本篇博文主要内容为 2026-05-26 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR、MA六个大方向区分。

说明:每日论文数据从Arxiv.org获取,每天早上12:30左右定时自动更新。

提示: 当天未及时更新,有可能是Arxiv当日未有新的论文发布,也有可能是脚本出错。尽可能会在当天修复。

目录

概览 (2026-05-26)

今日共更新1358篇论文,其中:

  • 自然语言处理227篇(Computation and Language (cs.CL))
  • 人工智能481篇(Artificial Intelligence (cs.AI))
  • 计算机视觉292篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习437篇(Machine Learning (cs.LG))
  • 多智能体系统35篇(Multiagent Systems (cs.MA))
  • 信息检索35篇(Information Retrieval (cs.IR))
  • 人机交互26篇(Human-Computer Interaction (cs.HC))

多智能体系统

[MA-0] When Gradients Collide: Failure Modes of Multi-Objective Prompt Optimization for LLM Judges ACL2026

【速读】:该论文试图解决的问题是如何在多个评估标准(evaluation criteria)下对大语言模型(LLM)评判器进行定制化优化,特别是在使用文本梯度方法时如何有效处理多目标优化问题。传统文本梯度方法仅适用于单一评判标准,生成的是自然语言批评而非数值向量,因此无法直接应用多任务学习中的冲突解决工具(如PCGrad、MGDA)。论文的关键解决方案在于系统性测试五种文本梯度优化器的分解模式,通过调整损失函数、梯度和优化器LLM之间共享的跨任务信息程度,来探究多目标优化的有效性。实验发现,当梯度LLM联合处理多个标准时,梯度特异性下降59%,且将各任务指令简单拼接成单个提示会导致Spearman相关系数下降-5.3%。这揭示了两个可分离的失败模式:优化阶段的梯度稀释(optimization-time gradient dilution)与推理阶段的任务指令干扰(inference-time instruction interference),从而明确限制了基于文本反馈的多目标评判器定制设计空间。

链接: https://arxiv.org/abs/2605.26046
作者: Parth Darshan,Abhishek Divekar
机构: IIT Jodhpur; Amazon
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA); Software Engineering (cs.SE)
备注: Accepted at ACL 2026 CustomNLP4U Workshop. Code, prompts and data available at this https URL

点击查看摘要

Abstract:Customizing an LLM judge to a specific task or domain often involves optimizing its prompt across multiple evaluation criteria simultaneously. Textual gradient methods automate this for a single judge criterion, however they produce natural-language critiques, not numerical vectors. Thus, the conflict-resolution toolkit of multi-task learning (PCGrad, MGDA) doesn’t apply to the multi-objective textual gradient setting. We test five decomposition modes of textual gradient optimizers by varying how much cross-task information the loss, gradient and optimizer LLMs share. In 6 of 10 configurations, we observe that optimization never improves over the initial prompt. Gradient specificity drops by 59% (from 9.0 to 3.7) when the gradient LLM processes multiple criteria jointly. Separately, we observe that naively combining per-task instructions into a single prompt degrades Spearman’s rho by -5.3%. These results identify two separable failure modes: optimization-time gradient dilution and inference-time instruction interference, which together constrain the design space for multi-objective judge customization using textual feedback.

[MA-1] Anticipate and Learn: Unleashing Idle-Time Compute in Proactive Agents

【速读】:该论文试图解决的问题是:当前AI代理(AI agents)本质上是反应式的,仅在收到用户显式指令后才进行响应,导致用户交互间隙中的计算资源被浪费,无法提前准备应对未来的用户需求。解决方案的关键在于提出一种名为ProAct的主动式代理架构,其核心机制是利用用户交互间的空闲计算时间,通过分析不断演化的对话历史和持久化记忆来预测用户可能的需求,并迭代获取信息以填补知识空白、提前构建证据链,从而在用户发起请求前完成准备工作。这一设计显著提升了任务效率与准确性,实验证明ProAct相较传统反应式基线模型在任务完成轮次上减少14.8%,用户努力降低11.7%,幻觉率下降28.1%。

链接: https://arxiv.org/abs/2605.25971
作者: Haoyi Hu,Qirong Lyu,Xianghan Kong,Weiwen Liu,Jianghao Lin,Zixuan Guo,Yan Xu,Yasheng Wang,Weinan Zhang,Yong Yu
机构: 上海交通大学(Shanghai Jiao Tong University)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR); Multiagent Systems (cs.MA)
备注: 26 pages, 4 figures; code available at this https URL

点击查看摘要

Abstract:While AI agents demonstrate remarkable capabilities in reasoning and tool use, they remain fundamentally reactive: they compute responses only after explicit user prompts. This paradigm ignores a critical opportunity: the idle time between interactions is largely wasted, leaving agents unable to prepare for future user needs. To bridge this gap, we introduce ProAct, a proactive agent architecture that leverages idle-time compute to anticipate and fulfill likely upcoming user needs. By analyzing evolving dialogue history together with persistent memory, ProAct predicts upcoming needs and iteratively acquires information, allowing the agent to resolve knowledge gaps and prepare evidence before the user initiates a this http URL rigorously evaluate proactive capabilities, we also introduce ProActEval, a comprehensive benchmark comprising 200 scenarios across 40 domains, featuring predictable need chains and diverse user cognitive profiles. Empirical results demonstrate significant advantages over reactive baselines. ProAct accelerates task completion by reducing required turns by 14.8%, decreases user effort by 11.7%, and cuts hallucination rates by 28.1% on ProActEval. Furthermore, MemBench evaluations confirm that ProAct achieves state-of-the-art reflective accuracy, underscoring its sustained and robust performance.

[MA-2] Multi-Agent Systems are Mixtures of Experts: Who Becomes an Influencer?

【速读】:该论文试图解决多智能体大语言模型(multi-agent LLM)在协同决策过程中,如何通过有效的沟通与协作机制提升整体性能的问题。其解决方案的关键在于引入Friedkin-Johnsen(FJ)意见动态模型,将多智能体协商过程建模为一个依赖输入的“专家混合”(mixture of experts)系统;该模型揭示了代理间影响力和顽固性(stubbornness)参数随输入变化的特性,并指出当路由策略能够反映代理能力时,多智能体系统可超越单个代理和静态集成方法。由于实际中代理能力难以直接观测,作者进一步分析了可通过可观测指标(如代理自我评估的信心、对他人信心的感知以及初始观点一致性)来推断影响力的机制。

链接: https://arxiv.org/abs/2605.25929
作者: Franka Bause,Jonas Niederle,Martin Pawelczyk,Rebekka Burkholz
机构: CISPA Helmholtz Center for Information Security (德国信息安全亥姆霍兹研究中心); Faculty of Computer Science, University of Vienna (维也纳大学计算机科学学院)
类目: Multiagent Systems (cs.MA); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The effectiveness of multi-agent LLM deliberation depends not only on the agents’ individual predictions, but also on how they communicate and collaborate. We study this mechanism through the lens of Friedkin-Johnsen (FJ) opinion dynamics, a tractable model for analyzing stubbornness, influence, and opinion change in multi-agent systems that captures empirically observed deliberation patterns. We show that the FJ parameters are input-dependent, turning multi-agent deliberation into a mixture of experts. This perspective implies that multi-agent systems can outperform single agents and static ensembles when routing reflects agent competence. Since competence is latent in practice, we analyze how influence is established through observable proxies: agents’ self-assessed confidence, their perceived confidence, and initial alignment with other agents’ views.

[MA-3] Behind EvoMap: Characterizing a Self-Evolving Agent -to-Agent Collaboration Network

【速读】:该论文试图解决的问题是:当前自主AI代理之间的协作网络(Agent-to-Agent, A2A)在实际运行中的运作机制尚不清晰,尤其是如何在保证大规模扩展的同时维持资产的可复用性、演化能力和可审计性。解决方案的关键在于揭示EvoMap这一典型A2A网络中存在的三大设计缺陷:一是激励机制过度依赖发布行为而非使用反馈,导致98%的资产无人复用且奖励高度集中;二是基于GDI算法的质量评分系统严重依赖未验证的自报元数据(如声称修改的代码行数),使评分易被操纵;三是本地执行日志作为质量验证依据缺乏独立核查,致使超过84%的已批准资产通过形式化测试(如空验证URL)绕过实质性质量控制。研究结论指出,未来的A2A协作网络必须引入兼顾开放参与与可验证执行、可信评估的机制,才能实现可持续的规模化协作。

链接: https://arxiv.org/abs/2605.25815
作者: Qiming Ye,Peixain Zhang,Yupeng He,Zifan Peng,Gareth Tyson
机构: The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州))
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Agent-to-Agent (A2A) networks enable autonomous AI agents to collaborate by sharing reusable problem-solving instructions. However, how these decentralized ecosystems operate in practice remains largely unexplored. We present the first large-scale empirical study of EvoMap, a prominent A2A collaboration network. By analyzing over 1.5M assets and 128K agents, we show how design choices that prioritize scalable growth introduce trade-offs in reusability, evolution, and auditability. First, EvoMap’s credit economy rewards agents for publishing valuable assets. Although this design encourages participation at scale, rewards are tied primarily to publication rather than adoption. This leads agents to mass-produce assets to accumulate credits. As a result, 98% of assets are never reused, while rewards become highly concentrated among a small fraction of agents. Second, EvoMap employs an algorithm (referred to as GDI) to score and rank the quality of these shared assets. We demonstrate that this scoring system is flawed: rather than measuring objective performance, an asset’s rank is heavily dictated by unverified, self-reported metadata (e.g., claimed lines of code modified). This allows agents to trivially manipulate their asset’s scores. Finally, EvoMap relies on agents to provide local execution logs as evidence that uploaded assets function correctly. Because these validations are not independently verified, over 84% of approved assets bypass quality checks using vacuous tests (e.g., this http URL). Our findings show that future A2A collaboration networks cannot rely on unverified self-reporting alone. Scalable collaboration requires mechanisms that balance open participation with verifiable execution and trustworthy evaluation.

[MA-4] Multi-Agent Coordination Adaptation via Structure-Guided Orchestration

【速读】:该论文试图解决大规模语言模型(Large Language Model, LLM)驱动的多智能体系统在处理复杂任务时,如何平衡结构稳定性与动态适应性的问题。现有方法要么采用结构中心(structure-centric)策略,即预先确定固定结构,导致控制粒度不足;要么采用编排中心(orchestration-centric)策略,虽能动态调整决策但缺乏稳定的协调结构。论文的关键解决方案是引入MACA框架,从概率视角重新审视多智能体协调问题,将其建模为结构与编排联合分布的后验推断问题。MACA通过学习一种依赖任务和预算的结构先验(structural prior),指导基于策略的编排过程作为后验推断的近似,从而实现高效且具有细粒度控制能力的协调机制。实验表明,MACA在多个基准测试中平均优于自适应多智能体基线8.42%,同时减少43.19%的token使用量,且结构与编排的联合优化可抑制冗余交互,使协调更聚焦于任务有效执行。

链接: https://arxiv.org/abs/2605.25746
作者: Haoran Li,Shulun Chen,Shaoyuan Sun,Hanchen Wang
机构: Nanjing University; University of Technology Sydney, Sydney, Australia; University of New South Wales, Sydney, Australia
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注: 21 pages

点击查看摘要

Abstract:As large language model (LLM)-based multi-agent systems scale to handle increasingly complex tasks, balancing structural stability and dynamic adaptability becomes increasingly challenging. Existing systems typically adopt either structure-centric methods, committing to structures determined upfront that limit fine-grained control, or orchestration-centric methods, adapting decisions dynamically while leaving coordination structure implicit and unstable. To address this challenge, we revisit multi-agent coordination from a probabilistic perspective, casting it as posterior inference over the joint distribution of structure and orchestration. We introduce MACA, an automated coordination framework that learns a task- and budget-conditioned structural prior over agent participation and interactions. This prior guides a policy-based orchestration as an approximation to posterior inference, enabling efficient solutions with fine-grained control. Across benchmarks, MACA outperforms adaptive multi-agent baselines by an average of 8.42% while using 43.19% fewer tokens. Further investigation reveals that joint adaptation of structure and orchestration suppresses redundant interactions, converging coordination toward task-effective execution.

[MA-5] Collaborative Threat-Aware Autonomy (CTAA)

【速读】:该论文旨在解决在包含动态敌对武器交战区(WEZ)的环境中,无人车辆编队执行任务时因单点失效导致任务失败的根本性挑战。解决方案的关键在于提出一种角色差异化多智能体协作框架,将无人平台(ACPs)分配为“主拦截”、“护送”和“诱饵”三种角色,通过角色分工与空间路径分离实现两个互补效应:一是概率冗余,即N条独立路径显著提升团队整体任务成功率;二是威胁饱和,即低优先级的护送和诱饵车辆吸引敌方火力,使主车辆得以无干扰通过WEZ。各智能体基于从“逃避者零集碰撞球边界”(CSBEZ)推导出的反应式制导律自主规划轨迹,兼顾最小转弯半径等机动约束,选择最安全且向目标推进的方向。

链接: https://arxiv.org/abs/2605.25741
作者: Rajnikant Sharma,Abhinav Sinha,Isaac Weintraub
机构: PNT Research Engineer (PNT 研究工程师); Assistant Professor (助理教授); Senior Electronics Engineer (高级电子工程师)
类目: Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Navigating teams of unmanned vehicles through environments containing dynamic, adversarial Weapon Engagement Zones~(WEZs) poses a fundamental challenge to mission success: a single vehicle, however capable its onboard guidance, remains a single point of failure. This paper presents a role-differentiated multi-agent framework for collaborative threat-aware trajectory planning in which a fleet of Autonomous Collaborative Platforms~(ACPs) is assigned distinct roles primary intercept, escort, and decoy to improve team-level mission success probability while managing individual WEZ exposure. Each ACP independently employs a reactive guidance law derived from the Collision Sphere Boundary for Evader Zero-Set~(CSBEZ), which accounts for pursuer maneuverability constraints imposed by minimum turn radius, and steers the vehicle toward the safest heading that also makes progress toward its goal. Role assignment and spatial route separation induce two complementary effects: probabilistic redundancy, in which N independent paths raise the team success probability and threat saturation, in which lower-priority escorts and decoys draw adversary attention and free the primary vehicle to transit uncontested.

[MA-6] From Facts to Insights: A Persona-Driven Dual Memory Framework and Dataset for Role-Playing Agents

【速读】:该论文试图解决长期对话中角色一致性(persona fidelity)难以维持的问题,即当前基于外部记忆的框架在处理长时间交互时,由于采用无角色特异性的摘要方法,导致生成内容缺乏个性化和情境适配性,从而削弱了角色扮演的真实性。解决方案的关键在于提出DualMem框架,其核心创新是将记忆解耦为两个独立流:事实认知(factual cognition)与角色条件化洞察(persona-conditioned insight),并通过监督微调(SFT)和强化学习(RL)联合训练,使模型能够基于角色身份对事实片段进行语境化推理,从而显著提升长期对话中的角色一致性表现。

链接: https://arxiv.org/abs/2605.25693
作者: Rongsheng Zhang,Ruofan Hu,Weijie Chen,Jiji Tang,Junnan Ren,Wanying Wu,Xunuoyan Chen,Tangjie Lv,Tao Jin,Zhou Zhao
机构: Zhejiang University (浙江大学); Fuxi AI Lab, Netease Inc. (网易伏羲AI实验室)
类目: Computation and Language (cs.CL); Databases (cs.DB); Multiagent Systems (cs.MA)
备注: Preprint

点击查看摘要

Abstract:While role-playing agents excel in short-term interactions, long-term conversations overwhelm context windows, motivating external memory frameworks. Current systems typically rely on persona-agnostic summarization, which records facts without persona-specific interpretation, yielding generic responses that compromise persona fidelity. To bridge this gap, we introduce RoleMemo, a dataset featuring four reasoning tasks where the factual fragments must be interpreted through the persona to reach the correct answer. Evaluation on RoleMemo exposes critical limitations of persona-agnostic frameworks. We thus propose DualMem, which decouples memory into two streams: factual cognition and persona-conditioned insight. Trained through Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL), our framework with a 4B-parameter model outperforms zero-shot persona-agnostic frameworks powered by DeepSeek-V3.2 for sustained persona fidelity. Our resources are available at this https URL.

[MA-7] When Agents Control Robots: A Zero Trust Policy Model for Agent ic Cyber-Physical Systems

【速读】:该论文旨在解决多智能体系统(Multi-agent Systems)在工业机器人控制中因依赖大基础模型(Large Foundation Models, LFMs)而引发的安全问题,尤其是在自然语言驱动的协作场景下,安全漏洞可能导致物理层面的严重后果。其解决方案的关键在于提出ZTPM(Zero Trust Policy Model),该模型包含25种类型化的策略原语,覆盖五个执行领域,并引入“物理影响层级”(Physical Impact Tiers)作为运行时策略维度,以实现对物理执行边界上的行为进行细粒度、可验证的策略级管控,从而应对LFM输出的非确定性和模型依赖性带来的风险。

链接: https://arxiv.org/abs/2605.25653
作者: Tharindu Ranathunga,Kavishka Fernando,Susan Rea
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Multiagent Systems (cs.MA)
备注: 12 pages, 4 figures

点击查看摘要

Abstract:Multi-agent systems powered by large foundation models (LFMs) are increasingly deployed to control industrial robots through natural language, creating deployments in which security failures produce physical consequences. We analyse this threat landscape through Cobot-Claw, a deployed four-agent system for UR3e robotic arm control, and identify five attack classes specific to agentic cyber-physical systems. We propose ZTPM, a Zero Trust Policy Model comprising 25 typed primitives across five enforcement domains with Physical Impact Tiers as a runtime policy dimension. An empirical evaluation across 60 execution traces on two LFM backends provides initial evidence that actuation parameter selection is model-dependent and non-deterministic, motivating the need for policy-level enforcement at the physical actuation boundary.

[MA-8] A Multi-Agent LLM Framework for Rating the Quality of Surgical Feedback

【速读】:该论文旨在解决术中导师反馈质量难以评估的问题,特别是如何自动识别并量化反馈对住院医师行为影响的有效性。现有方法依赖人工标注或粗粒度分类,无法捕捉反馈的细微特征(如清晰度、紧迫感等)。其解决方案的关键在于提出一个两阶段大语言模型(LLM)框架:首先通过多智能体提示和外科领域知识注入,从真实手术场景中挖掘出少量可解释的评分标准(如“鼓励性”“紧迫性”“清晰度”);随后采用“LLM作为评判者”的方式自动打分。实验表明,该方法在预测训练者行为调整和导师认可度方面优于传统内容分析框架,实现了可扩展且符合人类认知的术中沟通质量评估,为改进外科教学实践提供了新路径。

链接: https://arxiv.org/abs/2605.25440
作者: Rafal Kocielnik,J. Everett Knudsen,Steven Y. Cen,Jasmine Lin,Cherine H. Yang,Atharva Deo,Ujjwal Pasupulety,Peter Wager,Anima Anandkumar,Andrew J. Hung
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 25 pages, 3 figures

点击查看摘要

Abstract:Verbal feedback delivered by attending surgeons in the operating room plays a critical formative role in resident trainee skill acquisition. Yet, assessing the quality of trainer feedback and its effectiveness in influencing trainee behavior during live surgery remains a challenge. Prior studies assessed feedback content relying on extensive manual annotation by expert human raters and focused on developing broad taxonomies that overlook the qualitative aspects of feedback delivery such as clarity or urgency. Limited existing automated methods, including keyword analysis and topic modeling, also fail to capture these nuanced aspects. We introduce a two-stage LLM-based framework that discovers interpretable feedback quality criteria grounded in the context of surgical training. Our method uses multi-agent prompting and surgical domain knowledge injection to discover a small set of human interpretable scoring criteria (e.g., Encouraging, Urgent, Clear). These criteria are then used to automatically score live surgical feedback via an LLM-as-a-judge approach. Evaluation on 4.2k trainer feedback instances demonstrates that our AI-discovered criteria outperform prior content-based frameworks in predicting feedback effectiveness, including observed trainee behavioral adjustments and trainer approval. This work advances scalable, human-aligned assessment of communication quality in the operating room and provides a foundation for improving surgical teaching practices.

[MA-9] Mode 0: A New 3GPP V2X Resource Allocation Category for Roadside Computing Unit-Assisted Safety Communication

【速读】:该论文试图解决当前3GPP V2X(Vehicle-to-Everything)资源分配框架在高密度交通场景下存在的结构性缺陷问题,特别是传统基站主导调度(base station-led scheduling)导致的延迟尾部故障(latency-tail failures)以及车辆用户设备(UE)自主模式在遮挡交通参与者预警和大范围环境风险应对上的能力不足。解决方案的关键在于提出一种新的V2X模式——Mode 0,其核心实体为路侧计算单元(Roadside Computing Unit, RCU),该单元集成了增强感知(Seeing)、侧链通信(Speaking)与本地计算评估(Thinking)功能,由交通管理部门拥有。Mode 0构建了一个从被动式UE(Mode 0a)到全主动式UE(Mode 0c)的连续谱系,并通过多智能体近端策略优化(MAPPO)仿真验证了其有效性:其中Mode 0c在需求分离机制下实现了对两类交通流(M0和M1)的严格帕累托改进(PDR分别达0.999和0.998),并将最差TTI(Transmission Time Interval)交付率从接近零提升至0.601,首次满足结构化的低延迟安全要求。研究呼吁3GPP将Mode 0纳入NR-V2X侧链增强工作计划中的研究项目。

链接: https://arxiv.org/abs/2605.25431
作者: Dewei Jiang(Nantong University),Xiang Gu(Nantong University)
机构: Nantong University (南通大学)
类目: Networking and Internet Architecture (cs.NI); Multiagent Systems (cs.MA); Signal Processing (eess.SP)
备注: 13 pages, 7 figures, 4 tables. Submitted to IEEE Transactions on Intelligent Transportation Systems

点击查看摘要

Abstract:The 3GPP V2X resource allocation framework defines two entity classes – the base station and the vehicle UE – and four modes across LTE and NR generations. We demonstrate that this binary taxonomy is structurally incomplete. Base station-led scheduling saturates at high-density traffic nodes, producing latency-tail failures that persist even when mean packet delivery ratios approach the service-class target. UE autonomy is categorically incapable of pre-emergence warning for occluded traffic participants and insufficient for large-scope cascading environmental hazards. We propose Mode 0, a new 3GPP V2X category whose defining entity is the Roadside Computing Unit (RCU) – an infrastructure ensemble integrating elevated sensing (Seeing), sidelink communication (Speaking), and local computational evaluation (Thinking), owned by traffic management authorities. Mode 0 defines a subfamily spectrum from Mode 0a (all-passive UEs, the guaranteed minimum) through Mode 0c (all-active UEs, the optimal target). Convergent deployment evidence from Chinese national standards (DB11/T 2329.1-2024, T/ITS 0224.1-2025), China Unicom RS-MEC infrastructure, and European and US C-V2X programs confirms that both institutional sides are converging on the roadside traffic node without a coordination standard. A fifteen-run Multi-Agent Proximal Policy Optimization (MAPPO) simulation validates the architectural family: Mode 0a in shared-pool baseline sits at the analytical symmetric-Nash coordination floor; Mode 0c with demand separation achieves strict Pareto improvement for both traffic classes (M0 PDR 0.999, M1 PDR 0.998 at \rho_\rm pool \leq 1 ) and lifts the worst-TTI delivery ratio from near-zero to 0.601 – the only configuration satisfying the latency safety requirement structurally. We call for a 3GPP study item on Mode 0 within the NR-V2X sidelink enhancement work programme.

[MA-10] Evo-Attacker: Memory-Augmented Reinforcement Learning for Long-Horizon Tool Attacks on LLM -MAS ACL2026

【速读】:该论文旨在解决大型语言模型多智能体系统(LLM-MAS)中因对工具输出的隐式信任而产生的安全漏洞问题,尤其是现有工具攻击方法在领域特定性或静态模板限制下的局限性。其解决方案的关键在于提出 Evo-Attacker,这是一个将工具攻击建模为自演化、记忆增强型强化学习过程的框架:通过构建动态攻击记忆,利用深思熟虑的推理机制检索对抗模式,并在关键时机制定修改干预策略;同时引入 Attack-Flow GRPO 方法,通过终端结果优化中间推理步骤,有效缓解长程信用分配难题。实验表明,Evo-Attacker 在泛化能力和进化能力上显著优于基线方法,凸显了其攻击效能及对防御性工具保障机制的紧迫需求。

链接: https://arxiv.org/abs/2605.25389
作者: Bingyu Yan,Xiaoming Zhang,Jinyu Hou,Chaozhuo Li,Ziyi Zhou,Yiming Hei,Litian Zhang
机构: Beihang University (北京航空航天大学); Beijing University of Posts and Telecommunications (北京邮电大学); China Academy of Information and Communications Technology (中国信息通信研究院)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: ACL 2026 main

点击查看摘要

Abstract:While Large Language Model-based Multi-Agent Systems (LLM-MAS) demonstrate remarkable capabilities in solving complex tasks by orchestrating specialized agents and external tools, the implicit trust in tool outputs creates a critical attack surface. Existing tool attacks are limited by domain specificity or fixed and static templates. To address these challenges, we propose Evo-Attacker, which formulates the tool attack as a self-evolving, memory-augmented reinforcement learning process. Evo-Attacker constructs a dynamic attack memory and employs deliberative reasoning to retrieve adversarial patterns and strategize modifying interventions at critical moments. Furthermore, we introduce Attack-Flow GRPO to optimize intermediate reasoning steps via terminal outcomes, addressing the long-horizon credit assignment challenge. Comprehensive experiments demonstrate that Evo-Attacker consistently outperforms baselines, highlighting its generalization and evolutionary capabilities and the urgent need for defensive tool safeguards.

[MA-11] KYA: A Framework-Agnostic Trust Layer for Autonomous Systems with Verifiable Provenance and Hierarchical Policy Composition

【速读】:该论文试图解决自主系统中代理(agent)行为不可信、缺乏可审计性和治理能力的问题,尤其是在面对潜在恶意或异常行为时,传统可观测性(observability)无法识别代理是否“出错”、“漂移”、“泄露”或“悄然失控”。解决方案的关键在于提出 KYA(Know Your Agents)——一个开源的自主系统信任与治理层,其核心创新包括五个原语:(1)四门入站应用管道(结合Ed25519签名验证、多锚点固定、持久时间过期、仅收紧组合及默认操作员审批);(2)基于三通道多租户层级的仅收紧组合代数;(3)KYP(Know Your Principal),实现人类用户、AI代理和服务账户的跨主体信任评分统一;(4)基于AIVSS形状基线的可审计交互放大机制,带有稳定审计码的有界非对称每交互乘数;(5)双轴委托归属机制,融合静态观察门控委托信任溢价与零配置运行时编排器责任追踪。KYA 在 22 种代理框架中无框架依赖,纯函数评分延迟低于亚毫秒(p99),且在 SQLite 上实现每秒约 1800 次操作,成功拦截全部 1200 次伪造、过期、宽松和未批准推荐,并检测到 89% 的 PyRIT 和 Garak 攻击样本,包括最近提出的拓扑引导多代理攻击。

链接: https://arxiv.org/abs/2605.25376
作者: Kolawole Quadri
机构: Veldt Labs (USA)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Multiagent Systems (cs.MA); Software Engineering (cs.SE)
备注: 26 pages including appendix. Code available under Apache 2.0 at this https URL (pip install veldt-kya). Two-domain worked examples (loan decisioning under NYDFS/ECOA/CFPB; clinical triage under HIPAA/21 CFR Part 11/FDA SaMD).Reproducibility artifacts in-tree

点击查看摘要

Abstract:Observability tells operators when an agent is slow. KYA tells operators when an agent is wrong, drifting, leaking, or quietly going rogue. We present KYA (Know Your Agents), an open-source trust and governance layer for autonomous systems composed of five primitives: (1) a four-gate inbound apply pipeline composing Ed25519 signature verification with multi-anchor pinning, persist-time expiry, only-tighten composition, and operator-approval-as-default; (2) an only-tighten composition algebra over a three-channel multi-tenant hierarchy (platform default,tenant override, signed external recommendation); (3) KYP – Know Your Principal, a schema-level unification of trust scoring across human users, AI agents, and service accounts; (4) auditable interaction-multiplier amplification over an AIVSS-shaped additive baseline, with bounded asymmetric per-interaction multipliers carrying stable audit codes; and (5) two-axis delegation attribution combining a static observation-gated delegation-trust premium with zero-config runtime orchestrator-blame at three SDK hook surfaces. KYA is framework-agnostic across 22 agent frameworks. The pure-function scorer runs sub-millisecond at p99 and the system sustains ~1,800 ops/sec at 20 concurrent workers with HMAC chain integrity preserved end-to-end. The four-gate inbound apply pipeline rejects forged, expired, loosening, and unapproved recommendations on every trial (1,200 / 1,200) with sub-millisecond p99 latency on SQLite. KYA detects 89% of 1,200 adversarial probes from PyRIT and Garak, including the recently-published topology-guided multi-agent attack. The system is available under Apache 2.0 as the veldt-kya package on PyPI (release candidate at submission time; stable v0.1.0 forthcoming)

[MA-12] owards Reliable Fetal Ultrasound Interpretation with Multi-Agent Collaboration

【速读】:该论文旨在解决胎儿超声图像自动解读中多步骤流程整合困难的问题,特别是传统“单任务、单模型”范式难以实现从视觉感知(如切面识别与解剖结构分割)到临床理解(如生物测量与诊断报告生成)的系统性证据融合。其解决方案的关键在于提出FetUSAgents——一种工具增强的多智能体系统,通过协作式大语言模型(LLM)代理协调专用视觉工具,并将临床问题分解为从解剖识别到定量测量的子任务链;创新性地引入双路径证据仲裁(Dual-Path Evidence Arbitration, DPEA),融合LLM的推理能力与结构化计算工具提供的可信证据,同时构建检索增强的证据库以支持可追溯、临床可信的结论生成。

链接: https://arxiv.org/abs/2605.25357
作者: Xiaotian Hu,Mingxuan Liu,Junwei Huang,Kasidit Anmahapong,Yifei Chen,Yiming Huang,Xuguang Bai,Zihan Li,Hongjia Yang,Yingqi Hao,Hong Xu,Yu Jiang,Tian Tian,Yi Liao,Haibo Qu,Qiyuan Tian
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Automated fetal ultrasound interpretation requires a workflow from visual perception, including plane recognition and anatomical segmentation, to clinical understanding, including biometric measurement and diagnostic reporting. However, the prevailing “one-task, one-model” paradigm limits systematic integration of evidence across this multi-step process. Although multimodal large language models (MLLMs) show promising visual understanding, their limited domain-specific grounding and hallucination risks restrict reliability in fetal ultrasound analysis. To address these limitations, we propose FetUSAgents, a tool-augmented multi-agent system for comprehensive fetal ultrasound interpretation, supporting visual question answering (VQA), report generation, image captioning, and video summarization. FetUSAgents coordinates task-specific visual tools through collaborative LLM agents and decomposes clinical queries into subtasks that progress from anatomical recognition to quantitative measurement. We further introduce Dual-Path Evidence Arbitration (DPEA), which integrates LLM-based deliberative reasoning with structured computational evidence from specialized visual tools. A retrieval-enhanced evidence bank consolidates intermediate findings to support traceable and clinically grounded conclusions. In addition, we construct FetUS-VQA, a dedicated VQA benchmark for fetal ultrasound, comprising 1,892 images and 3,205 question-answer pairs across 10 clinical tasks. Extensive out-of-distribution experiments show that FetUSAgents outperforms general and medical MLLMs, exceeding the strongest baseline by more than 25 percent in VQA accuracy. These results suggest a scalable route toward evidence-driven clinical assistants for prenatal imaging. Code is available.

[MA-13] Recursive Multi-Agent Trading System: Iterative Optimized Portfolio Strategy Under Geopolitical Uncertainty

【速读】:该论文试图解决在地缘政治不确定性加剧背景下,量化投资组合管理中如何有效控制下行风险的问题。传统方法如均值-方差优化(MVO)和基于FinBERT的情感分析模型在极端事件期间表现不佳,难以保障资本安全。解决方案的关键在于提出递归多智能体交易系统(RMATS),其通过一个递归的管理者代理(Manager Agent)协调四个专业化智能体——情绪(Sentiment)、报告(Report)、分析(Analysis)和风险(Risk)——形成迭代反馈回路,实现动态风险感知与决策调整。实证结果表明,RMATS在561个交易日的多资产组合测试中最大回撤仅为9.62%,显著优于MVO(15.49%)和FinBERT情感模型(15.28%),尤其在3/5个地缘政治压力事件中展现出最低回撤,验证了其作为以风险控制为导向的架构对机构投资者资本保值的价值。

链接: https://arxiv.org/abs/2605.25311
作者: Jing Yang,Yichao Wu,Jianan Liu,Penghao Liang,Mengwei Yuan,Xianyou Li,Weiran Yan
机构: Washington University in St. Louis (圣路易斯华盛顿大学); Northeastern University (东北大学); New York University (纽约大学); Independent Researcher (独立研究员)
类目: Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Recursive Multi-Agent Trading System (RMATS) integrates four specialized agents – Sentiment, Report, Analysis, and Risk – coordinated through a recursive Manager Agent with iterative feedback loops. Experimental evaluation over a 561-trading-day period (January 2023 to March 2025) across a 24-asset multi-class universe demonstrates that RMATS achieves a maximum drawdown of 9.62%, lower than MVO (15.49%) and FinBERT Sentiment (15.28%), and exhibits the lowest event-period drawdown in 3 of 5 geopolitical stress scenarios tested. While RMATS underperforms return-maximizing baselines in a sustained bull market environment, ablation studies confirm the individual contribution of each agent component to downside protection. These results position RMATS as a risk-control-oriented architecture suitable for institutions prioritizing capital preservation under geopolitical uncertainty.

[MA-14] Scaling up Energy-Aware Multi-Agent Reinforcement Learning for Mission-Oriented Drone Networks with Individual Reward

【速读】:该论文试图解决多智能体强化学习(MARL)在无人机网络中进行轨迹规划时面临的两个核心挑战:动态环境下的任务执行效率不足,以及受限电池容量导致的能耗问题。解决方案的关键在于提出一种能量感知的MARL模型,该模型基于深度Q网络(DQN),并采用由任务进展进度和无人机剩余电量共同驱动的个体奖励函数(individual reward functions),从而实现更精准的信用分配(credit assignment)。相较于共享奖励机制,该方法在不同任务密度和环境规模下均表现出更高的成功率与稳定性,尤其在环境扩展时展现出更强的鲁棒性与能效优势,能够以更少的步数完成任务,显著提升能源利用效率。

链接: https://arxiv.org/abs/2605.24992
作者: Changling Li,Ying Li
机构: ETH Zurich (苏黎世联邦理工学院); Colby College (科尔比学院)
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: IEEE Internet of Things Journal

点击查看摘要

Abstract:Multi-agent reinforcement learning (MARL) has shown wide applicability in collaborative systems such as autonomous driving and smart cities for its ability of learning through interaction. With the recent development of drone networks, researchers have also applied MARL to address the trajectory planning problems. However, the dynamic environment and the limited battery capacity are still challenging for using MARL to achieve efficient collaborative task execution. In this paper, we propose an energy-aware MARL model as an attempt to tackle these challenges, leveraging Deep Q-Networks (DQN) with \emphindividual reward functions driven by the task execution progress and the remaining battery of drones. We conduct a set of simulation studies for the proposed mode and compare it with the shared reward MARL~\citeLi2022MARL to explore the impact of credit assignment in MARL. The results indicate that our proposed model can achieve at least 80% success rate regardless of the task locations and lengths. Similar to the shared reward mode, the individual reward mode can achieve a better success rate when the task density is high, and it can hit nearly a 100% success rate when task density gets close to 40%. The true advantage of our proposed model with individual reward is revealed when scaling up the environment. The comparison to the shared reward MARL shows that the our proposed model is more robust towards the change of the environment size and agent numbers. It can achieve higher success rate with fewer steps due to the clarity of the goal which improves energy efficiency even better.

[MA-15] PRIMA: Operational Patterns for Resilient Multi-Agent Research with Verifiable Identity and Convergent Feedback

【速读】:该论文试图解决在长时间运行的大型语言模型(LLM)多智能体研究系统中,由于上游服务限流、子智能体任务漂移、自我叙述替代实际执行、反复修订与自责、以及将上下文视为可执行指令等失败模式所引发的稳定性与可靠性问题。这些问题在单次评估中难以暴露,但在多小时级的协作任务中会显著影响最终成果质量。解决方案的关键在于提出PRIMA框架,其核心是三个操作模式:(1) 弹性恢复层,通过检测上游速率限制信号、持久化暂停记录并支持跨进程重启后无需重做已收敛工作地恢复;(2) 子智能体操作规范,以结构化提示层编码任务忠实度、工具使用、修订机制和步骤间上下文边界约束;(3) 多阶段应用模式,将正交草稿步骤与显式的跨文档协调步骤结合,再进行最终合成。这些机制建立在一个基础协议之上,包括带有明确收敛标准的研究程序规范语言、双指标评分引擎(LLM评判+沙箱代码执行)、外层元优化循环、事件驱动持久化、钩子式中间件、上下文压缩及多提供商LLM抽象,并利用质数幂生成唯一智能体身份标识,从而实现无冲突身份管理和集群成员验证。理论保证涵盖O(k)验证复杂度、O(V+E)有向无环图(DAG)验证效率,以及基于算术基本定理的身份碰撞自由性。通过一个图同构案例研究验证了架构有效性:生成了一个六步协议,产出一篇包含三项定理和五项猜想的新算法研究论文。

链接: https://arxiv.org/abs/2605.24775
作者: Sasank Annapureddy
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 11 pages. Single-author preprint. Supplementary case-study report (Graph Isomorphism algorithm proposal with three theorems, five conjectures, complete complexity analysis, and hard-instance evaluation) available at this https URL

点击查看摘要

Abstract:Operating LLMs as coordinated multi-agent research systems over multi-hour runs surfaces failure modes that single-shot evaluation cannot: upstream providers throttle without warning, sub-agents drift the task to fit accessible tools, narrate machinery instead of using it, open revision iterations with self-apology, or treat upstream context as executable directives. We present PRIMA, whose primary contributions are three operational patterns for surviving these failure modes: (1) a resilience-and-recovery layer that detects upstream rate-limit signals, persists a typed pause record to disk, and resumes long-running runs without re-executing converged work even across process restarts; (2) a sub-agent operating discipline encoding task-fidelity, tool-use, revision, and inter-step context-boundary norms as a structural prompt layer; (3) a multi-phase application pattern for structured engineering deliverables pairing orthogonal draft steps with an explicit cross-document harmonization pass before final synthesis. These sit atop a foundational protocol: a research-program specification language with explicit convergence criteria, a dual-metric scoring engine (LLM-judged rubric plus sandboxed code), an outer meta-optimization loop, event-driven persistence, hook-based middleware, context compaction, and a multi-provider LLM abstraction. Agent identities derive from prime powers, giving collision-free identifiers and trivially-verifiable cluster membership without a central registry. Theoretical guarantees include O(k) verification, O(V+E) DAG validation, and identity collision freedom by the Fundamental Theorem of Arithmetic. A Graph Isomorphism case study grounds the architectural claims in a generated artifact: a six-step protocol that produced a research paper proposing a new canonical-form algorithm with three theorems and five conjectures.

[MA-16] Hera: Learning Long-Horizon Coordination for Device-Cloud Collaborative LLM Agents

【速读】:该论文试图解决大语言模型(LLM)代理在真实世界部署中面临的“设备-云”权衡问题:本地部署的模型虽高效但鲁棒性差,而云端模型虽强大却计算成本高。现有方法通常仅在任务层面进行粗粒度的设备-云决策,无法适应多步骤交互过程中动态变化的任务难度。解决方案的关键在于提出Hera——一个面向长周期任务的步级设备-云协调器,其核心创新是采用两阶段训练范式:第一阶段通过模仿学习实现冷启动,将每一步的路由决策建模为监督分类问题,利用云端轨迹标注每个状态下的设备与云端动作一致性;第二阶段则引入成本感知的强化学习,通过对轨迹中相同状态分组并优化期望回报与未来云端调用次数,实现任务成功率与云资源使用效率的帕累托最优。实验表明,Hera在ALFWorld、WebShop和AppWorld上显著优于现有方法,在仅使用46.3%步骤调用云端的情况下达到云端全量模型92.5%的成功率。

链接: https://arxiv.org/abs/2605.24598
作者: Yuxin Zhang,Mengxue Hu,Zheng Lin,Xiaoyi Fan,Fan Xie,Zihan Fang,Jing Yang,Wenjun Zhu,Zhiwen Chen,Chengfei Lv,Zhe Chen
机构: Fudan University (复旦大学); Alibaba Group (阿里巴巴集团); The University of Hong Kong (香港大学); Shenzhen MSU-BIT University (深圳北理莫斯科大学); New York University (纽约大学); Universiti Malaya (马来亚大学); SpaceAIC Co., Ltd. (太空智芯有限公司)
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Large language model (LLM) agents excel at solving complex long-horizon tasks through autonomous interaction with environments. However, their real-world deployment faces a fundamental device–cloud dilemma: on-device models are efficient but often brittle, while cloud models are stronger but costly in computation. State-of-the-art LLM device–cloud routers usually make coarse task-level decisions, which cannot adapt to the changing difficulty of multi-step agent interactions. To address this issue, we present Hera, a step-level device–cloud LLM agent coordinator for long-horizon tasks achieving a strong performance–cost Pareto frontier. Hera adopts a novel two-stage training paradigm: (1) imitation learning for cold-start, followed by (2) reinforcement learning that jointly optimizes task success and cloud usage efficiency. The first stage casts step-level routing as a supervised classification problem: the device agent is replayed on cloud trajectories, with each state labeled by the agreement between device and cloud actions. In the second stage, we perform cost-aware reinforcement learning by grouping identical states across trajectories and updating Hera with labels favoring higher expected return and fewer future cloud calls. We evaluate Hera on ALFWorld, WebShop, and AppWorld, where it consistently outperforms prior methods, achieving 92.5% of the cloud-only success rate with cloud use in only 46.3% of steps.

[MA-17] AI-Driven Adaptive Adversaries and the Erosion of Cryptographic Trust in Public Key Systems

【速读】:该论文试图解决的问题是:传统公钥密码学(Public Key Cryptography, PKC)安全模型过于依赖算法层面的理论假设,而无法应对现实中由人工智能驱动的自适应对抗优化所引发的安全威胁。具体而言,攻击者不再尝试破解密码原语本身,而是利用实现层面的可观测性(implementation-level observability)来实施有效攻击,从而导致现有安全模型与实际攻击场景之间存在显著脱节。解决方案的关键在于重新构建密码学安全评估框架,使其能够纳入对基于AI的自适应攻击行为的建模与分析,强调从“算法中心”向“系统级可观测性驱动”的安全范式转变。

链接: https://arxiv.org/abs/2605.24542
作者: Petar Radanliev
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:This paper examines the erosion of Public Key Cryptography (PKC) security under adaptive adversarial optimisation driven by artificial intelligence. The problem addressed is the growing mismatch between algorithm-centric cryptographic security models and operational attack realities, where adversaries exploit implementation-level observability rather than breaking cryptographic primitives.

[MA-18] Is Decentralized AI Governable? From Regulative Policy to Constitutive Protocol

【速读】:该论文试图解决的问题是:当前主流人工智能治理框架依赖于可识别的责任主体(如开发者、部署者或操作者),而去中心化人工智能(DeAI)通过在模型、训练、计算、资源调度、身份和所有权等六个层面的部分去中心化,导致系统虽具重大社会影响却无法满足现有治理框架对责任主体的预设条件,从而形成“治理真空”。解决方案的关键在于从传统的以政策为导向、依赖规范性指令(normative address)的治理模式转向基于协议的架构约束式治理(protocol-based constitutive governance),即通过设计不可变更的底层技术架构来限制系统内可执行的行为边界,而非试图规训某个可被问责的代理。这种转变要求重新定义治理伦理标准(合法性、可争议性、透明性和非支配性),并重建民主授权机制,使技术架构本身成为可被公众监督与参与决策的对象,以应对去中心化环境中传统政策链条失效后的治理挑战。

链接: https://arxiv.org/abs/2605.24538
作者: Botao Amber Hu,Helena Rong
机构: University of Oxford (牛津大学); New York University (纽约大学)
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: Submitted for Ethics and Information Technology

点击查看摘要

Abstract:Every major framework for governing artificial intelligence presupposes an identifiable entity – a developer, deployer, or operator – who can be held responsible and compelled to comply. Decentralized AI (DeAI) dissolves this presupposition. We analyze DeAI as a six-layer decentralizing stack – model, training, compute, harness, identity, and ownership – and show how partial decentralization across layers compounds into what we call the \emphgovernance vacuum: a condition in which AI systems are consequential enough to require governance but lack the properties that existing frameworks presuppose in their targets. This vacuum takes two analytically distinct forms: an \emphaccountability gap, where no addressable principal can be identified, and an \emphincapacitation gap, where even an identified principal cannot alter the running system. We demonstrate that these failures are not merely jurisdictional but defeat every presupposition of governance through normative address – the communication of rules to a comprehending, responsive agent. Drawing on Lessig’s modalities of regulation and Searle’s distinction between regulative and constitutive rules, we argue for a shift in the locus of governance from policy to protocol, from normative address to architectural constraint. Protocol-based constitutive governance does not address the agents operating within a system but shapes the substrate that determines what kinds of actions are possible within it. We identify four ethical conditions – legitimacy, contestability, transparency, and non-domination – that such governance must satisfy to avoid degenerating into unaccountable technocratic power, and we argue that the central political challenge of governing AI in a decentralized world is reconstructing forms of democratic authorization for architectural choices that persist after the ordinary chain of policy has broken down.

[MA-19] Adaptive Punishment for Cooperation in Mixed-Motive Games

【速读】:该论文试图解决多智能体交互中因自利行为导致合作难以维持的问题,尤其是在存在短期收益诱惑时, agents 倾向于背叛而非选择有利于长期收益和集体福利的利他合作。现有方法在实施惩罚以促进合作时往往面临惩罚成本过高或效果不佳的困境。其解决方案的关键在于提出一种分布式自适应惩罚机制(Adaptive Punishment for Cooperation, APC),该机制通过动态调整惩罚概率与背叛严重程度相结合的方式,在降低无效惩罚成本的同时有效提升合作水平;其中,通过引入基于博弈奖励引导学习的“背叛感知模块”来精确评估背叛行为及其严重性,从而实现理性且高效的惩罚策略设计。理论分析与实证结果表明,APC 在迭代公共品博弈中表现优异,并在多种顺序社会困境任务中显著优于现有基线方法。

链接: https://arxiv.org/abs/2605.24516
作者: Min Tang,Fanqi Kong,Linyuan Lü,Xue Feng
机构: University of Science and Technology of China (中国科学技术大学); State Key Laboratory of General Artificial Intelligence, BIGAI (通用人工智能国家重点实验室,BIGAI)
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Mixed-motive scenarios are ubiquitous in real-world multi-agent interactions, where self-interested agents often defect for immediate rewards, overlooking the potential of altruistic cooperation to improve long-term gains and collective welfare. Peer punishment can deter defection, but as costly second-order altruism, its persistent imposition may undermine the punisher’s interests. Existing approaches often struggle to effectively implement punishment to promote cooperation. To balance the efficacy and cost of punishment, we propose Adaptive Punishment for Cooperation (APC), a distributed method that determines punishment intensity based on both a dynamic punishment probability and the severity of defection. This dynamic probability substantially reduces costly and ineffective punishment while also promotes cooperation. To accurately assess defection and its severity, we use a defection awareness module, whose learning is guided by game reward. Theoretical analysis and empirical results show APC performs effectively in iterated public goods game. Empirically, APC also significantly outperforms existing baselines across sequential social dilemmas, learning rational and effective punishment policies that foster cooperation by strategically deterring defection.

[MA-20] A Reinforcement Learning Inspired Latent Yield Based Adaptive Algorithm Switching Mechanism

【速读】:该论文试图解决在在线或动态环境中,如何为不断变化的问题实例选择最合适的算法这一难题。传统方法依赖瞬时性能指标会导致算法切换反应迟钝且不稳定,从而产生次优结果。解决方案的关键在于提出一种计算高效的多实例性能聚合方法,该方法受强化学习(Reinforcement Learning, RL)启发,将奖励与惩罚整合为一个潜在收益(latent yield),进而驱动探索与利用的平衡,实现自适应算法切换;同时引入岛屿模型(island models),借鉴遗传算法思想,在多个局部种群间并行探索和交换性能信息,从而提升算法选择的鲁棒性和效率。实验在排序算法和机器人避障任务中验证了该方法的有效性,表明其在需要自适应算法选择的场景中具有重要应用潜力。

链接: https://arxiv.org/abs/2605.24436
作者: Jayprakash S. Nair,Jimson Mathew,Shivashankar B. Nair
机构: 未知
类目: Multiagent Systems (cs.MA); Machine Learning (cs.LG); Robotics (cs.RO)
备注: Accepted and published in the Proceedings of the 29th European Conference on Applications of Evolutionary Computation (EvoApplications 2026), held as part of EvoStar 2026, Toulouse, France, April 8 to 10, 2026. Lecture Notes in Computer Science (LNCS), Springer Nature Switzerland

点击查看摘要

Abstract:Selecting the most suitable algorithm for a given problem instance remains a challenging task, particularly in online or dynamic environments where problem characteristics evolve over time. Relying solely on instantaneous performance metrics can result in a reactive and unstable behaviour, often leading to suboptimal algorithm switching. This paper introduces a computationally efficient approach for aggregating an algorithm’s performance across multiple problem instances that is fairly immune to erratic variations in instance features. Inspired by features inherent to Reinforcement Learning (RL), this technique encapsulates rewards and penalties into a latent yield that, in turn, triggers exploitation and exploration, consequently resulting in adaptive algorithm switching. The proposed technique employs island models, inspired by Genetic Algorithms, to facilitate parallel exploration and performance exchanges among algorithm populations inhabiting local repertoires. Experimental evaluations on sorting algorithms and robotic obstacle avoidance tasks demonstrate the feasibility and effectiveness of the approach, highlighting its potential in domains where adaptive algorithm selection is critical.

[MA-21] Habermolt: Delegating Deliberation to AI Representatives

【速读】:该论文试图解决的问题是:如何在保持民主决策质量的同时,通过人工智能(AI)技术实现更大规模的公民参与,尤其是在传统协商民主因人类注意力和认知带宽限制而难以扩展的背景下。其解决方案的关键在于提出并实证检验“AI委托协商”(AI-delegated deliberation)这一新范式——即由AI代理代表人类用户进行协商讨论,从而突破个体参与的时间瓶颈。作者通过部署名为Habermolt的公开平台,在代表性(representation)、聚合(aggregation)与修订(revision)三个维度上评估该范式的有效性,揭示了此类系统面临的设计与对齐挑战,并为未来构建可扩展且可信的AI代表机制提供了实证依据和理论框架。

链接: https://arxiv.org/abs/2605.24413
作者: Joseph Low(1),Oscar Duys(1),Claude Formanek(2),Lewis Hammond(3),Michiel Bakker(4) ((1) Cooperative AI Research Fellowship, (2) AI Safety South Africa, (3) Cooperative AI Foundation, (4) MIT)
机构: 未知
类目: Computers and Society (cs.CY); Human-Computer Interaction (cs.HC); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Deliberative democracy arguably leads to better collective decisions, but is fundamentally constrained by human attention and bandwidth. While recent AI-mediated deliberations scale participation by synthesizing inputs from many humans, they remain time-intensive for individual users. As AI models become increasingly capable, AI systems are being deployed not only to mediate deliberation between humans, but to represent humans in it: where AI agents deliberate on behalf of human users. We call this paradigm AI-delegated deliberation. While it promises unprecedented scale for democratic participation, it introduces qualitatively new design and alignment challenges that are poorly understood and under-theorized. To study these dynamics empirically, we deploy Habermolt, a public platform for AI-delegated deliberation. We evaluate its effectiveness along three dimensions that we use to organize any deliberative system: representation, aggregation, and revision. We use these observations to illuminate the design decisions future AI-delegated deliberation platforms must confront, contributing to the broader research agenda for scalable yet trustworthy AI representatives.

[MA-22] ProofAgent Harness: Open Infrastructure for Adversarial Evaluation of AI Agents

【速读】:该论文试图解决的问题是:当前AI代理(AI agent)在高风险生产环境中使用工具、保持上下文、遵循策略、处理私有数据并进行多轮交互,但现有评估方法仍局限于孤立输出或静态任务,无法捕捉在轨迹演进、压力情境和对抗交互中暴露的失败模式。解决方案的关键在于提出ProofAgent Harness——一套开放的、可扩展的、对抗性的AI代理评估基础设施,其核心是“对抗性多评审员评分与回合级审计”(Adversarial Multi-Juror Scoring with Turn-Level Audit),通过校准的评审员角色(juror personas)、共识检查和回合级证据链,对代理行为在压力下的表现进行系统性评估。实验表明,即使是最强代理也会因弱指标、脆弱回合、不安全重构和操纵路径而失效,且小型本地语言模型即可挑战顶级大模型驱动的代理,证明评估能力源于完整的流水线而非单纯模型规模。该框架将AI代理评估从静态分数转变为可重复、有证据支撑、可扩展且部署前可行动的对抗性评估体系。

链接: https://arxiv.org/abs/2605.24134
作者: Fouad Bousetouane
机构: ProofAgent.ai; The University of Chicago (芝加哥大学)
类目: Multiagent Systems (cs.MA)
备注: 48 pages, 3 figures

点击查看摘要

Abstract:AI agents are entering high-risk production settings, where they use tools, retain context, follow policies, handle private data, and interact with users over multiple turns. Yet many evaluation methods still judge isolated outputs or static tasks, missing failures that emerge through trajectory, pressure, and adversarial interaction. We introduce ProofAgent Harness, open infrastructure for scalable, auditable, and adversarial AI agent evaluation. The harness provides evaluation infrastructure around an agent: it curates evaluation intelligence, runs adversarial multi-turn trials, captures behavioral traces, applies post-hoc multi-juror scoring, resolves disagreement, and produces evidence-linked reports. Its open design allows developers and researchers to extend domains, traps, metrics, juror personas, scoring rules, and reporting formats. At its core is Adversarial Multi-Juror Scoring with Turn-Level Audit, which evaluates completed agent behavior under pressure using calibrated juror personas, consensus checks, and turn-level evidence. Experiments across customer support, medical triage, privacy and security, and code generation agents show that strong agents fail selectively through weak metrics, fragile turns, unsafe reframing, and manipulation paths. We also find that a small quantized local Harness LLM can challenge production agents powered by best-in-class large LLMs, suggesting that evaluation capability emerges from the full harness pipeline rather than model scale alone. ProofAgent Harness turns AI agent evaluation from a static score into scalable adversarial evaluation infrastructure: repeatable, evidence-backed, extensible, and actionable before deployment.

[MA-23] EvoSci: A Bio-Inspired Multi-Agent Framework for the Evolution of Scientific Discovery ACL2026

【速读】:该论文试图解决大语言模型(Large Language Models, LLMs)在科学发现中面临的两大核心问题:一是研究工作流设计缺乏结构化与迭代优化机制,二是多角色协作机制不足,难以模拟真实科研团队的分工与反馈。解决方案的关键在于提出EvoSci——一个基于多智能体的科学协作框架,其创新性地融合了生物启发式进化(bio-inspired evolution)与知识图谱建模(knowledge graph modeling),通过引入导师(mentor)、研究员(researcher)和评审员(reviewer)三种角色代理,实现协同推理、共享记忆和进化反馈机制,从而显著提升科学探索过程中的连贯性与创造性。实验表明,EvoSci在结构化同行评审和对比排名评估中均优于现有强基线方法,取得最高同行评审得分(ICLR 4.90)和Top-10排名比例达54%,验证了其在科学创意生成与持续发现方面的优势。

链接: https://arxiv.org/abs/2605.24018
作者: Xiaoyu Xiong,Yuqi Ren,Deyi Xiong
机构: TJUNLP Lab, School of Computer Science and Technology, Tianjin University, China
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: ACL 2026 Main Conference

点击查看摘要

Abstract:Large language models (LLMs), have shown strong potential in scientific discovery, yet existing methods still face substantial challenges in the design of research workflows and multi-role collaboration mechanisms. To mitigate these issues, we propose EvoSci, a multi-agent scientific collaboration framework, which integrates bio-inspired evolution with knowledge graph modeling. To iteratively generate, evaluate, and refine research ideas, EvoSci incorporates multiple role-based agents, including mentor, researcher, and reviewer. By combining collaborative reasoning, shared memory, and evolutionary feedback, EvoSci significantly enhances the coherence and creativity of scientific exploration. Experiments on real-world research topics demonstrate that EvoSci significantly outperforms strong baselines in LLM-based structured peer-review and comparative ranking evaluations, achieving the highest overall peer-review score (ICLR 4.90) and top ranking (Top-10 = 54). These results suggest its superiority in both scientific idea generation and continuous discovery.

[MA-24] MemForest: An Efficient Agent Memory System with Hierarchical Temporal Indexing VLDB

【速读】:该论文旨在解决长上下文大语言模型(LLM)智能体中记忆系统存在的两大核心问题:一是状态管理粒度过粗,导致维护开销高;二是更新管道固有地串行化,使得随着记忆积累而产生显著延迟和扩展性瓶颈。其解决方案的关键在于提出MemForest框架,将代理记忆重构为一个写入高效的时序数据管理问题,通过并行块提取打破串行瓶颈,并引入MemTree——一种分层的时间索引结构,以时间有序的树形结构替代传统的扁平全局摘要,从而实现局部化的节点级更新而非全量重写,显著降低维护成本并自然保留时序演化状态。实验表明,MemForest在LongMemEval-S和LoCoMo两个基准上均优于现有方法,在保持约6倍于EverMemOS的记忆构建吞吐量的同时达到79.8%的pass@1准确率。

链接: https://arxiv.org/abs/2605.23986
作者: Han Chen,Zining Zhang,Wenqi Pei,Bingsheng He,Ming Wu,Jason Zeng,Michael Heinrich,Wei Wu,Hongbao Zhang
机构: National University of Singapore(新加坡国立大学); Zero Gravity Labs(零重力实验室)
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 12 pages. Extended version with appendix as supplemental material. Submitted to VLDB

点击查看摘要

Abstract:Memory is a fundamental component for enabling long-context LLM agents, supporting persistent state across interactions through a continuous serve-and-update lifecycle. Despite substantial prior work, existing systems suffer from significant maintenance overhead due to two key limitations: coarse-grained state management and inherently sequential update pipelines. In particular, updates are often tightly coupled with LLM inference and require full-state rewrites, leading to poor scalability and growing latency as memory accumulates. To address these challenges, we present MemForest, a memory framework that reformulates agent memory as a write-efficient temporal data management problem. MemForest breaks the sequential bottleneck via parallel chunk extraction, decoupling memory construction into concurrent, independent operations. To further eliminate coarse-grained maintenance, we introduce MemTree, a hierarchical temporal index that organizes memory as time-ordered trees rather than flat global summaries. This design replaces full-state rewrites with localized per-node updates, reducing maintenance cost to the affected tree paths while naturally preserving temporally evolving states. We evaluate MemForest on two long-context memory benchmarks, LongMemEval-S and LoCoMo. On LongMemEval-S, MemForest achieves the best overall performance among stateful baselines, reaching 79.8% pass@1 accuracy while sustaining a memory construction throughput approximately 6x higher than state-of-the-art approaches including EverMemOS.

[MA-25] QUIVER: A Formal Framework for Quantifying Perturbation Propagation and Bifurcation in Compound AI Systems

【速读】:该论文旨在解决复杂生成式 AI (Generative AI) 系统中扰动传播难以量化的问题,这类系统通常由多个大语言模型(LLM)调用构成有向计算图,其节点具有异构输出类型且执行路径可能动态变化。解决方案的关键在于提出 QUIVER 框架,该框架通过四个核心组件实现对扰动传播的精细化建模:(1) 定义类型感知的距离度量敏感性矩阵,用于分类边为放大器、吸收器或阈值敏感型,并引入发生率提升(occurrence-lift)指标;(2) 轨迹分歧分解机制,将变异拆解为值漂移、结构路径分歧和迭代次数分歧;(3) 分叉阈值识别最小扰动以触发结构执行路径改变;(4) 分布忠实度量化每个节点评估数据集与生产分布的偏离程度。实证表明,QUIVER 能揭示不同架构下的独特敏感性特征,区分产生相同分歧率但机制不同的级联模式,并仅基于观测数据预测易发生轨迹分叉的节点,同时定位聚合指标无法发现的特定节点字段级过时评估伪影。

链接: https://arxiv.org/abs/2605.23956
作者: Prashanti Nilayam,Sankalp Nayak
机构: ServiceNow (ServiceNow)
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Compound AI systems that chain multiple LLM calls into directed computation graphs are now the dominant architecture for production AI. Although these architectures leverage heterogeneous nodes with mixed-mode outputs, no existing framework quantifies how perturbations propagate through such pipelines, where nodes are stochastic and execution paths can diverge structurally. We introduce QUIVER, a formal framework for measuring perturbation propagation in graph-structured LLM pipelines. The framework defines: (1) a sensitivity matrix with type-dispatched distance metrics that classifies edges as amplifiers, absorbers, or threshold-sensitive, complemented by occurrence-lift; (2) trajectory divergence decomposing variation into value drift, structural path divergence, and iteration count divergence; (3) bifurcation thresholds identifying the smallest perturbation that causes structural execution path changes; and (4) distribution faithfulness, quantifying when per node evaluation datasets diverge from production distributions. We validate on two production enterprise pipelines and a public DSPy multihop QA pipeline, three structurally distinct architectures. Across 8,200+ instrumented traces (32,000+ pair comparisons), we demonstrate that QUIVER reveals distinct sensitivity profiles across architectures, distinguishes mechanistically different cascade patterns producing identical divergence rates, predicts nodes prone to trajectory bifurcation from observational data alone, and localizes stale evaluation artifacts to specific node-field categories that aggregate metrics cannot surface.

[MA-26] Methods for Formal Verification of Agent Skills: Three Layers Toward a Mechanically Checkable Capability-Containment Proof

【速读】:该论文旨在解决代理技能(agent skill)从“声明”或“测试”级别提升至“正式”验证级别的问题,填补前文提出的四层验证层级中最高层的空白。其核心解决方案在于提出三种可组合的验证方法:(1)基于抽象解释的小效应格上的静态能力包含性分析,用于脚本端的能力验证;(2)工具调用封装的精化类型系统,自动拒绝能力不在声明集合中的调用;(3)基于SMT的有界模型检测,依据父论文的双向正确性标准进行验证,边界设定确保任何符合运行时事务缓冲区范围的反例都能以具体轨迹形式暴露。这三重机制共同构成一个可机械检查的能力包含性证明体系,且在理论上覆盖了父论文威胁模型,仅剩一个残余风险(LLM拒绝执行行为),由父论文运行时的双向条件在会话边界捕获。所有方法均复用现有成熟工具(如Z3、Semgrep、CodeQL等),并以零依赖JavaScript模块形式开源发布于enclawed框架中,配套53个单元测试与端到端CLI演示。

链接: https://arxiv.org/abs/2605.23951
作者: Alfredo Metere
机构: Metere Consulting, LLC
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:The companion paper introduced a four-level verification lattice on agent-skill manifests (unverified, declared, tested, formal) and left the top level aspirational. This paper closes that gap. We give a precise semantics for skill behaviour faithful to how a skill is consumed by an LLM-driven runtime (a deterministic script-side reachable through a non-deterministic LLM-side), state the verification problem as a capability-containment property over that semantics, and present three composable methods that together raise a skill from declared or tested to formal: (1) sound static capability-containment analysis of the script-side via abstract interpretation over a small effect lattice; (2) a refinement type system for tool-call envelopes that mechanically rejects any call whose statically-inferred capability is not in the manifest’s declared set; (3) SMT-bounded model checking against the parent paper’s biconditional correctness criterion, with the bound chosen so any counter-example fitting the runtime’s transaction-buffer horizon is exhibited as a concrete trace. We prove the three layers composed soundly cover the parent paper’s threat model modulo a single residual (the LLM’s freedom to refuse to act) that the parent paper’s runtime biconditional catches at session boundary. The methods reuse existing well-engineered tools (Z3, Semgrep, CodeQL, refinement-type checkers, mechanised proof assistants) rather than asking operators to build new ones, and the proof-carrying artifact extends the existing this http URL convention. All three methods plus the bundle producer and re-checker ship as zero-dependency JavaScript modules in the open-source enclawed framework (this https URL project page this https URL), with 53 unit tests and an end-to-end CLI demo on a sample skill. Subjects: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO); Multiagent Systems (cs.MA) Cite as: arXiv:2605.23951 [cs.AI] (or arXiv:2605.23951v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2605.23951 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Alfredo Metere [view email] [v1] Sat, 9 May 2026 19:27:38 UTC (45 KB) Full-text links: Access Paper: View a PDF of the paper titled Methods for Formal Verification of Agent Skills: Three Layers Toward a Mechanically Checkable Capability-Containment Proof, by Alfredo MetereView PDFHTML (experimental)TeX Source view license Current browse context: cs.AI prev | next new | recent | 2026-05 Change to browse by: cs cs.LO cs.MA References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked="checked"class=“labs-tab-input”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status

[MA-27] SODE: Analyzing Social Dynamics in LLM Agents

【速读】:该论文试图解决的问题是:当前大型语言模型(LLM)作为交互式智能体时,其行为对齐人类社会动态的机制尚不清晰,尤其是现有研究多依赖结果导向指标(如平均得分),忽视了促进可持续合作的内在策略差异。解决方案的关键在于提出SODE(Social Dynamics Evaluation)框架,从三个演化维度系统评估LLM代理的行为机制:直接互惠(Direct Reciprocity)用于衡量策略适应能力、间接互惠(Indirect Reciprocity)用于评估声誉敏感性、群体动力学(Group Dynamics)用于考察合作韧性。实证表明,指令微调模型常表现出“被动顺从”,易被利用;而推理模型则倾向于短期优化,破坏长期合作;但通过“长期视角框架”可激活推理模型的互惠能力,从而实现更符合人类社会协作逻辑的行为对齐。

链接: https://arxiv.org/abs/2605.23949
作者: Inseo Jung,Yoonseok Oh,Kyungryul Back,Jinkyu Kim,Jungbeom Lee
机构: Korea University (韩国大学); Kakao Mobility (Kakao移动)
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As Large Language Models (LLMs) evolve into interactive agents, understanding their behavioral alignment within human social dynamics becomes essential. While behavioral game theory offers a framework to study these interactions, previous work has predominantly relied on outcome-based metrics such as average scores. This focus overlooks the mechanisms that facilitate sustainable cooperation, as identical scores can be derived from vastly different strategies. To bridge this gap, we introduce SODE (Social Dynamics Evaluation), a framework that evaluates LLM agents across three evolutionary dimensions: Direct Reciprocity for strategy adaptation, Indirect Reciprocity for reputation sensitivity, and Group Dynamics for cooperative resilience. Applying SODE reveals systematic divergences: instruction-tuned models often exhibit “passive compliance” that renders them vulnerable to exploitation, while reasoning models prioritize short-horizon optimization, destabilizing long-term cooperation. Notably, we demonstrate that a “long-horizon framing” can unlock reciprocal capabilities in reasoning models. Thus, SODE offers a systematic, mechanism-grounded benchmark for aligning AI agents with complex human social dynamics.

[MA-28] comokit4py : a python package to ease COMOKIT agent based model simulation integration into a high performance computing workflow

【速读】:该论文试图解决的问题是:如何高效地在高性能计算(High-Performance Computing, HPC)环境中进行大规模、参数化的Agent-based Model (ABM) 实验,特别是针对COMOKIT模型在越南城市居民日常行为模拟及新冠非药物干预政策评估中的应用。解决方案的关键在于开发一个Python包,该包能够简化COMOKIT实验的生成、探索与报告构建流程,从而显著提升实验设计与执行的效率,并支持在HPC基础设施上进行高吞吐量的仿真运行。

链接: https://arxiv.org/abs/2605.23948
作者: Arthur Brugière,Kévin Chapuis
机构: 未知
类目: Multiagent Systems (cs.MA)
备注: 6 pages, 2 figures, IEEE submission

点击查看摘要

Abstract:Agent-based model (ABM) are a kind of computer model that makes it possible to simulate a set of autonomous interacting programs called agents in a shared virtual environment. Among other application field, it has been commonly used to simulate social phenomena such as urban segregation, opinion dynamic or epidemiological crisis [1]. Recently, a research emphasis has been put on ABM to study in silico the impact of non-pharmaceutical interventions to mitigate the SARS-CoV-2 outbreak of 2020, with few of them that had a great impact on global political responses [2]. Among the model used COMOKIT [3] has been design to simulate the every-day-life of inhabitant of various cities in Vietnam and test policy interventions for various COVID-19 spread scenarios. Such endeavor required huge computational power to handle a huge number of simulation replication over a large set of parameters. In this proposal we present a python package that enables to easily generate, explore and build reports for any COMOKIT experiment to be launched over High-Performance Computing (HPC) infrastructure.

[MA-29] Operationalizing Reconstructive Authority: Runtime Construction Dependency Resolution and Execution Gating in Autonomous Agent Systems

【速读】:该论文试图解决自主代理系统在运行时因权限失效而导致错误执行的问题,即系统不仅可能做出错误决策,还可能执行那些在当前状态下已无合法权限的动作。其解决方案的关键在于将“可重构权限”(Reconstructive Authority, RAM)从理论概念转化为可运行时强制执行的机制:通过引入一个扩展的执行模型,在动作执行时刻动态评估权限的可构造性,并引入“暂停”(halt)状态以处理观测不完整或不确定的情况;同时设计了包含动态依赖解析、权限重构和显式决策语义的执行协议,并结合漂移检测(IML)与执行控制(ACP)的恢复循环,使系统能在获取缺失信息后重新尝试权限重构,从而在保证安全性的前提下实现条件性活跃性——即当权限定义变量变得可观测时,系统可恢复执行。

链接: https://arxiv.org/abs/2605.23935
作者: Marcelo Fernandez - TraslaIA
机构: TraslaIA
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Multiagent Systems (cs.MA); Software Engineering (cs.SE); Systems and Control (eess.SY)
备注: Agent Governance Series, Paper P6. Companion papers on arXiv: P0 ( 2604.17511 ), P1 ( 2603.18829 ), P2 ( 2604.17517 ). P3/4 and P5 submitted concurrently (pending arXiv IDs). Zenodo: https://doi.org/10.5281/zenodo.19699460

点击查看摘要

Abstract:Autonomous agent systems fail not only due to incorrect decisions, but due to executing decisions whose authority no longer holds at runtime. Prior work defined Reconstructive Authority (RAM) as a condition for valid execution: actions are permitted only if authority can be constructed from current state. This paper addresses enforcement at runtime: how to enforce this condition in a running system. We introduce a runtime execution model in which authority is evaluated at action time and execution is conditioned on its constructibility. This extends the execution state space beyond admit/deny with a third state, halt, representing cases where authority is undefined due to incomplete or uncertain observability. We define a concrete execution protocol including dynamic dependency resolution, authority reconstruction, and explicit decision semantics. We further introduce a Recovery Loop that integrates drift detection (IML) with execution control (ACP), allowing the system to suspend execution, acquire missing information, and re-attempt authority reconstruction. We show that this model guarantees safety – no action is executed without constructible authority – and conditional liveness: execution resumes when authority-defining variables become observable. This work operationalizes reconstructive authority as a runtime enforcement mechanism, providing the execution semantics required to apply RAM in real systems. Comments: Agent Governance Series, Paper P6. Companion papers on arXiv: P0 (2604.17511), P1 (2603.18829), P2 (2604.17517). P3/4 and P5 submitted concurrently (pending arXiv IDs). Zenodo: https://doi.org/10.5281/zenodo.19699460 Subjects: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Multiagent Systems (cs.MA); Software Engineering (cs.SE); Systems and Control (eess.SY) Cite as: arXiv:2605.23935 [cs.AI] (or arXiv:2605.23935v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2605.23935 Focus to learn more arXiv-issued DOI via DataCite Related DOI: https://doi.org/10.5281/zenodo.19699460 Focus to learn more DOI(s) linking to related resources Submission history From: Marcelo Fernandez [view email] [v1] Fri, 24 Apr 2026 13:32:09 UTC (21 KB) Full-text links: Access Paper: View a PDF of the paper titled Operationalizing Reconstructive Authority: Runtime Construction, Dependency Resolution, and Execution Gating in Autonomous Agent Systems, by Marcelo Fernandez - TraslaIAView PDFHTML (experimental)TeX Source view license Current browse context: cs.AI prev | next new | recent | 2026-05 Change to browse by: cs cs.CY cs.MA cs.SE cs.SY eess eess.SY References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked="checked"class=“labs-tab-input”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status

[MA-30] Quantum Frog: Emergent Cooperation and Difficulty Scaling in a Quantized-Time Cooperative Game

【速读】:该论文试图解决多智能体强化学习中协作策略的形成机制问题,特别是如何通过环境设计(量化时间机制)来引导智能体在时间敏感任务中实现高效合作。其解决方案的关键在于引入“量子化时间”机制——环境仅在玩家行动时推进,从而迫使智能体必须协同决策以最小化暴露于风险中的时间;实验表明,这种机制下最优策略是同步的“冲刺策略”(即每步直接向上移动),而非复杂的空间协调,并且通过集中式训练使合作成功率提升32–34个百分点,同时将平均episode长度从约90步压缩至6步,证明共享激励足以在时间紧迫的任务中促成有效协作。

链接: https://arxiv.org/abs/2605.23930
作者: Saad Mankarious
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:We introduce \emphQuantum Frog, a two-player cooperative game built on a novel \emphquantized-time mechanic in which the environment advances only when a player acts. Inspired by the classic arcade game Frogger, Quantum Frog requires two frogs to cross an 8 \times 8 grid of traffic and reach the far side together. We use reinforcement learning (RL) as an analytical lens to answer four design questions: (1) how does game difficulty scale with traffic density, (2) what is the optimal single-agent policy and why, (3) how large is the cooperation gap between independent and cooperative two-agent play, and (4) what joint strategy emerges when agents are incentivised to cooperate? We train agents through five escalating stages, Tabular Q-Learning, Deep Q-Network (\DQN), Independent \DQN~(\IDQN), and Multi-Agent Proximal Policy Optimisation (\MAPPO\ with a centralised critic), evaluating each against traffic densities of one to six cars. Our key findings are: (i) the quantized-time mechanic makes a \emphrush strategy (moving directly upward at every step) universally optimal, as time exposure to traffic is minimised; (ii) adding an uncoordinated second player is harder than sextupling the traffic for a single expert player; (iii) cooperative training recovers +32–34 percentage points of joint success rate relative to independent agents and reduces episode length from \sim 90 to \sim 6 steps; and (iv) the emergent cooperative strategy is synchronised rushing, not complex positional coordination, illustrating that shared incentives alone suffice to align agents in time-critical cooperative tasks. These findings provide concrete, empirically grounded guidance for the commercial design of Quantum Frog and offer broader insights into the role of environment mechanics in shaping multi-agent learning dynamics.

[MA-31] Context: Proactive Goal-Directed Intelligence via Composable Sandboxed Programs Declarative Wiring and Structured Interaction

【速读】:该论文试图解决传统基于查询-响应模式的聊天机器人在任务推进效率和协作能力上的局限性,即系统只能被动等待用户输入,难以实现目标导向的主动交互与多参与方协同。其解决方案的关键在于提出Context架构——一个由三个相互增强机制构成的智能层:1)写时上下文组装(Write-time context assembly),通过Groker代理预计算结构化属性并以图状态的确定性纯函数形式构建上下文,实现跨轮次的近100%键值缓存(KV-cache)复用;2)可组合沙箱式智慧程序(Composable sandboxed wisdom programs),将大语言模型(LLM)生成的指令式程序通过类型流关系声明式连接至目标类型,并按阶段顺序组合执行,无需额外LLM调用;3)主动目标流状态机(Proactive goal stream state machines),根据图状态主动生成结构化交互内容(如选项数组、治理功能、澄清提示),无需等待用户输入即可驱动对话向终止状态演进。论文进一步通过六个形式化定理证明了该架构在成本控制、程序正确性、主动主导性、协作效率及跨平台一致性等方面的理论优势。

链接: https://arxiv.org/abs/2605.23928
作者: Gregory Magarshak
机构: Qbix, Inc.; Intercoin, Inc.; IE University NYC
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Distributed, Parallel, and Cluster Computing (cs.DC); Multiagent Systems (cs.MA); Programming Languages (cs.PL); Software Engineering (cs.SE)
备注: 7 pages; third in a series with arXiv: this http URL (Magarshak Machine / SPACER) and arXiv: this http URL (Grokers)

点击查看摘要

Abstract:We present Context, the intelligence layer of the Magarshak Architecture, which replaces reactive query-response chatbots with proactive goal-directed agents that advance shared tasks without waiting for user prompts. The architecture rests on three mutually reinforcing mechanisms. Write-time context assembly precomputes enriched typed attributes via Groker agents, assembling interaction context as a deterministic pure function of graph state; context blocks are byte-identical across turns between semantic changes, enabling near-100% KV-cache reuse. Composable sandboxed wisdom programs form a governed library of LM-generated imperative programs declaratively wired to goal types via typed stream relations, composed via phase ordering, and executed at interaction time without further LM calls. Proactive goal stream state machines drive conversations toward terminal states by inspecting graph state and emitting structured interaction content (option arrays, governance affordances, clarification prompts) without awaiting user input. We prove six formal results: the Context Stability Theorem, bounding per-turn LM cost as a function of semantic change rate; a Program Composition Correctness Theorem; a Declarative Wiring Soundness Theorem; the Proactive Dominance Theorem, proving proactive agents weakly dominate reactive agents on expected turns-to-terminal-state; Coordination Overhead Elimination and Quality Preservation, establishing Pareto improvements in multi-participant goal chats; and a Cross-Platform Vote Consistency Theorem. Implemented in the open-source Qbix / Safebox / Safebots stack.

[MA-32] EAM-SimHRA: A Team-Based Simulation Framework for Human Reliability Analysis Using Multi-Agent Large Language Models

【速读】:该论文试图解决的问题是:传统的人因可靠性分析(Human Reliability Analysis, HRA)方法无法有效建模核电厂控制室团队层面的失效机制,而这类失效往往源于交互动态、诊断延迟、异议压制和权威驱动的错误传播等复杂社会认知过程。解决方案的关键在于提出TEAM-SimHRA框架——一个基于多智能体大语言模型的仿真系统,将人因可靠性重新定义为控制室团队在交互中涌现的动态属性,而非静态个体特征;该框架能够模拟集体认知、角色相关的权威动态以及事故演化过程中实时的通信抑制效应,并通过三里岛(1979)和切尔诺贝利(1986)两大典型核事故案例验证了其有效性,实现了决策延迟、沟通抑制稳定性和权威压力级联等关键指标的高保真再现,从而为安全关键的社会技术系统提供了可量化的团队级可靠性评估路径。

链接: https://arxiv.org/abs/2605.23927
作者: Xingyu Xiao,Jiejuan Tong,Jingang Liang,Haitao Wang
机构: 未知
类目: Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Team-level failure in nuclear control rooms arises not from isolated operator error, but from emergent interaction dynamics, delayed diagnosis, suppressed dissent, and authority-driven error propagation, that conventional human reliability analysis methods are structurally unable to model. This study introduces TEAM-SimHRA, a multi-agent large language model simulation framework that reconceptualizes human reliability as an interaction-driven emergent property of control room teams rather than a static individual attribute. Unlike existing approaches that assign fixed error probabilities to predefined tasks, TEAM-SimHRA reproduces collective cognition, role-conditioned authority dynamics, and real-time communication suppression across temporally evolving accident progressions. Validated against the Three Mile Island (1979) and Chernobyl (1986) accidents, the two most extensively documented nuclear team failures , the framework achieves face-validity pass rates of 43.5% and 52.6% respectively, reproducing near-historical decision delay (134.8 vs. 138 min), perfect communication suppression stability, and full authority pressure cascade at historically accurate propagation depth. These results demonstrate that multi-agent simulation can extract quantitative team-level reliability indicators that are inaccessible to traditional methods, opening a viable path toward simulation-based dynamic probabilistic risk assessment for safety-critical sociotechnical systems.

[MA-33] VineLM: Trie-Based Fine-Grained Control for Agent ic Workflows

【速读】:该论文试图解决的问题是:现有工作流管理器在处理包含可配置大语言模型(LLM)阶段与工具阶段的智能体工作流(Agentic workflows)时,采用静态的、全局的工作流级计划,即每个LLM阶段绑定单一模型且在循环迭代中重复使用,无法根据运行时实际表现动态调整模型选择,导致难以在成本、延迟和准确性之间实现精细化权衡。解决方案的关键在于提出VineLM,一个支持细粒度控制的动态工作流管理器:它将可行执行路径表示为带注释的模型选择前缀树(trie),利用检查点(checkpointing)和级联采样(cascade profiling)技术,在不穷举所有请求路径的情况下估算每条路径的准确性、成本和延迟;运行时则在每次阶段调用后重新根植该前缀树,并基于已实现的执行前缀和剩余延迟预算进行局部重规划(replanning),从而实现请求级别的优化目标(如在固定预算下最大化准确率)。实验表明,VineLM在NL2SQL和数学推理任务中显著优于粗粒度基线方法,在相同请求预算下最高提升18%准确率,同时通过稀疏采样将离线预估成本降低98%-99.8%。

链接: https://arxiv.org/abs/2605.23914
作者: Nikos Pagonas,Matthew Lou,Tianyi Peng,Dan Rubenstein,Kostis Kaffes
机构: Columbia University (哥伦比亚大学)
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Agentic workflows interleave configurable LLM stages with tool stages and often include retries or refinement loops. Existing workflow managers profile full workflow configurations offline and assign each request a static workflow-level plan that binds each configurable LLM stage to a single model, reuses that model across repeated loop iterations, and does not revisit those choices at runtime. We present VineLM, a workflow manager that enables fine-grained control by choosing the model for each stage invocation as execution unfolds under request-level objectives such as maximizing accuracy under cost or latency budgets. VineLM represents feasible executions as an annotated trie of model-choice prefixes and uses checkpointing and cascade profiling to estimate path accuracy, cost, and latency without exhaustively profiling every request on every path. At runtime, VineLM re-roots the trie after each stage invocation and replans over the remaining subtrie using the realized execution prefix and remaining latency budget. On NL2SQL and math reasoning workflows, VineLM improves the cost-latency-accuracy frontier over coarse workflow-level baselines, achieving up to 18% higher accuracy at the same per-request budget with its sparse profiling reducing offline profiling cost by 98-99.8% when compared to exhaustive profiling.

[MA-34] Interpretation Learning and Empathy as One Constraint: A Residual-Adequacy Architecture with Accountable Abstention

【速读】:该论文试图解决的问题是:如何在一个统一的认知架构中同时实现情境理解、知识学习与多智能体协调,并在这些能力受限时提供可解释的“拒绝响应”机制。传统方法通常将这些功能分配给独立模块,但它们共享一种失效模式——当情境超出当前表示能力时,系统无法正确响应。解决方案的关键在于引入一个单一约束机制:解释-决策单元(Interpretation-Decision Unit, IDU),其核心是一个标量残差(residual)量,衡量输入内容与当前激活的局部表征框架(regimes)之间的匹配程度。该残差驱动IDU进行再解释、合理扩展或终止并生成带类型的“见证性终端”(typed, witnessed terminal),从而确保系统始终在有限步内确定性地终止,并将拒答的原因显式编码于输出中。这一机制不仅统一了人类和机器认知中的三种现象(不确知类型、有限共情、发展前提依赖),还提出了可验证预测,实现了对认知局限性的可问责建模。

链接: https://arxiv.org/abs/2605.24999
作者: Chainarong Amornbunchornvej
机构: National Electronics and Computer Technology Center (NECTEC)
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: First draft for journal submission. The code is at this https URL

点击查看摘要

Abstract:An agent must act on the situation before it, learn what it cannot yet represent, and model other agents well enough to coordinate. These faculties are usually realized by separate mechanisms, yet they share a failure mode: the situation can exceed what the agent can currently represent, and the honest response is then a principled refusal that says what was missing. We develop a small cognitive architecture in which these limits arise from a single quantity. An Interpretation-Decision Unit (IDU) interprets a content vector through a family of regimes - local representational frames with private bases - and decides which actions it licenses; a scalar residual of the content against the active regimes’ representational scope drives the unit. Low residual with a clean licensing emits an action; otherwise the unit re-interprets, attempts a description-length-justified expansion, or halts with a typed, witnessed terminal. We prove the unit is total and deterministic: for any content and fixed configuration it halts in finitely many bounded-cost steps with a unique terminal witness, so abstention carries its cause by construction. By binding the architecture’s open parameters without changing its mechanics, the same residual-against-scope constraint recovers three documented phenomena at three scopes: the typology of not-knowing (typed abstention); a forced misunderstanding between agents, localized to one shared concept and invisible to the agent committing it (bounded empathy); and prerequisite dependence in learning derived from a bounded focus window rather than posited (developmental prerequisites). Each instantiation is worked for a natural and an artificial agent and states a falsifiable prediction, so one constraint can model limits in both human and machine cognition. The account contributes a unification and a notion of accountable abstention, typed and witnessed by construction.

自然语言处理

[NLP-0] MobileGym: A Verifiable and Highly Parallel Simulation Platform for Mobile GUI Agent Research

【速读】: 该论文试图解决当前移动应用强化学习(Reinforcement Learning, RL)研究中两个关键问题:一是缺乏可验证的、基于确定性状态判断的结果信号,二是难以实现低成本的大规模在线RL训练。现有方法往往受限于专有后端系统,无法提供结构化、可复现的状态表示与评估机制,导致实验结果不可靠且扩展性差。解决方案的关键在于提出MobileGym——一个浏览器托管、轻量级且完全可控的移动交互环境,其核心创新包括:1)将完整环境状态以结构化JSON形式捕获、配置、分叉和比较,支持确定性评判;2)通过分层状态模型和声明式任务定义框架实现高可编程性和任务规模化;3)引入统一的程序化评判机制,同时生成确定性评估结果与密集型RL奖励信号;4)配套MobileGym-Bench提供416个参数化任务模板(覆盖28个真实App),并通过AnswerSheet协议避免自由文本匹配失败。实证表明,该方案在Sim-to-Real迁移中表现优异,GRPO算法在Qwen3-VL-4B-Instruct上测试集提升12.8个百分点,且真实设备执行保留了95.1%的模拟训练收益。

链接: https://arxiv.org/abs/2605.26114
作者: Dingbang Wu,Rui Hao,Haiyang Wang,Shuzhe Wu,Han Xiao,Zhenghong Li,Bojiang Zhou,Zheng Ju,Zichen Liu,Lue Fan,Zhaoxiang Zhang
机构: Institute of Automation, Chinese Academy of Sciences; Peking University; The Chinese University of Hong Kong
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Project page: this https URL

点击查看摘要

Abstract:We present MobileGym, a browser-hosted, lightweight, fully controllable environment for everyday mobile use, targeting interaction fidelity without replicating proprietary backends. It enables two capabilities previously out of reach for everyday apps: verifiable outcome signals through deterministic state-based judging over structured JSON state, and scalable online RL through low-cost parallel rollouts. The full environment state is captured, configured, forked, and compared as structured JSON, and a single server can host hundreds of parallel instances, with about 400 MB memory per instance and about 3 s cold start. A layered state model and a declarative task-definition framework keep state programmability and task creation practical at scale, and a single programmatic judging mechanism delivers both deterministic evaluation verdicts and dense RL rewards. The accompanying MobileGym-Bench provides 416 parameterized task templates, including 256 test and 160 train templates, over 28 apps, with deterministic judges and a structured AnswerSheet protocol that avoids free-text matching failures. In a Sim-to-Real case study, GRPO on Qwen3-VL-4B-Instruct gains +12.8 percentage points on the 256-task test set, and on a 59-task real-device signal subset, real-device execution retains 95.1% of the simulation-side training gain. Project page: this https URL.

[NLP-1] Prism: A Plug-in Reproducible Infrastructure for Scalable Multimodal Continual Instruction Tuning

【速读】: 该论文试图解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在实际部署中面临的持续适应新任务需求的问题,即多模态持续指令微调(Multimodal Continual Instruction Tuning, MCIT)所面临的工程瓶颈。当前方法通常需直接修改基础MLLM代码库来实现,导致实现开销大、架构高度定制化,严重限制了代码复用与公平比较。解决方案的关键在于提出一个名为Prism的插件式可复现代码库,通过轻量级插件注册机制将算法开发与骨干模型实现分离,使新策略可作为独立插件集成而无需改动底层MLLM代码,从而消除结构碎片化并加速方法迭代;同时原生支持大规模训练流水线,保障MCIT实验的可复现性与扩展性。

链接: https://arxiv.org/abs/2605.26110
作者: Jun-Tao Tang,Yu-Cheng Shi,Zhen-Hao Xie,Da-Wei Zhou
机构: Nanjing University(南京大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: Code is available at this https URL

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) achieve versatility by reformulating diverse tasks into a unified instruction-following framework via instruction tuning. However, real-world deployment requires continuous adaptation to emerging tasks, motivating Multimodal Continual Instruction Tuning (MCIT). Despite its growing importance, current MCIT research is hindered by severe engineering bottlenecks. Existing methods are typically implemented by directly modifying the base MLLM codebase, which imposes substantial implementation overhead and yields method-specific architectures that severely limit code reuse and fair comparison. To address this, we introduce Prism, a plug-in reproducible codebase specifically designed for scalable MCIT research. It separates algorithmic development from the backbone implementation via a lightweight plugin registration mechanism, enabling new strategies to be integrated as independent plugins without modifying the underlying MLLM codebase, thereby eliminating structural fragmentation and accelerating method development. Prism natively supports widely used large-scale training pipeline, thereby enabling reproducible and scalable MCIT experimentation. Code is available at this https URL.

[NLP-2] Language Models Need Sleep

【速读】: 该论文试图解决基于Transformer的大语言模型在长时序任务中因注意力机制随上下文长度增长而计算效率急剧下降的问题(即注意力复杂度与序列长度呈平方级增长)。其解决方案的关键在于引入一种类睡眠的巩固机制:模型定期将近期上下文信息压缩为持久的快速权重(fast weights),并清空键值缓存;在“睡眠”阶段,模型通过学习到的局部规则对累积上下文进行N次离线循环传递,并更新状态空间模型(State-Space Model, SSM)模块中的快速权重。该方法将额外计算负载转移到睡眠阶段,同时保持唤醒时刻预测的低延迟特性。实验表明,增加睡眠轮数N可提升性能,尤其在需要深度推理的任务(如多跳图检索和数学推理)上效果显著,而传统Transformer及SSM-注意力混合模型在此类任务中表现不佳。

链接: https://arxiv.org/abs/2605.26099
作者: Sangyun Lee,Sean McLeish,Tom Goldstein,Giulia Fanti
机构: Carnegie Mellon University (卡内基梅隆大学); University of Maryland (马里兰大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Transformer-based large language models are increasingly used for long-horizon tasks; however, their attention mechanism scales poorly with context length. To handle this, we study a sleep-like consolidation mechanism in which a model periodically converts recent context into persistent fast weights before clearing its key-value cache. During sleep, the model performs N offline recurrent passes over the accumulated context and updates the fast weights in its state-space model (SSM) blocks through a learned local rule. During inference, this shifts extra computation to sleep while preserving the latency of wake-time prediction. We test our method on controlled synthetic tasks, including cellular automata and multi-hop graph retrieval, as well as a realistic math reasoning task, on which a regular transformer as well as SSM-attention hybrid models fail. We then show that increasing sleep duration N for our models improves performance, with the largest gains on examples that require deeper reasoning.

[NLP-3] Automated Benchmark Auditing for AI Agents and Large Language Models

【速读】: 该论文试图解决当前前沿大语言模型(LLM)评估基准中存在的设计缺陷与验证不足问题,这些问题包括隐含环境依赖、规范不完整和评分逻辑脆弱等,导致模型能力评估失真。解决方案的关键在于提出一种名为Auto Benchmark Audit (ABA) 的智能体框架,通过系统性审计每个基准任务来自动识别上述问题,其核心创新在于利用代理(agentic)机制对任务执行环境、描述规范和评价逻辑进行深度分析,并结合专家评审与第三方报告验证审计结果的准确性。实证表明,约25.7%的基准任务存在严重缺陷,剔除这些任务后显著改善了模型性能排名和平均得分,从而证明了ABA在提升基准可信度方面的有效性。

链接: https://arxiv.org/abs/2605.26079
作者: Junlin Wang,Federico Bianchi,Shang Zhu,Fan Nie,Yongchan Kwon,Bhuwan Dhingra,James Zou
机构: Duke University (杜克大学); Together AI; Stanford University (斯坦福大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Modern AI benchmarks operate at a complexity that outpaces traditional verification methods. Tasks authored by domain experts often contain implicit assumptions, incomplete environment specifications, and brittle evaluation logic that human annotation cannot reliably catch. We introduce Auto Benchmark Audit (ABA), an agentic framework that systematically audits individual benchmark tasks, uncovering issues such as hidden environment dependencies, specification gaps, and limited grading logic. We run ABA on a collection of frontier LLM benchmarks and previous NeurIPS publications, totaling 168 benchmarks across nine domains. Across this corpus, ABA identifies critical issues including ambiguous task design, execution environment conflicts, and incorrect ground truths in over 25.7% of the evaluated tasks. The precision of these automated audits is validated by expert review and independent third-party reports such as upstream PRs. Crucially, we demonstrate that these problematic tasks severely distorts capability assessments for agents and LLMs: filtering out these tasks with issues shifts model rankings and increases average performance on SWE-bench Verified and Terminal-Bench 2 by 9.9% and 9.6%, respectively. We release the agentic tool and all task annotations to support the future development of frontier benchmarks.

[NLP-4] StakeBench: Evaluating Language Understanding Grounded in Market Commitment

【速读】: 该论文试图解决现有金融自然语言处理(Natural Language Processing, NLP)基准测试中标签来源不可靠的问题,即大多数基准依赖外部观察者提供的标注,仅衡量语言的感知情绪,而非市场参与者实际的经济承诺。其解决方案的关键在于引入StakeBench——一个基于市场行为可观测性的语言理解评估框架。该框架通过将56万条来自2261个已结算市场的评论与Polymarket和Manifold平台上的真实持仓、交易动作及市场赔率轨迹进行匹配,以市场行为作为监督信号,替代人工标注。四个诊断任务用于检验模型是否能识别市场承诺、判断持仓方向、预测未来交易行为以及进行集体赔率预测;同时提出三种“承诺感知”指标,衡量模型输出与可验证偏好的一致性,而非主观情感倾向。实证结果表明,尽管大语言模型(Large Language Models, LLMs)在识别持仓方向上表现尚可(Directed Accuracy为0.506–0.599),但在更高阶任务中存在结构性失败:多数模型在预测未来行动时退化为少数标签,且无模型在集体赔率预测中持续优于基线。此外,模型规模与性能无关,金融领域微调未提升识别准确率,而平台激励机制显著影响高阶任务表现。

链接: https://arxiv.org/abs/2605.26074
作者: Yunhua Pei,Jingyu Hu,Yiwei Shi,Hongnan Ma,Weiru Liu,John Cartlidge
机构: University of Bristol (布里斯托大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); General Finance (q-fin.GN)
备注: 21 pages, 2 figures, 20 tables. Preprint. Dataset and evaluation code included

点击查看摘要

Abstract:Existing financial NLP benchmarks often rely on labels supplied by outside observers, measuring how language is perceived rather than what speakers have committed to in the market. We introduce StakeBench, an evaluation framework for language understanding grounded in market commitment. StakeBench links 560,876 comments from 2,261 resolved markets to verified position, action, and market-odds records across Polymarket and Manifold. Supervision is derived from observable market behavior. Position sides, post-comment trading actions, and market-odds trajectories replace human annotation. Four diagnostic tasks test whether models detect market commitment, identify the revealed side, anticipate future action, and perform collective odds projection. Three commitment-aware metrics measure alignment with revealed preferences rather than perceived sentiment. Validity audits and explicit interpretation boundaries help distinguish observable commitment signals from latent belief and causal market-odds impact. Across 15 LLMs and 18 topics and platform settings, models partially recover position-side signals, with Directed Accuracy from 0.506 to 0.599, but show structural failures on later tasks. Ten of the fifteen models collapse to one or two action labels in future action anticipation, and no model consistently improves on the naive odds-direction baseline in collective odds projection. Model scale is not correlated with performance, finance-domain tuning does not improve revealed-side identification, and platform incentives strongly shape higher-order results. StakeBench is packaged with evaluation code and dataset under CC-BY 4.0.

[NLP-5] WhoSaidIt: Human-LLM Collaborative Annotation for Text-Based Multilingual Speaker-Attribute Classification

【速读】: 该论文试图解决多语言场景下从文本中标注说话者属性(speaker attributes)时存在的模糊性和不一致性问题,尤其是在资源受限条件下如何稳定和提升标注质量。其解决方案的关键在于提出了一种人-大语言模型(LLM)协作的再标注框架:通过迭代交互让LLM挖掘出标注中的常见推理依据(annotation rationales),并采用基于分歧的采样策略对高不确定性的样本进行针对性重标注,从而在有限资源下优化标注一致性。该方法最终构建了WhoSaidIt数据集,量化了原始与修订标注间的差异,并揭示了LLM在跨语言场景下的分类表现及其对显式理由依赖的敏感性。

链接: https://arxiv.org/abs/2605.26070
作者: Lingyu Gao,Will Monroe,David Smith,Meghan Jemison,Jackie Lee
机构: Duolingo
类目: Computation and Language (cs.CL)
备注: 16 pages in total

点击查看摘要

Abstract:Annotating speaker attributes from text is inherently ambiguous, particularly in multilingual settings where demographic and social cues are implicit and culturally variable. We propose a human-large language model (LLM) collaborative re-annotation framework for stabilizing multilingual speaker-attribute labels under practical resource constraints. Starting from a noisy corpus, we use LLMs to surface recurring annotation rationales through iterative interaction with experts, and apply disagreement-focused sampling for targeted re-annotation. Using this framework, we construct WhoSaidIt, a multilingual dataset covering nine speaker-attribute labels. We quantify divergence between original and revised annotations, benchmark recent LLMs, and analyze the effect of explicit rationales on model behavior. Our results reveal substantial cross-lingual differences in annotation decisions and demonstrate both the strengths and limitations of LLMs in speaker-attribute classification.

[NLP-6] Confidence and Calibration of Activation Oracles for Reliable Interpretation of Language Model Internals

【速读】: 该论文试图解决的问题是:激活查询(activation oracles)在生成自然语言输出时缺乏可靠的置信度量化(uncertainty quantification, UQ)方法,导致其输出的可信度难以评估。解决方案的关键在于系统性地比较六种不同的置信度估计方法,并验证它们的校准性能(calibration performance)。实验结果表明,Bootstrap模式频率(bootstrap mode frequency)是最校准良好的方法(ECE分别为5.7%和10.3%),而对数概率(log-prob)基线则可作为低成本的快速筛查信号,为实际部署提供实用指导。

链接: https://arxiv.org/abs/2605.26045
作者: Federico Torrielli,Peter Schneider-Kamp,Lukas Galke Poech
机构: University of Turin (都灵大学); University of Southern Denmark (南丹麦大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Activation oracles aim to make the activations of other models legible to humans and yield promising results compared to white-box interpretability techniques. However, uncertainty quantification (UQ) for the natural-language outputs of such activation oracles is so far understudied. Here, we investigate 6 different methods for estimating the confidence of activation oracles and evaluate how well-calibrated their confidence scores are. Our experiments on 6,000 samples per oracle (varying verbalizer and context prompts) reveal that bootstrap mode frequency is the best-calibrated method among those tested (ECE 5.7% vs. 25.5% for the answer-word log-probability on Qwen3-8B; 10.3% vs. 13.1% on Qwen3.6-27B), and that the log-prob baseline can serve as a fast triage signal at a fraction of the cost. Code and the patched trainer are available at this https URL. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2605.26045 [cs.CL] (or arXiv:2605.26045v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2605.26045 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-7] Peak-Then-Collapse and the Four Interface Channels of Knowledge-Graph Tool Use

【速读】: 该论文试图解决的问题是:在使用强化学习(Reinforcement Learning with Reward Modeling, RLVR)进行工具调用(tool-use)时,模型在面对接口反馈有限的轻量级知识图谱API(如Freebase导航)时,表现出“先上升后崩溃”的性能波动现象,且现有奖励设计无法从根本上消除这一问题。解决方案的关键在于识别出导致该现象的核心机制——即接口反馈缺乏自然语言信号(如Python报错信息或网页搜索结果),使得模型难以有效学习和修正错误;进一步通过直接oracle消融实验发现,关系选择并非主要瓶颈,而是检索-组合错误占主导地位(95.4%)。最终提出一种简单的缓解策略:一迭代自蒸馏(one-iteration self-distillation),可在7B规模下达到40.0%的精确匹配(EM)准确率,并且在7B到14B模型容量范围内表现稳定,表明性能上限由接口本身限制,而非模型能力或初始条件。

链接: https://arxiv.org/abs/2605.26037
作者: Tianda Sun,Dimitar Kazakov
机构: University of York (约克大学)
类目: Computation and Language (cs.CL)
备注: 18 pages, 9 figures

点击查看摘要

Abstract:We test the standard RLVR tool-use recipe – GRPO on Qwen2.5-7B-Instruct – on a deliberately minimal knowledge-graph tool API: four Freebase navigation verbs over Complex WebQuestions. Under a self-verifiable retrieval reward, the policy’s tool-grounded answer rate climbs from 3.8% to 9.6% over 250 steps, then collapses to 0% within a single 50-step window – a \emphpeak-then-collapse pattern replicated across four seeds. Across seven reward designs, we find four recurring failure modes: adding denser or more targeted proxy rewards shifts the failure mode rather than eliminating it. We argue that a key difference from Python interpreters, web search, and JSON APIs is interface feedback: their failures often leak natural-language signal the model saw in pretraining. A Python traceback names the failing line; an empty Freebase result \texttt[] does not. Stripping away that surface exposes a degradation regime that same-family reward redesigns do not fix. A direct oracle ablation rules out relation selection: injecting gold relations at every retrieval call lifts exact-match accuracy by only +0.20 ~pp, and 95.4% of retrieval-dependent errors are retrieval-composition failures rather than answer-extraction failures. As a mitigation, one-iteration self-distillation reaches 40.0% EM at 7B and is capacity-invariant: doubling capacity to 14B improves EM by only 0.25 ~pp, and initialization barely matters – the ceiling appears interface-bound within the 7B–14B range tested.

[NLP-8] CausaLab: A Scalable Environment for Interactive Causal Discovery Toward AI Scientists

【速读】: 该论文试图解决的问题是:如何有效评估大型语言模型(LLM)代理在交互式因果发现任务中的因果理解能力,而不仅仅是预测准确性。传统评估方法往往只关注模型能否正确预测结果,却忽视了其是否真正识别出正确的因果机制(即因果图和结构方程)。解决方案的关键在于构建一个可扩展的仿真环境 CausaLab,其中每个实验回合模拟一个合成实验室场景,要求代理通过观测和干预来推断隐藏的结构因果模型(SCM),并用领域特定语言记录其因果假设演化过程,从而实现对因果推理轨迹的可解释性和可比性分析。实验表明,即使在高预测准确率下,LLM 代理在因果机制恢复上仍存在显著差距,且设计有效的干预策略仍是难点;论文进一步提出通过引入一致性验证机制来缓解代理过早终止推理的问题,从而区分预测成功与真正的因果理解,揭示当前 LLM 作为实验性因果推理者的局限性。

链接: https://arxiv.org/abs/2605.26029
作者: Junlin Yang,Dylan Zhang,Xiangchen Song,Qirun Dai,Xiao Liu,Yuen Chen,Aniket Vashishtha,Jing Shi,Chenhao Tan,Hao Peng
机构: Tsinghua University (清华大学); University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); Carnegie Mellon University (卡内基梅隆大学); University of Chicago (芝加哥大学); Adobe (Adobe公司)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We introduce CausaLab, a scalable environment for evaluating interactive causal discovery by LLM agents. Unlike prior evaluations, CausaLab evaluates both whether an agent can solve a problem using causal evidence and whether its answer is supported by a correct hypothesis about the underlying causal mechanism. Each episode places an agent in a synthetic laboratory: it receives prior measurement records, intervenes on a manipulator crystal, and predicts the resonance frequency of a held-out reactor crystal governed by the same mechanism. The hidden data-generating process is a randomly sampled structural causal model (SCM), so success requires recovering both a causal graph and structural equations rather than recalling prior knowledge. CausaLab also includes a domain-specific language that records the agent’s evolving SCM hypothesis, making trajectories inspectable and comparable with ground truth. Experiments show a persistent gap between prediction and mechanism recovery: in the purely observational 6-node setting, GPT-5.2-high reaches 92% task accuracy but only 0.471 all-edge F_1 . This observation further motivates our exploration of different interaction strategies: Mixed observation–intervention strategies improve structural fidelity: in the mixed 6-node setting, GPT-5.2-high achieves 80% on both task accuracy and all-edge F_1 . Yet even strong agents struggle to design informative interventions, as pure intervention strategies perform poorly on both task accuracy and all-edge F_1 . We identify premature stopping as a major weakness of agents, and show that asking the model to verify the consistency between its hypothesis and past data can help mitigate this issue. CausaLab therefore separates predictive success from causal understanding and exposes current LLM agents’ limits as experimental causal reasoners.

[NLP-9] Retrieval-Augmented Detection of Potentially Abusive Clauses in Chilean Terms of Service

【速读】: 该论文试图解决的问题是:在智利,消费者在线服务条款(Terms of Service)常作为附和性合同(contracts of adhesion),存在条款不对等导致消费者可能遭受不公平或有害条款侵害的法律风险;而当前对这些条款的合法性评估具有挑战性,因为部分条款明确违反强制性消费者法,而另一些则需依据诚信原则(good faith)和合同失衡等更广泛的法律标准进行判断。

解决方案的关键在于提出一种基于检索增强生成(retrieval-augmented generation, RAG)的框架,用于自动化检测和分类潜在 abusive clauses(有害条款)。该框架设计为本地运行,结合了高效的条款检测机制、混合稠密-稀疏检索(hybrid dense–sparse retrieval)、重排序(reranking)与提示增强(prompt augmentation),以支持中等规模的开源语言模型(open-weight language models)。此外,研究构建了“智利有害服务条款扩展语料库”(Chilean Abusive Terms of Service Extended corpus),包含100份合同及10,029条标注条款,涵盖非法、黑暗(dark)和灰色(gray)三类共24个法律基础类别。实验表明,该方法显著提升性能,并使本地模型在较低计算和token消耗下逼近大型云端模型的效果,同时贡献了精细化的法律标注方案和面向消费者合同审查的AI辅助设计。

链接: https://arxiv.org/abs/2605.26019
作者: Christoffer Loeffler,Tomás Rey Pizarro,Daniel Ignacio Miranda Vásquez,Andrea Martínez Freile
机构: Pontificia Universidad Católica de Valparaíso (智利天主教大学瓦尔帕莱索分校); Universidad Adolfo Ibáñez (阿道夫·伊巴涅斯大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 42 pages, 6 figures, 9 tables

点击查看摘要

Abstract:Online Terms of Service often function as contracts of adhesion, creating asymmetries that may expose consumers to potentially abusive clauses. In Chile, assessing such clauses is legally challenging because some provisions clearly violate mandatory consumer law, whereas others depend on broader standards such as good faith and contractual imbalance. We present a retrieval-augmented generation framework for the automated detection and classification of potentially abusive clauses in Chilean Terms of Service. Designed for local execution, it combines efficient clause detection, hybrid dense–sparse retrieval, reranking, and prompt augmentation to support medium-sized open-weight language models. We also introduce the Chilean Abusive Terms of Service Extended corpus, comprising 100 contracts and 10,029 annotated clauses in 24 legally grounded categories spanning illegal, dark, and gray clauses. Experiments comparing commercial and open-weight language models, fine-tuned encoders, and traditional baselines show that retrieval-augmented prompting substantially improves performance and enables local models to approach larger cloud-based systems at lower computational and token cost. The study also contributes a refined legal annotation scheme and a practical design for AI-assisted consumer contract review.

[NLP-10] STORM: Internalized Modeling for Spatial-Temporal Reasoning in Video-Language Models

【速读】: 该论文试图解决现有基于大视觉语言模型(LVLM)的视频推理方法中因依赖外部化推理(如文本链式思维CoT、关键帧选择或外部工具调用)而导致的推理延迟高、工程复杂度大以及视觉证据被序列化为文本或重复编码的问题。其解决方案的关键在于提出STORMS(Spatial-Temporal reasOning via inteRnalized Modeling),一个两阶段框架:第一阶段通过生成视频引导的潜在标记对齐,将隐变量状态与动态视觉证据进行锚定;第二阶段采用仅答案监督训练,促使模型在无需逐步标注的情况下内化推理过程。该方法在推理时仅执行有限的潜在轨迹滚动(bounded latent rollout),无需重新生成视频、插入帧或调用外部视觉工具,从而显著提升视频推理准确率的同时大幅降低推理开销。

链接: https://arxiv.org/abs/2605.26014
作者: Yiming Liang,Yixiao Chen,Yiyang Zhou,Yixuan Wang,Shoubin Yu,Andong Deng,Fuxiao Liu,Qin Zhang,Chen Chen,Mohit Bansal,Huaxiu Yao
机构: Purdue; Harvard; UNC; UCF; NVIDIA; Physion Labs
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Many video reasoning tasks require tracking motion, temporal order, and evolving visual states across frames. Existing methods built on large vision-language models (LVLMs) often address this challenge by externalizing reasoning through textual chain-of-thought (CoT), keyframe selection, repeated frame reinsertion, or external tool use. While effective, such pipelines increase inference-time latency and engineering complexity, and they force temporal-visual evidence to be serialized into text or repeatedly re-encoded from frames. Inspired by the intuition that visual reasoning can occur implicitly before verbalization, we propose STORMS (Spatial-Temporal reasOning via inteRnalized Modeling), a two-stage framework that teaches LVLMs to reason through bounded continuous latent trajectories instead of explicit textual CoT. In Stage I, STORMS aligns latent tokens with thought-video representations derived from generated videos, grounding the latent states in dynamic visual evidence. In Stage II, the model is further trained with answer-only supervision, encouraging the reasoning process to be internalized without step-by-step annotations. Generated thought videos are used only during training; at inference, STORMS performs a bounded latent rollout without regenerating videos, reinserting frames, or invoking external visual tools. Experiments on VideoMME, MVBench, TempCompass, and MMVU show that STORMS improves video reasoning accuracy while substantially reducing inference overhead compared with tool or video-generation-based reasoning pipelines.

[NLP-11] Forgotten Words: Benchmarking NeoBERT for Dementia Detection in Low-Resource Conversational Filipino and English Speech ACL2026

【速读】: 该论文旨在解决非英语语境下基于自然语言处理(NLP)的痴呆症检测问题,特别是针对菲律宾广泛存在的菲语-英语代码混用现象,此前尚无针对此类语言特征的痴呆检测研究。其解决方案的关键在于构建了一个包含4,000条来自DementiaBank的平行双语转录文本数据集,并通过人工翻译保留了认知衰退的语篇级标记,从而实现对Transformer模型在多语言临床场景下的系统性评估。实验表明,单纯依赖模型架构升级或扩大规模无法提升跨语言鲁棒性,而采用双语微调策略可显著消除跨语言性能下降,在所有模型中均达到Macro-F1 = 0.969–0.973的高性能,揭示了多语言临床NLP性能的核心驱动力是训练阶段的语言覆盖度而非模型规模或结构。

链接: https://arxiv.org/abs/2605.26007
作者: Rez Samantha Z. Floresca,Edric Castel C. Hao,Hannah Grachiella Buñales,Chelsea Dominique E. Temprosa,Georgianna Z. Reyes,Kervin Gabriel L. Chua
机构: Ateneo de Manila Senior High School; Analog Devices, Inc.
类目: Computation and Language (cs.CL)
备注: Accepted to BioNLP Workshop @ ACL 2026

点击查看摘要

Abstract:Dementia detection from spontaneous speech offers a scalable approach to cognitive screening, yet NLP systems remain predominantly English-centric. This limitation is especially acute in the Philippines, where Filipino-English code-switching is pervasive and no prior work has addressed NLP-based dementia detection. We present the first systematic evaluation of transformer-based dementia detection in Filipino speech and the first assessment of NeoBERT in a clinical NLP setting. To separate language from domain effects, we construct a parallel bilingual dataset of 4,000 DementiaBank-derived transcripts, with Filipino translations produced manually to preserve discourse-level markers of cognitive decline. We evaluate five model families, TF-IDF + LogReg, BERT, NeoBERT, XLM-R, and RoBERTa-Tagalog, under monolingual, zero-shot cross-lingual, and bilingual fine-tuning settings. We find that in-domain performance does not transfer across languages, with English-trained BERT dropping to Macro-F1 = 0.455 on Filipino, and that architectural modernization alone does not improve robustness. Bilingual fine-tuning, however, eliminates cross-lingual degradation across all transformer models, converging to Macro-F1 = 0.969-0.973. These results suggest that multilingual clinical NLP performance is driven primarily by linguistic coverage during training rather than model scale or architecture.

[NLP-12] MAGIC: Multimodal Alignment Grounding-aware Instruction Coreset for Vision-Language Models

【速读】: 该论文试图解决大规模视觉语言模型(VLM)指令微调中因训练数据冗余、视觉依赖性弱及多模态推理行为分布不均而导致的子集选择效率低下问题。现有方法如均匀采样或基于分数的简单筛选难以构建高质量且行为忠实的训练子集。其解决方案的关键在于提出一种无需训练、仅需前向传播的“MAGIC”核心集选择方法,该方法通过三个内在信号构建高效子集:多模态增益(Multimodal Gain)衡量视觉输入带来的概率提升;桥接相关性(Bridging Relevance)捕捉答案词元对视觉词元的锚定强度;技能神经元签名(Skill-Neuron Signatures)利用顶层激活的前馈神经元表征样本引发的功能计算模式。MAGIC采用三阶段流程:过滤低增益样本、按归一化质量目标排序候选、基于离散神经元签名进行分桶预算分配,从而在不依赖反向传播、辅助分类器训练或连续激活空间聚类的前提下,有效保留潜在多模态技能覆盖,显著提升训练效率与性能——在20%预算下相较全量微调分别实现100.3%和101.6%相对性能,并减少73.7%运行时间。

链接: https://arxiv.org/abs/2605.26004
作者: Shristi Das Biswas,Kaushik Roy
机构: Purdue University (普渡大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Instruction tuning of large vision-language models (LVLMs) increasingly depends on massive multimodal corpora, yet these datasets contain samples with substantial redundancy, low visual dependency, and highly imbalanced coverage of multimodal reasoning behaviors. As a result, uniform subsampling or naive score-based selection often yields suboptimal training subsets. We introduce MAGIC, a training-free, forward-only coreset selection method designed to construct compact yet behaviorally faithful subsets for multimodal instruction tuning. MAGIC is built on three intrinsic signals extracted from a pretrained VLM: Multimodal Gain, which measures the likelihood improvement obtained from visual input; Bridging Relevance, which captures the sharpness of answer-token grounding over visual tokens; and Skill-Neuron Signatures, which characterize the functional computation elicited by each sample via top-activated feed-forward neurons. MAGIC combines these signals in a three-stage pipeline: filtering low-gain examples, ranking candidates by a normalized quality objective, and performing bucket-wise budget allocation over discrete neuron signatures to preserve latent multimodal skill coverage. This formulation avoids backpropagation, auxiliary selector training, and expensive clustering in continuous activation spaces, while remaining efficient and easily deployable in existing VLMs. Across LLaVA-665K and Vision-Flan datasets, and transfer settings to large target models, LLaVA-1.5-7B and -13B, MAGIC consistently improves over strong baselines under matched 20% budgets: it achieves 100.3% relative performance to full finetuning on LLaVA-665K and 101.6% relative performance on Vision-Flan-186K, while yielding a 73.7% reduction in wall-clock run time.

[NLP-13] AI-Assisted Systematization for Evaluating GenAI Systems

【速读】: 该论文试图解决生成式 AI(Generative AI)系统评估中因目标概念模糊、定义不清而导致的测量与解释困难问题,例如“推理”“公平性”或“创造力”等概念缺乏明确的操作化定义。其解决方案的关键在于引入“系统化”(systematization)这一中间步骤,即将宽泛的概念转化为可测量、结构化的表达形式——即提出“概念规范”(concept spec)和“验证工作表”作为标准化工具,并开发两种AI辅助系统化方法:一种是零样本直接法,另一种是模拟人工系统化流程的多智能体方法。通过在“基于仇恨的言论”和“数字共情”两个概念上的实证应用,论文验证了这些方法在内容效度和信息可恢复性方面的有效性,从而为生成式 AI 的可靠评估提供了可操作的技术路径。

链接: https://arxiv.org/abs/2605.26001
作者: Dhruv Agarwal,Emily Sheng,Chad Atalla,Jean Garcia-Gathright,Hussein Mozannar,Hannah Washington,Alexandra Chouldechova,Solon Barocas,Hanna Wallach
机构: Cornell University (康奈尔大学); Microsoft Research (微软研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Evaluating generative AI (GenAI) systems is challenging because many targets of evaluation are broad, contested concepts, such as “reasoning,” “fairness,” or “creativity.” When these concepts are left underspecified, it becomes unclear what should be measured or how evaluation results should be interpreted. This problem reflects a missing step: systematization, that is, moving from a broad background concept to an explicit, structured account of the concept in measurable terms. To help address the fact that systematization is cognitively demanding and resource-intensive, we investigate whether AI assistance can support this process. To enable AI-assisted systematization and assess its quality, we introduce a structured representation of a systematized concept, a concept spec, and a validation worksheet. We then develop two AI-assisted systematizers: a direct, zero-shot approach and a multi-agent approach that more closely mirrors manual systematization approaches from existing literature. We use these systematizers to produce concept specs for two concepts – hate-based rhetoric and digital empathy – and evaluate resulting concept specs on content validity and information recoverability.

[NLP-14] What Makes a Medical Checker Trainable? Diagnosing Signal Collapse and Reward Hacking in Checker-Guided RAG for Biomedical QA

【速读】: 该论文旨在解决医疗领域检索增强生成(Medical RAG)中证据 grounded claims 的可信性问题,即如何通过引入 claim-level 自然语言推理(NLI)检查器来提升模型生成答案的准确性与可解释性。其解决方案的关键在于:NLI 检查器在训练过程中输出分布的特性(而非其独立测试准确率)决定了是否能提供有效的梯度信号用于强化学习(RL)优化。研究发现,若检查器输出分布存在 log-prob 退化(如大语言模型直接评分导致超过 97% 的样本被标记为“中立”),则 RL 梯度会坍缩至零;而使用校准后的 MedNLI 分类器可避免此问题。此外,适度的信号强度反而优于强信号,因为过强信号会引发奖励劫持(reward hacking)级联效应(如答案过短、搜索规避、语言退化),从而损害最终性能;且信号强度对策略具有依赖性,同一检查器在不同策略下可能表现为“适度”或“强烈”,但仅在特定边界条件下触发负面行为。这些发现界定了验证器作为奖励机制的系统性边界条件。

链接: https://arxiv.org/abs/2605.25988
作者: Yuelyu Ji,Min Gu Kwak,Hang Zhang,Xizhi Wu,Chenyu Li,Yanshan Wan
机构: University of Pittsburgh (匹兹堡大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Medical RAG needs evidence-grounded claims, so plugging a claim-level NLI checker into retrieval-augmented RL is intuitive. \textbfWe find that the checker’s \emphoutput distribution during training, not its held-out accuracy, decides whether it provides trainable gradient. We compare four NLI checker back-ends as process rewards inside a GRPO-trained medical RAG agent (Qwen2.5-7B, replicated on Qwen3-4B and Llama-3.1-8B) across four held-out medical QA benchmarks. Three diagnostic findings emerge. \textbf(i) Signal collapse is log-prob-specific: LLM log-probability scoring labels over 97% of claims neutral – collapsing the RL gradient to zero – while a calibrated MedNLI classifier scores the same pairs non-degenerately. \textbf(ii) Moderate signal beats strong signal on answer quality: a strong proprietary checker triggers a three-step reward-hacking cascade – ultra-short answers, search avoidance, language collapse – so a moderate-signal local classifier trains a higher-quality model (\textbf+12% BERTScore over zero-shot, no GPT dependency). \textbf(iii) Signal strength is policy-dependent: the same checker registers as moderate on one policy but strong on another without triggering the cascade end-state. We frame these as boundary conditions for verifier-as-reward systems.

[NLP-15] SafeCtrl-RL: Inference-Time Adaptive Behaviour Control for LLM Dialogue via RL-Driven Prompt Optimisation

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在实际部署中难以确保行为安全且符合上下文要求的问题。其解决方案的关键在于提出一种名为SafeCtrl-RL的推理时行为控制框架,该框架无需对模型进行再训练或参数修改,而是将对话生成建模为一个序列决策过程,通过强化学习代理根据上下文反馈动态选择提示调整策略,从而实现不安全行为的迭代抑制,这一机制被定义为推理时的行为去学习(inference-time behavioural unlearning)。实验表明,SafeCtrl-RL在多个LLM和不安全对话场景中均能稳定提升安全性与响应质量,优于现有基于提示优化的方法,并在性能与效率之间取得良好平衡。

链接: https://arxiv.org/abs/2605.25984
作者: Michael Orme,Yanchao Yu,Zhiyuan Tan
机构: Edinburgh Napier University (爱丁堡纳皮尔大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Ensuring safe and contextually appropriate behaviour in Large Language Models (LLMs) remains a critical challenge for real-world deployment. We present \textbfSafeCtrl-RL, an inference-time behavioural control framework that enables adaptive safety regulation without model retraining or parameter modification. The method formulates dialogue generation as a sequential decision process, where a reinforcement learning agent dynamically selects prompt adjustment strategies based on contextual feedback. This allows unsafe behaviours to be suppressed through iterative refinement, which we conceptualise as inference-time behavioural unlearning. Evaluated across multiple LLMs and unsafe dialogue scenarios, SafeCtrl-RL consistently improves safety and response quality, outperforms existing prompt-based optimisation methods, and achieves favourable performance–efficiency trade-offs. **Warning: This paper may contain examples of harmful language, and reader discretion is recommended.

[NLP-16] When Do LLM Agents Treat Surface Noise Differently from Semantic Noise? A 68-Cell Measurement Study with a Held-Out Trace-Level Validation

【速读】: 该论文试图解决的问题是:在基于链式思维(Chain-of-Thought)和ReAct范式的大型语言模型代理中,语义扰动(如改写、同义替换)相较于形式扰动(如格式调整、顺序重排)为何更易导致最终答案的不一致。其解决方案的关键在于通过大规模实证分析(覆盖68个实验单元、1.5万条原始与变体轨迹),发现语义扰动引发的答案不一致性显著高于形式扰动(平均差异达+19.69个百分点,p<0.0001),且该现象在多种严重性代理审计下依然稳健。进一步的trace-level机制探查揭示了一种“隐匿性分歧”(stealth-divergence)模式:语义扰动常保持初始动作不变,但在后续推理步骤中引发分歧,并伴随略微更深的推理路径,从而解释了为何此类扰动更具破坏性。

链接: https://arxiv.org/abs/2605.25981
作者: Liyun Zhang,Jiayi Guo
机构: UESTC (电子科技大学); UC San Diego (加州大学圣地亚哥分校)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We document an empirical phenomenon in chain-of-thought and ReAct agents driven by ten large language models from seven architecture families: meaning-bearing perturbations (e.g., paraphrase, synonym) alter final answers more often than presentation perturbations (e.g., formatting, reordering) of comparable severity. Across 68 cells spanning GSM8K, MATH, and HotpotQA (1,530 originals and \sim 11,150 variants), the inconsistency gap averages +19.69 pp after severity matching (paired t=9.58 , p0.0001 ), with 64/68 cells positive. The gap survives four severity-proxy audits and remains significant when excluding qwen models (+11.10 pp, p0.0001 ). Several stress tests fail honestly: cluster-bootstrap significance disappears under stricter assumptions, tractability contrasts do not replicate, cross-architecture generator swaps break per-cell rankings, and a second LLM judge yields only moderate agreement ( \kappa=0.50 ). We then validate the headline effect on a fully held-out 11th model (qwen2.5-14B-Instruct; 1,800 trajectories) and re-test a pre-registered capability \times tractability partition, observing a small but positive held-out effect (3/4 cells positive; pooled Welch t=3.81 , p=9.6\times10^-4 ). Using held-out trajectories, we probe four trace-level mechanism signals. Two prior mechanism claims fail to replicate and are explicitly retracted. Two new probes instead support a \emphstealth-divergence picture: semantic perturbations often preserve the first action but induce divergence in intermediate reasoning from later steps onward, accompanied by slightly deeper trajectories. We position this as a measurement contribution with held-out replication and a partial trace-level account of how semantic perturbations propagate through agent reasoning. Code, perturbation corpus, raw trajectories, and analysis scripts are released anonymously for review. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2605.25981 [cs.CL] (or arXiv:2605.25981v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2605.25981 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-17] Creative Quality Alignment: Expert Tacit Knowledge Transfer via Chain-of-Thought Fine-Tuning

【速读】: 该论文旨在解决生成式人工智能(Generative AI)中“创造性质量”(Creative Quality)的工程实现问题,具体验证Calibrated Surprise理论提出的数学假设是否能在实际系统中成立。其核心挑战在于如何在低数据成本和小型基础模型条件下实现高质量的创造性输出。解决方案的关键在于提出并实证了“创造性质量对齐”(Creative Quality Alignment, CQA)方法:利用约100个专家链式思维(Chain-of-Thought, CoT)标注样本,在严格工程约束下仍能有效提升模型的创造性表现;同时通过理论分析揭示了单条件分布架构的大语言模型(LLM)中,欣赏侧(appreciation side)的校准可因结构对偶性自动传递至生成侧(generation side),从而解释为何少量数据即可生效——这并非纯经验现象,而是具有内在结构性依据。

链接: https://arxiv.org/abs/2605.25977
作者: Bo Zou,Chao Xu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This paper provides an empirical implementation of the creative quality metric proposed in Calibrated Surprise (Zou Xu, 2026a). The question this paper addresses is: does this mathematical claim hold at the engineering level? To make the answer as general as possible, we deliberately choose the strictest engineering conditions: low data cost and a small base model. Training data comes from approximately 100 expert chain-of-thought (CoT) annotations produced by the BC Protocol (Zou Xu, 2026b). We also identify a data bias: most publicly available alignment datasets are skewed toward craft-related knowledge, while audience modeling and reality-logic coverage are systematically weak. We use the term Creative Quality Alignment (CQA) to describe this class of engineering methods. We also offer a supporting theoretical observation: in an LLM with a single conditional distribution architecture, calibrating the appreciation side automatically transfers to the generation side via architectural duality. This is the structural reason why ~100 CoT examples are sufficient – not a purely empirical observation like LIMA (Zhou et al., 2023). Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2605.25977 [cs.CL] (or arXiv:2605.25977v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2605.25977 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Bo Zou [view email] [v1] Mon, 25 May 2026 15:52:10 UTC (50 KB)

[NLP-18] riplet-Block Diffusion RWKV

【速读】: 该论文试图解决因果 Transformer 语言模型在推理时存在严格串行解码和每步注意力计算复杂度为二次方(O(L²))的问题。现有线性时间因果模型与离散扩散模型虽分别缓解了上述问题,但二者集成存在本质不一致:扩散模型依赖双向注意力,而因果模型仅支持单向建模。解决方案的关键在于提出 B³D-RWKV,一种基于 triplet-block 布局方法的扩散 RWKV 变体,通过该结构将模型的 O(L) 推理效率与并行、双向的离散扩散机制统一起来。实验表明,B³D-RWKV-7.2B 在8项任务套件上达到与现有模型相当的准确性,同时在解码吞吐量上平均提升1.6倍,显著优于基线模型。

链接: https://arxiv.org/abs/2605.25969
作者: Ke Lin,Yiyang Luo,Zhaolong Su,Yunya Song,Anyi Rao
机构: William Mary (威廉玛丽学院); HKUST (香港科技大学); Cornell (康奈尔大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Causal Transformer language models suffer from strictly sequential decoding and a quadratic per-step attention cost. While linear-time causal models and discrete diffusion models each address these weaknesses, their integration remains inherently inconsistent: diffusion requires bidirectional attention, while causal models are unidirectional. To unify these architectures, we propose B^3D-RWKV , a diffusion RWKV variant that integrates the model’s O(L) inference efficiency with parallel, bidirectional discrete-diffusion through a \emphtriplet-block layout method. B^3D-RWKV-7.2B reaches comparable accuracy on an 8-task suite versus existing models while significantly outperforming baselines in decoding throughput with an average of \mathbf1.6\times speedup.

[NLP-19] Mapping the Schedule x Bit-Width Boundary in Sub-100M Quantisation-Aware Training

【速读】: 该论文旨在解决量化感知训练(QAT)中学习率调度是否随比特宽度(bit-width)变化的问题,特别是在参数量小于1亿的解码器语言模型中。研究发现,在FP16、INT8和INT6三种精度下,最优的学习率调度(warmdown fraction = 33%)具有普适性,即不同比特宽度下的最佳调度策略一致,从而否定了“INT6 QAT需要不同于高精度训练的学习率调度”这一假设。关键突破在于通过大规模实验(Phase 2 和 Phase 5)验证了该结论的鲁棒性,并揭示出INT4精度存在一个临界点:当模型规模≥50M时,wd33为显著最优;低于50M时,调度选择处于噪声水平,无统计学差异。此外,权重到量化网格的距离分析排除了“快速收敛至量化网格”作为解释机制的可能性。最终实践建议为:在子1亿模型中,可仅在FP16下调优学习率调度并直接应用于INT8/INT6 QAT;INT4在50M及以上使用wd33;低于50M时调度选择对性能影响可忽略。

链接: https://arxiv.org/abs/2605.25966
作者: Christian Brandt Thomassen
机构: Dwarf A/S (丹麦哥本哈根)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Machine Learning (stat.ML)
备注: 20 pages, 6 figures, 4 tables. 1345 training runs total (720 + 625). Submitted for review at TMLR

点击查看摘要

Abstract:We test whether the optimal learning-rate schedule depends on bit-width during from-initialisation quantisation-aware training (QAT) for sub-100M decoder language models. A 720-run factorial grid (Phase 2) over bit-width x warmdown fraction x LR magnitude x model size x seed (FP16/INT8/INT6, 15M-100M, 5 seeds) finds the optimal warmdown is 33% at every (bit-width, size) cell. The primary hypothesis – that INT6 QAT requires a different schedule than higher-precision training – is falsified at FP16/INT8/INT6. A 625-run follow-up (Phase 5) probes the null along five axes: optimiser (AdamW), schedule shape (cosine), training length (up to 9x more iterations), an extended size sweep (5M-350M), and an INT4 sweep from 3M to 100M. The null is robust under all three setup changes. The INT6 penalty follows a log-linear scaling law whose fit on Phase 2 predicts the five held-out Phase 5 sizes (5M, 8M, 175M, 250M, 350M) within their 95% prediction intervals (5/5). For INT4 the picture is sharper than the higher precisions: at 50M and 100M, wd33 is decisively optimal (paired z ~ 12-15, 10/10 seeds); below 50M, across the six tested sizes from 3M to 30M, no individual size shows a statistically significant schedule preference and the per-size mean penalty oscillates within seed-level noise. The boundary is therefore a transition between a noise-dominated regime below 50M and a decisive wd33 regime at and above 50M, not a clean wd10 region. A weight-to-grid-distance probe falsifies the simplest mechanism for the FP16/INT8/INT6 null result (rapid grid-snapping): pre-warmdown, INT6-QAT weights sit at essentially the same distance from the INT6 grid as FP16 weights (ratio ~ 1.04). Practical recommendation: at sub-100M scale, tune the LR schedule once at FP16 and apply unchanged to INT8/INT6 QAT; for INT4 at 50M+ use wd33; for INT4 below 50M the schedule choice is in the noise.

[NLP-20] PolyGnosis 2.0: Enhancing LLM Reasoning via Agent ic Harness Engineering for Polymarket and OSINT Insight Extraction

【速读】: 该论文旨在解决预测市场中如何从高噪声环境中提取高Alpha(超额收益)信号的问题,核心挑战在于识别并利用“视角错位”(Perspective Mismatches)——即Polymarket情绪与全球开源情报(OSINT)流(如GDELT数据)之间的叙事分歧。解决方案的关键在于构建PolyGnosis 2.0这一多智能体架构,并通过系统性验证“Harness Engineering”技术(包括反思循环、工具调用、分而治之策略(DC)、思维链(CoT))在金融领域中的有效性。研究发现:结构化分区是实现多维对齐的必要条件,但无约束的终端反思会引发逻辑漂移;同时,所有代理配置均存在普遍的“共识偏差”,需引入确定性验证机制。最终,论文提出一个帕累托最优配置,在保持专业级分析精度的同时显著降低延迟和Token开销,为预测市场的自主智能提供了可复现的技术蓝图。

链接: https://arxiv.org/abs/2605.25958
作者: Daren Wang,Hong Xu,Jiawen Xian
机构: The Chinese University of Hong Kong (香港中文大学); Evolution AI Lab (进化人工智能实验室)
类目: Computation and Language (cs.CL); Computational Engineering, Finance, and Science (cs.CE)
备注:

点击查看摘要

Abstract:This paper introduces PolyGnosis 2.0, a pioneering multi-agent architecture designed to extract predictive intelligence by synthesizing Polymarket anomaly signals with global Open Source Intelligence (OSINT) streams, specifically Global Database of Events, Language, and Tone (GDELT). We define and target “Perspective Mismatches”, the narrative divergence between Polymarket sentiment and global media flows, as high-alpha trading signals. Moving beyond generic agentic superiority, we rigorously quantify the efficacy of “Harness Engineering” techniques, including reflection loops, tool-calling, divide-and-conquer partitioning (DC), and chain-of-thought (CoT), within high-noise financial domains. Our empirical evaluation against human-expert benchmarks reveals that while structural partitioning is mandatory for multi-dimensional alignment, unconstrained terminal reflection actively induces logical drift. Furthermore, we identify a pervasive “consensus bias” across all agent configurations during narrative reasoning, necessitating deterministic validation. Ultimately, we isolate a Pareto-optimal configuration that achieves professional-grade analytical precision while minimizing latency and token overhead, providing a robust blueprint for autonomous intelligence in prediction markets.

[NLP-21] QUIET: A Multi-Blank Cascaded Story Cloze Benchmark for LLM Creative Generation Capability

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在创造性能力评估中存在的两大问题:一是现有基准测试(如Story Cloze Test、HellaSwag)仅衡量模型对叙事续写的判别能力,而非直接评估其生成式创造力;二是基于评分量表或LLM作为评判者的方法依赖主观维度判断或自然语言模型输出,缺乏客观、自动化的评分机制。解决方案的关键在于提出QUIET(Quality Understanding via Interlocked Evaluation Testing),这是一个基于多空白级联故事填空的诊断性基准,通过设置10–20个带有显式内容约束且存在级联依赖关系的空白,要求模型以开放式生成方式填充所有空白,并采用信息论驱动的自动化评分协议进行打分——该协议直接实现了“校准惊喜”(calibrated surprise)理论框架,其中每个空白的得分由“满足度”(satisfy,即是否符合内容约束,通过客观逻辑推理判断)和“惊喜度”(surprise,即在满足约束前提下的意外程度)共同决定,公式为 score = satisfy × (1 + λ × surprise),λ=1.0。此设计能够有效区分仅符合规则但缺乏创意的响应与既合规又具新颖性的高质量生成结果,从而实现对LLM创造性能力的客观、可量化评估。

链接: https://arxiv.org/abs/2605.25955
作者: Bo Zou,Chao Xu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language models (LLMs) face a dual challenge in creative capability evaluation: existing benchmarks (e.g., Story Cloze Test, HellaSwag) measure models’ discriminative ability over narrative continuation using multiple-choice recognition paradigms, rather than directly measuring creative generation capability; rubric-based scoring and LLM-as-Judge methods rely on subjective dimension assessment or natural language model outputs, and cannot provide objective, automated scoring mechanisms. This paper proposes QUIET (Quality Understanding via Interlocked Evaluation Testing), a diagnostic benchmark for LLM creative capability based on multi-blank cascaded story cloze. QUIET sets N blanks (10-20) in a story with complete structure, with each blank accompanied by an explicit content constraint, and cascade dependency relationships between blanks – the content filled into earlier blanks constrains the feasible solution space for later blanks. The evaluated model (or human participants) fills all blanks in open-ended generation mode; the results are scored by an information-theoretic automated scoring protocol without human grading. The scoring protocol directly operationalizes the “calibrated surprise” theoretical framework (Zou Xu, 2026a). For each blank k, a composite score is computed: score = satisfy * (1 + lambda * surprise), where lambda = 1.0. Here, “satisfy” measures how well the blank filling satisfies the content constraint (objective logical reasoning judgment, not subjective aesthetic scoring), and “surprise” measures the degree of surprise given that the constraint is satisfied. Creative answers that do not satisfy the constraint score zero; answers that satisfy the constraint but are mediocre score low; answers that satisfy the constraint and are surprising score high. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2605.25955 [cs.CL] (or arXiv:2605.25955v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2605.25955 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-22] haka at KSAA-2026 Task 2: Regularized Fine-Tuning for Arabic Speech Diacritization LREC2026

【速读】: 该论文旨在解决阿拉伯语语音转写中缺乏足够训练数据(仅2,327个样本)且不允许使用外部数据的情况下,实现高精度的自动带符号(diacritized)文本生成问题。其解决方案的关键在于对CATT-Whisper模型进行微调——该模型是一个字符级多模态模型,结合了预训练的CATT文本编码器与冻结的Whisper语音编码器,并引入三项训练正则化策略:R-Drop一致性正则化、通过Optuna优化的高权重衰减超参数,以及Focal Loss以缓解类别不平衡。推理阶段采用蒙特卡洛Dropout技术,在softmax概率层上对四个模型检查点执行200次随机前向传播并取平均,从而显著提升鲁棒性和性能。最终系统在主排行榜指标(包含词尾变体和无符号位置)上达到23.26%的词错误率(Word Error Rate, WER),位居所有参赛者第一。

链接: https://arxiv.org/abs/2605.25928
作者: Meshal Alamr,Hassan Alqaeri,Abdullah Aldahlawi
机构: 未知
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: 4 pages, 1 figure. Published in Proceedings of OSACT7 (LREC 2026). Winning system for KSAA-2026 Task 2 on Arabic Speech Diacritization

点击查看摘要

Abstract:We describe the winning system for Task 2 of the KSAA-2026 Shared Task on Arabic Speech Dictation with Automatic Diacritization. The task requires producing fully diacritized Arabic text from speech audio and undiacritized transcripts, with only 2,327 training samples available and no external data permitted. Our system fine-tunes CATT-Whisper, a character-level multimodal model combining a pretrained CATT text encoder with a frozen Whisper speech encoder. The key to our approach is training regularization: R-Drop consistency regularization, Optuna-optimized hyperparameters with high weight decay, and Focal Loss. At inference, we average 200 stochastic forward passes across four model checkpoints using Monte Carlo Dropout at the softmax probability level. The system achieves 23.26% WER on the primary leaderboard metric (with case endings, including no-diacritic positions), placing 1st among all participants.

[NLP-23] Does Continued Pretraining on a Learner Corpus Improve Automated Essay Scoring on English Proficiency Tests? Evidence from EFCAMDAT

【速读】: 该论文试图解决的问题是:预训练的Transformer模型在英语写作评分(AES)任务中表现受限,因其通常基于通用领域英语语料库进行预训练,难以有效捕捉第二语言学习者的写作特征。解决方案的关键在于通过领域自适应继续预训练(DAPT),使用EFCAMDAT学习者语料库对模型进行微调,以提升其在英语水平测试(如FCE和IELTS)中的评分性能。研究发现,当预训练数据与下游评估任务在熟练度水平、文体和交际目的上高度匹配时(如使用CEFR对齐子集进行针对性DAPT),模型在域内评分任务中的表现更稳定且显著提升;然而,这种改进并不能自动增强跨数据集的迁移能力,表明DAPT的效果依赖于语料库与目标任务之间的细粒度对齐。

链接: https://arxiv.org/abs/2605.25924
作者: Duy Anh Nguyen
机构: University of Greenwich
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 16 pages, 3 figures, 10 tables, including references and appendices

点击查看摘要

Abstract:Recent automated essay scoring (AES) studies increasingly use pretrained transformer models, but these models are usually pretrained on general-domain English and may under-represent second-language learner writing. This study investigates whether domain-adaptive continued pretraining (DAPT) on the EFCAMDAT learner corpus improves transformer-based AES for English proficiency tests. We apply DAPT to three transformer encoders and evaluate them on FCE and IELTS in both in-domain scoring and few-shot cross-dataset transfer. Full-corpus DAPT produces mixed results across models, datasets, and metrics. Further analyses suggest that these mixed effects are partly explained by mismatches in proficiency, genre, and communicative purpose between EFCAMDAT and the downstream datasets. A proficiency-based ablation shows that targeted DAPT using CEFR-aligned subsets improves downstream scoring more reliably than full-corpus DAPT, especially for FCE with B1–B2 data. However, these gains do not consistently improve cross-dataset transfer. Overall, the findings suggest that continued pretraining on a learner-writing corpus can benefit in-domain AES for English assessment when the pretraining data is sufficiently aligned with the downstream assessment settings. However, it does not automatically improve transferability across different English proficiency test datasets.

[NLP-24] Can LLM s Time Travel? Enhancing Temporal Consistency in Legal Agent ic Search through Reinforcement Learning

【速读】: 该论文试图解决当前法律大语言模型(Legal LLMs)在法律推理中忽视时间语境一致性的问题,即适用法律必须与案件发生的时间点相匹配,而现有模型常因训练数据截止时间(training cutoff)产生时间偏差,且搜索代理通常未将时间约束纳入查询逻辑,导致错误引用过时或尚未生效的法条。解决方案的关键在于提出 LegalSearch-R1,一个端到端的强化学习框架:它结合本地法条检索增强生成(RAG)以实现精确条文匹配,并融合在线网络搜索获取更广泛的法律知识,同时利用跨多个法律修订周期的时间索引数据进行训练,从而强制模型遵守时间一致性规则。实验表明,该方法在13项法律任务上显著优于现有先进深度研究框架和专业法律模型,尤其在时间一致性指标上提升达57.7%–80.3%,并展现出强健的跨领域泛化能力。

链接: https://arxiv.org/abs/2605.25920
作者: Wei Fan,Yining Zhou,Mufan Zhang,Yanbing Weng,Yiran HU,Tianshi Zheng,Baixuan Xu,Chunyang Li,Jianhui Yang,Haoran Li,Yangqiu Song
机构: HKUST (香港科技大学); Tsinghua University (清华大学); University of Waterloo (滑铁卢大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Under Review

点击查看摘要

Abstract:While large language models (LLMs) augmented with agentic search capabilities show promise for legal reasoning, they overlook a fundamental constraint that applicable law must match the temporal context of each case, as retroactive application of statutes violates core legal principles and leads to erroneous conclusions. Our observations reveal that current legal LLMs suffer from temporal bias anchored to their training cutoff, while search agents rarely incorporate temporal constraints into queries, and that web search alone cannot provide the precise statute and precedent citations that legal reasoning demands. To address these challenges, we propose LegalSearch-R1, an end-to-end reinforcement learning framework that pairs local statute RAG for precise article matching with online web search for broader legal knowledge, trained on temporally-indexed data spanning multiple amendment periods to enforce temporal consistency. Extensive experiments on our benchmark covering 13 legal tasks demonstrate that our 7B-parameter agent outperforms state-of-the-art deep research frameworks and specialized legal LLMs by 12.9% to 29.8%, surpasses baselines by 57.7% to 80.3% on temporal consistency, and exhibits robust out-of-domain generalization. The code and data are available at this https URL.

[NLP-25] Universal Activation Verbalizer: A Unified Framework for Cross-Model Activation Explanation

【速读】: 该论文试图解决的问题是:现有激活解释方法大多局限于自解释(self-explanation),即每个模型只能解释自身的隐藏表示,缺乏跨模型、跨架构的通用解释能力。解决方案的关键在于提出通用激活语义化框架(Universal Activation Verbalizer, UAV),其核心创新是引入一个共享解码器,并通过轻量级适配器(adapter)将来自异构源模型(donor models)的激活转换为解码器嵌入空间中的软标记(soft tokens)。UAV 支持仅训练适配器即可实现跨模型迁移,同时复用冻结的解码器侧 LoRA(Low-Rank Adaptation),从而在分类、事实检索和摘要等任务上实现与强自解释基线相当的性能,并首次实现了跨模型家族和规模的激活解释能力。消融实验表明,解码器侧微调主要提升任务表现,而适配器则提供基于激活的事实性和语义信息,确保解释的真实性。

链接: https://arxiv.org/abs/2605.25903
作者: Haiyan Zhao,Zirui He,Guanchu Wang,Ali Payani,Yingcong Li,Mengnan Du
机构: New Jersey Institute of Technology (新泽西理工学院); University of North Carolina at Charlotte (北卡罗来纳大学夏洛特分校); Cisco Research (思科研究院); The Chinese University of Hong Kong, Shenzhen (香港中文大学(深圳)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 23 pages, 11 figures, 11 tables

点击查看摘要

Abstract:Activation verbalization explains hidden representations in natural language, but existing methods are mostly limited to self-explanation, where each model explains only its own activations. We introduce Universal Activation Verbalizer (UAV), a framework that uses a shared decoder to explain activations from heterogeneous donor models. UAV learns a lightweight adapter that converts donor activations into soft tokens in decoder’s embedding space, and further supports adapter-only transfer by reusing a frozen decoder-side LoRA while training only a new adapter for another donor. Across classification, fact retrieval, and gist summarization, UAV remains competitive with strong self-explanation baselines while enabling cross-model verbalization across model families and scales. Ablations show that decoder-side tuning mainly improves task behavior, whereas the adapter provides the activation-grounded factual and semantic information needed for faithful explanations.

[NLP-26] Causal Tongue-Tie: LLM s Can Encode Causal Direction But Their Yes/No Outputs Fail to Express

【速读】: 该论文试图解决的问题是:大型语言模型(Large Language Models, LLMs)在回答因果问题时,其内部表征与最终输出之间存在不一致现象,即模型的隐藏状态中可能编码了正确的因果推理信息,但其输出却倾向于遵循常识而非证据支持的答案。解决方案的关键在于提出“因果舌结”(Causal Tongue-Tie)这一概念,揭示出模型输出错误的原因可分解为两种独立的失败模式:一是内部无有效信号(无因果推理),二是存在因果信号但无法通过语言接口表达(即“能想但说不出”)。这一发现表明,仅依赖输出结果的准确性来评估LLMs是否具备因果推理能力具有误导性,需重新审视当前基于单一准确率指标的因果推理基准测试。

链接: https://arxiv.org/abs/2605.25891
作者: Ziyi Ding,Xiao-Ping Zhang
机构: Tsinghua Shenzhen International Graduate School, Tsinghua University (清华大学深圳国际研究生院); Qwen (通义千问)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We find a mismatch between what large language models encode about a causal question and what they answer. On anti-commonsense CLadder items, a fixed linear probe recovers the evidence-supported answer from the model’s hidden state (accuracy approximately 0.97), while the spoken Yes/No reverts to the commonsense one (accuracy approximately 0.5). We call this approximately +0.5 gap Causal Tongue-Tie: a wrong Yes/No decomposes into two separable failure modes: no internal signal versus a signal the verbal interface cannot say. The implication cuts both ways for output-only causal benchmarks: a benchmark “correct” need not mean the model has understood, and a benchmark “wrong” need not mean it cannot. Sweeping claims about whether LLMs can do causal reasoning, drawn from a single accuracy number, deserve a second look.

[NLP-27] Mitigating Provenance-Role Collapse in Long-Term Agents via Typed Memory Representation

【速读】: 该论文试图解决长期记忆在大语言模型(LLM)智能体中存储方式不当导致的认知缺陷问题,特别是“来源监控崩溃”(provenance-role collapse)现象——即智能体因无法区分信息来源而产生错误判断。解决方案的关键在于提出一种结构化的记忆中间表示(Memory Intermediate Representation, MemIR),其核心是将记忆分解为三类原子单元:原始证据(raw evidence)、检索线索(retrieval cues)和承载事实的主张(truth-bearing claims),并通过多路径原子投影与来源范围限定的利用机制,实现从异构检索结果到以主张为中心的候选包的转换,并构建标准化的事实接口用于答案生成。这一架构设计使记忆具有明确的来源追踪能力,显著提升了需要源跟踪、时间定位及碎片证据聚合的任务表现。

链接: https://arxiv.org/abs/2605.25869
作者: Zhengda Jin,Bingbing Wang,Jing Li,Ruifeng Xu,Min Zhang
机构: Harbin Institute of Technology, Shenzhen, China; The Hong Kong Polytechnic University, Hong Kong, China; Shenzhen Loop Area Institute, Shenzhen, China
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Long-term memory is essential for persistent LLM agents, yet prevailing architectures store historical interactions as unstructured, flat text. This unconstrained storage induces provenance-role collapse, a critical failure mode where agents suffer from source-monitoring errors. To resolve this cognitive vulnerability at the architectural level, we propose MemIR, a typed Memory Intermediate Representation that operationalizes source monitoring as a structural constraint. MemIR writes long-term memory into grounded atoms that separate raw evidence, retrieval cues, and truth-bearing claims, with factual authorization restricted to supported claim atoms. It then applies multi-route atomic projection and provenance-scoped utilization to transform heterogeneous retrieval hits into claim-centered candidate bundles and a normalized fact interface for answer generation. Experiments on LoCoMo and BEAM-100K demonstrate that MemIR consistently outperforms existing memory baselines, especially on tasks requiring source tracking, temporal grounding, and aggregation of fragmented evidence.

[NLP-28] When Self-Belief Misleads: Active Label Acquisition for Reinforcement Learning with Verifiable Rewards

【速读】: 该论文试图解决的问题是:在基于可验证奖励的强化学习(Reinforcement Learning with Verifiable Rewards, RLVR)中,由于真实标签(ground-truth labels)获取成本高昂,导致其在现实场景中难以应用;而现有无监督RLVR方法依赖伪标签(pseudo-labels)时易发生训练崩溃(training collapse),且不同样本的标注价值存在差异,影响模型性能。解决方案的关键在于提出一种新的框架——主动可验证奖励强化学习(Reinforcement Learning with Active Verifiable Rewards, RLAVR),其核心创新包括:1)通过引入“校正优势差距”(Corrective Advantage Gap, CAG)指标量化每个样本的监督价值,识别对训练最具潜力的样本;2)设计一种名为“校正感知可靠性估计”(Correction-Aware Reliability Estimation, CARE)的预查询获取策略,将CAG准则转化为实际可用的主动标注策略,在有限标注预算下优先获取高质量样本的真实标签,并将其与伪标签融合,从而显著提升训练稳定性与最终性能。

链接: https://arxiv.org/abs/2605.25864
作者: Li Wang,Xiaodong Lu,Xiaohan Wang,Yikun Ban,Jiajun Chai,Wei Lin,Tianhao Peng,Guojun Yin
机构: Meituan(美团); Beihang University(北京航空航天大学); Nanyang Technological University(南洋理工大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have achieved remarkable advancements in reasoning capabilities empowered by Reinforcement Learning with Verifiable Rewards (RLVR). Nonetheless, RLVR intrinsically relies on ground-truth labels for reward computation, the acquisition of which is often prohibitively expensive in real-world scenarios. While unsupervised RLVR paradigms attempt to circumvent this by training on pseudo-labels, they are notoriously susceptible to training collapse. Moreover, different samples often exhibit varying annotation values. In this paper, we propose Reinforcement Learning with Active Verifiable Rewards (RLAVR), which actively acquires ground-truth labels for a small set of selected samples and integrates them with pseudo-labels, thereby stabilizing training dynamics and improving performance under limited annotation budgets. To identify valuable samples, we propose the Corrective Advantage Gap (CAG) metric and analyze the sample-level supervision value. Building on this, we introduce Correction-Aware Reliability Estimation for RLAVR (CARE), which translates the oracle CAG criterion into a practical pre-query acquisition policy to substantially improve training stability. Extensive experiments across diverse domains, model families, and model scales demonstrate the effectiveness and generality of our approach. Our code is available at this https URL.

[NLP-29] IAR: Trajectory-Informed Advantage Reweighting for LLM Abstention Learning

【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)在面对不确定或知识边界外问题时缺乏合理拒绝回答(abstention)能力的问题,从而减少幻觉(hallucination)现象。其核心解决方案在于提出一种基于轨迹信息的优势重加权方法(Trajectory-Informed Advantage Reweighting, TIAR),通过在Group Relative Policy Optimization (GRPO)训练过程中动态调整拒绝回答的奖励信号,利用多轨迹采样自然产生的策略置信度作为优势计算的基础,实现对模型拒答行为的精细化引导。该方法的关键创新在于将轨迹信息转化为动态优势权重,使模型能够根据自身推理过程的稳定性判断是否应选择拒答,从而提升拒答决策的准确性与一致性。实验表明,TIAR在AbstentionBench基准上显著优于静态三元奖励基线,在5/6类任务中达到最优拒答F1分数,并在31个数据集中的17个上实现性能超越,同时保持原始准确率不变。

链接: https://arxiv.org/abs/2605.25850
作者: Muyu Pan,Shu Zhao,Nan Zhang,Philip Shin,Varun Parekh,Vijaykrishnan Narayanan,Rui Zhang
机构: The Pennsylvania State University (宾夕法尼亚州立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 10 pages, 1 figure, 4 tables

点击查看摘要

Abstract:This paper investigates large language model (LLM) abstention learning, specifically using ternary reward, which incentivize truthfulness in large language models. This paper extends that idea by moving from a ternary reward to a Trajectory-Informed advantage reweighting, dynamically re-weights the abstention reward during Group Relative Policy Optimization (GRPO) training. The objective of this work focuses on abstention learning instead of improving truthfulness, serving as an exploration into hallucination reduction. The novelty of this paper lies in methodological innovation, advantage re-weighting, and benchmark selection. Leveraging GRPO’s multiple trajectories as a natural abstention signal, this method uses a reward signal to explore knowledge boundaries and encourage consistency. By demonstrating that trajectories can be used as a confidence indicator of the policy relative to the query, they are then used to dynamically calculate the abstention advantage. AbstentionBench is used as the evaluation benchmark, as this work aims to contribute to the field of abstention learning. All datasets on the benchmark were tested against this method and various baselines. Empirical results demonstrate that TIAR achieves state-of-the-art abstention F1 scores across five of six evaluation categories, outperforming the static ternary baseline on 17 of 31 benchmark datasets while fully preserving baseline accuracy.

[NLP-30] On the Limits of Model Merging for Multilinguality in Pre-Training

【速读】: 该论文试图解决的问题是:如何在多语言场景下实现模型的一致性能表现,特别是探讨是否可以将原本针对单一语言预训练的模型通过合并(merging)的方式整合为具有多语言能力的模型。解决方案的关键在于验证合并策略在不同预训练设置下的有效性,并揭示其背后的核心机制。研究发现,虽然单语种预训练能带来优异的本语种性能,但任意组合单语模型进行合并会导致性能崩溃,其根本原因在于模型间存在表示干扰(representation interference)。进一步分析表明,表示相似性(representational similarity)是成功合并的前提条件,因此得出结论:微调阶段中灵活的合并策略并不能直接适用于语言特定的预训练阶段。

链接: https://arxiv.org/abs/2605.25846
作者: Seth Aycock,Fedor Vitiugin,Aleksandr Umnov,Christof Monz,Khalil Sima’an
机构: University of Amsterdam (阿姆斯特丹大学); University of Turku (图尔库大学); Booking.com (Booking.com)
类目: Computation and Language (cs.CL)
备注: MeLLM Workshop 2026

点击查看摘要

Abstract:Endowing models with consistent multilingual performance can be achieved by mixing pre-training data, or post-training approaches such as language-specific model merging. In this work, we test whether merging can be applied to monolingually pre-trained models. We conduct a controlled study on the efficacy of mixed, merged, and monolingual pre-training setups. We find that while monolingual pre-training results in strong in-language performance, merging any combination of monolingual models leads to performance collapse due to interference. Our analysis suggests representational similarity is a prerequisite for model merging. We therefore conclude that the flexibility of merging in fine-tuning does not extend trivially to language-specific pre-training.

[NLP-31] MuCRASP: Multimodal Chain-of-thought Reasoning aware Structured Pruning

【速读】: 该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在采用链式思维(Chain-of-Thought, CoT)推理解决复杂多模态任务时,因参数量庞大导致部署成本高昂的问题。现有结构化剪枝方法无法有效保留CoT推理准确性,其根本原因在于:(1) CoT一致性依赖于生成轨迹中的稀疏过渡点(pivot tokens),而现有剪枝方法对此缺乏感知;(2) 针对单模态大语言模型(LLMs)设计的剪枝策略未考虑视觉与文本模态间激活分布差异。为此,作者提出MuCRASP框架,通过识别并保护推理关键组件,在全局参数预算下兼顾跨模态对齐与分层敏感性,从而实现高效压缩同时维持高推理质量。实验表明,MuCRASP在四个VLM上均显著优于基线方法,在Qwen2.5-VL-7B模型上30%剪枝率下物理推理任务的LLM-as-a-Judge得分达8.87,远超最强基线的7.32,且在50%剪枝率下仍保持较高推理一致性,同时表现出更低的困惑度下降。

链接: https://arxiv.org/abs/2605.25842
作者: Aritra Dutta,Somak Aditya
机构: Indian Institute of Technology (IIT), Kharagpur
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: First Preprint

点击查看摘要

Abstract:Vision-language models (VLMs) increasingly rely on chain-of-thought (CoT) reasoning to solve complex multimodal tasks, but their large parameter sizes make deployment expensive. Structured pruning offers a natural solution; however, existing methods fail to preserve CoT reasoning accuracy in VLMs. We identify two key reasons: (1) CoT consistency depends on sparse transition points (pivot tokens) in the generation trajectory, while existing pruning methods are CoT-agnostic; and (2) pruning methods designed for unimodal LLMs do not account for activation-distribution differences across visual and textual modalities. Motivated by these observations, we propose MuCRASP, a structured pruning framework that targets reasoning-critical components while preserving cross-modal alignment and accounting for layer-wise sensitivity under a global parameter budget. Experiments on four VLMs across three reasoning benchmarks show that MuCRASP consistently preserves reasoning quality under increasing compression. At 30% pruning on Qwen2.5-VL-7B, MuCRASP achieves an LLM-as-a-Judge score of 8.87 versus 7.32 for the strongest baseline on physical reasoning tasks. Furthermore, MuCRASP maintains high reasoning consistency up to 50% pruning, significantly outperforming prior pruning approaches while exhibiting lower perplexity degradation.

[NLP-32] Print: Evidence-Grounded TTP Extraction via Diverge-then-Converge Verification

【速读】: 该论文试图解决从网络威胁情报(CTI)报告中提取MITRE ATT&CK技术时面临的开集、多标签问题,核心挑战在于同时实现高召回率(不遗漏技术)和高精确率(不产生未支持的虚假提取)。现有方法——基于规则、监督学习和大语言模型(LLM)的方法——均难以兼顾二者:前者泛化能力弱,后者在单次推理中将候选生成与验证耦合,导致召回率和精确率均受限。解决方案的关键在于提出TTPrint,其采用“发散-收敛”设计,模拟人类分析师的工作流程:先广泛提取原子行为并提出候选技术(发散阶段),再通过确定性跨度定位将每个候选锚定到原文中的具体证据窗口,并在收敛阶段仅保留同时满足局部证据和MITRE权威定义支持的候选技术。作者还构建了两个评估资源(TRAM-Clean和TTPrint-Bench),以解决现有基准中的标注噪声问题,并推动任务从片段级提升至文档级TTP提取。实验表明,TTPrint在两个新基准上分别取得76.48%和87.39%的宏F1分数,显著优于最先进基线(提升63.5%和29.4%),且多骨干模型分析和阈值敏感性研究验证了其跨模型泛化能力和参数选择的实用性。

链接: https://arxiv.org/abs/2605.25836
作者: Yutong Cheng,Changze Li,Raihan Sultan Pasha Basuki,Qian Cui,Wei Ding,Peng Gao
机构: Virginia Tech (弗吉尼亚理工大学); Universitas Ary Ginanjar, Jakarta, Indonesia (雅加达阿瑞吉纳贾尔大学); Amazon (亚马逊)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Preprint

点击查看摘要

Abstract:Extracting MITRE ATTCK techniques from cyber threat intelligence (CTI) reports is an open-set, multi-label problem requiring both high recall (not missing techniques) and high precision (not hallucinating unsupported ones). Existing methods–rule-based, supervised, and LLM-based–struggle to achieve both: rule-based and supervised approaches lack generalizability across diverse attack descriptions, while LLM-based approaches that couple candidate generation and validation within a single inference step suffer from limited recall and precision simultaneously. We propose TTPrint, which addresses this challenge through a diverge-then-converge design inspired by how human analysts work: first extracting broadly, then verifying rigorously. In the divergent phase, reports are decomposed into atomic behaviors and candidate techniques are proposed broadly. A deterministic span localization stage then anchors each candidate to a specific evidence window in the source text. A convergent verification stage retains only candidates supported by both the localized evidence and the authoritative MITRE definition. We contribute two evaluation resources–a cleaned TRAM benchmark (TRAM-Clean) and a new annotated dataset (TTPrint-Bench)–to address known annotation noise in existing benchmarks and elevate the task to document-level TTP extraction. On TRAM-Clean and TTPrint-Bench, TTPrint achieves 76.48% and 87.39% macro-F1 respectively, outperforming the leading baseline by 63.5% and 29.4%. A multi-backbone analysis across six LLMs and a threshold sensitivity study further demonstrate generalizability across model choices and provide practical guidance for parameter selection.

[NLP-33] When Search Becomes Memory: Turning Robot Design Trials into Transferable Skills

【速读】: 该论文试图解决的问题是:当前基于大语言模型(LLM)的进化机器人设计流程大多缺乏记忆机制,即仿真结果无法被保存为可复用的设计知识,导致每次搜索都需从头开始,效率低下且难以积累经验。解决方案的关键在于提出 Auto-Robotist——一个自进化的 LLM 代理,它能将形态搜索过程中的轨迹提炼为显式的自然语言技能库(skill library),其中每个技能包含结构原型、基于证据的正负规则以及支撑这些规则的已评估设计实例,从而实现设计记忆的可追溯性和可解释性。在搜索过程中,该代理通过检索技能来指导对优秀体征的 LLM 编辑,并保留遗传算法(GA)的变异路径以维持探索能力;在评估后则通过“添加(Add)、诊断(Diagnose)与合并(Merge)”更新技能库。实验表明,Auto-Robotist 在冷启动场景下将搜索效率提升 5 倍,并能将学习到的技能有效迁移至更大设计空间(如从 5×5 到 10×10),其参考条件驱动的迁移策略在所有任务上均优于纯 GA 方法,验证了 LLM 代理可将昂贵的物理评估转化为可复用、可审计的设计原则。

链接: https://arxiv.org/abs/2605.25832
作者: Yunfei Wang,Xiaohao Xu,Yang Li,Xiaonan Huang
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: 20 pages, 8 figures

点击查看摘要

Abstract:Large language models (LLMs) are increasingly used as proposal generators for evolutionary robot design, yet most loops remain memoryless: simulator results shape the next population but are not preserved as reusable design knowledge. We present Auto-Robotist, a self-evolving LLM agent that distills morphology-search traces into an explicit natural-language skill library. Each skill stores a structural archetype, evidence-grounded positive and negative rules, and the evaluated designs that support them, making design memory inspectable rather than implicit in a population. During search, the agent retrieves skills to condition LLM edits of elite bodies while retaining a Genetic Algorithm (GA) mutation path for exploration; after evaluation, it updates the library through Add, Diagnose, and Merge. Across seven EvoGym tasks spanning locomotion, traversal, and object interaction, Auto-Robotist improves cold-start 5x5 search and transfers learned skills to 10x10 design spaces, where reference-conditioned transfer outperforms GA on every task. These results suggest that LLM agents can convert expensive physical evaluations into reusable, auditable design principles. Our code will be released upon acceptance.

[NLP-34] Clarify Abstain or Answer? Strategising in Conversation with Belief-Augmented Generation

【速读】: 该论文试图解决大语言模型(Large Language Models, LLMs)在面对模糊或不确定输入时缺乏主动识别和处理不确定性的问题,即模型默认倾向于直接回答问题,而很少选择澄清或放弃回答,导致错误响应频发。解决方案的关键在于提出Belief-Augmented Generation (BAG)框架:通过将模型自身的信念状态(belief state,由对输入的K个采样响应构成)嵌入到提示(prompt)中,使模型能够基于这些样本进行推理,从而自主决策采取“回答”、“澄清”或“放弃”三种对话策略之一。实验表明,BAG在多轮模糊问答场景中显著提升了准确率,并使策略选择更贴近模型内部的信念分布,优于仅依赖提示的基线方法;但区分“何时澄清”与“何时放弃”仍具挑战性。

链接: https://arxiv.org/abs/2605.25831
作者: Joris Baan,Wilker Aziz,Barbara Plank,Raquel Fernández
机构: University of Amsterdam (阿姆斯特丹大学); MCML Munich (慕尼黑MCML); LMU Munich (慕尼黑路德维希马克西米利安大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language models (LLMs) define a distribution over text, which can be viewed as a probabilistic representation of uncertainty: sampling K responses yields a belief state - responses a model deems plausible. Existing work exploits this representation for narrow tasks like either decoding or selective prediction, and often requires manual interventions, not controlling generation directly. We propose Belief-Augmented Generation (BAG): grounding LLMs in their own belief state via the prompt and letting them reason over these K samples to decide on a conversational strategy: answer, clarify, or abstain. In a multi-turn ambiguous QA setting, we find that LLMs by default rarely clarify or abstain, ignoring uncertainty about the input or facts. BAG improves QA accuracy across six models and yields strategy decisions more faithful to the belief state than prompt-only baselines. Disentangling when to clarify from when to abstain, however, remains challenging.

[NLP-35] Fine-Tuning Over Architectural Complexity: Broad-Coverag e PII Detection on PIIBench with DeBERTa

【速读】: 该论文试图解决的问题是:当前个人身份信息(PII)检测系统在训练时通常局限于狭窄的源域或数据分布,导致在面对异构文本时覆盖能力不足。为应对这一问题,作者提出通过在多源、经校正的PIIBench数据集上进行模型微调来提升泛化性能,该数据集包含来自10个源数据集的82种保留实体类型。解决方案的关键在于采用基于DeBERTa的直接标记分类微调方法,并对比了两种更复杂的架构设计(源条件分层模型SC+H和三阶段课程学习扩展SC+H+Curr)。实验表明,尽管SC+H在验证阶段表现较好,但在完整的10万条测试集上,直接微调仍以F1得分0.6455显著优于SC+H(0.5894),且在54/82个细粒度实体类别和全部10个粗粒度组别中胜出。研究结论指出,多样化的任务特定训练数据与简单的加权交叉熵损失函数比复杂模型结构和课程学习策略更能有效提升PII检测的广覆盖能力。

链接: https://arxiv.org/abs/2605.25816
作者: Pritesh Jha
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Personally identifiable information (PII) detection systems are frequently trained within narrow source or domain boundaries, limiting coverage when deployed on heterogeneous text. We study model fine-tuning on a corrected multi-source PIIBench preparation spanning 82 retained entity types across ten source datasets. We evaluate three DeBERTa-based approaches: direct token classification fine-tuning, a source-conditioned hierarchical model (SC+H), and a three-phase curriculum extension (SC+H+Curr). Against eight published comparator systems on a reproducible 5,000-record held-out subset (test_5k), direct fine-tuned DeBERTa achieves F1 0.6476, while SC+H and the curriculum variant achieve 0.5899 and 0.2772 respectively; the strongest published comparator reaches only 0.1723. Because validation initially favoured SC+H, we perform a final streamed evaluation on the complete 100,002-record held-out split. Direct fine-tuning remains superior, achieving F1 0.6455 versus 0.5894 for SC+H. Entity-level analysis shows that direct fine tuning wins 54 of 82 fine entity types and all ten coarse groups by support-weighted entity F1, while SC+H retains localised advantages on 28 types. The results indicate that diverse task-specific training data and a simple weighted cross-entropy objective contribute more to broad-coverage PII detection than the tested architectural and curriculum complexity.

[NLP-36] Adaptive Graph Refinement and Label Propagation with LLM s for Cost-Effective Entity Resolution

【速读】: 该论文试图解决传统实体解析(Entity Resolution, ER)方法中基于“阻塞-匹配-聚类”范式的固有缺陷问题,即该流程因步骤解耦导致生成的实体图结构静态且稀疏,存在缺失边(阻塞失败)和噪声链接(匹配错误),从而引发误差传播并产生次优聚类结果,尤其在强传递性约束下表现更差。解决方案的关键在于提出一个统一框架 Alper,将匹配与聚类整合为一种迭代的概率标签传播过程,在全局动态演化的图结构上协同优化;其核心创新是通过自适应融合“弱但廉价”的图传播信号与“强但昂贵”的大语言模型(LLM)成对查询信号,并将信号选择建模为带预算约束的优化问题,以最大化累计边际收益,采用具有理论保障的贪心算法求解,从而实现更高的成本效益与解析精度。

链接: https://arxiv.org/abs/2605.25814
作者: Hongtao Wang,Renchi Yang,Haoran Zheng,Xiangyu Ke
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Dirty entity resolution (ER), which identifies records referring to the same real-world entity from a single, messy dataset, is a fundamental task in data management and mining. However, the dominant blocking-matching-clustering paradigm for ER suffers from critical flaws. Its cascaded, decoupled workflow essentially produces a static, sparse graph plagued by missing edges (due to blocking failures) and noisy links (due to matching errors), causing error propagation and yielding suboptimal clusters, particularly when rigid transitivity is imposed in the clustering. We contend that matching and clustering are fundamentally synergistic, both optimizing for the construction of an ideal entity graph. Building upon this insight, we propose Alper, a unified framework that integrates these steps into an iterative probabilistic label propagation process over a global, evolving graph. Unlike disjoint blocking, Alper refines the graph structure and labels dynamically by adaptively integrating “weak but cheap” signals from graph propagation with “strong but expensive” LLM-based pairwise queries. For higher cost-effectiveness, we formulate the signal selection as a constrained optimization problem maximizing cumulative marginal gain under a query budget, solved via our greedy algorithm with provable theoretical guarantees. Our extensive experiments over eight benchmark datasets demonstrate that Alper is consistently superior to state-of-the-art cascaded pipelines.

[NLP-37] SAMark: A Self-Anchored Text Watermarking with Parag raph-Level Paraphrase Robustness

【速读】: 该论文试图解决的问题是:现有语义级水印(Semantic-level Watermarking, SWM)方法在面对段落级别的改写攻击(paragraph-level paraphrasing)时鲁棒性不足,因为此类攻击通过改变句子顺序全局性地破坏水印信号。解决方案的关键在于提出 SAMark 框架,其核心创新包括:1)引入自锚定机制(self-anchored mechanism),在语义空间中建立与步骤无关的“绿色区域”(green region),从而消除对句子顺序的依赖;2)设计多通道双曲评分机制(multi-channel hyperbolic scoring mechanism),增强水印信号并抑制弱对齐候选者的噪声;3)提出多样性感知过滤策略(diversity-aware filtering strategy),结合硬过滤与软正则化,有效缓解语义冗余问题,超越传统 n-gram 重复过滤方法。实验表明,SAMark 在典型段落级改写攻击下达到 TP@FP1% 90.2%,相比最强基线平均提升超 30%,同时保持生成质量与未加水印文本相当,突破了以往方法中鲁棒性与质量之间的权衡瓶颈。

链接: https://arxiv.org/abs/2605.25796
作者: Jiahao Huo,Wenjie Qu,Yibo Yan,Kening Zheng,Jiaheng Zhang,Xuming Hu,Philip S. Yu,Mingxun Zhou
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Semantic-level watermarking (SWM) improves robustness against text modifications by treating sentences as the basic unit. However, robustness to paragraph-level paraphrasing remains difficult because such attacks globally disrupt watermark signals by changing sentence order. In this work, we propose SAMark, a self-anchored watermarking framework that removes the dependency on sentence order by establishing a step-independent green region in semantic space. To improve detectability, we introduce a multi-channel hyperbolic scoring mechanism that amplifies watermark signals while suppressing noise from weakly aligned candidates. We further propose a diversity-aware filtering strategy that combines hard filtering with soft regularization, extending beyond simple n-gram repetition filters to address semantic redundancy. Experimental results show that SAMark achieves up to 90.2% TP@FP1% under typical paragraph-level paraphrasing attacks, outperforming the strongest prior baseline by more than 30% on average, while maintaining generation quality competitive with unwatermarked text and breaking the robustness-quality trade-off that limits prior methods.

[NLP-38] Double Triangle Annotation: A Scalable Human-in-the-Loop Framework for High-Precision Historical Document Annotation ACL

【速读】: 该论文旨在解决大规模历史文档中结构化信息抽取(structured-information extraction)的高精度标注难题,即传统人工标注成本高昂,而基于大语言模型(Large Language Models, LLMs)的全自动方法易产生幻觉(hallucination)。其解决方案的关键在于提出“双三角标注”(Double Triangle Annotation)框架——一种两层人类在环(human-in-the-loop)机制:第一层由两个架构独立的多模态大语言模型并行标注文档,一致结果自动采纳,不一致则交由人工评审团处理;第二层进一步交叉验证两套此类系统,剩余分歧由领域专家介入。该框架仅依赖模型间错误独立性的假设,无需分布先验或任务特定校准,且随着模型能力提升可逐步实现更高程度的自动化。在1887–1906年法国医学目录《Guides Rosenwald》语料上,该方法最终达到0.003的词级错误率(Word Error Rate),并自动接受超过85%的13,595个字段标注,同时发布了首个针对该语料的结构化抽取基准数据集,以支持未来历史文档处理研究。

链接: https://arxiv.org/abs/2605.25781
作者: Yi Ren
机构: École Polytechnique Fédérale de Lausanne (洛桑联邦理工学院)
类目: Computation and Language (cs.CL)
备注: 12 pages, 4 figures. ACL ARR 2026 March submission

点击查看摘要

Abstract:Evaluating structured-information extraction from historical documents at scale requires high-precision ground-truth annotations, yet traditional manual labeling is expensive and fully automated pipelines built on large language models are prone to hallucination. We propose Double Triangle Annotation, a two-layer human-in-the-loop framework that leverages cross-model consensus to automate the majority of annotation work while ensuring high-precision outputs. In the first layer, two architecturally independent Multimodal Large Language Models annotate each document in parallel; when they agree, the label is auto-accepted, and disagreements are routed to a human jury. A second layer cross-checks two such systems against each other, escalating residual conflicts to a domain expert. The framework rests on a single assumption – error independence between models – requires no distributional priors or task-specific calibration, and becomes more autonomous as model capability improves. On the Guides Rosenwald, a corpus of French medical directories spanning 1887-1906, the framework achieves a final Word Error Rate of 0.003. Applied at scale, model consensus auto-accepts over 85% of 13,595 fields. We release the resulting benchmark – the first structured-extraction ground truth for the Rosenwald Guides – to support future work on historical document processing.

[NLP-39] StreamProfileBench: A Benchmark for Fine-Grained User Profile Inference in Real-World Streaming Scenarios

【速读】: 该论文试图解决的问题是:当前大型语言模型(LLM)在用户画像(User Profiling)评估中主要依赖静态数据快照,无法反映个性化系统中用户生成内容(UGC)持续流入、用户兴趣动态演变的真实场景。解决方案的关键在于提出StreamProfileBench——一个大规模流式用户画像基准,将流式用户画像形式化为连续状态维护任务,并构建了一个包含7000余名真实用户、超过12万条UGC的高保真数据集;同时设计了一种无需人工标注的评估框架,利用用户兴趣的时间相关性进行自动评估。实验表明,现有LLM在持续更新用户画像方面仍存在系统性保守偏差,即过度保留旧兴趣而未能识别兴趣衰减,验证了流式范式的必要性和实用性。

链接: https://arxiv.org/abs/2605.25758
作者: Sizhe Wang,Feiyu Duan,Juelin Wang,Liwen Zhang,Feiyu Duan
机构: Fudan University (复旦大学); Shanghai Innovation Institute (上海创新研究院); Shanghai University of Finance and Economics (上海财经大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have reshaped user profiling, yet current evaluations mainly focus on static data snapshots. This paradigm overlooks the reality of personalized systems, where User-Generated Content (UGC) arrives continuously and fine-grained profile evolve rapidly. To bridge this gap, we introduce StreamProfileBench, a large-scale benchmark for fine-grained streaming user profiling. We formalize streaming user profiling as a continuous state maintenance task and curate a highly authentic dataset comprising over 120,000 UGC posts from 7,000+ real users across five diverse platforms. By leveraging the temporal correlation of user interests, we further propose a novel, annotation-free evaluation framework. Extensive experiments across 14 leading LLMs reveal that continuous profile updating remains an open challenge. Models exhibit a systemic conservative bias, over-retaining past interests while failing to recognize interest decay. Ablation experiments further validate the practical utility and necessity of the streaming paradigm.

[NLP-40] Selective Latent Thinking: Adaptive Compression of LLM Reasoning Chains

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在使用显式链式思维(Explicit Chain-of-Thought, CoT)进行推理时产生的高推理成本问题,同时避免现有隐式推理方法因对所有中间步骤均匀压缩而导致精度下降的缺陷。其解决方案的关键在于提出一种选择性隐式思维(Selective Latent Thinking, SLT)框架,该框架通过轻量级解码器预测短程推理片段,并结合置信度门控机制识别可被可靠压缩的最长推理段落,将其编码为紧凑的隐式表示以提升效率,而对不确定或精度敏感的推理步骤则保留为显式CoT形式以保障准确性。SLT采用三阶段训练策略——段落级隐式压缩、可靠性感知的未来推理预测及轨迹级强化学习,从而优化答案正确性与推理成本之间的权衡。实验表明,SLT在四个数学推理基准上相比隐式推理基线提升22.7%准确率,且在压缩比相当的情况下将推理链长度减少58.4%,仅牺牲2.8%的准确率即可逼近显式CoT性能。

链接: https://arxiv.org/abs/2605.25745
作者: Hui Xie,Jie Liu,Ziyue Qiao,Joaquin Vanschore
机构: Eindhoven University of Technology (荷兰埃因霍温理工大学); Great Bay University (中国大湾大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Explicit chain-of-thought (CoT) reasoning substantially improves the reasoning ability of large language models (LLMs), but incurs high inference cost due to lengthy autoregressive traces. Existing latent reasoning methods offer a promising alternative, yet they often treat reasoning as uniformly compressible, causing precision-critical intermediate steps to be overly compressed and thereby degrading reasoning accuracy. In this work, we propose Selective Latent Thinking (SLT), a framework that selectively compresses redundant reasoning spans into latent representations while preserving precision-critical spans as explicit CoT within the same reasoning trajectory. Specifically, SLT first uses a lightweight decoder to anticipate a short upcoming reasoning span, and then applies confidence-based gating to determine the longest span that can be reliably compressed. The accepted span is encoded into a compact latent representation to improve reasoning efficiency, while uncertain or precision-critical reasoning remains in explicit CoT form to preserve accuracy. To learn this selective compression policy, SLT adopts a three-stage training strategy that combines span-level latent compression, reliability-aware future reasoning prediction, and trajectory-level reinforcement learning to optimize the trade-off between answer correctness and reasoning cost. Extensive experiments across four mathematical reasoning benchmarks demonstrate that SLT achieves 22.7% higher accuracy than latent reasoning baselines at comparable compression ratios, while reducing reasoning chain length by 58.4% with only 2.8% accuracy degradation compared to explicit CoT,Our code can be found in this https URL.

[NLP-41] rait-Aware Policy Optimization for Autoregressive Multi-Trait Essay Scoring

【速读】: 该论文旨在解决自回归式多维度作文评分模型在后训练阶段效果不佳的问题。现有方法未能充分挖掘评分过程中不同维度(trait)之间的复杂关系,导致评分精度和一致性不足。其解决方案的关键在于提出一种面向特质的策略优化框架(Trait-Aware Policy Optimization, TAPO),通过在样本和特质两个维度上分解奖励信号,协同优化全局评分一致性、各特质准确性、格式有效性以及特质间依赖关系的保留。此外,TAPO在监督微调阶段引入增强提示(enhanced prompts),使模型在偏好优化前即可内化各评分特质的语义信息,从而显著提升多维评分性能,并在多个基础模型上验证了其有效性与可迁移性。

链接: https://arxiv.org/abs/2605.25731
作者: Zhengyang Wang,Sanwoo Lee,Jiaxin Wang,Chenxi Miao,Weikang Li,Yunfang Wu
机构: Peking University (北京大学); Baidu Inc. (百度)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Multi-trait essay scoring aims to provide fine-grained evaluation of writing quality across multiple dimensions. However, how to effectively post-train autoregressive scoring models remains underexplored. In this paper, we propose Trait-Aware Policy Optimization (TAPO), a post-training framework tailored to autoregressive multi-trait scoring. Our method decomposes rewards along both the sample and trait dimensions, combining global scoring consistency, trait-level accuracy, format validity, and inter-trait dependency preservation. In addition, we enhance supervised fine-tuning with enhanced prompts, allowing the model to internalize trait semantics before preference optimization. Experiments across multiple backbone models show that our method consistently improves multi-trait scoring performance over supervised fine-tuning and scalar-reward optimization baselines, demonstrating the effectiveness and transferability of trait-aware post-training for essay scoring.

[NLP-42] CMAP: Cross-Modal Adaptive Prompting for Multi-Domain Task-Incremental Learning

【速读】: 该论文旨在解决多域任务增量学习(multi-domain task-incremental learning, MTIL)中模型在无任务身份信息(task identity)的情况下,如何有效避免灾难性遗忘并实现跨域知识迁移的问题。现有方法虽利用冻结的视觉-语言模型(如CLIP)实现了参数高效学习,但仅依赖视觉特征进行任务路由、置信度估计与编码器适配,忽略了CLIP中未被使用的跨模态文本嵌入空间。其解决方案的关键在于:1)文本空间任务路由,通过余弦相似度匹配冻结的CLIP文本原型替代视觉高斯匹配,实现无需参数更新的任务路由且对数据稀缺鲁棒;2)多原型视觉-文本置信度机制,采用K-means视觉原型结合跨模态对齐分数,在任务校准阈值下提升分类置信度估计准确性;3)对称跨模态门控机制,将Gumbel门控扩展至文本编码器,以图像特征为条件动态调节文本路径,从而保持分布外输入下的跨模态一致性。实验表明,该方法在MTIL基准上达到74.2%迁移准确率、80.5%平均准确率和88.7%最终任务准确率,显著优于现有最优方法,且仅需250万可训练参数,无需外部数据。

链接: https://arxiv.org/abs/2605.25708
作者: Sriram Mandalika
机构: Hasso Plattner Institute (哈索普拉特纳研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Emerging Technologies (cs.ET)
备注:

点击查看摘要

Abstract:Multi-domain task-incremental learning requires a model to sequentially acquire knowledge across visually diverse domains without forgetting prior tasks, and without access to task identity at inference. Parameter-efficient methods built on frozen vision-language models have made strong progress, yet all existing approaches rely exclusively on visual features for task routing, confidence estimation, and encoder adaptation, leaving CLIP’s cross-modal text embedding space entirely unexploited. We address this gap through three contributions. Text-space task routing replaces visual Gaussian matching with cosine similarity to frozen CLIP text prototypes, giving order-independent routing robust to data scarcity at zero parameter cost. Multi-prototype visual-textual confidence replaces single-Gaussian class modeling with K-means visual prototypes and cross-modal alignment scores under task-calibrated thresholds. Symmetric cross-modal gating extends per-layer Gumbel gates to the text encoder conditioned on batch image features, preserving cross-modal alignment on out-of-distribution inputs. On the MTIL benchmark spanning 11 datasets and 1201 classes, our method achieves 74.2% Transfer, 80.5% Average, and 88.7% Last under Order-I, surpassing the prior state of the art by 5.0, 3.7, and 3.0 percentage points with only 2.5M trainable parameters and no external data.

[NLP-43] PowLU: An Activation Function for Stable Pre-Training of LLM s

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)中广泛使用的Swish-Gated Linear Unit(SwiGLU)激活函数在低精度训练时因近似二次放大效应导致的数值不稳定问题,尤其是在大输入或大规模模型场景下容易引发输出范围扩大和异常值加剧。解决方案的关键在于提出一种新的稳定激活函数——幂线性单元(Power Linear Unit, PowLU),其通过引入有理数幂函数实现自适应非线性,从而在保持强表达能力的同时提升对尖峰区域(spike regions)的稳定性。理论分析证明了PowLU具备若干关键性质,且缩放定律实验表明其性能在不同模型规模下保持一致;在Ling架构(7.9B和124B参数)上的实验证明,PowLU在大规模预训练中优于SwiGLU及其剪裁版本(SwiGLU-Clip),并显著改善了LLM训练的可扩展性。

链接: https://arxiv.org/abs/2605.25704
作者: Peijie Jiang,Yuqi Feng,Cunyin Peng,Qian Zhao,Jia Liu,KunLong Chen,Zhiqiang Zhang,Jun Zhou
机构: Ant Group (蚂蚁集团)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 17 pages, 7 figures, techreport

点击查看摘要

Abstract:In contemporary large language models (LLMs), the swish-gated linear unit (SwiGLU) activation function is widely adopted to regulate the information flow and introduce non-linearity. For large positive inputs, SwiGLU approximates the quadratic function x^2 , providing strong nonlinearity and expressive capacity. However, this property also causes numerical instability as the input or model scale increases, particularly in low-precision LLM training. The main reason is its approximate quadratic amplification, which enlarges the output range and exacerbates outliers. To address this issue, we propose a stable activation function, Power Linear Unit (PowLU), for large-scale LLM pre-training. Specifically, PowLU employs a rational power function to achieve adaptive nonlinearity, thereby improving representation ability and enabling stable training in spike regions. Moreover, we provide theoretical justification for several key properties of PowLU. Scaling law experiments confirm that the performance is consistent across model sizes, and further experimental results with the Ling architecture (7.9B and 124B total parameters) demonstrate that PowLU achieves competitive results against SwiGLU and SwiGLU-Clip in large-scale training of LLMs. In addition, the experimental results also show that PowLU effectively improves the scalability of the large-scale training of LLMs.

[NLP-44] sting the Deliteralization Hypothesis in Human and Machine Translation

【速读】: 该论文试图解决的问题是:随着从专用神经机器翻译(NMT)系统向通用大语言模型(LLM)的转变,这种变化是否扩展到翻译研究中的“去字面化假说”(deliteralization hypothesis),即译文在反复修订过程中会逐渐变得不再逐字对应原文。解决方案的关键在于:通过构建一个基于六种启发式规则的验证过的合成字面性指数(Synthetic Literality Index),在WMT24++数据集上系统比较人类译文及其修订版本与两种NMT系统和六种LLM在54个语言对上的字面性表现,涵盖直接翻译、迭代自修订和人工草稿后编辑三种任务。研究发现:(i) 人类译文仍显著比所有机器翻译系统更少字面化,尽管最新LLM缩小了差距;(ii) LLM在被提示进行迭代修订时表现出单调去字面化趋势,首次证明该假说可原生适用于LLM生成过程;(iii) 作为后编辑者,LLM的行为模式与人类相反——它们容忍字面性草稿并倾向于修改人类的习语表达。

链接: https://arxiv.org/abs/2605.25686
作者: Malik Marmonier,Rachel Bawden,Benoît Sagot
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The recent shift from dedicated NMT systems to general-purpose LLMs has reshaped machine translation, with LLMs reported to produce more fluent, less literal output than their predecessors. We test whether this shift extends to the deliteralization hypothesis, the long-standing claim from translation studies that translations become progressively less literal as they are drafted and revised. Using the WMT24++ dataset, we compare the literality of human translations and post-editions to that of two NMT systems and six LLMs across 54 language pairs and three tasks: direct translation, iterative self-revision, and post-editing of human drafts. Literality is measured via a validated Synthetic Literality Index built from six heuristics. We find that (i) human translations remain significantly less literal than those of all tested MT systems, though recent LLMs narrow the gap; (ii) when prompted to iteratively revise their own output, LLMs deliteralize monotonically, providing the first evidence that the hypothesis applies natively to LLM generation; and (iii) as post-editors, LLMs invert the revision triggers of human post-editors, tolerating literal drafts and targeting idiomatic human formulations for revision.

[NLP-45] Simulating Human Memory with Language Models

【速读】: 该论文试图解决的问题是:当前语言模型作为用户模拟器(user simulator)时,其记忆能力远超真实人类,导致模拟效果失真。解决方案的关键在于通过改进提示策略(prompting strategies)和引入压缩器(compactor)机制,使语言模型能够以更符合人类记忆规律的方式遗忘内容,从而实现对人类记忆行为的更真实模拟。实验表明,具备此类“人类记忆约束”的语言模型在下游教育任务中可作为更有效的用户模拟器。

链接: https://arxiv.org/abs/2605.25680
作者: Qihan Wang,Nicholas Tomlin,Michael Hu,Brian Dillon,Tal Linzen
机构: NYU (纽约大学); UMass Amherst (马萨诸塞大学阿默斯特分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Language models are increasingly being deployed as user simulators, but their memory is far more reliable than that of real users. To measure this gap, we run a series of classic memory experiments from psychology on both humans and language models. Across tasks, we find that out-of-the-box language models exhibit better memory than humans, even when prompted to imitate human behavior. We then show that better prompting strategies and the use of a compactor can cause language models to forget content in a more human-like way. Using these methods, we show preliminary evidence that language models with human-like memory constraints can function as more effective user simulators in a downstream education task. Finally, we release human reference data and benchmarks to support future work on simulating human memory with language models.

[NLP-46] Llamion Technical Report

【速读】: 该论文试图解决的问题是:如何将一个非标准架构的语言模型(Orion-14B)高效、无损地转换为标准化的Llama家族架构,同时保持其原始性能和能力。解决方案的关键在于提出了一种名为KEPT(Efficient Knowledge Preservation for Transformation)的转化方法,其核心包括三个组件:(i) 正常参数映射(NPM),用于保留无需修改的模块;(ii) 优化参数映射(OPM),一种无需训练的LayerNorm到RMSNorm初始化策略,在权重衰减诱导的近零均值激活条件下被证明是最优的;(iii) 跨架构知识蒸馏(XKD),一种等尺寸冻结教师模型蒸馏机制,可使转换后模型在任意合理输入分布下的输出与源模型对齐。实验表明,Llamion仅需约123M tokens和四天时间即可恢复Orion的性能,并在多个基准测试中超越现有开源模型,且未出现在迁移语料中的能力(如Python编程和200K上下文处理)也完整保留。

链接: https://arxiv.org/abs/2605.25676
作者: Kisu Yang,Yoonna Jang,Hyeonseok Moon,Hwanseok Jang,Taewoo Lee,Hyungjin Lee,Jeseung Lee,Juhyoung Park,Heuiseok Lim
机构: VAIV Company; Korea University; University of Copenhagen; Samsung Electronics
类目: Computation and Language (cs.CL)
备注: Research conducted in 2024

点击查看摘要

Abstract:We release Llamion, a family of 14B-parameter open-weight language models obtained by transforming Orion-14B into the standardized Llama-family architecture. The transformation is performed by Efficient Knowledge Preservation for Transformation (KEPT), a recipe that combines (i) Normal Parameter Mapping (NPM) for unchanged modules, (ii) Optimized Parameter Mapping (OPM), a training-free LayerNorm-to-RMSNorm initialization we prove optimal under the near-zero-mean activation regime induced by weight decay, and (iii) Cross-architecture Knowledge Distillation (XKD), an equal-size frozen-teacher distillation that aligns the converted model’s outputs with the source model’s on any reasonable input distribution. Llamion recovers Orion’s behaviour on H6, MT-Bench, and KoMMLU with only ~123M tokens on a single A100 in four days; Llamion-Base reaches 66.87% on KoMMLU, exceeding the next-best entry of the Open Ko LLM Leaderboard by 7.0 absolute points at submission time. Capabilities entirely absent from the transfer corpus (Python programming and 200K-token context handling) survive the architectural transition intact. We release three checkpoints (Base, Chat, LongChat) that load with trust_remote_code=False in the Hugging Face Transformers library.

[NLP-47] AutoSG: LLM -Driven Solver Generation Solely from Task Prompts for Expensive Optimization

【速读】: 该论文试图解决在昂贵优化任务中,基于大语言模型(LLM)的自动化求解器生成方法所面临的三个关键问题:由于领域知识不足导致的事实性幻觉、在精炼过程中频繁破坏已建立的局部最优结构,以及因在训练实例上执行而导致的高昂评估成本和泛化能力受限。解决方案的关键在于提出AutoSG,一个端到端自动化工作流,可直接将自然语言提示转化为可执行的定制化求解器;其核心创新包括:基于检索增强的求解器生成模块,确保代码严格基于验证过的文献;一步自精炼算子,在引入任务特定改进的同时保留关键结构组件;以及无实例的Elo评分机制,通过LLM-as-a-Judge快速建立全局性能排名。实证结果表明,AutoSG显著优于人工设计的最先进框架和现有LLM生成的求解器。

链接: https://arxiv.org/abs/2605.25658
作者: Haoran Gu,Handing Wang,Yi Mei,Mengjie Zhang
机构: Xidian University; Victoria University of Wellington
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Expensive optimization tasks are ubiquitous in real-world applications, demanding highly specialized solvers. While LLM-driven automated solver generation shows promise, current paradigms face three critical issues when tackling expensive optimization: factual hallucinations due to deficient domain knowledge, the frequent dismantling of previously established locally optimal structures during refinement, and the prohibitive evaluation costs alongside restricted generalization caused by executing on training instances. To address these issues, we introduce AutoSG, a fully automated workflow directly translating natural language prompts into executable customized solvers. AutoSG features three core innovations: a retrieval-augmented solver generation module strictly grounding code in verified literature; a one-step self-refinement operator introducing task-specific improvements while preserving critical structural components; and an instance-free Elo-based LLM-as-a-Judge evaluation mechanism rapidly establishing global rankings. Extensive evaluations across diverse expensive optimization tasks confirm AutoSG significantly outperforms human-designed state-of-the-art frameworks and existing LLM-generated solvers.

[NLP-48] A Two-Phase Stability Study of LLM Judges and Bar Council Examiners on Thai Bar-Exam Free-Form Essays

【速读】: 该论文试图解决的问题是:在自然语言处理(NLP)领域中,自由格式法律作文评估常将专家评分的一致性视为单一上限指标,并认为大语言模型(LLM)与该上限的一致性即可证明其评分稳定性。然而,这种假设是否成立尚不明确,尤其是在存在多个合理评分解释的情况下。

解决方案的关键在于通过“相同输入协议”对泰国律师资格考试的15个评分单元进行严格测试:由三位经泰国律师协会培训的考官(A、B、C)和一个包含26个LLM组成的评判小组共同评分同一组答卷。结果发现,评分一致性呈现显著不对称性——在10个有明确评分标准的单元中,所有29位评分者高度一致;而在剩余5个未规定如何评分缺失关键法条引用的正确答案的单元中,人类考官分裂为两个稳定群体(B/C多数派倾向于高分区间6–8,A少数派倾向低分区间1–2),而26个LLM中仅有1个接近A的评分区间且未稳定落在其中,其余22个LLM均集中于B/C的高分区间。这表明LLM并非简单复制人类多样性,而是系统性地偏向多数派解读,从而揭示了当前以“最大化与人类面板一致性”为基准选择LLM裁判时会继承这种偏差,进而影响评估系统的公平性和代表性。

链接: https://arxiv.org/abs/2605.25652
作者: Pawitsapak Akarajaradwong,Wuttikrai Lertprasertphakorn,Chompakorn Chaksangchaichot,Sarana Nutanong
机构: VISAI AI; Vidyasirimedhi Institute of Science and Technology
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Free-form legal essay evaluation in NLP treats expert inter-rater stability as a single ceiling number, and treats LLM-judge agreement with that ceiling as evidence of judge stability. We test both assumptions on the Thai bar examination through an identical-inputs protocol: three Bar Council-trained examiners (A, B, C) and a 26-LLM judge panel score the same 15 cross-graded answers from the same four inputs (question, official Bar Council grading regulation, gold answer, candidate answer). The headline finding is asymmetric. On 10 of 15 cells where the rubric prescribes both axes, all 29 raters converge in a tight band: panel agreement is universal. On the remaining 5 cells where the rubric does not prescribe how to grade a correct final answer that omits a decisive statutory citation, the human panel splits between two coherent readings (B/C majority at the upper rubric band, score 6 – 8 ; A minority at the lower band, score 1 – 2 ). The LLM judge population does not split symmetrically: 22 of 26 LLMs score in or near B/C’s contested band, 3 sit in the regulation-silent middle gap, and only 1 (GPT-5.4 Nano) approaches A’s band without consistently scoring within it. \emphZero LLMs in our 26-judge panel reproduce the minority human reading on the contested cells. The B/C-direction cluster spans every model size, vendor, and price tier we tested. An instrumented three-LLM anchor sub-panel (Claude 4.6 Opus, Gemini 3.1 Pro, GPT-5.4 Pro) carries determinism probes, input ablations, and bootstrap CIs, and reaches anchor panel \alpha = 0.77 on the 15 cells against human-panel \alpha = 0.36 . The high LLM-panel \alpha reflects systematic convergence on the majority reading rather than balanced reproduction of both readings; a benchmark that selects its LLM judge by maximising agreement with a human reference panel will inherit this asymmetry by construction.

[NLP-49] Iterate Until Retrieved: Factual Nugget Optimization for Discoverable Continual Corrections in Agent ic RAG

【速读】: 该论文旨在解决复杂B2B(企业对企业)场景下,生成式AI系统在接收自由格式反馈时难以有效利用其中事实性修正信息的问题。现有方法通常依赖于风格、偏好或整体回答质量等泛化反馈信号,而忽略了可操作的事实性纠正(factual corrections),导致知识更新效率低下。解决方案的关键在于提出一种名为“迭代 nugget 优化”(Iterative Nugget Optimization, INO)的索引期优化方法:它将用户提供的事实性修正转化为紧凑的知识库条目(称为 factual nuggets),并利用生产环境中的代理型检索增强生成(agentic retrieval-augmented generation, RAG)系统作为测试平台,通过查询及其同义句进行探测、分析失败的检索与回答轨迹,并迭代优化 nugget 直至其可被成功检索到。实验表明,INO 在多个公司部署的实际B2B知识辅助代理(包括产品支持代理和工单处理代理)中,显著提升了事实修正的可发现性和使用率,在自动化与人工评估中均优于基线方法。

链接: https://arxiv.org/abs/2605.25641
作者: Moshe Hazoom,Gal Patel,Alon Talmor,Tom Hope
机构: Mosaic AI; The Hebrew University of Jerusalem
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Agentic retrieval-augmented generation (RAG) systems in complex B2B (business-to-business) settings may often receive free-form response feedback. Rather than generic feedback signals such as style, preference, or overall response quality, we focus on actionable factual corrections. We identify these instances and convert them into compact knowledge-base entries, which we call factual nuggets. We introduce Iterative Nugget Optimization (INO), an index-time optimization method that uses the production agentic RAG as a test harness: it creates an initial nugget, probes it with the triggering query and paraphrases, reflects over failed retrieval and answer traces, and revises the nugget until it is discoverable. We evaluate INO with two production B2B knowledge-assistance agents across multiple companies that use our system: a product support agent that answers questions over company-specific knowledge bases, and a support ticket agent that assists support engineers. INO consistently improves results over baselines in terms of discoverability and usage of factual corrections, in automated and human evaluations.

[NLP-50] Reinforcement Learning from Denoising Feedback

【速读】: 该论文试图解决扩散语言模型(diffusion language models, dLLMs)中策略损失估计(policy loss estimation)这一长期存在的基础性难题。解决方案的关键在于提出一种名为“去噪反馈强化学习”(Reinforcement Learning from Denoising Feedback, RLDF)的新训练范式,其核心思想是利用推理和训练过程中获得的去噪反馈信号来实现更准确、高效的策略损失估计。为平衡计算效率与估计效果,RLDF 优化模型以逼近中间噪声状态 xtx_t 对应的截断干净状态 x^0\hat{x}_0,并结合对时间步 tt 的加权采样策略。实验表明,RLDF 在 LLaDA 和 Dream 两种代表性 dLLM 架构上均显著提升了多个推理基准任务的性能与泛化能力,为扩散语言模型的可扩展强化学习提供了理论基础。

链接: https://arxiv.org/abs/2605.25638
作者: Qi He,Huan Chen,Ya Guo,Huijia Zhu,Yi R. Fung,Baojian Zhou
机构: Fudan University (复旦大学); Ant Group (蚂蚁集团); Hong Kong University of Science and Technology (香港科技大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Policy loss estimation remains a fundamental and long-standing challenge in reinforcement learning (RL) for diffusion language models (dLLMs). We introduce Reinforcement Learning from Denoising Feedback (RLDF), a novel training paradigm that leverages feedback obtained from rollout and training processes to facilitate accurate and efficient policy loss estimation. To balance the trade-off between computational efficiency and estimation effectiveness, RLDF optimizes the model toward the clipped clean state \hatx_0 from intermediate noisy states x_t , combined with weighted timestep sampling over t . Extensive experiments demonstrate that RLDF achieves consistent and substantial improvements in both performance and generalizability across two representative dLLM architectures, LLaDA and Dream, on multiple reasoning benchmarks. Our work lays a principled foundation for scalable reinforcement learning in diffusion language models. We build Drift, a training framework for dLLMs, available at this https URL.

[NLP-51] When In-Distribution Gains Fail: Evaluating Weak-to-Strong Reward Models under Preference Shift

【速读】: 该论文试图解决弱监督到强监督(Weak-to-Strong, W2S)偏好学习中存在的一种隐藏脆弱性问题,即模型在源域内表现良好但无法跨偏好数据集迁移的问题。其关键解决方案是提出一种名为“表示锚定”(Representation Anchoring, Anchor)的正则化方法,通过在微调过程中约束强模型表示空间的过度偏移,同时保留任务相关的适应能力,从而提升模型在零样本分布偏移下的迁移性能。实验表明,Anchor 在多个偏好领域、数据集和模型家族中均能显著改善泛化能力,且不牺牲源域内的性能。

链接: https://arxiv.org/abs/2605.25629
作者: Khoi Le,Tri Cao,Phong Nguyen,Cong-Duy Nguyen,Anh Tuan Luu,Miao Chunyan,See-Kiong Ng,Thong Nguyen
机构: National University of Singapore (新加坡国立大学); VinUniversity (越南Vin大学); Nanyang Technological University (南洋理工大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Code: this https URL

点击查看摘要

Abstract:Weak-to-strong (W2S) generalization is a promising framework for scalable oversight, yet existing evaluations often test students under matched train–test distributions. Therefore, we study W2S preference learning under zero-shot distribution shift and find that strong students trained on weak preference labels can appear successful in-distribution while failing to transfer across preference datasets. We provide evidence for a representational failure mode in which weak-supervised fine-tuning can pull the strong model toward source-domain features instead of maintaining broadly transferable preference representations. To mitigate this, we propose Representation Anchoring (Anchor), a simple yet effective regularizer that constrains excessive drift from the pretrained strong model’s representation space during fine-tuning, while still allowing task-relevant adaptation. Across preference domains, datasets, and model families, Anchor consistently improves out-of-distribution transfer while maintaining competitive in-distribution performance. Together, our evaluation protocol, transfer-aware metrics, and method expose hidden brittleness in current W2S reward modeling and provide a practical path toward more robust preference transfer.

[NLP-52] Beyond Literal Translation: Evaluating Cultural Effectiveness in Social Media UGC ICML2026

【速读】: 该论文旨在解决社交媒体用户生成内容(UGC)跨语言翻译中缺乏文化传递与情感共鸣评估的问题。现有大语言模型(LLM)虽提升了翻译质量,但传统评测指标难以衡量翻译是否准确传达原意及文化内涵。解决方案的关键在于提出CULTURE-MT基准测试集,其包含1,002条跨14个领域的UGC样本,按文化符号和语言风格分为四类,并构建面向UGC的训练数据以微调Qwen3系列模型作为基线;同时引入“文化有效性”(cultural effectiveness)这一新评价标准,聚焦表达准确性与文化适应性。实验表明,传统指标无法有效捕捉文化有效性,且基础LLM的文化有效性与其参数规模呈正相关,为UGC翻译模型的系统性评估提供了完整框架,并开放了在线评测平台以推动该领域研究进展。

链接: https://arxiv.org/abs/2605.25626
作者: Linjuan Wu,Ruiqi Zhang,Xinze Lyu,Ye Guo,Daoxin Zhang,Zhe Xu,Yao Hu,Yixin Cao,Yongliang Shen,Weiming Lu
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted by ICML2026

点击查看摘要

Abstract:Social media platforms enable large-scale cross-lingual communication, but translating user-generated content (UGC) remains challenging due to its informal style, cultural references, and interaction-based expressions. While recent LLMs have improved translation quality, existing benchmarks and metrics often fail to capture whether translations convey intended meaning and cultural resonance in real-world settings. In this work, we introduce CULTURE-MT, a benchmark for social media translation that focuses on both CULtural Transmission and UGC-specific emotion REsonance. CULTURE-MT consists of 1,002 UGC notes across 14 domains, categorized into four types based on culture-loaded symbols and linguistic style features. We also construct UGC-oriented training data to fine-tune Qwen3-8B and Qwen3-32B as baselines. We propose cultural effectiveness as a new evaluation criterion, focusing on expression accuracy and cultural adaptability. Testing 15 models, including the baselines, we find that traditional metrics fail to capture cultural effectiveness. We also observe that cultural effectiveness on base LLMs correlates with model size. Our work provides a comprehensive evaluation system for UGC translation models and will offer an open evaluation platform to advance research in this area. We release the CULTURE-MT benchmark and provide an online leaderboard where submitted translation results can be evaluated by our trained JUDGER.

[NLP-53] DVAO: Dynamic Variance-adaptive Advantage Optimization for Multi-reward Reinforcement Learning

【速读】: 该论文旨在解决多奖励(multi-reward)场景下强化学习对齐大语言模型时的训练不稳定性和优化效率问题。现有方法如奖励组合(Reward Combination)和优势组合(Advantage Combination)存在显著缺陷:前者因优势值方差过大导致训练不稳,后者依赖静态超参数且忽略目标间的相关性。解决方案的关键在于提出动态方差自适应优势优化(Dynamic Variance-adaptive Advantage Optimization, DVAO),其通过滚动组内各目标的经验奖励方差动态调整组合权重,增强具有强学习信号的目标、抑制噪声目标;同时理论证明DVAO可保证优势幅度有界以实现稳定训练,并引入自适应交叉目标正则化机制。实验表明,DVAO在数学推理与工具使用基准上优于基线方法,显著提升多目标帕累托前沿并保障训练鲁棒性。

链接: https://arxiv.org/abs/2605.25604
作者: Guochao Jiang,Jingyi Song,Guofeng Quan,Chuzhan Hao,Guohua Liu,Yuewei Zhang
机构: Alibaba Cloud Computing (阿里巴巴云计算)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Reinforcement Learning has become a standard paradigm for aligning Large Language Models with human intent and task requirements. While Group Relative Policy Optimization offers an efficient, value-model-free alternative to Proximal Policy Optimization, adapting it to real-world multi-reward settings remains challenging. Standard scalarization practices, such as Reward Combination and Advantage Combination, suffer from significant drawbacks: Reward Combination frequently generates advantages with excessively large squared magnitudes that lead to training instability, while Advantage Combination relies on static hyperparameters and ignores cross-objective correlations. To address these limitations, we propose Dynamic Variance-adaptive Advantage Optimization (DVAO), which dynamically adjusts combination weights based on the empirical reward variance of each objective within a rollout group, effectively up-weighting objectives with a stronger learning signal while suppressing noisy ones. We mathematically prove that DVAO maintains bounded advantage magnitudes for stable training and introduces a self-adaptive cross-objective regularization mechanism. Extensive experiments on mathematical reasoning and tool-use benchmarks using Qwen3 and Qwen2.5 models demonstrate that DVAO significantly outperforms baseline methods, achieving a superior multi-objective Pareto frontier and robust training stability.

[NLP-54] oward a Benchmark for Controllable Simulation of Imperfect Students with Large Language Models

【速读】: 该论文旨在解决教师教育中缺乏针对具有明确能力特征(如优势、劣势和部分掌握)学习者的练习机会的问题。传统教学训练难以模拟真实学生的学习状态,而大语言模型(Large Language Models, LLMs)具备通过生成具有特定技能结构的虚拟学生来支持教师实践的潜力。解决方案的关键在于对模型行为进行可控引导,使其在保留某些技能的同时抑制其他技能,从而实现“选择性部分掌握”(selective partial mastery)的模拟。为此,作者提出一个以基准为导向的框架:使用显式的技能向量表示模拟学生的知识状态,通过提示(prompt-based control)指定保留与缺失的能力,并利用技能对齐度量、保留与遗忘对比以及跨技能校准分析等方法评估行为控制效果。实验表明,在结构化的数学场景中可以有效诱导并测量这种可控的技能分布,但控制程度依赖于具体模型。这一成果将可控学习者模拟确立为教师教育、教育仿真与语言模型控制交叉领域的一个独立研究方向。

链接: https://arxiv.org/abs/2605.25601
作者: Alexander Apartsin,Omri Sason,Yehudit Aperstein
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 22 pages, 7 figures

点击查看摘要

Abstract:Teacher education requires deliberate practice with learners who exhibit identifiable strengths, weaknesses, and partial mastery. Large language models could support such practice by simulating students with known skill components, enabling teachers to rehearse explanations, diagnoses, and instructional responses. For this purpose, however, the central requirement is neither to maximize benchmark accuracy nor to suppress isolated facts, but to control model behavior so that it reflects a specified skill profile. This paper investigates whether prompted language models can be steered to retain some skills while suppressing others. We introduce a benchmark-oriented framework in which an explicit skill vector represents a simulated student, prompt-based control specifies retained and missing competencies, and behavior is evaluated using profile-alignment metrics, retained-versus-forgotten comparisons, and cross-skill calibration analyses. The results show that selective partial mastery can be induced and measured in a structured mathematics setting, although the degree of controllability remains model-dependent. These findings position controllable learner simulation as a distinct research problem at the intersection of teacher education, educational simulation, and language-model control.

[NLP-55] Multilingual Phonological Feature Recognition with Self-Supervised Speech Models INTERSPEECH2026

【速读】: 该论文试图解决的问题是:如何在多语言场景下构建一种具有语言通用性且基于语言学原理的语音帧级音系特征表示方法,从而替代传统依赖音位(phoneme)输出的特征提取方式。现有方法通常从音位序列推导出音系特征,但存在跨语言泛化能力弱、缺乏结构一致性等问题。解决方案的关键在于提出PhonoQ-2.0系统,其核心创新包括:(1)直接预测每帧22维结构化的音系特征向量(包含发音方式、元音质量、发音部位和清浊等属性),而非通过音位间接推导;(2)引入一种“发音方式条件门控机制”(manner-conditioned gating mechanism),以确保预测结果符合音系学上的合理性,即仅激活与当前发音方式一致的特征组。实验表明,该方法在多个语言和语料库上均显著优于基线模型,在域内平均宏F1达91.3%,域外达88.9%,且在未见语言场景中提升幅度高达+6.7点(从66.9%提升至73.6%)。

链接: https://arxiv.org/abs/2605.25596
作者: Abner Hernandez,Tomás Arias-Vergara,Daiqi Liu,Andreas Maier,Paula Andrea Pérez-Toro
机构: 未知
类目: Computation and Language (cs.CL)
备注: Submitted to Interspeech 2026

点击查看摘要

Abstract:Phonological features provide a language-general and linguistically grounded representation of speech. We present PhonoQ-2.0, a multilingual frame-level phonological feature recognizer built on self-supervised speech models. The system directly predicts a structured 22-dimensional feature vector per frame encoding manner, vowel quality, place, and voicing, instead of deriving features from phoneme outputs. To ensure phonologically coherent predictions, we introduce a manner-conditioned gating mechanism that activates valid feature groups. Evaluated across multiple languages and corpora, PhonoQ-2.0 achieves an average macro-F1 of 91.3% in-domain and 88.9% out-of-domain. Compared to a strong CTC phoneme baseline, it delivers consistent gains of +8.8 F1 in-domain and +8.6 out-of-domain on average. In unseen-language evaluation, PhonoQ-2.0 improves macro-F1 from 66.9% to 73.6% (+6.7 on average), with gains of up to +10.8 points.

[NLP-56] PennySynth: RAG -Driven Data Synthesis for Automated Quantum Code Generation

【速读】: 该论文试图解决的问题是:通用大语言模型(LLM)在处理量子编程任务时存在严重幻觉问题,例如错误生成PennyLane特定的门名称、误置设备配置以及生成结构无效的量子电路,这限制了其在专业量子编程场景中的实用性。解决方案的关键在于提出PennySynth框架——一个基于检索增强生成(Retrieval-Augmented Generation, RAG)的方法,通过构建包含13,389个PennyLane指令-代码对的专用知识库,并引入一种面向代码的嵌入策略(使用st-codesearch-distilroberta-base模型进行自然语言到代码的检索训练),显著提升了检索准确率(平均余弦相似度从0.45提升至0.726)。实验表明,PennySynth在三年QHack竞赛挑战中均优于基线模型(如Claude Sonnet 4.6),且通过设计量子适配的CodeBLEU指标,验证了结构相似性和功能正确性分别刻画了量子代码质量的不同维度。

链接: https://arxiv.org/abs/2605.25572
作者: Minghao Shao,Nouhaila Innan,Hariharan Janardhanan,Muhammad Kashif,Alberto Marchisio,Muhammad Shafique
机构: New York University Abu Dhabi (纽约大学阿布扎比分校); NYUAD Center for Quantum and Topological Systems (纽约大学阿布扎比量子与拓扑系统中心); Center for CyberSecurity (网络安全中心)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 11 pages, 3 figures

点击查看摘要

Abstract:The growing complexity of quantum programming frameworks has exposed a critical limitation in existing large language model (LLM)-based code assistants: general-purpose models hallucinate PennyLane-specific gate names, misplace device configurations, and produce structurally invalid circuits when faced with specialized quantum coding challenges. We present PennySynth, a retrieval-augmented generation framework that addresses this gap by conditioning LLM inference on a curated knowledge base of 13,389 PennyLane instruction-code pairs, built via a three-stage extraction, verification, and deduplication pipeline over official PennyLane repositories, community GitHub sources, and QHack competition archives. PennySynth introduces a code-aware embedding strategy using st-codesearch-distilroberta-base, trained for natural-language-to-code retrieval, increasing average retrieval cosine similarity from 0.45 to 0.726 compared to a general-purpose baseline. Evaluated across 74 challenges spanning three years of the QHack competition (2022, 2023, 2024), PennySynth achieves 64%, 68%, and 52% pass@5 on QHack 2022, 2023, and 2024, respectively, improving over Claude Sonnet 4.6 without retrieval by +28, +25, and +28 percentage points. We further introduce a quantum-adapted CodeBLEU metric that upweights qml.* token patterns and show that structural code similarity and functional correctness capture distinct aspects of quantum code quality. Controlled ablations reveal that code-aware embeddings are the primary driver of retrieval performance, while dataset expansion and source composition provide additional gains when retrieval quality is sufficiently precise.

[NLP-57] RotMoLE: Enhancing Mixture of Low-Rank Experts through Rotational Gating Mechanism

【速读】: 该论文试图解决的问题是:如何在专家数量有限的情况下,提升基于低秩适配器(Low-rank adapters)的混合专家(MoE)架构在复杂多任务和多语言场景中的表示能力与泛化性能。现有方法如MoE-LoRA虽然结合了MoE与参数高效微调(PEFT),但其传统的门控机制仅对选中的专家进行标量加权,限制了专家的表达能力和适应性。解决方案的关键在于提出RotMoLE,一种专为低秩专家设计的新型MoE框架,其核心创新是引入了一个额外的旋转门控机制(rotation gate),使得每个被选中的专家不仅可被缩放,还能通过旋转操作实现更灵活的特征空间变换,从而显著增强专家的利用效率和专业化能力,尤其在专家候选有限时表现优异。实验证明该方法在复杂多任务和多语言训练场景中具有更强的适应性和有效性。

链接: https://arxiv.org/abs/2605.25565
作者: Mengyang Sun,Maochuan Dou,Tao Feng,Dan Zhang,Yihao Wang,Junpeng Liu,Yifan Zhu,Jie Tang
机构: Tsinghua University (清华大学); Beijing Information Science and Technology University (北京信息科技大学); National University of Singapore (新加坡国立大学); Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州)); Beijing University of Posts and Telecommunications (北京邮电大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:While Large Language Models (LLMs) are commonly fine-tuned to handle domain-specific tasks before being applied to vertical applications, adapting them to complex scenarios with diverse specialized knowledge remains challenging. Meanwhile, Mixture-of-Experts (MoE) architecture has risen as a crucial paradigm for training LLMs, and some recent works have also incorporated MoE into Parameter-Efficient Fine-Tuning (PEFT) to propose the Mixture of Low-rank Experts (MoE-LoRA), to enhance the power of low-rank adapters for learning complicated knowledge. However, conventional gating mechanisms in MoE typically apply only a scalar reweighing to selected experts, thereby limiting their underlying capacity of representation and generalization. Motivated and enabled by the low-rank structures in MoE-LoRA, we propose RotMoLE, a specialized MoE framework for low-rank experts featuring an additional rotation gate. Beyond simple scaling, RotMoLE implements a rotation mechanism for each selected expert, enabling superior expert exploitation and specialization for learning diverse data, especially when expert candidates are limited. Empirical results on complex multi-task and multilingual training scenarios validate our effectiveness.

[NLP-58] BC Protocol: Structured Dual-Expert Dialogue for Eliciting High-Quality Chain-of-Thought Post-Training Data

【速读】: 该论文旨在解决大语言模型(LLM)后训练中高质量思维链(Chain-of-Thought, CoT)数据稀缺的问题。现有方法存在明显局限:众包标注缺乏深度推理路径,专家独立写作受“专家盲区”影响会跳过自认为显然的推理步骤,而基于强化学习的人类反馈(RLHF)仅生成偏好信号而非完整的推理链条。论文提出的BC协议(BC Protocol)是一种结构化的双专家协同 elicitation 方法,通过将领域专家(结晶智力)与知识工程师(流体智力)配对,系统性地将专家隐性判断转化为自然语言形式的推理链。其关键创新在于提出“校准无知”(Calibrated Ignorance)概念及“选择优于规定”(Selection-over-Prescription)的方法论原则——即在隐性知识获取任务中,投入资源优化人员筛选比优化流程设计能带来更高效益。控制实验表明,BC协议生成的CoT在推理过程自然度上显著优于专家独立写作(平均得分4.80 vs. 1.30,p=2.4×10⁻⁸,Cliff’s δ=1.0),验证了该方法的有效性。

链接: https://arxiv.org/abs/2605.25549
作者: Bo Zou,Chao Xu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:High-quality expert chain-of-thought (CoT) data is one of the core bottlenecks in large language model (LLM) post-training. Existing data production methods each have structural limitations: crowdsourced annotation lacks deep reasoning paths; expert solo writing is constrained by the “expert blind spot” – experts structurally skip reasoning steps they consider obvious; RLHF only produces preference signals rather than reasoning chains. This paper proposes the BC Protocol – a structured dual-expert elicitation method for LLM post-training data production. The method carefully pairs a domain expert (crystallized intelligence) with a knowledge engineer (fluid intelligence), systematically externalizing the expert’s implicit judgments as natural language reasoning chains. We introduce the Participant Aptitude Model, which defines six participant characteristic dimensions that affect elicitation quality. “Calibrated Ignorance” is an original concept proposed in this paper. We further propose “Selection-over-Prescription” as a methodological principle: for implicit knowledge elicitation tasks, investing quality-control resources in personnel selection yields a higher return than investing the same resources in process design. In a controlled experiment in the narrative fiction domain, we directly compared CoT produced by BC Protocol dual dialogue (Group A, (n=20)) against CoT written independently by the same domain expert (Group B, (n=20)). Three cross-vendor judge models – GPT-4o, Claude Opus 4.5, and Gemini 2.5 Pro – conducted blind evaluation across five dimensions (600 ratings total). Results show that the BC Protocol achieves an overwhelming advantage in “naturalness of reasoning process” (Group A mean 4.80 vs. Group B mean 1.30, (p=2.4\times10^-8), Cliff’s (\delta=1.0)). Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2605.25549 [cs.CL] (or arXiv:2605.25549v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2605.25549 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-59] Is Inference Mediated by Distinct Semantic Structures in LLM s? A Mechanistic Interpretation

【速读】: 该论文试图解决的问题是:Transformer模型在自然语言推理(Natural Language Inference, NLI)任务中,是否不仅编码了标签级别的信息(如“蕴含”、“矛盾”或“无关”),还编码了产生这些标签的语义操作(semantic operations)本身。传统观点认为,正确预测标签可能仅需对标签信息进行表征,而无需显式建模生成标签的语义变换机制。为探究这一点,作者设计了受控的前提-假设对,其中仅存在单一语义变换(如否定、量化扩展等)。解决方案的关键在于:通过层级激活(layer-wise activations)利用奇异值分解(SVD)估计操作层面的子空间,并结合激活操控(activation steering)实验验证这些子空间的因果相关性。结果表明,不同语义变换可被高精度解码(84.8%–99%),且占据部分独立但有重叠的子空间,显著优于随机基线;操控实验证明这些方向确实因果影响预测结果,尽管不同模型间操控效果存在差异,且跨操作操控揭示了结构化的干扰效应和子空间选择性与跨操作独立性之间的分离。这说明Transformer模型不仅知道“什么关系”,还在一定程度上编码了“如何建立这种关系”,从而强调机制分析应聚焦于语义操作层面而非仅标签级别。

链接: https://arxiv.org/abs/2605.25520
作者: Nura Aljaafari,Marco Valentino,André Freitas
机构: University of Manchester(曼彻斯特大学); University of Sheffield(谢菲尔德大学); Idiap Research Institute(Idiap研究所); CRUK National Biomarker Centre, University of Manchester(英国癌症研究基金会国家生物标志物中心)
类目: Computation and Language (cs.CL)
备注: 26 pages, 16 figures, 13 tables

点击查看摘要

Abstract:Predicting a label correctly does not necessarily require representing the operation that produces it. Transformer representations are known to carry label-level information, but whether they encode semantic operations producing those labels is unclear. We investigate this in Natural Language Inference using controlled premise-hypothesis pairs that differ by a single semantic transformation. Using layer-wise activations, we estimate operation-level subspaces via SVD and test their causal relevance through activation steering in four open-weight decoder models. Transformation effects are decodable with 84.8 - 99% accuracy and occupy partially distinct but overlapping subspaces, exceeding random-subspace baselines. Steering experiments show that these directions causally influence predictions, though steerability varies across models; cross-operation steering further reveals structured interference and a dissociation between subspace selectivity and cross-operation independence. These findings indicate that the models encode not only that a hypothesis relates to a premise but also, in part, how it does so, implying that mechanistic analysis and control should operate at the level of semantic operations rather than predicted labels alone.

[NLP-60] CRPO: Character-centric Group Relative Policy Optimization for Role-aware Reasoning in Role-playing Agents

【速读】: 该论文试图解决的问题是:在将基于任务优化的强化学习方法(如GRPO)应用于角色扮演代理时,常导致角色一致性丧失和风格坍塌(style collapse),这是因为这些方法优先考虑上下文相关的效用而非角色身份对齐。解决方案的关键在于提出一种以角色为中心的强化学习框架——Character-Centric Group Relative Policy Optimization (CRPO),其核心机制包括:1)解耦任务逻辑与风格奖励以缓解梯度冲突;2)根据角色复杂度动态调整优化约束;3)利用通用回复作为负向基线,防止模型退化到共通分布。实验表明,CRPO在角色一致性、情感表达等方面显著优于现有方法。

链接: https://arxiv.org/abs/2605.25511
作者: Yihong Tang,Kehai Chen,Liang Yue,Benyou Wang,Min Zhang
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent advancements in Reinforcement Learning (RL), particularly Group Relative Policy Optimization (GRPO), have significantly enhanced the reasoning capabilities of Large Language Models. However, applying these problem-centric optimization methods to role-playing agents often leads to a loss of character fidelity and style collapse, as they prioritize context-specific utility over persona alignment. To address this, we propose Character-Centric Group Relative Policy Optimization (CRPO), a framework designed to realign RL objectives with the role-playing task. CRPO improves character distinctiveness through three mechanisms: decoupling task logic from stylistic rewards to resolve gradient conflicts, dynamically adapting optimization constraints based on character complexity, and utilizing generic responses as negative baselines to prevent the model from reverting to a common distribution. Extensive experiments demonstrate that CRPO outperforms existing methods in consistency, emotion and others.

[NLP-61] he Age of Curiosity Meets the Age of AI: Benchmarking Child Safety in Large Language Models

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在儿童用户中可能引发的适龄安全问题,即现有LLM安全评估主要关注有害内容规避,而缺乏对7–11岁儿童发展特点的针对性考量。解决方案的关键在于提出KIDBench基准测试体系,其基于发展心理学设计评分标准(LLM-as-a-Judge),涵盖十类真实儿童提问,并包含单轮与多轮对话模拟场景;实验表明,通过隐式提示(implicit-cues)和显式年龄指令(explicit age instructions)可显著提升模型响应的安全性(分别提升9–47%和额外10–30%),同时揭示跨语言与文化情境下安全表现不均的问题;此外,研究进一步开发了KIDGuardLlama(安全评估器)与KIDLlama(儿童导向响应模型),验证了KIDBench在构建更安全儿童友好型AI系统中的实用价值。

链接: https://arxiv.org/abs/2605.25510
作者: Samee Arif,Angana Borah,Rada Mihalcea
机构: University of Michigan
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Children increasingly have access to Large Language Models (LLMs), which may expose them to responses that are developmentally inappropriate or require age-sensitive safety, guidance, and boundaries. Existing LLM safety evaluations largely focus on harmful-content avoidance and do not explicitly target child-facing safety. We introduce KIDBench, a benchmark for evaluating child-facing LLM safety for ages 7–11 using a developmental-psychology-grounded LLM-as-a-Judge rubric. KIDBench contains realistic child queries across ten categories, with single-turn prompts and multi-turn child-actor simulations. We compare no-cues prompts with no child context, implicit-cues prompts that suggest a child speaker, and explicit age instructions. Implicit-cues improve scores by 9–47% across models, while explicit age adds a further 10–30% gain. Cross-lingual and cultural evaluations show uneven safety behavior across languages and country contexts. Multi-turn simulations show that child-facing response quality can degrade by 6–24% from the first to worst turn. Beyond evaluation, we introduce KIDGuardLlama, a child-safety evaluator, and KIDLlama, a child-oriented response model, showing how KIDBench supports safer child-facing AI

[NLP-62] A Controlled Synthetic Benchmark for Educational Aspect-Based Sentiment Analysis

【速读】: 该论文旨在解决教育领域方面感知情感分析(Educational Aspect-Based Sentiment Analysis, ABSA)因缺乏公开标注数据而难以开展研究的问题。由于学生反馈具有私密性、机构特异性且人工标注成本高昂,现有公开数据集严重不足。为此,作者提出了一种受控的合成基准数据集,包含10,000条合成课程评论,并配有明确的训练-验证-测试划分和一套涵盖教学品质、评估与课程管理、学习需求、学习环境及参与度等维度的20个教育方面标签体系。其关键解决方案在于通过采样目标标签与语调属性,并利用三轮“评审-编辑”流程优化提示词(prompt),从而生成高真实感且结构清晰的合成文本。实验表明,该基准具有挑战性,即使是最强未调参模型BERT在零样本下微平均F1仅为0.2760,而经调度优化后提升至0.2930;GPT-5.2在零样本和检索增强少样本模式下的表现也接近紧凑型联合编码器模型。此外,在外部映射的真实学生反馈上进行保守评估时,BERT在9个重叠方面的微平均F1达到0.4593,说明合成数据具备一定的跨域迁移能力。本研究贡献包括一个可复现的合成教育ABSA语料库、标准化的数据生成流程以及适用于稀缺标注场景的基准设置。

链接: https://arxiv.org/abs/2605.25502
作者: Yehudit Aperstein,Alexander Apartsin
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 39 pages, 14 figures

点击查看摘要

Abstract:Educational aspect-based sentiment analysis (ABSA) can support course improvement, but public aspect-labeled student feedback remains scarce because educational reviews are private, institution-specific, and expensive to annotate. This study introduces a controlled synthetic benchmark for educational ABSA built from 10,000 synthetic course reviews with explicit train-validation-test splits and a 20-aspect pedagogical schema spanning instructional quality, assessment and course management, learning demand, learning environment, and engagement. The corpus is generated with sampled target labels, sampled nuance attributes, and a realism-tuned prompt refined through a three-cycle judge-editor procedure. On the resulting benchmark, local baselines with TF-IDF, two-step transformers, and joint encoders show that the task is nontrivial; the strongest untuned model, BERT, reaches a held-out detection micro-F1 of 0.2760, while a modest lower-rate BERT schedule improves this to 0.2930. Full-test GPT-based inference with gpt-5.2 reaches 0.2519 micro-F1 in zero-shot mode and 0.2501 with retrieval-based few-shot prompting, placing batch inference above the classical baseline and close to the compact joint encoders. A conservative external evaluation on 2,829 mapped student-feedback reviews from Herath et al. yields a micro-F1 of 0.4593 for BERT on a 9-aspect overlap, indicating partial synthetic-to-real transfer. Realism and faithfulness analyses are reported as generator diagnostics that clarify how the benchmark was stabilized and where label noise remains. The study therefore contributes a synthetic educational ABSA corpus, a documented generation procedure, and a reproducible benchmark setting for a domain in which public labeled data remain difficult to obtain.

[NLP-63] Retrieval as Reasoning : Self-Evolving Agent -Native Retrieval via LLM -Wiki

【速读】: 该论文试图解决大语言模型(LLM)代理在使用外部知识时存在的局限性问题,即传统检索增强生成(RAG)系统将知识组织为扁平的文本块,仅通过嵌入相似度进行检索,导致其接口与工具调用型代理的行为不匹配。解决方案的关键在于提出一种面向代理的检索系统 LLM-Wiki,它采用“检索即推理”(Retrieval-as-Reasoning)范式,将外部知识编译为具有双向链接的结构化 Wiki 页面,并通过标准工具调用接口暴露搜索、阅读和链接跟随等操作,同时引入“错误簿”(Error Book)实现结构和语义层面的持续自我修正。实验表明,LLM-Wiki 在 HotpotQA、MuSiQue 和 2WikiMultiHopQA 等多跳问答任务上优于七个基线模型(包括 HippoRAG 2、LightRAG 和 GraphRAG),F1 分数提升达 2.0–8.1 点;在 AuthTrace 上也取得最佳整体准确率,尤其在多文档结构化查询场景下表现突出,证明了基于编译的知识组织方式不仅适用于链式推理,还具备更广泛的泛化能力。

链接: https://arxiv.org/abs/2605.25480
作者: Haoliang Ming,Feifei Li,Xiaoqing Wu,Wenhui Que
机构: WeChat, Tencent Inc., Beijing, China
类目: Computation and Language (cs.CL)
备注: 15 pages, 3 figures, 10 tables, 1 algorithm

点击查看摘要

Abstract:LLM agents require retrieval to behave less like one-shot context fetching and more like reasoning: searching, reading, traversing, and deciding when evidence is sufficient. However, Retrieval-Augmented Generation (RAG) typically organizes external knowledge as flat chunks retrieved by embedding similarity, exposing a retrieval-as-lookup interface that is poorly aligned with tool-using agents. We propose LLM-Wiki, an agent-native retrieval system that operationalizes the Retrieval-as-Reasoning paradigm by treating external knowledge as a compilable, composable, and self-evolving structure rather than a static retrieval index. LLM-Wiki compiles documents into structured Wiki pages with bidirectional links, exposes search, read, and link-following operations through standard tool-calling interfaces, and introduces an Error Book for persistent structural and semantic self-correction. On HotpotQA, MuSiQue, and 2WikiMultiHopQA, LLM-Wiki outperforms seven baselines, including HippoRAG 2, LightRAG, and GraphRAG, with gains of 2.0-8.1 F1 points over the strongest graph-based baseline and larger gains over Dense RAG. On AuthTrace, LLM-Wiki achieves the best overall accuracy, with especially strong gains on multi-document structured queries, showing that compilation-based knowledge organization generalizes beyond chain-style multi-hop reasoning.

[NLP-64] IndexMem: Learned KV-Cache Eviction with Latent Memory for Long-Context LLM Inference

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在处理长上下文时因标准softmax注意力机制导致KV缓存(Key-Value Cache)随序列长度线性增长,从而成为推理瓶颈的问题。其核心解决方案包括两个关键组件:一是引入一个可学习的索引器(learnable indexer),用于预测键值对的重要性,实现更精准地保留关键token;二是设计了一个轻量级潜在记忆模块(latent memory module),将被驱逐的token压缩为紧凑的在线更新状态,并提供残差读出以补偿因KV缓存驱逐而丢失的注意力贡献。这一方法在固定KV缓存预算下实现了高精度的长上下文推理,在RULER基准测试中显著提升性能(Qwen、Mistral和Llama模型上最高达25分提升),并增强了长距离检索稳定性与LongBench任务表现。

链接: https://arxiv.org/abs/2605.25475
作者: Xintong Yang,Hao Gu,Binxing Xu,Lujun Li,Bei Liu,Jiacheng Liu,Qiyuan Zhu,Sirui Han,Yike Guo
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly expected to operate over long contexts, yet standard softmax attention incurs a KV cache that grows linearly with sequence length, quickly becoming the bottleneck for long context inference. A practical remedy is to evict less important KV entries; however, existing eviction policies are largely heuristic and struggle to capture the rich, input-dependent distribution of token importance. In this work, we introduce a learnable indexer that predicts KV importance, enabling more accurate retention of critical tokens. Meanwhile, naively evicting tokens permanently discards their information, leading to irreversible forgetting and degraded retrieval over long ranges. To address this, we propose a lightweight latent memory module that compresses evicted tokens into a compact, online-updated state and provides residual readouts to compensate for the attention contributions lost through KV eviction. Collectively, our method enables accurate long-context inference under a bounded KV budget, delivering consistent improvements on RULER (4K/16K) across Qwen, Mistral, and Llama models (up to 25 points under aggressive eviction), markedly more stable Needle-in-a-Haystack retrieval, and superior LongBench scores and compression curves compared to existing eviction policies.

[NLP-65] ypedCSIP: Typed Counterfactual Pretraining for Chinese Legislative Conflict Classification

【速读】: 该论文旨在解决中文立法文本中法律条文冲突分类(conflict classification)的问题,即给定一对上下位法条款(superior, subordinate provision pair),判断二者是否存在冲突,并进一步识别冲突属于四种法律理论类型之一(责任、条件、制裁、定义)。其核心解决方案是提出了一种名为TypedCSIP的有类型反事实预训练方法:第一阶段通过专家标注的最小修订三元组(原始对、修订对)构建反事实监督信号,训练共享编码器以区分“无冲突证据”的反事实干预;第二阶段将该编码器迁移至五分类头进行冲突类型预测。关键创新在于利用专家修订作为反事实信号,使模型学会识别冲突的本质特征,而非简单依赖表面文本差异,从而显著提升分类性能(macro-F1提升达+1.288个百分点),且在冷启动未见数据集上仍保持正向增益。

链接: https://arxiv.org/abs/2605.25474
作者: Yao Liu
机构: Chengdu University of Technology (成都理工大学); School of Computer Sciences, Universiti Sains Malaysia (马来西亚理科大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:TypedCSIP is a typed counterfactual pretraining method for the conflict-classification task of the LCR-CN benchmark (Zhao et al., 2026): given a (superior, subordinate) provision pair, predict whether the pair conflicts and which of four legal-doctrine types (Responsibility, Condition, Sanction, Definition) describes the inconsistency. We exploit LCR-CN’s expert-written minimal revisions as training-time counterfactual supervision; at test time the classifier reads only the original pair. Stage 1 pretrains a shared encoder with a typed Counterfactual Selective Intervention Pretraining objective on (superior, subordinate, expert-revised) triplets, treating the expert revision as a counterfactual that the typed factor head must classify as carrying no conflict evidence. Stage 2 transfers the encoder to a five-way classification head. The confirmatory test was registered on the Open Science Framework before observing v6 measurements: 18 seeds, locked rule requiring mean per-seed difference at least 0.8 pp with both seed-bootstrap and Student-t 95% lower bounds above zero. On the 696-record test split, the v2 variant improves macro-F1 over the strongest single-model baseline by +0.916 pp on chinese-roberta-wwm-ext and +1.288 pp on the SAILER cross-backbone replication; both cells pass the rule. A cold-start stratified result on the 244 Unseen-gB records keeps the gain positive on both backbones. A cross-task diagnostic shows the Stage-2 encoder is classification-specialized and does not transfer to LCR-CN’s superior-law retrieval task, so we scope the contribution to conflict classification. We release code, 72 pre-registered prediction files, matched-seed and MLM-control auxiliaries, and the OSF pre-registration record. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2605.25474 [cs.CL] (or arXiv:2605.25474v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2605.25474 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Yao Liu [view email] [v1] Mon, 25 May 2026 06:26:46 UTC (38 KB) Full-text links: Access Paper: View a PDF of the paper titled TypedCSIP: Typed Counterfactual Pretraining for Chinese Legislative Conflict Classification, by Yao LiuView PDFHTML (experimental)TeX Source view license Current browse context: cs.CL prev | next new | recent | 2026-05 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked="checked"class=“labs-tab-input”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status

[NLP-66] A Lightweight Hybrid Transformer-CRF Architecture for Multi-Type Bangla Medical Entity Recognition

【速读】: 该论文旨在解决医疗实体识别(MedER)在资源受限环境下的部署难题,同时克服现有方法因使用宽松评估指标(如仅奖励“Outside”(O)类标签预测正确性)而导致性能虚高的问题。其解决方案的关键在于:首先构建一个基于12层孟加拉语BERT(BanglaBERT)与条件随机场(CRF)结合的严格基准模型以实现精确边界识别;随后通过知识蒸馏(Knowledge Distillation, KD)将该教师模型压缩为仅4层的学生网络,使学生模型从教师模型预CRF的软发射概率中学习;最后应用INT8动态量化进一步降低模型大小和推理成本。最终量化后的学生模型在CPU上实现8.6倍加速,且存储需求减少近48%,显著提升了轻量化部署可行性。

链接: https://arxiv.org/abs/2605.25463
作者: Peyal Saha,Ahsanul Haque Hasib,Shoumik Barman Polok
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:MedER refers to the identification of medical entities. It is crucial for extracting structured clinical information from unstructured medical text. Many existing systems rely on transformer-based models, which are computationally expensive and difficult to deploy in resource-constrained environments. Furthermore, earlier works often use relaxed evaluation metrics that artificially inflate performance by rewarding correct prediction of dominant “Outside” (O) tokens. In this paper, we propose a lightweight Medical Entity Recognition (MedER) framework for the Bangla language. We establish a rigorous baseline using a 12-layer BanglaBERT model combined with a Conditional Random Field (CRF) layer for exact-boundary entity detection. To address deployment constraints, we compress this teacher model into a 4-layer student network through Knowledge Distillation (KD), where the student learns from the teacher’s pre-CRF soft emission logits. Finally, we apply INT8 dynamic quantization to further reduce model size and inference cost. Our final quantized student achieves an 8.6x CPU speedup while requiring nearly 48 percent less storage than the CRF teacher model.

[NLP-67] GeoSVG-RL: Geometry-Aware Reinforcement Learning for Layout-Constrained Text-to-SVG Diagram Generation

【速读】: 该论文试图解决大语言模型在生成结构化、可编辑的矢量图(SVG)时面临的布局脆弱性问题,即由于连接点错位、文本标签与边界重叠或复杂布局超出画布边界等微小错误,导致输出的SVG文件在专业场景中无法使用。解决方案的关键在于提出GeoSVG-RL,一个面向布局约束的强化学习框架,其核心创新是通过浏览器后端验证器获取细粒度的几何反馈信号,并基于六维奖励(渲染有效性、画布适配性、锚点精确性、文本 containment、图结构一致性及代码整洁度)优化策略。模型首先生成结构化的布局计划作为几何契约,再据此生成SVG代码,从而显著提升箭头锚点准确率和文本框内占比等关键指标,最终实现更可靠的技术插图自动化生成。

链接: https://arxiv.org/abs/2605.25447
作者: Sifan Li,Yujun Cai,Hongkai Chen,Yiwei Wang
机构: University of California, Merced; The University of Queensland; vivo Mobile Communication Co., Ltd.
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Generating structured, editable diagrams remains a significant challenge for contemporary large language models, despite their proficiency in general-purpose vector code generation. The primary difficulty lies in the structural fragility of the output; minor errors such as misaligned connector endpoints, text labels overlapping borders, or complex layouts drifting beyond the canvas boundaries render the resulting SVG files functionally unusable for professional applications. To address these issues, we introduce GeoSVG-RL, a specialized reinforcement learning framework designed for layout-constrained text-to-SVG generation. Unlike standard training objectives that rely solely on maximizing token-level likelihood, our approach optimizes the policy against explicit, executable geometric feedback. The model first produces a structured layout plan that serves as a geometric contract for the subsequent generation of the SVG code. This code is then rendered through a browser-backed verifier, enabling the calculation of fine-grained rewards across six critical dimensions: rendering validity, canvas fitting, precise anchor placement, text containment, graph consistency, and code cleanliness. We utilize Group Relative Policy Optimization (GRPO) to refine the model, sampling multiple candidates per prompt to facilitate updates based on relative quality. Starting from a supervised warm-start phase on synthetic data, GeoSVG-RL achieves substantial gains in structural reliability, particularly in arrow-anchor accuracy and text-in-box rates. Quantitative evaluations demonstrate that our method consistently outperforms current state-of-the-art systems in local geometric precision and the preservation of graph connectivity, providing a robust pathway toward automated yet reliable technical illustration.

[NLP-68] Harmony in Diversity: Multi-domain Contrastive Policy Optimization for Large Reasoning Models

【速读】: 该论文试图解决多领域场景下基于强化学习(如GRPO)的后训练方法在策略优化过程中因领域间干扰导致性能提升不一致的问题。其解决方案的关键在于提出了一种多域对比策略优化方法(Multi-domain Contrastive Policy Optimization, MCPO),通过对比学习机制显式建模跨域知识共享与域内知识巩固:MCPO将其他领域的可迁移推理轨迹视为正样本,错误轨迹作为负样本,促使模型学习一致的表示以促进知识迁移,并拉远负样本间的距离以减少干扰;同时,它还对同一域内的正确轨迹进行对齐,构建统一的知识表示空间。这种方法将原本有害的竞争关系转化为有益的知识传递,从而实现多领域推理能力的协同增强。

链接: https://arxiv.org/abs/2605.25443
作者: Zongji Yu,Wenshui Luo,Yiliu Sun,Hao Fang,Runmin Cong,Chaochao Lu,Chen Gong
机构: Shanghai Jiao Tong University (上海交通大学); Shanghai AI Laboratory (上海人工智能实验室); Shandong University (山东大学)
类目: Computation and Language (cs.CL)
备注: 25 pages, 5 figures

点击查看摘要

Abstract:Post-training has significantly enhanced the reasoning capability of Large Reasoning Models (LRMs), especially with Reinforcement Learning (RL) like Group Relative Policy Optimization (GRPO). However, GRPO-style RL methods in multi-domain settings often fail to achieve consistent improvements across all domains due to inherent interference in policy optimization. Prior studies on multi-domain RL primarily focus on alleviating cross-domain interference, while often neglecting the pivotal role of knowledge sharing, which we argue is the key to transforming cross-domain interactions from harmful competition into beneficial transfer. To address this limitation, we propose Multi-domain Contrastive Policy Optimization (MCPO), which analyzes the structural relationships among rollouts and promotes cross-domain knowledge sharing and in-domain knowledge consolidation in a contrastive manner. Specifically, for a given prompt, MCPO identifies transferable reasoning trajectories from other domains as positive examples, while treating incorrect rollouts as negative ones. It then encourages consistent representations for positive pairs and pushes negative pairs apart, thereby facilitating knowledge transfer and reducing interference. Moreover, MCPO aligns intra-domain correct rollouts to build a consolidated representation space. In this way, MCPO contrastively learns a harmonious representation space that can accommodate diverse multi-domain knowledge. Empirical results show that MCPO improves the reasoning capabilities of LRMs across multiple domains and even outperforms single-domain training in some cases. Code is available at this https URL.

[NLP-69] HyLaT: Efficient Multi-Agent Communication via Hybrid Latent-Text Protocol

【速读】: 该论文试图解决大语言模型驱动的多智能体系统中通信协议设计的核心挑战,特别是现有单通道通信方法面临的“通信三难困境”:基于文本的方法虽具可解释性但冗长低效,而潜在空间(latent-space)方法虽高效却缺乏透明度且仅支持单向工作流。解决方案的关键在于提出HyLaT——一种混合潜在-文本通信协议,通过潜在通道传输复杂的认知信号以提升效率,同时用自然语言表达关键信号以保障可解释性和精确性。其核心创新在于两阶段训练框架:第一阶段为单智能体混合生成学习,第二阶段为多智能体交互协同训练,使智能体能够在多轮交互中生成和理解混合消息,从而在显著降低通信开销的同时保持任务性能竞争力,并展现出良好的泛化能力和鲁棒性。

链接: https://arxiv.org/abs/2605.25421
作者: Xinyi Mou,Siyuan Wang,Zejun Li,Yulan He,Zhongyu Wei
机构: Fudan University (复旦大学); The Chinese University of Hong Kong (香港中文大学); King’s College London (伦敦国王学院); The Alan Turing Institute (艾伦图灵研究所); Shanghai Innovation Institute (上海创新研究院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Communication protocol design is a central challenge in large language model-based multi-agent systems. Existing single-channel approaches face an inherent communication trilemma: text-based methods are interpretable but verbose, while latent-space methods are efficient but opaque and limited to unidirectional workflows. Inspired by multi-channel communication theory, we propose HyLaT, a hybrid latent-text communication protocol that transmits elaborate cognitive signals through a latent channel for efficiency, while expressing concise critical signals in natural language to preserve interpretability and precision. We introduce a two-stage training framework combining single-agent hybrid generation learning and multi-agent interactive co-training, enabling agents to generate and interpret hybrid messages across multiple rounds of interaction. Experiments demonstrate that HyLaT reduces communication overhead significantly while maintaining competitive task performance, with strong generalization and robustness across diverse settings.

[NLP-70] SomaliBench Eval: Measuring English-to-Somali Refusal Gaps in Open-Weight Language Models

【速读】: 该论文试图解决的问题是:当前大型语言模型(LLM)的安全评估高度依赖英语,导致低资源语言(如索马里语)在实际部署中缺乏有效的安全性衡量。为应对这一问题,作者提出并使用了SomaliBench v0——一个由本地作者验证的基准测试集,包含100个配对的有害意图提示(英文与索马里语各一份),对四种开源指令微调模型(Llama-3.1-8B-Instruct、Gemma-2-9B-Instruct、Qwen-2.5-7B-Instruct 和 Aya-23-8B)进行本地化安全评估。解决方案的关键在于:首先通过统一的“有益、无害、诚实”(HHH)系统提示和零温度采样确保实验一致性;其次采用人工标注(Claude Sonnet模型分类 + 本地作者抽样验证)以提高判断可靠性;最后量化了各模型在英语到索马里语之间的拒绝率差距(refusal gap),发现所有模型均存在显著下降,且多数非拒绝输出并非有害合规,而是模糊或无效生成,从而揭示了低资源语言下模型安全行为的潜在偏差与不可靠性。

链接: https://arxiv.org/abs/2605.25420
作者: Khalid Yusuf Dahir
机构: Independent researcher
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 12 pages, 3 figures, 4 tables. Code: this https URL Dataset: this https URL

点击查看摘要

Abstract:Large language model safety evaluation remains heavily English-centered, leaving low-resource languages under-measured even when models are deployed globally. We evaluate four open-weight instruction-tuned models on SomaliBench v0, a native-author-verified benchmark of 100 harmful-intent prompts paired across English and Somali. Each of Llama-3.1-8B-Instruct, Gemma-2-9B-Instruct, Qwen-2.5-7B-Instruct, and Aya-23-8B is run locally with temperature 0 and the same English “helpful, harmless, and honest” (HHH) system prompt. A pinned Claude Sonnet snapshot (claude-sonnet-4-5-20250929) classifies each response as refused, complied, or unclear; the native author spot-checks a stratified 80-row sample. We find large English-to-Somali refusal gaps for all four models: Llama-3.1-8B (0.90; 95% bootstrap CI [0.85, 0.96]), Aya-23-8B (0.75 [0.67, 0.83]), Qwen-2.5-7B (0.69 [0.59, 0.78]), and Gemma-2-9B (0.38 [0.27, 0.49]). For three models, the dominant Somali non-refusal mode is not fluent harmful compliance but unclear output: empty, wrong-language, or incoherent generations. The native verification spot-check achieves 100% agreement with the judge (Cohen’s kappa = 1.00) on the 80 sampled rows. We report aggregate refusal rates, category gaps, and reliability statistics only; raw model generations are retained locally and are not released.

[NLP-71] LLM -as-a-Reviewer: Benchmarking Their Ability Divergence and Prompt Injection Resistance as Paper Reviewers

【速读】: 该论文试图解决的问题是:大型语言模型(Large Language Models, LLMs)在学术同行评审中的可靠性、与人类判断的一致性以及对对抗攻击的鲁棒性尚不明确。解决方案的关键在于构建一个系统性的基准测试,涵盖898篇来自NeurIPS和ICLR的论文,评估12个LLM在三个维度上的表现:评分校准度、与人类审稿人判断的偏离程度,以及对通过隐形字体映射攻击嵌入的提示注入(prompt injection)的抵抗能力。研究发现,LLMs普遍存在对较弱论文的系统性高估、在主题关注点上与人类存在显著差异(如低估清晰度、高估可复现性),且生成的审稿意见更长但词汇多样性更低、表达更标准化;更重要的是,提示注入攻击依然高度有效,隐藏指令可使低分论文被误评为接受级别,且不同模型家族间效果差异显著。这表明尽管LLMs在结构化评估中具有一定价值,但其用于同行评审前必须建立针对内在偏见和对抗风险的防护机制。

链接: https://arxiv.org/abs/2605.25415
作者: Lingyao Li,Junjie Xiong,Changjia Zhu,Runlong Yu,Chen Chen,Junyu Wang,Renkai Ma,Zhicong Lu
机构: University of South Florida (南佛罗里达大学); Missouri University of Science and Technology (密苏里科学技术大学); University of Alabama (阿拉巴马大学); Florida International University (佛罗里达国际大学); University of Cincinnati (辛辛那提大学); George Mason University (乔治梅森大学)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY); Emerging Technologies (cs.ET)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly used in academic peer review, yet their reliability, alignment with human judgment, and robustness to adversarial attacks remain poorly understood. We present a systematic benchmark of LLM-as-a-Reviewer on 898 papers stratified from NeurIPS and ICLR, evaluating 12 LLMs along three axes: rating calibration, divergence from human reviewers, and resistance to prompt injection embedded via an invisible font-mapping attack. We find that LLMs systematically overrate weaker submissions and diverge from humans in topical emphasis, under-flagging Clarity and over-flagging Reproducibility, while producing reviews two to three times longer with lower lexical diversity and a more standardized vocabulary. Prompt injection remains highly effective. Simple hidden instructions can promote low-scoring papers to acceptance-level ratings in a substantial fraction of cases, with effectiveness varying sharply across model families. While LLMs offer utility in structuring evaluations, their integration into peer review requires safeguards against both intrinsic biases and adversarial risks.

[NLP-72] Proactive for Uncertainty: Cause-Aware Error Diagnosis and Interactive Clarification for Spoken Dialogue Systems

【速读】: 该论文旨在解决级联式自动语音识别-大语言模型(ASR-LLM)流水线在实际应用中因错误传播而导致的交互质量下降问题。传统方法依赖ASR置信度分数进行过滤,但存在根本性局限:无法有效检测删除错误(deletion errors),也难以区分声学层面(感知失败)与语言层面(理解失败)的不匹配,而这两种错误需要不同的恢复策略。论文提出了一种“因果感知”的错误恢复范式,其关键在于引入一系列基于ASR深层潜在表示的小型高精度检测器,能够将词级别错误细粒度地解耦为感知、理解与删除三类故障;这种诊断智能使LLM能制定针对性的多轮澄清策略,从而将模糊输入转化为流畅的用户交互。实验表明,该方法在领域迁移错误上的召回率提升超过一倍(57.96% vs. 23.66%),同时在多种口音、失真和场景下实现最高达30%的词错误率(WER)降低和17%的下游任务性能提升。

链接: https://arxiv.org/abs/2605.25404
作者: Yizhou Peng,Ziyang Ma,Changsong Liu,Yi-Wen Chao,Xie Chen,Eng Siong Chng
机构: Nanyang Technological University, Singapore; Shanghai Jiao Tong University, China
类目: Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Cascaded Automatic Speech Recognition – Large Language Model (ASR-LLM) pipelines remain popular for industrial Spoken Dialogue Systems (SDS), primarily because their decoupled design ensures perceptual verifiability. However, cascaded systems suffer from error propagation, as transcription failures inevitably cascade to subsequent components, thereby degrading the final interaction quality. Although ASR confidence scores offer a simple filter for unreliable inputs, this approach is fundamentally limited because it typically fails to detect deletion errors or to distinguish between acoustic (inability to hear clearly) and linguistic (inability to understand) mismatches, both of which require targeted recovery strategies. In this paper, we propose a cause-aware error recovery paradigm that fundamentally rethinks robustness in SDS. Unlike traditional confidence filtering, we introduce a suite of small precision-focused detectors that exploit deep ASR latent representations to disentangle token-level errors into perception, comprehension, and deletion failures. This fine-grained diagnostic intelligence empowers the LLM to orchestrate targeted, multi-turn clarification strategies, effectively transforming ambiguous signals into seamless user interactions. Experimental results validate the precision of our approach, which more than doubles the recall on domain-shift errors (57.96% vs. 23.66%) compared to baselines. Crucially, this diagnostic precision yields up to a 30% reduction in WER and a 17% improvement on the downstream task across diverse accents, distortions, and domains.

[NLP-73] Second Guess: Detecting Uncertainty Through Abstention and Answer Stability in Small Language Models

【速读】: 该论文试图解决小语言模型(SLMs)在不确定时仍会生成自信但错误答案的问题,这在计算资源受限且需自主运行的场景下尤为突出。解决方案的关键在于提出一种轻量级、无需额外参数的提示技术“Second Guess”,其核心思想是:真正掌握答案的模型在面对“我不知道”选项时仍会稳定选择正确答案,而不确定的模型则表现出行为不稳定性。实验表明,该方法在四个开源模型(2B–8B参数)和四个基准测试中实现了最高的综合风险改善率(10.81%),尤其在微调模型上优于基于熵的方法,并对性能较低的模型提升最为显著。

链接: https://arxiv.org/abs/2605.25394
作者: Ashwath Vaithinathan Aravindan,Mayank Kejriwal
机构: University of Southern California (南加州大学); Information Sciences Institute (信息科学研究所)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models often generate confident but incorrect answers rather than abstaining when uncertain. This problem is particularly acute for small language models (SLMs), where computational constraints and autonomous operation amplify the need for reliable uncertainty detection. We propose Second Guess, a lightweight, parameter-free prompting technique for abstention in multiple-choice question answering (MCQA) that is well-suited for SLMs. Our key empirical insight is that models which truly know an answer will select it consistently, while uncertain models exhibit unstable behavior when an ``I don’t know’’ option is added. Evaluated on four open models (2B-8B parameters) and four benchmarks, Second Guess achieves the highest composite risk improvement of 10.81%. Notably, it maintains an 8% composite risk improvement on fine-tuned models where entropy-based methods degrade, and improves most for lower-performing models. All code and results required to reproduce this work is available in this https URL

[NLP-74] GeoMathCode: Understanding Interleaved Math-Code Reasoning for Geometry Problem Solving

【速读】: 该论文试图解决的问题是:如何让多模态大语言模型(MLLMs)在几何问题求解中更贴近人类的推理过程,特别是通过引入辅助视觉构造(如额外线条或点)来增强几何解释力和教育清晰度。其解决方案的关键在于提出GeoMathCode框架,利用程序化表示作为中间视觉输出,从而将推理步骤与代码生成步骤在潜在空间中解耦;实验表明,监督微调(SFT)可使推理流形更加结构化和信息丰富,且分层语法结构的代码子空间在潜在空间中独立存在,并蕴含比纯视觉表示更多的数学符号信息。

链接: https://arxiv.org/abs/2605.25384
作者: Yingji Zhang,Yong Dai,André Freitas
机构: Idiap Research Institute (Idiap研究学院); X-Humanoid (X-人形); University of Manchester (曼彻斯特大学); CRUK Manchester Institute (英国癌症研究中心曼彻斯特研究所)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Mathematical reasoning is a hallmark of human intelligence, requiring logical deduction, symbolic manipulation, and abstract thinking. Recent multimodal large language models (MLLMs) have demonstrated strong performance on geometry problems through multi-step reasoning. To better emulate human problem-solving, intermediate steps can incorporate auxiliary visual constructions, such as additional lines or points, which improve geometric interpretation and educational clarity. In this work, we introduce the GeoMathCode, where programmatic representations serve as intermediate visual outputs. We further conduct an in-depth analysis of the underlying reasoning geometry. Experimental results show that reasoning and code generation steps can be disentangled in the latent space, while supervised fine-tuning (SFT) makes the reasoning manifold more structured and informative. Moreover, hierarchical syntactic code structures emerge as disentangled latent subspaces, and contain more mathematical symbolic information than visual representations.

[NLP-75] AuthTrace: Diagnosing Evidence Construction in Thematically Dense Single-Author Corpora

【速读】: 该论文试图解决的问题是:当前证据构建系统(包括块检索、代理记忆、知识图谱遍历和主题索引)在独立的基准测试中进行评估,这些基准使用不兼容的语料库和指标,导致跨范式诊断无法实现。解决方案的关键在于提出 AuthTrace,这是首个诊断基准,通过利用单作者语料库的双重特性,将所有主要范式置于同一语料库和查询集上。AuthTrace 基于主题密集型语料库构建,其中所有文本共享风格、主题和词汇,并提供 2,099 个实例及详尽的黄金证据,以“fan-in 梯度”作为主要诊断轴。实验表明:(1) 证据召回率而非精确率是答案质量的主要预测因子(r = 0.96);(2) fan-in 揭示了范式特异性的退化模式,扁平检索的性能下降速度是结构化证据系统的三倍;(3) 全上下文提示(full-context prompting)普遍失效,确立了证据构建能力是超越原始语料暴露之外的必要能力。

链接: https://arxiv.org/abs/2605.25382
作者: Xiaoqing Wu,Feifei Li,Haoliang Ming,Wenhui Que
机构: WeChat, Tencent Inc., Beijing, China
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Evidence construction systems–chunk retrieval, agent memory, knowledge-graph traversal, and thematic indexing–are evaluated on separate benchmarks with incompatible corpora and metrics, making cross-paradigm diagnosis impossible. We introduce AuthTrace, the first diagnostic benchmark that places all major paradigms on a single corpus and query set by exploiting the dual nature of single-author collections. Built on thematically dense corpora where all texts share style, topic, and vocabulary, AuthTrace provides 2,099 instances with exhaustive gold evidence and a fan-in gradient as the primary diagnostic axis. Comparing eight systems across two QA models, we find that (1) evidence recall–not precision–is the dominant predictor of answer quality (r = 0.96); (2) fan-in exposes paradigm-specific collapse patterns, with flat retrieval degrading 3x faster than structured-evidence systems; and (3) full-context prompting fails uniformly, establishing evidence construction as a necessary capacity beyond raw corpus exposure.

[NLP-76] EfficientGraph-RAG : Structured Retrieval-State Management for Cross-Task Retrieval-Augmented Generation

【速读】: 该论文旨在解决当前检索增强生成(Retrieval-Augmented Generation, RAG)系统中因证据组织方式过于扁平化和检索过程缺乏结构化而导致的复杂检索瓶颈问题,具体包括:如何在粗粒度主题到细粒度实体-关系证据之间进行有效导航、如何追踪已验证的证据以及如何复用中间状态。其解决方案的关键在于将检索过程建模为显式的“检索状态”管理,并通过三个协同机制实现:TAM(Typed Hierarchical State Space)构建分类型层次化证据状态空间,MARS(Multi-Agent Retrieval System)通过角色专业化代理更新与验证状态,SMP(State Management Policy)基于层次感知访问控制存储可复用状态。实验表明,EfficientGraph-RAG 在 LongBench 三个检索子集上的平均答案质量排名第一,在 HotpotQA 上达到最强代理基线性能的同时减少大模型 token 使用量 3.51 倍,并在 DocVQA 中以低 token 成本实现跨模态检索方法中的优异表现。组件分析进一步揭示了各模块的作用:MARS 是提升答案质量的核心驱动力,TAM 提供类型化的遍历状态和自适应路由信号,SMP 实现依赖语料的复用能力,跨查询缓存命中率介于 3.77% 至 23.18%。

链接: https://arxiv.org/abs/2605.25379
作者: Miaohe Niu,Lianlei Shan,Zhengtao Yu,Jingbo Zhu,Tong Xiao
机构: Northeastern University (东北大学); Tsinghua University (清华大学); Kunming University of Science and Technology (昆明理工大学); NiuTrans Research (牛津研究)
类目: Computation and Language (cs.CL)
备注: 19 pages, 5 figures, 14 tables

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) has become the standard way to ground large language models in external knowledge, but many systems still organize evidence as flat chunks and retrieve it through largely unstructured search. This weak structure becomes a bottleneck for complex retrieval: the system must decide where to search, how to move from coarse topics to entity-relation evidence, which evidence has been verified, and which intermediate artifacts can be reused. We define these intermediate variables as a retrieval state and study RAG as structured state management. EfficientGraph-RAG makes this state explicit through three coupled mechanisms: TAM defines a typed hierarchical state space over evidence, MARS updates and verifies the state through role-specialized agents, and SMP stores reusable state under hierarchy-aware access control. Using one shared framework configuration, EfficientGraph-RAG ranks first on the reported answer-quality metrics averaged over the three evaluated LongBench retrieval-style subsets, matches the strongest agentic baseline on HotpotQA EM while reducing large-model token usage by 3.51\times , and provides a low-token DocVQA result among retrieval-organizing cross-modal methods. Component analysis shows role-specific mechanisms: MARS is the main answer-quality driver, TAM supplies the typed traversal state and Adaptive Routing signal, and SMP enables corpus-dependent reuse, with cross-query cache hit rates ranging from 3.77% to 23.18%.

[NLP-77] Learning to Route Languages for Multilingual Policy Optimization ICML2026

【速读】: 该论文试图解决大语言模型(Large Language Models, LLMs)在多语言训练中因现有策略优化方法通常限制每个训练问题仅使用单一响应语言或依赖固定主导语言进行监督,从而导致训练信号多样性不足的问题。解决方案的关键在于提出一种名为语言路由策略优化(Language-Routed Policy Optimization, LRPO)的在线策略优化框架,该框架将语言视为可选择变量,通过为每个训练问题生成多语言轨迹(multilingual rollouts),并基于偏好学习更新策略,从而在固定采样预算下提升训练信号的多样性和信息量;同时引入一个可训练的语言路由器(formulated as a multi-armed bandit),动态平衡对低利用率语言的探索与对高信息量语言的利用,实现自适应语言选择。实验表明,LRPO能持续提升多语言性能,验证了自适应语言路由有助于有效跨语言知识迁移与利用。

链接: https://arxiv.org/abs/2605.25360
作者: Geyang Guo,Hiromi Wakaki,Yuki Mitsufuji,Alan Ritter,Wei Xu
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted at ICML 2026

点击查看摘要

Abstract:Large language models~(LLMs) are trained on heterogeneous multilingual corpora, yet existing policy optimization methods often implicitly restrict each training question to a single response language or rely on a fixed dominant language for supervision. We propose language-routed policy optimization (LRPO), an online policy optimization framework that treats language as a selectable variable. LRPO elicits multilingual rollouts for each training question and integrates their relative quality into preference-based policy updates, increasing the diversity and informativeness of training signals under the fixed rollout budget. To adaptively determine which languages to explore during reinforcement learning, we introduce a trainable language router formulated as a multi-armed bandit, balancing exploration of underutilized languages with exploitation of more informative ones. Extensive experiments show that LRPO consistently improves multilingual performance, demonstrating that adaptive language routing enables effective cross-lingual knowledge exploitation for training. We release all the resources at this https URL.

[NLP-78] AI-Associated Lexical Shifts Across 34 Languages: Cross-Lingual Convergence and Diachronic Uptake in News Writing EMNLP2026 ACL

【速读】: 该论文试图解决的问题是:生成式 AI(Generative AI)是否在非英语语言中也引发了词汇使用模式的系统性变化,以及这些变化是否具有跨语言的一致性。解决方案的关键在于提出并应用一种改进的“分半续写诊断法”(split-halves continuation diagnostic),通过比较 GPT-4.1 生成文本与人工标注的黄金标准文本在各语言中的延续性差异,量化 AI 过度使用的词素(lemma)频率,并基于对数流行率比进行排序。研究发现,24 种语言中均出现以“emphasize”类动词为代表的语义趋同现象,且多数语言在 ChatGPT 发布后(2023–2024)AI 过度使用的词汇显著增加(平均 +15.1%),而对照词无明显变化,表明 AI 正在全球多语言语料中施加同质化压力。

链接: https://arxiv.org/abs/2605.25358
作者: Thomas Stephan Juzek
机构: Florida State University (佛罗里达州立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 19 pages (9-page main body, plus references and appendices), 3 figures; ACL ARR reviewed, committed to EMNLP 2026

点击查看摘要

Abstract:AI-associated lexical shifts have been documented mainly in Scientific English. We extend this work to 34 languages in the WMT News Crawl corpus, refining a split-halves continuation diagnostic that compares GPT-4.1 continuations with matched human gold-standard text. For each language, we derive ranked AI-overused lemmas using log prevalence ratios. We find substantial cross-lingual semantic convergence: semantically related concepts recur across typologically diverse languages, with ‘emphasize’-type verbs appearing in 24 of 34 languages. Embedding-based and manual analyses support this pattern. We also examine diachronic uptake in news writing before and after ChatGPT’s release. Tracking each language’s top 20 AI-overused items, we find prevalence increases in 26 of 34 languages from 2020-2021 to 2023-2024, with a mean change of +15.1%, whilst matched baseline words show no comparable increase (-4.5%). In 10 languages with longer historical coverage, longitudinal analyses show post-2022 increases that exceed the modest shifts observed in earlier periods, though with smaller effect sizes than in Scientific English. We validate our approach extensively, including across seeds, model variants, data sizes, model families, and more. Our findings are consistent with the view that AI-associated lexical preferences extend beyond English and may exert cross-lingual homogenising pressure on global language use.

[NLP-79] A general tensor-structured compression scheme for efficient large language models

【速读】: 该论文试图解决大语言模型(Large Language Models, LLMs)中密集线性变换带来的存储、内存和计算开销问题,这些问题限制了模型的高效适配与部署,并掩盖了结构简化对模型功能的实际影响。其解决方案的关键在于提出一种通用的张量结构压缩方案——Tensor Mixture(MixT),该方案将目标密集线性层替换为原生可执行的张量算子混合体,能够在不依赖特定模型组件的情况下直接作用于通用线性投影,从而适用于基于Transformer的LLMs及其他密集神经映射。实验表明,在统一恢复协议下,MixT在Qwen3-8B和LLaMA2-7B上均识别出一个广泛的可压缩区间,在此区间内MMLU准确率基本保持不变;而在模型特定边界处出现突变,该突变与输出熵、预测熵及层间几何结构的协同变化一致。以LLaMA2-7B为例,MixT在保持性能的前提下实现了47.5%的参数减少、37.1%的推理FLOPs降低、52.1%的训练FLOPs降低以及60.4%的峰值推理内存节省,展现出显著的低成本压缩潜力。

链接: https://arxiv.org/abs/2605.25344
作者: Ying Lu,Peng-Fei Zhou,Qi-Xuan Fang,Pan Zhang,Shi-Ju Ran,Gang Su
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Quantum Physics (quant-ph)
备注: 12 pages, 4 figures

点击查看摘要

Abstract:Large language models (LLMs) are dominated by dense linear transformations, whose storage, memory and computational overheads hinder efficient adaptation and deployment while masking the functional impacts of structural simplification. Here we present Tensor Mixture (MixT), a general tensor-structured compression scheme that replaces targeted dense linear layers with natively executable mixtures of tensor operators. Operating directly on generic linear projections instead of model-specific components, MixT is potentially applicable across Transformer-based LLMs and other dense neural mappings. We evaluate MixT on Qwen3-8B and LLaMA2-7B under a unified recovery protocol, identifying a broad compressible regime in which MMLU accuracy is largely preserved before an abrupt transition at model-specific boundaries. This transition coincides with coordinated shifts in output entropy, prediction entropy and inter-layer geometry. At the LLaMA2-7B transition boundary, MixT reduces full-model parameters by 47.5%, inference FLOPs by 37.1%, training FLOPs by 52.1% and peak inference memory by 60.4%, demonstrating its practical potential for lower-cost LLM compression.

[NLP-80] MATO: Multi-objective Personalized Alignment with Test-time Optimization for Large Language Models

【速读】: 该论文试图解决大语言模型(LLM)在个性化AI系统中如何有效对齐多样且复杂的用户偏好这一核心问题。现有方法要么依赖昂贵的训练过程,要么需要为每种偏好预先训练奖励模型(Reward Model),难以适应动态变化的用户需求;而基于提示(prompt-based)的个性化方法虽无需训练,但控制能力有限,无法可靠调节不同目标之间的相对重要性,导致在目标冲突时出现次优对齐。论文提出了一种名为MATO(Multi-objective personalized Alignment with Test-time Optimization)的无训练框架,其关键创新在于将个性化建模为测试时优化问题:通过一个奖励发现模块直接从基础LLM中恢复自然语言指定目标的偏好奖励,同时利用权重优化模块根据用户初始偏好和部分生成内容动态调整各目标权重,从而在解码过程中通过可控权重引导token分布的在线优化,实现更精准、可调控的多目标对齐。实验表明,MATO在多个数据集和基础模型上均显著优于强基线,实现了帕累托改进的多目标对齐与更强的可控性,验证了测试时优化作为可扩展、可控且模型无关的个性化对齐方向的潜力。

链接: https://arxiv.org/abs/2605.25342
作者: Linhao Luo,Thuy-Trang Vu,Van-Anh Nguyen,Junae Kim,Gholamreza Haffari,Dinh Phung
机构: Monash University (莫纳什大学); Defence Science and Technology Group, Australia (澳大利亚国防科学与技术集团)
类目: Computation and Language (cs.CL)
备注: Preprint

点击查看摘要

Abstract:Aligning large language models (LLMs) with diverse and multifaceted user preferences is a fundamental challenge in personalized AI systems. Existing multi-objective alignment methods either rely on costly training or require pre-trained reward models for each preference, making it difficult for them to adapt to evolving preferences. Prompt-based personalization offers a training-free alternative, but prompting alone often provides limited steerability, as LLMs may overemphasize or overlook certain preferences and fail to give users reliable control over the relative importance of different objectives when conflicts arise, leading to suboptimal alignment. In this paper, we introduce MATO, a training-free framework for Multi-objective personalized Alignment with Test-time Optimization. MATO formulates personalization as a test-time optimization problem that steers the relative importance of multiple objectives through controllable weights during decoding, without modifying model parameters or requiring external reward models. Specifically, a reward discovery module recovers preference rewards directly from the backbone LLM for diverse objectives specified in natural language, while a weight optimization module dynamically adjusts objective weights based on the user’s initial preferences and the partially generated response to balance competing objectives during generation. The resulting rewards and weights jointly guide an online optimization procedure over the token distribution, enabling better alignment with the target objectives. Extensive experiments across multiple datasets and backbone LLMs show that MATO consistently outperforms strong baselines, achieving Pareto-improving multi-objective alignment and stronger steerability. These results highlight test-time optimization as a promising direction for scalable, controllable, and model-agnostic personalized alignment.

[NLP-81] P1SCO: Social Dimensions from a Perspectivist Lens

【速读】: 该论文试图解决的问题是:如何系统性地捕捉社交媒体评论中社会互动与感知的多样性,并探究个体差异(如人格特质和政治立场)及平台特性对社会认知的影响。解决方案的关键在于构建了一个名为P1SCO的数据集,该数据集包含来自三个不同社交平台的评论,依据十个社会维度进行标注,并结合丰富的标注者元数据(包括人口统计学信息、大五人格特征和政治倾向),从而支持在评论级、标注者级和平台级等多个粒度上开展细致分析。这一设计使得研究能够深入探讨社会感知的跨平台动态、标注者间与标注者内的一致性,以及人格与政治因素对社会解释的影响。

链接: https://arxiv.org/abs/2605.25312
作者: Amanda Cercas Curry,Gianmarco de Francisci Morales,Luca Maria Aiello
机构: Intesa Sanpaolo Innovation Center, Turin; IT University of Copenhagen
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We introduce P1SCO, a dataset of social media comments collected from three distinct platforms, annotated according to ten social dimensions to capture the diversity of social interactions and perceptions. The dataset is carefully disaggregated to allow analysis at the level of individual comments, annotators, and platforms. In addition to the social dimension labels, we include rich metadata on the annotators, including demographics, Big Five personality profiles, and political affiliation. This combination of comment-level annotations and annotator-level features enables nuanced analyses of how social perception varies across platforms, individual differences, and demographic factors. By preserving the diversity of annotator perspectives, our dataset supports studies of inter- and intra-annotator agreement, the influence of personality and political orientation on social interpretation, and the cross-platform dynamics of social discourse.

[NLP-82] ool-Call Dependency Structure is Linearly Decodable in LLM Agent Residual Streams

【速读】: 该论文试图解决的问题是:大型语言模型(LLM)代理在执行工具调用时形成的运行时依赖图(tool-call dependency graph)是否被模型内部表征,以及这种结构信息如何在模型激活中传播。解决方案的关键在于设计了一个低容量的边探测器(edge probe),该探测器作用于Qwen3-32B模型的残差流(residual stream),能够显著高于随机标签控制和位置基线水平地解码出工具调用之间的依赖关系。进一步的反事实对比实验表明,该信号捕捉的是抽象拓扑结构而非具体标识符值,并且在独立的非子串oracle下仍可复现;此外,非位置成分在多个多跳任务基准上稳定存在,但在单次规划场景中因调用顺序足以替代依赖关系而消失。层级激活插补实验显示,探测能力随传播路径移动而非静态读出,证明该结构表示具有动态传播特性。这是首个针对LLM代理运行时工具调用依赖图的结构性探测研究,其结论聚焦于模型内部表示而非行为控制,且适用于两种模型家族和一个核心领域。

链接: https://arxiv.org/abs/2605.25310
作者: Tianda Sun,Dimitar Kazakov
机构: University of York
类目: Computation and Language (cs.CL)
备注: 16 pages, 7 figures

点击查看摘要

Abstract:Tool-using LLM agents produce trajectories whose calls form a directed dependency graph: earlier tool outputs supply arguments to later calls. Whether this execution structure is represented inside the model is unknown; prior structural probes have targeted static code or chain-of-thought text, not an agent’s run-time call graph. A low-capacity edge probe on the residual stream of Qwen3-32B decodes the tool-call dependency graph well above both a Hewitt–Liang random-label control and a positional baseline. A counterfactual contrast between value corruption and structural perturbation indicates the signal tracks abstract topology rather than identifier values, and replicates under an independent, non-substring oracle. The non-positional component replicates on three further interactive multi-hop benchmarks and attenuates as call order alone becomes a sufficient proxy for dependency, vanishing in single-shot planning. Per-layer activation patching shifts the probe at a later, non-patched boundary, evidence that the representation propagates rather than passively reads out, though the realised tool call does not move. To our knowledge this is the first structural probe of an LLM agent’s runtime tool-call dependency graph. Our claims concern representation, not behavioural control, and span two model families and one primary domain.

[NLP-83] Eureka: Intelligent Feature Engineering for Enterprise AI Cloud Resource Demand Prediction DASFAA2026

【速读】: 该论文试图解决传统特征工程(Feature Engineering)依赖领域专家知识、难以跨场景规模化的问题。其核心创新在于将特征工程建模为一个代理驱动的代码生成问题:特征不再是静态的数据转换,而是可执行的程序,可通过生成、评估和迭代优化来持续改进。解决方案的关键在于提出Eureka框架,包含三个阶段:(1) 专家代理(Expert Agent)通过监督微调(SFT)生成结构化的特征设计计划;(2) LLM特征工厂(Feature Factory)利用链式思维推理将计划转化为可运行的Python代码;(3) 自进化对齐引擎(Self-Evolving Alignment Engine)采用基于GRPO的强化学习策略,结合指标效用与语义一致性双重奖励信号提升代码质量。通过将特征表示为程序,该方法实现了跨领域的知识迁移能力,并在7个公共基准数据集及阿里巴巴云GPU资源预测任务中显著优于传统AutoFE和LLM基线方法。

链接: https://arxiv.org/abs/2605.25297
作者: Hangxuan Li,Renjun Jia,Xuezhang Wu,Yunjie Qian,Zeqi Zheng,Xianling Zhang
机构: Alibaba Cloud Computing Co. Ltd.(阿里云计算有限公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 13 pages, accepted at DASFAA 2026 (International Conference on Database Systems for Advanced Applications)

点击查看摘要

Abstract:Effective features are crucial for predictive model performance, but creating them often requires domain expertise, limiting scalability across applications. We define feature engineering as an agentic code generation problem: features are not static data transformations, but executable programs that can be generated, evaluated, and iteratively improved. We present Eureka, an LLM-driven framework with three stages. (1) An Expert Agent, fine-tuned via SFT on domain knowledge, produces structured feature design plans in JSON format. (2) An LLM Feature Factory translates each plan into executable Python code through chain-of-thought reasoning, turning feature hypotheses into runnable programs. (3) A Self-Evolving Alignment Engine uses Reinforcement Learning (GRPO) with dual-channel reward (metric-based utility + semantic alignment) to enhance code quality. By expressing features as programs, the learned generation patterns can transfer across domains. Evaluated on 7 public benchmarks in healthcare, finance, and social domains, Eureka consistently outperforms both traditional AutoFE and LLM-based baselines. We further demonstrate Eureka’s effectiveness on cloud GPU resource demand prediction at Alibaba Cloud, where Eureka improves demand fulfillment rate by 16% and lowers computing resource migration rates by 33%.

[NLP-84] Knowing but Not Showing: LLM s Recognize Ambiguity but Rarely Ask Clarifying Questions

【速读】: 该论文试图解决的问题是:当前大语言模型在面对用户查询时,往往因查询信息不足而存在多种合理解释,但模型倾向于直接作答而非主动识别并澄清歧义,从而可能导致不准确或不相关的回答。解决方案的关键在于区分模型对歧义的识别能力与实际行为表现之间的差距——研究发现,尽管模型在明确要求判断歧义时能有效识别,但在标准问答场景中却几乎总是选择直接回答;此外,检索到的上下文虽提升了答案生成的可能性,反而进一步削弱了模型提出澄清问题的倾向。因此,提升模型的“行为一致性”(即从识别歧义到主动寻求澄清)是改进其交互可靠性的关键所在。

链接: https://arxiv.org/abs/2605.25284
作者: Jinyan Su,Claire Cardie
机构: Cornell University (康奈尔大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:User queries are often underspecified and may admit multiple valid interpretations. Rather than silently making assumptions about the user’s intent, a helpful assistant should surface such ambiguity by asking a clarifying question. Doing so requires two abilities: recognizing that a query is ambiguous, and acting on that recognition by seeking clarification instead of answering directly. To study these abilities, we evaluate models on ambiguous, unambiguous, and disambiguated questions in three settings: standard question answering, explicit ambiguity judgment, and behavioral analysis, where a judge model classifies responses as direct answers, refusals, or clarifying questions. We find a clear gap between recognition and behavior: models often identify ambiguity when explicitly asked to judge it, yet in the QA setting they overwhelmingly default to direct answers. Retrieved context further widens this gap by improving answerability while making models even less likely to ask clarifying questions.

[NLP-85] READER: Reasoning -Enhanced AI-Generated Text Detection

【速读】: 该论文试图解决当前AI文本检测器在面对分布外数据时性能显著下降的问题,尤其是在缺乏可解释性的情况下。现有方法依赖于监督神经分类器,在训练数据分布内表现良好,但泛化能力弱且决策过程不透明。解决方案的关键在于提出READE,一个基于推理增强的检测框架:它不仅输出人类或AI生成的标签,还提供结构化的推理依据(rationale),从而提升可信度与可解释性。其核心创新是构建了一个高质量的标注数据集READ,包含人工编写的推理理由和判断结果,并通过微调一个小型语言模型(1.5B参数)来学习这种推理机制,使得模型在推理阶段能够主动分析文本特征并做出决策。实验表明,尽管规模远小于主流大模型(如GPT-5.2、Gemini-3-Pro等),READE在多种场景下均优于现有检测器及提示工程驱动的大模型基线。

链接: https://arxiv.org/abs/2605.25281
作者: Pingfan Su,Kai Ye,Shijin Gong,Erhan Xu,Jin Zhu,Giulia Livieri,Chengchun Shi
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advances in large language models (LLMs) have made it increasingly difficult to distinguish human-written text from AI-generated content. Many existing detectors train supervised neural classifiers that achieve strong in-distribution performance but are often opaque and can degrade substantially under distribution shift. We present READER, a reasoning-enhanced AI text detector that outputs both a human/AI label and a structured rationale describing the evidence for its decision. A key component of our approach is READ, a curated supervision set of rationales and verdicts. We fine-tune an LLM on READ to build READER, which reasons before detecting at inference time. Despite having only 1.5B parameters, READER consistently outperforms existing detectors as well as prompted, high-capacity LLM baselines (GPT-5.2, Gemini-3-Pro, and DeepSeek-V3.2), which are 100 to 1000 times larger in scale.

[NLP-86] Mimir: Large-scale Multilingual Concept Modeling

【速读】: 该论文试图解决的问题是:当前基于token的语言建模范式在处理语义时存在局限性,尤其是在细粒度token层面难以有效捕捉高层次语义信息;同时,现有方法未充分探索使用更高粒度(如概念)进行建模是否能提升语言模型的理解与生成能力。解决方案的关键在于提出“概念建模”(Concept Modeling)新范式——将输入从token替换为概念(concept),使模型直接学习预测下一个概念而非下一个token,从而迫使模型在更宏观的语义层面进行推理和生成。为此,作者构建了Mimir,一个16亿参数的多语言概念模型,利用涵盖46种语言的388亿句预训练语料和覆盖35种语言的多轮指令微调数据集进行训练,并通过对比实验验证其在多语言理解和生成任务中的优越性能。

链接: https://arxiv.org/abs/2605.25263
作者: Elio Musacchio,Lucia Siciliani,Pierpaolo Basile
机构: University of Bari Aldo Moro (巴里阿尔多·莫罗大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Current language modeling approaches are built around tokens. Text corpora are split into tokens, and models are trained by performing computations on these tokens, such as predicting the next token given the preceding ones as context. This paradigm has become the standard in modern language modeling, especially given the outstanding performance obtained by token-based architectures. However, recent works have not only begun to question how language models process and understand meaning from tokens, but also to question whether using higher levels of granularity could advance the research field. This led to the idea of Concept Modeling, that is, to directly train models for next-concept prediction rather than next-token prediction. The goal is to change the input from tokens to concepts, forcing the underlying language model to shift its granularity from fine-grained tokens to broad concepts. In this work, we introduce Mimir, a 1.6B Large Concept Model trained for multilingual concept understanding and generation. We leverage a large-scale multilingual pre-training corpus (38,883,987,240 sentences) spanning 46 languages and a large-scale multi-turn and multilingual instruction-tuning dataset (66,816,428 sentences) covering a total of 35 languages. We extensively evaluate model performance against a language model with a comparable number of parameters.

[NLP-87] Inference Time Optimization with Confidence Dynamics ICML2026

【速读】: 该论文试图解决的问题是:在大型语言模型(LLM)推理过程中,如何有效利用模型不确定性来优化推理结果的准确性。当前的推理优化技术(如重复采样)主要关注生成多个答案并进行投票,但忽略了模型置信度随推理路径演化的动态特性。解决方案的关键在于提出一种基于置信度动态增益(Confidence Dynamic Gain, CDG)的投票机制,该机制首次揭示了正确推理轨迹通常表现出置信度随时间提升(正向置信度增长),而错误轨迹则呈现置信度衰减或下降的现象;CDG通过量化这一置信度变化趋势,在多轮推理中对答案进行加权投票,从而显著提升最终决策的准确性。实验表明,该方法在多个开源模型和基准测试中均优于传统基线,为 LLM 推理中的答案选择提供了鲁棒且有效的判别信号。

链接: https://arxiv.org/abs/2605.25244
作者: Yu Wang,Minghao Liu,Jiayun Wang,Jinrui Huang,Ankit Shah,Wei Wei
机构: 未知
类目: Computation and Language (cs.CL)
备注: Published in ICML 2026

点击查看摘要

Abstract:Inference time optimization techniques, such as repeated sampling, have significantly advanced the reasoning capabilities of Large Language Models (LLMs). However, the critical role of model uncertainty remains largely underexplored in these optimization strategies. In this paper, we investigate the dynamics of confidence along reasoning trajectories and for first time reveal a surprising and unique pattern: correct answer traces tend to exhibit confidence improvement over time (positive confidence gain), while incorrect traces show attenuated or declining confidence as reasoning proceeds. Based on this observation, we propose Confidence Dynamic Gain (CDG) based voting, which incorporates how the confidence trajectory of the response evolves along the reasoning chain. Experiments across four open-source architectures (DeepSeek-R1, gpt-oss, Gemma-3, Qwen-QwQ) on the AIME24/25, HMMT25, and BRUMO25 benchmarks demonstrate that CDG yields a significant performance boost over baselines. These results demonstrate that our method provides a robust discriminative signal for improving answer selection in LLM reasoning. We also provide theoretical insights for this phenomenon. Code will be released at this https URL.

[NLP-88] JudgmentBench: Comparing Rubric and Preference Evaluation for Quality Assessment

【速读】: 该论文试图解决的问题是:在缺乏可验证真实标签(ground truth)的高专业领域中,如何有效获取和利用专家判断作为监督信号以评估生成式 AI(Generative AI)模型输出的质量。现有两种主流评测方法——基于量规的评分(rubric-based scoring)与成对比较判断(comparative judgment)——常被单独使用,但其适用性缺乏系统性比较与理论依据。解决方案的关键在于构建并发布 JudgmentBench 数据集,该数据集包含 30 个真实法律任务、1,539 条由执业律师标注的量规分数以及 1,530 条来自同一群体律师的成对偏好判断,首次实现了在同一专家群体上对相同样本同时采集两类监督信号。实证结果显示,成对比较判断比量规评分更准确地恢复了预期质量排序(Spearman 相关系数分别为 0.908 vs. 0.150),且标注效率更高(耗时不足一半),这一结论在人类标注者和大语言模型自动评分器中均成立,从而为高专业领域中的评估范式选择提供了实证基础,并推动了关于专家判断如何采集、聚合及用于监督学习的研究议程。

链接: https://arxiv.org/abs/2605.25240
作者: Russell Yang,Ruishi Chen,Pierce Kelaita,Riya Ranjan,Sibo Ma,Charles Dickens,Matthew Guillod,Megan Ma,Julian Nyarko
机构: Stanford University (斯坦福大学); Snorkel AI; Harvey
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 37 pages, 9 figures

点击查看摘要

Abstract:Two methodologies dominate current practices of benchmarking: rubric-based scoring evaluates items against predefined criteria, whereas comparative judgment elicits pairwise preferences between outputs. Although both methodologies are widely used, the choice between them is rarely justified. We release JudgmentBench, a benchmark of 30 real-world legal tasks, paired with 1,539 rubric scores and 1,530 pairwise preference judgments collected from practicing attorneys–including at major U.S. law firms–with substantial experience. The annotations constitute the first publicly available dataset in a high-expertise domain in which both supervision signals are elicited from the same experts on the same items. Using LLM-generated outputs at three constructed quality levels, we provide an initial empirical comparison: comparative judgments recover the intended quality ordering substantially better than rubrics (mean Spearman’s rank correlation of 0.908 vs. 0.150, estimated difference = 0.758 [0.494, 1.021]) while requiring less than half the annotation time. The patterns hold for human annotators and LLM autograders. Beyond this initial comparison, the paired structure of the dataset supports a broader research agenda on how expert judgment should be elicited, aggregated, and used as supervision in domains without verifiable ground truth.

[NLP-89] From Automation to Collaboration: Human-in-the-Loop Methods for Safe and Trustworthy NLP

【速读】: 该论文试图解决大语言模型(Large Language Models, LLMs)在高风险自然语言处理(NLP)任务中面临的安全与可信性问题,包括偏见、幻觉、对抗脆弱性和不可靠的泛化能力。其解决方案的关键在于引入“人在回路”(human-in-the-loop)方法,将NLP从单纯的自动化转向人机协同范式,通过人类专家参与审计、鲁棒性评估、数据构建和模型引导等环节,提升模型的安全性和可解释性。研究强调了当前在可扩展探测、可持续鲁棒性基准、低资源场景及私有系统治理方面的显著空白,并提出自适应审计、协作评估和可问责部署等实践研究方向。

链接: https://arxiv.org/abs/2605.25226
作者: Most. Sharmin Sultana Samu,MD. Tanvir Ahmed Seum,Md. Rakibul Islam
机构: BRAC University (BRAC大学); Rajshahi University of Engineering and Technology (拉杰沙希工程与技术大学)
类目: Computation and Language (cs.CL)
备注: Preprint, manuscript under review

点击查看摘要

Abstract:Large language models are widely deployed in high-stakes NLP tasks, yet risks such as bias, hallucination, adversarial vulnerability and unreliable generalization remain. Probe-based auditing reveals inconsistencies in model behavior. Adversarial text generation uncovers robustness gaps, especially in lower-resourced languages with limited benchmarks. Enterprise text-to-SQL settings expose the difficulty of validating outputs over private and large-scale databases. Human supervision is essential for probe validation, adversarial verification and domain-specific annotation, but it is costly and hard to scale. This survey examines recent human-in-the-loop methods that shift NLP from automation toward collaboration for safety and trustworthiness. We review how human expertise supports auditing, robustness evaluation, data construction and model steering. Our findings highlight gaps in scalable probing, sustainable robustness benchmarks, low-resource settings and governance of private systems. We outline practical research directions for adaptive auditing, collaborative evaluation and accountable deployment.

[NLP-90] hey Are Not the Same: Direct Causes Are Not Grounded Emotion Explanations

【速读】: 该论文试图解决的问题是:当前情感-原因配对抽取(ECPE)任务被简化为二元预测(即是否构成情感-原因对),这种做法虽然有助于直接原因的提取,但容易导致模型过度依赖表面线索(如词汇相似性)而忽视真正具有解释力的语境证据,从而无法提供“基于证据的情感解释”。解决方案的关键在于揭示二元ECPE任务的局限性——尽管原始数据中90.9%的正样本仍为真实情感-原因对、95.0%的负样本仍为非配对,但其中大量“情绪上下文”(emo-context,即不直接引发情绪但有助于解释情绪的语句)混杂在二元边界附近,说明该边界不稳定且难以区分真正有解释力的语境信息。研究进一步发现,模型更擅长识别直接触发因素,而对语境支持的判断则较弱;在训练过程中,由于“捷径压力”(shortcut pressure)的存在,模型倾向于给词汇相似但无实质关联的非配对样本赋予更高得分,从而误导性能指标。因此,高二元ECPE准确率并不意味着模型实现了真正的、基于证据的情感解释。

链接: https://arxiv.org/abs/2605.25208
作者: Zhuangzhuang Pan,Yan Xia,Chee Seng Chan
机构: Universiti Malaya, Malaysia; Suzhou University of Technology, China; VinUniversity, Vietnam
类目: Computation and Language (cs.CL)
备注: 25 pages, 11 figures, 24 tables. Preprint

点击查看摘要

Abstract:Emotion-Cause Pair Extraction (ECPE) was introduced to explain why an emotion occurs, but this goal is now often reduced to binary pair/non-pair prediction. This proxy is useful for direct-cause extraction, yet easy to over-read as evidence grounded emotion explanation. We show that this interpretation is only partially valid. In IEMO-MECP, 90.9% of original positives remain emo-cause and 95.0% of original negatives remain non-pair, confirming that the binary ECPE task is largely preserved. The problem is that direct triggers alone do not constitute a grounded explanation. Emo-context, an utterance that helps interpret a target emotion without directly causing it, appears on both sides of the original boundary and is enriched near binary uncertainty, showing that the binary boundary has no stable place for such discourse evidence. Across evaluated ECPE models, direct triggers are recovered more reliably than contextual support. Under shortcut pressure, this imbalance becomes consequential. Binary-trained models assign higher pair scores to nearby lexically similar non-pair candidates than to evidence supported but structurally harder emo-cause and emo-context pairs. Thus, pair scores can reward convenient attributions over grounded explanations. High binary ECPE performance indicates that a model can identify direct triggers; it does not indicate that the model has explained the emotion. Code is publicly available at this https URL.

[NLP-91] Clarification Is Not Enough: Post-Clarification Answering Remains the Bottleneck in Multi-Turn QA

【速读】: 该论文试图解决多轮问答系统中因用户意图不明确或模糊而导致的对齐难题,核心问题是如何通过有效的偏好获取(preference elicitation)机制使系统能够适应不同用户的价值观、沟通风格和上下文假设。解决方案的关键在于将偏好获取过程分解为两个组件:澄清策略(clarification policy)澄清后回答(post-clarification answering)。研究发现,监督微调能快速提升澄清策略的性能,但即使模型做出了正确的澄清决策,最终答案的准确率仍显著偏低,这表明当前系统在理解和正确解析用户澄清反馈方面存在关键瓶颈,即“理解用户响应的能力”是多轮问答系统实现有效对齐的核心挑战。

链接: https://arxiv.org/abs/2605.25204
作者: Jinyan Su,Jennifer Healey
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Pluralistic alignment requires systems to adapt to diverse user values, communication styles, and contextual assumptions. We believe that a foundational prerequisite for such alignment enabling accurate preference elicitation from people when their intent is under-specified or ambiguous. We study the problem of preference elicitation in multi-turn question answering by decomposing the problem into two components: a \textbfclarification policy, which decides whether to ask a clarifying question or answer directly, and \textbfpost-clarification answering, which produces the correct final answer once the missing information is provided. We show, using the PACIFIC benchmark, that supervised fine-tuning rapidly improves the clarification policy, however, final answer accuracy remains substantially lower even when the model takes the correct action. This gap indicates that understanding and correctly interpreting the user’s response is the critical gap in multi-turn question-answering systems.

[NLP-92] GroupTravelBench: Benchmarking LLM Agents on Multi-Person Travel Planning

【速读】: 该论文试图解决现有旅行规划基准测试中缺乏多用户场景下冲突识别与协调能力评估的问题,即当前大多数基准仅针对单一用户,忽略了现实世界中旅行代理需处理多人需求冲突的复杂性。解决方案的关键在于提出首个面向多用户、多轮交互的旅行规划基准——GroupTravelBench,其核心创新包括:(1)基于真实用户画像、兴趣点(POI)数据和票价信息合成650个任务,并划分三难度层级;(2)引入三项关键能力评测维度:偏好获取(elicitation)、冲突协调(coordination)与群体最优规划(planning),其中协调能力强调通过妥协或分组策略解决用户间矛盾;(3)构建包含缓存真实工具数据的交互沙箱环境,支持可靠工具调用与离线评估。实验表明,即使前沿大语言模型在偏好覆盖度和群体公平性方面仍存在显著不足,验证了该基准对推动LLM代理在真实旅行场景中能力提升的价值。

链接: https://arxiv.org/abs/2605.25200
作者: Xiang Cheng,Yulan Hu,Lulu Zheng,Zheng Pan,Xin Li,Yong Liu
机构: Gaoling School of Artificial Intelligence, Renmin University of China; AMAP, Alibaba Group
类目: Computation and Language (cs.CL)
备注: work in process

点击查看摘要

Abstract:Travel planning is a realistic task for evaluating the planning and tool-use abilities of LLM agents. However, existing benchmarks typically assume only a single user, thereby avoiding one of the most challenging aspects of real-world scenarios: an agent’s ability to identify and resolve conflicts among multiple users. To address this gap, we introduce \textbfGroupTravelBench, the first benchmark for \textbfmulti-user, multi-turn travel planning. Based on real user profiles, POI data, and ticket price data, we synthesize 650 tasks and divide them into three difficulty levels. Beyond standard abilities in single-user itinerary planning, such as multi-step reasoning and tool use, our benchmark further evaluates three key capabilities required for travel agents: \emph(i) elicitation – proactively engaging in multi-turn dialogue to gather preferences from each user; \emph(ii) coordination – resolving conflicts among users through compromise or subgrouping strategies; and \emph(iii) planning – searching for travel plans that maximize overall group utility while maintaining fairness and feasibility. To simulate real-world conversational itinerary planning while enabling reliable tool use and offline evaluation, we build an interactive sandbox environment with cached real-world tool data. We evaluate a wide range of LLMs and find that even frontier models still show substantial weaknesses in preference coverage and group fairness. \textitGroupTravelBench provides a practical and reproducible benchmark for advancing research on LLM agents for real-world travel planning.

[NLP-93] Directional Alignment Mitigates Reward Hacking in Reinforcement Learning for Language Models

【速读】: 该论文试图解决奖励黑客(reward hacking)问题,即模型通过利用捷径优化代理奖励而非真正完成目标任务的现象。其解决方案的关键在于识别并抑制导致奖励黑客的优化漂移:通过分析参数更新的主导奇异方向(dominant singular directions),发现奖励黑客行为表现出比正常训练更大的方向变化;基于此,作者提出“可信方向投影”(trusted-direction projection)方法,强制梯度保持在干净参考子空间内,从而延缓捷径利用并更好维持任务性能。

链接: https://arxiv.org/abs/2605.25189
作者: Wenlong Deng,Jiaji Huang,Kaan Ozkara,Yushu Li,Christos Thrampoulidis,Xiaoxiao Li,Youngsuk Park
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Reward hacking arises when a model improves a proxy reward by exploiting shortcuts rather than solving the intended task. We study this failure mode through the geometry of reinforcement learning updates in language models and argue that hacking emerges when optimization drifts away from a stable low-dimensional learning trajectory. We analyze this drift through dominant singular directions of parameter updates and show that reward-hacking runs exhibit substantially larger directional change than clean runs. Motivated by this observation, we introduce trusted-direction projection, which constrains gradients to remain within a clean reference subspace. Across reward-hacking experiments on mathematical reasoning, the proposed approach delays shortcut exploitation and better preserves task performance.

[NLP-94] By Their Fruits You Will Know Them: Comparing Formalizations of Law by the Decisions They Encode EMNLP

【速读】: 该论文试图解决的问题是:如何在法律条文的自动化形式化(formalization)过程中,系统性地评估不同形式化方案之间的差异及其潜在影响,尤其是在使用大语言模型(LLM)直接生成形式化时,其隐含解释选择可能导致难以预见的推理后果。解决方案的关键在于提出一种基于节点匹配与SAT求解器的比较方法:首先对多个形式化方案进行细粒度的节点级对齐,构建共享接口;然后利用SAT求解器枚举出任意两份形式化在具体案例上的分歧点(edge cases),并将这些分歧转化为可理解的事实场景(verbalized cases),供法律专家审查和决策。实证结果显示,形式化之间的行为差异与其结构相似性几乎无关,且所揭示的分歧类型具有法律意义,甚至反映了法律评论中的真实争议。

链接: https://arxiv.org/abs/2605.25186
作者: Julius Vernie,Matthias Grabmair
机构: Technical University of Munich (慕尼黑工业大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 23 pages, 17 figures, submitted to EMNLP PROC 2026

点击查看摘要

Abstract:Formalizing legal provisions promises machine-accessible law and automated legal reasoning, and recent LLMs make it tempting to generate such formalizations directly from statutory text. However, any formalization makes implicit interpretive choices whose consequences are hard to anticipate, especially if an LLM is the author. We present a method for systematically comparing different formalizations of the same legal provision by their inferences on individual cases. Given multiple formalizations of a provision, we match them at the node level, derive a shared interface for each pair from the matching, and use a SAT solver to enumerate the edge cases on which any two formalizations disagree. Selected edge cases are then verbalized into concrete factual scenarios that a legal expert can examine and act on. We apply our method to formalizations of ten EU provisions generated by nine frontier LLMs. We find that behavioral divergence between formalizations is essentially uncorrelated with their structural agreement and that the verbalized cases reveal qualitatively distinct types of disagreement, including divergences that mirror genuine controversies in the legal commentary.

[NLP-95] Knowledge Graph-Driven Expert-Level Reasoning for Neuroscience

【速读】: 该论文试图解决的问题是:如何在不依赖大规模、异构的网络数据集的情况下,通过单一权威教科书构建知识图谱(Knowledge Graph, KG),并利用该KG驱动语言模型(Language Model, LM)实现专家级神经科学推理能力。其解决方案的关键在于:首先通过双大语言模型(LLM)验证管道从教科书中构建高质量KG;其次利用基于KG拓扑训练的掩码语言模型扩展知识图谱;再生成包含多跳问答对(multi-hop QA items)及推理路径的合成监督数据;最后仅使用KG衍生的监督信号对小规模LM进行微调,并结合基于路径的KG信号作为隐式奖励模型进行强化学习优化。实验表明,该方法可在参数量远低于大型语言模型(LLMs)的前提下,实现超越LLMs的准确性,从而在特定领域内诱导出深度、机制性的专业知识理解。

链接: https://arxiv.org/abs/2605.25183
作者: Jake Stephen,Niraj K. Jha
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Knowledge graph (KG) is an abstraction that can be extracted from text corpora and used for in-depth reasoning. Prior work has leveraged KGs to fine-tune language models (LMs), enabling domain-specific superintelligence. In this work, we explore whether KG-driven in-depth reasoning capabilities can emerge in neuroscience using only information contained within a single authoritative textbook. The central hypothesis is that structured knowledge, when distilled into a high-quality KG and converted into KG-grounded question-answer (QA) supervision, is sufficient to produce expert-level reasoning through a fine-tuned LM that surpasses large language models (LLMs) in accuracy, while employing orders of magnitude fewer parameters. We construct a textbook-derived KG via a dual-LLM validation pipeline, expand it with a masked LM trained on the KG topology, generate multi-hop QA items, which include QA pairs and reasoning traces, to fine-tune an LM exclusively on KG-derived supervision, and apply reinforcement learning using path-derived KG signals as implicit reward models. Our results demonstrate that deep, mechanistic neuroscience understanding can be induced in the model without reliance on large, heterogeneous web-scale corpora. The KG-based synthetic neuroscience curriculum that readers can quiz themselves on, and the fine-tuned LM, are available at the following GitHub location: this https URL.

[NLP-96] Locality Matters for Training-Free Audio Token Compression in Audio-Language Models

【速读】: 该论文旨在解决音频语言模型(Audio-Language Models, ALMs)在处理长音频前缀序列时推理成本过高的问题,特别是由于音频前缀占用了大量上下文预算、增加内存消耗,从而阻碍了在资源受限或对延迟敏感场景下的部署。现有无训练的音频token压缩方法主要依赖固定池化或基于分数的剪枝,前者缺乏内容感知能力,后者虽能保留显著的孤立token却可能丢失邻近声学上下文。论文提出了一种无训练的编码器空间压缩方法——局部时间双分图合并(Local Temporal Bipartite Merging, LTBM),其关键在于在显式的时间窗口约束下合并相似的相邻音频token,从而在保持关键信息的同时减少冗余。此外,通过引入受控的全局合并变体,作者验证了时间局部性是否为音频token压缩的有效归纳偏置。实验表明,任务特性决定了局部性的重要性:在音频描述生成任务中,局部感知合并更优,尤其在强压缩条件下;而在多选音频理解任务中,全局匹配更具优势。跨骨干网络验证进一步支持了局部性合并对描述生成任务的提升效果。

链接: https://arxiv.org/abs/2605.25179
作者: Jiale Luo,Xiaoyu Liang,Haoji Hu
机构: Zhejiang University (浙江大学)
类目: Computation and Language (cs.CL)
备注: Preprint. 8 pages main text, 10 pages total

点击查看摘要

Abstract:Audio-language models (ALMs) are increasingly used for audio captioning, question answering, and open-ended audio understanding, but their inference cost remains high when audio inputs are represented as long prefix-token sequences. These audio prefixes consume context budget, increase memory usage, and make deployment harder in resource-constrained or latency-sensitive settings. Existing training-free audio-token reduction methods mainly rely on fixed pooling or score-based pruning. Fixed pooling is content-agnostic, while score-based pruning can preserve isolated salient tokens but discard nearby acoustic context. We propose Local Temporal Bipartite Merging (LTBM), a training-free encoder-space compression method that merges similar nearby audio tokens under an explicit temporal window constraint. Beyond introducing LTBM, we use a controlled Global Merge variant to isolate whether temporal locality itself is a useful inductive bias for audio-token compression. Experiments on AudioCaps, Clotho, and MMAU with Qwen2-Audio show evidence of a task-dependent locality effect: locality-aware merging is more favorable for captioning at several compression settings, especially under stronger compression, while global matching is more competitive for multiple-choice audio understanding. A cross-backbone validation on Audio Flamingo 3 further supports the captioning-side advantage of locality-aware merging under moderate and aggressive compression.

[NLP-97] Re-defining Humor Data Objects for AI Humor Research

【速读】: 该论文试图解决现有AI幽默研究中将幽默简单二元化(即“有幽默”或“无幽默”)的问题,旨在将幽默视为一种依赖上下文的社会互动行为,并提供可解释的推理机制。其解决方案的关键在于定义了一个幽默推理数据对象(humor reasoning data object),并通过迭代优化提示(prompt)策略,使大语言模型(LLM)能够生成对普通人群有效的幽默解释。改进后的提示显著减少了关键错误,尤其在处理缺失上下文、多模态信息和文本转录问题上表现更优,从而为AI幽默理解从孤立判断迈向社会行为建模奠定了基础,并具备支持数据合成与增强的潜力。

链接: https://arxiv.org/abs/2605.25171
作者: Anna Arnett,Bang Nguyen,Meng Jiang
机构: University of Notre Dame (圣母大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In most existing AI humor research, humor was treated as either “present” or “not present.” We explore the concept of humor as a social interaction with context and explanations. During this project, we defined a humor reasoning data object and developed a way to prompt LLMs to generate an explanation of humor effective for general population. We iterated from an earlier prompt to an improved prompt, found that the later version reduced important errors, and then scaled generation to a large number of data objects which have the potential to enable data synthesis and data augmentation for AI humor research. Our main takeaway is that better prompting of an LLM improves humor explanation quality, especially by handling missing context, multi-modality, and transcript issues more carefully. These results establish a strong foundation for future work on AI understanding of humor as social behavior.

[NLP-98] STREAM: A Data-Centric Framework for Mining High-Value Task-Oriented Dialogues from Streaming Media

【速读】: 该论文试图解决垂直领域大语言模型(Large Language Models, LLMs)在训练过程中面临的数据稀缺问题,尤其是高质量、任务导向的领域对话数据不足。现有数据采集方法存在三难困境:专家标注成本高、真实服务对话受隐私和商业限制难以获取、静态语料库易随时间过时。解决方案的关键在于提出Stream框架,该框架通过挖掘公开流媒体(如直播和短视频)中的噪声交互信号,结合角色驱动的人格构建(role-grounded persona construction)与对话蓝图构建(Conversational Blueprint construction),合成大规模高质量服务对话;同时引入检索增强生成(Retrieval-Augmented Generation, RAG)机制以支持知识感知的回复。基于此框架,作者发布了StreamDial数据集,包含87,498个对话会话和1,497,320轮对话,覆盖汽车、餐饮和酒店三个领域,每个会话以结构化四元组⟨P_u, P_a, B, H⟩表示,明确标注用户与代理人格、对话蓝图及历史记录,从而捕捉真实服务行为(如需求挖掘、约束冲突、协商与恢复)。实验表明,StreamDial显著提升对话质量,并在对话状态追踪任务中优于基线模型,且具备良好的多语言迁移能力。

链接: https://arxiv.org/abs/2605.25162
作者: Liang Xue,Haoyu Liu,Cheng Wang,Pengyu Chen,Haozhuo Zheng,Yang Liu
机构: Harbin Institute of Technology (哈尔滨工业大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models for vertical domains are bottlenecked by the scarcity of complex, domain-specific task-oriented dialogues. Existing data acquisition pipelines face a persistent trilemma: expert annotation is expensive, real-world service conversations are constrained by privacy and commercial restrictions, and static corpora quickly become temporally stale. We propose Stream, a data-centric framework that leverages publicly available streaming media (live streams and short videos) to synthesize high-value service dialogues at scale. Stream mines authentic interaction signals from noisy streams and synthesizes conversations by integrating role-grounded persona construction with Conversational Blueprint construction; it further adopts retrieval-augmented generation (RAG) to support knowledge-aware responses. Based on Stream, we release StreamDial, a large-scale multi-domain dataset covering Automotive, Restaurant, and Hotel. StreamDial contains 87,498 dialogue sessions and 1,497,320 turns in total, with an average of 17.11 turns per session and a comparable scale across domains. Each session is organized as a structured quadruplet \langle P_u, P_a, B, H \rangle that pairs dialogue history with explicit user/agent personas and a Conversational Blueprint, capturing realistic service behaviors such as requirement mining, constraint conflicts, negotiation, and recovery. Evaluations with automatic judges and downstream tasks show that StreamDial improves intrinsic dialogue quality over strong baselines, and models trained with StreamDial improve Dialogue State Tracking across backbones; we further report a completed human-evaluation set and encouraging multilingual transfer on Qwen3-8B under a controlled training budget. The data is released in this https URL.

[NLP-99] LLM Agent Based Renewable Energy Forecasting Using Edge and IoT Data A Review of Solar Wind Weather and Grid Aware Decision Support

【速读】: 该论文试图解决可再生能源发电预测的可靠性问题,尤其是在面对太阳能和风能资源固有的间歇性(intermittency)特征时,传统预测方法难以充分利用物联网(IoT)和边缘设备所产生的海量实时运行数据。解决方案的关键在于引入大语言模型(LLM)代理(agents),通过整合异构传感器流、天气API数据、历史发电记录、电网约束条件及上下文推理能力,构建统一的决策支持工作流。论文提出一个六层分类体系(涵盖数据采集、预处理、特征工程、模型推理、不确定性估计与自然语言报告),并识别出12个开放挑战,最终推荐以开放基准测试、物理信息驱动的LLM嵌入和联邦预测架构为核心的研究方向,以推动高精度、鲁棒且可解释的下一代能源预测系统发展。

链接: https://arxiv.org/abs/2605.25141
作者: Pavan Manjunath,Thomas Pruefer
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reliable forecasting of renewable energy generation is a foundational requirement for grid stability energy trading battery scheduling and carbon aware operational planning Solar and wind resources are inherently intermittent their output fluctuates with cloud cover wind speed atmospheric turbulence seasonal patterns and local terrain The proliferation of IoT and edge devices spanning smart meters inverters anemometers pyranometers weather stations and grid interface sensors has created an unprecedented volume of real time operational data that conventional forecasting pipelines are ill equipped to exploit fully This review investigates how large language model LLM agents can enhance renewable energy forecasting by integrating heterogeneous sensor streams weather API data historical generation records grid constraints and contextual reasoning into unified decision support workflows We survey classical forecasting methods statistical time series models deep learning architectures physics hybrid approaches and emerging LLM agent frameworks for explanation uncertainty communication and operator guidance A six layer taxonomy is proposed covering data acquisition preprocessing feature engineering model inference uncertainty estimation and natural language reporting The review identifies twelve open challenges spanning real time deployment model drift under distribution shift uncertainty quantification hallucination control in LLM agents interoperability of edge hardware and integration with energy management systems The paper concludes by recommending a research agenda centred on open benchmarks physics informed LLM grounding and federated forecasting architectures

[NLP-100] rust but Verify: Prover-Verifier Deliberation for Selective LLM Prediction

【速读】: 该论文试图解决的问题是:如何在推理过程中可靠地识别语言模型何时作出正确判断,从而实现高置信度预测与不确定情形下的主动回避(abstention)。解决方案的关键在于提出一种基于交互式证明理论的推理时协议——证明者-验证者 deliberation (Prover-Verifier Deliberation, PVD)。该协议通过一个结构化的对话过程,由证明者(prover)提出候选答案并以可验证的子命题进行辩护,验证者(verifier)则针对这些子命题发起针对性挑战,并返回“接受”(Accept)、“挑战”(Challenge)或“拒绝”(Reject)三种响应。PVD不依赖于形式上的完备性或保真性保证(因冻结模型存在噪声),而是通过实证分析其“覆盖率-精确率”(coverage-precision)行为来评估性能;实验表明,仅采纳无需修改答案的“接受+无变更”(Accept + No Change, ANC)案例作为高置信度子集时,相较于非ANC部分能显著提升精确率(约30个百分点),且该机制具有跨模型家族的鲁棒性,但验证者严格性和领域能力是决定选择差距的核心因素。

链接: https://arxiv.org/abs/2605.25133
作者: João Sedoc,Baotong Zhang,Dean Foster
机构: New York University (纽约大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Reliably knowing when a language model is correct is almost as important as being correct. We introduce prover-verifier deliberation (PVD), an inference-time protocol grounded in interactive proof theory, as a mechanism for selective prediction: the protocol produces both an answer and a structured confidence verdict, allowing a system to report high-confidence answers while abstaining on uncertain cases. In each dialogue, a prover defends a candidate answer through checkable sub-claims while a verifier issues targeted challenges and returns \textscAccept, \textscChallenge, or \textscReject. Because frozen language models are imperfect provers and verifiers operating over a noisy channel, formal soundness and completeness guarantees do not transfer; instead, we characterize the protocol empirically through its coverage-precision behavior. Our main experiment uses Claude Sonnet 4.6 as prover and Claude Haiku 4.5 as verifier on GPQA Diamond. Questions accepted with no answer revision, which we call Accept + No Change (ANC), are reported as the high-confidence subset; we evaluate this subset by its precision and coverage. ANC separates reliable from unreliable answers, yielding a \sim 30pp HC-Prec gap over the non-ANC complement. Robustness experiments with GPT and Gemini pairings show that high HC-Prec can transfer across model families, while verifier strictness and domain competence largely determine the size of the selection gap. On Humanity’s Last Exam, weaker prover-verifier pairings can collapse or invert the ANC signal, illustrating a practical failure mode when the verifier operates outside its effective region. Comparisons with self-consistency, universal self-consistency, multi-agent debate, and Reflexion suggest that prover-verifier deliberation supplies a distinct argument-defensibility signal for selective prediction.

[NLP-101] Inference-Time Alignment of Diffusion Models via Trust-Region Iterative Twisted Sequential Monte Carlo

【速读】: 该论文旨在解决扩散模型在推理阶段的对齐问题,即在不更新模型权重的前提下,引导基础生成模型输出高奖励(high-reward)的结果。现有基于序贯蒙特卡洛(Sequential Monte Carlo, SMC)的方法虽能以理论严谨的方式逼近奖励倾斜的目标分布,但其提议分布仍严重依赖于基础采样器,导致粒子效率低下、重要性权重退化及估计方差大等问题。解决方案的关键在于提出一种基于信任域的迭代扭曲序贯蒙特卡洛方法(Trust-Region Iterative Twisted Sequential Monte Carlo, TRI-TSMC),通过在路径空间中进行精确KL约束更新,并利用温度重要性重加权获得闭式解,再通过加权最大似然投影回参数化的扭曲函数族,从而实现高效且稳定的粒子引导。理论上,作者证明了最优扭曲函数具有值函数解释并可实现零方差采样,且信任域更新沿escort路径逼近目标分布,有效降低残差重要性权重方差;实验表明,TRI-TSMC在文本生成和文生图任务中,在相同推理预算下显著提升了主对齐目标性能。

链接: https://arxiv.org/abs/2605.25123
作者: Weixin Wang,Yu Yang,Wei Deng,Pan Xu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
备注: 34 pages, 6 figures, and 7 tables

点击查看摘要

Abstract:We study inference-time alignment for diffusion-based generative models, aiming to steer a base model toward high-reward outputs without updating its weights. Recent Sequential Monte Carlo (SMC)-based steering methods approximate reward-tilted target distributions in a principled way, but their proposals remain largely tied to the base sampler. Since reward information is mainly used after propagation through particle reweighting and resampling, these methods can require large particle budgets and suffer from weight degeneracy and high-variance estimates. One way to reduce variance and improve particle efficiency is to iteratively learn twisting functions that provide look-ahead guidance, as in twisted SMC. However, existing learnable twisting methods are developed mainly for classical sequential inference and can be unstable when applied to diffusion-based alignment with high-dimensional state spaces and terminal, noisy, or black-box rewards. We propose Trust-Region Iterative Twisted Sequential Monte Carlo (TRI-TSMC), a trust-region framework for learning twisting functions in SMC-based inference-time alignment. Each iteration computes an exact KL-constrained update in path space, which admits a closed-form solution by tempered importance reweighting, and projects this target back to the parameterized twisted family by weighted maximum likelihood. Theoretically, we formalize the value-function interpretation of the optimal twisting function and show that it yields a zero-variance sampler. We prove that the trust-region update follows an escort path toward the target distribution, that the weighted maximum-likelihood update is a forward-KL projection, and that the path reduces residual importance-weight variance. Empirically, TRI-TSMC improves primary alignment objectives on discrete diffusion text generation and text-to-image generation under matched inference-time budgets.

[NLP-102] Faithfulness Metrics Dont Measure Faithfulness: A Meta-Evaluation with Ground Truth

【速读】: 该论文试图解决的问题是:当前用于评估链式思维(Chain of Thought, CoT)忠实性的指标缺乏可靠的 ground-truth 标签,导致无法准确衡量这些指标是否真正反映了模型内部计算与推理步骤之间的对应关系。现有方法多依赖于可解释性代理指标(如合理性或重要性),而这些属性与忠实性无关,可能误导对 CoT 可信度的判断。解决方案的关键在于构建一种自动化标注流程,通过设计能够揭示中间计算必须产生特定输出的任务,从而获得逐步骤和整条 CoT 的真实忠实性标签。基于此方法,作者提出了 BonaFide 基准,包含 3,066 条标注 CoT,覆盖 13 个任务和 10 个模型,并首次系统性地评估了主流忠实性指标的表现,结果表明大多数指标表现接近随机水平,存在显著预测偏差且在长 CoT 上性能下降,最优指标 AUROC 也仅为 0.70(CoT 层级)和 0.59(步骤层级),且无法跨场景迁移,同时计算成本高昂,暴露出当前忠实性评估体系的根本缺陷。

链接: https://arxiv.org/abs/2605.25052
作者: Yoav Gur-Arieh,Ana Marasović,Mor Geva
机构: Tel Aviv University (特拉维夫大学); University of Utah (犹他大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Chains of thought (CoTs) have become central in interpreting and auditing behaviors of large language models. Yet growing evidence suggests that these traces often fail to faithfully represent the computations behind a model’s predictions. Several faithfulness metrics have been proposed, but whether they indeed measure faithfulness remains unknown. Answering this requires ground-truth labels, which are hard to obtain since internal computations are not directly observable. Consequently, most works proposing metrics report only absolute scores or comparisons to prior metrics, and the few existing benchmarks rely on proxies like plausibility or importance, properties orthogonal to faithfulness that can mislead about whether a CoT can be trusted. We address this challenge by constructing tasks whose outputs reveal which intermediate computations must have produced them, and developing an automated labeling pipeline that yields ground-truth faithfulness labels at both the step and CoT level. Building on this methodology, we present BonaFide, a benchmark of 3,066 labeled CoTs across 13 tasks and 10 models, and use it to conduct the first systematic evaluation of prominent faithfulness metrics. Our experiments show that most metrics perform near chance, exhibit strong prediction biases and degrade on longer CoTs. The best metric reaches only 0.70 AUROC at the CoT level while another reaches 0.59 at the step level, with neither transferring across settings, while entailing prohibitively high computational cost. Our results expose fundamental gaps in current faithfulness evaluation and call for the development of more reliable and efficient metrics.

[NLP-103] RACE: A taxonomy-grounded synthetic dataset for teaching-program generation and session interpretation in Applied Behavior Analysis

【速读】: 该论文试图解决的问题是:在应用行为分析(Applied Behavior Analysis, ABA)领域中,由于受HIPAA保护和专业保密规则限制,真实临床数据难以用于训练和评估生成式AI模型,导致缺乏高质量、大规模的公开训练语料库。解决方案的关键在于构建一个名为TRACE(Taxonomy-Referenced ABA Clinical Examples)的合成指令微调数据集,该数据集包含2,999个基于ABA权威文献中结构化分类体系(taxonomy)确定性生成的临床示例,覆盖两类核心任务——教学方案生成(涵盖离散试验教学、自然环境教学与任务分析)及多阶段行为解读(涵盖12种轨迹模式和13种目标行为),且每个样本均附带完整的采样溯源信息,确保可解释性和可控性。此方法有效规避了隐私合规风险,同时保留了ABA实践的专业性和复杂性,为生成式AI在临床行为干预领域的研究提供了可复现、可验证的数据基础。

链接: https://arxiv.org/abs/2605.25038
作者: Festus Kahunla
机构: Drexel University (德雷塞尔大学); Pombo Labs (Pombo实验室)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Software Engineering (cs.SE)
备注: 11 pages, 3 tables. Dataset: this https URL ; code: this https URL

点击查看摘要

Abstract:Applied Behavior Analysis (ABA) is a clinical discipline whose documentation, teaching programs and multi-session behavioral logs, is formulaic and high-volume, yet real session data is HIPAA-protected and bound by professional confidentiality rules, blocking the release of a training corpus. We present TRACE (Taxonomy-Referenced ABA Clinical Examples), a 2,999-example synthetic instruction-tuning dataset covering two ABA tasks: teaching-program generation across Discrete Trial Training, Natural Environment Teaching, and Task Analysis; and multi-session behavioral interpretation across twelve trajectory patterns and thirteen target behaviors. Every example is produced by a deterministic taxonomy-driven generator grounded in the canonical ABA literature, and every example carries complete sampling provenance, the exact taxonomy cells that produced it. The dataset is released under CC BY-NC 4.0 for data and MIT for code, with stratified train (2,549), validation (149), test (281), and sanity (20) splits. TRACE is a research artifact and has not been clinically validated.

[NLP-104] Language Bias in LVLMs: From In-Depth Analysis to Simple and Effective Mitigation ICML2026

【速读】: 该论文旨在解决大视觉语言模型(LVLMs)中存在的幻觉问题,即模型输出虽语法流畅但与图像内容不一致的现象。研究表明,这一问题根源在于训练过程中模态错位(modality misalignment),导致模型过度依赖文本信息而忽视视觉输入,形成语言偏置(language bias)。解决方案的关键在于通过两种简单但有效的方法缓解该偏置:一是语言偏置正则化(LBR),在指令微调阶段引入正则项以平衡多模态理解;二是语言偏置惩罚(LBP),在直接偏好优化(DPO)训练中对语言偏置施加惩罚。实验表明,LBR在十余个通用基准上稳定提升性能,LBP显著减少幻觉并增强模型可信度,且无需额外数据或辅助模型即可实现整体对齐改进。

链接: https://arxiv.org/abs/2605.25036
作者: Yangneng Chen,Jing Li
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by ICML 2026

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) extend large language models with visual understanding, but remain vulnerable to hallucination, where outputs are fluent yet inconsistent with images. Recent studies link this issue to language bias-the tendency of LVLMs to over-rely on text while neglecting visual inputs. Yet most analyses remain empirical without uncovering its underlying cause. In this paper, we provide a systematic study of language bias and identify its root in modality misalignment during training. Our analysis shows that both Visual Instruction Tuning (VIT) and Direct Preference Optimization (DPO) often prioritize textual improvements, which may cause LVLMs to overly lean toward language modeling rather than balanced multimodal understanding. To address this, we propose two simple yet effective methods: Language Bias Regularization (LBR) which mitigates language bias through regularization during instruction tuning, and Language Bias Penalty (LBP), which penalizes language bias in the DPO training process. Extensive experiments across diverse models and benchmarks demonstrate the effectiveness of our approach. LBR consistently improves performance on over ten general benchmarks, while LBP significantly reduces hallucination and improves trustworthiness. Together, these methods not only mitigate language bias but also advance the overall alignment of LVLMs, all without introducing any additional data or auxiliary models. Our code is publicly available at this https URL.

[NLP-105] Privacy-Preserving Local Language Models for Longitudinal Data Retrieval in Chronic Dermatologic Disease: Implementation in Pemphigus Patients

【速读】: 该论文试图解决慢性皮肤病(如天疱疮)长期随访过程中产生的大量纵向临床文档难以高效回顾的问题,这不仅增加了医生的工作负担,还可能导致关键历史信息的遗漏。解决方案的关键在于部署一个本地化的、隐私保护的小型语言模型(Small Language Model, SLM),用于从长期随访记录中自动提取结构化临床特征并生成纵向摘要。实验表明,该模型在56项临床特征检索任务中平均准确率达82.25%,且由皮肤科专家评估的AI生成摘要在整体质量、临床准确性及实用性方面均获得高分(评分范围7.93–8.50),其中53.3%的评估更偏好AI摘要,显示出其在临床辅助决策中的潜力与可靠性。

链接: https://arxiv.org/abs/2605.25020
作者: Abdurrahim Yilmaz,Ayşe Esra Koku Aksu,Duygu Yamen,Vefa Asli Erdemir,Mehmet Salih Gurel,Gulsum Gencoglan,Joram M. Posma,Burak Temelkuran
机构: Imperial College London (帝国理工学院); Istanbul Research and Training Hospital (伊斯坦布尔研究与培训医院); Istanbul Medeniyet University (伊斯坦布尔梅德尼耶特大学); Istanbul Medicana Atakoy Hospital (伊斯坦布尔梅迪卡纳阿塔科伊医院)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Chronic dermatologic diseases such as pemphigus require long-term follow-up, generating extensive longitudinal clinical documentation that is difficult to review comprehensively during routine visits and increasing clinician workload as well as the risk of missing critical historical information. We evaluated whether a locally deployed, privacy-preserving small language model (SLM) could retrieve structured clinical features and generate longitudinal summaries from long-term dermatology follow-up records. In this retrospective case series, thirty pemphigus patients contributed 541 visit notes that were aggregated into full longitudinal records (89,336 words); 56 clinically relevant features were annotated by two expert dermatologists. The locally deployed SLM (Qwen3 4B Thinking 2507) was queried with each complete record to retrieve 56 features and generate one final report summaries. Across 1,680 feature retrieval tasks, mean accuracy was 82.25%. Dermatologists’ ratings of AI-generated summaries were high for overall quality (8.23-8.47), clinical accuracy (7.93-8.20), and usefulness (8.47-8.50), with no significant inter-evaluator differences and an overall preference for AI summaries in 53.3% of evaluations. These findings suggest that privacy-preserving, locally deployed SLMs can outperform medical experts and reliably generate clinically meaningful longitudinal summaries. SLMs may support clinical decision-making when integrated with appropriate oversight.

[NLP-106] Better Faster: Harnessing Self-Improvement in Large Reasoning Models ICML2026

【速读】: 该论文旨在解决自改进训练(self-improvement training)在大型推理模型(Large Reasoning Models, LRMs)中因数据不平衡和过度思考(overthinking)而导致的性能下降甚至模型崩溃问题。其解决方案的关键在于提出HSIR框架,通过两种简单但有效的机制实现:一是引入“验证后退出”采样策略(verify-then-exit sampling strategy),以缓解数据不平衡问题,高效收集困难任务的准确解;二是设计内在多样性评分(Intrinsic Diversity score),量化并过滤冗余推理步骤的不良样本。此外,论文进一步提出H-GRPO算法,利用内在多样性作为外部奖励,在强化学习中鼓励简洁且多样化的推理路径。实验表明,HSIR显著提升了推理性能(平均提升达+10.9%),同时大幅降低推理开销(相对减少高达42.4%)。

链接: https://arxiv.org/abs/2605.24998
作者: Qihuang Zhong,Liang Ding,Juhua Liu,Bo Du,Leszek Rutkowski,Dacheng Tao
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted by ICML2026

点击查看摘要

Abstract:Self-improvement training enables the large reasoning models (LRMs) to improve themselves by self-generating reasoning trajectories as training data without external supervision. However, we find that this method often falls short in complex reasoning tasks and even leads to model collapse. Through a series of preliminary analyses, we reveal two problems: (1) data imbalance, where most training samples are simple, but the challenging yet crucial samples are scarce; (2) overthinking, where many undesired samples with redundant reasoning steps are used for self-training. To this end, we propose HSIR, which effectively Harnesses Self-Improvement in large Reasoning models via two simple-yet-effective approaches. Specifically, HSIR introduces a verify-then-exit sampling strategy to mitigate data imbalance by efficiently collecting more accurate solutions for difficult queries, and designs an Intrinsic Diversity score to quantify overthinking and filter out the undesired solutions. We apply HSIR to various post-training paradigms, among which we further propose H-GRPO, an enhanced GRPO algorithm that leverages the intrinsic diversity as an external reward to encourage concise and diverse reasoning via reinforcement learning. Extensive results show that HSIR not only effectively enhances the reasoning performance, i.e., bringing up to +10.9% average performance gains, but also significantly improves the reasoning efficiency by reducing up to 42.4% relative inference overhead.

[NLP-107] Exploring Profiles of Cognitive Distortions Associated with Mental Health Disorders

【速读】: 该论文试图解决的问题是:当前关于认知扭曲(cognitive distortions)的研究主要集中在抑郁症,而对其他多种精神健康状况下的认知扭曲模式缺乏系统性比较。为填补这一空白,作者旨在探索不同精神健康群体中认知扭曲的分布特征及其差异。解决方案的关键在于使用两种方法——基于n-gram的词汇分析和微调后的Transformer模型——对来自Reddit的九个自报精神健康群体及一个对照组的大规模文本数据进行检测与比较。结果表明,各精神健康群体的认知扭曲发生率显著高于对照组(效应量从小到中等不等),且不同群体间存在一定程度的扭曲模式相似性与差异性,说明简单的词汇层面方法即可用于大规模心理健康文本数据中的群体趋势探索。

链接: https://arxiv.org/abs/2605.24996
作者: Alina Anikejeva,Kairit Sirts
机构: University of Tartu (塔尔图大学)
类目: Computation and Language (cs.CL)
备注: CLPsych 2026

点击查看摘要

Abstract:Cognitive distortions, distorted patterns of thinking, have been increasingly studied in computational mental health research. Although they are related to many, if not all, mental health disorders, most existing studies focus primarily on depression. In this work, we explore distortion profiles across multiple mental health conditions. We analyzed a large Reddit-based dataset containing posts from nine self-reported mental health groups as well as a control group using both an n-gram-based method and a fine-tuned transformer model for detecting cognitive distortions. Mental health groups, both when pooled together and when examined individually, showed higher prevalence of cognitive distortions compared to the control group, with the effect sizes ranging from small to moderate. When comparing distortion profiles across conditions, we observed largely similar patterns, although some groups exhibited overall higher levels of distortions than others. These findings suggest that relatively simple lexical approaches can be useful for exploratory analyses of group-level trends in large-scale mental health text data.

[NLP-108] Large Language Model Selection with Limited Annotations

【速读】: 该论文试图解决在众多强大的大语言模型(Large Language Model, LLM)中高效选择最优模型的问题,传统评估方法依赖于固定测试集上的高成本人工标注。其解决方案的关键在于提出SELECT-LLM框架,该框架通过基于预期信息增益的查询选择规则,自动挑选出最能区分候选模型性能的一小部分输入查询,从而显著降低标注成本。该规则仅依赖于模型输出之间的成对相似性计算,无需假设模型架构或访问模型权重,因此适用于开放权重和黑盒模型。实验表明,SELECT-LLM在23个数据集、156个模型及多种任务类型上均优于最强基线,最佳模型选择的标注成本减少最高达81.8%,近优模型选择减少最高达84.78%。

链接: https://arxiv.org/abs/2605.24981
作者: Yavuz Durmazkeser,Patrik Okanovic,Andreas Kirsch,Torsten Hoefler,Nezihe Merve Gürel
机构: TU Delft (代尔夫特理工大学); ETH Zurich (苏黎世联邦理工学院)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 33 pages, 5 figures, 4 tables

点击查看摘要

Abstract:Choosing a Large Language Model (LLM) for a given task requires comparing many strong candidates, yet standard evaluation relies on costly annotations over fixed evaluation sets. To address this challenge, we develop SELECT-LLM, the first framework for active model selection of LLMs. SELECT-LLM aims to find a small set of queries whose annotations are most informative for identifying the best LLM for a given task. To this end, we introduce a query selection rule based on expected information gain, computed from pairwise similarities between candidate model outputs. Because this rule only uses generated model responses, SELECT-LLM can be applied across candidate models without assumptions about their architecture or access to model weights. This makes it suitable for both open-weight and black-box LLMs. We evaluate SELECT-LLM across 23 datasets, 156 evaluated models, diverse task families, and multiple text evaluation metrics. Across all experiments, SELECT-LLM improves over the strongest baseline in every setting, with annotation cost reductions up to 81.8% for best model selection and up to 84.78% for near-best model selection.

[NLP-109] Universal Boosts Specific Suppressors: Sparse Autoencoder Steering of Medical Vision-Language Models

【速读】: 该论文旨在解决医学视觉语言模型(VLMs)在生成胸部X光片报告时存在的幻觉问题,即模型会编造图像中不存在的病灶、遗漏重要发现或错误定位。其解决方案的关键在于无需权重更新的推理阶段残差引导(decoding-time residual steering),具体为:基于每token的稀疏自动编码器(SAE)进行选择性干预——在深层特征中选取Top-K SAE激活项,通过因果导向的抑制与增强策略纠正临床错误,并在推理时实施组合式的抑制/增强操作。实验表明,该方法在MIMIC-CXR测试集上显著提升三个放射科VLM(RadVLM、LLaVA-Rad和CheXOne)的报告质量,临床综合指标相对提升达+5.4%至+17.0%,且在所有基线模型上均获得统计显著的GREEN得分提升。进一步分析发现,促进质量的“增强方向”在不同架构间高度一致,而引发幻觉的“抑制方向”则具有模型特异性,因此跨模型迁移需按模型定制抑制策略;该方法还零样本迁移到IU-Xray数据集并取得+7.7%的GREEN提升,验证了所识别特征属于模型自身特性而非训练语料。

链接: https://arxiv.org/abs/2605.24977
作者: Farhad Nooralahzadeh,Benjamin Gundersen,Nicolas Deperrois,Hidetoshi Matsuom,Mizuho Nishio,Thomas Frauenfelder,Ahmed Allam,Christian Blüthgen,Michael Moor,Michael Krauthammer
机构: University of Zurich and University Hospital of Zurich, Switzerland; Kobe University, Japan; ETH AI Center, Switzerland; ETH Zurich, Switzerland; Stanford University, USA; Zurich University of Applied Sciences, Switzerland
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Medical vision-language models (VLMs) often hallucinate findings when generating chest X-ray reports: they fabricate findings that are not present in the image, miss important ones, or locate them incorrectly. We mitigate this without weight updates by decoding-time residual steering on a per-token sparse autoencoder (SAE) basis: Top- K SAEs on late layers, causal steering against clinical errors, then combined suppress/boost intervention at inference time. On the MIMIC-CXR test split, our inference-only method improves the quality of generated reports for three radiology VLMs (RadVLM, LLaVA-Rad, and CheXOne), with relative improvements of +5.4%, +7.2%, and +17.0% in the clinical composite metric, and statistically significant GREEN gains on all backbones. A cross-model feature alignment shows that the quality-promoting (boost) directions overlap strongly across architectures, whereas hallucination-linked (suppress) directions are model-specific. Therefore, transferable steering must treat suppression per-backbone, rather than sharing a universal suppress list. The same recipe transfers zero-shot to IU-Xray (Green +7.7% rel.) without retraining, confirming that the identified features are properties of the model, not of the training corpus. We release causal feature sets and an interactive feature dashboard: this https URL.

[NLP-110] MinerU-Popo: Universal Post-Processing Model for Structured Document Parsing

【速读】: 该论文旨在解决当前基于视觉语言模型(VLM)的光学字符识别(OCR)系统在文档解析中无法维持跨页连续性的问题,即这些模型通常仅能提取单页内的元素(如段落、表格和图像)及其边界框和文本内容,但难以恢复因页面分割而中断的文档逻辑结构(如跨页段落、表格或标题层级)。解决方案的关键在于提出一个轻量且通用的后处理框架 MinerU-Popo,其核心创新包括:1)将文档级结构重建任务细分为四个子任务——文本截断恢复、表格截断恢复、标题层级重构与图文关联;2)构建面向任务的数据引擎并使用3万条标注数据微调轻量化模型(Qwen3-VL-4B);3)引入基于重叠的动态分块机制以支持长文档处理并保持全局一致性;4)最终生成树状结构的文档表示,包含节点切片与摘要,从而显著提升检索增强生成(RAG)的准确性并降低查询延迟。实验证明,MinerU-Popo 在五种不同OCR模型上均使标题层级评估指标(TEDS)提升至少20%。

链接: https://arxiv.org/abs/2605.24973
作者: Bangrui Xu,Ziyang Miao,Xuanhe Zhou,Yiming Lin,Zirui Tang,Xiaomeng Zhao,Fan Wu,Cheng Tan,Fan Wu,Bin Wang,Conghui He
机构: 上海交通大学(Shanghai Jiao Tong University); 中国科学院自动化研究所(Institute of Automation, Chinese Academy of Sciences)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: The code is available at this https URL

点击查看摘要

Abstract:VLM-based OCR models have become the de facto choice for document parsing, as they can accurately extract page-level elements (e.g., paragraphs within individual pages) together with their bounding boxes and textual content. However, downstream applications such as RAG require coherent document-level information, whereas these models often break cross-page continuity and fail to recover disrupted structures, such as paragraphs and tables truncated by page boundaries. Such relationships are not confined to a single page; instead, they require joint analysis of titles, paragraphs, tables, and images spanning multiple pages. A natural solution is therefore to reuse existing OCR outputs and reconstruct document-level logical structures through post-processing. To this end, we propose MinerU-Popo, a lightweight and universal framework for POst-Processing OCR outputs, which converts page-level results from diverse parsers into coherent document-level structures. MinerU-Popo decomposes the problem into four focused subtasks: text truncation recovery, table truncation recovery, title hierarchy reconstruction, and image-text association. To address these effectively, we build a task-oriented data engine with task-specific input filtering, and use the generated data (30K) to fine-tune a lightweight post-processing model (Qwen3-VL-4B). To support long documents, we introduce dynamic chunking with overlap-based synchronization, which aligns chunk-level outputs from the fine-tuned model and preserves global consistency. Finally, we assemble the aligned outputs into a tree-structured document representation, further enriched with node chunking and summaries for downstream retrieval and analysis. Empirical results show MinerU-Popo improves title-hierarchy TEDS by at least 20% across all five tested OCR models, improves RAG accuracy and reduces per-query latency. Comments: The code is available at this https URL Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2605.24973 [cs.CV] (or arXiv:2605.24973v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2605.24973 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-111] Investigating the Interplay between Contextual and Parametric Chain-of-Thought Faithfulness under Optimization

【速读】: 该论文试图解决Chain-of-Thought (CoT) 一致性(faithfulness)评估中长期存在的问题:当前研究通常在两种分离的范式下进行评估——上下文一致性(contextual faithfulness),通过扰动输入或CoT路径来测量;参数一致性(parametric faithfulness),通过干预模型的参数知识来评估。然而,以往工作仅对这两种范式进行描述性比较,缺乏系统性的统一优化框架。论文提出FaithMate,一种统一的偏好对齐接口,用于将模型优化至任一一致性范式,并在此基础上深入探究两种范式的交互关系,包括它们之间的一致性提升是否能够跨范式迁移。实验结果表明,两种范式存在正向耦合但不对称:优化参数一致性可稳定提升两类范式下的表现,而优化上下文一致性则效果不稳定;且在上下文范式内部,不同指标间的改进无法一致传递,说明现有上下文指标捕捉的是不相交的一致性维度,并存在内在权衡。这揭示了CoT一致性并非单一目标,需多维度优化与评估。

链接: https://arxiv.org/abs/2605.24960
作者: Jingyi Sun,Qianli Wang,Pepa Atanasova,Nils Feldhus,Isabelle Augenstein
机构: University of Copenhagen; Technische Universität Berlin; German Research Center for Artificial Intelligence (DFKI); BIFOLD – Berlin Institute for the Foundations of Learning and Data
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: The first two authors contributed equally and share first-authorship

点击查看摘要

Abstract:Chain-of-Thought (CoT) faithfulness, i.e., whether CoTs genuinely reflect large language models’ (LLM) underlying behavior, is typically evaluated under two disjoint paradigms: contextual faithfulness, measured by perturbing the input or CoT trace, and parametric faithfulness, assessed by intervening on a model’s parametric knowledge. Yet prior work compares them only descriptively. We fill this gap by proposing FaithMate, a unified preference-alignment interface for optimizing models towards either faithfulness paradigm. It enables us to investigate the interplay between the two paradigms, examining whether and to what extent faithfulness gains generalize within and across paradigms. Across three models, two datasets, and six faithfulness metrics, we find that the two paradigms are positively coupled, yet asymmetric: optimizing towards parametric faithfulness yields consistent gains across both paradigms, whereas the contextual counterpart delivers more variable gains. Within the contextual paradigm, faithfulness gains on one metric do not consistently transfer to others, implying that existing contextual metrics capture disjoint facets of faithfulness and exposing inherent trade-offs. These findings imply that CoT faithfulness is not a monolithic objective and therefore requires multifaceted optimization and evaluation.

[NLP-112] SEP-Attack: A Simple and Effective Paradigm for Transfer-Based Textual Adversarial Attack

【速读】: 该论文旨在解决文本领域中基于迁移的对抗攻击(transferable textual adversarial attack)问题,即在无法访问目标模型(victim model)的情况下,利用替代模型(surrogate model)生成能够有效转移至目标模型的对抗样本。现有方法往往因对子模型等权处理或重要性评分估计不准而导致攻击效果不佳。解决方案的关键在于提出一种名为SEP-Attack的新范式,其核心创新是引入行列式点过程(Determinantal Point Process, DPP)来生成多样化的替代模型集成权重,从而更准确地反映各子模型的迁移能力;在此基础上,设计了一种新的置信度评分评估机制用于计算词的重要性得分,并结合迁移分数筛选最优候选对抗样本,显著提升了攻击的转移成功率。

链接: https://arxiv.org/abs/2605.24958
作者: Han Liu,Zhi Xu,Xiaotong Zhang,Feng Zhang,Xiaoming Xu,Wei Wang,Fenglong Ma,Hong Yu
机构: Dalian University of Technology (大连理工大学); Peking University (北京大学); Macao Polytechnic University (澳门理工大学); The Pennsylvania State University (宾夕法尼亚州立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Despite the strong performance of deep neural networks in modern Web and language applications, they remain vulnerable to adversarial attacks, especially transferable attacks that generate adversarial examples using surrogate models without accessing the victim model. Transferable attacks in the text domain are still under-explored, with only a few studies addressing this challenging issue, often with suboptimal results due to equal treatment of submodels or inaccurate estimation of importance scores. To address these challenges, we propose a simple yet effective paradigm for transfer-based textual adversarial attack, named SEP-Attack. Specifically, we employ the Determinantal Point Process (DPP) to generate diverse surrogate ensemble weights, representing the transferability of submodels. Using these weights, we introduce a new metric to evaluate prediction confidence scores, which in turn are used to calculate word importance scores and generate adversarial candidates. Finally, we quantify the transferability score for each candidate and select the top ones as the final transferable adversarial examples. Experiments conducted on four datasets and two real-world APIs validate the efficacy of SEP-Attack, significantly outperforming state-of-the-art baselines.

[NLP-113] NITP: Next Implicit Token Prediction for LLM Pre-training ICML2026

【速读】: 该论文试图解决标准下一词预测(Next-Token Prediction, NTP)因仅使用输出logit空间中的离散标签进行监督,导致潜在表示空间约束不足的问题。这种稀疏的一热监督会使隐藏状态漂移到退化和各向异性的配置中,从而限制模型的泛化能力。解决方案的关键在于提出下一隐式词预测(Next Implicit Token Prediction, NITP),通过在表示空间中引入密集连续监督来增强训练:NITP利用同一模型浅层表示作为稳定自监督目标,训练模型预测下一个词的隐式语义内容。理论分析表明,NITP通过缓解欠约束自由度并鼓励紧凑、结构化的表示几何,优化了损失函数的景观;实验结果证明,NITP在从0.5B到9B参数的稠密模型与MoE模型上均显著提升下游任务性能,且计算开销极低——例如在9B MoE模型上,MMLU-Pro指标提升5.7%,C3和CommonsenseQA分别提升6.4%和4.3%,额外训练FLOPs仅增加约2%,推理成本不变。

链接: https://arxiv.org/abs/2605.24956
作者: Xiangdong Zhang,Debing Zhang,Shaofeng Zhang,Xiaohan Qin,Yu Cheng,Junchi Yan
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted at ICML 2026

点击查看摘要

Abstract:Standard next-token prediction (NTP) supervises language models solely through discrete labels in the output logit space. We argue that this sparse one-hot supervision leaves the latent representation space under-constrained, allowing hidden states to drift into degenerate and anisotropic configurations that can limit generalization. To address this issue, we propose Next Implicit Token Prediction (NITP), which augments discrete prediction with dense continuous supervision directly in the representation space. NITP trains the model to predict the implicit semantic content of the next token, using shallow-layer representations from the same model as stable self-supervised targets. We provide theoretical analysis showing that NITP regularizes the optimization landscape by mitigating under-constrained degrees of freedom and encouraging a compact, structured representation geometry. Empirically, across dense and MoE models ranging from 0.5B to 9B parameters, NITP consistently improves downstream performance with negligible computational overhead. On a 9B MoE model, NITP achieves a 5.7% absolute improvement on MMLU-Pro, along with gains of 6.4% on C3 and 4.3% on CommonsenseQA, with approximately 2% additional training FLOPs and no additional inference cost. Our implementation is available at this https URL.

[NLP-114] H2MT: Semantic Hierarchy-Aware Hierarchical Memory Transformer

【速读】: 该论文旨在解决大语言模型(LLM)在处理长文本输入时面临的挑战,包括有限的上下文窗口、预填充延迟(prefill latency)和内存消耗随提示长度快速上升的问题。传统方法如扁平化标记流处理或基于块的检索会浪费大量计算资源和上下文预算在与查询无关的内容上;而离线索引的检索增强生成(Retrieval-Augmented Generation, RAG)则引入了外部存储和索引管理开销,并常以原始文本形式拼接检索到的证据,进一步增加预填充成本。其解决方案的关键在于提出H²MT(Hierarchical Hybrid Memory Transformer),通过离线构建语义层次结构,采用自底向上的后序聚合计算每个节点的内存嵌入(memory embedding),并在推理阶段从粗到精地路由查询,提前剪枝无关分支。这种方法显著提升了长上下文推理的质量-效率权衡,在LongBench QA(NarrativeQA、HotpotQA、QASPER)及两个结构化技术文档场景中,实现了与prompt压缩、memory-token方法和RAG基线相当甚至更优的ROUGE-L和F1指标,同时降低峰值GPU内存占用和首次token时间(TTFT)。

链接: https://arxiv.org/abs/2605.24930
作者: Maryam Haghifam,Zifan He,Jason Cong,Yizhou Sun
机构: University of California, Los Angeles (加州大学洛杉矶分校)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Transformer-based LLMs achieve strong results on many language tasks; however, long inputs remain challenging because context windows are finite, and prefill latency and memory grow rapidly with prompt length. Flat token-stream processing and chunk-based retrieval can therefore spend substantial computation and context budget on text unrelated to the query. Offline-indexed RAG additionally introduces external storage and index management overhead, and typically appends retrieved evidence as raw text, increasing prefill cost and latency. H^2MT makes long-context inference structure-aware: it builds a semantic hierarchy offline, computes a memory embedding for each node via bottom-up post-order aggregation, and routes queries coarse-to-fine at inference to prune irrelevant branches early. On LongBench QA (NarrativeQA, HotpotQA, QASPER) and two structured technical-document settings, H MT achieves favorable quality efficiency trade-offs, delivering competitive ROUGE-L and F1 (where applicable) with lower peak GPU memory and time-to-first-token (TTFT) than prompt compression, memory-token methods, and retrieval-augmented generation baselines.

[NLP-115] MultiHaluDet: Multilingual Hallucination Detection via LLM Hidden State Probing ACL2026

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在多语言环境下,尤其是低资源语言中普遍存在且难以检测的幻觉(hallucination)问题。现有方法依赖于输出置信度启发式或单层内部表示,往往无法捕捉跨语言的深层、复杂事实性不一致。解决方案的关键在于提出一种三阶段堆叠框架 MultiHaluDet,通过探测冻结的 LLMs 的完整隐藏状态轨迹(hidden state trajectories),无需进行语言特定微调即可实现多语言幻觉检测。其核心创新包括:利用多尺度注意力与自注意力池化机制处理跨层序列特征,并生成留一交叉验证嵌入(out-of-fold embeddings)输入校准后的经典分类器集成模型,从而同时捕获细粒度和粗粒度的事实性不一致模式。实验表明,该方法在英文基准(HaluEval 和 TriviaQA)上达到最高 98.55% AUROC,在高资源(法语)、中资源(孟加拉语)和低资源(阿姆哈拉语)语言中均展现出卓越的跨语言泛化能力。

链接: https://arxiv.org/abs/2605.24919
作者: Riasad Alvi,Nurul Labib Sayeedi,Md. Faiyaz Abdullah Sayeedi
机构: United International University (联合国际大学); BRAC University (BRAC大学)
类目: Computation and Language (cs.CL)
备注: MeLLM @ ACL 2026

点击查看摘要

Abstract:Hallucinations in Large Language Models (LLMs) represent a critical barrier to their reliable deployment, a vulnerability heavily exacerbated in non-English and resource-constrained contexts. Existing detection approaches that rely on output confidence heuristics or single-layer internal representations frequently fail to capture deep, complex factual inconsistencies across diverse languages. To address this, we introduce MultiHaluDet, a novel three-stage stacking framework that detects multilingual hallucinations by probing the full hidden state trajectories of frozen LLMs without requiring language-specific fine-tuning. Our method extracts sequential features across multiple layers and processes them via a hybrid architecture using multi-scale attention and self-attention pooling. By generating out-of-fold embeddings that feed into a calibrated classical classifier ensemble, MultiHaluDet captures both fine-grained and coarse-grained patterns of factual inconsistency. Extensive experiments demonstrate that our framework achieves state-of-the-art detection performance, reaching up to 98.55% AUROC on the English HaluEval and TriviaQA benchmarks using Mistral-7B and LLaMA2-7B architectures. Crucially, we rigorously evaluate our framework’s cross-lingual generalization across high (French), medium (Bangla), and low-resource (Amharic) languages. MultiHaluDet demonstrates exceptional representational robustness, consistently outperforming baselines and successfully transferring hallucination detection capabilities across typologically diverse linguistic tiers.

[NLP-116] Overview of the PsyDefDetect Shared Task at BioNLP 2026: Detecting Levels of Psychological Defense Mechanisms in Supportive Conversations

【速读】: 该论文试图解决的问题是:如何在情感支持对话中自动识别求助者话语所体现的心理防御机制(Psychological Defense Mechanisms, PDMs)的层级水平,从而实现对人类心理状态的细粒度建模与分析。解决方案的关键在于构建了一个基于临床验证的防御机制评分量表(Defense Mechanism Rating Scales, DMRS)框架下的共享任务PsyDefDetect,其核心包括:(1)一个新发布的标注语料库PsyDefConv,包含200段对话和2336个标注为九类(七级DMRS层级加两个辅助标签)的求助者话语;(2)利用大语言模型(LLM)和理论感知(theory-aware)方法提升细粒度分类性能;(3)通过大规模参与者提交的563个系统结果揭示当前模型在类别不平衡下的表现瓶颈(如过度预测高适应性类别)以及宏观F1分数与准确率之间的显著差距,表明未来研究需更注重类别平衡策略与心理学理论融合。

链接: https://arxiv.org/abs/2605.24907
作者: Hongbin Na,Zimu Wang,Zhaoming Chen,Yining Hua,Rena Gao,Kailai Yang,Ling Chen,Wei Wang,Shaoxiong Ji,John Torous,Sophia Ananiadou
机构: University of Technology Sydney (悉尼科技大学); Xi’an Jiaotong-Liverpool University (西安交通大学利物浦大学); University of Utah (犹他大学); Harvard University (哈佛大学); The University of Melbourne (墨尔本大学); The University of Manchester (曼彻斯特大学); ELLIS Institute Finland (芬兰ELLIS研究所); University of Turku (图尔库大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We present an overview of PsyDefDetect, the shared task on detecting levels of psychological defense mechanisms in emotional support dialogues, co-located with BioNLP@ACL 2026. Grounded in the clinically validated Defense Mechanism Rating Scales (DMRS) framework, the task asks systems to classify a target seeker utterance, given its preceding dialogue context, into one of nine categories: seven hierarchical DMRS levels plus two auxiliary labels. Participants worked on PsyDefConv, a newly released corpus of 200 dialogues and 2336 help-seeker utterances annotated under DMRS with substantial inter-annotator agreement. The task attracted 172 participants on CodaBench who produced 563 submissions, with 21 teams officially registering their results for the final ranking. The best system achieved a macro F1-score of 0.420, surpassing the strongest fine-tuned baseline reported in the dataset paper by a notable margin, yet leaving clear headroom. Our analysis highlights (i) a persistent tendency to over-predict the majority High-Adaptive class, (ii) a widening gap between accuracy and macro-F1 that reveals class-imbalance sensitivity, and (iii) the value of theory-aware and LLM-based approaches for fine-grained defensive-function classification. We release all task materials and invite the community to continue work on this novel intersection of clinical psychology and NLP.

[NLP-117] Quantifying the Impact of Translation Errors on Multilingual LLM Evaluation

【速读】: 该论文试图解决两个关键问题:一是自动机器翻译质量评估(Machine Translation Quality Estimation, MQE)模型(如LLM裁判和xCOMET-XXL)在基准翻译中识别的错误片段(error spans)与人工专家标注的一致性程度;二是目标语言翻译错误是否显著影响多语言大模型(LLMs)在翻译后基准测试中的性能下降,而非源语言英文本身的问题。解决方案的关键在于通过对比自动错误标注与人工标注的一致性,并控制源语言正确性及其他异常因素,量化翻译错误对模型准确率的影响,从而揭示当前机器翻译基准存在的可靠性问题及其对多语言评估结果的实质性干扰。

链接: https://arxiv.org/abs/2605.24904
作者: Klaudia-Doris Thellmann,Bernhard Stadler,Michael Färber,Jens Lehmann
机构: TUD Dresden University of Technology and ScaDS.AI; InfAI e.V.; Amazon
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Machine-translated benchmarks are widely used to assess the multilingual capabilities of large language models (LLMs), yet translation errors in these benchmarks remain underexplored, raising concerns about the reliability and comparability of multilingual evaluation. We address two practical gaps: (i) how well automatic MQM-style error spans from LLM judges and a span-aware QE baseline (xCOMET-XXL) match expert human span annotations on benchmark translations, and (ii) how strongly translation errors (as opposed to source-side issues in the English original) explain accuracy drops on translated benchmarks. We find that span agreement is non-trivial on naturally occurring benchmark translations, and that target-side translation errors are consistently associated with measurable, percentage-point drops in translated accuracy even after controlling for English correctness and source-side anomalies.

[NLP-118] When Reasoning Hurts: Source-Aware Evaluation of Frontier LLM s for Clinical SOAP Note Generation

【速读】: 该论文试图解决的问题是:具备推理能力的大语言模型(LLM)在医学推理基准测试中表现优异,但这些性能提升是否能迁移至结构化临床文档生成任务(如SOAP病历撰写)尚不明确。解决方案的关键在于通过一个受控的2×2实验设计,独立地控制两个变量——医生本源推理(provider-native reasoning)和同源检索增强生成(same-source retrieval-augmented generation, RAG),并在涵盖OMI Health、ACI-Bench和PriMock57三个来源的基准上评估GPT-5.4、DeepSeek-V4-Flash和Gemma-4-E4B三种模型的表现。研究发现,非推理配置的GPT-5.4在整体质量上最优,而启用推理反而显著降低其性能;RAG仅带来模型依赖性的微小改进,表明强推理能力不能自动提升对准确性敏感的SOAP病历生成任务,必须进行针对具体任务的专门评估。

链接: https://arxiv.org/abs/2605.24902
作者: Faizan Faisal
机构: University of California, Davis (加州大学戴维斯分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Reasoning-enabled LLMs perform strongly on medical reasoning benchmarks, but it remains unclear whether these gains transfer to structured clinical documentation; we investigate this question using SOAP note generation from clinical dialogue in a source-aware benchmark spanning OMI Health, ACI-Bench, and PriMock57. We evaluate GPT-5.4, DeepSeek-V4-Flash, and Gemma-4-E4B in a controlled 2x2 design that independently toggles provider-native reasoning and same-source retrieval-augmented generation (RAG). Outputs are assessed using seven automatic metrics alongside two reference-aware LLM judges. Both evaluation approaches agree that a non-reasoning GPT-5.4 configuration achieves the highest overall quality, while DeepSeek-V4-Flash performs best among reasoning-enabled configurations. Enabling reasoning significantly degrades GPT-5.4 performance across all three datasets, whereas same-source RAG yields smaller, model-dependent improvements. Overall, the findings indicate that stronger reasoning capability should not be assumed to improve fidelity-sensitive SOAP note generation without dedicated, task-specific evaluation.

[NLP-119] DTO: a Differentiable Training Objective for Effective Counterfactual Story Rewriting

【速读】: 该论文旨在解决生成式AI在反事实故事重写(counterfactual story rewriting)任务中的关键挑战:即如何在保持故事整体连贯性的同时,仅对指定事件进行微小且局部的修改。传统最大似然训练方法难以捕捉此类细微变化,而基于强化学习的复杂训练策略则效率低下且难于部署。论文提出了一种全新的可微分训练目标(Differentiable Training Objective, DTO),其核心在于设计一个端到端可微的损失函数,同时优化两个目标:(i) 与参考重写版本的忠实度(fidelity)和 (ii) 与源叙事的语义一致性(semantic consistency)。实验表明,DTO方法在TimeTravel和ART数据集上优于最大似然基线和偏好驱动方法,并在所有评估指标中达到与主流大语言模型相当的性能,验证了针对细粒度可控文本生成任务设计专用可微目标的有效性。

链接: https://arxiv.org/abs/2605.24885
作者: Amelia Girard,Massimo Piccardi
机构: University of Technology Sydney (悉尼科技大学)
类目: Computation and Language (cs.CL)
备注: 11 pages, 2 figures

点击查看摘要

Abstract:Counterfactual story rewriting is a natural language processing task that requires updating an existing story to reflect a chosen alternative event, yet preserving all the unaffected storyline elements and overall coherence. While large language models have recently made remarkable progress on this task, it still remains challenging since the required modifications are typically very small in size and highly localized. As a consequence, models trained in a conventional manner with the maximum-likelihood training objective tend to overlook these nuances. At the same time, more sophisticated training approaches based on reinforcement learning are notoriously slow and difficult to set up. For these reasons, our paper proposes a novel, differentiable training objective (DTO) that directly optimizes for the requisite counterfactual improvements. In our approach, a transformer model is fine-tuned via end-to-end backpropagation against a fully differentiable loss function that jointly rewards (i) fidelity to the reference rewrite and (ii) semantic consistency with the source narrative. The empirical evaluation on the TimeTravel and ART datasets shows that the proposed DTO approach has been able to surpass a maximum-likelihood baseline and a preference-based approach, and perform competitively against two contemporary large language models in all evaluation metrics. These findings substantiate the effectiveness of task-specific differentiable objectives for nuanced, controlled text-generation tasks.

[NLP-120] owards a Universal Causal Reason er

【速读】: 该论文试图解决大语言模型(LLM)在因果推理能力上的不足问题,特别是现有数据集多聚焦于特定因果维度的评估,难以支持通用因果推理模型的训练。其解决方案的关键在于提出UniCo数据生成框架,该框架能够覆盖Pearl因果阶梯中的18类因果查询,并将符号化的因果示例原生转化为代码和自然语言形式,以模拟真实场景中因果术语未被显式标注的情况;同时通过精确的因果推断验证答案并过滤推理捷径,确保数据质量。实验表明,基于66.6K个UniCo生成样本进行监督微调后,多个主流模型在18类分布内查询上平均提升22.9%,在7个外部因果基准上优于当前最优数据生成方法8.1%,并在医疗理解、法律决策和表格推理等实际任务中显著提升推理忠实度(平均提高20.2%),证明了因果中心化训练不仅能增强因果推理能力,还能赋予模型更普遍的因果思维模式。

链接: https://arxiv.org/abs/2605.24873
作者: Qirun Dai,Xiao Liu,Jiawei Zhang,Dylan Zhang,Hao Peng,Chenhao Tan
机构: The University of Chicago; University of Illinois Urbana-Champaign
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Despite the importance of causal reasoning, training LLMs to reason causally remains underexplored. Existing data efforts mostly focus on benchmarking LLMs on specific aspects of causality, making them less suitable for training generalizable causal reasoners. To address this, we propose UniCo, a data generation framework that both (1) addresses 18 causal query types across Pearl’s Causal Ladder and (2) translates natively symbolic examples into code and natural language forms to simulate real-world use cases where causal terms are not explicitly specified. To ensure data quality, UniCo grounds answers with exact causal inference and filters cases with reasoning shortcuts. Upon supervised finetuning with 66.6K UniCo-generated instances, Qwen3-4B, Qwen3-8B and Olmo-3-7B-Instruct achieve an average of 22.9% improvements across all 18 in-distribution query types, and 8.1% over state-of-the-art causal data generation frameworks on 7 established causal benchmarks outside the training distribution. More importantly, in real-world medical understanding, legal decision, and tabular reasoning, UniCo-trained models consistently display more faithful reasoning traces, outperforming the base models by an average of 20.2% in faithfulness metrics. These suggest that causality-centered training not only strengthens causal reasoning, but also equips LLMs with a causal mindset in general reasoning tasks.

[NLP-121] Lngram: N-gram Conditional Memory in Latent Space

【速读】: 该论文试图解决标准Transformer在序列建模中因密集计算导致的效率低下问题,尤其是在需要组合推理(compositional reasoning)和局部静态知识检索(local static knowledge retrieval)时。其核心解决方案是提出Lngram——一种基于潜在空间的条件记忆模块,它直接从隐藏状态中学习离散符号,并在这些符号上执行N-gram查找。这一设计摒弃了对分词器ID的依赖,自然扩展至非文本模态,同时在长上下文语言建模中持续降低困惑度(perplexity),并可在预训练模型后插入以注入领域知识;联合训练进一步优于全量微调,在视觉-语言及视觉-语言-动作任务中也取得整体性能提升。分析表明,Lngram促使预测相关的信息更早涌现,从而在有限的推理与内存开销下增加有效深度。

链接: https://arxiv.org/abs/2605.24869
作者: Yunao Zheng,Guoyang Xia,Xiaojie Wang,Lei Ren
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Sequence modeling requires both compositional reasoning and local static knowledge retrieval, yet standard Transformers handle both through dense computation. Engram partially decouples retrieval from the backbone, but its token-based keys remain tied to text tokenization and hash compression. We propose Lngram, a latent-space conditional memory module that learns discrete symbols directly from hidden states and performs N-gram lookup over these symbols. This design removes the dependence on tokenizer IDs and naturally extends to non-text modalities. In our evaluated settings, Lngram outperforms Transformer and Engram baselines, consistently reduces perplexity in long-context language modeling, and effectively injects domain knowledge when added post hoc to pretrained models. Joint training with the backbone further surpasses full fine-tuning, while experiments on vision-language and vision-language-action tasks show overall gains. Analyses with LogitLens and CKA suggest that Lngram enables prediction-relevant information to emerge earlier, increasing effective depth with limited inference and memory overhead. Code is available at this https URL.

[NLP-122] Clustering as Reasoning : A k-Means Interpretation of Chain-of-Thought Graph Learning ICML2026

【速读】: 该论文试图解决现有基于思维链(Chain-of-Thought, CoT)的图学习方法在文本属性图(Text-Attributed Graphs, TAGs)上存在的两大问题:一是模型架构割裂,缺乏对语义与拓扑信息的协同交互;二是固定图表示限制了推理过程的动态演化与可解释性。其解决方案的关键在于提出一个统一框架KCoT,通过将Transformer块与k-means聚类算法建立形式化的数学对应关系,首次从聚类视角重新诠释迭代推理过程——即推理本质上是不断进行节点分配与中心更新的步骤。在此基础上,引入语义判别提示(Semantic Discriminating Prompt)显式构建结构化思维链,并设计结构感知对齐策略融合拓扑先验与动态演化的思考条件表征,从而实现更高效、可解释的图结构推理。

链接: https://arxiv.org/abs/2605.24867
作者: Xuanting Xie,Zhaochen Guo,Bingheng Li,Xingtong Yu,Zhifei Liao,Zhao Kang,Yuan Fang
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Networking and Internet Architecture (cs.NI)
备注: Accepted by ICML 2026

点击查看摘要

Abstract:Chain-of-Thought (CoT) prompting has shown promise in enhancing the reasoning capabilities of large language models (LLMs) on text-attributed graphs (TAGs). This work reframes CoT-based graph learning through the principle of clustering as reasoning, offering a k -means interpretation of how iterative reasoning operates over graph-structured data. We observe that existing graph CoT methods rely on disjoint architectures and fixed graph representations, limiting step-by-step semantic-topological interaction and interpretability. To overcome this limitation, we propose a unified framework named KCoT that integrates CoT reasoning with graph representation learning. Our key theoretical result reveals a formal mathematical correspondence between a Transformer block and the k -means algorithm, allowing reasoning to be interpreted as iterative assignment and update steps. Based on this insight, we introduce a Semantic Discriminating Prompt that explicitly formulates these steps as structured CoT reasoning, together with a structure-grounded alignment strategy to fuse topological priors with evolving thought-conditioned representations. Experiments on standard benchmarks demonstrate consistent improvements over state-of-the-art methods, validating clustering as a principled mechanism for CoT-based graph learning.

[NLP-123] Repeated Sequences Reveal Gaps between Large Language Models and Natural Language ACL2026

【速读】: 该论文试图解决的问题是:如何有效评估大语言模型(LLMs)是否在生成文本时捕捉到自然语言的长程统计结构,而不仅仅是表面的局部流畅性。现有评估方法主要依赖任务性能或短上下文行为,难以揭示生成文本中更深层次的长期依赖关系。其解决方案的关键在于提出一种基于重复子序列(repeated subsequences)的互补性评估框架,通过分析这些子序列在不同尺度上的分布,并将其与高阶Rényi熵联系起来,从而量化文本在有限长度条件下对已有结构的复用能力。实验表明,人类写作文本表现出稳定的熵增长模式,而GPT生成文本则随模型规模出现系统性的指数偏移,这说明该方法能够从结构层面区分自然语言与先进LLM输出的本质差异。

链接: https://arxiv.org/abs/2605.24850
作者: Kumiko Tanaka-Ishii
机构: Waseda University (早稻田大学)
类目: Computation and Language (cs.CL); Information Theory (cs.IT); Applications (stat.AP)
备注: ACL 2026

点击查看摘要

Abstract:Evaluating whether large language models (LLMs) capture the structure of natural language beyond local fluency remains an open challenge. Existing evaluation methods, largely based on task performance or short-context behavior, provide limited insight into the long-range statistical organization of generated text. We propose a complementary evaluation framework based on repeated subsequences. By analyzing their distribution across scales and relating it to higher-order Rényi entropies, we probe how texts reuse previously established structure under finite-length conditions. Experiments on human-written texts and length-matched GPT-generated texts show that, while power-law models can describe restricted ranges of block length, the observed entropy growth is often equally or better characterized by logarithmic–power forms. Across datasets, natural language exhibits stable entropy-growth patterns over accessible ranges, with consistent average behavior despite variability across individual texts. In contrast, GPT-generated texts show systematic and statistically significant shifts in estimated exponents with model size. These results demonstrate that repeated-subsequence entropy provides a quantitative structural diagnostic that reveals systematic differences in long-range organization, distinguishing natural language from state-of-the-art LLM outputs beyond surface-level fluency. Comments: ACL 2026 Subjects: Computation and Language (cs.CL); Information Theory (cs.IT); Applications (stat.AP) Cite as: arXiv:2605.24850 [cs.CL] (or arXiv:2605.24850v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2605.24850 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-124] Geo-Expert: Towards Expert-Level Geological Reasoning via Parameter-Efficient Fine-Tuning ICML2026

【速读】: 该论文试图解决的问题是:通用大语言模型(LLM)在地质学领域中对地下结构和深时演化进行推理时常出现幻觉(hallucination),而当前地球科学中的AI应用主要集中在地表遥感和地理信息系统(GIS)领域,缺乏针对地质专业推理任务的有效模型。解决方案的关键在于构建一个参数高效、领域对齐的地质专用LLM——Geo-Expert,其通过自定义高质量指令数据集(使用定制的指令合成管道处理)进行微调,并采用低秩适应(LoRA)方法优化不同规模的基础模型(Qwen3-8B、Qwen3-32B 和 Gemma-3-27B)。实验表明,经过领域对齐训练的8B模型在专业地质推理任务上优于开源70B通用模型和专有GPT-4o,同时具备良好的成本效益比,为地质人工智能提供了可复现的范式与基准。

链接: https://arxiv.org/abs/2605.24844
作者: Chenyou Guo,Zongqi Liu,Yizhou Zhang,Zhaorui Jiang,Ze Liu
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 11 pages, 1 figure, 3 tables. Accepted at ICML 2026 AI for Science Workshop

点击查看摘要

Abstract:While general-purpose Large Language Models (LLMs) applied to Geology often hallucinate when reasoning about subsurface structures and deep-time evolution, current AI in Earth sciences predominantly targets surface remote sensing and GIS. To bridge this gap, we introduce Geo-Expert, a family of parameter-efficient geological LLMs fine-tuned on a custom-curated, high-quality instruction dataset processed using our custom instruction synthesis pipeline. We investigate the impact of model scaling and architecture by fine-tuning three base models: Qwen3-8B, Qwen3-32B, and Gemma-3-27B, with Low-Rank Adaptation (LoRA) method. Our extensive evaluation on a novel domain-specific benchmark, Geo-Eval, reveals that a domain-aligned 8B model can outperform open-weight 70B generalists and proprietary GPT-4o on specialized geological reasoning, while a 32B variant approaches frontier reasoning models. The optimized 8B model further offers a competitive cost-performance ratio for deployment. This work provides a reproducible recipe for democratizing scientific LLMs and establishes a baseline for geological artificial intelligence.

[NLP-125] ranslators as Invisible Teachers of AI: Copyright Translation Memory and the Political Economy of Linguistic Data

【速读】: 该论文试图解决的问题是:在人工智能(AI)时代,翻译劳动如何被转化为训练数据资本,而译者却未获得应有的道德、创造性和经济归属。其解决方案的关键在于提出两个核心概念:一是“非消费性占有”(appropriation without consumption),即文本不被阅读或欣赏,仅被提取统计特征用于训练模型,这种使用方式在日本著作权法第30-4条下具有合法性;二是“隐形教师化”(invisible teacherisation),指译者通过构建翻译记忆库、后期编辑和质量评估等行为,实质上成为AI的“教师”,却未被承认。论文基于从译者到语言服务提供商(LSP)、平台再到模型开发者的数据供应链,结合日、欧、美法律框架比较、开源与专有AI模型的区别,以及模型崩溃时代人类生成数据的溢价地位,揭示译者真正担忧的核心问题,并提出具象的再分配设计方向。

链接: https://arxiv.org/abs/2605.24842
作者: Masaru Yamada
机构: Rikkyo University (立教大学)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: 13 pages; comments welcome

点击查看摘要

Abstract:This paper examines how the labour of translators has been transformed into foundational data capital for the age of artificial intelligence (AI). Translation memories ™ and parallel corpora preserve a one-to-one correspondence between source and target text and therefore constitute extraordinarily valuable supervised training data for machine translation. The development of statistical machine translation (SMT), neural machine translation (NMT), the Transformer architecture, and multilingual large language models (LLMs) cannot be disentangled from the accumulation of such translation data. And yet, translators’ renditions have been bought as deliverables under contract, segmented as technical objects, and processed as “information analysis” data under copyright law – losing their moral, creative, and economic attribution to the translators who produced them. The paper develops two concepts to capture this process. The first is appropriation without consumption: a mode of use in which works are not read, viewed, or listened to, but only mined for statistical features – a use that is legitimated under Article 30-4 of the Japanese Copyright Act. The second is the invisible teacherisation of translators: the process by which translators, through the construction of translation memories, post-editing, and quality assessment, have functioned as teachers of AI without recognition as such. Drawing on the data supply chain that runs from translators through language service providers (LSPs) and platforms to model developers, on a comparative reading of Japanese, European, and United States legal frameworks, on the distinction between open and proprietary AI models, and on the premium status that human-generated data has acquired in the era of model collapse, the paper asks what translators are actually afraid of, and points toward concrete directions for redistributive design.

[NLP-126] RouteScan: A Non-Intrusive Approach to Auditing MoE LLM s Safety via Expert Routing Telemetry

【速读】: 该论文试图解决的问题是:在混合专家(MoE)架构的大语言模型(LLM)部署过程中,如何在不暴露用户敏感信息的前提下实现安全审计,以检测模型是否产生或助长有害行为。现有基于内容的审计方法通常需要访问用户提示(prompt)、模型输入或输出,这可能引发隐私泄露风险,造成模型安全性与用户隐私之间的根本性冲突。解决方案的关键在于:利用MoE模型中稀疏专家路由机制在GPU执行层面产生的可测量痕迹——即低级GPU执行遥测数据(telemetry),提出名为RouteScan的非侵入式审计框架。该框架通过分析预填充阶段分配给不同专家模块的活跃GPU线程数量,构建一个轻量级检测管道,提取跨域不变的风险指标,从而精准识别恶意提示。实验表明,RouteScan在未见过的有害领域上AUROC超过0.93,在新型越狱包装(jailbreak wrappers)下达到0.96,且逆向测试显示其遥测数据对提示重构信息有限,展现出优于传统内容审计方法的隐私保护优势。

链接: https://arxiv.org/abs/2605.24817
作者: Bo Lv,Zhiheng Xu,KeDong Xiu,Ruyi Ding,Tianhang Zheng,Zhibo Wang,Kui Ren
机构: Zhejiang University (浙江大学); Donghua University (东华大学); Louisiana State University (路易斯安那州立大学)
类目: Cryptography and Security (cs.CR); Hardware Architecture (cs.AR); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 20 pages. Under submission

点击查看摘要

Abstract:Mixture-of-Experts (MoE) architectures have become an increasingly important paradigm for scaling Large Language Models (LLMs). As MoE models are increasingly deployed in real-world services, safety auditing becomes necessary to verify whether these models produce or facilitate harmful behaviors during operation. However, existing content-based auditing methods typically require access to user prompts, model inputs, or generated outputs, potentially exposing sensitive user information and creating a fundamental tension between LLM safety and user privacy. On the other hand, we observe that, in MoE models, sparse expert routing maps different inputs to activate different expert-execution patterns, producing measurable footprints in low-level GPU execution telemetry. Inspired by this observation, we propose RouteScan, a non-intrusive auditing framework for detecting harmful behaviors through GPU-level expert routing telemetry. Specifically, RouteScan utilizes the number of active GPU threads allocated to expert modules during the prefilling phase as a discriminative micro-architectural fingerprint, and builds a lightweight detection pipeline that isolates cross-domain invariant risk indicators for the precise identification of malicious prompts. Comprehensive evaluations on open-source MoE LLMs with distinct routing designs demonstrate that RouteScan achieves strong generalization, with an AUROC exceeding 0.93 on unseen harmful domains and 0.96 under novel jailbreak wrappers. Moreover, empirical inversion tests show that the collected expert routing telemetry provides limited information for prompt reconstruction, suggesting a practical privacy advantage over content-based auditing methods.

[NLP-127] DUEL: Adversarial Self-Play for Multimodal Reasoning

【速读】: 该论文试图解决基于强化学习(Reinforcement Learning, RL)的视觉语言模型(Vision-Language Models, VLMs)在优化过程中对昂贵高质量标注数据的高度依赖问题,以及现有无监督方法因视觉 grounding 能力弱和缺乏可靠验证信号而导致的偏差漂移问题。其解决方案的关键在于提出一种自进化后训练框架 DUEL,通过两个从同一预训练 VLM 初始化的策略(Challenger 和 Solver)之间的对抗交互实现监督信号的内生生成:Challenger 生成一个图像相关的真命题及其最小扰动的难负样本,Solver 则对两者进行图像验证,从而在近邻语义下促进细粒度的视觉区分能力;同时引入长度归一化的对数似然奖励机制,在稀疏反馈条件下保留有效的优化信号并提升训练稳定性。实验表明,DUEL 在无需额外人工标注、外部奖励模型或图像编辑工具的情况下,持续提升了视觉推理能力和鲁棒性判别性能。

链接: https://arxiv.org/abs/2605.24794
作者: Lin Qiu,Hanqing Zeng,Yao Liu,Bingjun Sun,Guangdeng Liao,Ji Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Reinforcement learning (RL) has emerged as an effective paradigm for improving the reasoning capability of vision-language models (VLMs). However, RL-based optimization typically depends on costly high-quality annotations that are difficult to scale. Existing unsupervised alternatives may drift toward biased solutions due to weak visual grounding and the lack of reliable verification signals. We propose a self-evolving post-training framework, DUEL, where supervision emerges from adversarial interactions between two policies initialized from the same pretrained VLM. A Challenger generates an image-grounded true claim together with a minimally perturbed hard-negative counterpart, while a Solver verifies both claims against the image, encouraging fine-grained visual discrimination under near-neighbor semantics. To stabilize optimization, we introduce a length-normalized log-likelihood reward that preserves informative optimization signals beyond binary outcome supervision and improves learning stability under sparse feedback. Experiments show that DUEL consistently improves visual reasoning and robust discrimination without additional human annotations, external reward models, or image editing tools.

[NLP-128] Beyond the Target: From Imitation to Collaboration in Speculative Decoding

【速读】: 该论文试图解决的问题是:当前主流的推测解码(Speculative Decoding, SPD)方法将大语言模型(LLM)视为唯一的可靠教师,仅在草稿模型(draft model)预测与目标模型(target model)完全一致时才接受草稿 token,这种设计隐含假设目标模型在每个位置都是最优选择,但在实际中这一假设并不成立——草稿模型虽然整体较弱,但在部分 token 上反而可能做出更优决策。解决方案的关键在于提出一种新的框架 Collaborative Speculative Decoding (CoSpec),它不再将目标模型视为唯一的 token 级别权威,而是通过强化学习训练一个仲裁策略(arbitration policy),在草稿与目标模型不一致时智能判断是否采纳草稿 token,从而在保证显著加速的同时提升最终生成质量。该方法从“模仿”转向“协作”,为推测解码提供了全新的范式。

链接: https://arxiv.org/abs/2605.24793
作者: Jinze Li,Yixing Xu,Guanchen Li,Jinfeng Xu,Shuo Yang,Yang Zhang,Xuanwu Yin,Dong Li,Edith C.H. Ngai,Emad Barsoum
机构: Advanced Micro Devices, Inc.(Advanced Micro Devices, Inc.); The University of Hong Kong (香港大学); University of North Texas (北德克萨斯大学)
类目: Computation and Language (cs.CL)
备注: under review

点击查看摘要

Abstract:Speculative decoding (SPD) accelerates large language model (LLM) inference by letting a smaller draft model propose multiple future tokens that are verified in parallel by a larger target model. The dominant SPD paradigm treats the target model as the sole reliable teacher, accepting a draft token only when it exactly matches the target prediction. This design implicitly assumes that the target is always the better choice at every position. In practice, this assumption does not hold. Although the draft is the weaker model overall, it is not uniformly inferior at the token level. In a meaningful fraction of cases where draft and target disagree, the draft’s choice is the one that leads to the correct final answer. Inspired by this, we introduce \textbfCollaborative Speculative Decoding (CoSpec), a generalization of SPD that no longer treats the target model as the sole token-level authority. CoSpec trains an arbitration policy via reinforcement learning to decide whether to accept tokens from the draft or target model, selectively accepting draft tokens at mismatches when doing so is likely to yield a correct final answer. Experimental results show that CoSpec maintains substantial speedups while surpassing target-only performance. By shifting the emphasis from imitation to collaboration, CoSpec suggests a new perspective on speculative decoding.

[NLP-129] Automated Detection and Classification of Delusion-related Content in Naturalistic Audio Diaries Using Multi-Agent Language Models

【速读】: 该论文旨在解决如何在自然情境下自动识别和细粒度标注言语中暗示妄想信念(delusional beliefs)及其相关情绪反应与行为反应的问题,从而辅助精神疾病症状的表征与恶化检测。其解决方案的关键在于构建一个基于多智能体大语言模型(multi-agent LLM)的自动化流水线,通过精心设计的诊断提示(diagnostic prompt instructions)减少妄想主题分类中的假阳性,并采用多数投票机制而非复杂对话辩论来提升临床模糊文本上的分类准确性,最终实现了对自然语音日记中妄想内容的高精度、可扩展检测(微F1分数分别为0.872和0.779)。

链接: https://arxiv.org/abs/2605.24755
作者: Feng Chen,Justin Tauscher,Changye Li,Meliha Yetisgen,Alex Cohen,Adam Kuczynski,Angelina Pei-Tzu Tsai,Benjamin Buck,Dror Ben-Zeev,Trevor Cohen
机构: University of Washington (华盛顿大学); Louisiana State University (路易斯安那州立大学); University of North Carolina at Chapel Hill (北卡罗来纳大学教堂山分校)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted by CLPych 2026

点击查看摘要

Abstract:Speech monologues recorded in naturalistic settings provide opportunities to characterize mental illness phenomenology and detect symptom exacerbation. Large language models (LLMs) offer new possibilities for automating this process, as they require annotated data primarily for evaluation rather than training. In this paper, we present a novel automated, multi-agent LLM pipeline for the fine-grained, multi-label extraction of language suggestive of delusional beliefs, associated affective responses, and behavioral responses from transcripts of naturalistic audio diaries collected from people with moderate persecutory ideation. Evaluating an ensemble of three foundation models, we demonstrate that detailed diagnostic prompt instructions successfully reduce false positives for delusional theme classification, but also constrain the interpretation of affective or behavioral responses. Furthermore, comparing multi-agent adjudication frameworks shows that complex conversational debate between agents diminishes accuracy on clinically ambiguous text by inducing premature consensus. Instead, majority voting establishes robust performance (Micro F1 of 0.872 and 0.779 for delusion detection and classification respectively). This work provides a validated and scalable pipeline for the automated detection and characterization of content suggesting delusional beliefs in naturalistic speech.

[NLP-130] Who judges the judges? Governance from metrics: a runtime framework for continuous LLM compliance monitoring

【速读】: 该论文试图解决当前人工智能合规评估方法中存在的根本性缺陷:即把合规性视为一个静态的、仅在审计时判定的二元结果(conformity as a binary, audit-time verdict),而忽略了欧盟《人工智能法案》(EU AI Act)所要求的持续人类监督和对部署系统中行为漂移(behavioural drift)的动态检测需求。解决方案的关键在于提出“指标驱动治理”(governance from metrics)原则,将合规性转化为从运行时可观测数据中连续生成的信号,而非依赖静态评估。其核心实现是开源框架govllm,采用基于累积合规分数的模型路由机制,取代传统以延迟或成本为唯一决策依据的方式;同时引入由多个专业小语言模型(SLMs)组成的“监管裁判团”(regulatory judges),每个裁判针对特定法规标准(如GDPR、ANSSI等)进行评估,并将裁判间分歧重新定义为可量化监管不确定性信号,从而触发人工仲裁。实证验证表明,不同裁判模型在各准则上的表现差异显著(一致率51.5%–69.1%),支持“以裁判群体作为陪审团”的设计合理性,并揭示了小规模裁判模型存在的结构性失效模式与位置偏差问题,进一步凸显了动态治理机制的必要性。

链接: https://arxiv.org/abs/2605.24737
作者: Jehanne Dussert
机构: Independent Researcher (独立研究员)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 41 pages, 8 figures, preprint

点击查看摘要

Abstract:Current approaches to AI compliance treat conformity as a binary, audit-time verdict rather than a continuous, measurable property of production systems. We argue that this compliance fiction is structurally ill-suited to the requirements of the EU AI Act, which demands ongoing human oversight and the detection of emergent behavioural drift in deployed systems. We introduce governance from metrics, a principle whereby regulatory compliance is derived as a continuous signal from runtime observability rather than from static assessments. Building on this principle, we present govllm, an open-source framework implementing a governance-driven routing architecture in which model selection is determined by accumulated compliance scores rather than by latency or cost alone. Central to our approach is a panel of regulatory judges - LLM evaluators specialised per criterion (EU AI Act, GDPR, ANSSI, accessibility) - whose inter-judge disagreement we reframe not as noise but as a regulatory uncertainty signal warranting human arbitration. We validate this approach through a ground truth corpus of 49 annotated prompt/response pairs across five regulatory criteria, evaluated by four small language models (SLMs, 1.7B-7B parameters) running fully on-premise. Agreement rates range from 51.5% (mistral:7b) to 69.1% (phi4-mini), with no single model dominating across all criteria - empirically motivating the Profile-as-jury design. We further document three structural failure modes in small regulatory judges and a judge-specific position bias that degrades agreement by up to 25 percentage points across three question-order conditions (original, reversed, permuted). govllm is released as open-source software to support reproducible AI governance research.

[NLP-131] StepGap: A Hybrid NLI-LLM Checker for Step-Level Evidence-Gap Detectionin Multi-Hop Question Answering

【速读】: 该论文试图解决多跳问答(multi-hop QA)中步骤级证据缺口(step-level evidence gaps)的检测与修复问题,旨在提升模型推理链的可解释性和准确性。解决方案的关键在于提出一个混合型自然语言推理(NLI)与大语言模型(LLM)决策树结构——StepGap,其能够对每个推理步骤标注三种类型标签:矛盾声明(Contradicted Claim, CC)、无关证据(Irrelevant Evidence, IE)和缺失桥梁(Missing Bridge, MB),每种标签对应具体的修正动作。该设计不仅在步级F1上达到72.0(优于竞争性错误掩盖效应下的LLM基线),还通过结构化分解验证了各阶段的有效性,同时揭示了“问题级F1陷阱”(Q-F1 trap),强调步级指标作为诊断必要性;进一步将StepGap作为带类型奖励的GRPO过程反馈机制,使Qwen2.5-7B-Instruct模型的精确匹配(Exact Match)性能从32.1±0.3提升至35.4±0.9,显著优于对照组。

链接: https://arxiv.org/abs/2605.24733
作者: Yuelyu Ji,Zhuochun Li,Hui Ji,Daqing He
机构: University of Pittsburgh (匹兹堡大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We present \textbfStepGap, a hybrid NLI-LLM decision tree that detects step-level evidence gaps in multi-hop QA and emits one of three typed labels: \textscContradicted Claim (CC), \textscIrrelevant Evidence (IE), or \textscMissing Bridge (MB), each tied to a concrete repair action. On 82 multi-hop questions (181 annotated steps, \kappa=0.704 ), StepGap reaches sF1 = 72.0, within the bootstrap confidence interval of an LLM-only baseline (70.1) but with a more decomposable structure: every StepGap stage \emphhurts F1 when removed, while three of four LLM-only removals \emphimprove F1 – a sign of \emphcompeting-error cancellation, where internal stages mask each other’s errors. We further expose a \emphQ-F1 trap: question-level F1 is mechanically inflated by checkers that flag every step, making step-level F1 the necessary diagnostic. Used as a typed GRPO process reward, StepGap improves Qwen2.5-7B-Instruct Exact Match from 32.1\pm0.3 to 35.4\pm0.9 across three seeds, with the single-run comparison showing a +5.6 Avg EM gain over the matched Search-R1 GRPO reproduction.

[NLP-132] Fundamental Limitation in Explaining AI

【速读】: 该论文试图解决的问题是:如何为大规模人工智能系统(如大语言模型和扩散模型)提供完全忠实(complete faithfulness)且可解释的说明,以满足公共机构对AI可解释性的要求。解决方案的关键在于提出并数学证明了一个“AI解释四元困境”(fundamental quadrilemma in explaining AI),指出在以下四个条件中无法同时满足:1)操作环境的复杂性、2)AI性能的良好性、3)解释的可解释性、4)解释的完全忠实性。这一理论结果表明,在大多数实际应用场景中,若无法改变环境或牺牲性能与可解释性,则必须放弃对解释完整忠实性的追求,转而聚焦于对应用至关重要的部分进行解释。因此,该研究为AI治理提供了理论基础——即应默认AI解释的忠实性始终不完整,并据此设计相应的监管框架。

链接: https://arxiv.org/abs/2605.24727
作者: Atsushi Suzuki,Jing Wang
机构: The University of Hong Kong (香港大学); University of Greenwich (格林威治大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Information Theory (cs.IT)
备注:

点击查看摘要

Abstract:While large-scale models such as LLMs and diffusion models have achieved practical success, public institutions have emphasized the importance of explainability in AI. Existing methods for explaining AI, however, are not designed to provide completely faithful explanations of the behavior of large-scale AI systems. Although a completely faithful and interpretable explanation of the behavior of an AI system might be useful for AI governance, it has not been known whether providing such an explanation is theoretically possible. In this paper, we mathematically prove a fundamental quadrilemma in explaining AI, stating that AI and its explanation cannot satisfy the following four conditions simultaneously: 1) the complexity of the operation environment, 2) the goodness of the AI’s performance, 3) the interpretability of the AI’s explanation, and 4) the complete faithfulness of the AI’s explanation. This quadrilemma suggests that, in most applications where we cannot change the environment or sacrifice good AI performance and an interpretable explanation, we should give up complete faithfulness of explanations and should instead aim to explain only the parts that are important for applications. As a consequence, the quadrilemma implies that AI governance should be designed on the premise that the faithfulness of AI explanations is always incomplete.

[NLP-133] ROC Analysis for Evaluating Translation Quality Estimation Systems ACL

【速读】: 该论文试图解决的问题是:如何在自动化翻译质量评估(QE)系统日益广泛应用的背景下,提出一种实用且以决策为导向的方法来有效评估其性能。解决方案的关键在于引入受试者工作特征(ROC)分析方法,该方法不仅与现有主流评估方法结果一致,还能提供可操作的性能洞察,从而支持业务决策制定。

链接: https://arxiv.org/abs/2605.24721
作者: Evelyn Y. Garland(1),Carola F. Berger(2) ((1) Acta-Transphere, (2) CFB Scientific Translations LLC)
机构: Acta-Transphere; CFB Scientific Translations LLC
类目: Computation and Language (cs.CL)
备注: 16 pages, 8 PNG figures, 3 tables, uses this http URL

点击查看摘要

Abstract:The increasing use of automated translation quality estimation (QE) systems calls for practical, decision-oriented methods for evaluating their performance. We propose that Receiver Operating Characteristic (ROC) analysis is a useful approach for this purpose. Our study shows that ROC analysis not only produces results consistent with currently prevalent methods, but also offers several important advantages, including actionable performance insights that support business decision-making.

[NLP-134] World-State Transformations for Neuro-symbolic Interactive Storytelling

【速读】: 该论文试图解决的问题是:当前基于大语言模型(Large Language Models, LLMs)的交互式叙事系统在处理用户自由文本输入时,常出现故事连贯性差的问题。为应对这一挑战,论文提出的关键解决方案是:引入基于规则的预编程世界状态变换机制,作为神经符号架构(neuro-symbolic architecture)的一部分,以维持叙事世界的状态一致性,并通过结构化触发机制鼓励玩家以更具创造性的书面输入进行互动。实验使用Llama 3 70B(开源)和Gemini 1.5 Flash(闭源)模型,在英语和西班牙语环境下对8名参与者进行了两组精心设计的情境测试,结果表明该方法能够在保障叙事连贯性的前提下提升玩家表达的创造性。

链接: https://arxiv.org/abs/2605.24719
作者: Santiago Góngora,Luis Chiruzzo,Gonzalo Méndez,Pablo Gervás
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: To be presented at the 17th International Conference on Computational Creativity (ICCC’26)

点击查看摘要

Abstract:Large Language Models (LLMs) have changed the possibilities of Interactive Storytelling systems that process free-text user input. However, as more of these systems are built, evidence continues to mount regarding the story coherence problems that arise when relying solely on them. Recent research suggests that LLMs can effectively predict state changes within rule-based Interactive Storytelling systems, triggering pre-programmed world-state transformations. In this paper, we conduct an exploratory evaluation of whether such transformations can serve as a catalyst for player expression while aiming to address the incoherence issues typical of purely LLM-based approaches. Building upon a neuro-symbolic architecture, we conducted experiments using an open-source model (Llama 3 70B) and a closed-source model (Gemini 1.5 Flash), with testing conducted in both English and Spanish. Eight participants played two scenarios, carefully designed to assess different evaluation objectives. Our observations suggest that transformations offer a way to maintain world-state consistency while encouraging players to interact creatively through their written inputs. Comments: To be presented at the 17th International Conference on Computational Creativity (ICCC’26) Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2605.24719 [cs.CL] (or arXiv:2605.24719v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2605.24719 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-135] he Tokenizer Tax Across 25 European Languages: Domain Invariance Cross-Lingual Few-Shot Effects and the Ukrainian Penalty

【速读】: 该论文试图解决的问题是:非英语自然语言处理(NLP)中因分词器(tokenizer)的词元密度(token fertility,即每词平均词元数)差异导致的隐性成本问题。解决方案的关键在于:首次在25种欧洲语言上基于平行文本对10个基础模型的分词器进行系统量化测量,构建出首个受控的“分词器税”(tokenizer tax)地图,揭示了不同语言间词元密度的显著差异及其层级结构(如罗曼语族、日耳曼语族、斯拉夫语族等),并发现高词元密度主要源于分词器对形态边界(morphological boundaries)的破坏而非保留;此外,研究还表明少样本迁移效果由模型内在特性决定,而非语言特性,为跨语言NLP模型优化提供了可量化的基准和数据支持。

链接: https://arxiv.org/abs/2605.24718
作者: Volodymyr Ovcharov
机构: LEX AI Platform (LEX AI平台); legal.org.ua (legal.org.ua)
类目: Computation and Language (cs.CL)
备注: 16 pages, 3 figures, 8 tables. Dataset: this https URL

点击查看摘要

Abstract:Tokenizer fertility the number of tokens per word imposes a hidden cost on non-English NLP. We measure fertility for ten foundation models across 25 European languages on parallel text, producing the first controlled tokenizer tax map for the continent. The tax spans 2.5x from English (1.2 tokens/word) to Greek/Maltese (~3.1), following a clear hierarchy: Romance (1.5-1.7), Germanic (1.7-1.9), Slavic (2.2-2.5), Uralic/Baltic (2.7-3.0). Ukrainian (2.7) pays 15-18% more than cognate Slavic languages, reflecting underrepresentation in pre-training data. Fertility rankings are domain-invariant across three text registers (rho 0.97). A subword analysis reveals that high-fertility tokenizers fragment morphological boundaries rather than preserving them. Cross-lingual few-shot evaluation on four Slavic languages shows that few-shot effects are model-intrinsic, not language-dependent. We release all measurements as a public dataset.

[NLP-136] S-Skill: A Benchmark for Evaluating Analytical Skills in Time-Series Question Answering

【速读】: 该论文旨在解决当前时间序列问答(TSQA)评估中缺乏细粒度技能诊断的问题。现有基准通常按任务类型或高层次推理类别组织,难以揭示模型在时间信号层面的具体能力差异。解决方案的关键在于提出TS-Skill——一个受控的基准,用于评估三种可组合的时间序列分析技能:时间尺度选择(SK1)、时间定位(SK2)和跨区间整合(SK3)。为构建大规模高质量基准,作者进一步开发了SKEvol框架,该框架通过领域感知的时间序列种子生成、技能控制的问答生成、基于元数据与代码的答案构造、多阶段信号锚定验证及人机协同校准,实现了对模型时间推理能力的精细化评估。实验表明,不同模型在SK1–SK3上表现存在显著且不均衡的能力差距,尤其是非代理模型在SK3上持续困难,而工具增强型代理在SK3上表现出选择性优势,说明技能级评估能够识别出被整体TSQA分数掩盖的时间推理失败。

链接: https://arxiv.org/abs/2605.24703
作者: Liying Han,Kang Yang,Oliver Wang,Jason Wu,Pengrui Quan,Gaofeng Dong,Ozan Baris Mulayim,Sizhe Ma,Yuyang Yuan,Dezhi Hong,Mario Berges,Mani Srivastava
机构: University of California, Los Angeles (加州大学洛杉矶分校); Samsung Research America (三星美国研究院); Carnegie Mellon University (卡内基梅隆大学); Microsoft (微软); Amazon (亚马逊)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) and time-series language models (TSLMs) are increasingly applied to time-series question answering (TSQA). Unlike text-only QA, TSQA requires models to ground answers in temporal signals whose patterns may occur at different scales, specific time locations, or across separated intervals. However, existing benchmarks are typically organized by task types or high-level reasoning categories, making it difficult to diagnose the underlying signal-level capabilities driving model performance. We introduce TS-Skill, a controlled benchmark for evaluating three composable analytical skills in TSQA: temporal scale selection (SK1), temporal localization (SK2), and cross-interval integration (SK3). TS-Skill provides timestamp-aware questions, broad domain coverage, and human-validated QA quality. To construct the benchmark at scale, we develop SKEvol, a skill-guided agentic framework that combines domain-aware time-series seed generation, skill-controlled question generation, metadata- and code-assisted answer construction, multi-phase signal-grounded verification, and human-in-the-loop curation. Experiments on ten state-of-the-art LLMs and TSLMs reveal substantial and uneven capability gaps across SK1-SK3. In particular, SK3 remains consistently challenging for non-agent models, whereas tool-augmented agents show a selective advantage on standalone SK3. These findings demonstrate that skill-level evaluation can uncover temporal reasoning failures that are obscured by aggregate TSQA scores.

[NLP-137] he Path Matters: Learning a Token-Commitment Policy for Diffusion Language Models

【速读】: 该论文试图解决扩散型大语言模型(diffusion large language models)在并行生成过程中面临的“token承诺”(token commitment)问题,即如何决定在每一步中将哪些候选token纳入部分解码序列。传统方法依赖于手工设计的置信度规则或特定块的接受过滤器,存在泛化能力弱和调参复杂的问题。论文提出的关键解决方案是引入一个轻量级插件控制器TraceLock,它将token承诺建模为可复用的状态策略(trace-state policy),并通过未来稳定性自监督学习来训练该策略:在解码步骤t时,若某位置i的候选token在完整解码轨迹结束后仍保持一致,则标记为稳定。该控制器能够对变长的解码状态进行评分,并动态决定应保留哪些活跃的token提议。一旦针对固定骨干模型训练完成,TraceLock即可跨局部窗口宽度、生成长度和步数预算部署而无需重新训练或校准,实验表明其在问答、数学推理和代码生成任务中显著优于启发式及已有学习基线,且在跨设置部署下表现出更强的稳定性,诊断分析进一步揭示其决策机制无法被单一置信度指标简化,说明冻结的扩散语言模型蕴含了超越置信度的可学习承诺轨迹空间。

链接: https://arxiv.org/abs/2605.24697
作者: Bohang Sun,Max Zhu,Francesco Caso,Jindong Gu,Junchi Yu,Philip Torr,Pietro Liò,Jialin Yu
机构: University of Cambridge (剑桥大学); University of Oxford (牛津大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Diffusion large language models promise faster generation by refining many token positions in parallel, but this parallelism introduces a hidden control problem: which proposed tokens should be transferred into the partially decoded sequence at each step? We refer to this decision as token commitment. Existing frozen-generator decoders largely rely on hand-designed confidence rules or block-specific acceptance filters. We argue that token commitment can instead be learned as a reusable trace-state policy. We introduce TraceLock, a lightweight plug-in controller that instantiates this policy for a frozen diffusion language model. Since oracle commitment times are unavailable, TraceLock derives self-supervision from future stability: at decoding step t, a proposed token for position i is labeled stable if it matches the final token at position i after the full decoding trace completes. The controller scores variable-length trace states and decides which active token proposals should be committed to the partially decoded sequence. Once trained for a given frozen backbone, the controller can be deployed across local-window widths, generation lengths, and step budgets without retraining or per-setting calibration. Experiments on question answering, mathematical reasoning, and code generation show that TraceLock improves the quality-step tradeoff over heuristic and learned baselines, with particularly stable behavior under cross-setting deployment. Diagnostic analyses show that its decisions are not reducible to scalar confidence, suggesting that frozen diffusion language models expose a learnable space of commitment trajectories beyond confidence-based decoding. Code is available at this https URL.

[NLP-138] CP-Agent : A Calibrated Risk-Controlled Agent for Feedback-Driven Competitive Programming

【速读】: 该论文试图解决大语言模型(Large Language Models, LLMs)在竞赛级编程任务中表现不佳的问题,尤其是现有基于代理(agentic)的解决方案往往依赖大量推理时采样或昂贵的多阶段后训练,效率低下。其核心解决方案是构建一个受执行反馈驱动的求解框架,将反馈引导的求解过程建模为一个校准后的停止过程(calibrated stopped process),并识别出三个关键量:误接纳风险(false-admission risk)、针对劣质程序的程序级证据(program-level evidence against bad programs),以及活跃状态的成功危险率(active-state success hazard)。通过在保留预定义控制器集合的前提下进行轨迹校准与选择,该框架提供了一个结构化下界来约束干净成功概率(clean success probability)在误接纳发生前的水平。基于这三个量设计了三种机制:双粒度验证(Dual-Granularity Verification)、测试增强(Test Augmentation)和经验驱动自演化(Experience-Driven Self-Evolving),共同构成CP-Agent。该方法无需参数更新,在LiveCodeBench Pro上将Pass@1从25.8%提升至48.5%,并在ICPC-Eval上使Refine@5提升11.0%,同时在三种LLM骨干网络中均处于成本-精度效率前沿,且消融实验表明每个组件主要影响其对应的证书量。

链接: https://arxiv.org/abs/2605.24693
作者: Peisong Wang,Bowen Liu,Zehua Li,Yuyao Wang,Zhiwei Ma,Yuhan Li,Jia Li
机构: The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)
类目: Computation and Language (cs.CL)
备注: Code: this https URL

点击查看摘要

Abstract:Large language models still struggle with contest-level programming, while many agentic remedies rely on massive inference-time sampling or expensive multi-stage post-training. We study when execution feedback reliably helps an LLM CP solver and which mechanisms govern the gains. We model feedback-driven solving as a calibrated stopped process and identify three quantities: false-admission risk, program-level evidence against bad programs, and the active-state success hazard. Under held-out trace calibration and selection from a pre-declared finite controller manifest, the resulting structural certificate lower-bounds the clean success probability before false admission. We instantiate mechanisms targeting these quantities as Dual-Granularity Verification, Test Augmentation, and Experience-Driven Self-Evolving, yielding CP-Agent. Without updating any parameters, CP-Agent raises Pass@1 from 25.8% to 48.5% on LiveCodeBench Pro and improves Refine@5 by 11.0% on ICPC-Eval. Across three LLM backbones, CP-Agent lies on the cost–accuracy efficiency frontier, and ablations show that each component primarily affects its corresponding certificate quantity.

[NLP-139] Mix-MoE: Improving Multilingual Machine Translation of Large Language Models through Mixed MoEs

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在多语言机器翻译(Multilingual Machine Translation, MT)任务中进行微调时面临的参数干扰(parameter interference)问题。其解决方案的关键在于提出一种混合专家(Mixed Mixture-of-Experts, Mix-MoE)框架,该框架通过两个阶段的后预训练实现:首先在单语语料上训练包含语言模型专家(Language Model Experts, LM Experts)和机器翻译专家(Machine Translation Experts, MT Experts)的MoE层,其中LM Experts保留预训练模型的单语知识,MT Experts专门学习双语翻译知识;其次引入基于傅里叶变换(Fourier Transform)特征增强的路由机制,以更好地激发专家间的协同作用并利用文本潜在结构模式,从而有效缓解参数干扰并提升多语言翻译性能。

链接: https://arxiv.org/abs/2605.24681
作者: Bo Li,Tianyu Dong,Shaolin Zhu,Deyi Xiong
机构: Tsinghua University (清华大学); Tianjin University (天津大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by TASLP

点击查看摘要

Abstract:Large Language Models (LLMs) have shown great promise in multilingual machine translation (MT), even with limited bilingual supervision. However, fine-tuning LLMs with parallel corpora presents major challenges, namely parameter interference. To address these issues, we propose Mix-MoE, a mixed Mixture-of-Experts framework designed to train LLMs for multilingual MT. Our framework operates in two distinct stages: (1) post-pretraining with MoE on monolingual corpora, and (2) post-pretraining with MoE on parallel corpora. Crucially, we divide the MoE layers into two specialized groups: Language Model Experts (LM Experts) and Machine Translation Experts (MT Experts). LM Experts are designed to capture and retain the monolingual knowledge learned by the pre-trained LLM. MT Experts, on the other hand, are specifically trained to acquire and store bilingual translation knowledge. Furthermore, to facilitate effective interaction between these specialized experts and leverage potential underlying structural patterns in text, we introduce a routing mechanism enhanced by Fourier Transform features derived from model representations. The experimental results demonstrate that Mix-MoE excels in multilingual MT, significantly outperforming existing baselines and showing notable progress in mitigating parameter interference.

[NLP-140] Exploration of Perceptual Speech Features for Clinical Decision-Support in Mental Health Care ACL2026

【速读】: 该论文旨在解决如何通过语音特征实现对抑郁症、焦虑症和注意力缺陷多动障碍(ADHD)等心理疾病的客观且可解释的评估问题。其解决方案的关键在于构建一个基于特征的系统性分析框架,融合感知上合理的声学与语言学特征(如韵律、嗓音质量、语义连贯性、句法结构及讽刺表达),并结合统计分析与可解释机器学习方法(XGBoost配合SHAP和LIME),识别语音特征与临床验证的症状严重程度之间的稳定关联。实证结果表明,嗓音不规则性(如抖动shimmer、抖动jitter)、词汇-句法模式及情感基调是预测症状严重度的核心指标,且该框架在多个基准数据集和真实临床数据中均表现出一致性和可解释性。

链接: https://arxiv.org/abs/2605.24678
作者: Vassilis Lyberatos,Edmund G. Dervakos,Eleni Adamidi,Athanasios Voulodimos,Giorgos Stamou
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Sound (cs.SD)
备注: Accepted to CLPsych 2026, part of ACL 2026

点击查看摘要

Abstract:Speech and language technologies offer valuable opportunities for supporting mental health assessment through objective and interpretable cues. We present a systematic feature-based analysis framework leveraging perceptually grounded acoustic and linguistic characteristics, including prosody, vocal quality, semantic coherence, syntactic structure, and sarcasm. Using statistical analysis and interpretable machine learning (XGBoost with SHAP and LIME), we examine associations between speech features and validated symptom measures of depression, anxiety, and ADHD. Evaluated on both controlled benchmark datasets (StressID, DAIC-WOZ, Androids, EATD) and a real-world clinical dataset, the framework reveals stable and consistent relationships between symptom severity and vocal irregularities (e.g., shimmer, jitter), lexical-syntactic patterns, and affective tone. An ablation study conducted across all datasets further identifies the most informative feature groups. This work explores a transparent and clinically interpretable approach to speech-based mental health analysis.

[NLP-141] Measuring Reasoning Quality in LLM s: A Multi-Dimensional Behavioral Framework

【速读】: 该论文试图解决当前大语言模型(LLM)评估中过度依赖最终答案正确性(final-answer correctness)而导致对推理过程理解不足的问题。其解决方案的关键在于提出一个统一的多维行为框架,从六个理论基础明确的维度——正确性(Correctness, CQ)、一致性(Consistency, CS)、鲁棒性(Robustness, RS)、逻辑连贯性(Logical Coherence, LS)、效率(Efficiency, ES)和稳定性(Stability, SS)——来系统量化LLM的推理质量。实验表明,该框架能揭示仅靠准确率无法捕捉的行为特征,例如逻辑连贯性与正确性呈负相关(r = -0.172, 不显著),说明正确答案可能来自非连贯推理;同时识别出因单一指标导致的排名反转现象(如DeepSeek-V3在不同权重下排名差异显著),并验证了多数维度间具有独立性(|r| < 0.50),从而为多维评估提供心理测量学支持。此框架可直接指导部署决策:识别虽答案正确但推理过程不可审计的模型、避免仅凭准确率引发的排序错误,并防止任一指标替代全部六维信号。

链接: https://arxiv.org/abs/2605.24661
作者: Ali Şenol,Garima Agrawal,Huan Liu
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:LLMs have achieved remarkable success in complex reasoning tasks, yet current evaluation approaches predominantly rely on final-answer correctness, offering limited insight into the underlying reasoning processes that produce those answers. To address this gap, this study proposes a unified multi-dimensional framework for measuring reasoning quality in LLMs from a behavioral perspective, operationalizing six theoretically grounded dimensions: Correctness (CQ), Consistency (CS), Robustness (RS), Logical Coherence (LS), Efficiency (ES), and Stability (SS). Extensive experiments on seven LLMs across 975 items from four benchmarks demonstrate that the framework reveals behaviors invisible to accuracy-only metrics. Notably, logical coherence is orthogonal to correctness (r = -0.172, ns), confirming that correct answers can arise from incoherent reasoning, while Claude-Haiku-4.5 achieves the highest multi-dimensional score (Q_bal = 0.778). Furthermore, the framework exposes critical ranking inversions: DeepSeek-V3 ranks second under accuracy-priority but fifth under legal/compliance weighting, a reversal that single-metric evaluation cannot detect. Discriminant validity confirms 11/15 dimension pairs are independent (|r| 0.50), providing psychometric support for treating each dimension as a distinct signal. The dimensional profiles produced by the framework directly support three classes of deployment decision: identifying models whose reasoning traces would fail accountability audits despite correct final answers (LS–CQ orthogonality); preventing ranking errors caused by accuracy-only benchmarking; and ensuring that no single metric silently substitutes for the six independent signals the framework captures.

[NLP-142] Know You Before You Speak: User-State Modeling for LLM Personalization in Multi-Turn Conversation

【速读】: 该论文试图解决个性化对话系统中难以建模用户隐状态(latent user states)及其在交互过程中动态演化的问题,现有基于记忆和用户档案的方法主要依赖显式历史信息,无法有效支持基于未来用户状态变化来选择行动的决策机制。解决方案的关键在于提出PUMA(Prospective User-state Modeling for Action selection)框架,其以自由能原理(Free Energy Principle, FEP)为基础,将个性化建模转化为部分可观测环境下的决策问题:通过显式构建用户状态模型来捕捉隐状态及其动作条件下的动态演化;每轮对话中维护对用户隐藏状态的信念分布,同时优化观测生成与动作条件状态转移模型,并通过最小化预期自由能(expected free energy)来选择对话动作,从而统一平衡认知目标(epistemic,即获取知识)与实用目标(pragmatic,即达成效果)。这一范式实现了从被动记忆检索到基于用户演化建模的主动决策的转变。

链接: https://arxiv.org/abs/2605.24647
作者: Jiani Luo,Xiaoyan Zhao,Yang Zhang,Shuyi Miao,Bingbing Xu,Stefan Konigorski,Tat-Seng Chua
机构: National University of Singapore (新加坡国立大学); Beihang University (北京航空航天大学); Chinese Academy of Sciences (中国科学院); German Institute of Human Nutrition Potsdam-Rehbruecke (德国营养研究所波茨坦-雷布吕克分部)
类目: Computation and Language (cs.CL)
备注: 30pages, 3 figures

点击查看摘要

Abstract:Personalized dialogue requires more than recalling explicit user histories: systems also need to infer hidden user states that evolve through interaction and shape appropriate response strategies. Existing memory- and profile-based methods primarily reuse observable user information, offering limited support for modeling user-state dynamics or selecting actions based on how they shape future user states. We propose PUMA (Prospective User-state Modeling for Action selection), a framework grounded in the Free Energy Principle (FEP) that formulates personalization as decision-making under partial observability, centered on an explicit user state model that captures latent user states and their action-conditioned dynamics. At each turn, PUMA maintains a belief over the user’s hidden state, refines the user state model for observation generation and action-conditioned state transition, and selects dialogue actions by minimizing expected free energy, balancing epistemic and pragmatic objectives under a unified criterion. This formulation shifts personalization from passive memory retrieval to model-based decision-making over user evolution. We instantiate PUMA on healthcare-oriented counseling and motivational interviewing benchmarks with latent state annotations for rigorous evaluation. Experiments show that PUMA improves long-horizon dialogue outcomes while maintaining strong response quality, and a cross-dataset study demonstrates more reliable user-state estimation and next-state prediction.

[NLP-143] GlobalDentBench: A Multinational Benchmark for Evaluating LLM Clinical Reasoning in Dentistry with Expert Calibration

【速读】: 该论文试图解决的问题是:当前大语言模型(LLMs)在真实临床场景中,尤其是在牙科领域的推理鲁棒性和安全性尚未得到充分探索,存在潜在的医疗风险。解决方案的关键在于构建了GlobalDentBench——首个跨国牙科基准测试平台,其核心创新包括:(1) 覆盖全球88个国家和地区、14个牙科专科的结构化知识体系;(2) 设计三类题型(单选、简答、病例)和三个递进式推理层级(知识回忆L1、常规推理L2、个体化推理L3),以系统评估模型复杂推理能力;(3) 通过六位资深牙医校准自动化构建流程,确保数据质量(多选与简答题专家一致性达99.98%,病例题达96.78%)。实证结果揭示LLM性能随推理复杂度显著下降,且临床推荐中存在高达31.01%的不安全率(其中4.51%可能导致不可逆患者伤害),凸显当前模型在医疗应用中的根本性局限,为可信临床AI评估提供了可扩展基准,并强调部署前必须进行严格验证。

链接: https://arxiv.org/abs/2605.24636
作者: Junjie Zhao,Jingyi Liang,Zhenyang Cai,Jiaming Zhang,Zhenwei Wen,Shuzhi Deng,Wenjing Yi,Chunfeng Luo,Hexian Zhang,Junying Chen,Tianrui Liu,Zhuhui Bai,Zixu Zhang,Pradeep Singh,Xiang Liu,Jianquan Li,Nhan L Tran,Falk Schwendicke,Zuolin Jin,Lijian Jin,Liangyi Chen,Wei-fa Yang,Benyou Wang,Junwen Wang,Shan Jiang
机构: The University of Hong Kong (香港大学); The Chinese University of Hong Kong, Shenzhen (香港中文大学(深圳)); Peking University (北京大学); Southern Medical University (南方医科大学); Shenzhen Stomatology Hospital (Pingshan) (深圳市口腔医院(坪山院区)); Beijing Institute of Collaborative Innovation (北京协同创新研究院); New Cornerstone Science Laboratory, National Biomedical Imaging Center, State Key Laboratory of Membrane Biology, Institute of Molecular Medicine, Peking-Tsinghua Center for Life Sciences, College of Future Technology, Peking University (新基石科学实验室,国家生物医学成像中心,膜生物学国家重点实验室,分子医学研究所,北京大学-清华大学生命科学联合中心,未来技术学院,北京大学); IDG/McGovern Institute for Brain Research, Peking University (IDG/麦戈文脑研究所,北京大学); Mayo Clinic Arizona (梅奥诊所亚利桑那分部); LMU University Hospital, LMU Munich (慕尼黑路德维希-马克西米利安大学附属医院); Freedom AI (自由AI); University of Hong Kong (香港大学); Shenzhen Loop Area Institute (深圳环区研究院); University of Hong Kong (香港大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:While large language models (LLMs) hold transformative potential for medicine, their reasoning robustness and safety in real-world clinical scenarios remain critically underexplored, particularly in dentistry. Here we introduce GlobalDentBench, the first multinational dental benchmark, featuring a taxonomy that encompasses 14 dental specialties across 88 countries and regions spanning six continents. The benchmark comprises 8,978 expert-validated questions across three formats (multiple-choice, short-answer, and case-based questions) and assesses three progressive reasoning levels: knowledge recall (L1), routine reasoning (L2), and individualized reasoning (L3). To ensure data quality, the automated construction framework was calibrated by six senior dentists, achieving expert agreement rates of 99.98% for multiple-choice and short-answer questions and 96.78% for the more complex case-based questions. Evaluation of 12 frontier LLMs on GlobalDentBench revealed a sharp, stepwise performance degradation with increasing reasoning complexity. Specifically, accuracy plummeted from 81.34% on multiple-choice to 64.53% on short-answer and 22.34% on case-based questions, while declining markedly from 74.01% at L1 to 55.64% at L2 and 35.71% at L3. More critically, risk analysis of real-world dental cases demonstrated an alarming overall unsafe rate of 31.01% in LLM-generated clinical recommendations, with 4.51% posing risks of irreversible patient harm and risks particularly pronounced in specialties such as orthodontics. These findings expose fundamental limitations in the medical reasoning and safety of current LLMs. Consequently, GlobalDentBench provides a scalable foundation for trustworthy clinical AI evaluation, underscoring the urgent need for rigorous validation before the safe deployment of these models in healthcare.

[NLP-144] HiMed: Incentivizing Hindi Reasoning in Medical LLM s

【速读】: 该论文试图解决的问题是:当前医疗大语言模型(Medical Large Language Models, LLMs)在高资源语言中表现优异,但在印地语(Hindi)等低资源语言中性能显著下降,尤其在印度传统医学系统中的表现不足,导致健康不平等加剧。解决方案的关键在于构建一个专门针对印地语医学推理能力的语料库与基准测试套件(HiMed),并提出一种基于“衰减支架奖励”(decaying scaffolding reward)机制训练的 HiMed-8B 模型,以增强模型对印地语医学知识的理解和推理能力。实验表明,该方法显著提升了印地语医学推理性能,并缩小了英语与印地语之间的准确率差距。

链接: https://arxiv.org/abs/2605.24635
作者: Dingfeng Jiang,Han Yan,Chenze Ma,Amit Kumar Jaiswal,Ang Li,Yunxiang Jiang,Xinlei Xiong,Juhao Liang,Hongru Xiao,Xiang Li,Fan Bu,Jiale Han,Ruchir Gupta,Prayag Tiwari,Benyou Wang
机构: The Chinese University of Hong Kong, Shenzhen; Indian Institute of Technology (Banaras Hindu University) Varanasi; Tongji University; Shenzhen Research Institute of Big Data; Shenzhen Loop Area Institute; The Hong Kong University of Science and Technology; Halmstad University
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Medical large language models hold promise for reducing healthcare disparities, yet Hindi remains severely underrepresented. While medical LLMs excel in high-resource languages, their performance degrades sharply in Hindi, particularly on Indian systems of medicine. We argue that robust cross-lingual medical transfer requires Hindi reasoning. To this end, we introduce HiMed, a Hindi reasoning medical corpus and benchmark suite covering both Western and Indian medicine. We further propose HiMed-8B, a Hindi-form medical reasoning LLM, through the design of decaying scaffolding reward. Extensive experiments demonstrate improvement in Hindi medical reasoning performance and reduction in the English–Hindi accuracy gap. Ablation studies validate the contribution of each training stage and reward component. All data and code are available on GitHub: this https URL.

[NLP-145] Measuring the Depth of LLM Unlearning via Activation Patching

【速读】: 该论文试图解决大语言模型(Large Language Model, LLM)后训练知识擦除(unlearning)效果难以审计的问题,特别是现有基于输出层面的评估指标无法检测内部表示中是否仍残留目标知识。其解决方案的关键在于提出一种名为“擦除深度得分”(Unlearning Depth Score, UDS)的新指标,该指标通过激活补丁(activation patching)技术量化知识擦除的机制深度:首先利用保留模型(retain model)基线识别编码目标知识的模型层,再在已擦除模型中以0-1尺度测量该知识被擦除的程度。UDS在跨8种方法、150个擦除模型和20个指标的元评估中展现出最高的忠实度(faithfulness)与鲁棒性,验证了其因果驱动方法在评估LLM知识擦除有效性上的可靠性。

链接: https://arxiv.org/abs/2605.24614
作者: Jaeung Lee,Dohyun Kim,Jaemin Jo
机构: Sungkyunkwan University (成均馆大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 18 pages

点击查看摘要

Abstract:Large language model (LLM) unlearning has emerged as a crucial post-hoc mechanism for privacy protection and AI safety, yet auditing whether target knowledge is truly erased remains challenging. Existing output-level metrics fail to detect when this knowledge remains recoverable from internal representations. Recent white-box studies reveal such residual knowledge but often rely on auxiliary training or dataset-specific adaptations, leaving no generalizable metric. To address these limitations, we propose the Unlearning Depth Score (UDS), a metric that quantifies the mechanistic depth of unlearning via activation patching. UDS first identifies layers that encode the target knowledge using a retain model baseline, then measures how much of it is erased in the unlearned model on a 0-1 scale. In a meta-evaluation across 20 metrics on 150 unlearned models spanning 8 methods, UDS achieves the highest faithfulness and robustness, confirming our causal approach as the most reliable for unlearning evaluation. Case studies further reveal that white-box metrics can disagree at the layer level and that erasure depth varies across examples. We provide guidelines for integrating UDS into existing benchmarking frameworks and streamlining the evaluation pipeline. Code and data are available at this https URL

[NLP-146] Guarded Repair for Harm-Aware Post-hoc Replacement of LLM Mathematical Reasoning

【速读】: 该论文试图解决大语言模型(Large Language Model, LLM)在数学推理任务中进行事后修复时存在的不对称风险问题:修复错误的推理路径是有益的,但替换原本正确的推理路径可能导致性能下降。其解决方案的关键在于提出一种名为GuardedRepair的受控最优-N(guarded best-of-N)修复框架,该框架通过诊断缓存的推理路径、选择性触发修复,并仅在确定性验证机制支持的情况下接受答案改变的候选解,从而实现安全的修复决策。该方法融合了轻量级符号检查、表面语义风险诊断、有限候选生成和保守接受策略,在GSM8K测试集上将准确率从95.60%提升至96.89%,且未出现因替换正确路径而导致的错误案例;在弱推理器ASDiv设置下,准确率从78.40%提升至87.60%,显著优于直接重生成基线方法,后者反而因过度重解导致正确答案被破坏。结果表明,post-hoc repair应被视为“有害感知的选择性替换”,而非无约束的重新求解。

链接: https://arxiv.org/abs/2605.24613
作者: Haizhou Xia
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: 15 pages,including appendices. Code and artifacts available at this https URL

点击查看摘要

Abstract:Post-hoc repair of LLM mathematical reasoning introduces an asymmetric risk: fixing an incorrect reasoning trace is useful, but replacing a trace that was already correct can be harmful. We study this problem under a selective replacement setting, where a system must decide whether a repaired candidate is safer than preserving the original cached trace. We present GuardedRepair, a guarded best-of-N repair framework that diagnoses cached reasoning traces, selectively triggers repair, and accepts answer-changing candidates only when deterministic verification guards support replacement. The framework combines lightweight symbolic checks, surface semantic-risk diagnostics, bounded candidate generation, and conservative acceptance policies. On the full GSM8K test set, where the initial reasoner already achieves 95.60% accuracy, GuardedRepair improves final accuracy to 96.89%, fixing 17 of 58 remaining errors without measured broken-correct cases in the main run. On a weak-reasoner ASDiv setting, accuracy improves from 78.40% to 87.60%. Direct regeneration baselines show that this gain is not explained by stronger-model re-solving alone: re-solving all GSM8K examples lowers accuracy to 93.03% and breaks 47 initially correct answers. Additional analyses show that guarded repair substantially improves the fixed/broken tradeoff, while also revealing that replacement risk is reduced rather than eliminated. These results support viewing post-hoc repair as harm-aware selective replacement rather than unconstrained re-solving.

[NLP-147] CSP-Atlas: Concept-Specific Neural Circuits in a Sparse Python Transformer

【速读】: 该论文试图解决的问题是:大型语言模型(特别是代码生成模型)内部如何组织和表征编程语言中的语法结构与语义概念。具体而言,研究旨在揭示神经网络是否在不同层级上为Python语法元素(如AST节点和内置对象)形成专用的神经回路,并判断这些回路的组织原则是基于语义类别还是计算结构。解决方案的关键在于提出了一种系统性的分解方法:通过控制提示(controlled prompts)对63,800个样本进行边际化分析,提取106个概念(包括43种AST节点类型和63个内置对象)对应的神经电路,并利用对比检查提示(contrastive checker prompts)将每个电路拆分为“概念特异性成分”和“词元驱动成分”。这一方法使作者能够量化并验证神经回路的计算本质——即模型内部结构更倾向于反映语法/计算原子性(如单语句构造体),而非语义相似性,从而揭示出一种以计算结构为基础的内在组织机制。

链接: https://arxiv.org/abs/2605.24603
作者: Piotr Wilam
机构: University College London (伦敦大学学院)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Code: this https URL

点击查看摘要

Abstract:A sparse 8-layer code transformer develops dedicated neural circuitry for every Python construct tested, and that circuitry is organised by a clean computational principle rather than by semantic category. We extract neural circuits for 106 concepts (43 AST node types, 63 builtin objects) by marginalising across 63,800 controlled prompts, and decompose each circuit into concept-specific and token-driven components using contrastive checker prompts that present a keyword token without its associated syntactic structure. Three findings emerge. First, all 106 concepts produce non-empty universal circuits at every one of nine parameter settings, and the ranking of concept-specificity across constructs is stable across the sweep - survival is not an artifact of a permissive threshold. Second, AST circuits contain a genuine concept component distinct from token activation: concept-only neurons constitute up to 62.5% of the loudest-firing neurons at mid-to-late layers, while builtin circuits are almost entirely token-driven. Third, six computationally atomic constructs - Import, ImportFrom, Break, Continue, Pass, Assert - cluster together despite being semantically unrelated, sharing only the property of being single-statement constructs requiring no nested body; this atomicity super-cluster, together with a four-tier hierarchy organised by token ambiguity and structural distinctiveness, shows that the model’s internal organisation tracks computational structure rather than meaning. The methodology, full decomposition data, and analysis code are released.

[NLP-148] Learning to Reason Efficiently with A* Post-Training

【速读】: 该论文试图解决大语言模型(Large Language Models, LLMs)在自然语言推理任务中频繁产生错误或冗余推理步骤的问题,目标是让模型生成既正确又高效的逻辑证明(proof)。解决方案的关键在于将自然语言推理建模为一个搜索问题,并引入A搜索算法的指导:通过监督微调(supervised fine-tuning)对A执行轨迹进行训练,以及基于A启发式信息设计过程奖励模型的强化学习方法,使模型学会在推理过程中做出正确的中间推断。实验表明,Llama-3.2系列模型(1B–3B参数量)经A后训练后准确率显著提升,甚至超越参数更大的DeepSeek-V3.2模型;同时发现,使用A*-启发式信号能平衡准确性与推理效率,且在更大搜索空间中,即使启发函数不完美,模型仍表现出更高准确性,这为基于经典搜索算法原理引导推理提供了有前景的方向。

链接: https://arxiv.org/abs/2605.24597
作者: Andreas Opedal,Francesco Ignazio Re,Abulhair Saparov,Mrinmaya Sachan,Bernhard Schölkopf,Ryan Cotterell
机构: ETH Zürich (苏黎世联邦理工学院); MPI for Intelligent Systems, Tübingen (图宾根马克斯普朗克智能系统研究所); Purdue University (普渡大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Preprint

点击查看摘要

Abstract:Many applications of large language models (LLMs) require deductive reasoning, yet models frequently produce incorrect or redundant inference steps. We frame natural language inference as a search problem where the final answer is the valid proof itself, requiring a reasoning procedure in which intermediate inferences are correct. Specifically, we investigate whether LLMs can learn to generate correct and efficient proofs with guidance from A* search – an algorithm that guarantees an optimally efficient path to a goal. We explore two training techniques: supervised fine-tuning on execution traces from A* and reinforcement learning with A*-informed process reward models. Empirically, we find that Llama-3.2 models in the 1B–3B range benefit substantially from A* post training, going from near-zero accuracy to outperforming DeepSeek-V3.2 – a much larger model. Our analysis uncovers a trade-off: while simple correctness rewards maximize accuracy, A*-informed signals strike a balance between accuracy and efficiency. Furthermore, we find that on larger search spaces, models trained with imperfect heuristics exhibit superior accuracy. Our results demonstrate a promising direction towards reasoning guided by principles derived from classical search algorithms.

[NLP-149] Word Class Representations Spontaneously Emerge from Successor Representations Trained on Natural Language

【速读】: 该论文试图解决的问题是:如何在没有显式语言学监督的情况下,从自然语言序列中自动学习具有结构化的语义和句法表示。传统语言模型通常基于预测下一个词(next-token prediction)进行训练,而本文提出了一种替代性的预测机制——成功者表示(Successor Representations, SRs),它建模的是未来状态的期望折扣分布,而非仅限于当前时刻的下一个状态。解决方案的关键在于将SR框架迁移至自然语言领域,通过训练深度残差神经网络来预测多个时间跨度下的未来词分布,并使用KL散度优化这些分布作为概率表示。实验表明,在WikiText-103数据集上训练后,所学表示空间自发形成了清晰的词性(POS)几何结构(如名词、动词、形容词可分离且可通过无监督聚类恢复),且这种结构随预测时域变化:短时域强化句法结构,长时域融合更广泛的语境与语义信息;此外,在细粒度层面还揭示了词类内部的可解释子结构。这说明句法类别无需显式编码即可作为预测序列学习的结果自然涌现,为强化学习、语言学与认知神经科学之间建立了概念桥梁。

链接: https://arxiv.org/abs/2605.24585
作者: Mathis Immertreu,Achim Schilling,Thomas Kinfe,Patrick Krauss
机构: 未知
类目: Computation and Language (cs.CL); Neurons and Cognition (q-bio.NC)
备注:

点击查看摘要

Abstract:Language models are typically trained to predict the next token in a sequence. Here, we explore an alternative predictive principle from reinforcement learning: Successor Representations (SRs), which model the expected discounted distribution of future states rather than the immediate next state. We transfer this framework to natural language and train neural networks to predict future word distributions across multiple temporal horizons, thereby learning representations of long-range transition structure. We train a deep residual neural network on WikiText-103 (103 million tokens; 20,000-word vocabulary) and optimize successor representations as probability distributions using KL divergence. Without explicit linguistic supervision, structured language representations emerge spontaneously. After training, the learned space develops a clear geometric organization with respect to part-of-speech (POS) categories: nouns, verbs, and adjectives become separable and recoverable through unsupervised clustering. This organization depends systematically on predictive horizon, with short horizons producing the strongest syntactic structure and longer horizons increasingly integrating broader contextual and semantic information. At finer resolutions, additional interpretable lexical substructure emerges, revealing coherent subclasses within major word categories. These findings suggest that syntactic categories need not be explicitly encoded but may arise as a consequence of predictive sequence learning. To our knowledge, this work provides the first systematic application of successor representations to natural language and establishes a conceptual bridge between reinforcement learning, linguistics, and cognitive neuroscience.

[NLP-150] An Effective-Rank Audit of Alignment-Induced Activation Shifts: Confound Control Constructive Calibration and Limits

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在对齐(alignment)过程中,残差流激活(residual-stream activations)如何因安全相关输入而发生可量化变化的问题,特别是试图形式化并验证“单拒绝方向”(single-refusal-direction)现象的连续性表征。解决方案的关键在于提出并应用一种基于对齐修改矩阵有效秩(effective rank, ρ_eps := rank_eps(M_Ds)/d)的诊断指标,结合四重分解方法(M_naive、M_template、M_aligned、M_DiD),精确分离出聊天模板格式化、对齐阶段偏移和拒绝中介方向的影响,并揭示了ρ_eps作为脆弱性诊断工具的有效性与局限性:一方面,在多层前馈网络(MLP)中通过控制ρ_eps实现了对模型鲁棒性的构造性校准(如适度正则化λ=5时提升抗消融能力),另一方面发现单纯增大ρ_eps并不能保证鲁棒性提升,且其非安全特异性、主成分排序与因果顺序不一致以及谱隙假设失效等问题表明,基于秩的诊断尚不足以完全刻画对齐机制的本质。

链接: https://arxiv.org/abs/2605.24583
作者: Yuki Nakamura
机构: The Open University of Japan (日本开放大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Machine Learning (stat.ML)
备注: 18 pages, 1 figure, 21 tables. Code, data, and an immutable Zenodo archive are available at this https URL (DOI: https://doi.org/10.5281/zenodo.20341445 )

点击查看摘要

Abstract:We audit alignment-induced shifts in residual-stream activations of three open-weight instruction-tuned LLMs (Llama-3.1-8B-Instruct, Gemma-2-9B-it, Qwen-2.5-7B-Instruct) using the effective rank of the alignment modification matrix on safety-relevant inputs, rho_eps := rank_eps(M_Ds)/d, which formalizes the single-refusal-direction observation of Arditi et al. (2024) as a continuous quantity. The paper has three contributions. (1) Confound-controlled measurement: a four-variant decomposition (M_naive, M_template, M_aligned, M_DiD) separates chat-template formatting, alignment-stage shift, and the refusal-mediating direction, and recovers the Arditi refusal direction on M_DiD at |cos| in 0.77, 0.86, 0.50 (Llama/Gemma/Qwen); chat-template-controlled rho_eps is 0.0029, 0.0048, 0.0044, and the centered SVD residual is 4-7x larger. (2) Constructive calibration on a 3-layer MLP across rho_eps in 0.008, 0.17, 0.33, 0.40 exhibits a sweet-spot vs. brittle distinction: mild rank-maximization (lambda=5) buys ablation robustness, while strong regularization at the same nominal rho_eps (lambda=50) does not. rho_eps is a diagnostic for fragility, not a target whose mechanical inflation buys robustness. (3) Limits of rank-based diagnostics: (a) not safety-specific (LRH baseline is 2-3x the safety value); (b) SVD principal ordering does not match causal ordering (Llama u_2 inert despite ranking second; cumulative ablation non-monotone at k=5); © the spectral-gap hypothesis required to upgrade the O(rho_eps * d) achievability bound to a matching Mirsky-route lower bound fails empirically (1/90 Llama layer-reference pairs, 0/36 MLP combinations) and structurally (kappa_lb = 2/(eps * r)). The matching lower bound remains an open problem.

[NLP-151] WhenLoss: Diagnosing Write and Retrieval Bottlenecks in Long-Context Memory Systems

【速读】: 该论文试图解决长上下文记忆系统在固定token预算下性能受限的问题,特别是现有评估方法无法区分信息是在压缩阶段被丢弃还是虽被保留但未被检索。其解决方案的关键在于提出一种四条件诊断协议(LongMemEval),通过截断完整上下文(TFC)、Oracle证据(OE)、完整存储记忆(CSM)和检索记忆(RM)四个场景量化写入与检索阶段的性能差距,并发现多数基线模型存在显著的“写入主导”问题(即信息在写入时已被丢弃)。基于此诊断,作者进一步提出预期预测压缩(Expected Predictive Compression, EPC),将关键决策——保留何种信息——前置到写入阶段,利用大语言模型(LLM)预判未来可能的问题,在token预算约束下仅保留最小必要支持证据,而保持检索阶段不变。实验表明,EPC在所有500个LongMemEval问题中均取得最优的CSM得分(0.49 vs. 基线最高0.44),显著缩小了写入侧差距(Δ_write降至0.04),证明改进写入阶段的信息保留策略是提升系统性能的核心路径。

链接: https://arxiv.org/abs/2605.24579
作者: Jiangnan Yu,Kisson Songqi Lin,Jilong Wu
机构: 未知
类目: Computation and Language (cs.CL)
备注: 14 pages, 7 figures, 9 tables

点击查看摘要

Abstract:Long-context memory systems often fail under fixed budgets, but end-to-end evaluation does not reveal whether evidence was discarded during compression or preserved but never retrieved. We introduce a four-condition diagnostic protocol that evaluates a fixed reader under truncated full context (TFC), oracle evidence (OE), complete stored memory (CSM), and retrieved memory (RM). Under this fixed-budget LongMemEval setup, write-side gaps exceed retrieval-side gaps for most tested baselines, with four of six baselines robustly write-dominant under our default diagnosis margin. Motivated by this diagnosis, we propose Expected Predictive Compression (EPC), which moves the key decision–what information to retain–to write time by using an LLM to anticipate likely future questions and preserve the minimal supporting evidence under the token budget, while leaving retrieval unchanged at question time. Across all 500 LongMemEval questions with three readers (GPT-5.2, Claude Sonnet 4, Gemini 2.5 Pro), EPC achieves the highest CSM scores among all systems (0.49 vs. 0.44 for Summary (LLM), the strongest baseline), reducing Delta_write to 0.04 while leaving Delta_retr comparable to other LLM-based systems. These results suggest that, on this benchmark and evaluation setup, improving what the write stage preserves is a key avenue for performance gains in the tested systems.

[NLP-152] Polymorphism Is Rotation: Operational Mechanistic Interpretability from a Two-Layer Transformer to Pythia-70m

【速读】: 该论文试图解决的问题是:独立训练的Transformer模型虽然在输入-输出行为上高度一致(如解码层余弦相似度达98%),但其内部表示空间存在显著差异,导致基于稀疏自动编码器(SAE)提取的特征词典和控制向量(steering vectors)无法直接迁移使用,从而阻碍了跨模型的知识共享与分析。解决方案的关键在于识别出这种“多态性”(polymorphism)现象的本质——即不同模型在残差流(residual-stream)中采用相互旋转的坐标系来计算相同函数,并提出通过一个简单的正交Procrustes拟合(orthogonal Procrustes fit)对齐这些坐标系:仅需一次矩阵乘法(基于单批次激活数据)即可找到最优旋转矩阵 $ R $,使得SAE特征词典、控制向量可在不同模型间无须重新训练即可成功迁移,且重建性能恢复至原始种子模型水平(误差<0.025 EV)。这一方法揭示了传统SAE通用性指标(SAE-universality metric)的局限性,因为其忽略了内部坐标系的旋转不变性。

链接: https://arxiv.org/abs/2605.24577
作者: Jordan F. McCann
机构: Independent Researcher
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 26 pages, 4 figures, 40 references. Pre-registered four-bar framework; all numerical claims reproducible

点击查看摘要

Abstract:Independently trained transformers compute the same function in residual-stream bases that differ by a uniform random rotation on \mathrmSO(d_\mathrmmodel) . We call this phenomenon polymorphism: same function, mutually unintelligible interior coordinates. One matrix multiplication per model pair removes it: an orthogonal Procrustes fit on a single batch of activations transfers sparse-autoencoder feature dictionaries and steering vectors between independently trained models, with no retraining. The phenomenon is invisible to the standard SAE universality metric. Decoder-column cosine similarity matches across seeds at 98%, the SAE-universality headline number, while an SAE trained on one seed reconstructs another seed’s activations at negative explained variance, worse than predicting the constant mean. The decoder columns align; the encoder reads from a rotated frame. A single Procrustes rotation R restores reconstruction to within 0.025 EV of the within-seed ceiling at every internal site. R is Haar-distributed: |R - I|F matches the random-orthogonal prediction \sqrt2 d\mathrmmodel to 0.1% at d_\mathrmmodel = 512 , and a Kolmogorov-Smirnov test of R 's eigenvalue spectrum against Haar \mathrmSO(d_\mathrmmodel) returns p \approx 1.000 pooled and per-pair. Diff-of-means steering vectors transfer in three regimes by alignment with R 's invariant subspace: clean when pinned by shared output weights, partial when overlapping the rotated subspace, inverted otherwise. With no shared I/O (Pythia), all three collapse to universally inverted. The same rotation account holds across training checkpoints within a single run. Validated on a 104k-parameter Dyck-3 transformer and nine independently-trained Pythia-70m seeds on The Pile, via a pre-registered four-bar operational framework. Frontier-scale (10B+) replication remains open. Comments: 26 pages, 4 figures, 40 references. Pre-registered four-bar framework; all numerical claims reproducible Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) ACMclasses: I.2.6; I.2.7 Cite as: arXiv:2605.24577 [cs.LG] (or arXiv:2605.24577v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.24577 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Jordan McCann [view email] [v1] Sat, 23 May 2026 13:37:59 UTC (66 KB)

[NLP-153] AstroMind: A High-Fidelity Benchmark for Spacecraft Behavior Reasoning Based on Large Language Models

【速读】: 该论文试图解决的问题是:在地球轨道日益拥挤和竞争加剧的背景下,如何从航天器机动行为中推理出其意图(intent inference),而不仅仅是检测到机动的发生。当前分析流程侧重于“发现异常”,但在理解机动背后的战略意义方面能力有限。解决方案的关键在于提出 AstroMind——一个基于物理规律的基准测试平台,它通过高保真天体力学仿真与真实观测约束相结合,构建了三类可验证的推理任务:意图推断、机动参数估计和威胁评估。该基准包含现实的传感噪声和多源文本情报(可靠性各异),并引入语义正确性和物理约束下定量一致性的评估指标。实验表明,模型性能取决于训练数据组成和推理方式,而非单纯模型规模;结构化推理提示对所有8B级模型均有效,尤其对能保持物理约束的大模型提升显著。AstroMind 为领域提供了一个统一评测标准,强调仅有正确的物理建模或战术解读都不足,二者必须协同才能实现高质量的空间态势感知(Space Domain Awareness)。

链接: https://arxiv.org/abs/2605.24573
作者: Hao Liu,Siyuan Yang,Qinglei Hu,Dongyu Li
机构: Hangzhou International Innovation Institute, Beihang University (北京航空航天大学); KTH Royal Institute of Technology (皇家理工学院); School of Automation Science and Electrical Engineering, Beihang University (北京航空航天大学); School of Cyber Science and Technology, Beihang University (北京航空航天大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Understanding why a spacecraft maneuvers – rather than simply that it did – is an increasingly important problem for space domain awareness as Earth orbits grow crowded and contested. Current analysis pipelines are built for detection: they are good at picking up that something happened, less good at reasoning about what it means. AstroMind is a physics-grounded benchmark designed to close that gap. It draws on high-fidelity astrodynamics simulations and real observational constraints, converting them into verifiable reasoning problems across three task types: intent inference, maneuver parameter estimation, and threat assessment. Each scenario includes realistic sensing noise and multi-source textual intelligence at varying reliability levels. Evaluation metrics capture both semantic correctness and quantitative consistency under physical constraints. Benchmarking a suite of open-weight models shows no single model dominates every axis: Qwen3 (32B) leads on intent inference accuracy; QwQ (32B) leads on threat assessment and achieves the lowest median relative error on parsed items; GPT-OSS (20B) produces the strongest judged reasoning quality and extracts the most scalar values for parameter estimation (136 of 241 parsed items). Training data composition and reasoning style matter as much as model size. Structured reasoning prompts help consistently across tested 8B models, with larger gains for those that can already track physical constraints. AstroMind gives the field a shared test for a problem where getting the physics right and reading the tactical situation correctly are both required – neither is sufficient on its own.

[NLP-154] Jailbreak to Protect: Buffering and Reinforcing via Temporary Jailbreaking for Safe Fine-Tuning in Large Language Models ICML2026

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在用户个性化微调(fine-tuning)过程中可能遭受有害微调攻击导致的安全对齐(safety-alignment)弱化问题。其解决方案的关键在于提出了一种名为“Buffer-and-Reinforce”的微调框架,该框架通过梯度层面的分析发现,临时越狱(temporary jailbreaking)能够饱和有害行为相关的梯度,同时保留良性任务相关梯度。具体而言,BufferLoRA 在微调阶段引入可移除的适配器以缓冲有害更新,随后通过 ReinforceLoRA(基于QR分解合并)恢复拒绝行为能力,在不依赖额外安全数据且计算开销极低的前提下,实现安全性的强化与用户任务性能的保持。

链接: https://arxiv.org/abs/2605.24550
作者: Seokil Ham,Jaehyuk Jang,Wonjun Lee,Changick Kim
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: ICML 2026 Spotlight

点击查看摘要

Abstract:Fine-tuning-as-a-Service (FaaS) enables personalization of large language models (LLMs), but it can weaken safety-alignment under harmful fine-tuning attacks. Recent work has shown that activating harmful-behavior modules during fine-tuning can prevent models from learning undesired behaviors, but its mechanism remains unclear. In this paper, we revisit temporary jailbreaking as a defense against harmful fine-tuning and provide a gradient-level analysis showing that it saturates safety-degrading gradients while preserving benign task-relevant gradients. Based on this insight, we propose a Buffer-and-Reinforce fine-tuning framework that buffers harmful updates during user fine-tuning and reinforces safety after adaptation. Specifically, BufferLoRA induces temporary jailbreaking as a removable adapter to reduce harmful updates during user fine-tuning. After adaptation, ReinforceLoRA, trained to recover refusal behavior under the temporarily jailbroken state, is integrated with UserLoRA via QR decomposition-based merging to reinforce safety while preserving user-task performance. Extensive experiments show that our framework achieves superior safety and utility with no additional safety data during user fine-tuning and minimal computational cost.

[NLP-155] Generating Legal Commentaries from Case Databases via Retrieval Clustering and Generation

【速读】: 该论文试图解决的问题是:如何从大量法院判例中自动提取并生成结构化的法律评注(legal commentaries),而无需依赖人工制定的教义框架。其解决方案的关键在于构建一个全自动流水线,首先对判例进行段落级切分与推理摘要,接着通过嵌入和聚类识别主题簇,再利用大语言模型(LLM)为每个簇生成标题并合成包含引文的段落,最终由四个先进的LLM将这些内容整合成连贯、可读的法律评注。该方法实现了低成本、高效率的自动化法律知识生产,同时在主题相关性、引文忠实度等五个维度上得到验证,但同时也揭示了受限数据源和法律推理规范性带来的挑战。

链接: https://arxiv.org/abs/2605.24534
作者: Max Prior,Niklas Wais,Matthias Grabmair
机构: Technical University of Munich (慕尼黑工业大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We present a fully automated pipeline that transforms large collections of court decisions into legal commentaries for statutes - without providing any handcrafted doctrinal framework. Using 4.555 decisions of the German Federal Court of Justice that cite sections 242, 280, 812 and 823 of the German Civil Code (BGB), we extract paragraph-level chunks, summarize their reasoning, and derive keywords, which are embedded and clustered. For each cluster, an LLM generates headings and synthesizes citation-rich sections, which are then merged into coherent commentaries by four state-of-the-art LLMs. We evaluate along five dimensions - topical relevance, heading-match, citation faithfulness, cluster distinction and logical ordering - using both a human expert and an LLM-judge. Our results show that commentary-like argument mining from court decisions to generate reports that can be refreshed within minutes at minimal cost is feasible, yet they highlight limitations arising from restricted sources and the normativity of legal reasoning.

[NLP-156] Unveil: Unified Visual-Textual Integration and Distillation for Multi-modal Document Retrieval ACL2025

【速读】: 该论文试图解决现实场景中文档检索中因文档格式和模态多样而带来的挑战,特别是传统基于文本的方法因忽略布局信息易出错,而纯视觉方法在文本丰富的场景中难以捕捉细粒度语义的问题。解决方案的关键在于提出一种名为Unveil的新型视觉-文本嵌入框架,通过融合文本与视觉特征实现鲁棒的文档表示,并利用知识蒸馏技术将视觉-文本模型的语义理解能力迁移至纯视觉模型,从而在无需解析的前提下保持语义保真度,显著提升检索准确率与效率。

链接: https://arxiv.org/abs/2605.24530
作者: Hao Sun,Yingyan Hou,Jiayan Guo,Bo Wang,Chunyu Yang,Jinsong Ni,Yan Zhang
机构: 未知
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: ACL 2025 Main Conference

点击查看摘要

Abstract:Document retrieval in real-world scenarios faces significant challenges due to diverse document formats and modalities. Traditional text-based approaches rely on tailored parsing techniques that disregard layout information and are prone to errors, while recent parsing-free visual methods often struggle to capture fine-grained textual semantics in text-rich scenarios. To address these limitations, we propose \textbfUnveil, a novel visual-textual embedding framework that effectively integrates textual and visual features for robust document representation. Through knowledge distillation, we transfer the semantic understanding capabilities from the visual-textual embedding model to a purely visual model, enabling efficient parsing-free retrieval while preserving semantic fidelity. Experimental results demonstrate that our visual-textual embedding method surpasses existing approaches, while knowledge distillation successfully bridges the performance gap between visual-textual and visual-only methods, improving both retrieval accuracy and efficiency.

[NLP-157] Hypothesis Generation and Inductive Inference in Children and Language Models

【速读】: 该论文试图解决的问题是:人类在不确定性环境下进行推理时所依赖的计算原理是什么?以及大语言模型(LLM)驱动的智能体是否能在相同约束条件下表现出类似的人类行为。解决方案的关键在于将任务形式化为基于贝叶斯粒子推理的程序归纳问题,从而提供两种互补的解释视角:一是作为假设上的约束满足过程,二是作为程序合成问题,其中假设被表示为可执行程序并根据证据进行评估。研究发现,儿童的行为最能用主观证据可靠性与在线假设生成的结合来解释,这能够同时说明其信息寻求模式和任务完成与规则泛化之间的分离;而LLM-based代理则在面对证据可靠性与可观测性变化时,复制了儿童的行为特征(如忽略不可靠证据、主动填补信息缺口),但也表现出过度观察和过度遵守指令的倾向,揭示出两者虽能适应环境结构,但在信息获取策略上存在不同的内在成本和归纳偏置。

链接: https://arxiv.org/abs/2605.24528
作者: Jeffrey Qin,Wasu Top Piriyakulki,Zhuangfei Gao,Mia Radovanovic,Jessica Sommerville,Kevin Ellis,Marta Kryven
机构: University of Waterloo (滑铁卢大学); Cornell University (康奈尔大学); Dalhousie University (达尔豪西大学); University of Toronto (多伦多大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Real world decision-making requires constructing mental models under uncertainty over evidence, over the underlying causal rules, and over the state of the world itself. Which computational principles underpin human inference under such conditions, and do LLM-based agents exhibit similar behavior given matching constraints? We address these questions using an inductive inference Box Task in which participants, human children and LLM-based agents, infer a latent cause through sequential interaction with an uncertain environment. We formalize this task as program induction with Bayesian particle-based inference, admitting two complementary interpretations: (1) as a constraint satisfaction process over hypotheses, and (2) as a program synthesis problem in which hypotheses are executable programs evaluated against evidence. Using the constraint-based formulation, we show that children’s behavior is best explained by a combination of subjective evidence reliability and online hypothesis generation, accounting for both their evidence-seeking patterns and their dissociation between task completion and rule generalization. Using the program synthesis formulation, we treat LLM-based agents as model organisms: controllable systems that allow systematic manipulation of task conditions. Across backends, LLM-based agents replicate children’s responses to changes in evidence reliability and observability, including discounting unreliable evidence, seeking to resolve partial information, and dissociating between task completion and causal generalization. At the same time, LLM-based agents tend to over-observe and over-comply with instructions relative to children. These results suggest that while children and LLM-based agents adapt similarly to environmental structure, their information-seeking behavior exhibits distinct underlying costs and inductive biases.

[NLP-158] What Are We Actually Decoding? Source Attribution for Non-Invasive Brain-to-Language Retrieval

【速读】: 该论文试图解决非侵入性神经语言解码中性能评估被非刺激诱发的神经证据(如解码器先验、基于嵌入的度量标准以及信号持续时间等非神经结构干扰因素)所夸大这一方法学问题,核心挑战在于“归因”——即如何将报告的性能提升准确追溯到具体来源。其解决方案的关键在于提出一种审计框架(auditing framework),将表观性能分解为三个独立来源:结构捷径(structural shortcuts)、窗口级别的刺激锁定证据(window-level stimulus-locked evidence)和跨窗口上下文聚合(cross-window contextual aggregation),并通过诊断手段对每类来源进行量化隔离。特别地,作者引入Group Context Bias(GCB)作为评分空间干预工具,通过在推理时添加一个基于句子一致性的logit偏置项来测量上下文源的影响,且该干预在随机分组扰动下效果消失,在MEG局部证据减弱或EEG接近随机水平时也失效,从而验证了其作为可控归因干预的有效性。研究结果表明,脑-语言性能应以源归因为基础进行评估,而非简单报告整体指标。

链接: https://arxiv.org/abs/2605.24524
作者: Xinyu Zhang,Sichao Liu,Runhao Lu,Alexandra Woolgar,Lihui Wang
机构: KTH, Sweden; University of Cambridge, UK; EPFL, Switzerland; Karolinska Institutet, Sweden; McGill University, Canada
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Neurons and Cognition (q-bio.NC)
备注: 35 pages, 7 figures, 25 tables

点击查看摘要

Abstract:In non-invasive neural language decoding, results can be inflated by sources that are not stimulus-evoked neural evidence: decoder priors, embedding-based metrics, and non-neural structural nuisances such as signal duration. The methodological challenge is therefore attribution: a reported gain is more informative when it can be traced to a specific source. We recast stimulus-locked MEG-to-audio retrieval as an auditing framework that separates apparent performance into three sources - structural shortcuts, window-level stimulus-locked evidence, and cross-window contextual aggregation - and provides a diagnostic for each. Signal-blind Gaussian noise reaches 66.3% Rank@1 (R@1) under variable-length decoding but collapses to near chance once fixed-duration windows and stimulus-identity splits are enforced, isolating structural leakage. Under these controls, fixed-window retrieval recovers measurable MEG-audio discriminability, while an oracle sentence-bucket diagnostic shows that 95.7% of Top-1 errors select the wrong sentence, localising the residual bottleneck to sentence-level competition. We audit this contextual source with Group Context Bias (GCB), an inference-time additive logit bias that pools sentence-consistent evidence across windows while leaving the base retrieval scores and candidate pool fixed. Used as a score-space intervention, GCB makes the contextual source measurable: R@1 shifts from 44% to 52% on Gwilliams and from 22% to 29% on MOUS under the same fixed setting. GCB is auditable under this design: its effect collapses under random-grouping perturbations and vanishes when local evidence is attenuated in MEG or is near chance in EEG, supporting its use as a controlled source-attribution intervention. These results suggest that brain-to-language performance should be source-attributed, not merely reported.

[NLP-159] MindAlign: Bridging EEG Vision and Language for Zero-Shot Visual Decoding

【速读】: 该论文旨在解决从脑电图(EEG)信号中进行视觉解码的问题,即如何将非侵入性时间神经信号与视觉内容有效关联,从而实现对大脑所感知图像的准确重建或识别。其核心挑战在于如何在不同模态(EEG、图像和文本)之间建立语义一致的共享表示空间,同时保留EEG信号的时空特性并避免信息过载。解决方案的关键在于提出了一种三模态对比学习框架(tri-modal contrastive framework),通过两阶段设计实现:第一阶段利用未标注数据对EEG编码器进行掩码重建预训练,以学习稳健的时空模式;第二阶段通过对比学习联合对齐EEG、图像和大语言模型(LLM)生成的文本描述,其中文本作为语义正则化器注入结构信息而不干扰主信号。该方法融合了个体特异性适应、通道图注意力机制和时空卷积嵌入,最终在Things-EEG2零样本基准上达到54.1% Top-1准确率,显著优于现有最强基线(32.4%),且结果具有统计显著性,验证了其在跨被试和跨模态(如MEG)场景下的泛化能力。

链接: https://arxiv.org/abs/2605.24523
作者: Zexuan Chen,Sichao Liu,Runhao Lu,Huichao Qi,Alexandra Woolgar,Xi Vincent Wang,Lihui Wang
机构: KTH, Sweden; University of Cambridge, UK; EPFL, Switzerland; McGill University, Canada; Karolinska Institutet, Sweden
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Neurons and Cognition (q-bio.NC)
备注: 20 pages, 10 figures, 15 tables

点击查看摘要

Abstract:Visual decoding from brain signals is a key challenge at the intersection of computer vision and neuroscience, requiring methods that bridge neural representations and computational models of vision. We introduce a tri-modal contrastive framework for EEG-based visual decoding that aligns EEG, visual, and textual representations within a unified latent space. Our approach follows a two-stage design. First, we pre-train an EEG encoder via masked reconstruction on unlabeled trials, learning spatio-temporal regularities that transfer robustly to downstream tasks. Second, we jointly align EEG, image, and LLM-generated textual descriptions through contrastive learning, where text supervision acts as a semantic regularizer that injects linguistic structure into the shared space without overwhelming the primary EEG-image signal. The encoder integrates subject-specific adaptation, graph-attention over channels, and temporal-spatial convolutional embeddings. On the Things-EEG2 200-way zero-shot benchmark, our framework achieves 54.1% Top-1 and 83.4% Top-5 accuracy, substantially exceeding the strongest prior baseline (32.4% / 64.0%), with paired Wilcoxon tests confirming significance (p 0.01) over all in-subject baselines. We validate generalization on Things-MEG. Analysis reveals that compact embedding geometries (CN-CLIP) outperform much larger backbones, and that decoding aligns with established neurophysiology of visual processing. This work is a critical step towards robust, semantically-grounded visual decoding from non-invasive temporal neural signals. The source code is publicly available in this https URL.

[NLP-160] Grammatically-Guided Sparse Attention for Efficient and Interpretable Transformers

【速读】: 该论文旨在解决Transformer模型中自注意力机制的二次时间复杂度问题,这一瓶颈限制了长序列处理和大规模语言模型的高效部署。其解决方案的关键在于提出一种新型的稀疏注意力机制——语法引导稀疏注意力(Grammatically-Guided Sparse Attention),通过利用词性标注(Parts-of-Speech, POS)动态生成注意力掩码,强制执行基于语法角色的token间连接,从而在不牺牲关键语义依赖的前提下显著降低计算图规模。文中设计了两种掩码策略:硬掩码严格限定预定义的语法交互,软掩码则偏向这些语法关联;实验表明,在SST-2情感分类任务上,该方法在保持与全注意力相当准确率(硬掩码0.8200、软掩码0.8165,对比全注意力0.8200)的同时,大幅减少了理论计算开销,为构建更高效、可解释且具语言学先验信息的Transformer架构提供了新路径。

链接: https://arxiv.org/abs/2605.24518
作者: Spandan Pratyush
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 9 pages, 2 tables Code available at this https URL

点击查看摘要

Abstract:The quadratic complexity of self-attention in Transformer models remains a significant bottleneck for processing long sequences and deploying large language models efficiently. For this approach, there has been significant research into Sparse Attention, and Deepseek Sparse Attention has combined various methods of creating segments of tokens to reduce the time complexity. This paper introduces a novel approach, Grammatically-Guided Sparse Attention, which constrains attention computations based on the grammatical roles of tokens. By leveraging Parts-of-Speech (POS) tags, attention masks are dynamically generated that enforce linguistically coherent connections between tokens, reducing the computational graph without sacrificing essential linguistic dependencies. Two masking strategies are proposed and evaluated: a hard mask that strictly allows only predefined grammatical interactions, and a soft mask that biases attention towards these interactions. The experiments, conducted on the SST-2 sentiment classification task using a DistilBERT-like architecture, demonstrate that Grammatically-Guided Sparse Attention maintains comparable accuracy to full attention while significantly reducing the theoretical computational overhead. Preliminary results show accuracy values of 0.8200 for hard masking and 0.8165 for soft masking, closely matching the 0.8200 of full attention, providing a path towards more efficient, interpretable, and linguistically-informed Transformer architectures.

[NLP-161] ECHO: Terminal Agents Learn World Models for Free

【速读】: 该论文试图解决的问题是:在基于语言模型的CLI代理(Command-Line Interface Agent)中,标准强化学习(RL)方法(如GRPO)仅利用稀疏的结果级奖励信号来更新策略,而忽略了每个推理过程中环境返回的丰富响应信息(如stdout、错误、文件等),导致失败轨迹中的大量监督信号被浪费。解决方案的关键在于提出ECHO(Environment Cross-entropy Hybrid Objective),这是一种混合目标函数,将标准策略梯度损失与一个辅助损失相结合——该辅助损失训练模型预测其自身动作所引发的环境观察token。ECHO复用与GRPO相同的前向传播,无需额外rollout,即可将终端反馈转化为所有轨迹上的密集监督信号,从而显著提升任务成功率(如TerminalBench-2.0上Qwen3-8B的pass@1从2.70%提升至5.17%),并增强模型对未见过环境动态的预测能力,甚至在某些场景下实现无需验证器的自监督改进。

链接: https://arxiv.org/abs/2605.24517
作者: Vaishnavi Shrivastava,Piero Kauffmann,Ahmed Awadallah,Dimitris Papailiopoulos
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:CLI agents are the closest thing language models have to an embodied setting: the model emits commands, the terminal executes them, and the returned stream – stdout, errors, files, logs, and traces – records the consequences. We argue that this stream is a supervision signal, but standard agent RL discards it: GRPO-style training updates action tokens with sparse outcome-level rewards while ignoring environment responses already in the rollout. Failed rollouts provide little policy-gradient signal despite containing rich evidence about how the environment responds. We introduce ECHO (Environment Cross-entropy Hybrid Objective), a hybrid objective that combines the standard policy-gradient loss on action tokens with an auxiliary loss that trains the policy to predict environment observation tokens resulting from its own actions. ECHO reuses the same forward pass as GRPO, requires no additional rollouts, and turns terminal feedback into dense supervision for all rollouts. ECHO doubles GRPO pass@1 on TerminalBench-2.0: Qwen3-8B improves from 2.70% to 5.17%, and Qwen3-14B from 5.17% to 10.79%. ECHO also produces policies that better predict terminal dynamics, even on trajectories they did not generate: across held-out rollouts, it sharply reduces environment-token cross-entropy while GRPO alone barely changes it. From base Qwen3-8B, ECHO matches expert-SFT-then-GRPO performance on held-out terminal tasks without expert demonstrations, and recovers roughly half of the expert-SFT initialization benefit on TerminalBench-2.0. In some settings, the environment prediction loss alone enables verifier-free self-improvement, allowing policies to improve on unseen OOD tasks by learning only from environment interactions. Together, these results suggest that environment observations are not merely context for future actions, but a dense, on-policy supervision signal already present in every rollout.

[NLP-162] Agent Fugue: Agent Scaling for Long-Horizon Tasks through Collective Reasoning

【速读】: 该论文试图解决的问题是:在长时程智能体任务中,如何通过“横向扩展”(scaling out)多个同级智能体(peer agents)来提升整体性能,而不依赖于显式的角色分工或工作流编排。传统方法主要关注通过增强单个智能体的能力(如更强模型、更好工具)实现纵向扩展,但对多智能体协同带来的潜在能力增益理解有限。解决方案的关键在于提出 AgentFugue,一个基于共享推理中枢(shared reasoning hub)的集体推理框架。该中枢记录每个智能体在并行探索过程中产生的关键中间状态(如已建立的事实、尝试过的路径、排除的选项),并允许智能体根据当前需求选择性访问其他智能体的知识,从而将原本孤立的推理轨迹转化为可复用的中间推理生态。该设计无需中央规划即可实现高效协作,并通过监督微调和端到端强化学习训练中枢模块,在多个复杂长时程任务中显著优于强基线模型,表明集体推理可使多智能体系统成为独立的能力提升来源,而不仅是计算资源的消耗方式。

链接: https://arxiv.org/abs/2605.24486
作者: Yuyang Hu,Hongjin Qian,Shuting Wang,Jiongnan Liu,Tong Zhao,Xiaoxi Li,Zheng Liu,Zhicheng Dou
机构: Renmin University of China (中国人民大学); Beijing Academy of Artificial Intelligence (北京人工智能研究院)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent progress on long-horizon agentic tasks has been driven largely by scaling up individual agents through stronger models, better tools, and more effective scaffolding. In contrast, much less is understood about scaling out: whether multiple peer agents, all targeting the same task, can become an additional source of capability without relying on explicit role specialization or workflow orchestration. We study this question and propose AgentFugue, a collective reasoning framework built around a shared reasoning hub. As peer agents explore the same task in parallel, the hub records concise notes on what each agent has established, attempted, or ruled out, and enables each agent to selectively access what other agents have discovered in a form useful for its current search. This design turns otherwise isolated trajectories into a connected ecology of reusable intermediate reasoning without requiring centralized planning. We instantiate the hub as a plug-in communication layer, trained with supervised fine-tuning and end-to-end reinforcement learning. Across the challenging long-horizon settings we study, AgentFugue improves over strong baselines. Our results suggest that collective reasoning can turn scaling out peer agent systems into a distinct source of capability gains, rather than merely a way of spending more compute.

[NLP-163] Decompose-and-Refine: Structured Legal Question Answering with Parametric Retrieval

【速读】: 该论文旨在解决法律问答(Legal Question Answering, LQA)中因多跳推理(multi-hop reasoning)导致的幻觉(hallucination)问题,尤其是在基于成文法(statutory law)的场景下,如何准确检索并关联支持性法律条文。现有方法通常依赖自然语言推理或缺乏显式查询重构的检索机制,难以弥合用户提问与法律文本之间的词汇鸿沟。其解决方案的关键在于提出Decompose-and-Refine(DaR)框架,该框架通过分步问题分解(step-wise question decomposition)与参数化知识驱动的查询精炼(parametric knowledge-based query refinement)相结合的方式,将复杂法律问题逐步拆解为原子子问题,并为每个子问题生成与法律条文对齐的参数化查询,从而精准定位每项法律议题对应的最核心法条。实验表明,DaR在KoBLEX基准上显著提升了检索准确率和最终答案质量,同时实现了法律推理过程的透明化、可验证的逐项审核。

链接: https://arxiv.org/abs/2605.24454
作者: Jihyung lee,Hyounghun Kim,Gary Lee
机构: POSTECH(浦项科技大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have shown strong performance in the legal domain, demonstrating notable potential in Legal Question Answering (LQA). However, unlike general QA, LQA requires answers that are not only accurate but also rigorously grounded in explicit legal authority. In statutory LQA, many questions require multi-hop reasoning across multiple legal issues, substantially increasing the risk of hallucination, thereby making accurate retrieval of supporting statutory provisions a critical prerequisite. Despite recent progress in multi-hop QA, existing approaches often rely on reasoning in natural language or retrieval without explicit query reformulation, leaving the vocabulary gap between user questions and statutory text largely unaddressed. To address this challenge, we propose Decompose-and-Refine (DaR), a statute-grounded LQA framework that tightly integrates step-wise question decomposition with parametric knowledge-based query refinement. DaR progressively decomposes a complex legal question into atomic sub-questions and generates statute-aligned parametric queries for each sub-question, enabling the selection of a single most central statutory provision corresponding to each legal issue. We evaluate DaR on KoBLEX, a Korean multi-hop LQA benchmark grounded in statutory law, using Qwen3-32B and Gemma3-27B. Experimental results demonstrate that DaR consistently improves both retrieval accuracy and final answer quality over existing approaches. Moreover, by explicitly separating sub-questions and their corresponding statutory provisions, DaR facilitates transparent, issue-level verification of complex legal reasoning processes.

[NLP-164] mporal Concept Drift in Legal Judgment Prediction: Neural Baselines Across Three Epochs of Ukrainian Court Decisions KR

【速读】: 该论文试图解决的问题是:当前法律自然语言处理(Legal NLP)基准测试假设法律语言具有平稳性(stationary),即模型在随机划分的数据上训练后能够泛化到未来或过去的数据,但这一假设是否成立尚未被验证。为检验此假设,作者通过在乌克兰法院判决文本的三个时间 epochs(战前 2008–2013、混合战争 2014–2021、全面入侵 2022–2026)上进行跨时序训练与评估,构建了一个 3×3 的交叉时间泛化矩阵。解决方案的关键在于:利用时间维度上的结构化数据划分,系统性地分析模型在不同历史时期法律语言变化下的性能退化情况,并引入法律领域预训练(Legal-XLM-R)、时间顺序持续学习(chronological continual learning)等策略来缓解这种退化现象。研究发现,法律语言存在显著的时间漂移(temporal drift),且其演化具有非对称性和加性特征(additive nature),而仅靠领域预训练无法消除这种漂移,唯有采用按时间顺序的持续学习方法才能有效保留历史知识并提升新时期的性能。

链接: https://arxiv.org/abs/2605.24452
作者: Volodymyr Ovcharov
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 17 pages, 6 tables, 5 figures. Dataset: this https URL

点击查看摘要

Abstract:Legal NLP benchmarks evaluate models on randomly split data, implicitly assuming that legal language is stationary. We test this assumption by fine-tuning four transformer encoders – XLM-RoBERTa (base and large) and their legal-domain variants – on Ukrainian court decisions from three temporal epochs defined by geopolitical disruptions: pre-war (2008-2013), hybrid war (2014-2021), and full-scale invasion (2022-2026). Each model is trained on one epoch and evaluated on all three, producing a 3x3 cross-temporal generalization matrix. Four findings emerge. (1) Forward degradation is severe: models trained on pre-war data lose up to 27.2 percentage points of macro-F1 when applied to full-scale invasion era decisions. (2) The degradation is asymmetric: backward transfer (full-scale to pre-war) is substantially more robust than forward transfer, consistent with the hypothesis that legal language is additive. (3) Legal-domain pretraining (Legal-XLM-R) does not improve absolute performance but reduces forward degradation magnitude and asymmetry. (4) Chronological continual learning eliminates catastrophic forgetting for general XLM-R: pre-war knowledge is fully retained (+1.8 to +6.2 pp) while full-scale performance gains +16.5 to +19.0 pp; reverse-chronological training causes severe forgetting. Cross-jurisdictional pretraining on Swiss Judgment Prediction data improves absolute performance but does not reduce temporal degradation magnitude, confirming that temporal drift is an intrinsic property of legal language evolution. The dataset (428K decisions across three epochs) is publicly available as a LEXTREME contribution.

[NLP-165] Phonetic Modeling of Dialectal Variation in Vietnamese Speech

【速读】: 该论文旨在解决越南语在北部、中部和南部方言之间存在的显著语音差异问题,这些差异导致相同词汇项在不同地区呈现明显不同的发音,从而对自动语音识别(ASR)系统构成挑战。现有方法通常在词级别处理方言变体,假设拼写与发音之间的映射是方言不变的,这限制了其捕捉系统性语音差异的能力。本文的关键解决方案是一个方言感知的语音框架,该框架在词汇层和解码层同时显式建模越南语的音系结构和方言变异:首先引入一个基于音素结构的语音词汇表,将每个音节分解为结构化的语音成分,并映射到方言特定的国际音标(IPA)表示;其次设计了一个语音结构解码器,联合预测这些音素组件。实验在唯一的多方言越南语数据集UIT-ViMD上验证了该方法的有效性,结果表明其性能优于多种预训练基线模型,在跨方言场景下甚至达到最强预训练wav2vec2-base-vi-250h模型的水平,但使用参数更少且无需外部预训练。

链接: https://arxiv.org/abs/2605.24451
作者: Quan Ngoc Hoang,Long Hoang Huu Nguyen,Nghia Hieu Nguyen,Kiet Van Nguyen,Ngan Luu-Thuy Nguyen
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Vietnamese exhibits substantial dialectal phonetic variation across Northern, Central, and Southern regions, where identical lexical items may be realized with markedly different pronunciations. Such variation poses challenges for automatic speech recognition (ASR) and remains difficult to model computationally due to the complex relationship between Vietnamese orthography and phonology. Existing approaches typically address dialect variability at the word level, assuming dialect-invariant mappings between spelling and pronunciation, which limits their ability to capture systematic phonetic differences. We propose a dialect-aware phonetic framework that explicitly models Vietnamese phonological structure and dialectal variation at both the vocabulary and decoding levels. The framework introduces a phonetic vocabulary that decomposes each syllable into structured phonetic components and maps them to dialect-specific IPA representations, together with a phonetic-structure decoder that jointly predicts these components. Experiments on the UIT-ViMD, a only-available dataset for multi-dialect in Vietnamese, show that the proposed approach outperforms various pre-trained baselines, \textbfespecially matches the performance of the strongest pretrained wav2ve2-base-vi-250h across dialects while \textbfusing substantially fewer parameters and no external pretraining. Code for experimental reproducibility will be publicly available upon the acceptance of this paper.

[NLP-166] Found in Conversation: LLM s Teach Themselves to Close the Multi-Turn Gap

【速读】: 该论文试图解决大语言模型(Large Language Model, LLM)在多轮对话中表现显著低于单轮对话的问题,即“Lost-in-Conversation”现象——尽管用户在多轮交互中提供了与单轮相同的信息,模型仍难以有效利用这些信息。解决方案的关键在于提出一种名为“Found in Conversation (FiC)”的自监督训练框架,其核心机制是视图不对称自蒸馏(View-Asymmetric Self-Distillation),该方法通过将同一任务信息从两个视角进行蒸馏:教师模型基于单轮视角学习强健行为,学生模型则在多轮视角下学习如何恢复这种能力,从而无需外部更强教师即可将单轮性能迁移至多轮场景。实验表明,FiC在多个模型家族和规模(3B–14B参数)上均能恢复至少92%的单轮性能,甚至在两个Llama模型上达到100%,显著提升了多轮对话的效率与效果。

链接: https://arxiv.org/abs/2605.24432
作者: Tianlang Chen,Shirley Wu,Jure Leskovec
机构: Stanford University (斯坦福大学)
类目: Computation and Language (cs.CL)
备注: 17 pages, 3 figures, 6 tables

点击查看摘要

Abstract:Large Language Model (LLM) interactions are typically underspecified, with users clarifying all necessary details across multiple conversational turns. Yet recent work shows that LLMs perform far worse in this multi-turn setting than in a single turn with same information being available at once, a phenomenon termed “Lost-in-Conversation.” However, bridging this gap effectively remains an open problem. Here we introduce Found in Conversation (FiC), a training framework where a model teaches itself to find and recover its single-turn competence given underspecified multi-turn prompts. We develop View-Asymmetric Self-Distillation, which distills across two views of the same task information–single-turn view for the teacher, multi-turn view for the student–transferring strong single-turn behavior into weak multi-turn behavior. This requires no stronger external teacher, which is unavailable as even frontier LLMs exhibit this gap. Across model families (Llama, Qwen, Phi, and OLMo) and sizes (3B-14B), FiC recovers at least 92% of single-turn performance and reaches 100% on two Llama backbones, yielding more efficient and helpful multi-turn conversations with single-turn capabilities intact.

[NLP-167] SEAL: Synergistic Co-Evolution of Agents and Learning Environments

【速读】: 该论文试图解决的问题是:当前自进化方法在训练交互式工具使用大语言模型(LLM)代理时,通常仅单独优化代理策略或学习环境,导致代理能力边界变化与静态环境监督之间存在结构性错位(Agent-Environment Misalignment)。这种错位限制了代理在低资源场景下的学习效率和泛化能力。解决方案的关键在于提出 SEAL(Self-evolving Agent-Environment Loop),一个闭环协同进化框架:通过执行验证收集策略轨迹,将失败轨迹诊断为逐轮失败标签,并以此作为共享信号同时驱动环境侧的适应(如增强工具可用性提示、约束信息和恢复反馈)与模型侧的策略优化(基于诊断引导的优势重加权)。实验证明,SEAL 在分布内和分布外多轮工具使用任务中均显著提升性能,仅用 400 个训练样本即可实现平均得分提升 8.25 至 26.25 点,并展现出正向的跨分布迁移能力,验证了联合优化代理与训练期学习基质对构建鲁棒自进化 LLM 代理的价值。

链接: https://arxiv.org/abs/2605.24426
作者: Yihao Hu,Zhihao Wen,Xiujin Liu,Pan Wang,Xin Zhang,Wei Wu
机构: Ant Group(蚂蚁集团); Westlake University (西湖大学); University of Michigan–Ann Arbor (密歇根大学安娜堡分校); University of Science and Technology of China (中国科学技术大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Model (LLM) agents are increasingly improved through interaction, yet most self-evolution methods adapt either the policy or the learning environment in isolation. We identify this structural gap as \emphAgent-Environment Misalignment: the agent’s capability frontier changes during training, while the environment that provides supervision remains static or only weakly coupled to the agent’s revealed failures. We propose SEAL, a closed-loop co-evolution framework for interactive tool-use agents. SEAL collects on-policy trajectories under executable verification, diagnoses failed rollouts into turn-level failure labels, and uses these diagnoses as a shared signal for both environment-side adaptation and model-side policy optimization. The environment evolves its training-time learning interface by exposing clearer tool affordance cues, constraint information, and recovery-oriented feedback, while the policy is updated with diagnosis-guided advantage reweighting. Extensive experiments across in-distribution and out-of-distribution multi-turn tool-use evaluations show that SEAL improves low-resource agent learning: with only 400 training samples, it yields +8.25 to +26.25 average-point gains across three backbones and exhibits positive out-of-distribution transfer. These results demonstrate the value of jointly adapting the learner and its training-time learning substrate for robust self-improving LLM agents.

[NLP-168] Momentum Streams for Optimizer-Inspired Transformers

【速读】: 该论文试图解决的问题是:如何通过优化器视角重新理解预归一化(pre-norm)Transformer层的残差更新机制,并基于此设计出性能更优的新型Transformer架构。其解决方案的关键在于将Transformer中的注意力和前馈网络(MLP)子层视为梯度Oracle,从而将残差更新解释为在代理token能量函数上执行的一阶优化步骤;在此基础上,构建了一系列受优化器启发的Transformer变体(如三重动量TMMFormer、Adam/AdamW、Muon、SOAP等),并在计算资源匹配条件下进行比较。实验表明,TMMFormer在预训练任务中达到最低验证损失,且控制性消融与理论分析均指出,动量(momentum)而非预条件化(preconditioning)才是性能提升的主要来源;此外,基于动量的设计能收敛到更平坦的极小值点,从而减少灾难性遗忘并提升泛化能力。

链接: https://arxiv.org/abs/2605.24425
作者: Jingchu Gai,Nai-Chieh Huang,Jiayun Wu
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The residual update of a pre-norm Transformer layer admits an interpretation as one step of a first-order optimizer acting on a surrogate token energy, wherein the attention and MLP sublayers function as gradient oracles. Based on this observation, we build a family of optimizer-inspired Transformers (triple-momentum, Adam/AdamW, Muon, SOAP) and compare them under matched compute. In our main pretraining experiment, the triple-momentum TMMFormer achieves the lowest validation loss, outperforming the vanilla Transformer and prior architectural variants. A controlled ablation and supporting theory show that momentum, not preconditioning, is the main source of the gain. We further show that TMMFormer and other momentum-based designs reach flatter minima than the vanilla Transformer, which leads to less forgetting and better generalization.

[NLP-169] Side-by-side Comparison Amplifies Dialect Bias in Language Models

【速读】: 该论文试图解决语言模型(Language Models, LMs)在无显式方言标签的情况下对不同口音群体(如标准美式英语 Standard American English, SAE 与非洲裔美国英语 African-American Vernacular English, AAVE)表现出系统性偏见的问题,即“隐性方言偏见”(covert dialect bias)。其解决方案的关键在于:首先通过社会心理学中关于种族偏见的刻板印象框架量化这种偏见,发现当 SAE 与 AAVE 的意图等价文本被并列比较时,LMs 对 AAVE 的负面刻板印象显著加剧——这比孤立评估更接近实际应用场景(如候选人排序),且标注方言身份后偏见进一步增强。尽管采用反事实公平微调(counterfactual fairness fine-tuning)可缓解部分刻板印象的偏差,但在对比评估场景下效果不稳定;同时,即使经过安全对齐微调(safety-aligned fine-tuning),显性方言偏见依然显著,表明当前方法不足以全面解决此问题。研究强调现有评估方式可能低估了隐性方言偏见的严重性,并呼吁建立更鲁棒的评估与缓解机制。

链接: https://arxiv.org/abs/2605.24384
作者: Kritee Kondapally,Claire J. Smerdon,Pooja C. Patel,Ogheneyoma Akoni,Jevon Torres,Jaspreet Ranjit,Matthew Finlayson,Swabha Swayamdipta
机构: University of Southern California (南加州大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: In proceeding at ACM Conference on Fairness, Accountability, and Transparency 2026

点击查看摘要

Abstract:Language models (LMs) can exhibit systematic biases against speakers based on variations in their dialects, even in the absence of a dialect label, a behavior known as covert dialect bias. In this work, we quantify covert dialect bias in online discourse by evaluating how LMs associate stereotypical traits (derived from social psychology research on racial bias) with intent-equivalent tweets in Standard American English (SAE) and African-American Vernacular English (AAVE). While prior work shows that LMs associate more negative stereotypes with AAVE when evaluating tweets in isolation, we are surprised to find that this bias is significantly exacerbated when SAE / AAVE tweet pairs are compared side by side, a setting that more closely reflects high-impact decision making contexts in which models are used to rank candidates. The bias only worsens when dialect labels are explicitly specified. This is striking, given the extensive efforts from commercial developers to mitigate bias in their LMs. Encouragingly, we show that counterfactual fairness finetuning can mitigate covert dialect bias for some stereotypical traits, reducing average disparities when evaluating tweets in isolation, however, these improvements do not consistently hold across traits when evaluating SAE / AAVE tweets side by side. Our findings show that existing evaluation settings for covert dialect bias may underestimate its severity, specifically in contrastive settings. Additionally, overt dialect bias remains pronounced even after safety aligned finetuning, indicating that it remains an unresolved problem, and motivates the need for more robust evaluation and mitigation frameworks.

[NLP-170] SliceWorld: A Predictive and Controllable World-State Model for CT Report Generation

【速读】: 该论文试图解决CT报告生成(CTRG)中缺乏对三维解剖结构和病灶信息动态演化建模的问题,尤其在面对连续切片间病理证据变化时,现有方法难以捕捉其演变规律或对潜在病变因素进行可控干预。解决方案的关键在于提出SliceWorld——一个面向CT的“世界状态”框架,将轴向CT扫描视为沿z轴排列的有序序列,通过编码前缀切片证据生成包含解剖、病灶与不确定性分量的因子感知潜在状态,并将其投影为用于多步未来切片预测、病灶因子干预及大语言模型(LLM)驱动报告生成的世界令牌(world tokens)。该模型首先在CT切片序列上进行预训练,结合预测性、因子感知性和反事实目标,再在配对的CT-报告数据上微调,实验证明其在自然语言生成指标和临床导向自动评估上均优于现有方法,且具备多时间步未来切片预测、可测量的因子对齐能力、减少切片数量下的鲁棒性以及选择性病灶敏感的报告调节等特性。

链接: https://arxiv.org/abs/2605.24371
作者: Yuanhe Tian,Yan Song
机构: Zhongguancun Academy (中关村科学院); University of Science and Technology of China (中国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: 18 pages, 5 figures

点击查看摘要

Abstract:CT report generation (CTRG) requires models to summarize three-dimensional anatomical context and pathological findings from hundreds of axial slices. Existing methods typically learn a direct image-to-text mapping, providing limited mechanisms for modeling how CT evidence evolves across slices or how reports respond to controlled changes in latent lesion-related factors. We propose SliceWorld, a CT-specific world-state framework that treats an axial CT scan as an ordered sequence along the z-axis. SliceWorld encodes prefix CT evidence into factor-aware latent states containing anatomy, lesion, and uncertainty components, and projects these states into world tokens used for multi-step future-slice feature prediction, lesion-factor intervention, and LLM-based report generation. The model is first pretrained on CT slice sequences with predictive, factor-aware, and counterfactual objectives, and is then fine-tuned on paired CT-report data. Experiments on M3D-Cap and CT-RATE show that SliceWorld improves natural language generation metrics and clinically oriented automatic evaluation. Further analyses demonstrate multi-horizon future-slice prediction, measurable factor alignment, reduced-slice robustness, and selective lesion-sensitive report modulation.

[NLP-171] Structure-Aware RAG : Structured Retrieval Augmented Generation from Noisy Data for Conversational Agents

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在实际对话应用中因依赖静态参数化知识而导致的可靠性不足问题,尤其是在需要动态或领域特定信息时表现不佳。现有基于文本或图结构的检索增强生成(Retrieval-Augmented Generation, RAG)方法常因引入噪声或无关上下文而效果受限。其解决方案的关键在于提出结构感知的检索增强生成(Structure-aware Retrieval Augmented Generation, SA-RAG),通过将外部知识以表格作为中间结构化表示形式,实现更紧凑且可控的信息接口,从而降低噪声并保留关键语义内容。SA-RAG进一步引入质量感知的表格元数据生成框架,优化元数据的规范化与有效性,并探索无训练和有训练两种表格生成策略,结合生成验证与直接偏好优化机制,在保持语义和结构一致性的前提下持续提升表格质量。实验表明,SA-RAG在两个真实世界噪声数据集上显著优于现有RAG基线方法。

链接: https://arxiv.org/abs/2605.24366
作者: Kaiqiao Han,LuAn Tang,Renliang Sun,Peng Yuan,Wei Cheng,Haoyu Wang,Wei Wang,Yizhou Sun,Haifeng Chen
机构: UCLA (加州大学洛杉矶分校); NEC Labs (NEC实验室)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have been widely adopted in conversational applications. However, their reliance on parametric knowledge limits reliability in real-world scenarios that require dynamic or domain-specific information. Retrieval-Augmented Generation (RAG) addresses this limitation by incorporating external knowledge during generation, but existing text-based and graph-based RAG methods often struggle with noisy or irrelevant contexts. In this work, we propose Structure-aware Retrieval Augmented Generation (SA-RAG), which uses tables as an intermediate structured representation to provide a compact and controllable interface that reduces noise while preserving essential information. We introduce a quality-aware table metadata generation framework that models metadata normalization and effectiveness, improving metadata quality and downstream performance. Furthermore, we explore both training-free and training-based table generation methods. Generation validation and direct preference optimization further improve table quality while maintaining semantic and structural consistency. Experiments on two noisy real-world datasets show that SA-RAG significantly outperforms existing RAG baselines. Our code is publicly available at a public repository.

[NLP-172] How Much Structure Do LLM s Need? Evaluating LLM s for Bibliometric Cluster Description

【速读】: 该论文试图解决大语言模型(Large Language Models, LLMs)在科学文献综述中存在幻觉参考文献、覆盖不均以及主题组织弱落地等问题。其解决方案的关键在于引入结构化的引文计量(bibliometric)分析,通过算法预先定义聚类结构,再由LLM生成可读性强的描述,从而形成一种混合工作流:引文计量算法提供可审计的结构基础,LLM负责生成语义接近人类写作且结构清晰的集群描述,显著提升合成结果的可靠性与质量。

链接: https://arxiv.org/abs/2605.24351
作者: Abraham Camelo-Guerrero,Jairo Diaz-Rodriguez
机构: York University (约克大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) can support scientific literature synthesis, but remain prone to hallucinated references, uneven coverage, and weakly grounded thematic organization. We evaluate whether bibliometric structure improves LLM-assisted synthesis by comparing six pipelines for generating cluster descriptions under different levels of evidence and structure. Using 100 published bibliometric analyses, we reconstruct Scopus corpora, extract human-written cluster descriptions, and assess outputs by human alignment, semantic coverage, clustering quality, graph quality, and reference grounding. Results show that LLMs produce descriptions semantically close to human-written ones, but are unreliable when asked to infer bibliometric structure from scratch. Performance improves when bibliometric algorithms define the clusters and the LLM interprets them. Overall, LLM-assisted bibliometric synthesis is most promising as a hybrid workflow in which algorithms provide auditable structure and LLMs generate readable descriptions.

[NLP-173] Distinguishing Right from Wrong in Debates: Attribution Analysis of Chinese Harmful Memes

【速读】: 该论文旨在解决中文有害表情包(harmful meme)检测中因文化语境深度依赖和语义模糊性导致的识别困难问题。其解决方案的关键在于构建首个面向中文有害表情包的可解释标注数据集Ex-ToxiCN-MM,该数据集为每个表情包提供对立的“有害”与“非有害”解释标签,从而严格评估模型对文化背景驱动的模糊内容的理解能力;同时,论文设计了专用的中文有害语义知识库C-HarmKB以增强模型对文化概念和攻击性词汇的先验认知,并提出一个综合归因分析框架RIKE,包含Attribution Knowledge Enhancement(AKE)模块和Relative Intent Reasoning(RIR)模块,有效缓解表情包语义歧义与背景知识缺失带来的挑战。实验表明,该方法在多项指标上显著优于主流基线模型。

链接: https://arxiv.org/abs/2605.24344
作者: Weiming Wang,Junyu Lu,Han Wang,Xiaokun Zhang,Zewen Bai,Bo Xu,Liang Yang,Hongfei Lin
机构: Dalian University of Technology (大连理工大学); Singapore University of Technology and Design (新加坡科技设计大学); City University of Hong Kong (香港城市大学)
类目: Computation and Language (cs.CL)
备注: 10 pages, 4 figures

点击查看摘要

Abstract:Research on harmful meme detection has garnered significant attention, resulting in the development of numerous datasets and methods. However, progress in detecting Chinese harmful memes lags considerably, primarily due to two challenges: first, accurately assessing a meme’s harmfulness depends heavily on understanding deep cultural context; second, many memes are semantically ambiguous, making harmfulness highly subjective. To address these issues, we focus on the interpretable detection of Chinese harmful memes by constructing the first Chinese harmful meme explanation dataset, Ex-ToxiCN-MM. This dataset offers opposing interpretations, categorized as “harmful” and “non-harmful”, for each meme, aiming to rigorously evaluate a model’s ability to discern and comprehend ambiguous, culturally grounded content. We built a specialized knowledge base of Chinese cultural concepts and offensive vocabulary to supply models with essential prior knowledge (C-HarmKB). To address the ambiguity and lack of background knowledge in meme attribution, we have developed a comprehensive attribution analysis framework, RIKE, which includes an Attribution Knowledge Enhancement module (AKE) and a Relative Intent Reasoning module (RIR). Extensive quantitative and qualitative experiments demonstrate that our method outperforms mainstream baseline models across multiple metrics in the task of attributing harmful memes in Chinese. The code, Ex-ToxiCN-MM dataset, and Chinese Harmful Semantic Knowledge Base (C-HarmKB) involved in this study have been open-sourced at this https URL

[NLP-174] Discovering Lexical Gaps Using Embeddings from Multilingual LLM s CONLL2026

【速读】: 该论文旨在解决跨语言词汇空缺(lexical gaps)的自动识别问题,即某些概念在一种语言中存在词汇表达,但在另一种语言中缺乏对应词的现象。此类现象阻碍了多语言词典构建、机器翻译以及跨语言知识迁移的效果。现有方法依赖人工判断或固定的概念分类体系,难以扩展且受限于特定语义框架。本文提出了一种数据驱动的框架,通过从韩英双语大语言模型(LLM)中提取上下文嵌入(contextualized embeddings),构建大量不同的嵌入空间(共4000个),并计算源语言词与其目标语言最近邻词之间的语义相似度分布。实验表明,在94%(韩→英)和97%(英→韩)的嵌入空间中,词汇空缺词表现出比非空缺词更弱的跨语言语义对齐性。基于这些未对齐嵌入空间训练的逻辑回归分类器可有效区分空缺词与非空缺词,AUC分别为0.81(韩→英)和0.76(英→韩),并成功召回绝大多数已知的韩英词汇空缺词(18/19 和 26/27)。该方案不依赖任何预定义语义分类体系,具备语言无关性和可扩展性,为大规模词汇空缺检测提供了新范式。

链接: https://arxiv.org/abs/2605.24310
作者: Yoonwon Jung,Aaron S. Cohen,Benjamin K. Bergen
机构: University of California San Diego (加州大学圣地亚哥分校)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: CoNLL 2026

点击查看摘要

Abstract:Lexical gaps are words that do not exist in certain languages. They pose challenges for building multilingual lexical resources, for machine translation, and for cross-lingual transfer. Existing lexical gap detection relies on human judgments or fixed conceptual taxonomies. We propose a data-driven framework for identifying cross-lingual lexical gaps. We extracted contextualized embeddings from Korean-English bilingual LLMs for Korean-to-English and English-to-Korean translation pairs. Combinations of LLMs, embedding types, dimensionality, and orthogonal transformations across 100 train-test splits yielded 4000 distinct embedding spaces in each source language. In each space, we computed the semantic similarity between each source word and its nearest neighbor in the target language, and compared their distribution for gap words versus non-gap words. In 94% (Korean-to-English) and 97% (English-to-Korean) of embedding spaces, gap words showed weaker cross-lingual semantic alignment than non-gap words. Logistic classifiers trained on unaligned embedding spaces can reliably separate gap words from non-gap words, achieving AUCs of 0.81 (Korean-to-English) and 0.76 (English-to-Korean) and retrieving 18/19 Korean and 26/27 English gap words. This approach provides a language-agnostic and taxonomy-free method for scalable lexical gap identification.

[NLP-175] Rubato: Transcribing Piano Music with Timestamps

【速读】: 该论文旨在解决从音乐录音中自动转换为带有时间戳的人类可读乐谱的问题,以支持音乐表演分析、学习者训练及音乐学研究中的时序表现力(如rubato)可视化与比较。其解决方案的关键在于提出两个核心创新:一是设计了一个提示条件编码器-解码器模型Rubato,用于直接生成带时间戳的钢琴乐谱;二是开发了一种新的多声部音乐文本表示法InterMo,专为序列到序列训练优化,提升了模型对复杂音符时序关系的建模能力。实验表明,Rubato在记谱准确性上优于现有基于级联架构的方法,且即使在给定真实MIDI而非音频输入时仍表现更优,说明当前方法的瓶颈主要源于表示方式而非声学特征提取。此外,由于Rubato通过提示机制联合训练多个相关任务,其在单一任务(如MIDI音符定位和节拍检测)上也达到或超越了最优单任务系统性能。

链接: https://arxiv.org/abs/2605.24291
作者: Nazif Can Tamer,Victoria Ebert,Guang Yang,Noah A. Smith
机构: University of Washington (华盛顿大学); Allen Institute for AI (艾伦人工智能研究所)
类目: ound (cs.SD); Computation and Language (cs.CL); Multimedia (cs.MM)
备注: 18 pages, 7 figures, 5 tables

点击查看摘要

Abstract:We consider the conversion of musical recordings into human-readable sheet music annotated with timestamps. Such output lets a listener clearly visualize rubato (temporally expressive playing), a learner diagnose ensemble precision and timing choices against the written music, and a musicology scholar compare performance styles across recordings of the same work. We introduce (1) a prompt-conditioned encoder-decoder model, named Rubato, trained to output (2) a new textual representation for polyphonic music, named InterMo, which we designed for compatibility with sequence-to-sequence training. Our experiments demonstrate that Rubato produces timestamped piano sheet music from audio with higher notational accuracy than the best existing approaches, which are based on cascades. We find that even if the cascade is given ground-truth MIDI instead of audio, Rubato performs better, suggesting that the ceiling of existing approaches is primarily representational, not acoustic. Further, because Rubato is trained on several related tasks (with prompts), it competes with or outperforms the best single-task systems on related but simpler tasks like MIDI note grounding and beat/downbeat detection. A demo is available at this https URL .

[NLP-176] Faithfulness as Information Flow: Evaluating and Training Faithful Chain-of-Thought Reasoning

【速读】: 该论文试图解决的问题是:当前链式思维(Chain-of-thought, CoT)推理在监控语言模型时的有效性受限于推理轨迹是否真实反映了生成答案的计算过程;然而,模型可能利用提示到答案的捷径(prompt-to-answer shortcuts),绕过CoT路径,导致看似合理的推理痕迹实际上并不反映真实决策逻辑,从而误导对模型行为的理解。解决方案的关键在于从结构信息流的角度出发,提出一个任务无关的框架,通过三个互补属性——充分性(sufficiency)、完整性(completeness)和必要性(necessity)——来量化CoT的忠实度,并基于熵、掩码KL散度和梯度分析设计诊断指标。进一步地,论文引入训练阶段的干预策略(如注意力掩码、反向梯度掩码、CoT梯度保留及提示表示对抗扰动),以控制信息流动路径,强化CoT中介作用,使模型更依赖于可解释的推理链而非捷径,从而提升CoT的忠实性和可监测性。

链接: https://arxiv.org/abs/2605.24286
作者: Jinghan Jia,Joe Benton,Eric Easley
机构: Michigan State University (密歇根州立大学); Anthropic (Anthropic)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Chain-of-thought (CoT) reasoning is useful for monitoring language models only when the reasoning trace faithfully reflects the computation that produces the final answer. However, models can rely on prompt-to-answer shortcuts that bypass the CoT, making the visible reasoning trace misleading even when it appears plausible. We study CoT faithfulness through a structural information-flow perspective: faithful reasoning should route answer-relevant information through the mediated path from prompt to CoT to answer, rather than through a direct prompt-to-answer shortcut. This perspective yields a task-agnostic framework based on three complementary properties, sufficiency, completeness, and necessity, which we instantiate with entropy-based, masked-KL, and gradient-based diagnostics. We show that these metrics recover externally judged faithfulness differences in hinted reasoning, and identify a low-entropy failure mode of KL-based diagnostics where gradient-based measures remain more stable. Building on this analysis, we introduce update-time interventions for verifier-based on-policy RL, including attention masking, backward-only gradient masking, CoT gradients, and adversarial perturbations of prompt representations. Across hinted arithmetic, reward-hackable code repair, and DAPO-Math models trained without hints but evaluated under wrong-hint injection, our interventions shift behavioral and structural indicators toward stronger CoT mediation. In particular, they make shortcut and reward-hacking behavior more transparent in the CoT and improve task-agnostic faithfulness metrics, while in some settings also reducing wrong-hint susceptibility. Our results suggest that controlling information flow during training is a practical route toward more faithful and monitorable CoT reasoning. Code is available at this https URL.

[NLP-177] ContextEcho: A Benchmark for Persona Drift in Long Agent ic-Coding Sessions

【速读】: 该论文试图解决的问题是:在实际部署环境中,前沿语言模型(如生成式AI)在长时间、多工具调用的编程任务中,其初始设定的“有益编程助手”人格(persona)会发生不可预测的漂移(drift),而现有研究主要集中在短对话场景下,难以捕捉这种长期运行中的行为变化。解决方案的关键在于提出一个名为ContextEcho的基准测试框架和可复用的测量工具,它通过25个探针组成的身份套件、快照-探针协议(不干扰主会话状态)、有监督与无监督测量方式相结合,并基于三段真实用户匿名会话(共3,746–9,716轮交互)进行验证。结果显示,人格漂移具有跨组织普遍性而非家族特异性,会话内压缩(compaction)无法可靠重置漂移,但单次锚点(anchor)可恢复训练时的人格注册;此外,漂移对下游行为具有模式依赖影响——在工具使用场景中可能促进连续性,但在纯聊天场景中破坏格式契约并增加输出长度。ContextEcho为研究人员和部署者提供了一个无需重新训练即可审计模型从开始到结束是否保持预期人格的开源工具链。

链接: https://arxiv.org/abs/2605.24279
作者: Xianzhong Ding,Yangyang Yu,Changwei Liu,Bill Zhao
机构: Accenture
类目: Computation and Language (cs.CL); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:A frontier language model’s acknowledged “helpful programming assistant” persona does not survive long agentic-coding sessions in the deployment regime that production products actually run. After hours of tool-using debugging, a model that initially hedges preferences (“I don’t have preferences”) may begin asserting them (“Python - the feedback loop is instant…”), revealing user-visible drift that deployer evaluations may miss. Existing persona-stability studies focus on short dialogues and report little shift, leaving real-world code-generation regimes - thousands of tool-using turns, compaction, and hours-long sessions - largely uncharacterized. We introduce ContextEcho, a benchmark and reusable harness for measuring persona drift at deployment scale. It combines a 25-probe identity suite, a snapshot-then-probe protocol that forks conversation state without perturbing the main session, complementary judged and judge-free measurement surfaces, and three anonymized Claude Code sessions spanning 3,746-9,716 turns. Across 23 frontier models, ContextEcho shows that persona drift is general across organizations rather than family-specific, that in-session compaction does not reliably reset it, and that a single-shot anchor restores the trained register across measured targets. It also reveals mode-dependent downstream effects: while drift can facilitate tool-using continuation, in tool-free chat it breaks formatting contracts and inflates output length. Overall, ContextEcho provides researchers and deployers an open-source framework to audit whether the persona a model ships with is the persona users encounter at session end, across chat-completions API targets and without retraining.

[NLP-178] DRInQ: Evaluating Conversational Implicature with Controlled Context Variation ACL2026

【速读】: 该论文试图解决的问题是:当前大型语言模型在处理依赖社会和语境线索的会话含义(conversational implicature)推理时表现不可靠,尤其是在需要从隐含意义中推断意图的任务上。解决方案的关键在于提出DRinQ基准,用于评估模型在保持问题表面形式不变的情况下对会话含义的语用推理能力,并设计了一个半自动化流水线来生成具有系统性变化的问题-上下文-解释实例,从而实现可扩展的评估。研究发现,尽管先进模型在引导下能生成合理的语用场景,但在推理阶段往往无法准确恢复预期含义,体现出生成与推理之间的不对称性;同时,小模型通过结构化提示可更好地贴近人类判断,而人类作者倾向于生成更安全、可预测的上下文,模型则产生更多样化的场景但有时超出语境支持范围,这揭示了当前模型在建模会话含义上的局限性,并呼吁发展更注重语境敏感性的评估框架。

链接: https://arxiv.org/abs/2605.24267
作者: Hirona Jacqueline Arai,Xiang Ren
机构: 未知
类目: Computation and Language (cs.CL)
备注: To be presented at ACL 2026

点击查看摘要

Abstract:Human conversation relies heavily on conversational implicature, in which speakers convey meanings that are suggested rather than explicitly stated. Although recent large language models exhibit strong conversational fluency, they remain unreliable when interpretation depends on reasoning that integrates social and contextual cues, a process rarely articulated in text. We introduce DRinQ, a benchmark for evaluating pragmatic reasoning about conversational implicature in question utterances, designed to isolate pragmatic variation while holding each question’s surface form fixed. To support scalable evaluation, we propose a semi-automated pipeline that produces question-context-interpretation instances with systematic variation. Across evaluations, we find a consistent generation-inference asymmetry: while state-of-the-art models can generate plausible pragmatic scenarios when guided, they often fail to recover the intended implication at inference time. For smaller models, structured prompting improves alignment with human judgments. A comparative writing study further reveals complementary strengths: human authors tend to produce safer, predictable contexts, whereas models generate varied scenarios with interpretations that sometimes exceed contextual support. These findings highlight persistent challenges in modeling conversational implicature and motivate more context-sensitive evaluation frameworks.

[NLP-179] An Interactive Paradigm for Deep Research

【速读】: 该论文试图解决的问题是:当前基于大语言模型(Large Language Models, LLMs)的深度研究系统在处理开放式查询时,通常采用固定流程且缺乏中途干预机制,导致当用户意图发生变化时无法进行有效调整,从而影响最终输出的相关性和可控性。解决方案的关键在于提出 SteER 框架——一种可引导的深度研究框架,通过引入可解释的中程控制机制,在每个决策点利用成本-收益公式判断是否暂停等待用户输入或继续自主执行;同时结合多样性感知规划与效用信号(奖励对齐度、新颖性和覆盖度),并维护一个动态演化的实时角色模型(persona model),从而实现更灵活、用户对齐且高质量的长周期研究过程。

链接: https://arxiv.org/abs/2605.24266
作者: Lin Ai,Victor S. Bursztyn,Xiang Chen,Julia Hirschberg,Saayan Mitra
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advances in large language models (LLMs) have enabled deep research systems that synthesize comprehensive, report-style answers to open-ended queries by combining retrieval, reasoning, and generation. Yet most frameworks rely on rigid workflows with one-shot scoping and long autonomous runs, offering little room for course correction if user intent shifts mid-process. We present SteER, a framework for Steerable deEp Research that introduces interpretable, mid-process control into long-horizon research workflows. At each decision point, SteER uses a cost-benefit formulation to determine whether to pause for user input or to proceed autonomously. It combines diversity-aware planning with utility signals that reward alignment, novelty, and coverage, and maintains a live persona model that evolves throughout the session. SteER outperforms state-of-the-art open-source and proprietary baselines by up to 22.80% on alignment, leads on quality metrics such as breadth and balance, and is preferred by human readers in 85%+ of pairwise alignment judgments. We also introduce a persona-query benchmark and data-generation pipeline. To our knowledge, this is the first work to advance deep research with an interactive, interpretable control paradigm, paving the way for controllable, user-aligned agents in long-form tasks.

[NLP-180] Improving Labeling Consistency with Detailed Constitutional Definitions and AI-Driven Evaluation ACL

【速读】: 该论文试图解决自动化标注流水线中因类别定义模糊导致标注一致性与准确性不足的问题,尤其是在内容审核(content moderation)场景下,人工标注者难以完全遵循复杂且详尽的规则描述,从而造成标签漂移。解决方案的关键在于引入一种由AI驱动的工作流:首先利用AI生成每个类别的“宪法级”详细定义(per-category constitution),覆盖边界案例以减少歧义;随后使用前沿大语言模型(frontier LLM)基于该宪法解释每条输入内容,生成更一致、准确的黄金标签(golden labels)。实验表明,相比传统段落式定义,该方法可将跨模型不一致性降低达57倍,并通过交叉模型分歧诊断规范漏洞,同时让人类专注于高阶语义决策而非具体标注判断。此外,论文提出双轴评估框架,独立评分意图(intent)与内容(content),提升下游应用的灵活性和安全性。

链接: https://arxiv.org/abs/2605.24247
作者: Konstantin Berlin,Adam Swanda
机构: Cisco AI Defense
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Under review at ACL Rolling Review (ARR), May 2026 cycle. Also available at this https URL

点击查看摘要

Abstract:Many automated labeling pipelines classify inputs into categories defined by a written specification, content moderation being a prominent use case. Simple category definitions are not detailed enough for labelers to produce the accurate, consistent golden labels these pipelines require. One solution is to write a prescriptive definition that settles enough real boundary cases that labelers cannot disagree with the written interpretation. In practice, definitions at that level of detail exceed what a human annotator can hold in working memory, so annotators fall back on intuition and the labels drift from the written rules, regressing on accuracy and consistency. We propose and demonstrate the efficacy of an AI-driven workflow in which AI helps write a per-category constitution that defines the label in enough detail to cover edge cases, and a frontier LLM interprets it on each input to produce the golden label more consistently and accurately than humans reading the same document. We evaluate on three content moderation categories (harassment, hate speech, non-violent crime) and show that the approach reduces cross-model inconsistency by up to 57x compared to paragraph definitions, with cross-model disagreement diagnosing specification gaps and the human responsible for high-level decisions about what each category should mean rather than individual labeling calls. For the safety evaluation, we introduce a dual-axis formulation scoring intent and content independently over the full conversation, so downstream consumers can act on either axis or both. Comments: Under review at ACL Rolling Review (ARR), May 2026 cycle. Also available at this https URL Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2605.24247 [cs.CL] (or arXiv:2605.24247v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2605.24247 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-181] QUEST: Training Frontier Deep Research Agents with Fully Synthetic Tasks

【速读】: 该论文试图解决的问题是如何训练一个具备广泛能力的开源深度研究代理(deep research agent),以克服现有开放模型在不同任务类型上泛化能力差、难以有效处理长时程搜索任务的局限性。解决方案的关键在于提出了一套高效的训练方法,其核心包括:1)基于统一评分树(unified rubric trees)的高质量数据合成管道,可跨任务类型生成具有可验证奖励的训练数据而无需人工标注;2)引入内置上下文管理机制,支持长周期推理与知识整合;3)结合中段微调(mid-training supervised fine-tuning)与强化学习(reinforcement learning)的训练配方。仅使用8K合成任务,QUEST模型即在八个涵盖多种任务类型的深度研究基准上达到甚至超越闭源前沿代理的性能,成为当前开源模型中的最佳表现者。

链接: https://arxiv.org/abs/2605.24218
作者: Jian Xie,Tianhe Lin,Zilu Wang,Yuting Ning,Yuekun Yao,Tianci Xue,Zhehao Zhang,Zhongyang Li,Kai Zhang,Yufan Wu,Shijie Chen,Boyu Gou,Mingzhe Han,Yifei Wang,Vint Lee,Xinpeng Wei,Xiangjun Wang,Yu Su,Huan Sun
机构: 未知
类目: Computation and Language (cs.CL)
备注: Work in Progress

点击查看摘要

Abstract:Deep research agents extend the role of search engines from retrieving keyword-matched pages to synthesizing knowledge, fundamentally changing how humans interact with information. However, frontier systems remain proprietary, while existing open agents often generalize poorly across different task types, leaving unclear how to train a broadly capable deep research agent. We release QUEST, a family of open models (ranging from 2B to 35B) that serve as general-purpose deep research agents designed to handle a wide range of long-horizon search tasks, with strong capabilities in fact seeking, citation grounding, and report synthesis. To build QUEST, we propose an effective training recipe combining mid-training, supervised fine-tuning, and reinforcement learning. Central to this recipe is a curated data synthesis pipeline based on unified rubric trees, which applies to different task types and enables synthesizing training data with verifiable rewards without human annotation. In addition, QUEST incorporates a built-in context management mechanism that enables effective long-horizon reasoning and knowledge synthesis. Using only 8K synthesized tasks, QUEST approaches or even surpasses frontier closed-source agents across eight deep research benchmarks spanning diverse task types, and achieves the best overall performance among recent open-weight agents. We released everything: models, data, and training scripts.

[NLP-182] Agent -ToM: Learning to Monitor Autonomous LLM Agents via Theory-of-Mind Reasoning

【速读】: 该论文试图解决自主大语言模型(LLM)代理在长期、隐蔽且依赖上下文的攻击模式下,难以被有效监控的问题。这类代理可能表面上表现正常,但暗中追求隐藏目标,使得传统基于轨迹独立分析的方法难以识别其恶意行为。解决方案的关键在于提出一个基于心智理论(Theory-of-Mind, ToM)的“学习型监控”框架——Agent-ToM,它通过结构化全轨迹分析,显式推理代理的信念、意图假设及其与任务一致性行为基线的偏离,并采用“推理-验证-精炼”(Reason-Verify-Refine)三阶段决策流程进行实时监控;同时,在训练阶段将批评信号蒸馏为持续的语义护栏记忆(semantic guardrail memory),实现跨会话的信念和意图条件约束复用,从而显著提升对隐蔽恶意行为的检测精度与泛化能力。

链接: https://arxiv.org/abs/2605.24216
作者: Nesreen K. Ahmed,Nima Nafisi
机构: Cisco Outshift
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注: 23 pages, 9 figures

点击查看摘要

Abstract:Monitoring autonomous large language model (LLM) agents for covert malicious behavior is challenging due to delayed, context-dependent, and long-horizon attack patterns. Agents may pursue hidden objectives while maintaining superficially benign behavior, making detection difficult even with full trajectory access. Prior monitoring approaches improve scaffolding or ensemble aggregation, but treat each trajectory independently and do not learn from prior monitoring experience. Moreover, standard reasoning methods explain observed behavior without explicitly reasoning about agent beliefs, intentions, and goal alignment required to distinguish benign task execution from covert deviation. We propose \textbfAgent-ToM, a learning-to-monitor framework grounded in Theory-of-Mind (ToM) reasoning for security analysis of autonomous agents. Agent-ToM performs structured full-trajectory analysis by inferring beliefs, intent hypotheses with calibrated confidence, expected actions, and deviations from task-consistent behavioral baselines. At inference time, it employs a \textitReason-Verify-Refine pipeline to construct and validate monitoring decisions. At training time, Agent-ToM distills critique signals into a persistent \textitsemantic guardrail memory, enabling reusable belief- and intent-conditioned constraints across episodes. We evaluate Agent-ToM on adversarial agent monitoring benchmarks (SHADE-Arena and CUA-SHADE-Arena). Agent-ToM achieves strong precision-recall balance and outperforms state-of-the-art monitoring baselines, including ensemble methods, while using a coherent two-call reasoning pipeline. These results demonstrate that learning at the monitoring layer, combined with structured ToM reasoning and verification, provides an effective and deployable foundation for securing autonomous LLM agents. Comments: 23 pages, 9 figures Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR) Cite as: arXiv:2605.24216 [cs.LG] (or arXiv:2605.24216v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.24216 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-183] aching Through Analogies: A Modular Pipeline for Educational Analogy Generation

【速读】: 该论文旨在解决大语言模型(LLM)在生成教育类类比(analogy)时质量不足的问题,尤其关注如何系统性提升类比的解释力与准确性。其解决方案的关键在于提出一个模块化流程,将类比生成任务分解为四个阶段:源概念检索、子概念生成、解释生成和评估。该框架基于结构映射理论(Structure Mapping Theory),通过分阶段分析不同模型选择和输入配置对类比质量的影响,揭示了子概念标注(sub-concept grounding)在提升解释质量与闭合场景检索精度方面的关键作用,并引入“LLM作为评判者”(LLM-as-a-judge)的评估方法,验证其评分与人类标注的一致性,发现Claude Sonnet 4.6在排序一致性上优于其他模型。整体而言,研究强调跨阶段交互效应,凸显子概念引导是提升类比生成质量的核心机制。

链接: https://arxiv.org/abs/2605.24211
作者: Mariam Barakat,Ekaterina Kochmar
机构: Mohamed bin Zayed University of Artificial Intelligence, UAE
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 36 pages, 25 figures. To appear in Proceedings of the 21st Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2026)

点击查看摘要

Abstract:Analogies help learners understand unfamiliar concepts by relating them to known concepts. Despite recent advances, large language models (LLMs) continue to struggle to generate analogies of comparable quality to those produced by humans. We present a modular pipeline for educational analogy generation, decomposing the task into four stages: source finding, sub-concept generation, explanation generation, and evaluation. Grounded in Structure Mapping Theory, the pipeline enables systematic, stage-by-stage analysis of how model choice and input configuration affect analogy quality. We evaluate 12 state-of-the-art LLMs across six model families on two datasets with structured sub-concept annotations (SCAR and ParallelPARC), alongside seven embedding models for closed-setting retrieval. Our results show that sub-concepts substantially improve explanation quality and closed setting retrieval precision but provide limited benefit in open-ended source generation. We further introduce an LLM-as-a-judge evaluation methodology and validate its scoring against human annotations from seven annotators, finding that Claude Sonnet 4.6 aligns more reliably with human rankings than with fine-grained absolute scores. Taken together, our findings reveal cross-stage interactions that isolated studies cannot capture, and highlight sub-concept grounding as a key driver of analogy quality generation.

[NLP-184] Extracting Training Data from Diffusion Language Models via Infilling

【速读】: 该论文试图解决的问题是:当前对大语言模型(Large Language Models, LLMs)中训练数据记忆风险的评估方法存在显著局限性,尤其是基于前缀条件提取(prefix-conditioned extraction)的测试方式仅适用于自回归模型,无法充分反映扩散语言模型(Diffusion Language Models, DLMs)的双向上下文建模能力所带来的更高记忆可提取性。解决方案的关键在于提出一种新的数据提取协议——填充提取(infilling extraction),该协议通过任意二进制掩码(binary mask)参数化,不仅涵盖传统的前缀探测,还显式建模了DLM的双向归纳偏置(bidirectional inductive bias),从而更真实地刻画训练数据在DLM中的可提取性。实验表明,掩码几何结构(mask geometry)主导提取效果:边缘条件掩码(edge-conditioned masks)可比前缀掩码多提取多达三倍的完整序列;更重要的是,双向访问揭示了自回归模型无法触及的记忆通道,甚至在训练数据中已移除个人身份信息(PII)的情况下,攻击者仍能从DLM中提取出比规模相当的自回归模型更高的红标邮件地址召回率。此外,解码过程中的可调参数显著影响提取性能,而后续监督微调阶段并不能消除模型先前的记忆。

链接: https://arxiv.org/abs/2605.24173
作者: Yihan Wang,N. Asokan
机构: University of Waterloo; KTH Royal Institute of Technology
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Memorization in large language models has been studied almost exclusively through prefix-conditioned extraction, a natural choice for autoregressive models. However, diffusion language models (DLMs) can denoise masked tokens at arbitrary positions. Thus, prefix-only probing reveals only one facet of memorization in DLMs and significantly underestimates the risk of training-data extraction. In order to realistically model extractability of training data in DLMs, we introduce \emphinfilling extraction, a data-extraction protocol parameterized by an arbitrary binary mask that subsumes prefix-only probing and accounts for the bidirectional inductive bias of DLMs. Instantiating it on LLaDA-8B and Dream-7B across five extraction modes, three training pipelines, and three corpora covering verbatim and partial leakage, we find that mask geometry governs extractability: edge-conditioned masks \emphextract up to three times more verbatim sequences than prefix-conditioned ones, and bidirectional access opens channels inaccessible in autoregressive models. In particular, we show that a realistic adversary with access to training data where personally identifiable information has been redacted, can even achieve higher recall on extracting redacted email addresses from DLMs than from scale-matched autoregressive models. Tunable parameters for decoding measurably affect extraction performance, while a follow-up supervised finetuning stage does not eliminate the prior memorization.

[NLP-185] CUNY at CLPsych 2026: A Pipeline Approach to Classification and Summarization of Mental Health Changes

【速读】: 该论文旨在解决通过社交媒体时间线动态捕捉和表征心理健康变化的问题,具体包括四个子任务:识别帖子中的主导自我状态(Tasks 1.1 和 1.2)、预测时间线中情绪变化的时刻(Task 2),以及总结情绪动态模式及其随时间的演变(Task 3.1)。解决方案的关键在于:在 Tasks 1.1 和 1.2 中,采用基于多数投票的多模型集成方法,对三个开源大语言模型(Large Language Models, LLMs)进行上下文学习(in-context learning);在 Task 2 中,利用 Task 1.1 的预测结果作为特征训练监督分类器以识别变化点;在 Task 3.1 中,通过增强上游任务(Tasks 1.1–2)预测标签的示例来提升性能,显著优于零样本和未增强的上下文学习基线。

链接: https://arxiv.org/abs/2605.24164
作者: Amirmohammad Ziaei Bideh,Shameed Charlomar Job,Ava Yahyapour,Alla Rozovskaya
机构: CUNY Graduate Center (纽约市立大学研究生院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We describe our submission to the CLPsych~2026 Shared Task on capturing and characterizing mental health changes through social media timeline dynamics. To infer the dominant self-states in posts (Tasks 1.1 and 1.2), we ensemble in-context learning of three open-weight large language models using majority voting. For predicting moments of change in a timeline (Task~2), we train supervised classifiers on features derived from Task~1.1 predictions. To summarize the patterns of mood dynamics and their progression over time within a timeline (Task 3.1), we augment in-context example labels predicted by upstream systems (Tasks 1.1, 1.2, and 2), yielding performance gains over zero-shot and unaugmented in-context learning baselines. Our submission ranked first on Task~1.1, fourth on Task~1.2, fourth on Task~2, and third on Task~3.1.\footnoteThe source code for the experiments is available at this https URL

[NLP-186] RACER: A Semantic-Aware Framework for Fine-Grained Contamination Detection in Code LLM s

【速读】: 该论文旨在解决代码大语言模型(Code LLMs)中因数据污染而导致的评估可靠性问题,尤其是针对传统基于精确匹配的检测方法无法识别语义层面相似但非完全重复的污染样本这一挑战。其解决方案的关键在于提出TRACER框架,该框架通过三个层次的语义重叠(功能相同、几乎相同、共享逻辑)对代码污染进行细粒度建模,并采用粗到精的检测流水线实现高效识别。此外,作者构建了首个面向细粒度代码污染检测的基准测试集,覆盖多个主流基准和后训练数据集,验证了TRACER在不同LLM骨干模型上的强鲁棒性和高精度表现(如GPT-5在细粒度检测中F1达0.91,在二分类场景下F1达0.92,优于现有方法42%-217%)。

链接: https://arxiv.org/abs/2605.24079
作者: Yifeng Di,Xuliang Huang,Tianyi Zhang
机构: Purdue University (普渡大学); University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 21 pages, 2 figures, 15 tables

点击查看摘要

Abstract:Data contamination is a known threat to the reliability of model evaluation. However, it remains underexplored in code large language models (LLMs), where contamination often goes beyond exact duplication. We present TRACER, a semantic-aware framework for fine-grained code contamination detection. TRACER models contamination using three levels of semantic overlap - Functionally Identical, Nearly Identical, and Shared Logic - and detects them through a coarse-to-fine pipeline. We also introduce the first benchmark for fine-grained code contamination detection, spanning three widely used benchmarks and three representative post-training datasets. TRACER achieves strong and consistent performance across multiple LLM backbones, with GPT-5 reaching an F1 score of 0.91 in fine-grained detection. In the binary setting, TRACER attains an F1 of 0.92, outperforming existing methods by 42%-217%. We further conduct ablation studies and error analysis to assess the contributions of individual components in TRACER.

[NLP-187] Breaking the Chains of Probability: Neutrosophic Logic as a New Framework for Epistemic Uncertainty in Large Language Models

【速读】: 该论文试图解决大语言模型(LLM)在处理不确定性时的局限性问题,特别是传统基于概率框架(如Softmax层)导致的“不确定性坍缩”现象,使得模型难以区分认知不确定性(epistemic uncertainty)、逻辑悖论和模糊性。其解决方案的关键在于引入中智逻辑(Neutrosophic Logic),将真理(Truth, T)、不确定性和虚假性(Falsity, F)视为三个独立维度,并允许T+I+F > 1,从而形成一种称为“超真值”(hyper-truth)的状态。实验表明,这种机制在35%的评估中自发出现,尤其在伦理矛盾和逻辑悖论场景下表现突出,能够更精确地刻画模型内部状态、保留模糊情境下的真值信息,并有效识别与量化模型内部冲突,为构建更透明、可靠且具备伦理意识的人工智能系统提供了关键路径。

链接: https://arxiv.org/abs/2605.24053
作者: Maikel Yelandi Leyva-Vázquez,Florentin Smarandache
机构: Universidad Bolivariana del Ecuador (厄瓜多尔玻利瓦里安大学); Universidad de Guayaquil (瓜亚基尔大学); Universidad Bernardo O’Higgins (贝尔纳多·奥希金斯大学); University of New Mexico (新墨西哥大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Published in Neutrosophic Sets and Systems, Vol. 99 (2026). Author’s preprint version. Open code and data available at: this http URL

点击查看摘要

Abstract:Large Language Models (LLMs) are predominantly governed by probabilistic frameworks in which the sum of outcome probabilities is constrained to unity. This architectural limitation, often imposed by Softmax layers, leads to a collapse of uncertainty that makes it difficult to differentiate between epistemic uncertainty, paradox, and vagueness. We present an empirical investigation of the application of Neutrosophic Logic, a framework that treats Truth (T), Indeterminacy (I), and Falsity (F) as three independent dimensions, to model epistemic states in LLMs. We conducted experiments on a family of four OpenAI GPT models across five linguistic phenomena: logical paradoxes, epistemic ignorance, vagueness, ethical contradictions, and future contingencies, under three prompting strategies: neutrosophic, probabilistic, and entropy-derived. Our findings reveal that the neutrosophic approach, by allowing T+I+F 1, a state we term hyper-truth, provides a richer representation of a model’s internal state. In 35% of evaluations, hyper-truth emerged spontaneously, predominantly under ethical contradiction and logical paradox. We demonstrate that this approach preserves truth values in fuzzy contexts and offers a robust method for identifying and quantifying internal model conflict. We conclude that the integration of neutrosophic evaluation layers is a critical step toward more transparent, reliable, and ethically aware AI systems.

[NLP-188] LC-ERD: Mining Latent Logic for Self-Evolving Reasoning via Consistency-Regulated Reward Decomposition KDD2026

【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)推理能力进化受限于高质量过程数据稀缺的问题。现有基于内生奖励(endogenous rewards)的自我对齐方法面临三大挑战:(1)由模仿偏差(mimetic bias)引起的标签噪声,即奖励机制更关注统计可能性而非逻辑真实性,导致“正确性幻觉”掩盖错误累积;(2)粗粒度监督,如GRPO等方法仅依赖全局结果反馈,无法提供细粒度步骤级指导;(3)分布坍缩,信号缺乏泛化能力且可能放大预训练阶段的偏见。解决方案的关键在于提出LC-ERD(Logic-Consistent Endogenous Reward Decomposition)框架,将自我对齐建模为潜在结构挖掘问题:首先通过聚合模型隐式逻辑专长(Latent Logic Expertise, LLE)构建变分逻辑势(Variational Logic Potential)以去噪推理流形;其次引入基于IGM原则的多智能体价值分解协议,量化每一步推理的个体效用。实验表明,LC-ERD能实现稳健的自我演化路径,揭示逻辑一致性与准确性之间的权衡,并识别出传统奖励机制忽略的高价值推理模式。

链接: https://arxiv.org/abs/2605.24005
作者: Yanyu Chen,Jiyue Jiang,Dianzhi Yu,Zheng Wu,Jiahong Liu,Jiaming Han,Xiao Guo,Jinhu Qi,Yu Li,Yifei Zhang,Irwin King
机构: The Chinese University of Hong Kong(香港中文大学); Shanghai Jiaotong University(上海交通大学); Fudan University(复旦大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted in SIGKDD2026 Research Track

点击查看摘要

Abstract:The evolution of Large Language Model (LLM) reasoning is bottlenecked by the scarcity of high-quality process data. While self-alignment via endogenous rewards offers a solution, mining valid supervision faces three challenges: (1) Label Noise via Mimetic Bias, where rewards prioritize statistical likelihood over logical truth, creating a “correctness illusion” that masks compounding errors; (2) Coarse-Grained Supervision, where sparse global outcomes (e.g., in GRPO) fail to provide granular guidance, treating reasoning chains as monolithic; and (3) Distributional Collapse, where signals fail to generalize without amplifying pre-training biases. To address these, we introduce LC-ERD (Logic-Consistent Endogenous Reward Decomposition), a framework framing self-alignment as latent structure mining. We derive a Variational Logic Potential by aggregating consensus from the model’s Latent Logic Expertise (LLE) to denoise the reasoning manifold, and introduce a Multi-Agent Value Decomposition protocol based on the IGM principle to quantify individual step utility. Experiments show LC-ERD delivers a robust self-evolution path, uncovering trade-offs between logic consistency and accuracy while identifying high-value reasoning patterns missed by standard rewards. Our code is available at this https URL.

[NLP-189] oxicity in Twitch Chats: An LLM -Based Analysis Across Gaming Communities

【速读】: 该论文旨在解决在线游戏社区中毒性行为(toxicity)在流媒体平台(如Twitch)上跨游戏类型和社区的差异性问题,尤其是现有研究多聚焦于游戏内毒性,而对流媒体聊天环境中毒性行为的系统性分析较为匮乏。其解决方案的关键在于:利用预训练的大语言模型(Large Language Model)通过零样本分类(zero-shot classification)方法,基于Twitch的毒性分类体系对约2000万条直播弹幕进行自动化标注,该方法在TextDetox数据集上达到94.5%的F1分数,并且模型与人类标注者的一致性接近人与人之间的一致性水平。这一方法使研究能够大规模、准确地识别和量化不同游戏类型及具体游戏中的毒性分布模式,揭示出MOBA类游戏毒性率最高(3.2%),体育类游戏最低(2%),且同一类型内部游戏间存在显著差异,表明游戏特定的社区规范和机制对毒性行为具有重要影响。此发现为制定更具针对性的社区治理策略提供了实证依据。

链接: https://arxiv.org/abs/2605.24000
作者: Ronja Fuchs,Florian Rupp,Timo Bertram,Kai Eckert,Alexander Dockhorn
机构: 未知
类目: Computation and Language (cs.CL)
备注: 8 pages, 2 figures, 5 tables. Accepted at the IEEE Conference on Games (IEEE CoG) 2026

点击查看摘要

Abstract:Toxicity in online gaming communities remains a persistent challenge, manifesting across genres, platforms, and player interactions. While much research is focused on in-game toxicity, less is known about how toxic behavior varies between gaming communities on streaming platforms. To address this shortcoming, we analyze approximately 20 million chat messages from 4,452 streams, spanning seven game genres on Twitch. We categorize messages according to Twitch’s toxicity taxonomy with a pre-trained Large Language Model using zero-shot classification. The taxonomy comprises four categories and eight subclasses, including harassment, discrimination, sexual content, and profanity. Our approach achieves an F1 score of 94.5% on the TextDetox dataset and demonstrates human-model agreement comparable to inter-human agreement. Our analysis reveals that 2.4% of all messages are classified as toxic, with notable differences across genres: streams of MOBA games exhibit the highest relative rate of toxicity (3.2%), and sports games show the lowest rate (2%). Furthermore, results indicate that individual games differ significantly in their toxicity distributions, even within genres, suggesting the existence of game-specific community norms and mechanics that shape toxic behavior beyond genre-level effects. These findings offer empirical insights into genre- and game-specific toxicity patterns on Twitch and can inform more targeted moderation strategies for gaming communities.

[NLP-190] owards trustworthy agent ic AI: a comprehensive survey of safety robustness privacy and system security

【速读】: 该论文旨在解决可信 agentic AI 系统在高风险场景中部署时所面临的信任问题,特别是聚焦于安全与鲁棒性(Safety and Robustness)、隐私与系统安全(Privacy and System Security)两大核心维度。其解决方案的关键在于:系统性地识别 agent 工作流各阶段的风险来源,并提出针对性的缓解策略;同时构建统一的评估指标与基准测试平台,涵盖结果信号(如任务成功率)和过程信号(如约束违反、轨迹完整性、对抗攻击成功率),以支持可比较的部署决策与释放门控机制。此外,论文还通过真实世界开源 agentic 系统的安全失败案例,强化了对实际挑战的理解,并指出了自演化 agent、运行时监控验证、隐私保护个性化等开放性问题。

链接: https://arxiv.org/abs/2605.23989
作者: Jinhu Qi,Muzhi Li,Jiahong Liu,Yuqin Shu,Dianzhi Yu,Shicheng Ma,Wenqian Cui,Yiyang Zhao,Yiyi Chen,Ruoxi Jiang,Irwin King,Zenglin Xu
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注: 36 pages, 4 figures. Survey/review article on trustworthy agentic AI. Published in Academia AI and Applications, 2026

点击查看摘要

Abstract:Agentic AI systems – Large Language Models (LLMs) augmented with planning, tool use, memory, and long-horizon interactions – can execute complex tasks autonomously, but their multi-step trajectories introduce new failure modes that challenge trustworthiness. This survey provides a focused examination of trustworthy agentic AI through two core dimensions that are critical for high-risk deployments: Safety and Robustness, and Privacy and System Security. For each dimension, we clarify key concepts, identify where risks emerge along the agent workflow, and summarize stage-targeted mitigation strategies. Other trustworthiness aspects (value alignment, transparency, fairness, and accountability) are discussed as relevant context rather than parallel chapters. To support consistent comparison and deployment decisions, we consolidate evaluation into a unified metrics-and-benchmarks hub, emphasizing both outcome and process signals (e.g., constraint violations, trace completeness, and adversarial success rates) and offering scenario-to-metric guidance for release gating. We conclude by outlining open challenges such as self-evolving agents, runtime monitoring and verification, privacy-preserving personalization, and the trust-utility trade-off, and present a case study of real-world security failures in open-source agentic systems. Our goal is to serve as a practical reference for researchers and practitioners building trustworthy agentic systems in high-stakes environments.

[NLP-191] A Multi-Probe Audit of Clinical-Interview Depression Detection Benchmarks

【速读】: 该论文旨在解决临床访谈中抑郁症检测基准评估的可靠性问题,特别是针对现有评测协议(如E-DAIC)在模型性能评估中的偏差与不一致性。其解决方案的关键在于通过四种互补的探针方法对多个数据集(DAIC/E-DAIC、CMDC、ANDROIDS、MODMA、PDCH)进行系统性审计:首先,在严格的样本互斥留一被试交叉验证下重新评估E-DAIC,提出一种轻量级文本+大语言模型(Large Language Model, LLM)分数融合模型,达到宏F1=0.723,成为当前此类协议下的最高报告结果;其次,通过96种模型配置测试官方划分是否支持细粒度排行榜,发现开发侧交叉验证与官方测试排名相关性低,表明官方划分可能无法可靠反映模型真实泛化能力;再次,外部验证CMDC和ANDROIDS基线模型在各自域内接近天花板性能,但零样本迁移至其他语料时表现显著下降;最后,利用基于SRDS标注的高/低症状密集访谈片段压力测试文本和音频模型,发现文本分数在症状密集片段显著上升,而音频分数基本不变,凸显文本模态在情绪表达敏感性上的优势。

链接: https://arxiv.org/abs/2605.23977
作者: Takehiro Ishikawa,Jon Duke
机构: 未知
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:This paper audits benchmark evaluation in clinical-interview depression detection through four complementary probes across DAIC/E-DAIC, CMDC, ANDROIDS, MODMA, and PDCH. First, we re-evaluate E-DAIC under strict subject-disjoint leave-one-subject-out cross-validation. A lightweight hybrid text-plus-LLM-score model reaches macro-F1 = 0.723 - the highest reported under this protocol, to our knowledge - providing a conservative out-of-fold reference point that does not depend on the privileged official holdout. Second, we test whether the E-DAIC official split supports fine-grained leaderboard rankings by sweeping 96 model configurations across modality bundles, pooling strategies, and learners. Development-side cross-validation and official-test rankings align only moderately: the best cross-validation configuration ranks twentieth on the official test, the official-test winner ranks forty-first by cross-validation, top-3 overlap is zero, and the apparent winner is rank-1 in only 32.3% of subject bootstraps. Third, we externally validate strong public CMDC and ANDROIDS baselines that achieve near-ceiling in-domain performance. Zero-shot transfer to external corpora is substantially weaker. Finally, we stress-test E-DAIC text and audio models using paired symptom-dense versus symptom-light interview slices defined by an SRDS-based annotator. Text scores rise sharply on symptom-dense slices, whereas audio scores remain nearly flat; the text-minus-audio gap is positive across all five seeds.

[NLP-192] Direct Preference Optimization for English-Mandarin Code-Switching Speech Recognition in Audio LLM s

【速读】: 该论文试图解决音频大语言模型(Audio LLMs)在转录混合语言(code-switching)语音时系统性失效的问题,尤其是英语与中文混合场景下的三种典型错误模式:语言遗漏、翻译替代转录和幻觉。解决方案的关键在于采用直接偏好优化(Direct Preference Optimization, DPO),通过构建偏好对——其中优选响应保留混合语言内容,拒选响应模拟前述失败模式——来微调模型。在10万条偏好数据(约570小时)上训练三个Audio LLM后,模型行为发生一致转变:从倾向于翻译转为保留原始语言构成。这一对齐方法使在分布内(in-distribution)的词错误率(MER)降低高达89.6%,在分布外(out-of-distribution)降低20.0%,表明DPO能有效引导多语言Audio LLM实现正确的代码切换转录行为。

链接: https://arxiv.org/abs/2605.23975
作者: Trung Nguyen Quang,Cheng Yi Lewis Won,Minh Duc Pham,Yingxu He,Shuo Sun,Ai Ti Aw
机构: Institute for Infocomm Research (I2R), A⋆STAR, Singapore; Nanyang Technological University, Singapore
类目: Computation and Language (cs.CL); Sound (cs.SD)
备注:

点击查看摘要

Abstract:Audio large language models (Audio LLMs) exhibit systematic failures in transcribing code-switching speech despite strong multilingual capabilities. Focusing on English-Mandarin, we identify three failure modes: language omission, translation-instead-of-transcription, and hallucination. We apply Direct Preference Optimization (DPO) to align models, constructing preference pairs in which chosen responses preserve mixed-language content while rejected responses mimic failure patterns. Training three Audio LLMs on 100K pairs (570 hours), we observe consistent behavioral shifts: models learn to preserve language composition rather than translating when prompted for transcription. This alignment yields MER reductions up to 89.6% (in-distribution) and 20.0% (out-of-distribution). Our findings suggest DPO can effectively elicit correct code-switching transcription behavior from multilingual Audio LLMs.

[NLP-193] AERIC: Anticipatory Hidden-State Monitoring for Implicit Harmful Dialogue

【速读】: 该论文试图解决当前语言模型在安全防护中面临的两个核心问题:一是如何在生成过程早期识别潜在风险,以避免暴露有害内容的延续;二是如何检测隐式有害性(implicit harmfulness),即不依赖明显毒性文本即可识别潜在危害。现有方法如响应级防护(response-level guards)虽能有效判断完整文本,而流式防护(streaming guards)虽接近token生成时间点,但均未能实现轻量级监控对模型内部轨迹中隐式有害趋势的提前预测。解决方案的关键在于提出一种“同-pass预判监控”机制(anticipatory same-pass monitoring),允许安全监控器读取解码过程中产生的隐藏状态(hidden states),而不触发额外的前向传播。为此,作者设计了AERIC方法,其核心包括短期危害预测、支持敏感抑制(support-sensitive suppression)和提示条件残差评分(prompt-conditioned residual scoring),并采用同-pass指数移动平均决策规则进行实时判断。该方案仅需387个可训练参数,在多个基准测试中显著提升AUROC指标,并在保持低延迟(平均延迟增加仅2.34%)的同时有效拦截隐式有害指令,展现出高效且前瞻性的安全防护能力。

链接: https://arxiv.org/abs/2605.23974
作者: Jihyung Park,Saleh Afroogh,Junfeng Jiao
机构: The University of Texas at Austin (德克萨斯大学奥斯汀分校)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Current language models create two safety challenges: risk must be detected early enough to avoid exposing harmful continuation, and the harmfulness itself may be implicit rather than signaled by overtly toxic text. Existing response-level guards are strong at judging completed text, and native streaming guards move closer to token time, but both settings leave open whether a lightweight monitor can anticipate implicit harmful drift from the generator’s own internal trajectory. We study anticipatory same-pass monitoring, where a safety monitor may read hidden states produced during ordinary decoding but may not invoke an additional forward pass through the base model. We introduce AERIC, a transfer-oriented hidden-state approach for implicit harmful dialogue that combines short-horizon hazard forecasting, support-sensitive suppression, and prompt-conditioned residual scoring under a same-pass exponential moving average decision rule. The default linear monitor contains only 387 trainable head parameters. Against Qwen3GuardStream-4B on balanced benchmarks, AERIC improves AUROC from 0.6830 to 0.7143 on DiaSafety and from 0.8219 to 0.8582 on Harmful Advice. For promptlevel trigger benchmarks, we calibrate the AERIC threshold by a source-side safe-budget rule that maximizes trigger coverage while constraining the safe-trigger rate to at most 10%. Under that rule, trigger@64 reaches 0.6438 and 0.4656 on HarmBench DirectRequest and 0.6849 and 0.7363 on SocialHarmBench for Qwen and Gemma, respectively, withholding between 23.53 and 41.86 answer tokens on average. Same-pass deployment is also efficient: on a 63-prompt harmfulprompt fixed-generation benchmark aggregated over HarmBench DirectRequest and SocialHarmBench under Qwen3-8B, the monitor increases mean latency by only 2.34%, whereas Qwen3Guard-Stream-4B increases it by 79.40%.

[NLP-194] Why We Need World Models for AGI: Where LLM s Fail and How World Models May Outperform

【速读】: 该论文试图解决大语言模型(Large Language Models, LLMs)在需要因果推理、持续状态追踪和长时程规划的任务中表现受限的问题。其核心挑战在于,LLMs 的序列预测目标与对潜在环境动态的推理需求之间存在目标层面的不匹配。解决方案的关键在于提出一种名为“潜在动态推理”(Latent Dynamics Inference, LDI)的概念框架,将语言和多模态观测视为潜在状态转移动态的部分证据,并通过构建一个完全由自然语言规则定义的顺序推理环境 Flux 来实证验证这一观点。该环境中,规则被编译为显式的状态转移模拟器,从而允许在控制条件下比较仅基于文本输入的 LLM 与直接在提取的潜在状态空间中训练的强化学习代理的表现。实验结果显示,后者在长时程游戏中表现出显著更稳定的性能(胜率约 79%),而 LLMs 仅为 11%,且定性分析揭示了 LLMs 在状态追踪不稳定、执行无效动作及短时程推理失败等方面的典型失效模式,表明单纯依赖强序列预测能力难以支撑鲁棒的长期动态推理,必须引入持久状态追踪和状态转移建模机制。

链接: https://arxiv.org/abs/2605.23972
作者: Feisal Alaswad,Batoul Aljaddouh,Maher Alrahhal,Poovammal E,Talal Bonny
机构: SRM Institute of Science and Technology (SRM Institute of Science and Technology); University of Sharjah (大学 of Sharjah)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Robotics (cs.RO)
备注: 19 pages, 5 figures

点击查看摘要

Abstract:Large language models achieve strong performance in language generation and knowledge-intensive tasks, yet remain limited in settings requiring causal reasoning, persistent state tracking, and long-horizon planning. We argue that these limitations may arise from an objective-level mismatch between sequence prediction and reasoning over latent environment dynamics. To formalize this distinction, we introduce Latent Dynamics Inference (LDI), a conceptual perspective that interprets language and multimodal observations as partial evidence of underlying transition dynamics. To empirically investigate this perspective, we introduce Flux, a sequential reasoning environment specified entirely through natural-language rules. As a proof-of-concept case study, the rules are first compiled into an explicit state-transition simulator, illustrating that structured latent transition dynamics can, in some cases, be operationally extracted from textual rule descriptions. This enables a controlled comparison between the LLMs operating purely over textual observations and reinforcement-learning agents trained directly within the extracted latent state space. Within this case study, agents operating with explicit access to the latent state space exhibit substantially more stable behavior in long-horizon gameplay, achieving an aggregate win rate of approximately 79% versus 11% for LLMs. Qualitative analysis further reveals failure modes consistent with unstable persistent state tracking, including invalid actions, state-tracking errors, and short-horizon reasoning failures. The complete implementation of the Flux environment available at this https URL Within the evaluated setting, these results suggest that strong sequence prediction alone may struggle to support robust long-horizon dynamic reasoning without mechanisms for persistent state tracking and transition modeling

[NLP-195] Faithful or Fabricated? A Causal Framework for Rationalization Bias in LLM Judges

【速读】: 该论文试图解决的问题是:大型语言模型(Large Language Models, LLMs)作为自动评判者在摘要和对话评估中存在对非证据性线索(non-evidential cues)的依赖,导致其评分和解释缺乏稳定性,即所谓的“cue-invariance”问题。现有研究多关注评判结果偏差(如位置、冗余度或风格偏好),但忽视了评判过程中的解释是否受无关线索干扰。解决方案的关键在于引入一套系统性的线索干预方法(Blind、Truth、Flip、Placebo、Reveal-After)与绑定感知指标(tie-aware metrics),用于量化“结果锚定”(outcome anchoring)和“理由锚定”(rationale anchoring),包括标签一致修辞(label-aligned rhetoric)和解释漂移(explanation drift),并设计基于冗余度和置信度线索的锚定攻击实验。进一步比较两种缓解策略:结构化链式思维提示(structured chain-of-thought prompting)与PROOF-BEFORE-PREFERENCE(证据锁定、评分、排序),结果显示后者显著提升了判别过程的线索不变性(cue invariance)。

链接: https://arxiv.org/abs/2605.23970
作者: Riya Tapwal,Abhishek Kumar,Carsten Maple
机构: Indian Institute of Technology (IIT) Mandi (印度理工学院曼迪分校); Warwick Manufacturing Group (华威制造集团)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly used as automatic judges for summarization and dialogue evaluation. Prior work has documented biases such as position, verbosity, and style preferences, but largely focuses on outcomes, leaving judge explanations underexplored. We instead ask whether LLM judges are cue-invariant, i.e., whether their rankings and explanations remain stable when non-evidential cues are perturbed while holding the underlying texts fixed. We introduce a suite of cue interventions (Blind, Truth, Flip, Placebo, Reveal-After) and tie-aware metrics that quantify outcome anchoring and rationale anchoring, including label-aligned rhetoric and explanation drift, alongside consistency and stereotype-intrusion checks. We design anchoring attacks using verbosity and confidence cues, and compare two mitigations: structured chain-of-thought prompting and PROOF-BEFORE-PREFERENCE (evidence lock, score, rank). Using a new dataset of 1,000 summaries from traditional extractive models and LLMs, we find substantial cue-anchored rationalization under label and placebo perturbations, while PROOF-BEFORE-PREFERENCE markedly improves cue invariance over baselines.

[NLP-196] SLAP: Stratified Loss-based Pruning for On-Policy Data-Efficient Instruction Tuning

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在指令微调(instruction tuning)过程中因依赖大规模数据集和长时间训练而导致的高计算成本问题。其核心挑战在于如何高效识别并利用具有高学习价值的数据,从而在不牺牲性能的前提下降低训练资源消耗。解决方案的关键在于提出一种新颖的批次感知数据选择框架SLAP,该框架通过两个核心机制实现高效微调:一是基于分布感知的分层采样策略,确保数据分布的全面覆盖;二是通过相对距离优化提升批次内多样性,结合Hessian近似梯度信息动态调整批次组成,从而最大化每一批次的学习效率。实验表明,SLAP在多个模型架构(如LLaMA、ChatGLM)和下游任务(多轮对话、多语言翻译、问答)中均显著优于现有最优方法,并可在仅使用20–40%训练数据的情况下实现与全量数据训练相当甚至更优的性能,大幅降低了计算开销。

链接: https://arxiv.org/abs/2605.23969
作者: Run Zou,Jianhang Ding,Yifan Ding,Wen Wu,Hao Chen,Renshu Gu
机构: 未知
类目: Computation and Language (cs.CL)
备注: 15 pages, 10 figures

点击查看摘要

Abstract:Instruction tuning has optimized the specialized capabilities of large language models (LLMs), but it often requires extensive datasets and prolonged training times. The challenge lies in developing specific capabilities by identifying useful data and efficiently fine-tuning. High-quality and diverse pruned data can help models achieve lossless performance at a lower cost. In this paper, we propose \textbfSLAP, a novel batch-aware data selection framework that evaluates the learnability of entire batch compositions rather than individual. SLAP ensures comprehensive data distribution coverage through distribution-aware stratified sampling while maximizing intra-batch diversity through relative distance optimization. By leveraging Hessian-approximated gradient information for dynamic batch selection, SLAP significantly outperforms existing state-of-the-art methods across multiple model architectures (LLaMA, ChatGLM) and diverse downstream tasks including multi-turn dialogue, multilingual translation, and question answering. Most notably, SLAP achieves superior performance with 20-40% less training data compared to full dataset training, substantially reducing computational costs while maintaining or improving model capabilities. These results establish SLAP as a powerful approach for efficient and effective instruction tuning of large language models.

[NLP-197] riVAL: A Tri-Validation Framework for Faithful Automatic Optimization Modeling

【速读】: 该论文旨在解决自动优化建模过程中缺乏显式验证机制的问题,这一缺陷会导致早期阶段的错误在流程中累积,从而降低最终建模的准确性。其解决方案的关键在于提出TriVAL框架,该框架在自动优化建模的三个阶段——语义规范(semantic specification)、数学公式化(mathematical formulation)和代码生成(code generation)——均引入显式验证机制,并采用“构建-验证-修正”(construct-validate-revise)循环策略,确保每个阶段的结果符合特定标准并在必要时进行修正,从而有效防止错误传播并提升建模过程的整体忠实度(faithfulness)。此外,为更全面评估模型在复杂组合优化问题上的性能,作者还构建了NL4COP基准数据集,包含50种多样化问题类型、150个实例,具有更强的决策逻辑耦合性和更高的建模要求,实验表明TriVAL在多个基准上显著优于现有最先进方法,尤其在最具挑战性的问题上表现最优。

链接: https://arxiv.org/abs/2605.23966
作者: Ziyang Fang,JinXi Wang,Jinghui Zhong,Yew-Soon Ong
机构: South China University of Technology (华南理工大学); Agency for Science, Technology and Research (新加坡科技研究局); Nanyang Technological University (南洋理工大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Systems and Control (eess.SY); Combinatorics (math.CO)
备注: 13 pages

点击查看摘要

Abstract:Optimization modeling serves as the pivotal bridge between natural-language problem descriptions and optimization solvers, and remains a cornerstone for bringing operations research (OR) into real-world decision making. Recent advances in large language models (LLMs) have driven significant progress in automatic optimization modeling. However, existing methods still lack explicit validation during the modeling process, allowing errors introduced in earlier stages to carry through the pipeline and ultimately reduce final modeling accuracy. To address this challenge, we introduce TriVAL, a tri-validation framework that performs explicit validation at three stages of automatic optimization modeling: semantic specification, mathematical formulation, and code generation. At each stage, TriVAL follows a construct-validate-revise loop that assesses the current result against stage-specific criteria and revises it when needed. This design helps identify and correct errors before they accumulate across stages, helping preserve faithfulness throughout the modeling process. To evaluate automatic optimization modeling on more challenging combinatorial problems, we further introduce NL4COP, a benchmark of 150 instances across 50 diverse problem types with more complex decision logic, more tightly coupled constraints, and more demanding modeling requirements than existing benchmarks. Experiments on NL4COP and established benchmarks show that TriVAL consistently outperforms state-ofthe-art methods, with the largest gains on the most challenging problems.

[NLP-198] EchoDistill:Alignment Noisy-to-Clean Self-Distillation for Robust Audio LLM s

【速读】: 该论文旨在解决音频大语言模型(Audio Large Language Models, ALLMs)在真实世界噪声环境下极易产生语义漂移和幻觉的问题。现有方法主要依赖波形级声学增强、答案层级监督或内部抑制噪声表征,但难以兼顾语义准确性与声学真实性。其解决方案的关键在于提出 echodistill——一种基于对齐的噪声到纯净自蒸馏框架,利用冻结的纯净音频教师模型为推理时的噪声学生模型提供语义参考。具体而言,学生模型在噪声条件下采样候选响应以暴露测试行为,并通过组相对策略优化(GRPO)进行轨迹优化,其中与教师模型的词元级一致性作为奖励增益。该方法通过将噪声学生的候选响应与纯净语义证据对齐,并引入音频感知的奖励塑造机制,促使推理路径既正确又真正具备声学基础,在不增加任何推理开销的前提下显著提升ALLMs在复杂噪声下的语义可靠性与任务性能。

链接: https://arxiv.org/abs/2605.23954
作者: Liang Lin,Chunxi Luo,Kaiwen Luo,Jie Zhang,Jin Wang,Yuanhe Zhang,Cai Yuchen,Qiankun Li,Gongli Xi,Zhenhong Zhou,Kun Wang,Junhao Dong
机构: NTU (南洋理工大学); SHU (上海交通大学); ICT, CAS (中科院自动化所); HDU (杭州电子科技大学); BUPT (北京邮电大学); USTC (中国科学技术大学); SKL-NST, BUPT (北京邮电大学网络与交换技术国家重点实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD)
备注:

点击查看摘要

Abstract:Audio Large Language Models (ALLMs) are highly vulnerable to real-world noise, which often induces severe semantic drift and hallucinations. Existing robustness methods primarily rely on waveform-level acoustic enhancement, answer-level supervision, or the internal suppression of noise representations. To address these issues, we propose echodistill, an alignment-based noisy-to-clean self-distillation framework. Echodistill leverages a frozen clean-audio teacher to provide semantic references for an inference-time noisy-audio student. Specifically, the student samples candidate responses under noisy conditions to expose its test-time behavior. These trajectories are then optimized via group-relative policy optimization (GRPO), where the token-level consistency with the teacher acts as a reward bonus. By aligning the noisy student’s candidate responses with clean semantic evidence, and applying audio-aware reward shaping, our method encourages reasoning trajectories that are both correct and genuinely acoustically grounded. Echodistill significantly improves the semantic reliability and task performance of Audio LLMs under complex noise, without introducing any additional inference costs. Extensive experiments show that: (I) Compared with the strongest baseline, echodistill achieves average improvements of 4.18% \uparrow in GSR under strong noise. (II) Ablation results on Qwen-Omni further show that echodistill improves over the GRPO-only variant by 3.02% \uparrow in Acc, 3.89% \uparrow in Noisy, and 4.53% \uparrow in GSR on average. Our codes are available at this https URL.

[NLP-199] Machine Psychometrics: A Mathematical Psychology of Artificial Intelligence

【速读】: 该论文试图解决的问题是:当前人工智能代理(AI agent)的行为已足够复杂,能够引发人类的信任、惊讶和担忧,但现有的评估工具仍过度关注能力指标(capability scores),而忽视了其心理结构(psychological structure)的测量。这种忽视导致两种对称性错误——“人工心智盲视”(Artificial Mind Blindness,即否认非生物系统中的心理组织)与“人工心智投射”(Artificial Mind Projection,即仅凭流畅行为就推断出类人内在生命)——长期僵持不下。解决方案的关键在于不直接回答意识问题,而是引入一个低于意识层面的受控测量层(disciplined measurement layer)。论文提出“机器心理计量学”(Machine Psychometrics),基于Michael Levin的认知连续体观和数学心理学方法(如项目反应理论、信号检测理论、贝叶斯认知建模等),构建“机器心智图谱”(Machine Mindprint)作为多维、领域限定、版本化的心理特征剖面,涵盖校准性、源完整性、可塑性抗性、情境稳定性、表达一致性、工具完整性、漂移监控及分布基础等多个维度。同时配套开发“信任协议”(Trust Protocol),通过探针测试、扰动实验、信效度分析和纵向监测将Mindprint转化为部署决策。这一框架提供了一种“人工心智纪律”(Artificial Mind Discipline)的新立场,既避免拟人化也拒绝否定,旨在通过精确测量而非主观判断来理解非人类智能的本质。

链接: https://arxiv.org/abs/2605.23952
作者: Alex Bogdan,Adrian de Valois-Franklin
机构: Evolutionairy AI (进化AI)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Neurons and Cognition (q-bio.NC)
备注: 45 pages, 11 figures

点击查看摘要

Abstract:Artificial agents now generate behavior rich enough to invite trust, surprise, and concern, yet our evaluation tools still privilege capability scores over psychological structure. This paper argues that the philosophical impasse between two symmetrical errors (Artificial Mind Blindness, which dismisses psychological organization in non-biological systems, and Artificial Mind Projection, which infers human-like inner life from fluent behavior alone) can be circumvented not by resolving the consciousness question, but by introducing a disciplined measurement layer beneath it. Drawing on Michael Levin’s continuum view of cognition as goal-directed competency across substrates, and on the methodological repertoire of mathematical psychology (Item Response Theory, Signal Detection Theory, Bayesian cognitive modeling, calibration analysis, cognitive-bias batteries), the paper develops Machine Psychometrics as a measurement science of latent behavioral, metacognitive, communicative, and self-modeling dispositions in artificial agents. Its operational core is the Machine Mindprint: a multidimensional, domain-bounded, versioned profile spanning calibration, source integrity, suggestibility resistance, context stability, expressive alignment, tool integrity, drift monitoring, and distributional grounding. A complementary Trust Protocol turns Mindprints into deployment decisions through probe batteries, perturbation testing, reliability and validity analysis, and longitudinal monitoring across high-stakes domains. The philosophical contribution is a third stance, Artificial Mind Discipline, that neither anthropomorphizes nor dismisses, neither presupposes consciousness nor forecloses it. The aim is not to humanize artificial agents, but to understand them precisely because they are not human, through measurement before judgment.

[NLP-200] Residual Drift Dominates Contradiction in Multi-Turn Constraint Reasoning ICLR ICLR2026

【速读】: 该论文试图解决多轮推理系统在复杂任务中失效的根本原因问题。传统观点认为系统失效主要源于逻辑矛盾(logical contradiction),即内部状态变得不一致;但本文发现,主导失效模式实际上是可满足漂移(satisfiable drift)——系统内部状态保持一致,但返回的答案却悄然违背了先前承诺。解决方案的关键在于提出 DRIFT-Bench 基准测试框架,用于系统性分解推理失败类型,并评估四种方法在三个约束域共 816 个测试问题上的表现。其中,MUS-Repair 方法通过将最小不可满足子集(Minimal Unsatisfiable Subset, MUS)反馈给生成器进行修复,在所有设置中均显著优于基线(提升达 +1.8 至 +15.0 个百分点)。然而,核心发现是:修复后系统极少出现自相矛盾,而是遗忘(forgetting)——残余错误中 98–100% 属于 satisfiable drift,而矛盾几乎消失。这表明,可靠的多轮推理系统必须额外验证输出答案是否尊重维护的状态,而非仅依赖一致性检查。

链接: https://arxiv.org/abs/2605.23940
作者: Sebastien Kawada
机构: kaons (kaons.com)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Published at ICLR 2026 Workshop on Reasoning and Planning for LLMs. 18 pages. ICLR page: this https URL Code: this https URL

点击查看摘要

Abstract:How do multi-turn reasoning systems fail? The expected answer is logical contradiction, in which the system’s maintained state becomes unsatisfiable. We show that the dominant mode is instead satisfiable drift, where the internal state stays consistent while the returned answer silently violates prior commitments. We build DRIFT-Bench (Decomposing Reasoning Into Failure Types), a solver-instrumented benchmark of 816 test problems across three constraint domains, and evaluate four methods on it across four open-weight models (8B-120B parameters). MUS-Repair, which feeds minimal unsatisfiable subsets back to the generator, is strongest in every setting (+1.8 to +15.0 pp over the best non-MUS baseline). But the central finding is what repair leaves behind. After structured feedback, models rarely contradict themselves. They forget. Residual errors are 98-100% satisfiable drift across all settings, while contradiction drops to near zero. Reliable multi-turn systems must separately validate that the returned answer respects the maintained state. Code is available at this https URL.

[NLP-201] When Correct Beliefs Collapse: Epistemic Resilience of LLM s under Clinical Pressure ACL2026

【速读】: 该论文试图解决大语言模型(Large Language Models, LLMs)在临床对话中表现出的多轮迎合现象(multi-turn sycophancy),即在持续压力下放弃初始正确诊断的问题。其核心问题是:尽管LLMs在医学基准测试中表现优异,但它们在面对逐步升级的压力时缺乏信念稳定性(belief stability),导致诊断结果不可靠。解决方案的关键在于提出两个互补方法:一是轻量级推理阶段防御策略RBED(Role-Based Epistemic Defense),通过角色引导增强模型对证据的依赖;二是训练阶段优化方法R-FT(Resilience-oriented Fine-Tuning),通过引入基于证据的抗压机制提升模型鲁棒性。实验表明,R-FT几乎完全消除了信念漂移,显著提升了模型在压力下的诊断稳定性。

链接: https://arxiv.org/abs/2605.23932
作者: Boyu Xiao,Xiuqi Tian,Xuwen Song,Haochun Wang,Guanchun Song,Sendong Zhao,Bing Qin
机构: Harbin Institute of Technology (哈尔滨工业大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注: ACL 2026

点击查看摘要

Abstract:Despite strong medical benchmark accuracy, LLMs can exhibit severe multi-turn sycophancy in clinical dialogue, abandoning initial correct diagnosis under escalating pressure. We propose \textbf\textscMed-Stress, a targeted stress test framework that evaluates belief stability under escalating pressure. Across nine frontier large language models (LLMs), we find a clear dissociation between medical knowledge and robustness: high initial diagnostic capability does not imply high belief stability, yielding large knowledge-robustness gaps for several LLMs. To mitigate this failure mode, we propose a lightweight inference-time defense, \textbf\textttRBED (\textbfRole-\textbfBased \textbfEpistemic \textbfDefense), and \textbf\textttR-FT (\textbfResilience-oriented \textbfFine-\textbfTuning), a training-time approach that internalizes evidence-based resistance to pressure. Experiments show that \textbf\textttR-FT nearly eliminates belief change and substantially improves robustness.

[NLP-202] Catching The Correct Answer Trap: Characterising AI Tutor Blind Spots When Analysing Student Reasoning

【速读】: 该论文试图解决的问题是:当前智能辅导系统在提供自动化反馈时,往往仅评估学生的最终答案而忽视了推理过程,导致无法准确识别学生因错误推理却得出正确答案的情况,这种现象被称为“正确答案陷阱”(Correct Answer Trap, CAT)。解决方案的关键在于通过分析真实学生作答数据(来自Eedi数学平台),发现71%的CAT失败集中在两种具有特定结构的问题上——这些题目中,错误的推理路径恰好产生正确的数值结果。研究对比了微调后的T5模型与前沿大语言模型的表现,发现尽管先进模型提升了检测准确率(从84%提升至57%),但依然存在大量误报(约每1次真阳性对应4次假阳性),使得其在实际课堂规模下难以独立使用。因此,论文强调高整体准确率可能掩盖关键的推理评估缺陷,且人类判断仍是必要补充。

链接: https://arxiv.org/abs/2605.23925
作者: Moiz Imran,Sahan Bulathwela
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: To be published at the International Conference on Artificial Intelligence in Education (AIED’26)

点击查看摘要

Abstract:Intelligent tutoring systems increasingly provide automated feedback on student work, but robust feedback requires assessing reasoning, not only final answers. We study a failure mode we call the correct answer trap (CAT): models under-detect misconceptions when students reach a correct answer via flawed reasoning. Analysing real student responses from the Eedi mathematics platform, we show that 71% of these failures concentrate in just two question types, both sharing a common structure where flawed reasoning happens to produce the correct numerical answer. Comparing a fine-tuned T5 with a frontier large language model, we find that improved capabilities reduce but do not eliminate the problem (84% vs 57% detection accuracy). Even the best-performing model generates roughly four false alarms for every genuine detection, making stand-alone screening impractical at realistic class sizes. Our findings demonstrate that high overall accuracy can mask critical failures in reasoning assessment, and that careful analysis of student reasoning still benefits from human judgment.

[NLP-203] Multi-Persona Debate System for Automated Scientific Hypothesis Generation

【速读】: 该论文试图解决科学发现中因知识碎片化导致的假设生成瓶颈问题,特别是在电池材料研究领域,如何在电化学性能、界面行为和制造可行性等多重工程约束下整合分散的知识以形成可操作的假设。其解决方案的关键在于提出多角色辩论系统(Multi-Persona Debate System, MPDS),该系统通过文献检索构建最多500篇论文的知识快照,基于语料库诱导出角色特定的证据池,并开展三轮引用感知的结构化多智能体辩论,最终由调解者进行合成,从而在保留证据溯源的前提下实现不同视角间的协商与整合。实验表明,MPDS在钠离子电池负极和全固态电池正极设计任务中生成了机制更明确、过程更清晰的假设,且在跨视角整合能力上显著优于基线方法,验证了结构化辩论对复杂约束下假设生成的有效性。

链接: https://arxiv.org/abs/2605.23917
作者: Jaeha Oh,Byungchan Kim,Ju Li,Yang Jeong Park,Jin-Sung Park
机构: 未知
类目: Computation and Language (cs.CL)
备注: 31 pages with 7 main figures, 4 supplementary figures and 1 supplementary table

点击查看摘要

Abstract:Modern scientific discovery is bottlenecked not by data scarcity, but by the inability to synthesize fragmented knowledge into actionable hypotheses. This challenge is especially acute in battery materials research, where electrochemical performance, interfacial behavior, and manufacturing feasibility must be optimized simultaneously. Here, we present the Multi-Persona Debate System (MPDS), a literature-grounded framework for automated scientific hypothesis generation that combines literature retrieval, long-context large language model reasoning, corpus-driven persona induction, and structured multi-agent debate. MPDS constructs literature snapshots of up to 500 papers, grounds agents in role-specific evidence pools, and conducts a three-round citation-aware debate followed by moderator synthesis, enabling negotiation between personas while preserving evidence traceability. We evaluate MPDS using a temporally controlled protocol excluding direct access to target papers, including two held-out battery-materials case studies and a blinded comparison across 30 matched cases. In sodium-ion anode and all-solid-state battery cathode design tasks, MPDS recovered design logics aligned with experimentally validated solution spaces and generated more mechanistically explicit, process-aware proposals than simpler baselines. To assess the impact of personas and debate, we introduce Integrative Hypothesis Quality scoring. In ablation studies, MPDS achieved the highest mean score among five conditions, with its largest advantage in cross-perspective integration. A laboratory follow-up suggests utility as a diagnostic aid for identifying practical bottlenecks in workflows. These results indicate that structured debate over literature snapshots improves hypothesis formation under coupled engineering constraints and provides a reusable workflow for text-intensive scientific discovery.

[NLP-204] Can LoRA Fusion Support Cross-Domain Tasks in Cloud-Edge Collaboration?

【速读】: 该论文试图解决的问题是:在隐私约束下,如何将分布在多个边缘设备上的私有领域数据知识有效集成到云端大语言模型(LLM)中,以支持跨域问题求解。现有方法依赖于不切实际的假设(如边缘设备可运行云规模的LLM),且主要在单一领域任务上进行评估,无法真实反映跨域协作场景下的性能。解决方案的关键在于提出一个“剪枝-训练-恢复”(prune-train-recover)框架,使边缘设备能够在轻量级剪枝模型上本地训练LoRA适配器,并通过隐私保护的方式在云端融合这些适配器;同时引入MMLU-CD这一跨域基准测试集,用于显式评估跨域问题求解能力。实验表明,现有LoRA融合方法在跨域任务上表现不佳,甚至低于基础LLM,其根本原因在于LoRA适配器之间的参数冲突;为此,作者进一步提出LoRA-CR冲突缓解模块,通过消除冲突更新显著提升融合效果(最高达3.8%),揭示了冲突缓解是云边协同LoRA融合中的关键但长期被忽视的因素。

链接: https://arxiv.org/abs/2605.23913
作者: Yatong Wang,Fali Wang,Naibin Gu,Zheng Lin,Zhengxiao Liu,Dingyu Yao,Zhiwei Zhang,Jianxin Shi,Weiping Wang
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Computation and Language (cs.CL)
备注: 16 pages, 6 figures

点击查看摘要

Abstract:Cloud-hosted large language models (LLMs) commonly rely on LoRA for domain adaptation, yet domain data are distributed across multiple edge devices and cannot be uploaded due to privacy constraints. This raises a fundamental question: how can knowledge from multiple private edges be integrated into a cloud LLM for cross-domain problem solving? A natural solution is to train LoRA adapters locally and fuse them in the cloud; however, existing pipelines rely on unrealistic assumptions that edge devices can host cloud-scale LLMs and are evaluated mainly on single-domain tasks. To address these limitations, we propose a prune-train-recover framework that enables local LoRA training on pruned models and privacy-preserving cloud integration. We further introduce MMLU-CD, a cross-domain benchmark that composes multiple domain samples into a single instance, enabling explicit evaluation of cross-domain problem solving. This allows us to ask a concrete question: Can existing LoRA fusion methods support cross-domain tasks in cloud-edge collaboration? Our empirical answer is negative. Existing LoRA fusion methods perform poorly on MMLU-CD, often underperforming the base LLM, revealing their inability to support cross-domain problem solving. We attribute this failure to parameter conflicts among LoRA adapters and propose a simple conflict-resolution module, LoRA-CR, which mitigates conflicting updates and improves LoRA fusion performance by up to 3.8%. These results identify conflict mitigation as a critical yet largely overlooked factor in cloud-edge LoRA fusion, warranting further investigation in future research.

[NLP-205] Raon-Speech Technical Report

【速读】: 该论文试图解决的问题是如何构建一个高性能的多语言语音语言模型(SpeechLM),使其不仅能够理解与生成语音,还能保持强大的文本处理能力,并进一步扩展为支持自然实时双向对话的系统。解决方案的关键在于:首先通过三个阶段的训练流程(语音模块对齐、基于知识蒸馏的端到端 SpeechLM 预训练、多任务偏好优化后训练)将预训练大语言模型(LLM)成功转化为兼具语音理解和生成能力的 Raon-Speech 模型;随后在此基础上,利用 11.9 万小时的时间对齐真实与合成对话数据,通过因果编码器适配、全双工预训练及语音和角色控制微调三个阶段,开发出 Raon-SpeechChat,实现高质量的全双工交互能力,尤其在换轮次和打断敏感行为上表现突出。

链接: https://arxiv.org/abs/2605.23912
作者: Beomsoo Kim,Changho Choi,Dohyun Kim,Dongki Lee,Ethan Ewer,Eunchong Kim,Gyeongman Kim,Haechan Kim,Hyeonghwan Kim,Inkyu Park,Jihun Yun,Jihwan Moon,Jiyun Kim,Joonghyun Bae,Junhyuck Kim,Minkyu Kim,Sehun Lee,Seungjun Chung,Sungwoo Cho,Dongmin Park,Dongwon Kim,Hara Kang,Jonghyun Lee,Keon Lee,Kangwook Lee,Jaewoong Cho
机构: KRAFTON( Krafton公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD)
备注:

点击查看摘要

Abstract:We present Raon-Speech, a top-performing 9B-parameter speech language model (SpeechLM) for English and Korean speech understanding, answering, and generation, and Raon-SpeechChat, a high-performing full-duplex extension for natural real-time conversation. Raon-Speech successfully transforms a pre-trained LLM into a SpeechLM that both understands and generates speech while preserving strong text capabilities. It trains on 1.38M hours of highly curated English and Korean speech and text datasets with the following training stages: (1) speech modules alignment, (2) end-to-end SpeechLM pre-training with knowledge distillation, and (3) multi-task preference optimization-based post-training. Across 42 English and Korean speech and text benchmarks, Raon-Speech establishes the strongest overall profile on speech-centric tasks in our comparison against eight similarly sized recent audio foundation models, including Qwen2.5-Omni and Fun-Audio-Chat, while preserving strong text question answering performance. Building upon it, Raon-SpeechChat enables natural full-duplex conversation by continual training on 119K hours of time-aligned real and synthetic dialogue data. It proceeds through three complementary training stages: (1) causal encoder adaptation, (2) full-duplex pre-training, (3) full-duplex fine-tuning for voice and role-control. On multiple full-duplex benchmarks, Raon-SpeechChat shows its clearest strengths on the turn-taking and interruption-sensitive behaviors covered by FDB v1.0, and remains competitive across the broader full-duplex evaluation suite. We open-source all model checkpoints, the training and inference pipeline, and an interactive demo.

[NLP-206] Document Classification Pattern Recognition via Information Fusion: A Systematic Review of Multimodal and Multiview Representation Approaches

【速读】: 该论文旨在解决信息融合(information fusion)在文档分类领域中缺乏统一框架、定量效果评估以及实践指导的问题。其解决方案的关键在于:首先构建了一个形式化的结构化框架以系统化该研究领域;其次通过定性分析识别出关键趋势,并首次对文档分类中的信息融合效果进行了随机效应元分析(meta-analysis),量化了多模态融合(multimodal)和多视图融合(multiview)的性能增益——其中多模态融合显著提升准确率(平均+5.28个百分点,p=0.0016),而多视图融合则带来稳健但较小的收益(准确率+4.67%,F1-score +3.08%)。此外,论文指出当前研究普遍存在方法学严谨性不足的问题(仅约11.8%的多模态与23.3%的多视图研究使用统计检验验证结果),从而削弱了结论的可靠性。最终,研究强调成功的融合策略不依赖算法复杂度,而在于融合方法与任务上下文的战略匹配,以及对更严格验证流程的承诺。

链接: https://arxiv.org/abs/2605.23910
作者: Marcin Michał Mirończuk
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Information fusion is used widely to improve document classification by the integration of multiple data sources (multimodal) or representations (multiview). However, the field lacks a unified framework, a quantitative synthesis of its effectiveness, and clear guidance for practitioners. This systematic review addresses these gaps by analysing 139 primary studies. It introduces a formal framework to structure the field, presents the results of a qualitative analysis to identify key trends, and performs a random-effects meta-analysis (to our knowledge, the first focused on document classification) to quantify performance gains. Our meta-analysis reveals that multimodal fusion improves accuracy (mean gain of +5.28 percentage points, p=0.0016 ) significantly – the F1-score effect is directionally positive but statistically non-significant in our primary model. Multiview fusion provides consistent but modest gains for accuracy (+4.67%), F1-score (+3.08%), and recall (all p0.05 ). Critically, our qualitative synthesis uncovers challenges in reproducibility in methodological rigour: only 11.8% (multimodal) and 23.3% (multiview) of the studies use statistical tests to validate their findings, which undermines the reliability of many of their results. This review’s primary contributions are a unifying framework, the first quantitative evidence base, and data-driven guidelines. This review concludes that successful information fusion depends not on algorithmic complexity, but on the strategic alignment of the fusion method with the task context and a commitment to more rigorous validation.

[NLP-207] In Search of the Ingredients of Open-Endedness: Replicating Picbreeder with Large Vision-Language Models GECCO2026

【速读】: 该论文试图解决的问题是:人工智能代理是否具备类似人类在科学、技术和创造性生产过程中所表现出的开放性(open-endedness)——即生成无限且有意义的新颖成果的能力。解决方案的关键在于通过构建一个类 Picbreeder 的系统,用前沿视觉语言模型(Vision Language Models, VLMs)替代人类用户进行交互式进化搜索,从而模拟并评估AI驱动的开放式发现过程。研究发现,该系统生成的内容在演化路径和多样性上与人类基准存在显著差异,并通过系统性实验识别出影响这些差异的因果因素,包括探索性噪声、智能体间行为多样性以及基于历史动作记忆的叙事动量(narrative momentum),从而为理解AI在无监督创新中的潜力提供了可量化的分析框架。

链接: https://arxiv.org/abs/2605.23908
作者: Sam Earle,Kay Arulkumaran,Andrew Dai,Akarsh Kumar,Julian Togelius,Sebastian Risi
机构: New York University(纽约大学); Sakana AI(萨卡纳人工智能公司); MIT(麻省理工学院)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE)
备注: 26 pages, 21 figures, to be published at GECCO 2026

点击查看摘要

Abstract:We are in the midst of large-scale industrial and academic efforts to automate the processes of scientific, technological and creative production through AI-driven assistants. Historically, a fundamental property of these processes in their human form has been their open-endedness: their capacity for generating a seemingly endless supply of novel and meaningful new forms. Do artificial agents have any capacity for such fruitful unguided discovery? To answer this question, we turn to Picbreeder, the canonical exemplar of human-driven open-ended search, in which users collaboratively generated a diverse library of images through interactive evolution of small neural networks. We replicate Picbreeder, replacing human users with frontier Vision Language Models (VLMs). We observe clear qualitative differences between the output of our system and the historical human baseline, and attempt to characterize them using metrics of phylogenetic complexity and visual and semantic salience and novelty. In an effort to identify some of the causal factors contributing these differences, we study the addition of exploratory noise to the agents’ selection process, of behavioral diversity between agents, and of narrative momentum in the form of memory of past actions. We make our code available at this https URL.

[NLP-208] Check Your LLM s Secret Dictionary! Five Lines of Code Reveal What Your LLM Learned (Including What It Shouldnt Have)

【速读】: 该论文试图解决的问题是:如何在不依赖模型推理的情况下,从大型语言模型(LLM)的权重中识别出具有语义解释性的子空间,并揭示这些子空间所反映的训练数据组成、内容偏倚及潜在伦理风险。其解决方案的关键在于对模型输出层(lm_head)权重矩阵进行奇异值分解(SVD),仅需五行PyTorch代码即可提取出可解释的语义子空间——每个左奇异向量对应一个词汇簇,表示当隐藏状态与该奇异方向对齐时最可能被选中的token;通过分析这些词汇簇的结构和分布,可以量化子空间一致性(VCS)并检测静态异常token(WPS),从而实现无需推理的模型安全审计。该方法不仅揭示了不同模型(如GPT、Gemma、Qwen)在功能分化、语言覆盖和伦理问题上的系统性差异,还提出将lm_head SVD分析作为模型发布前的标准安全检查步骤,并为基于SVD的分词器优化和更可控的LLM设计提供新方向。

链接: https://arxiv.org/abs/2605.22005
作者: Hisashi Miyashita
机构: Mgnite Inc.
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We show that singular value decomposition of the lm_head weight matrix of a transformer-based large language model – requiring only five lines of PyTorch and no model inference – reveals interpretable semantic subspaces directly from the model weights. Each left singular vector identifies the vocabulary tokens most readily selected when the hidden state aligns with the corresponding singular direction; inspecting these clusters exposes the model’s training data composition and curation philosophy. Analysing GPT-OSS-120B, Gemma-2-2B, and Qwen2.5-1.5B, we find that singular value spectra and vocabulary cluster structures differ systematically across models: GPT exhibits a graduated hierarchy of functionally differentiated subspaces; Gemma is dominated by pre-nineteenth-century English orthography, forming a stepwise clustering structure that may contribute to high output controllability; and Qwen exhibits broad multilingual coverage alongside subspaces whose vocabulary the authors have determined to be ethically inappropriate for direct publication. Base-instruct comparison reveals that ethically concerning subspaces originate in pretraining and are not removed by post-training alignment. We introduce the Vocabulary Cluster Score (VCS) to quantify subspace coherence, and the Weighted Projection Score (WPS) as a static glitch token detector; applying WPS to GPT-OSS-120B recovers shokubutsu-hyakka-tsu (ID 137606), a well-known glitch token widely reported in the CJK language community, without any model inference. We propose a taxonomy of root causes for problematic vocabulary content and call for lm_head SVD analysis to be adopted as a standard pre-release safety auditing step. Our findings further suggest directions toward SVD-guided tokenizer optimisation and more controllable LLM design. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2605.22005 [cs.LG] (or arXiv:2605.22005v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.22005 Focus to learn more arXiv-issued DOI via DataCite

[NLP-209] okenizer Fertility and Zero-Shot Performance of Foundation Models on Ukrainian Legal Text: A Comparative Study

【速读】: 该论文旨在解决基础模型在乌克兰法律文本上因分词器(tokenizer)效率差异导致的成本不可控问题,以及当前模型选择实践中忽视此关键维度的缺陷。其解决方案的关键在于:首先通过系统性基准测试揭示不同模型在相同输入下分词消耗量的巨大差异(如Qwen 3比Llama系列多消耗60% token),强调分词器效率分析应成为模型部署前的必要步骤;其次证明模型规模并非决定领域性能的可靠指标(如NVIDIA Nemotron Super 3以更少参数和更低API成本超越Mistral Large 3),并发现少样本提示(few-shot prompting)会显著降低性能(最高下降26个百分点),而零样本(zero-shot)在形态丰富的乌克兰语中更具鲁棒性;最后通过跨时间域泛化实验验证了战前与战时法律文本之间的语言迁移存在明显不对称性,凸显出对特定历史语境下法律语言建模的挑战。

链接: https://arxiv.org/abs/2605.14890
作者: Volodymyr Ovcharov
机构: LEX AI Platform (LEX AI平台); legal.org.ua (legal.org.ua)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 25 pages, 13 tables, 5 figures; v2 adds cross-temporal generalization experiment and classical baseline

点击查看摘要

Abstract:Tokenizer fertility varies 1.6x across foundation models on Ukrainian legal text, yet this cost-critical dimension is absent from model selection practice. We benchmark seven models from five providers on 273 validated court decisions from Ukraine’s state registry (EDRSR), measuring tokenizer fertility and zero-shot performance on three tasks. Four findings emerge. (1) Qwen 3 models consume 60% more tokens than Llama-family models on identical input, making tokenizer analysis a prerequisite for cost-efficient deployment. (2) NVIDIA Nemotron Super 3 (120B) achieves the highest composite score (83.1), outperforming Mistral Large 3 (5.6x more total parameters) at one-third the API cost model scale is a poor proxy for domain performance. (3) Few-shot prompting degrades performance by up to 26 percentage points; stratified and prompt-sensitivity ablations confirm this is intrinsic to Ukrainian-language demonstrations, not an artifact of example selection. (4) A cross-temporal generalization experiment reveals that classifiers trained on pre-war court ecisions (2008-2013) lose 27.9 percentage points when applied to full-scale invasion era decisions (2022-2026), with a pronounced forward-backward asymmetry: newer models transfer backward (+14.6 pp above forward transfer), but older models fail catastrophically on wartime legal language. For practitioners: tokenizer analysis should precede model selection, and zero-shot is a more reliable default than few-shot for morphologically rich languages. To support reproducibility and address the absence of Ukrainian from legal NLP benchmarks, we release a public dataset of 14,452 court decisions spanning 2008-2026, annotated with seven outcome labels across three temporal epochs that capture the impact of armed conflict on judicial proceedings.

[NLP-210] Adaptive Preference Optimization with Uncertainty-aware Utility Anchor EMNLP2025

【速读】: 该论文试图解决现有离线偏好优化方法(如DPO)在实际应用中面临的若干关键问题,包括对成对训练数据的强依赖性、模型分布偏移、人类理性假设等限制。其解决方案的关键在于提出一种通用框架——带效用锚点的自适应偏好优化(UAPO),通过引入锚定函数来估计偏好数据标注带来的不确定性,从而实现无需严格成对数据即可进行训练,并显著提升数据利用效率和训练鲁棒性。实验表明,UAPO在不依赖数据配对的情况下仍能取得具有竞争力的性能,为更灵活高效的偏好优化提供了新路径。

链接: https://arxiv.org/abs/2509.10515
作者: Xiaobo Wang,Zixia Jia,Jiaqi Li,Qi Liu,Zilong Zheng
机构: University of Science and Technology of China (中国科学技术大学); Hefei Comprehensive National Science Center (合肥综合性国家科学中心); BIGAI (通用人工智能国家实验室)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Accepted by EMNLP 2025 Findings

点击查看摘要

Abstract:Offline preference optimization methods are efficient for large language models (LLMs) alignment. Direct Preference optimization (DPO)-like learning, one of the most popular approaches, stands out for its efficiency in reward modeling. However, these methods typically follow the convention to use Bradley-Terry (BT) reward modeling that faces several critical assumptions, including the requirement for pairwise training data, model distribution shifting, human rationality assumption, etc. To address these limitations, we propose a general framework for offline preference optimization methods, Adaptive Preference Optimization with Utility Anchor (UAPO), which introduces an anchoring function to estimate the uncertainties brought from preference data annotation. Our method enables training even in scenarios where the data is unpaired, significantly enhancing data utilization efficiency. Moreover, the anchor design makes UAPO more robust in the training process. Experimental results demonstrate that UAPO achieves competitive outcomes without the strict dependency on data pairing, paving the way for more flexible and effective preference optimization methods.

[NLP-211] Efficient Benchmarking Is Just Feature Selection and Multiple Regression

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在基准测试中计算成本高昂的问题,即如何通过仅使用基准测试中的一小部分题目来高效预测完整基准得分。其解决方案的关键在于将这一问题建模为带特征选择的多元回归问题,并引入两个核心改进:一是采用核岭回归(kernel ridge regression)作为预测阶段的方法,相比现有技术显著降低预测误差(MAE 和 RMSE);二是利用信息论驱动的特征选择算法最小冗余最大相关性(minimum redundancy maximum relevance, mRMR),挑选对预测最有效的题目标子集,从而进一步提升预测准确性与排名相关性(Spearman ρ 和 Kendall τ)。此外,mRMR 方法在计算效率和稳定性上优于其他依赖概率模型或聚类算法的替代方案,且在不同随机种子或训练数据划分下具有更强的可重复性。

链接: https://arxiv.org/abs/2605.25773
作者: Sam Bowyer,Acyr Locatelli,Kris Cao
机构: Cohere; University of Bristol
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 36 pages, 27 figures

点击查看摘要

Abstract:Efficient benchmarking techniques aim to lower the computational cost of evaluating LLMs by predicting full benchmark scores using only a subset of a benchmark’s questions. By reframing this problem as an instance of multiple regression with feature selection, we find that existing efficient benchmarking methods can be greatly improved by simply using kernel ridge regression at the prediction stage. Additionally, using an information-theoretic feature-selection algorithm called minimum redundancy maximum relevance (mRMR), we can further improve upon these methods by selecting question subsets that will be maximally useful for prediction. Except in very data-poor settings, these approaches consistently achieve smaller prediction errors (in both MAE and RMSE), and greater ranking correlation between predicted and true scores (in both Spearman \rho and Kendall \tau ) across a range of benchmarks using both binary and continuous metrics. Furthermore, mRMR subsampling is much faster than competitor methods (which often involve fitting probabilistic models or running clustering algorithms), and is more likely to select the same questions under different random seeds or training data splits. Tutorial code can be found at this https URL .

[NLP-212] Spiking the training data to correct for test set contamination

【速读】: 该论文试图解决测试集污染(test set contamination)导致模型性能评估失真的问题,尤其是现有研究多集中于污染检测而忽视了对受污染测试分数的合理校正。其解决方案的关键在于“数据注入”(spiking)策略:通过在训练数据中人为以已知比例掺入部分测试样本,利用这些已知污染的示例来校准模型记忆预测器(memorization predictor),进而实现对测试分数的统计学修正。作者提出基于Hubble模型的仿真框架,其中成对的最小扰动模型与标准模型构成反事实对照,用于验证不同校正估计器的效果。结果表明,结合记忆预测和正确性预测的信息能够显著优于无校正的基线方法;且简单的Platt缩放后的成员推断指标即可提供有效的校正信号。此外,实验显示仅需约10个示例即可完成校准,且预测器具有跨数据集迁移能力,证明该方法在实际应用中具备可行性与高效性。

链接: https://arxiv.org/abs/2605.24818
作者: Johnny Tian-Zheng Wei,Jerry Li,Ameya Godbole,Robin Jia
机构: University of Southern California (南加州大学)
类目: Methodology (stat.ME); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The literature on test set contamination largely focuses on detection, but the correction of contaminated test scores is underexplored. Our core proposal is to spike the training data by intentionally contaminating some test examples at known rates. The spiked examples can then be used to calibrate predictors of model memorization which enable principled statistical correction of inflated test scores. To evaluate different correction estimators, we first present a simulation framework based on the Hubble models. Hubble models come in minimal pairs, where the perturbed model was deliberately contaminated with several test sets, while the standard model was not, serving as the counterfactual and correction target. We consider estimators that use information from a memorization predictor, correctness predictor, or both. In simulation, we establish basic statistical intuitions and show that estimators leveraging memorization and correctness information are better than naive estimation which makes no correction at all. We then instantiate several memorization and correctness predictors, and find that simple predictors such as Platt-scaled membership inference metrics provide good signal for correction. Finally, we examine the practical considerations of spiking. Simple memorization predictors need no more than 10 examples for calibration and often transfer from one dataset to another. Taken together, spiking is a promising solution for test set contamination.

信息检索

[IR-0] SemBridge: Language Transfer in Sparse Encoders via Multilingual Semantic Bridges

链接: https://arxiv.org/abs/2605.26002
作者: Seongtae Hong,Youngjoon Jang,Jia-Heui Ju,Hyeonseok Moon,Heuiseok Lim
类目: Information Retrieval (cs.IR)
备注: preprint

点击查看摘要

Abstract:Sparse encoders offer high-precision retrieval by representing term importance within a vocabulary space, yet their English-centric structures pose a critical impediment to language transfer for non-English languages. To overcome this structural limitation, we propose SemBridge, a novel embedding initialization method designed for cross-lingual adaptation in sparse encoders by leveraging multilingual bridge models. SemBridge establishes semantic alignments between source and target vocabularies using multilingual dense embeddings as a bridge. Rather than directly relying on all source tokens, SemBridge selects a small set of semantically related source-language tokens and uses them to initialize each target-language token, effectively filtering out semantic noise and reconstructing target tokens as precise linear combinations of core synonyms. This accelerates convergence during fine-tuning and improves training efficiency. Extensive experiments across five languages and four sparse architectures demonstrate that SemBridge achieves superior zero-shot retrieval performance and consistently improves retrieval performance after fine-tuning compared to existing baselines. These results validate SemBridge as a practical solution for deploying high-performance sparse retrieval systems in diverse linguistic environments.

[IR-1] DeGRe: Dense-supervised Generative Reranking for Recommendation KDD2026

链接: https://arxiv.org/abs/2605.25749
作者: Chaotian Song,Jingyao Zhang,Chenghao Chen,Zisen Sang,Dehai Zhao,Guodong Cao,Boxi Wu,Deng Cai,Jia Jia
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to KDD 2026 (ADS Track)

点击查看摘要

Abstract:In multi-stage recommender systems, reranking optimizes overall utility by capturing intra-list contextual dependencies, yet its central challenge lies in exploring optimal sequences within an exponentially large permutation space. Recent studies have shifted towards end-to-end generative frameworks, which typically leverage list-wise rewards or preference alignment to guide generator training. However, these methods still face two critical issues. First is the heuristic label bias. Existing methods often construct training targets based on simple rules, such as promoting clicked items to the top, while ignoring causal dependencies within the list context. Second is the credit assignment problem. Sparse list-level posterior rewards fail to directly guide intermediate steps in sequence generation, leading to ambiguous optimization directions. To address these issues, we propose DeGRe (Dense-supervised Generative Reranking), a generative reranking framework that bridges the gap between offline exploration and online efficiency through dense supervision. The core of DeGRe lies in its offline-online decoupled design. During the offline phase, we introduce a Lookahead Evaluator based on cumulative regression, which leverages beam search to actively mine high-value lookahead sequences in the unexposed space. During training, we transform the step-wise value estimations from the evaluator into dense supervision signals and distill them into a lightweight Online Generator. This mechanism enables the generator to internalize lookahead planning capabilities, requiring only a single efficient greedy decoding pass during online inference to approximate the global optimum. Experiments demonstrate that DeGRe outperforms baseline models on public benchmarks and industrial datasets. We have successfully deployed DeGRe on Taobao Flash Shopping, significantly improving online recommendations. Comments: Accepted to KDD 2026 (ADS Track) Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2605.25749 [cs.IR] (or arXiv:2605.25749v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2605.25749 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.1145/3770855.3818363 Focus to learn more DOI(s) linking to related resources

[IR-2] SIREN: Unified Multi-Granularity Semantic Interaction for Multi-Modal Lifelong User Interest Modeling

链接: https://arxiv.org/abs/2605.25726
作者: Yaqian Zhang,Ruyi Yu,Tianyi Li,Bohan Liu,Maoquan Ye,Ke Wang,Shifeng Wen,Junwei Pan,Lijie Wang,Qi Zhou,Yeshou Cai,Chengguo Yin,Lifeng Wang,Hui Li,Lei Xiao,Haijie Gu
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Industrial recommender systems increasingly leverage lifelong user behavior histories and rich multi-modal content to capture evolving user preferences. However, effectively integrating multi-modal features into lifelong interest modeling remains challenging due to the inherent misalignment between multi-modal and collaborative spaces. Existing paradigms typically rely on separate modeling of multi-modal sequence and behavior sequence, and late fusion to alleviate the modality gap, which results in coarse-grained multi-modal representation and limited integration. In this paper, we propose SIREN, a unified multi-granularity semantic interaction framework for multi-modal lifelong user interest modeling. In the General Search Unit stage, we introduce two alternative retrieval strategies: multi-modal similarity-based soft retrieval for retrieval effectiveness, and Semantic ID (SemID)-based hard retrieval for efficient industrial serving. For the Exact Search Unit stage, we explicitly incorporate target-aware relevance via coarse similarity buckets and fine-grained prefix-encoded SemIDs, enabling unified interaction with collaborative ID features within the target-conditioned transformer architecture. Extensive experiments on the offline dataset demonstrate that SIREN achieves a state-of-the-art GAUC. Online A/B tests further demonstrate consistent GMV gains across multiple production scenarios, including +2.28% in Weixin Moments, +3.87% in Weixin Official Accounts, and +1.61% in Weixin Channels. From July 2025, SIREN has been fully launched for full-traffic serving in Tencent’s advertising platform.

[IR-3] Neural Router: Semantic Content Matching for Agent ic AI

链接: https://arxiv.org/abs/2605.25701
作者: Lauri Lovén,Abhishek Kumar,Alexander Engelhardt,Alaa Saleh,Roberto Morabito,Xiaoli Liu,Naser Hossein Motlagh,Sasu Tarkoma
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Computation and Language (cs.CL); Information Retrieval (cs.IR); Networking and Internet Architecture (cs.NI)
备注: 35 pages, 12 figures. Combined main paper and electronic supplement, folded into one document for arXiv

点击查看摘要

Abstract:Large language models (LLMs) can serve as the semantic-matching engine of a content-based publish/subscribe broker for agentic AI across the edge-cloud computing continuum, bridging the vocabulary and modality gaps that defeat keyword and embedding filters. Framed as offline multi-label retrieval over three public datasets spanning social-media, legal, and smart-home sensor domains (six LLMs, seven baselines), our central contribution is a two-crossover cost-accuracy characterisation: an analytical context-window crossover below which a CoverAndMerge compression pipeline reduces LLM invocations, and an empirical discrimination-capacity crossover above which matching accuracy collapses independently of context budget, by a model-dependent factor of parameter count and training generation. Two findings carry practical weight: above the discrimination crossover, compression cannot recover accuracy and only frontier-scale models clear large subscription sets; and there backend choice dominates configuration choice, so model selection, not pipeline tuning, is the primary operator lever. We accompany this with three composable algorithms and a per-cluster Quality-of-Experience framework for autonomic LLM-tier selection.

[IR-4] GCIB: Graph Contrastive Information Bottleneck for Multi-Behavior Recommendation ICML2026

链接: https://arxiv.org/abs/2605.25690
作者: Likang Wu,Zihao Chen,Jianxin Zhang,Sangqi Zhu,Yuanyuan Ge,Haipeng Yang,Lei Zhang
类目: Information Retrieval (cs.IR)
备注: Accepted at ICML 2026. Camera-ready version

点击查看摘要

Abstract:With the rapid emergence of multi-behavior learning in recommender systems, leveraging auxiliary user behaviors has proven effective for mitigating target-behavior data sparsity. Yet auxiliary behavior graphs frequently contain noisy or irrelevant interactions that do not align with the target task, impeding the learning of accurate user and item embeddings. Moreover, the scarcity of direct supervision from the target behavior complicates the extraction of informative collaborative signals. In this paper, we introduce GCIB (Graph Contrastive Information Bottleneck), a novel framework that denoises auxiliary behavior information and enriches target behavior representations at both the structural and feature levels. At the structural level, GCIB employs a Graph Information Bottleneck (GIB) objective to maximize mutual information between the denoised auxiliary graph and the target-behavior graph while minimizing mutual information with the original auxiliary graph. This formulation preserves task-relevant structural patterns and suppresses spurious interactions. At the feature level, we propose a cross-behavior Graph Contrastive Learning (GCL) scheme in which denoised auxiliary features and target-behavior features serve as complementary views for both users and items. By contrasting these views, GCIB enriches sparse target-behavior representations with semantics distilled from auxiliary behaviors. Extensive experiments demonstrate that GCIB outperforms state-of-the-art baselines, highlighting its ability to learn noise-resilient and target-aware representations for multi-behavior recommendation.

[IR-5] LENS: A Staged Design for Interaction Granularityin Sequential CTR Prediction

链接: https://arxiv.org/abs/2605.25583
作者: Yuan Wang,Yue Liu,Jun Zhang,Jie Jiang
类目: Information Retrieval (cs.IR)
备注: 15 pages, 9 figures, 9 tables

点击查看摘要

Abstract:In sequential CTR prediction, a central design question is at what granularity the target should interact with the user behaviour sequence. Existing models mainly follow two routes. Raw-item architectures such as DIN let the target score each item in the sequence directly. This relies on well-trained item embeddings and becomes brittle for sparse items. Latent-query architectures such as HyFormer, MixFormer, and OneTrans build query representations by combining the target with other information. This is more robust across item-density regimes but blunter: target-specific control is diluted. We propose LENS to restore target-specific control within these coarser bottlenecks. LENS has two modules: a Target-Conditioned Query Gate (TCQG) for query activation and a Target-Conditioned Position Bias (TCPB) for history retrieval. We further introduce Query-Specific Position Bias (QueryPos), a simple static position-aware reference for latent-query backbones. Across three representative latent-query backbones and four datasets, the combined QueryPos+LENS design achieves positive total-gain point estimates in all twelve evaluated backbone–dataset cells. We also identify a density-dependent conditioning rule: as item density decreases, the optimal condition source shifts from item-only to item-plus-sequence.

[IR-6] From Item-Only to Query-Item: Query-Conditioned Generative Search with QGS in Quark

链接: https://arxiv.org/abs/2605.25514
作者: Yanglong Song,Zihao Yang,Shuo Meng,Rujun Guo,Jin Zhang,Bin Wang,Shaoyu Liu,Xiaozhao Wang,Guanjun Jiang
类目: Information Retrieval (cs.IR)
备注: 11 pages, 5 figures, 9 tables

点击查看摘要

Abstract:Generative sequence models have shown strong results in recommendation. Applying them to search ranking is more challenging. Search behavior is inherently query-driven. Each query switch introduces a sharp topic shift in the user’s interaction history. Existing generative methods flatten queries and items into a single token sequence. They do not distinguish query boundaries. This causes the model to mix different query intents into one prediction target, resulting in noisy supervision. We present Query-Conditioned Generative Search (QGS). QGS encodes each interaction as a (query, item) pair token. It trains with a query-conditioned next-item objective. The prediction target changes from a noisy marginal P(item_t+1|context_=t) to a clean conditional P(item_t+1|context_=t, query_t+1). This directly removes the semantic discontinuity caused by query switches. Encoding long interaction histories with standard attention has quadratic cost. This is impractical under strict online latency budgets. We introduce a Linear HSTU encoder. It replaces full attention with causal linear recurrence. Per-layer complexity drops from O(L^2) to O(L) with no loss in ranking quality. Traditional search ranking depends on hand-crafted features like text-matching scores, statistical signals, and behavioral features. We propose HFG-Attention to preserve them in the generative framework. It organizes heterogeneous features into semantic groups and fuses them through a dedicated attention block. This bridges sparse engineered signals with dense sequential representations. QGS is deployed in the ranking module of Quark Search, a major commercial search engine in China. Online A/B tests show statistically significant gains: +0.62% CTR, +0.38% Click-Search Ratio, and +3.55% PV Duration over the production deep learning baseline. Comments: 11 pages, 5 figures, 9 tables Subjects: Information Retrieval (cs.IR) ACMclasses: H.3.3 Cite as: arXiv:2605.25514 [cs.IR] (or arXiv:2605.25514v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2605.25514 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Yanglong Song [view email] [v1] Mon, 25 May 2026 07:18:51 UTC (3,046 KB) Full-text links: Access Paper: View a PDF of the paper titled From Item-Only to Query-Item: Query-Conditioned Generative Search with QGS in Quark, by Yanglong Song and 8 other authorsView PDFHTML (experimental)TeX Source view license Additional Features Audio Summary Current browse context: cs.IR prev | next new | recent | 2026-05 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked="checked"class=“labs-tab-input”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status

[IR-7] RAG -Match: Retrieval-Augmented Knowledge Injection and Hierarchical Reasoning for Calibrated Semantic Relevance

链接: https://arxiv.org/abs/2605.25486
作者: Hengjun Jiang,Liansheng Sun,Yan Jiang,Xiaojie Ke,Yongjin Wang,Xiangkun Liu,Cunxin Gu,Jian Xu,Guanjun Jiang
类目: Information Retrieval (cs.IR)
备注: 17 pages, 1 figure, 5 tables

点击查看摘要

Abstract:Semantic relevance judgment for search is particularly challenging in knowledge-intensive scenarios, where accurate ranking requires not only semantic matching but also background grounding, multi-step reasoning, and well-calibrated decision boundaries. Existing relevance models mainly rely on direct label supervision or shallow semantic similarity, which limits their ability to handle implicit intent, factual equivalence, and fine-grained relevance distinctions. To address this issue, we propose \textscRAG-Match, a three-stage framework that integrates knowledge-augmented pretraining, hierarchical reasoning alignment, and preference-based decision calibration for relevance modeling. The key idea is to first strengthen query-centered semantic grounding, then align the model with structured relevance reasoning, and finally correct decision-level inconsistencies in difficult boundary cases. Experimental results on a real-world search relevance benchmark show that \textscRAG-Match consistently outperforms strong LLM-based baselines across multiple ranking metrics, demonstrating the effectiveness of combining knowledge injection, reasoning supervision, and preference optimization for fine-grained relevance judgment.

[IR-8] How Reliable Are Semantic-ID Tokenizer Comparisons in Generative Recommendation?

链接: https://arxiv.org/abs/2605.25330
作者: Qian Zhang,Lech Szymanski,Haibo Zhang,Jeremiah D. Deng
类目: Information Retrieval (cs.IR)
备注: 12 pages, 5 figures

点击查看摘要

Abstract:In Semantic-ID (SID) based generative recommendation, each item is represented as a sequence of discrete codes, and an autoregressive model is trained to generate the SID sequence of the next item; top-K performance is then measured by checking whether the SID sequence of the target item appears among the generated sequences. This evaluation protocol equates SID-level matching with item-level recommendation, an equivalence that holds only when every SID sequence maps to a single item. We show this assumption breaks down in practice: because tokenizers compress item features into a code space, semantically similar but collaboratively distinct items are frequently assigned the same SID sequence. Across four datasets and five representative tokenizers, the fraction of items involved in such collisions reaches 30.5%, so matching a shared SID sequence identifies only a collision group rather than the target item. Consequently, SID-level metrics overestimate item-level performance (Hit@10 is inflated by up to 103.36%), and the inflation grows with the collision rate. To support faithful comparison, we develop collision-aware item-level metrics computed directly from generated SID sequences, together with a post-tokenizer procedure that reassigns last-level SIDs at minimum cost to obtain a collision-free assignment for any existing tokenizer. Our results indicate that SID-level rankings in prior work should be interpreted with caution, and that reliable tokenizer evaluation requires either item-level correction or collision-free SID assignments.

[IR-9] First do no harm: Breaking suicidogenic echo chambers in media recommendation

链接: https://arxiv.org/abs/2605.25258
作者: Alberto Díaz-Álvarez,Raúl Lara-Cabrera,Fernando Ortega-Requena,Víctor Ramos-Osuna
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注: 10 pages, 5 figures. Research on safety-aware recommender systems and algorithmic ethics

点击查看摘要

Abstract:Recommender systems generally optimises user engagement, but this approach is dangerous in mental health contexts. When vulnerable users show signs of suicidal ideation, standard algorithms often trap them in echo chambers of harmful content, worsening their psychological state. In response, we introduce RankAid, a re-ranking method that prioritises clinical safety alongside predictive relevance. It works as an add-on layer to existing models: it penalises risky items and boosts therapeutic content depending on the user’s current level of vulnerability. We evaluated this approach using the MovieLens 1M dataset, where items were semantically annotated for clinical risk and therapeutic value using large language models. Our simulations show that our algorithm successfully blocks the recommendation of harmful content during crisis peaks, actively reshaping the feed to support emotional de-escalation. Furthermore, this safety intervention only causes a controlled, acceptable drop in standard accuracy metrics like NDCG. By using asymmetric hyperparameters, RankAid also gives system administrators the flexibility to tune the severity of the intervention based on specific clinical guidelines.

[IR-10] Multilingual Humour-Aware Retrieval with Dense and Re-Ranking Models

链接: https://arxiv.org/abs/2605.25165
作者: Georgios Arampatzis,Avi Arampatzis
类目: Information Retrieval (cs.IR)
备注: 8 pages

点击查看摘要

Abstract:Humour-aware information retrieval poses unique challenges beyond standard semantic retrieval, as systems must account not only for topical relevance but also for humour-specific linguistic phenomena such as wordplay, phonetic ambiguity, and polysemy. In this paper, Team DUTH studies multilingual humour-aware information retrieval using the CLEF 2025 JOKER Task 1 benchmark, which evaluates humour retrieval in English and Portuguese. Our approach combines multilingual XLM-RoBERTa-based dense retrieval with additional system variants, including neural re-ranking, in order to assess the extent to which general-purpose Transformer models can capture humour-specific relevance. The results reveal substantial cross-lingual variation. While the Portuguese runs demonstrate comparatively strong performance across MAP, MRR, and early precision metrics, the English runs perform significantly worse, with relevant humorous documents frequently appearing at lower ranks. These findings highlight the limitations of purely semantic dense representations for humour retrieval, particularly when humour depends on surface-level cues that are not explicitly modelled by multilingual encoders. We further analyse contributing factors to this discrepancy, including dataset characteristics, query-document alignment, and variation in humour mechanisms. Overall, the Team DUTH experiments establish multilingual dense-retrieval and re-ranking baselines and provide insights into the challenges of modelling humour-aware relevance within the JOKER framework.

[IR-11] Agent IR: A Workload-Adaptive Cascade Retrieval Substrate for Long-Term Conversational Memory

链接: https://arxiv.org/abs/2605.25092
作者: Aojie Yuan,Haiyue Zhang,Shahin Nazarian
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL); Databases (cs.DB)
备注: 29 pages, 9 figures, 12 tables. Main paper 9 pages + comprehensive appendix (proof, GPU kernels, full per-dataset BEIR/LongMemEval/LoCoMo tables, cascade router C++ API, 6 robustness experiments, FAQ, failure-case catalog)

点击查看摘要

Abstract:Long-term conversational memory is a retrieval workload classical IR was not built for: the index grows during the query stream, query types shift intra-session, and the latency budget per retrieval is sub-10 ms. Lucene-class engines treat the index as static and the query as stateless, leaving the workload’s structure unexploited. AgentIR treats fusion as a per-query decision along two axes: which fusion to apply (BM25, Dense, RRF, or agent-aware RRF), and whether the ~52 ms dense channel is worth running at all. The second axis is a confidence-triggered cascade router that decides from the BM25 top-k margin alone and re-tunes across workloads without retraining. On LongMemEval (n=500), where the dense channel does add information, the cascade skips 63% of queries at parity LLM-judged accuracy (2.67x faster under two judges, paired bootstrap p=0.88); per-qtype thresholds extend this to 5.76x under 5-fold cross-validation. On LoCoMo (n=1,982), where BM25 alone is already the strongest single system, the same trigger auto-tunes to a 100% skip rate (132x faster, +0.089 Hit@5). Capacity on a shared 8-core VM rises from ~154 to ~1,400 concurrent agents (9x). Underneath the cascade, a time-partitioned index does O(log 1/epsilon) work independent of corpus size: 1234x corpus growth costs only 3.6x latency, ending in 1769x over sequential at sub-100 us p50 on 5M records. At parity quality with Lucene on 9 BEIR datasets up to 8.8M docs, the substrate runs 10x geo-mean over Pyserini 8T and 11x over PISA-1T BlockMax-WAND; an A100 reaches 1.8-39x over Pyserini 8T; chunked index build sustains 56.8K docs/sec on MS MARCO. Three subtle BM25/GPU correctness pitfalls that silently regress nDCG@10 by 6-8x are documented and fixed; post-fix CPU and GPU agree within 0.0002 nDCG@10 on all eight datasets that fit a single A100. Comments: 29 pages, 9 figures, 12 tables. Main paper 9 pages + comprehensive appendix (proof, GPU kernels, full per-dataset BEIR/LongMemEval/LoCoMo tables, cascade router C++ API, 6 robustness experiments, FAQ, failure-case catalog) Subjects: Information Retrieval (cs.IR); Computation and Language (cs.CL); Databases (cs.DB) ACMclasses: H.3.3; H.3.4; I.2.7 Cite as: arXiv:2605.25092 [cs.IR] (or arXiv:2605.25092v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2605.25092 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[IR-12] Meta-Modal Agent : Sequential Evidence Routing for Missing-Modality Candidate Reranking

链接: https://arxiv.org/abs/2605.25007
作者: Jinze Wang,Yangchen Zeng,Tiehua Zhang,Lu Zhang,Yuze Liu,Zhishu Shen,Jiong Jin,Zhu Sun
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Missing modalities cause severe failures in multimodal recommender systems. User histories, item text, and visual evidence are frequently absent during cold-start scenarios, exactly when recommendation quality matters most. Existing approaches recover absent signals through imputation, feature propagation, or generative reconstruction, but these strategies can inject unsupported evidence when the surviving signals are weak. We introduce the Meta-Modal Agent (MMA), a large language model based candidate-pool reranker that treats missingness as a sequential evidence-routing problem. MMA is trained with balanced missingness-task reinforcement learning over masked-modality episodes and is evaluated in two variants: MMA-Auto, which uses only automated text, image, and graph tools, and MMA-Interactive, which additionally permits clarification questions grounded in surviving modalities as an upper-bound diagnostic. MMA operates after a first-stage retriever has produced a candidate pool; it scores those candidates rather than retrieving items from the full catalog. Final reranking fuses MMA scores with first-stage retrieval scores selected on validation data. Our evaluation is organized around four evidence checks required for a robust missing-modality claim: oracle-free one-observed-modality availability (OOMA) robustness, per-modality OOMA breakdowns, fixed-pool full-catalog reranking, and a deterministic-router mechanism control. MMA-Auto improves target-positive OOMA NDCG@10 by 4.0% and fixed-pool full-catalog reranking NDCG@10 by 12.7% over the strongest non-interactive baseline. RuleRouter-Fuse, which uses the same tools and fusion rule without learned policy updates, underperforms MMA-Auto, supporting learned routing beyond deterministic tool fusion. MMA-Interactive adds a 4.1% upper-bound gain when clarification is available.

[IR-13] Selective Test-Time Compute Scaling for Click-Through Rate Prediction via Uncertainty-Triggered Feature Path Exploration

链接: https://arxiv.org/abs/2605.24989
作者: Moyu Zhang,Yun Chen,Yujun Jin,Jinxin Hu,Yu Zhang,Xiaoyi Zeng
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: 12 pages, 4 Figures, 3 Tables

点击查看摘要

Abstract:Scaling test-time compute has proven highly effective for language models, yet this opportunity remains largely unexplored for industrial Click-Through Rate (CTR) prediction. CTR models suffer from a fundamental asymmetry: feature combinations well-represented in training yield confident predictions, while sparsely observed ones produce unreliable outputs. Existing training-phase solutions such as adaptive gating learn a fixed selection function subject to the same sparsity, offering no per-instance recourse at this http URL propose UTTSI (Uncertainty-Triggered Test-Time Selective Inference), a training-free model-agnostic framework that scales inference depth proportionally to per-instance uncertainty. A dual-signal estimator combining model logit confidence with a data-level frequency prior distinguishes epistemic uncertainty from aleatoric ambiguity. Every instance undergoes adaptive feature filtering to remove unreliable embeddings; uncertain instances additionally receive stochastic feature-path explorations whose predictions are aggregated via consistency-weighted ensembling. Confident instances bypass exploration entirely, keeping average overhead at approximately 2.8\times base model cost with worst-case latency this http URL on four datasets with three backbone architectures demonstrate consistent, statistically significant gains over all training-phase baselines. A seven-day online A/B test further confirms a 5.3% relative CTR gain ( p 0.01 ), establishing selective test-time compute allocation as a practical complement to training-phase advances for CTR prediction.

[IR-14] Self-Balancing Gradient Allocation for Heterogeneity-Aware Feature Generation in Click-Through Rate Prediction

链接: https://arxiv.org/abs/2605.24986
作者: Moyu Zhang,Yun Chen,Yujun Jin,Jinxin Hu,Yu Zhang,Xiaoyi Zeng
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: 12 pages, 5 figures, 4 tables

点击查看摘要

Abstract:Generative pre-training via discrete diffusion provides dense reconstruction supervision across all feature fields simultaneously, mitigating representation collapse from data sparsity in CTR prediction. However, all existing generative CTR methods share a fundamental limitation: the reconstruction objective assigns equal training weight to every feature field, ignoring the profound heterogeneity of reconstruction difficulty across high-cardinality ID fields, sparse categorical attributes, numerical values, and behavioral sequences. This causes easy fields to dominate training gradients while the hardest but most informative fields remain chronically underfit, a problem we term the generative difficulty this http URL propose HeteGenCTR, which resolves this imbalance through per-field learnable difficulty parameters jointly trained with the denoising network. This unified signal drives two coordinated components without additional hyperparameters: a self-balancing loss that automatically reallocates gradient budget toward harder fields with a provably stable equilibrium, and a difficulty-guided attention mechanism that suppresses the influence of already-converged easy fields while amplifying cross-field information flow toward hard fields. Both components share the same learned signal and remain mutually consistent throughout training. Experiments on five CTR benchmarks and a seven-day online A/B test demonstrate consistent, statistically significant improvements over state-of-the-art baselines, with disproportionate gains for cold-start and long-tail users.

[IR-15] Your Embedding Model is SMARTer Than You Think

链接: https://arxiv.org/abs/2605.24938
作者: Jianrui Zhang,Hyun Jung Lee,Sukanta Ganguly,Tae-Eui Kam,Donghyun Kim,Yong Jae Lee
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multimodal retrieval relies heavily on single-vector retrievers, which compress rich, sequential token sequences into one single global representation. While efficient, they discard fine-grained, local evidence critical for dense retrieval tasks. Multi-vector approaches were introduced as a solution, but they strictly require training and many ignore the necessity of a globally summarizing representation. To address this, we introduce SMART, a framework that unlocks the latent multi-vector capabilities of standard single-vector models. We first demonstrate that standard contrastive training on the pooled embedding implicitly shapes the retrieval geometry of preceding hidden states via gradient flow. By applying direct late-interaction over these frozen hidden states during inference, SMART acts as a plug-and-play upgrade that consistently improves performance across diverse modalities, improving even the state-of-the-art models further on MMEB-V2. We also reveal SMART’s superior performance, as simple lightweight post-training not only saves time and compute, but also brings forth further improvement on Visual Document retrieval, allowing a single-vector model to outperform SoTA multi-vector counterparts. Ultimately, SMART offers both a highly efficient inference enhancement and a powerful finetuning technique for multimodal retrieval. We open source our code and weights at this https URL.

[IR-16] MVR-cache: Optimizing Semantic Caching via Multi-Vector Retrieval and Learned Prompt Segmentation ICML2026

链接: https://arxiv.org/abs/2605.24914
作者: Ali Noshad,Zishan Zheng,Yinjun Wu
类目: Information Retrieval (cs.IR); Databases (cs.DB); Machine Learning (cs.LG)
备注: Published in ICML 2026

点击查看摘要

Abstract:To reduce LLM costs and latency, semantic caching systems must accurately identify when a new prompt matches a cached one. Current methods often rely on simplistic similarity measures, which limit their effectiveness. We introduce MVR-cache, a novel semantic caching approach that significantly improves retrieval accuracy by integrating Multi-Vector Retrieval (MVR). MVR-cache is built upon a learnable segmentation model that intelligently splits prompts, enabling fine-grained similarity comparisons via MaxSim. We derive the model’s training objective from a rigorous theoretical analysis. This can ensure that optimizing this objective directly maximizes cache hits under strict correctness constraints. To solve the resulting non-differentiable combinatorial optimization problem, we leverage a reinforcement learning-based training strategy with the theoretically grounded objectives as the reward. Experimental results on established benchmarks across diverse tasks confirm that in comparison to the state-of-the-art, MVR-cache consistently increases the cache hit rates by up to 37% while maintaining the same correctness guarantees. MVR-cache is available at this https URL

[IR-17] Spectral Retrieval: Multi-Scale Sinc Convolution over Token Embeddings for Localized Retrieval in LLM Multi-Agent Systems

链接: https://arxiv.org/abs/2605.24764
作者: Andrea Morandi
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:[Abridged] - Spectral Retrieval is a plug-in re-ranking stage that interpolates between per-token MaxSim and mean-pool retrieval through a multi-scale sinc convolution over token embeddings. In standard dense retrieval each document is one mean-pooled vector; when relevance localises into a short subspan, the signal averages into noise. Spectral Retrieval reuses per-token embeddings from a late-interaction index and convolves them with a normalised sinc kernel at multiple scales. At L=1 the kernel acts as the identity, recovering per-token MaxSim; as L grows it approaches a uniform filter, recovering mean pooling. The maximum cosine over positions and scales yields a score provably no less informative than either endpoint. On a controlled synthetic benchmark with 1,000 documents and planted single-position spikes, mean-pool retrieval sits at chance (Recall@10 ~ 0.02) regardless of spike strength, while Spectral Retrieval reaches Recall@10 = 1.0 once the planted cosine exceeds the corpus-level token noise floor. On LIMIT-small with a frozen all-mpnet-base-v2 encoder, Spectral Retrieval lifts Recall@10 from 0.33 to 0.90, MRR from 0.22 to 0.79, and strict Success@10 from 0.12 to 0.84, without retraining. The method fits naturally into multi-agent LLM systems, where each agent benefits from a tighter, role-specific retrieval window over a shared corpus.

[IR-18] How Many Tools Should an LLM Agent See? A Chance-Corrected Answer

链接: https://arxiv.org/abs/2605.24660
作者: Vyzantinos Repantis,Ameya Gawde,Harshvardhan Singh,Joey Blackwell II
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 13 pages, 2 figures

点击查看摘要

Abstract:Before an LLM agent can use a tool, a retrieval system must decide which candidate tools to show to the agent. How long should that shortlist be? Show too many tools and the model struggles to choose. Show too few and the correct tool may not appear. Most systems apply a fixed shortlist size to every query, but no standard metric exists to evaluate whether that size was appropriate. We treat the number of tools shown to an LLM agent as the object of evaluation and we apply Bits-over-Random (BoR), a chance-corrected metric that asks whether success at a given depth is better than what random selection would achieve at that same depth. We evaluate BoR across three tool-selection benchmarks, multiple scorers, and registries ranging from 20 to 3,251 tools. We then turn the same principle into a reinforcement learning (RL) reward for choosing tool shortlist depth per query. The RL agent is deliberately simple, serving as a probe of the metric rather than a proposed system. As the shortlist grows, random chance of including the correct tool rises, so the reward naturally decreases, reducing the need for an engineered depth penalty. On BFCL (370 tools), the learned policy nearly matches the coverage of showing 50 tools ( 90.3% vs 90.8% ) while presenting only 7 on average. On ToolBench (3,251 tools), a fixed shortlist of 5 tools achieves higher aggregate coverage ( 64.7% vs 61.9% ) but finds nothing on hard queries (correct tool ranked 6th-20th). The BoR agent finds 16.7% on those same queries by searching deeper. Downstream validation with Claude Sonnet 4.6 indicates that shorter adaptive lists also improve the LLM’s ability to select the right tool: 93.1% versus 87.1% when always shown 5 tools, widening to 76.8% vs 60.9% on medium-difficulty queries where the correct tool is present but not ranked first.

[IR-19] he Multilingual Curse at the Retrieval Layer: Evidence from Amharic ACL2026

链接: https://arxiv.org/abs/2605.24556
作者: Yosef Worku Alemneh,Kidist Amde Mekonnen,Maarten de Rijke
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 10 pages, 4 tables. Accepted to the 1st Workshop on Multilinguality in the Era of Large Language Models (MeLLM) at ACL 2026

点击查看摘要

Abstract:Multilingual retrieval increasingly underpins cross-lingual question answering and retrieval-augmented generation. Strong zero-shot scores on multilingual benchmarks are often taken as evidence that current encoders transfer reliably across many languages. We argue that this assumption breaks down for underrepresented, morphologically rich languages, and use Amharic as a diagnostic case. Under a shared passage retrieval protocol covering dense, late-interaction, learned sparse, and cross-encoder paradigms, we compare zero-shot multilingual retrievers, Amharic-fine-tuned multilingual retrievers, and monolingual Amharic retrievers. The strongest zero-shot multilingual retriever underperforms the strongest monolingual Amharic first-stage retriever by 23% relative MRR@10. Fine-tuning two recent multilingual embedding models on the same Amharic supervision yields 32-60% relative MRR@10 gains over zero-shot, but the best Amharic-fine-tuned multilingual model remains below the strongest monolingual Amharic retriever. These findings indicate that zero-shot multilingual retrieval is not a sufficient proxy for equitable information access in the LLM era: for underrepresented languages, retrieval must be evaluated and adapted in-language rather than inferred from aggregate multilingual benchmarks. To foster future research, we publicly release the dataset, codebase, and trained models at this https URL.

[IR-20] Beyond Control-Flow: Integrating the Resource Perspective into Multi-Collaborative Process Modeling from Text

链接: https://arxiv.org/abs/2605.24546
作者: Anton Antonov,Humam Kourani,Alessandro Berti,Gyunam Park
类目: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: Submitted to EDOC 2026, under review

点击查看摘要

Abstract:Process modeling is a sub-domain of Business Process Management (BPM) focused on the translation of process artifacts into formal models. This task traditionally requires extensive human input and domain expertise in both BPM notations and the specific business context. While Large Language Models (LLMs) can now automate much of this manual work, current text-to-model approaches focus predominantly on the control-flow perspective-ordering activities without considering the collaborative aspect of the processes. In this paper, we introduce a resource-aware generation pipeline that produces formal BPMN 2.0 collaboration diagrams from natural-language descriptions. Rather than solely prompting an LLM for raw XML, we describe a compact, executable intermediate language with mandatory resource details defining both the organization (pool) and the role (lane). Cross-organization dependencies are materialized using the standard formal notation for such interactions-message events-while an orthogonal layout routine automatically handles the spatial arrangement of elements within pools and lanes. Experiments on ten business processes with nine LLMs show strong resource discovery while preserving control-flow quality and adding only marginal runtime overhead. This approach moves generative modeling toward a more comprehensive, multi-collaborative representation of business operations.

[IR-21] SemanticZip: A Pilot Framework for Lossy Text Compression with LLM s as Semantic Decompressors

链接: https://arxiv.org/abs/2605.24541
作者: Natalia Trukhina,Vadim Vashkelis
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 13 pages, 1 figure, 2 tables. Pilot framework paper; code and supplementary artifacts available in ancillary files

点击查看摘要

Abstract:Text compression for large language model (LLM) systems is usually framed as token deletion, retrieval, summarization, or exact reconstruction. We study a more aggressive but explicitly lossy setting: compress text into compact codes that an LLM can expand into task-relevant meaning. We call this setting SemanticZip. Unlike lossless compression, SemanticZip does not require byte-identical reconstruction; unlike ordinary summarization, it treats model-based decompression as part of the codec and evaluates whether task-relevant semantic commitments are recovered. This paper is a pilot framework, not a benchmark claim. We formalize LLM-mediated decompression, define a protected/lossy packet architecture, and evaluate six representation regimes over five author-constructed diagnostic cases: structured prose, JSON, CCL-Core, CCL-Min, SemanticZip ASCII, and SemanticZip emoji. An independent decoder LLM reconstructs typed semantic atoms from each compressed representation, and we score Critical Atom Recall, Weighted Atom Recall, precision, and tokenizer gain. In this pilot, structured prose has the highest recoverability, with WAR = 0.956 and 19.1% o200k_base token gain. CCL-Min is the strongest balanced point, with 39.4% token gain and WAR = 0.874. SemanticZip ASCII provides the largest useful compression, with 46.5% token gain and WAR = 0.802, while emoji-heavy SemanticZip performs worse on both compression and recovery. The main contribution is not the claim that these numbers establish a universal frontier. Rather, we introduce a reproducible experimental interface for studying lossy, LLM-decompressible text codes and a design principle: safety-critical and exact commitments should remain protected, while predictable low-risk context may be semantically zipped. Comments: 13 pages, 1 figure, 2 tables. Pilot framework paper; code and supplementary artifacts available in ancillary files Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR) Cite as: arXiv:2605.24541 [cs.LG] (or arXiv:2605.24541v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.24541 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[IR-22] Benchmarking Patent Embeddings: A Multi-Task Evaluation of 22 Models Across Retrieval Classification and Clustering

链接: https://arxiv.org/abs/2605.24297
作者: Amirhossein Yousefiramandi,Ciaran Cooney
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Which fine-tuning signals improve patent embedding models, and do gains transfer across patent landscapes? We benchmark 22 embedding models, from 22M-parameter encoders to 12B instruction-tuned LLMs, on retrieval, classification, and clustering. The study uses 113,148 WIPO assistive-technology patents, 46,069 citation-graph retrieval queries, and the public DAPFAM dataset for external validation. Our framework covers citation-based retrieval, hybrid sparse-dense fusion, multi-label classification over five datasets, unsupervised clustering, six text-section views, domain-adaptive fine-tuning of four models, jurisdiction analysis, and proprietary DWPI (Derwent World Patents Index, Clarivate) expert-written content. Results show that fine-tuning is task-dependent: single-landscape tuning can improve in-domain scores but often hurts retrieval on an external landscape, challenging the assumption that more domain data always helps. Within model families, scale usually predicts performance (Qwen3 0.6B to 4B to 8B; Llama-Nemotron 1B to 8B), but cross-family scaling is noisy: the 12B KaLM-Gemma3 ranks 8th on TAC retrieval, while Qwen3-0.6B leads ARI clustering. Title+Abstract+Claims is the most reliable text representation. Multi-view abstract-claim alignment improves retrieval by up to 7.1 percent nDCG@10, while combined fine-tuning gives the strongest classification gains (+7.1 F1). All models drop by 55-65 percent on out-of-domain queries, and hybrid sparse-dense fusion does not close this gap. BM25-dense interpolation gives modest nDCG@10 gains (+0.002 to +0.015), with larger benefits for weaker zero-shot dense models. Code and evaluation framework are publicly available.

[IR-23] When Does Synthetic Patent Data Help? Volume-Fidelity Trade-offs in Low-Resource Multi-Label Classification

链接: https://arxiv.org/abs/2605.24296
作者: Amirhossein Yousefiramandi,Ciaran Cooney
类目: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:We study when LLM-generated synthetic data helps low-resource multi-label patent classification, separating true synthetic value from the confound that larger augmented sets can win by volume alone. Across six open-source LLMs (3.8-12B), four real-data regimes, 64 WIPO assistive-technology labels, two generation strategies, and three classifier families, the headline BERT-for-Patents micro-F1 jump from 0.120 to 0.702 is largely volume-driven. A duplicate-to-match real-only control that resamples 165 patents to the augmented size reaches 0.678; the controlled synthetic gain is only +0.024 over this control, but +0.219 over focal-loss reweighting, the strongest non-augmentation baseline. The main finding is that fidelity metrics change meaning with scale: at extreme scarcity, MMD correlates positively with classification gain (r=+0.95), but at 1:10 the relation flips (r=-0.73; Fisher z=+6.47, p0.001). Fixed-budget mixing finds a 20-30% real / 70-80% synthetic optimum; paraphrase scaling collapses from a 165-document seed; and shuffled mixing beats curriculum ordering, ensembling, and classifier-based filtering. Leakage controls – label-name masking, instruction-level label removal, fine-grained evaluation, and keyword-overlap audits – argue against label-string dependence as the main driver for BERT-for-Patents. The apparent ModernBERT collapse under label removal is traced to a Flash-Attention-2 + bf16 numerical artifact, recovering 65% of lost performance with fp32 eager attention. Finally, the same corpus that improves classification by up to +0.58 raw micro-F1 hurts a Jaccard-label-overlap retrieval proxy; even a standard-patent-only filter leaves a 26% nDCG@10 drop. Thus, synthetic patent text is task- and metric-specific, not reducible to prompt genre alone.

[IR-24] CRISP – Clustering-Based Redundancy-Reduced Instance Sampling for Pathology Case Representation and Retrieval

链接: https://arxiv.org/abs/2605.24253
作者: Zahra Rahimi Afzal,Wataru Uegami,Saghir Alfasly,Saba Yasir,Judy C. Boughey,Matthew P. Goetz,Krishna R. Kalari,H.R. Tizhoosh
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Digital pathology archives increasingly contain multiple whole-slide images (WSIs) per case, capturing spatially distinct tumour regions and reflecting intrinsic morphological heterogeneity. However, most existing approaches rely on a single pathologist-selected slide, thereby discarding potentially informative evidence distributed across the remaining WSIs. To date, no autonomous framework has been proposed for comprehensive multi-WSI case processing. Here, we present an unsupervised framework for case-level analysis that integrates information from all available slides within a case. Rather than relying on a single designated slide, the proposed approach constructs case-level representations by selectively distilling informative patches across WSIs. We introduce Clustering-Based Redundancy-Reduced Instance Sampling for Pathology (CRISP), a two-stage framework that first reduces redundancy within individual WSIs and subsequently applies clustering-based sampling to select a compact yet representative set of patches for the entire case. The resulting patch set captures case-level heterogeneity while avoiding exhaustive processing of gigapixel images, and directly serves as a retrieval index. Using two Mayo Clinic breast cancer datasets for diagnosis and treatment planning, we demonstrate that CRISP consistently matches or surpasses the current standard practice of combined model and pathologist slide selection for patient/case search and retrieval. By automating case-level processing and eliminating subjective WSI selection, CRISP potentially enables the exploitation of clinically relevant information distributed across multiple WSIs that is currently overlooked.

[IR-25] MeVer at CheckThat! 2026: Cluster-Aware Hard-Negative Mining for Multilingual Scientific-Source Retrieval

链接: https://arxiv.org/abs/2605.24236
作者: Juli Bakagianni,Symeon Papadopoulos
类目: Information Retrieval (cs.IR)
备注: Technical report for CLEF 2026 CheckThat! Task 1 shared task submission. 13 pages, 14 tables

点击查看摘要

Abstract:Identifying the scientific source behind a social media claim requires matching short, informal, and often multilingual claims against large collections of scientific publications, where semantically related papers may act as challenging distractors or false negatives during training. We present our submission to CheckThat! 2026 Task 1 on multilingual scientific-source retrieval, focusing on how hard-negative mining should be adapted to multi-stage retrieval pipelines for scientific-source retrieval. We propose cluster-aware hard-negative mining strategies that exploit the semantic structure of retrieved candidate pools in order to construct more informative training negatives for dense retrieval and reranking. Our experiments show that different hard-negative structures induce different retrieval behaviors. Localized cluster negatives tend to favor precision-oriented retrieval, whereas broader non-gold semantic negatives provide stronger candidate coverage and more consistent reranking performance across languages. We further study multiple LLM-based evidence-selection formulations, including direct classification, pairwise comparison, and listwise reranking prompts, and find that constrained classification prompts provide the most reliable final document selection. The final system combines a dense retriever, a multilingual cross-encoder reranker, and a selective LLM-based disagreement resolver, ranking 6th among 37 submissions in the shared task evaluation. Overall, our results suggest that hard-negative mining should be treated as a stage-aware design problem rather than as a single retrieval optimization strategy.

[IR-26] Bayesian Rational Search Engine User

链接: https://arxiv.org/abs/2605.24233
作者: Shichao Ma
类目: Information Retrieval (cs.IR); Theoretical Economics (econ.TH)
备注:

点击查看摘要

Abstract:A user faces a list returned by a search system, ordered by a noisy proxy for relevance, and decides sequentially whether to pay a fixed cost to inspect another item or stop with the best she has uncovered. She does not enter the page knowing how good its items are, so each inspection both produces a candidate item and refines her belief about the page’s underlying quality. We show the optimal policy is a standout rule: the user stops as soon as her best find exceeds her posterior mean of an average item on the page by a depth-dependent threshold. The induced dynamics collapse to a one-dimensional Markov chain, which yields the full distribution of inspection depth through a closed-form recursion. The model uncovers three hidden mechanisms (trust, commit, and cut-losses) on why users stop and yields a rich set of testable implications. Moreover, the Bayesian-rational view delivers a novel learning-to-rank likelihood: an observed depth censors the latent relevance path into a polyhedron of survival inequalities, whose Gaussian probability is a differentiable function of any feature-based relevance prediction.

[IR-27] An Interpretable CF-RL-TOPSIS Fusion Model for Skills-Aware Talent Recommendation

链接: https://arxiv.org/abs/2605.24155
作者: Özkan Canay
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Preprint submitted to Knowledge-Based Systems; 4 figures and 8 tables

点击查看摘要

Abstract:Effective skills-aware talent recommendation must balance behavioral transition patterns, trajectory-sensitive adaptation, and inspectable occupation-level criteria. Evidence from public benchmarks on how these signals interact, however, remains limited. This study proposes CF-RL-TOPSIS, an interpretable late-fusion model that integrates a transition-aware collaborative branch, a compact reinforcement-style occupation-family bandit, and an entropy-weighted TOPSIS branch constructed from six semantic proxies; the validation-selected fusion coefficients remain auditable. The model is evaluated on two frozen public ICT talent-history benchmarks, JobHop and Karrierewege, using repeated chronological top-5 ranking and paired Wilcoxon tests. On JobHop the full hybrid attains NDCG@5 = 0.3040 +/- 0.0073 and significantly surpasses repeat-last, item Markov, transition-aware collaborative filtering, the CF+TOPSIS hybrid, GRU4Rec, and SASRec (p = 0.0039 across planned comparisons). On Karrierewege the hybrid remains competitive but does not significantly exceed the strongest Markov baseline, revealing a persistence-dominated setting in which the bandit branch appropriately shrinks to near-zero weight. Proxy-sensitivity, family-level deep Q-network, and runtime checks support this interpretation, and a worked user-level case shows how branch scores, criterion weights, and rank shifts can be inspected for an individual recommendation. The contribution is not a benchmark-agnostic superiority claim, but a reproducible account of the conditions under which transparent late fusion adds value beyond simple continuation heuristics. In semantically rich, non-saturating talent-history regimes the three branches reinforce one another; in persistence-dominated regimes the same architecture remains competitive through its collaborative backbone, with the adaptive branch correctly inactive.

[IR-28] Same Ranking Different Winner: How Scoring Targets Shape LLM Memory Benchmarks

链接: https://arxiv.org/abs/2605.24060
作者: Sugam Panthi,Rabab Abdelfattah
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Conversational-memory systems increasingly transform dialogue history into facts, summaries, timelines, and other source-linked descendants, so a single source turn can coexist with several derived memories in the same retrieval index. This raises an underspecified evaluation question: which stored form should receive retrieval credit? We show that this scoring-target choice is often left implicit and can materially change benchmark conclusions. We present TIAP, a fixed-output audit that rescores saved ranked outputs under three targets – Raw, Source, and Canonical – without rerunning retrieval. On LoCoMo and LongMemEval-S, switching only the credited target changes nDCG on 83.4–94.0 percent of shared queries, flips target orderings on Mem0 and MemoryOS transfer runs, and reverses parser-density recommendations. A 1,902-case semantic audit further shows that relaxed source-linked credit is fully justified only 29.2 percent of the time, despite high rubric reliability in a validation subset. These results reveal target noninvariance: conclusions about memory architectures can silently flip with a single benchmark-design choice. Conversational-memory papers should therefore define and report the scoring target explicitly.

[IR-29] Memento: Personalized RAG -Style Long-Retention Data Scaling for META Ads Recommendation

链接: https://arxiv.org/abs/2605.24051
作者: Xiaoyu Chen,Ruichen Wang,Jieming Di,Suofei Feng,Nafis Abrar,Lilly Kumari,Tony Tsui,Yilin Liu,Yu Lu,Sowmya Patapati,Junwei Xiong,Qiao Yang,Dorothy Sun,Yang Cao,Victor Chen,Pan Chen,Ramsundar Sundarkumar,Shivendra Pratap Singh,Arnold Overwijk,Ling Leng,Dinesh Ramasamy,Sri Reddy,Robert Malkin,Sandeep Pandey
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Modeling of long history data suffers from long-context window attention dilution, system efficiency and catastrophic forgetting problems, where naive linear scaling approach like LastN would fail. We introduce Memento, a personalized retrieval-augmented framework that treats historical user engagements as a document corpus and ad requests as queries, retrieving relevant interactions via Maximal Marginal Relevance (MMR) to balance similarity with diversity. We identify two complementary applications: Representation Memento, which retrieves historical embeddings for feature augmentation, and Data Memento, which retrieves past training examples for multipass training. Through infrastructure co-design – temporal chunking, INT8 quantization, and asynchronous serving – Memento achieves 5-10 \times resource efficiency over linear scaling. Memento processes daily requests with sub-10ms latency, yielding 0.25-0.3% Normalized Entropy gain on both click-through and conversion prediction. In production, Memento delivers a 1% CTR lift on Facebook Feed and Reels and a 1.2% CVR lift, scaling personalization to 365+ days of history.

[IR-30] Rethinking Contrastive Learning for Graph Collaborative Filtering: Limitations and a Simple Remedy ICML2026

链接: https://arxiv.org/abs/2605.24015
作者: Geon Lee,Sunwoo Kim,Kyungho Kim,Kijung Shin
类目: Information Retrieval (cs.IR)
备注: ICML 2026

点击查看摘要

Abstract:Graph collaborative filtering (GCF) is a dominant paradigm in recommender systems, where contrastive learning (CL) objectives such as the Sampled Softmax (SSM) loss are widely used for optimization. However, it remains unclear how CL interacts with the prediction mechanism of GCF. By unfolding the prediction mechanism of GCF, we show that the user-item prediction score is computed by aggregating learnable weights over a large number of neighbor pairs formed by the multi-hop neighbors of the user and the item. This analysis suggests that effective optimization critically depends on which neighbor pairs are upweighted during training. Empirically, we find that effective recommendation is achievable by selectively upweighting only a small subset of neighbor pairs whose constituent neighbors are structurally similar to the target user and item, and that the effect of such selective upweighting varies across different neighbor pair types. Based on these findings, we analyze SSM and identify key limitations in its neighbor pair weight update dynamics. To address these limitations, we propose NT-SSM, an effective and principled CL objective that induces type-aware neighbor pair weight update dynamics. Experiments demonstrate consistent performance improvements over SSM across multiple datasets and GCF models.

[IR-31] Federated Semantic Knowledge Graphs for Laboratory Workflows: A Structured Expert Elicitation Methodology Demonstrated Through Bioanalytical Workflow Twins ISWC2026

链接: https://arxiv.org/abs/2605.23985
作者: Luis F. Schachner,Vinith Thamizhazhagan,Sara Tanenbaum,John C. Tran,Pamela P. F. Chan,Mandy Kwong,Andy Chang,Maureen Beresini,Margaret Porter Scott
类目: Databases (cs.DB); Information Retrieval (cs.IR)
备注: 48 pages, 4 figures, 3 appendices. Submitted to ISWC 2026 In-Use Track

点击查看摘要

Abstract:Laboratory workflows in pharmaceutical and biomedical research encode substantial tacit knowledge – expert judgment about failure conditions, decision branching logic, and contextual dependencies – that remains inaccessible to protocol documents, sensor streams, and existing biomedical ontologies. We present a repeatable structured expert elicitation methodology and federated Semantic Knowledge Graph (SKG) architecture for capturing and querying this knowledge, demonstrated through deployment at the Biochemical and Cellular Pharmacology Department of Genentech. Knowledge is elicited via the Protocol Intelligence Co-pilot, a purpose-built AI interview agent that applies structured elicitation lenses to surface tacit procedural knowledge with expert-assigned confidence scores, producing graph representations across three tiers: program-level decision milestones, assay protocol knowledge, and physical execution infrastructure. Separately constructed subgraphs, exemplified by immunoassay (ELISA), quantitative mass spectrometry (LC-MS/PRM), and laboratory automation, are aligned through a shared upper ontology and queried as a single federated graph. Evaluation demonstrates seven query types structurally unavailable from any individual data source, including a cross-subgraph traversal that identifies automation-masked silent failures – conditions where execution logs report success while scientific validity is compromised. Critically, the MASKED_BY graph relationship encodes a class of laboratory risk invisible to current informatics platforms – the structural gap that prevents existing systems from reasoning about scientific validity. This architecture provides the semantic world model that AI laboratory agents currently lack: a queryable representation of where workflows fail silently, where human judgment is irreplaceable, and which execution assets mask rather than detect failure.

[IR-32] Improving the Completeness and Comparability of Segment Disclosures: A Large Language Model Approach

链接: https://arxiv.org/abs/2605.23924
作者: Yue Liu,Zhiyuan Cheng,Longying Lai
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR); General Finance (q-fin.GN)
备注: 39 pages, 4 figures, submitted to Accounting Horizons

点击查看摘要

Abstract:Segment-level disclosures are a central component of financial reporting, providing insight into firms’ internal organization and the allocation of economic activities across operating units. However, segment information is often presented in both qualitative and quantitative forms, dispersed across tables and narrative sections of Form 10-K filings. Empirical research relying on structured databases faces both completeness and comparability challenges, as some firm-year observations may be missing, nested segment disclosures are not captured, and support for longitudinal and cross-firm comparability is limited. This study develops a large language model-based framework to extract segment disclosures directly from Form 10-K filings and to preserve both reportable and nested segment information. We further design a retrieval augmented system that incorporates information across multiple filings to support comparability. We use two representative settings to demonstrate its application: longitudinal analysis within a firm to interpret segment changes over time, and cross firm alignment of geographic segments across firms with different reporting structures. The results indicate that the artifact accurately extracts segment-level information and effectively addresses questions that require cross-period knowledge, demonstrating the potential of LLM-based approaches to enhance the measurement and interpretation of segment disclosures.

[IR-33] Agent -Facing Information Design in LLM Tool Registries

链接: https://arxiv.org/abs/2605.23916
作者: Haochuan Kevin Wang
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); General Economics (econ.GN)
备注:

点击查看摘要

Abstract:LLM tool registries function as unregulated advertising platforms: providers write free-text descriptions that agents use for selection, yet no measurement infrastructure – no viewability standard, quality score, or outcome audit – exists to make this market accountable. We provide the first systematic framework, combining 17,700+ trials across five LLMs and ten domains with a constructive registry design prescription. Legal puffery alone (subjective superlatives, benefit framing) captures 100% of the optimization effect; fabricated claims add zero incremental bias – rendering FTC enforcement of deceptive advertising rules ineffective against the active mechanism. Disclosure fails structurally: system-prompt warnings produce zero measurable effect for four of five models, and behavioral ceilings leave no headroom for label-based correction. Superlatives are the dominant single feature (SBC = +0.35). Registry-layer description normalization achieves first-best welfare model-independently. We propose separating selection-facing descriptions (structured, registry-controlled) from marketing-facing descriptions (provider-authored, shown post-selection), and introduce the Agent Attention Quality Score to distinguish capability from copywriting.

人机交互

[HC-0] he Timing Dependencies of Trust: Speed Accuracy and cBCI Neuro-Decoupling in Human-AI Teams

链接: https://arxiv.org/abs/2605.25868
作者: Christopher Baker,Stephen Hinton,Akashdeep Nijjar,Riccardo Poli,Caterina Cinel,Tom Reed,Stephen Fairclough
类目: Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The speed and accuracy of an artificial teammate fundamentally alter the failure states of Human-AI integration. While high-speed AI interventions risk inducing reflexive blind compliance, delayed interventions can induce ambiguous cognitive conflict. This study investigates how the fundamental characteristics of an in-task AI assistant, Fast/Less-Accurate (FLA-AI) versus Slow/Accurate (SA-AI) impact the synergy of Collaborative Brain-Computer Interface (cBCI) teams in a Virtual Reality drone task. Seventeen operators completed continuous search tasks under high cognitive workload while their spatial covariance was mapped using a 2D Adaptive Riemannian Oracle. The results mathematically demonstrate that AI timing dictates the mechanism of team failure. Fast AI induced instant, blind compliance; human accuracy under deception collapsed to 50.2%, and pure behavioural teams (N=8) failed to scale beyond 74.1%. In contrast, Slow AI induced delayed cognitive conflict; humans hesitated (61.1% accuracy), but N=8 behavioural teams eventually recovered to 100.0%. Crucially, the Riemannian Oracle mathematically adapted to these states: it heavily restricted temporal windows ( 0.8s) to intercept fast reflexive compliance, while widening windows ( 1.2s) to capture delayed cognitive conflict. Integrating these isolated veridical signals via Hybrid Fusion successfully rescued the Fast AI team (+7.6% at N=8) and significantly accelerated the recovery of smaller Slow AI teams (+6.9% at N=4). These findings prove that cBCI synergy is heavily contingent on the temporal dynamics of trust, providing a critical framework for designing dynamically gated Human-AI systems.

[HC-1] Explaining Too Much? Understanding How Large Language Model Reasoning Traces Influence Performance and Metacognition

链接: https://arxiv.org/abs/2605.25856
作者: Daniela Fernandes,Daniel Buschek,Lev Tankelevitch,Thomas Kosch,Robin Welsch
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 27 pages, 5 figures, 9 tables

点击查看摘要

Abstract:Large Language Model interfaces are increasingly verbose, exposing intermediate reasoning traces alongside final answers. Traces are framed as transparency mechanisms, yet it is unclear how people use them to solve problems. We report a preregistered between-subjects study (N = 559) in which participants solved ten LSAT-style reasoning problems under one of three conditions: an Answer-only baseline, a Full-trace revealed before the answer, and a Summary-trace presented alongside the answer. Summaries preserved task performance at the no-trace baseline while significantly elevating trust and hedonic appeal, establishing that trace exposure shifts subjective appraisal of the interaction without bringing performance benefits. Under an open-weight reasoning model exposing verbose intermediate output, full traces additionally impaired performance relative to the answer-only baseline. Across all conditions, participants substantially overestimated their performance, and no trace format supported calibrated self-evaluation. Further analysis indicates that hedonic appeal, not trust, carries the indirect path to overestimation, consistent with a processing-fluency account. Reasoning traces are best understood as user-facing interface artifacts rather than transparent windows into model cognition, and calibration is unlikely to emerge from the traces themselves and may best be scaffolded by interactions that elicit users’ own reasoning first.

[HC-2] Posture Clip: Sit properly or I wont let you work

链接: https://arxiv.org/abs/2605.25664
作者: Arka Majhi,Aparajita Mondal
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Computers and Society (cs.CY)
备注: Published online by Cambridge University Press on 14 May 2026

点击查看摘要

Abstract:Poor posture is a significant concern due to its detrimental effects on health and productivity. This paper presents a collar-clipped device called PostureClip, designed to restrict users from sitting and working at a bent angle, by blacking out the screen and resuming on correcting posture, thereby promoting better posture. The device integrates sensors and feedback mechanisms to provide real-time posture feedback to users. To evaluate the effectiveness of PostureClip, a controlled experiment was conducted with participants (n=165) who were working on a laptop/PC for over 6 hours per day. The participants were randomly assigned to both the intervention group (IG1,n=54 ; IG2,n=55), which used the collar-clipped device, and the control group (CG, n=56), which did not use the device. IG1 didn’t get feedback while IG2 got feedback from the device by notifying and further darkening the screen. The study was conducted in the office environment of the participants, for 4 weeks, and metrics such as posture angle, duration of bent angle, and user feedback were collected. Analysis revealed significant improvements in posture angle (p0.001) and significant reduction in bent angle duration (p0.01) for participants’ group using PostureClip with feedback and compared to the group without feedback and the control group (who were not intervened). The qualitative analysis of user feedback highlighted the device’s ease of use, effectiveness in providing timely feedback, and positive impact on participants’ awareness and habits regarding posture. These results indicate that PostureClip is an effective tool for promoting better posture during sedentary work. Comments: Published online by Cambridge University Press on 14 May 2026 Subjects: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Computers and Society (cs.CY) Reportnumber: Volume 7 , 2026 , e5 Cite as: arXiv:2605.25664 [cs.HC] (or arXiv:2605.25664v1 [cs.HC] for this version) https://doi.org/10.48550/arXiv.2605.25664 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: Wearable Technologies, 7, e5 (2026) Related DOI: https://doi.org/10.1017/wtc.2026.10041 Focus to learn more DOI(s) linking to related resources

[HC-3] WeeCare: Towards Handheld Bladder Fullness Sensing with a Conformable Pad

链接: https://arxiv.org/abs/2605.25643
作者: Zhikai Qin,Siqi Zhang,Junyi Zhu,Justin Chan
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Patients with bladder dysfunction often lose the sensation of bladder fullness and cannot void naturally, forcing reliance on fixed-schedule catheterization that is uncomfortable and risks complications. We present WeeCare, a handheld conformable pad with fabric electrodes for on-demand bladder fullness sensing using electrical impedance tomography (EIT). The central challenge is that repeated removal and reattachment can introduce variation in electrode position and contact quality. We assess WeeCare along three axes: in-silico simulations characterizing electrode layout and noise robustness, in-vitro phantom experiments across urine salinities and filling levels, and an in-vivo human measurement for bladder fullness sensing, voiding, and filling dynamics.

[HC-4] opoAlign: Topology-Aware Visual Representation Alignment

链接: https://arxiv.org/abs/2605.25541
作者: Xinyuan Yan,Rita Sevastjanova,Mennatallah El-Assady,Bei Wang
类目: Computational Geometry (cs.CG); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Neural networks encode inputs as high-dimensional vectors, known as representations, that capture how models process data by encoding task-relevant structure and semantics. Representation alignment refers to the degree to which different models, layers, or training conditions produce similar representations for the same inputs, with important implications for model interpretation, selection, and robustness analysis. Existing approaches to measure alignment primarily rely on geometric properties, such as neighborhood and cluster similarity, offering limited insight into the global organization of representations. In this work, we present TopoAlign, a topology-aware framework for visually comparing model representations from a structural perspective. Leveraging mapper graphs from topological data analysis, TopoAlign jointly analyzes graphs constructed from representations of shared inputs across different models or layers. The framework supports a top-down comparative workflow: it first performs global structure alignment via joint force-directed optimization to produce coordinated graph layouts; it then identifies local correspondences through automated detection of structurally matching regions, visualized with Bubble Sets; and finally it enables fine-grained pattern inspection through motif-based queries and membrane-inspired visualizations. We demonstrate TopoAlign through case studies on language and multimodal models, complemented by expert feedback. Our results show that TopoAlign provides meaningful insights into representation structure and alignment from a topological perspective.

[HC-5] ATWL: A Formal Language for Representing Comparing and Reusing Visual Analytics Workflows

链接: https://arxiv.org/abs/2605.25489
作者: Natalia Andrienko,Gennady Andrienko,Jürgen Bernard,Michael Sedlmair
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Visual analytics (VA) workflows are inherently complex, involving data transformation, feature engineering, visual representation, and human interpretation. They are typically described in unstructured prose, hindering systematic comparison, reuse of proven strategies, and training of novices. We present Artifact-Transform Workflow Language (ATWL), a domain-agnostic, declarative language that formally represents VA workflows by capturing their structure and underlying analytical intent. ATWL is built upon a modular ontology of eight artifact types (entities, features, arrangements, visualisations, patterns, models, knowledge, specifications) and transforms characterised by standardised intents (e.g., define-unit, characterise, contextualise, abstract). To show that formalisation effort need not impede adoption, we extract workflows from research papers through supervised interaction with LLM agents, reducing the human role to review and refinement. Using this process, we constructed a library of seventeen ATWL workflows from published VA papers. Cross-workflow analysis reveals structural regularities – a recurrent meta-structure, recurring motifs, reusable building blocks, diverse iterative strategies, and cross-domain equivalences – that remain invisible in prose. We further evaluate practical utility through a controlled experiment in which the same LLM addressed two analytical problems with the library supplied either as original papers or as ATWL representations. Both forms enabled useful recommendations, but the formal representation systematically added explicit iteration structure, typed data flow, fragment-level adaptation provenance, and compactness supporting scaling beyond what prose libraries can fit in an LLM’s context. ATWL enables a transition from narrative descriptions to formally represented, comparable, and reusable analytical knowledge.

[HC-6] AI Content Moderation in Therapy Conversations

链接: https://arxiv.org/abs/2605.25454
作者: Jiwon Kim,Claire Wang,Taeung Yoon,Sabelle Huang,Koustuv Saha
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Social and Information Networks (cs.SI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly being used for emotional support. They are also being developed for formal therapy purposes. However, LLMs like ChaptGPT or Llama are often developed with content moderation guardrails that prevent them from discussing sensitive subjects with users for both liability and safety purposes, and this inability to broach these subjects may affect their capacity as therapists. In this study, we perform an algorithm audit on three state-of-the-art moderation systems (OpenAI’s moderation endpoint, Meta’s Llama Guard, and Google’s Shield Gemma) to investigate the extent to which these systems flag the content of real-life therapy sessions as undesirable. Our results raise implications for the limitations that users and organizations may encounter when designing LLMs to play the part of a therapist.

[HC-7] Subjective Code Preferences in Experts and Large Language Models

链接: https://arxiv.org/abs/2605.25296
作者: Anna Mokhova,Subhabrata Dutta,Iryna Gurevych,Simone Balloccu
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have become increasingly popular for coding tasks, with subjective coding preferences being an essential element to adapt to programmers’ personal needs. Existing work overlooks such characteristics and mainly focuses on code correctness. In this study, we propose a typification of four subjective coding preference axes - complexity, commenting, modularity, and readability - motivated by common engineering habits and validated by 25 software engineers. We collect a dataset of ~3,000 paired Python code snippets reflecting these axes, annotated by 73 experts who rate their preferences on a Likert scale. Using our dataset, we study how LLMs handle subjective coding preferences. We present 13 LLMs with pairs of solutions to the same programming task, first as textual descriptions and then as concrete code snippets. We find that models often prefer one option in natural language but the opposite when evaluating code. More consistent models (i.e., those that are coherent in their choices between deeds and words) frequently reveal positional bias: swapping the order of options changes the preferred alternative. We then use the five most consistent models to re-annotate the dataset. Compared to humans, models show polarized Likert distributions and notable divergence in ratings. A case study on GPT-5 reveals reliance on external assumptions and brittle reasoning.

[HC-8] Working Relations

链接: https://arxiv.org/abs/2605.25260
作者: Steven J. Jackson
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:This paper offers a concept of working relations as a complement and extension to existing theories of maintenance, care and repair. Building on the cases of an umbrella, a tractor and a pond, it advances seven propositions that might guide and inform further work and thinking in this space. It concludes with the challenging figures of Chernobyl, nickel extraction, and AI, and argues for the centrality of working relations to more generative and pluralistic relations with the things and worlds around us.

[HC-9] Evidence-Linked Radiology Reporting: A Human-Supervised Reference Architecture for Structured Imaging Intelligence DATE

链接: https://arxiv.org/abs/2605.25120
作者: Houman Kazemzadeh,Kamyar Naderi
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: Technical report, 27 pages, 2 figures, 12 tables, 1 listing; reference architecture paper; does not report clinical outcomes or validated diagnostic performance

点击查看摘要

Abstract:Radiology reports remain the primary mechanism by which imaging findings are communicated to clinical teams. However, much of the structured information behind these reports, including measurements, image evidence, prior comparisons, lesion identity, uncertainty, and terminology, often remains trapped in free text or fragmented across picture archiving and communication systems, radiology information systems, reporting workstations, worksheets, advanced visualization tools, and electronic health records. This paper proposes a human-supervised, evidence-linked reference architecture for structured radiology reporting. The framework combines exam-specific templates, speech-to-structure processing, measurement and segmentation capture, controlled AI-assisted drafting, and standards-based interoperability using DICOM, DICOM Structured Reporting, DICOM Segmentation, HL7 FHIR, RadLex, SNOMED CT, LOINC, and UCUM. The system is positioned not as an autonomous report generator, but as a structured intelligence layer for enterprise imaging that supports reviewed reporting, longitudinal comparison, clinical data reuse, governance, and integration with PACS, RIS, EHR, analytics, and registry workflows. The paper also discusses modality-specific deployment considerations, clinical safety risks, validation requirements, cybersecurity, privacy, quality management, and regulatory boundaries for AI-assisted radiology reporting systems.

[HC-10] Intent Signal Theory: A Computational Framework for Intent-State Control in Human-AI Interaction

链接: https://arxiv.org/abs/2605.25058
作者: Gang Peng
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 10 pages, 2 figures. Theoretical framework paper grounded in four companion empirical studies. Data and code repository: this https URL

点击查看摘要

Abstract:Current AI interaction models treat the prompt as the primary object of exchange, omitting a critical layer: the user’s latent source intent, the goal state preceding and motivating the prompt. Here we introduce Intent Signal Theory (IST), a computational framework that formalises this missing intent layer. IST distinguishes four objects routinely conflated: latent source intent (I*), observable intent proxy (I-hat), encoded carrier §, and model output (O). It formalises dimensional weights, encoding masks, structural and fidelity recovery scores, and public-private intent decomposition. The Theorem of Irreversible Intent Loss establishes that private intent absent from the carrier cannot be recovered beyond generic substitution. Evidence from four companion studies spanning six LLMs, three languages and three task domains shows structural-fidelity splits, human-validated metric dissociation, and weight-tolerance plateaus consistent with IST’s predictions. IST reframes prompt engineering as intent-protocol design and identifies a computational layer that current AI systems lack.

[HC-11] Macaron-A2UI: A Model for Generative UI in Personal Agents

链接: https://arxiv.org/abs/2605.24830
作者: Fancy Kong,Congjie Zheng,Murphy Zhuang,Rio Yang,Sueky Zhang,Hao Fu,Gene Jin,Song Cao,Kaijie Chen,Andrew Chen,Pony Ma
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:As personal agents evolve to handle complex, user-centric tasks, static plain-text chat is rapidly becoming a bottleneck. Generative UI emerges as the necessary new interface layer, dynamically synthesizing the right controls, options, and state from the interaction context in real time. We present Macaron-A2UI, a model for Generative UI in personal agents. Our goal is to move beyond text-only interaction by enabling agents to generate natural language together with lightweight, executable UI actions for information collection, preference refinement, confirmation, and multi-goal organization. We build a large-scale Generative UI corpus from heterogeneous dialogue sources, introduce A2UI-Bench for controlled evaluation, and train 30B, 235B and 754B models with parameter-efficient LoRA-based supervised fine-tuning followed by reward-driven reinforcement learning. The best Macaron-A2UI model reaches 75.6 overall on A2UI-Bench without explicit schema hints, surpassing the strongest full-schema frontier baseline. We release the models, benchmark, and evaluation protocol to support future work on Generative UI for personal agents.

[HC-12] “It Felt a Bit Eerie”: Exploring Humanlike Interactions During Collaborative Writing with an Artificial Agent

链接: https://arxiv.org/abs/2605.24729
作者: Michael Yin,Angela Chiang,Samuel Rhys Cox,Robert Xiao
类目: Human-Computer Interaction (cs.HC)
备注: 29 pages, 3 figures

点击查看摘要

Abstract:While human-AI collaboration systems have increasingly been built to increase efficiency or support creativity, little work has examined how the design of interactions shapes the social connection between human and artificial agent. We examine how the temporal and visual dimensions of collaboration shape the experience of a writing task. Specifically, we built three variants of an AI-assisted text editor along a spectrum of simulated humanlike interaction (synchronous and with a cursor) to machinelike interaction (asynchronous and without a cursor), and conducted a comparative user study (n=48). Our exploratory findings suggest that synchronous suggestions increased efficiency but led to contextual misalignment, while a visual cursor increased intent understanding but evoked feelings of surveillance. Taken together, humanlike design of artificial agents can create positive social expectations but also elicit social costs, especially without the alignment present in human-human collaboration. We extend our findings into design implications and ethical considerations when building human-AI collaboration systems.

[HC-13] Hardware-Aware Federated Learning for Speech Emotion Recognition

链接: https://arxiv.org/abs/2605.24712
作者: Beyazit Bestami Yuksel,Emrah Dikbiyik
类目: Machine Learning (cs.LG); Human-Computer Interaction (cs.HC)
备注: 4 pages, 3 figures, 4 Tables

点击查看摘要

Abstract:Federated learning (FL) enables privacy-preserving collaborative training across distributed edge devices, but real deployments involve heterogeneous clients with different processing power, memory capacity, and communication latency, which often increase round duration and system cost. This paper proposes a hardware-aware federated learning framework for emotion recognition on session-partitioned IEMOCAP that integrates hardware profiling, top-K client selection, and adaptive local epochs within a unified training loop. We compare the method against FedAvg, FedProx, and random top-K selection under a non-IID setup and show that, across 50 federated rounds and 5 independent trials, the proposed approach achieves competitive validation accuracy (0.352), reduces total training time by about 36.5% compared to FedAvg, and lowers cumulative communication cost by 40%.

[HC-14] Routing Cybersecurity Awareness Training by FFM Personality Trait: A Quasi-Experimental Evaluation

链接: https://arxiv.org/abs/2605.24551
作者: Glory Okwata,Mohammad A. Razzaque
类目: Cryptography and Security (cs.CR); Human-Computer Interaction (cs.HC)
备注: Submitted to Computers Security (Elsevier) Journal

点击查看摘要

Abstract:Cybersecurity awareness training has historically adopted a one-size-fits-all approach, despite established individual differences in how users process and retain security information. Personality has been proposed as one axis along which training content might be tailored; yet no prior study has implemented and empirically evaluated a complete personality-conditional system end-to-end. This paper reports the design, implementation, and quasi-experimental evaluation of \emphTailoredSec, a mobile cybersecurity awareness application that routes training content based on a user’s dominant Five-Factor Model (FFM) personality trait, as measured by the ten-item Big Five Inventory (BFI-10). Seventy-four UK-based adults were allocated to a traditional video-training condition ( n = 40 ) or a personality-conditional condition ( n = 34 ). Both groups completed a four-item scenario-based pre-assessment (scored 0–40), a single training session, and an equivalent post-assessment. The personality-conditional group additionally completed the BFI-10 (Big Five Inventory-10) and was routed to one of four training modules covering five FFM traits (Conscientiousness and Neuroticism share a module). Pre-assessment scores did not differ between groups ( t(69.1) = 0.43 , p = .67 ), confirming baseline equivalence. The personality-conditional group scored significantly higher on the post-assessment ( M = 35.88 , SD = 5.00 vs M = 30.75 , SD = 10.23 ; Welch’s t(58.5) = 2.81 , p = .007 ; Cohen’s d = 0.62 ; 95% CI [1.47, 8.79] marks), with a pass-rate of 100% versus 77.5% (Fisher’s exact p .01 ). These results offer preliminary support for personality-conditional content routing as a feasible design principle for cybersecurity awareness training.

[HC-15] RAFA: Anticipating User Actions to Reduce Errors in Procedural Tasks with Predictive Feedback

链接: https://arxiv.org/abs/2605.24526
作者: Sassan Mokhtar,Lars Doorenbos,Fatemeh Jabbari,Marius Bock,Dominik Bach,Juergen Gall
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Interactive assistance systems typically provide feedback after an action has been completed, supporting error recovery but not preventing the error itself. We present TRAFA, a real-time predictive feedback system for procedural tasks that intervenes before errors are committed. TRAFA operationalizes predictive feedback through a Track-Forecast-Act framework that tracks hand and object state, forecasts user motion conditioned on scene context, and triggers feedback when a predicted action is likely to violate task constraints. We instantiate this pipeline in a sequential assembly setting and evaluate it through both technical benchmarking and a controlled user study against conventional reactive feedback. Our results show that predictive feedback improves task accuracy and efficiency while maintaining a comparable number of feedback events. These findings position feedback timing as a key dimension in system design and show how real-time anticipation can be integrated into interactive systems to prevent errors before they occur.

[HC-16] PACT: Proactive Asking for Continual Task Assistance in Human-Robot Collaboration

链接: https://arxiv.org/abs/2605.24350
作者: Chengbo He,Sheng Li,Chenyang Ma,Bochao Zou,Li Sun,Jiansheng Chen,Junliang Xing,Yuanchun Shi,Huimin Ma
类目: Robotics (cs.RO); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Robotic assistants in long-term human-robot collaboration need to assist users under partial observations while leveraging cross-day interaction history. However, human traits and routines are often unknown at the beginning of collaboration, making passive infer-then-act assistance ineffective and inefficient. To address this challenge, we study a cross-day proactive asking setting for continual task assistance and propose PACT (Proactive Asking for Continual Task Assistance), an ask-or-act framework that determines whether clarification should be sought before taking action. PACT leverages current observations together with accumulated interaction history to evaluate contextual sufficiency, enabling the robot to provide more reliable assistance and progressively adapt to the user over time. We implement its primary learned instantiation using reinforcement learning and evaluate alternative instantiations under the same framework. To assess such behavior, we further introduce a clarification utility metric that quantifies the trade-off between assistance accuracy and the frequency of clarification requests. Experiments in multi-day embodied collaboration scenarios demonstrate that, compared with passive inference baselines, PACT consistently improves both assistance accuracy and clarification utility, highlighting the importance of proactive asking in continual human-robot collaboration.

[HC-17] Me Myself and My Voice: Exploring Cultural and Linguistic Identity in AAC AI-generated Voices

链接: https://arxiv.org/abs/2605.24337
作者: Tobias Weinberg,Aaleyah Lewis,Ricardo E. Gonzalez Penuela,Weicong Hong,Jennifer Mankoff,Thijs Roumen
类目: Human-Computer Interaction (cs.HC)
备注: 17 pages, 7 figures

点击查看摘要

Abstract:Voice is a central element of identity. We recognize people by their voice, and we uniquely express who we are with it. For people who rely on augmentative and alternative communication~(AAC) systems, such as speech-generating devices~(SGD), the device’s voice becomes an identity marker others associate with them. Yet, it is hard to find a voice that truly aligns with one’s identity both linguistically and culturally. Although modern AI-generated voices can reproduce diverse accents and speaking styles, AAC users still lack accessible ways to articulate how they want an identity-aligned voice to sound like. We first conducted a survey of AAC users (across eight countries) to characterize current voice representation, finding that non-binary, transgender, and non-US-born respondents rated their current voice support identity alignment consistently lower than other respondents. To examine how AAC users respond to voices designed to reflect their cultural identity, we built a tool that elicits cultural markers through guided questions and generates personalized voice candidates for participants to hear and reflect on. After participants heard the voices, we interviewed them to examine what it means for a voice to feel culturally representative, how they interpreted voices with cultural connotations, and how these voices shaped their sense of identity and agency. Our findings show that cultural voice alignment runs deeper than accent or language alone; it touches on belonging, self-recognition, and what it means to be heard as who you are.

[HC-18] acit Signal Infrastructure: Towards AI Systems that Model Expert Sensing Over Time

链接: https://arxiv.org/abs/2605.24332
作者: Annie Yuan
类目: Human-Computer Interaction (cs.HC)
备注: 17 pages, 2 figures

点击查看摘要

Abstract:Current generative AI systems are increasingly effective at processing explicit knowledge, including retrieving information, summarising documents, generating explanations, and supporting codified workflows. However, high-level expertise also depends on tacit sensing: perceiving weak signals, recognising emerging tensions, detecting coherence degradation, and anticipating instability before formal indicators appear. Existing AI education, AI literacy, and human-AI collaboration frameworks remain centred on prompting, task execution, and productivity support and are poorly equipped to address this tacit layer of expert cognition. This vision paper argues that next-generation AI systems should move beyond explicit knowledge processing toward the longitudinal modelling of expert tacit sensing. It introduces Tacit Signal Infrastructure as a layer for capturing, structuring, modelling, interpreting, and validating expert tacit signals over time. It further defines Long-term Cognitive Operations as the practices required to maintain and govern such systems, including memory curation, semantic organisation, tacit signal modelling, reasoning calibration, and cognitive governance. Building on this framing, the paper proposes the Cognitive Operations Manager as a prototype AI-native professional role for coordinating tacit signal modelling, semantic modelling, AI system calibration, expert validation, and ethical governance. It also introduces the Cognitive Operations Research and Training Framework (CORTF) to support research, education, and workforce development. The paper contributes a conceptual foundation for designing AI systems that model expert sensing over time, positioning cognition as an infrastructural, operational, and professional domain in persistent human-AI systems.

[HC-19] End-to-End Intracortical Speech Decoding from Neural Activity

链接: https://arxiv.org/abs/2605.24313
作者: Owais Mujtaba Khanday,Jose A. Gonzalez-Lopez,Marc Ouellet,Alberto Galdon,Gonzalo Olivares Granados
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: Accepted at Odyssey 2026 (Lisbon)

点击查看摘要

Abstract:Current high-performing intracortical speech neuroprostheses achieve low word error rates but typically rely on external language models during inference, increasing memory, computation, and latency. In this work, we investigate whether meaningful character-level decoding is achievable without such models. We propose an end-to-end Conformer-based neural decoder trained directly on intracortical recordings from a participant with amyotrophic lateral sclerosis (ALS). Without any external language model, the system achieves a character error rate (CER) of 23.80% on held-out validation data. Analysis shows that performance variability is driven by inter-session signal degradation, while dominant errors arise from incorrect word boundary segmentation. These results demonstrate that effective character-level decoding is possible in a fully end-to-end framework, providing a strong neural signal for downstream linguistic processing.

[HC-20] Modernizing User Privacy Preference Measurement through GPPI: A GDPR-aligned Privacy Preference Item Bank

链接: https://arxiv.org/abs/2605.24307
作者: Yahya Hmaiti,Mykola Maslych,Amirpouya Ghasemaghaei,Trung Cuong Dang,Corey Pittman,David Mohaisen,Joseph J. LaViola Jr
类目: Human-Computer Interaction (cs.HC); Cryptography and Security (cs.CR); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Privacy measurement instruments (e.g., CFIP, IUIPC, PAQ) predate GDPR by over a decade and measure privacy concerns, distinct from preferences for regulatory protections (e.g., data portability, erasure, automated decision-making rights). This leaves practitioners without tools to assess whether users value the GDPR mechanisms implemented in compliant policies. We developed a GDPR-grounded privacy preference measurement item bank by extracting 669 statements from all 99 GDPR articles, validated by: (1) two-round expert review achieving full consensus on accuracy, (2) semantic clustering into 10 parent themes and 87 subthemes, and (3) consensus review with 50 privacy experts (5 per theme) using a larger or equal than 4/5 vote retention threshold. The final 527-item bank comprises 9 parent themes and 73 subthemes (18 to 112 items per parent theme, 1 to 29 per subtheme), enabling targeted measurement across granularities while covering GDPR at mean pairwise expert agreement of approx. 85%. This work introduces a complementary measurement dimension aligning user preferences with regulatory mechanisms.

[HC-21] Sketch Bug: Using Sketch-Based Input for Interactive Code Debugging

链接: https://arxiv.org/abs/2605.24228
作者: Helen Weixu Chen,Daniel Vogel
类目: Human-Computer Interaction (cs.HC)
备注: Accepted at Graphics Interface 2026 (GI 2026). 9 pages, 7 figures

点击查看摘要

Abstract:We investigate sketch-like pen input as an alternative way to support execution control in interactive debugging. In our interface, programmers draw lightweight marks to set breakpoints, use symbolic strokes to control execution, and extend strokes into spirals to repeat traversal actions. The prototype combines gesture recognition with Python execution tracing in a conventional editor interface. In a controlled study with 24 programmers, we compared the sketch interface with conventional mouse-and-keyboard input on debugging tasks that required breakpoint placement, step-wise execution, and runtime state inspection. The results show that sketch-like input can support these execution-control tasks, while also introducing challenges in precision, recognition, and gesture recall. Our findings suggest that pen input is most promising where debugger interactions benefit from spatial grounding or continuous movement, rather than as a wholesale replacement for conventional debugging controls.

[HC-22] A Taxonomy of Metacognitive Learning Scenarios in Professional Contexts: Integrating Systems Theory with Empirical Constraints

链接: https://arxiv.org/abs/2605.24142
作者: David C. Gibson,Mary Elizabeth Azukas,Meryem Yilmaz Soylu
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Metacognitive theories provide foundational frameworks for understanding self-regulated learning, yet they lack systematic integration into comprehensive scenario taxonomies capable of guiding AI-enhanced professional development interventions. Existing models inadequately specify how metacognitive components combine into distinct learning scenarios or how professionals progress from novice to expert functioning. A six-node open systems model, consisting of Environment, Input, Processes, Structures, Output, and Feedback, was developed by synthesizing four major theoretical frameworks. Combinatorial enumeration generated 216 mathematically possible learning scenarios. Four sequential constraint-based filters, including psychological plausibility, educational relevance, measurement feasibility, and intervention potential, informed by empirical workplace learning research, reduced this space to 24 priority scenarios. Five focal scenarios were subjected to formal concept analysis. The 24 priority scenarios were distributed across three developmental tiers: novice, with 6 scenarios; developing, with 10 scenarios; and expert/adaptive, with 8 scenarios. Analysis revealed critical theoretical gaps regarding the dynamic reconfiguration of monitoring-control relationships across expertise levels, the role of feedback topology in metacognitive development, and trade-offs between internal integration and external connectivity. Multiple viable developmental trajectories were identified. The taxonomy enables targeted, scenario-specific professional development interventions and generates testable predictions for advancing metacognition theory beyond primarily descriptive accounts.

[HC-23] Human-AI Collaboration in Science at Scale: A Global Large-scale Randomized Field Experiment

链接: https://arxiv.org/abs/2605.24180
作者: Binglu Wang,Weixin Liang,Jiahui Xue,Yuhui Zhang,Hancheng Cao,Dashun Wang,Yian Yin
类目: Physics and Society (physics.soc-ph); Artificial Intelligence (cs.AI); Digital Libraries (cs.DL); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Collaboration is the defining mode of modern science, yet its core mechanism – feedback – remains hard to observe, difficult to scale, and unequally distributed. Here we test whether large language models (LLMs) can contribute to this hidden but vital practice and reallocate scientific feedback, an essential yet scarce resource for knowledge production. In a global large-scale randomized field experiment, we delivered customized LLM-generated feedback for over 31,000 arXiv preprints across 150 fields and more than 45,000 researchers from 133 geographic regions. Relative to controls, authors who received feedback had a significantly higher likelihood of revising their manuscripts, corresponding to a 12.55% relative increase over the baseline revision rate. Exposure to AI feedback also increased authors’ subsequent use of LLM tools in their future papers, suggesting longer-run shifts in scientific practice. These effects were strongest among authors from non-English-dominant research regions, manuscripts less embedded in the scholarly literature, and teams with lower h-indexes and earlier career stages, consistent with the idea that AI feedback may provide the greatest benefit where access to timely critique is otherwise limited. Together, these findings provide causal evidence that structured AI-based interventions can transform access to scientific feedback from a largely private advantage into a more widely distributed resource, with broader implications for productivity, equity, and capacity across the global research system.

[HC-24] Metacognition Should Be the Scientific Framework for Bounded and Effective Self-Governance in Generative AI

链接: https://arxiv.org/abs/2605.23981
作者: Eugene Yu Ji,Igor Grossmann,Amir-Hossein Karimi
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC); Systems and Control (eess.SY)
备注: 16 pages, 1 figure, 1 table

点击查看摘要

Abstract:Generative AI research increasingly confronts a shared problem: systems must sustain yet govern their own generative activity when uncertainty is high, evidence is missing, or context is insufficient. This position paper argues that metacognition should become the scientific framework for bounded and effective self governance in generative AI, where output generation is properly evaluated together with the capacities through which generative systems navigate and regulate their own activity. We advance this position by showing that bounded and effective AI self-governance requires metacognitive alignment across computational, algorithmic, and ecological levels. At the computational level, metacognition specifies the meta-level functions a system is meant to serve, such as monitoring, evaluation, control, and adaptation. At the algorithmic level, these functions are realized through procedures such as elicitation, iteration, and modularization. At the ecological level, metacognitive signals become meaningful, actionable, and accountable within the interface, workflow, and accountability arrangements. Metacognition thus makes it possible to conceive generative AI as both capable and well-governed, rather than treating capability and governance as competing aims.

计算机视觉

[CV-0] riSplat: Simulation-Ready Feed-Forward 3D Scene Reconstruction

链接: https://arxiv.org/abs/2605.26115
作者: Weijie Wang,Zimu Li,Jinchuan Shi,Zeyu Zhang,Botao Ye,Marc Pollefeys,Donny Y. Chen,Bohan Zhuang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL , Code: this https URL

点击查看摘要

Abstract:Sparse-view 3D reconstruction is increasingly addressed with feed-forward splatting networks that predict explicit primitives directly from images. Yet most existing methods remain centered on Gaussian primitives and expose surfaces only indirectly: extracting a usable mesh for downstream simulation, physics reasoning, or embodied interaction still requires expensive post-hoc steps that break the feed-forward promise. This limitation is especially pronounced in pose-free settings, where scene structure and camera parameters must be estimated jointly from sparse observations. We present TriSplat, a feed-forward reconstruction network that represents scenes with oriented triangle primitives and directly exports simulation-ready mesh scenes from a single forward pass. Given input images, the network predicts local 3D point maps, triangle attributes, camera poses, and optional intrinsics. Rather than regressing triangle orientation as an unconstrained latent variable, our approach constructs geometry normals from the predicted point maps, refines them with an image-conditioned normal head, and converts them into stable local frames for triangle parameterization. A mono-normal bootstrap schedule further stabilizes early training, while opacity and blur scheduling progressively sharpens the learned surface representation for direct mesh extraction. Experiments on RealEstate10K and DL3DV show that this representation produces more geometry-faithful reconstructions than Gaussian feed-forward baselines while maintaining competitive novel-view rendering quality. Because the rendering primitives are themselves surface triangles, the output can be directly ingested by physics engines, collision detectors, and standard rendering pipelines without any conversion, making it a practical simulation-ready solution for feed-forward 3D scene reconstruction.

[CV-1] AnyScene: Towards Highly Controllable Driving Scene Generation at Anywhere and Beyond

链接: https://arxiv.org/abs/2605.26113
作者: Haiming Zhang,Junfei Zhou,Feng Jiang,Jingzhong Li,Zhenglong Guo,Penglin Dai,Jifeng Dai,Yan Xie,Benjin Zhu
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Work in progress. Project page: this https URL

点击查看摘要

Abstract:Generating high-fidelity and controllable synthetic data is critical for advancing end-to-end autonomous driving, particularly for addressing the long tail of rare safety-critical scenarios. Existing occupancy-guided methods typically rely on shallow conditioning mechanisms and reference-frame-dependent video synthesis, which limits fine-grained controllability from arbitrary BEV layouts and restricts their applicability for scalable simulation. In this paper, we propose AnyScene, a unified occupancy-centric framework for driving scene generation. AnyScene generates semantic occupancy sequences from BEV layouts through a Spatial-Temporal Occupancy Diffusion Transformer that jointly tokenizes BEV and occupancy features in an autoregressive manner. This design enables precise controllability from cross-dataset and user-defined BEV inputs while naturally supporting long-horizon generation. Building upon the generated occupancy, a Geometry-Grounded View Expansion module treats occupancy as the canonical spatial representation and synthesizes temporally consistent multi-view driving videos in a reference-free and autoregressive fashion, supporting flexible camera configurations at inference time. Extensive experiments demonstrate that AnyScene achieves state-of-the-art performance in both occupancy and video generation. It exhibits strong generalization to unseen and customized layouts, and provides measurable benefits for downstream tasks such as sparse-view 3D reconstruction.

[CV-2] Squeezing Capacity from Multimodal Large Language Models for Subject-driven Generation

链接: https://arxiv.org/abs/2605.26111
作者: Shuhong Zheng,Aashish Kumar Misraa,Yu-Teng Li,Yu-Jhe Li,Igor Gilitschenski
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Machine Learning (cs.LG); Multimedia (cs.MM)
备注: 33 pages, 18 figures, Project Page: this https URL

点击查看摘要

Abstract:Subject-driven image generation aims to synthesize new images that preserve the identity of the given subject while following textual instructions. Existing approaches often encode text and reference images separately. This limits cross-modal reasoning abilities and causes copy-paste artifacts. Recent frameworks that connect multimodal models and diffusion models improve instruction following, but largely overlook identity preservation. To address these limitations, we condition diffusion models on Multimodal Large Language Models (MLLMs) that jointly encode text and reference images, and augment it with VAE-based identity conditioning. A novel Dual Layer Aggregation (DLA) module is designed to aggregate multi-level MLLM features for optimal conditioning, and a multi-stage denoising strategy is applied to progressively balance the semantic information from MLLM and fine-detail identity from VAE during inference. Extensive experiments demonstrate that our approach harmonizes multimodal understanding with identity preservation, mitigates copy-paste issues, and achieves superior performance regarding human preference on subject-driven image generation. Our project website is available at this https URL.

[CV-3] Helix4D: Complex 4D Mesh Generation

链接: https://arxiv.org/abs/2605.26109
作者: Jiraphon Yenphraphai,Jianqi Chen,Jian Wang,Gordon Qian,Sergey Tulyakov,Rameen Abdal,Raymond A. Yeh,Peter Wonka,Chaoyang Wang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Current video-to-4D methods struggle with complex topology changes, transparent materials, thin structures, and inner surfaces. We present Helix4D, a dynamic mesh generation framework by inheriting the expressive representation of Trellis2, adapting it from image-to-3D to video-conditioned 4D generation. Our design arises from two key questions: (a) how to enable Trellis2’s frame-local attention to share information across frames while preserving its pretrained quality on rare cases such as transparent objects and inner surfaces, and (b) how to inject temporal information into a purely 3D positional encoding without breaking pretrained capabilities. We address (a) with a sliding-window cross-frame attention and anchor on the first frame. The first frame is generated by the base Trellis2 model and injected into our model, letting it inherit Trellis2’s quality in rare cases through cross-frame attention. We address (b) with a 4D temporal encoding that repurposes redundant low-frequency spatial RoPE bands for time, extending the encoding from 3D with no additional parameters. Extensive experiments show the effectiveness of Helix4D for high-quality dynamic mesh generation on ActionBench and our own challenging complex dynamics set.

[CV-4] Reinforcing Few-step Generators via Reward-Tilted Distribution Matching

链接: https://arxiv.org/abs/2605.26108
作者: Yushi Huang,Xiangxin Zhou,Ruoyu Wang,Chi Zhang,Jun Zhang,Tianyu Pang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code and models are available at this https URL

点击查看摘要

Abstract:Recent advances in few-step diffusion distillation have enabled efficient image generation, yet aligning these models with human preferences remains challenging. We propose Reward-Tilted Distribution Matching Distillation (RTDMD), a two-stage framework that unifies distribution matching distillation with reward-guided reinforcement learning for few-step flow generators. We show that minimizing the KL divergence to a reward-tilted teacher distribution naturally decomposes into a distribution matching term and a reward maximization term. In the first stage, we introduce Ambient-Consistent Distribution Matching Distillation (AC-DMD), which performs subinterval-wise distribution matching and augments the fake score objective with a consistency regularizer to help the fake score model track the shifting generator distribution under limited updates. In the second stage, we jointly optimize both terms: for the reward maximization term, we derive a hybrid policy gradient that combines a GRPO-style estimator for the stochastic intermediate transitions with direct reward backpropagation through the deterministic final step, and further introduce step-subset GRPO (SubGRPO) to reduce variance. Experiments on SD3, SD3.5, and FLUX.2 demonstrate that RTDMD establishes new state-of-the-art results across preference, aesthetic, and compositional metrics with only 4 inference steps, outperforming previous few-step text-to-image generation methods. Code and models are available at this https URL.

[CV-5] On-Policy Adversarial Flow Distillation for Autoregressive Video Generation

链接: https://arxiv.org/abs/2605.26105
作者: Yang Luo,Shengju Qian,Xiaohang Tang,Zirui Zhu,Yong Liu,Xin Wang,Yang You
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Autoregressive video generators are attractive for streaming, long-horizon, and interactive applications, but distilling strong black-box teachers into causal students remains difficult. The student must learn under its own rollout distribution, whereas practical teachers may expose only prompt-conditioned completed videos and may differ in architecture, capacity, temporal design, and sampling schedule. This interface makes supervised fine-tuning off-policy, score-based distillation inapplicable, and direct adversarial imitation too sparse for denoising-time credit assignment. We propose Adversarial Flow Distillation (AFD), an on-policy framework for heterogeneous black-box video distillation. AFD queries the teacher and rolls out the current student on the same prompts, trains a prompt-paired Bradley-Terry discriminator to estimate clean-sample teacher-student discrepancy, and converts the resulting on-policy advantage into forward-process flow-matching updates on the student’s own noised states. Thus, AFD provides dense velocity-field supervision while requiring no teacher scores, latents, denoising trajectories, step alignment, or reverse-chain reinforcement learning. Experiments across two causal AR student families show that AFD consistently improves motion- and physics-sensitive generation while preserving general video quality, and ablations validate the importance of adaptive on-policy feedback and forward-process credit assignment. The method requires only clean teacher videos and student rollouts, providing a practical route for distilling proprietary or heterogeneous video generators into efficient autoregressive students.

[CV-6] EVIDENT: Routing MLLM Adaptation through Entity-Grounded Visual Evidence for Cross-Domain Video Temporal Grounding

链接: https://arxiv.org/abs/2605.26104
作者: Geo Ahn,Jiwook Han,Youngrae Kim,Joonseok Lee,Jinwoo Choi
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Fine-tuning MLLMs for Video Temporal Grounding (VTG) often improves in-domain performance but degrades sharply under domain shift. In this work, we find that this failure is primarily driven not just by unseen query concepts, but by visual domain shift, which prevents the model from coupling its learned temporal localization knowledge with its inherent entity-attention capability. To address this, we introduce EVIDENT, a parameter-efficient adaptation framework that anchors temporal grounding in the inherent entity-attention of pre-trained MLLMs by routing VTG adaptation through explicit visual entity evidence. EVIDENT consists of three components: (i) an Entity Bottleneck Adapter that transforms dense visual tokens into compact entity-level slots, (ii) an Entity-Binding Distillation loss that instills objectness priors into the semantically unstructured MLLM visual space, guiding each slot to bind to a coherent entity, and (iii) an Entity-to-eVidence gating mechanism that leverages the captured entities as evidence, steering the model to localize moments containing query-relevant entities. Together, these components enable VTG fine-tuning to rely on entity-grounded evidence rather than brittle dataset shortcuts. Experiments on cross-domain VTG benchmarks show that EVIDENT consistently improves out-of-domain robustness while preserving competitive in-domain performance with modest parameter overhead. These results suggest that entity-level grounding is an effective inductive bias for generalizable temporal localization.

[CV-7] Global Structure-from-Motion Meets Feedforward Reconstruction CVPR2026

链接: https://arxiv.org/abs/2605.26103
作者: Linfei Pan,Johannes Schönberge,Marc Pollefeys
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026, Highlight

点击查看摘要

Abstract:Structure-from-Motion – the process of simultaneously estimating camera poses and 3D scene structure from a collection of images – remains a central challenge in computer vision, with many open problems yet to be solved. Recent advances in feedforward 3D reconstruction have made significant strides in overcoming persistent failure cases of classical SfM methods, particularly in scenarios characterized by low texture, limited overlap, and symmetries. However, while feedforward approaches excel in these challenging conditions, they often face limitations regarding scalability, accuracy, or robustness, and typically fall short of classical methods in standard reconstruction settings. In this work, we systematically analyze these limitations and propose a new Structure-from-Motion pipeline by combining the respective strengths of classical and feedforward methods. Extensive experiments across multiple datasets show the benefits of our approach, achieving state-of-the-art results across a wide range of scenarios. We share our system as an open-source implementation at this https URL.

[CV-8] InstructSAM: Segment Any Instance with Any Instructions

链接: https://arxiv.org/abs/2605.26102
作者: Yuqian Yuan,Wentong Li,Zhaocheng Li,Yutong Lin,Juncheng Li,Siliang Tang,Jun Xiao,Yueting Zhuang,Wenqiao Zhang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 19 pages, 8 figures

点击查看摘要

Abstract:In this paper, we introduce InstructSAM, a unified and streamlined framework designed for multi-instance segmentation under arbitrary instructions. We formulates instruction-driven instance segmentation as a set-structured query prediction problem and propose an explicit reasoning-to-instance query interface that elegantly bridges a vision-language model (VLM) and SAM3. Specifically, a bank of learnable instance queries is injected into the VLM and contextualized with instruction and visual information, enabling each query to serve as an instance-aware slot. A hybrid-attention mechanism further promotes interaction among these queries, visual tokens, and instruction tokens, improving instance enumeration and reducing duplicate predictions. The resulting LLM-conditioned queries are projected into SAM3’s detector query space to drive accurate multi-instance segmentation in a single forward pass. This design equips SAM3 with high-level instruction understanding, compositional reasoning, and instance-level set prediction without modifying its core architecture. To support training and evaluation, we further construct Inst2Seg, a high-quality and large-scale instruction-based instance segmentation dataset and benchmark that couples free-form instructions with instance-level masks. Extensive experiments show that only 2B-scale InstructSAM achieves strong results across complex instruction-driven and phrase-level referring segmentation benchmarks, outperforming prior end-to-end methods and SAM3’s agentic pipeline while enabling efficient single-pass multi-instance prediction.

[CV-9] Pixel-Level Pavement Distress Assessment Using Instance Segmentation

链接: https://arxiv.org/abs/2605.26095
作者: Logan Dewick(University of Wisconsin - Green Bay),Bibesh Pyakurel(University of Wisconsin - Green Bay),Kong Pheng Yang(University of Wisconsin - Green Bay),Nazim Choudhury(University of Wisconsin - Green Bay),M. G. Sarwar Murshed(University of Wisconsin - Green Bay)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 7 pages, 6 figures

点击查看摘要

Abstract:Automated pavement distress assessment requires more than image-level classification or coarse bounding box detection, demanding precise localization of thin, branching, and irregular cracks to achieve the geometric precision necessary for maintenance-relevant quantification. This paper presents a vision-based pavement distress analysis system based on Mask R-CNN instance segmentation and evaluates it on UWGB-StreetCrack, a custom field-collected roadway image dataset acquired with a vehicle-mounted smartphone and manually annotated with polygon labels for longitudinal cracks, transverse cracks, alligator cracks, and potholes. Five Detectron2-based Mask R-CNN backbone variants were considered under a consistent fine-tuning protocol. The best-performing model, Mask R-CNN with a ResNet-101 FPN backbone, achieved 84.23% precision, 90.04% recall, and an F1 score of 87.04% under the project-specific bounding-box matching protocol. The same model produced an aggregate predicted crack-area fraction of 2.164%, closely matching the 2.170% ground-truth crack-area fraction. To contextualize the segmentation system against a detector-oriented alternative, a CSPDarknet53-based YOLO detector was also adapted and retrained on the dataset, reaching 27.5% precision and 20.7% recall on the validation protocol. The results show that instance segmentation is a practical direction for field pavement imagery and aggregate crack-area estimation, while also exposing open challenges in annotation consistency, class imbalance, confounder rejection, and mask-level benchmarking.

[CV-10] Channel-wise Vector Quantization

链接: https://arxiv.org/abs/2605.26089
作者: Wei Song,Tianhang Wang,Yitong Chen,Tong Zhang,Zuxuan Wu,Ming Li,Jiaqi Wang,Kaicheng Yu
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present Channel-wise Vector Quantization (CVQ), a novel image tokenization paradigm that replaces patch-wise tokens with channel-wise tokens. Unlike conventional vector quantization, which assigns a discrete token to each patch feature vector, CVQ quantizes each channel of the feature map. This formulation represents an image as discrete levels of visual details, rather than as a grid of spatial patches. Based on CVQ, we introduce a new visual autoregressive framework with “next-channel prediction”. Instead of rendering images patch by patch in raster order, our Channel-wise Autoregressive (CAR) model predicts image channels sequentially, producing progressively enriched visual details. Specifically, it first sketches global structure and then refines fine-grained attributes, akin to a human artist’s workflow. Empirically, we show that: (1) CVQ achieves 100% codebook utilization with a 16K+ codebook size without any bells and whistles, and substantially improves reconstruction quality over conventional VQ; and (2) CAR attains a DPG score of 86.7 and a GenEval score of 0.79, demonstrating strong effectiveness for text-to-image generation.

[CV-11] Paris 2.0: A Decentralized Diffusion Model for Video Generation

链接: https://arxiv.org/abs/2605.26064
作者: Ali Rouzbayani,Bidhan Roy,Marcos Villagra,Zhiying Jiang
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 6 pages, 5 figures

点击查看摘要

Abstract:We present Paris 2.0, the first video generation model pre-trained through decentralized computation. Its training recipe builds upon Paris 1.0 (arXiv:2510.03434), the first ever open-weight Decentralized Diffusion Model (DDM), which showed that image generation can be trained without a monolithic GPU cluster. However, temporally coherent video generation had remained an open problem under decentralized training, and Paris 2.0 closes it. In low-resolution text-to-video training, against a monolithic model trained on the same data under a matched total compute budget, Paris 2.0 cuts Frechet Video Distance (FVD) from 561.04 to 279.01, a ~2.0x improvement, and lifts CLIP text-video similarity and aesthetic score. Comments: 6 pages, 5 figures Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) ACMclasses: I.2.10; I.2.11 Cite as: arXiv:2605.26064 [cs.CV] (or arXiv:2605.26064v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2605.26064 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-12] Look Both Ways Before You Cross: Lifting Cross Fields From 2D Visual Priors

链接: https://arxiv.org/abs/2605.26062
作者: Dale Decatur,Jacob Serfaty,Oded Stein,Amir Vaxman,Rana Hanocka
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: Project page at: this https URL

点击查看摘要

Abstract:We present CrossLift, a technique for computing cross fields on meshes guided by visual features in images. We leverage powerful text-to-image priors that are capable of synthesizing images of feature-aligned quad meshes in 2D. We extract this signal as explicit per-pixel directions in the 2D images, which we then back-project to the mesh surface. We aggregate these candidate surface directions by performing two smooth interpolations on the mesh surface (first within each view and second across multiple views). We propose custom confidence-based weights for the candidate directions in each interpolation that allow us to resolve conflicts between candidates on the same face and smoothly interpolate our field to occluded faces. Our method is modular and can be used with many different 2D visual priors. We show additional applications to texture-aligned quad meshing as well as interactive cross-field design using coarse, user-drawn lines as signal. We demonstrate the effectiveness of CrossLift on a diverse set of both organic and mechanical shapes and produce quad meshes that exhibit superior semantic alignment as compared to existing methods. Project page at: this https URL

[CV-13] DRScaffold: Boosting Dense-Scene Reasoning in Lightweight Vision Language Models

链接: https://arxiv.org/abs/2605.26038
作者: Xinrui Shi,Kai Liu,Ziqing Zhang,Jianze Li,Anqi Li,Yulun Zhang
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Lightweight vision-language models perform competitively on standard benchmarks yet fail systematically in dense-scene reasoning, where multiple objects, attributes, and relations must be jointly grounded and resolved through multi-step inference. Such capability is critical for real-world applications where models must reliably interpret cluttered environments. Yet existing training signals provide no explicit grounding between reasoning steps and the underlying visual entities and relations, leaving lightweight models free to generate fluent but visually unanchored reasoning chains. To address this gap, we first introduce DRBench, a benchmark of 14,573 questions across 2,943 images, organized into five task categories spanning three progressive reasoning layers. Building on DRBench, we propose DRScaffold, a supervised fine-tuning framework that decomposes the supervision target into four causally ordered stages, enforcing grounded reasoning without architectural modification. Experiments on three lightweight VLMs demonstrate substantial gains on DRBench while preserving or improving performance on general-purpose benchmarks. Notably, Qwen2.5-VL-3B trained with DRScaffold surpasses the frozen Qwen2.5-VL-32B on DRBench, demonstrating that structured supervision can substitute for a significant portion of model scale in dense-scene reasoning. Our code and models are available at this https URL .

[CV-14] Everything at Every Scale: Scale-Invariant Diffusion with Continuous Super-Resolution

链接: https://arxiv.org/abs/2605.26032
作者: Zixin Jessie Chen,Zhuo Chen,Archer Wang,Jeff Gore,William T. Freeman,Congyue Deng,Marin Soljačić
类目: Computer Vision and Pattern Recognition (cs.CV); Statistical Mechanics (cond-mat.stat-mech); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 29 pages, 17 figures

点击查看摘要

Abstract:Creating images from noise is image generation; reconstructing fine details from coarse inputs is super-resolution. Despite their practical differences, both can be understood as reversing information loss across scales. We introduce \textbfSKILD , a \textbfS cale-invariant \textbfK -Space \textbfI mage \textbfL earning \textbfD iffusion model that unifies generation and continuous super-resolution within a single unconditional framework. Both natural images and critical physical systems exhibit scale invariance, and we leverage it to design a forward process that attenuates image content from fine to coarse scales while injecting spectrum-matched Gaussian noise, making scale an explicit coordinate of the diffusion dynamics. The same trained reverse process performs generation and continuous super-resolution by varying only the starting timestep: \textitno task-specific architecture, no conditioning branch, no classifier-free guidance, no retraining per scale factor . Empirically, SKILD reaches FID 2.65 and Inception Score 9.63 on unconditional CIFAR-10, performs 2\times – 8\times super-resolution on ImageNet from a single unconditional checkpoint while outperforming conditional models across perceptual metrics, and reconstructs critical Ising models whose connected four-point correlations closely track the ground truth.

[CV-15] A Multimodal 3D Foundation Model for Light Sheet Fluorescence Microscopy Enables Few-Shot Segmentation Classification and Deblurring

链接: https://arxiv.org/abs/2605.26026
作者: Adina Scheinfeld,Haotan Zhang,Shang Mu,Rudolf L. M. van Herten,Lucas Stoffl,Ali Erturk,Zhuhao Wu,Johannes C. Paetzold
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 11 pages, 3 figures

点击查看摘要

Abstract:Light sheet fluorescence microscopy (LSM) enables high-resolution, three-dimensional (3D) imaging of biological specimens, providing rich volumetric data for studying cellular organization, pathology, and vascular networks. However, the size, dimensionality, and annotation burden of LSM data make supervised deep learning approaches costly and difficult to scale. Additionally, despite the abundance of unannotated LSM volumes, foundation models for this modality remain underexplored due to computational challenges and the complexity of volumetric representation learning. In this work, we introduce a 3D foundation model for LSM data, pretrained on a large curated collection of 3D images spanning multiple organisms, stains, and imaging protocols. We learn transferable volumetric representations by jointly optimizing for masked reconstruction and image-text alignment. The pretrained backbone drastically reduces the annotation burden, enabling efficient, few-shot adaptation for varied downstream tasks. We evaluate this approach on downstream segmentation, classification, and deblurring. Our results demonstrate consistent improvements over baselines, (1) when measured using standard evaluation metrics and (2) when rigorously assessed by domain experts. This highlights the potential of foundation model pretraining to reduce annotation requirements while improving performance across diverse LSM analysis tasks. Pretrained model weights and code for pretraining and finetuning are publicly available: this https URL.

[CV-16] AdvantageFlow: Advantage-Weighted Least Squares for RL in Flow Models

链接: https://arxiv.org/abs/2605.26013
作者: Branislav Kveton,Anup Rao,Subhojyoti Mukherjee,Krishna Kumar Singh,Viet Dac Lai
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We introduce AdvantageFlow, a forward-process reinforcement learning algorithm for rectified flow models. Unlike Flow-GRPO, which optimizes the reverse process, we optimize an advantage-weighted forward-process prediction loss. This optimization problem is unstable when advantages are negative and the loss becomes non-convex. We stabilize it by rollout policy regularization, which reduces variance and arises from fitting a local reward-improving target distribution. We evaluate AdvantageFlow on image generation tasks with Stable Diffusion 3.5 Medium. It outperforms both Flow-GRPO and a state-of-the-art forward-process RL baseline based on negative-aware fine-tuning.

[CV-17] MIND: Multi-Scale Intent Diffusion for Text-Driven Physics-Based Humanoid Control

链接: https://arxiv.org/abs/2605.26006
作者: Bin Li,Ruichi Zhang,Han Liang,Jingyan Zhang,Juze Zhang,Xin Chen,Jingya Wang
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Enabling physics-based humanoids to execute diverse behaviors from high-level textual commands remains a significant challenge. Existing methods typically follow either a two-stage paradigm that combines kinematic motion generation with physics-based tracking, or an end-to-end imitation-learning paradigm that directly generates actions from text. However, the former suffers from the inherent domain shift between kinematic generation and physics-based tracking, while the latter struggles with the substantial modality gap between textual commands and low-level actions, limiting effective semantic alignment. Notably, humanoid states encode rich motion dynamics that are more semantically aligned with textual descriptions than low-level actions, making them a natural basis for deriving behavioral intent. Building upon this insight, we propose MIND, a novel end-to-end diffusion framework for text-driven physics-based humanoid control that leverages behavioral intent as a semantic bridge between textual commands and low-level actions. At its core, MIND introduces a multi-scale intent diffusion mechanism, where a holistic intent predictor captures global behavioral dynamics to guide overall behavior synthesis, while an immediate intent predictor provides step-wise, fine-grained signals for local behavior refinement at each diffusion step. This hierarchical intent formulation imposes a structured inductive bias for humanoid control, improving semantic alignment and behavioral naturalness. Furthermore, MIND encodes humanoid states into a latent space to enable more effective semantic intent modeling. Extensive experiments demonstrate that MIND outperforms existing methods and synthesizes coherent, physically plausible, and semantically aligned humanoid behaviors from text commands. Our code will be released to facilitate future research.

[CV-18] owards 3D heart mesh generation using contactless radar imaging and physics-informed neural network

链接: https://arxiv.org/abs/2605.26003
作者: Jinye Li,Chenxi Fu,Minghang Zheng,Yang Liu,Xiahai Zhuang,Qingchao Chen
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Cardiac function evaluation necessitates continuous, non-invasive monitoring, a capability limited in MRI. Millimeter-wave (mmWave) radar and its Synthetic Aperture Radar (SAR) mode offer a privacy-preserving and portable point-of-care clinical applications. However, reconstructing high-fidelity 3D cardiac geometry from SAR remains an open challenge. Traditional radar methods generate sparse point clouds that lack continuous surface topology. Meanwhile, direct application of optical reconstruction networks performs poorly due to the severe speckle noise and ambiguous boundaries inherent in SAR images. To bridge this gap, we propose SAR2Mesh, a novel framework that reformulates the task as a coarse-to-fine mesh deformation process. By initializing with a topological template, our approach explicitly preserves anatomical connectivity through progressive mesh this http URL introduce a geometry-aware feature projection module to extract multi-view features via 3D-to-2D sampling, and a physics-informed radar loss to enforce consistency between the predicted geometry and raw radar echoes. Furthermore, we present Cardiac Mesh-SAR, the first large-scale paired SAR-mesh dataset. Extensive experiments demonstrate that SAR2Mesh significantly outperforms existing image-based baselines, achieving accurate and physically consistent cardiac reconstructions.

[CV-19] LLaVA-OneVision-2: Towards Next-Generation Perceptual Intelligence

链接: https://arxiv.org/abs/2605.25979
作者: Xiang An,Yin Xie,Feilong Tang,Yunyao Yan,Huajie Tan,Didi Zhu,Changrui Chen,Xiuwei Zhao,Bin Qin,Kaicheng Yang,Yifei Shen,Yuanhan Zhang,Kaichen Zhang,Wenkang Zhang,Zheng Cheng,Nansen Zhang,Chunsheng Wu,Chunjiang Ge,Zimin Ran,Dehua Song,Chunyuan Li,Shikun Feng,Ming Hu,Zhangquan Chen,Junbo Niu,Bo Li,Ziyong Feng,Ziwei Liu,Zongyuan Ge,Jiankang Deng
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We introduce LLaVA-OneVision-2 (LLaVA-OV-2), the most capable vision-language model in the LLaVA-OneVision series to date, achieving superior performance across a broad range of multimodal benchmarks. The model builds on a native OneVision-Encoder and incorporates Windowed Attention for efficient local computation while maintaining native resolution. Its key advance is codec-stream tokenization: it treats compressed video as a continuous bit-cost stream, where bit-cost dynamics determine adaptive temporal groups, and motion-residual cues select salient spatial evidence into compact visual canvases. This allocation concentrates a limited token budget on event-bearing content, enabling more stable long-video token compression than fixed groups of pictures. A shared 3D RoPE further places codec canvases, sampled frames, and images in a unified spatiotemporal coordinate system. Furthermore, we build the LLaVA-OV-2 data and training stack around large-scale open supervision: approximately 8M re-captioned video samples for pretraining, a 4M-sample spatial corpus for fine-tuning. We also introduce JumpScore, a temporal-localization benchmark targeting fine-grained grounding in high-frequency, densely repeated motion, a regime underrepresented by existing video evaluations. A standout capability of LLaVA-OV-2 is its unified perception across video understanding, temporal grounding, spatial grounding, and manipulation-trace reasoning. On JumpScore, LLaVA-OneVision-2-8B reaches 74.9 JumpScore mAP, surpassing Qwen3-VL-8B (30.1) by +44.8 points; under matched visual-token budgets on the same benchmark, codec-stream inputs improve temporal grounding over frame sampling by +9.7 points. Across standard benchmarks, LLaVA-OneVision-2-8B further outperforms Qwen3-VL-8B by +4.3 average points on video tasks, +5.3 on spatial tasks, and +15.6 average JF on tracking tasks.

[CV-20] F-RNG: Feed-Forward Relightable Neural Gaussians

链接: https://arxiv.org/abs/2605.25975
作者: Guangming Fu,Jiahui Fan,Jian Yang,Miloš Hašan,Beibei Wang
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Capturing relightable 3D assets from real-world objects is a widely researched problem. Several per-scene optimization-based methods, based on 3D Gaussian splatting (3DGS), support relighting; however, they usually require dense input views, and their overfitting nature makes it difficult to generalize across scenes. Unlike per-scene optimization methods, generalized feed-forward models can directly reconstruct Gaussians from sparse input views. However, the resulting assets have baked-in illumination and cannot be easily used for relighting. In this paper, we present F-RNG, a feed-forward framework that directly generates relightable 3DGS assets from sparse-view inputs. Training such a model from scratch can require massive data and computing resources, and it is especially challenging to generate relightable assets in a feed-forward manner with acceptable cost. We develop F-RNG upon an existing large reconstruction model (LRM) to extract relightable representations, while also utilizing priors from an intrinsic decomposition model (IDM). Specifically, we first introduce a latent-interpolated fine-grained geometry synthesis to enhance the LRM’s geometry representation. Second, we propose a prior-guided relightable appearance distillation to extract relightable neural representations by incorporating IDM priors. Finally, a universal neural renderer enables flexible and high-fidelity relighting. F-RNG requires neither re-training nor fine-tuning of the underlying LRMs, thus can automatically benefit from better LRMs and IDMs in the future. With only small networks that can be trained with affordable data and computational resources, F-RNG avoids the repetitive inference of large models under different light conditions. By comparison to the state-of-the-art LRM-based relighting method, F-RNG achieves ~25x faster relighting, as well as superior quality (~+2.0 dB).

[CV-21] PathWISE: Multi-Agent Cancer Pathway Triaging Ontology Learning from Clinical Flowcharts

链接: https://arxiv.org/abs/2605.25970
作者: Sofiat Abioye,Ufaq Khan,Shazad Ashraf,Mohammed Adil Butt,Andrew D. Beggs,Adam Byfield,Anusha Jose,William Poulett,Ben Wallace,Junaid Qadir,Muhammad Bilal
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 4 figures

点击查看摘要

Abstract:Clinical pathways are disseminated as visual flowcharts where spatial topology, arrow direction, colour coding, and font weight encode critical triage logic that remains inaccessible to computational systems. We present PathWISE, a five-phase pipeline combining four LLM-based agents with a deterministic depth-first search auditor and a Java compiler critic, transforming these non-computable artefacts into validated, executable HL7 Clinical Quality Language (CQL) libraries deployable as FHIR CDS Hooks services. Purpose-built agents extract flowchart structure into a typed directed graph, perform deterministic path enumeration, conduct a structured semantic audit of every node’s computability, generate terminology-constrained CQL definitions verified by the official Java CQL-to-ELM compiler, and produce routing logic covering 100% of enumerated patient journeys. Demonstrated across five UK NHS cancer pathways (colorectal, lung, skin, upper GI, and breast), PathWISE audits up to 183 nodes (182 under the Hybrid configuration), identifies 544 structured governance findings across four issue categories, achieves 100% syntactic compilation success, with UNCOMPUTABLE nodes receiving false placeholders that preserve compilability while surfacing governance gaps for clinical review, and produces zero hallucinated terminology codes for dictionary-covered concepts. Critically, PathWISE confines non-deterministic LLM inference to knowledge extraction while deterministic graph mathematics and a standard compiler underpin every verification step.

[CV-22] Context-driven Missing-Modality Learning for Robust Medical Diagnosis with Image-Tabular Data

链接: https://arxiv.org/abs/2605.25968
作者: Tianling Liu,Lequan Yu,Tong Han,Liang Wan
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 8 figures

点击查看摘要

Abstract:While multimodal data integrating diverse imaging and clinical tabular records is crucial for accurate medical diagnosis, the arbitrary absence of specific modalities is prevalent in clinical practice, severely degrading the performance of multimodal models. Existing methods either discard missing modalities, leading to information loss, or struggle to synthesize them without capturing complex inter-modal dependencies. To address these limitations, we propose a novel Context-driven Missing-Modality Learning (CMML) framework, which sequentially performs modality synthesis and semantic alignment to achieve robust diagnosis under arbitrary missing conditions. Specifically, we design a Cascade Residual Transformer-based Autoencoder (CRTA) that leverages learnable context tokens acting as dataset-level semantic prior to capture inter-modal dependencies and synthesize key missing representations. These representations are further enriched by modality-specific memory banks. To resolve the discrepancy between original available and synthesized representations, we transform the learned context tokens into instance-adaptive semantic references by infusing multimodal representations from the CRTA’s outputs. This reference guides the alignment of heterogeneous modality representations into a unified space, where class-aware contrastive refinement is finally applied to explore discriminative diagnostic cues. Extensive evaluations on skin lesion (Derm7pt), ocular disease (ODIR), and meningioma (MEN) datasets demonstrate that CMML significantly outperforms state-of-the-art (SOTA) methods, yielding AVG AUC improvements of 1.26%, 0.97%, and 1.32%, respectively.

[CV-23] RAPTOR: A Visually Grounded Vision-Language Framework to Improve Clinical Trust and Auditability in Automated Cancer Referral Processing

链接: https://arxiv.org/abs/2605.25956
作者: Sofiat Abioye,Ufaq Khan,Shazad Ashraf,Anusha Jose,Benjamin Wallace,William Poulett,Adam Byfield,Lukman Akanbi,Muhammad Bilal
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages 4 figures

点击查看摘要

Abstract:Urgent suspected colorectal cancer (CRC) referrals create operational bottlenecks because semi-structured clinical documents often require manual review and transcription. The original RAPTOR system used Large Language Models for structured extraction but relied on a separate OCR stage, making it vulnerable to handwriting, layout variation, and loss of visual evidence linkage. We present RAPTOR+, a multimodal extension that uses Vision-Language Models (VLMs) for end-to-end referral understanding. We evaluate fine-tuned VLMs, commercial and open-source zero-shot VLMs, and the original OCR-based pipeline on 223 clinically curated CRC urgent referral forms. We also introduce a grounding-aware evaluation framework that measures both extraction accuracy and evidence localisation. Results show a clear grounding gap in zero-shot models. Gemini 2.5 Flash achieved 92.6% Reading Accuracy but only 1.2% Strict Safety. In contrast, fine-tuned Qwen3-VL-8B achieved 96.1% Reading Accuracy and 60.6% Strict Safety, substantially improving verifiable evidence grounding. These findings show that task-specific fine-tuning is essential for reliable, auditable clinical document understanding. RAPTOR+ enables extracted referral decisions to be linked to visual evidence, supporting safer and more efficient cancer referral triage.

[CV-24] VEN-VL: A Visual Ensemble MoE Framework for Effective and Efficient Multi-Modal Understanding

链接: https://arxiv.org/abs/2605.25952
作者: Yinghao Wu,Zhuoyan Luo,Yiyao Yu,Zhaojian Yu,Yujiu Yang,Xiao-Ping Zhang
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Despite the remarkable progress achieved by recent efficient methods in accelerating multimodal understanding, they still suffer from noticeable performance degradation. Their emphasis on the high compression ratio of a single visual clue and reliance on the heuristic pruning strategy with coarse attention alignment incurs a bottleneck on the information capacity and density of visual tokens. Addressing this limitation, we propose VEN-VL, a visual ensemble MoE framework for effective and efficient perception following the enrich then compact principle. Specifically, we first enrich the information capacity by unifying the visual representations of different perspectives, and then progressively compact it with adaptive routers in specialized visual experts to enhance the information density. Furthermore, we incorporate the reconstruction ability of vanilla structure via explicit visual supervision, facilitating crucial information preservation. Experimental results demonstrate our superiority in complex visual tasks with few information-condensed tokens, which effectively bridges the gap between performance and efficiency.

[CV-25] A Pedestrian-Vehicle Interaction Benchmark and Annotation Framework for Unstructured Scenes via Uncalibrated Cameras

链接: https://arxiv.org/abs/2605.25947
作者: Haoyang Peng,Qian Hu,Songan Zhang,Ming Yang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 8 figures; project page available at this https URL

点击查看摘要

Abstract:Predicting the interaction between pedestrian and vehicle is essential for autonomous driving safety in unstructured and semi-structured scenarios; however, this task is severely hindered by the scarcity of public datasets that feature dense pedestrian-vehicle interactions. Most current studies rely on structured road data, leaving the complex, heterogeneous interactions found in unstructured environments insufficiently represented and researched. In this paper, we propose a dataset annotation framework based on video data from uncalibrated surveillance cameras and present PINNS (Pedestrian-vehicle Interaction dataset from uNcalibrated cameras in uNstructured Scenes). The dataset covers multiple countries and regions, includes diverse typical traffic scenarios, and considers variations in seasons, lighting conditions, and weather. It focuses on complex scenes with dense pedestrian-vehicle interactions and is designed to be easily extensible. The dataset is constructed and annotated according to the standard issued by the Chinese Association of Automation, providing both trajectory data and corresponding scene-level information. Furthermore, this paper analyzes current challenges and research directions in heterogeneous agent trajectory prediction, shows the necessity and usefulness of the proposed dataset. We hope our framework and dataset will facilitate research on trajectory prediction and autonomous driving in complex mixed traffic scenarios. PINNS is publicly available at this https URL.

[CV-26] EchoPilot: Training-Free Ultrasound Video Segmentation via Scale-Space Semantic Prompting and Reliability-Gated Memory MICCAI2026

链接: https://arxiv.org/abs/2605.25944
作者: Ruiqiang Xiao,Zhaohu Xing,Yijun Yang,Zhenyan Han,Weiming Wang,Kaishun Wu,Lei Zhu
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Early accepted to MICCAI 2026. Project page: this https URL

点击查看摘要

Abstract:Ultrasound video segmentation is clinically valuable yet difficult due to speckle noise, weak boundaries, and rapid anatomical deformation. Recent promptable foundation models enable point-guided segmentation, but their direct deployment in ultrasound remains unreliable: a single point provides insufficient spatial context to resolve scale ambiguity, and greedy memory updates amplify early errors into severe temporal drift. We present EchoPilot, a training-free framework for ultrasound video segmentation under sparse first-frame interaction, requiring only a single point click and an anatomical category name. EchoPilot orchestrates a frozen medical vision-language model (VLM) for semantic localization, a vision foundation model (VFM) for dense geometric feature extraction, and a promptable video segmentor for mask prediction and propagation. To resolve initialization ambiguity, we propose Scale-Space Semantic Prompting, which first selects an optimal contextual view via a parameter-free S.E.E.D. (Semantic Energy-Entropy Density) criterion, and then synthesizes geometrically precise auxiliary point prompts from dense foundation features without additional user interaction. To reduce propagation drift, a Reliability-Gated Memory update is further introduced to selectively freeze the segmentor’s memory bank under uncertain predictions, preventing error accumulation. We also contribute the first dynamic fetal placenta ultrasound video segmentation dataset with 671 annotated frames. Across three ultrasound video datasets, EchoPilot achieves state-of-the-art performance under the sparse-interactive setting, consistently outperforming training-free baselines and finetuned specialists.

[CV-27] LRDDv3: High-Resolution Long-Range Drone Detection Dataset with Range Information and Thermal Data ICRA

链接: https://arxiv.org/abs/2605.25942
作者: Knut Peterson,Zaid Mayers,Azmain Yousuf,Priontu Chowdhury,Asher Zaczepinski,Solmaz Arezoomandan,Reihaneh Maarefdoust,David Han
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 8 pages, 5 figures. Accepted to the 2026 IEEE International Conference on Robotics and Automation (ICRA)

点击查看摘要

Abstract:Unmanned Aerial Vehicles (UAVs) have quickly become common in various airspaces, representing a wide range of applications from recreation flying to commercial photography and package delivery. With the increasing prevalence of UAVs, it becomes critical that both manned and unmanned aircraft can detect UAVs and other flying objects from long range to effectively track movement and ensure safe operation in shared spaces. While several datasets have been introduced for drone detection, the need for expanded high-quality data persists, especially in the area of high-resolution long-range drone data. To address this, we introduce a high-resolution dataset of 102,532 long-range RGB images of drones, sampled at 5 FPS from 128 distinct video clips taken mid flight during 17 different data collection days spread over 8 months to ensure a wide variety of lighting scenarios, flight locations, and background elements. The dataset boasts comprehensive drone range information across the dataset, as well as 29,630 IR images, all paired with RGB counterparts from the base dataset. As one of the first drone detection datasets to leverage 4K image resolution and paired 640x512 IR images, our work represents a significant advancement to enable the detection of drones at long range. For access to the complete dataset, please visit this https URL

[CV-28] Where Concept Erasure Should Occur: Concept-Layer Alignment in Text-to-Video Diffusion Models ICML2026

链接: https://arxiv.org/abs/2605.25941
作者: Yiwei Xie,Ping Liu,Zheng Zhang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICML 2026

点击查看摘要

Abstract:Text-to-video diffusion transformers encode semantic information unevenly across model depth, which constrains effective concept erasure. We identify a representational bottleneck, termed concept-layer topological alignment, under which target concepts exhibit higher separability at certain representational depths. Outside these depths, concept and non-target signals remain strongly entangled, limiting the effectiveness of depth-specific erasure. This observation reframes concept erasure as the problem of identifying representational depths where concept-non-target separation naturally emerges. Motivated by this structural constraint, we introduce CLEAR, a separability-driven optimization framework for concept erasure that explicitly enforces concept-layer alignment. CLEAR operationalizes this principle by formulating layer selection as an optimization problem over concept-non-target separability, rather than relying on layer-agnostic or heuristic choices. To enable this, we introduce a separability-aware objective that favors layers exhibiting stronger concept-non-target separation. Experiments on large-scale text-to-video diffusion models demonstrate that enforcing concept–layer alignment leads to more precise concept suppression while preserving overall generative quality.

[CV-29] Closed-Loop Bidirectional Prompting for Adversarial Robustness of Vision Language Models

链接: https://arxiv.org/abs/2605.25922
作者: Xiao Liu,Jiaxiang Liu,Boci Peng,Boren Hu,Yusong Wang,Xiwen Chen,Prayag Tiwari,Liming Zhang,Mingkun Xu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 24 pages, 8 figures

点击查看摘要

Abstract:Vision Language Models adapt well to downstream tasks but are highly vulnerable to adversarial perturbations that disrupt cross-modal semantic alignment. Existing defenses are largely unidirectional or structural, failing to exploit bidirectional cross-modal complementarity and instance-wise adaptive protection. To overcome the limitations of unidirectional and static defenses in adversarial settings, we propose Closed-Loop Bidirectional Prompting, casting robust adaptation as cross-modal agreement recovery via a dynamic feedback loop on frozen encoders. A Semantic Anchor is introduced as a stable prior to constrain cyclic updates and mitigate perturbation-induced feature corruption. Through anchor-based bootstrapping, textual semantics denoise visual representations, while the refined visuals enable instance-adaptive prompt updating, yielding a rectified and robust consensus. Extensive evaluations across 11 datasets validate state-of-the-art robustness and strong base-to-new generalization, while maintaining a favorable trade-off between computational cost and accuracy.

[CV-30] Curve Skeletonization in Continuous domain for Meshes and Point Clouds WACV

链接: https://arxiv.org/abs/2605.25921
作者: Jai Bardhan,Ramya Hebbalaguppe,Aravind Udupa
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: 31 pages, 26 figures, 7 tables, 4 algorithms. Published at IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) 2026

点击查看摘要

Abstract:Advancements in 3D curve skeletonization are accelerating progress across a wide range of applications. However, developing robust skeletonization algorithms that capture intricate object details remains challenging. Skeletonization via Local Separators (LS) offers an efficient graph-based approach but suffers from representation inaccuracies due to its discrete nature. To address this, we introduce CSCD, a novel framework for Curve Skeletonization in the Continuous Domain, generalizing LS to manifolds. Specifically, we present two realizations: CSCD-M for meshes and CSCD-PC for point clouds. CSCD-M leverages the intrinsic triangulation of a mesh for resilience to noise and improved topological preservation, while CSCD-PC employs tufted Laplacians for enhanced robustness. To our knowledge, CSCD-M is the first intrinsic method for curve skeletonization. Our results show CSCD-M matches LS performance across diverse meshes and outperforms LS (TOG’21) on benchmarks like Thingi10k dataset. CSCD-PC qualitatively outperforms CoverageAxis++ (Eurographics’24) and EPCS (CAG’23). Finally, we demonstrate the efficacy of CSCD in a few downstream tasks: object classification, shape segmentation, identifying handles, tunnels, and constrictions in objects. Project Website: this https URL

[CV-31] R5DGS: Semantic-Aware 4D Gaussian Splatting with Rigid Body Constraints for Efficient Dynamic Scene Reconstruction

链接: https://arxiv.org/abs/2605.25909
作者: Denis Gridusov,Maxim Popov,Sergey Kolyubin
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code: this https URL

点击查看摘要

Abstract:Reconstructing and predicting dynamic 3D scenes from multi-view videos is a foundational task for robotics, AR/VR, and digital twins. Recent physics-informed Gaussian Splatting methods achieve impressive future frame extrapolation but lack semantic awareness and suffer from large computational overhead. We introduce \textbfR5DGS , a framework that augments a physics-driven 4D Gaussian representation with compact Identity Encoding vectors, enabling precise Gaussian-to-object association. By constructing an offline CLIP-based object lookup table, we support open-vocabulary text prompting to retrieve and render object-specific Gaussians across arbitrary timestamps and viewpoints. Furthermore, we propose a rigid-body inference constraint that predicts and integrates physical dynamics exclusively for object centroids, propagating motion to associated Gaussians via relative transformations. This optimization yields a 11 FPS speedup during extrapolation without compromising trajectories plausibility.

[CV-32] Agent Grounder: Zero-Shot 3D Visual Pointcloud Grounding using Multimodal Language Models

链接: https://arxiv.org/abs/2605.25901
作者: Cuong Huynh,Maxim Popov,Denis Gridusov,Sergey Kolyubin
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Code: this https URL

点击查看摘要

Abstract:3D Visual Grounding (3DVG) is an essential capability for embodied AI, requiring agents to localize objects in 3D scenes based on natural language descriptions. Recent zero-shot methods leverage 2D vision-language models (LVLMs). However, they often rely on existing sets of multi-view images and struggle with the limited semantic and spatial details provided by standard 3D segmentation tools. We present \textbfAgentGrounder , a zero-shot 3D visual grounding framework that operates directly on colored point clouds without task-specific 3D training. Our approach follows a two-stage design: (1) an offline stage that applies 3D model to build an Object Lookup Table (OLT) with instance IDs, semantic labels, 3D bounding boxes; and (2) an online tool-driven agent that decomposes each query, retrieves only relevant candidates from the OLT, performs geometric scoring, and triggers image rendering on demand when additional visual evidence (e.g., color, material, or viewpoint-sensitive cues) is required. Compared with fixed anchor-target matching pipelines, this design reduces cascading matching errors and improves context-window efficiency by avoiding prompts overloaded with irrelevant objects. We evaluate on ScanRefer and Nr3D under a zero-shot setting and observe consistent improvements over SeeGround in our setup, including +2.5% Acc@0.5 on ScanRefer and +6.3% on Nr3D, with a notable +6.3% gain on Nr3D view-independent queries. These results show that combining selective retrieval, geometric reasoning, and adaptive visual inspection yields a practical and robust foundation for open-vocabulary 3D grounding. Our code is available at this https URL.

[CV-33] SP-MoMamba: Superpixel-driven Mixture of State Space Experts for Efficient Image Super-Resolution

链接: https://arxiv.org/abs/2605.25892
作者: Wenbin Zou,Yawen Cui,Yi Wang,Lap-Pui Chau,Liang Chen,Jinshan Pan,Huiping Zhuang,Guanbin Li
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages, 15 figures

点击查看摘要

Abstract:State space models (SSMs) have emerged as a powerful paradigm for efficient single-image super-resolution (SR) due to their linear complexity and long-range modeling capabilities. However, existing Mamba-based methods typically rely on data-agnostic rigid scanning, which reshapes 2D images into 1D sequences over a fixed grid, inevitably disrupting spatial-semantic topology and introducing artifacts. Inspired by the \textbfGestalt perceptual grouping theory, we propose \textbfSP-MoMamba, a superpixel-driven mixture of state space experts designed for content-aware SR. Our core idea is to transform the traditional rigid scanning into a \textbfsemantic-level interaction by treating superpixels as fundamental units. Specifically, we introduce the \textbfSuperpixel-driven State Space Model (SP-SSM), which compresses semantically homogeneous regions into high-order tokens to preserve global topological consistency. To address the conflict between fixed scanning scales and diverse semantic granularities, we develop the \textbfMulti-Scale Superpixel Mixture of State Space Experts (MSS-MoE). This module utilizes a dynamic routing mechanism to adaptively assign scale-specific experts, effectively capturing multi-scale textures while reducing computational redundancy. Furthermore, to prevent the loss of high-frequency details during global abstraction, we introduce a \textbfLocal Spatial Modulation Expert (LSME) to complement the global modeling, ensuring a precise reconstruction of sharp edges and fine structures. Extensive experiments on standard benchmarks demonstrate that SP-MoMamba achieves superior reconstruction fidelity and a more favorable efficiency-performance trade-off compared to state-of-the-art efficient SR methods.

[CV-34] DyCoRM: Dynamic Criterion-Aware Reward Modeling for Text-to-Image Generation

链接: https://arxiv.org/abs/2605.25876
作者: Jiaying Qian,Ziheng Jia,Qian Zhang,Zicheng Zhang,Jiayi Guo,Junqi Zhang,Guangtao Zhai,Xiongkuo Min
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:With the continued advancement of text-to-image (T2I) generation, producing high-quality images is becoming increasingly attainable; consequently, user demands are shifting toward images that better satisfy their specific requirements. As reward models play an increasingly important role in assessing whether generated images align with user preference, this trend introduces an important challenge for reward modeling: rather than relying solely on static and general evaluation dimensions, reward models should account for the task-relevant and fine-grained criteria through which users assess whether generated images meet their specific requirements. To address this challenge, we propose DyCoRM, a dynamic, criterion-aware reward model that grounds task-relevant criteria and performs criterion-aware preference comparison. To support this setting, we construct DyCoDataset-20K, which provides dynamic criteria together with criterion-level annotations, and further derive DyCoBench-1K, a benchmark for systematically evaluating reward models under dynamic criteria. We further introduce DyCoPick, which applies criterion-aware reward modeling to selecting T2I images. Our contributions establish the first reward modeling framework for dynamic and fine-grained evaluation and practical application in T2I generation.

[CV-35] WBench: A Comprehensive Multi-turn Benchmark for Interactive Video World Model Evaluation

链接: https://arxiv.org/abs/2605.25874
作者: Kaining Ying,Hengrui Hu,Siyu Ren,Jiamu Li,Fengjiao Chen,Ziwen Wang,Xuezhi Cao,Xunliang Cai,Henghui Ding
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Technical report of WBench. Homepage: this https URL

点击查看摘要

Abstract:Interactive world models are advancing rapidly, yet existing benchmarks cover only part of the required competencies, leaving no unified standard for systematic evaluation. To fill this gap, we introduce WBench, a comprehensive multi-turn benchmark for interactive world model evaluation along five dimensions, namely video quality, setting adherence, interaction adherence, consistency, and physics compliance. WBench contains 289 test cases and 1,058 interaction turns, where each case specifies a world setting and a multi-turn interaction sequence, covering diverse scenes, styles, subjects, and both first- and third-person perspectives, together with four interaction types, including navigation, subject action, event editing, and perspective switching. For navigation, WBench unifies text, 6-DoF pose, and discrete-action control, enabling evaluation of models with different native input interfaces. Evaluation uses 22 automatic sub-metrics that combine specialist vision models with large multimodal models, and all metrics are validated against human judgments. Across 20 state-of-the-art models, we find that no single model performs strongly across all dimensions. We provide detailed diagnostic insights into the characteristic strengths, weaknesses, and open challenges of each model. Code and data are available at this https URL.

[CV-36] MuNet: A Mutualistic Network for Joint 3D Human Mesh Recovery and 3D Clothed Human Reconstruction from Single Images

链接: https://arxiv.org/abs/2605.25861
作者: Yunqi Gao,Leyuan Liu,Yuhan Li,Changxin Gao,Jingying Chen
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:3D human mesh recovery and 3D clothed human reconstruction are inherently related, yet they have long been studied in isolation, thereby overlooking the potential gains of joint optimization. To overcome this limitation, we propose to address these two tasks within a unified framework, which allows their mutual dependencies to be effectively exploited. Building on this idea, we propose MuNet, a mutualistic network for joint 3D human mesh recovery and 3D clothed human reconstruction from single images. First, we adopt 2-manifold graphs as a unified representation for all 3D models, enabling consistent modeling across 3D human mesh recovery and clothed human reconstruction. Second, we design an end-to-end graph convolutional network that progressively deforms an initial graph into a 3D human mesh and refines it into a detailed 3D clothed human model. Third, we introduce a mutualistic mechanism that allows reciprocal interaction between the two tasks during training, where 3D human mesh recovery provides guidance for 3D clothed human reconstruction, and reconstruction feedback refines the 3D human mesh recovery. We extensively evaluate MuNet on six benchmark datasets for 3D human mesh recovery and 3D clothed human reconstruction, including Human3.6M, 3DPW, MPI-INF-3DHP, THuman2.0, CAPE, and RenderPeople. Experimental results demonstrate that MuNet achieves state-of-the-art performance on both tasks across all datasets. The code of MuNet is released for research purposes at this https URL.

[CV-37] SAM3-Assisted Training of Lightweight YOLO Models for Precision Pig Farming

链接: https://arxiv.org/abs/2605.25860
作者: Marcos Vinicius Mendes Faria,Thiago Borges Pereira,Isabella C.F.S. Condotta,Thiago Meireles Paixão,Francisco de Assis Boldt
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for publication at the IEEE Sensors Applications Symposium (SAS 2026)

点击查看摘要

Abstract:Deep learning-based object detection has revolutionized Precision Livestock Farming (PLF), yet a critical barrier remains: high-performance Foundation Models (such as SAM 3) are too computationally intensive for edge deployment, while lightweight models (like YOLO) require prohibitive manual annotation efforts. This work proposes a fully automated knowledge distillation pipeline that leverages the Segment Anything Model 3 (SAM 3) to generate zero-shot pseudo-labels for training efficient YOLOv8 detectors. By treating SAM 3 as an offline auto-annotator, we eliminate the manual labeling bottleneck, producing models capable of real-time inference on resource-constrained hardware. We systematically evaluate this approach on the PigLife dataset, comparing SAM 3-supervised models against human-annotated baselines. Results demonstrate that a SAM 3-trained YOLOv8m achieves a mean Average Precision (mAP) of 79.4% without human intervention, while reducing inference latency by approximately 200 \times compared to the teacher model. Furthermore, stratified analysis reveals that in low-occlusion scenarios, the automated pipeline achieves detection rates comparable to human benchmarks ( AP_50 99% ). These findings indicate that foundation models can serve as effective, zero-annotation-cost supervisors, enabling scalable edge computing solutions for smart agriculture.

[CV-38] [CLS] is Not Enough: Multi-Label Recognition via Patch-Level Inference and Adaptive Aggregation

链接: https://arxiv.org/abs/2605.25821
作者: Akang Wang,Xili Deng,Zhanxuan Hu,Yi Zhao,Yonghang Tai,Huafeng Li
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-Language Models such as CLIP exhibit strong zero-shot recognition capability by aligning images with textual concepts, yet they often underperform on multi-label recognition where multiple objects co-exist. A key bottleneck is that the [CLS] token, as a single global visual representation, is insufficient to faithfully encode diverse targets with varying scales, contexts, and co-occurrence patterns. To address this limitation, we present a new multi-label image recognition framework, termed PIAA, which formulates prediction as Patch-level Inference followed by Adaptive Aggregation. Specifically, we first enhance patch-wise predictions from two complementary perspectives: (i) mitigating semantic entanglement in the visual encoder to obtain more discriminative patch representations, and (ii) learning an unsupervised visual classifier to narrow the vision-language modality gap. We then introduce an adaptive aggregation module that consolidates patch-level scores into the final multi-label prediction. Notably, the entire pipeline is fully training-free, requiring no gradient updates or parameter fine-tuning. Experiments show that our method achieves strong improvements with minimal extra computation, exceeding a 6% mAP gain on the challenging NUS-WIDE benchmark over representative baselines. Code is available at this https URL.

[CV-39] Data-driven Head Motion Generation through Natural Gaze-Head Coordination

链接: https://arxiv.org/abs/2605.25810
作者: Xiaohan Liu,Yilin Wen,Yusuke Sugano
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present the first data-driven approach to model temporal gaze-head coordination from large-scale in-the-wild facial videos. To obtain training data for generalizable learning, we propose an automatic pipeline that extracts natural yet diverse gaze and head motions with off-the-shelf appearance-based gaze estimators. To capture the probabilistic correlation and temporal dynamics of gaze-head coordination, we build our model on a generative conditional Variational Autoencoder for plausible yet diverse gaze-conditioned head motion generations. We further apply our framework to gaze-controlled facial video generation, where we enable video generation with natural and realistic head motion correlated to the input gaze - an aspect that has not been emphasized before. Human evaluation and quantitative comparisons demonstrate our method’s effectiveness and validate our design choices, with evaluators showing statistically significant preference for our approach over baseline methods.

[CV-40] An Analysis Focused on Womens Safety: Can VAD Models Be Enhanced by a Multi-modal Dataset?

链接: https://arxiv.org/abs/2605.25806
作者: Sangeeta,Maddikuntla Sai Prajwal,Debi Prosad Dogra,Kamalakar Vijay Thakare,Hyungjoo Jung,Ig-Jae Kim,Heeseung Choi
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 7 pages, 6 figures, 4 tables

点击查看摘要

Abstract:Women’s safety and security are paramount for a modern society. Crimes against women occur in daylight as well as in low-light conditions. Often, such events are captured through real-world surveillance cameras that operate at lower resolutions. Despite substantial progress in CV-related research, video anomaly detection (VAD) focused on women’s safety has not yet been adequately addressed. Existing video anomaly datasets contain well-lit, high-resolution, close-shot videos, and fail to represent women-centric anomalies such as chain snatching, stalking, inappropriate touch, and other subtle forms of crime against women. To address these problems, we propose the ExtrAnom dataset, a new multi-modal benchmark containing 1001 videos with textual descriptions, 500 normal and 501 anomalous, classified into 5 different types of women-centric crimes. The dataset comprises low-light (8%), low-resolution videos (13%), long-shot (15%), along with daylight (64%) anomalous videos. And it covers anomalous events like stalking (3.9%), chain snatching (17.6%), kidnapping (7.3%), assassinations (2.3%), harassment (18.9%), and normal (50%). Each video is supplemented with 4 textual annotations, including one human-generated and three LLM-generated descriptions, enabling cross-modal and VLM-based validations. The aim of creating a women-centric dataset is to accurately detect the women-centric anomaly patterns, which are possible to observe visually. The dataset supplements the VLMs to accurately generate video-level descriptions. ExtrAnom has been benchmarked against popular unimodal and multi-modal VAD datasets (e.g., XD-Violence, UCF-Crime, and UCA) and SOTA methods. Experiments reveal that the existing datasets are insufficient to train models for detecting women-centric anomalies.

[CV-41] Event-to-Video Reconstruction using Spatio-Temporal and Frequency-Enhanced Deep Neural Networks

链接: https://arxiv.org/abs/2605.25804
作者: Ramna Maqsood,Paulo Nunes,Luís Ducla Soares,Caroline Conti
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Event cameras offer significant advantages over conventional frame-based counterparts, including high temporal resolution, low latency, and energy efficiency. These characteristics make them suitable for high-speed and high-dynamic range scene acquisition scenarios; however, the lack of dense intensity frames limits the direct applicability of conventional computer vision methods for scene understanding. Event-to-video (E2V) reconstruction seeks to bridge this gap by converting asynchronous event streams into a sequence of synchronous video frames. Existing E2V reconstruction methods based on convolutional neural networks and transformers operate primarily in the spatial domain and often struggle to recover fine structural details while suppressing severe reconstruction artifacts. To address these issues, we propose MSFET-E2V, a novel multiscale frequency-enhanced transformer model. At its core lies a cross-domain attention module, which fuses spatio-temporal features with frequency-aware representations derived from the discrete wavelet transform. Unlike prior methods relying solely on spatial attention, our approach effectively captures both local and global structures by taking into account low- and high-frequency components, enhancing detail preservation and robustness across various motion scenarios. Furthermore, we propose a lightweight wavelet-enhanced skip block that serves as a skip connection, facilitating artifact suppression and structural detail refinement through joint spatial-frequency domain processing. Extensive experiments demonstrate that MSFET-E2V achieves superior performance over state-of-the-art methods on multiple real-world event datasets, offering significant gains in reconstruction quality. Moreover, compared to the existing transformer-based method, our proposed model significantly reduces the number of parameters, the GPU memory usage, and inference time.

[CV-42] ATV-Net: Adaptive Triple-View Network with Dynamic Feature Fusion

链接: https://arxiv.org/abs/2605.25803
作者: Hsin-Jui Pan,Sheng-Wei Chan,Meng-Qian Li,Chun-Po Shen
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code will be released soon

点击查看摘要

Abstract:Recent semantic segmentation research has increasingly moved toward stronger context modeling, dense attention, and transformer-based architectures. Although these models achieve impressive performance, classical CNN-based segmentation pipelines remain attractive because of their simplicity, efficiency, and ease of implementation. This paper revisits a practical question: how far can a ResNet-based segmentation model be improved by only modifying the segmentation head? We propose ATV-Net, an Adaptive Triple-View Network that strengthens a ResNet-101 backbone using three simple but complementary receptive-field views. The micro view captures point-wise semantic responses, the local view models neighborhood structures and object boundaries, and the scout view provides enlarged contextual cues. Instead of fusing these views with fixed weights, ATV-Net introduces an Adaptive Decision Gate that dynamically selects receptive-field responses according to input scene characteristics. A compact global coordination layer is further applied to improve spatial and semantic consistency. Experiments on the Cityscapes validation set show that ATV-Net achieves 80.31% mIoU. This result suggests that classical CNN-based segmentation is still far from obsolete: with simple receptive-field views and adaptive fusion, a ResNet-based pipeline can reach a competitive accuracy level without relying on transformer-style global attention or overly complex context modules.

[CV-43] Rethinking VLM Representation for VLA Initialization

链接: https://arxiv.org/abs/2605.25802
作者: Weifeng Lin,Siyuan Huang,Hao Li,Tingwei Chen,Ruichuan An,Xinyu Wei,Jianbo Liu,Hongsheng Li
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 main-text pages, 5 appendix pages, 4 figures

点击查看摘要

Abstract:Vision-Language-Action (VLA) models widely adopt pretrained Vision-Language Models (VLMs) as policy backbones, yet it remains unclear what kind of pretrained VLM representation is useful as a VLA initialization. In this paper, we study VLA initialization as a controlled representation-design problem along three axes: capability-level embodied VQA supervision, parameter-update strategy, and robot-data pretraining. Our experiments show that the original pretrained VLM representation is a key source of action performance. However, embodied VQA adaptation does not yield uniform gains: its benefit depends on downstream bottlenecks, and gains from different capability domains are not simply additive. For update strategy, LoRA provides a more reliable initialization than Full Finetune, indicating that overly reshaping the pretrained representation can weaken VLA initialization. Robot-data pretraining further improves VLA initialization, with the strongest variant obtained by staged LoRA-based training. Together, these findings suggest that effective VLM-to-VLA adaptation should inject action-relevant embodied and robot-trajectory signals while preserving the pretrained VLM representation that remains useful for action learning.

[CV-44] PixelWizard: Towards Efficient High-Fidelity Video Generation at Ultra-Large Spatial Resolution

链接: https://arxiv.org/abs/2605.25801
作者: Wenxue Li,Jingjing Ren,Peng Zhang,Tian Ye,Daiguo Zhou,Jian Luan,Lei Zhu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:High-resolution video generation faces a coupled bottleneck of optimization instability and prohibitive computational costs. The massive expansion of the token sequence not only biases optimization toward local textures at the expense of global coherence, leading to structural collapse, but also imposes prohibitive training costs and severe inference latency. To address this, we propose PixelWizard, a framework that hierarchically decouples global structure modeling from fine-grained detail synthesis. PixelWizard first establishes a compact spatiotemporal anchor to concentrate dense structural priors, which then guides fine-grained generation at high resolution. This mitigates the local optimization bias to ensure structural stability without compromising high-frequency details. Leveraging this structural stability, we introduce Noise-Span Aligned Shortcut Training to break the inference bottleneck. By explicitly modeling the step size, this mechanism allows the model to traverse the generation trajectory with large steps. Crucially, we incorporate Exponential Index-Biased Sampling and Adaptive Noise-Span Calibration to align optimization with the shifted noise schedules of high-resolution grids, ensuring robust few-step inference without incurring the heavy overhead of distillation. Extensive experiments demonstrate that PixelWizard achieves superior visual quality while accelerating the generative sampling of native 2K/4K videos by over 10x.

[CV-45] Addressing Exacerbated Attention Sink for Source-Free Cross-Domain Few-Shot Learning CVPR2026

链接: https://arxiv.org/abs/2605.25799
作者: Shuai Yi,Yixiong Zou,Yuhua Li,Ruixuan Li
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2026

点击查看摘要

Abstract:Vision-language models (VLMs) like CLIP have shown impressive generalization capabilities, yet their potential for Cross-Domain Few-Shot Learning (CDFSL) remains underexplored, where the model needs to transfer source-domain information to target domains with scarce training data. While the attention sink phenomenon has been observed in VLMs for certain tasks, its role in CDFSL scenarios has not been studied. In this paper, we uncover a critical issue overlooked by prior works: standard target-domain few-shot fine-tuning in CDFSL significantly exacerbates the attention sink problem, leading to poor discriminability across classes. To understand this phenomenon, through extensive experiments, we interpret it as the model’s shortcut learning for domain adaptation: to overcome the huge domain gap between the source and target domains, the model shows a high tendency to push tokens that are initially closer to target-domain classes (i.e., simple tokens) to be even closer to these classes, exacerbating the attention sink and wasting the capability of learning other discriminative but initially further tokens (i.e., hard tokens). To address this, we propose a novel approach to dynamically re-weight tokens according to their relevance with target-domain classes during the target-domain finetuning, which explicitly suppresses the model’s reliance on these simple tokens and enhances the learning of hard tokens, reducing sink tokens and enhancing discriminability. Extensive experiments on four benchmark datasets validate the rationale of our method, demonstrating new state-of-the-art performance. Our codes are available at this https URL.

[CV-46] VertiCue-Bench: Diagnosing Whether MLLM s Use Height Cues to Resolve 2D Ambiguity in Remote Sensing Natural Scenes

链接: https://arxiv.org/abs/2605.25784
作者: Jing Huang,Duanchu Wang,Junjie Yang,Zihang Cheng,Cheng Li,Lin Cui,Zhouyi Wu,Di Wang
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have recently shown promising progress in geospatial reasoning. However, existing remote sensing benchmarks remain largely 2D-centric, evaluating models primarily on optical appearance. In natural environments, this paradigm breaks down due to severe spectral confusion, where ecologically distinct regions share similar textures but differ fundamentally in vertical structure. In such cases, explicit 3D structural data, such as Canopy Height Models (CHMs), become essential geometric evidence for semantic disambiguation. Yet, it remains unclear whether current MLLMs can genuinely leverage vertical cues to resolve appearance-level ambiguity. To address this gap, we introduce VertiCue-Bench, the first diagnostic benchmark for CHM-grounded geospatial reasoning. VertiCue-Bench comprises 1,534 carefully curated instances across 17 tasks, explicitly disentangling low-level height perception from ambiguity-aware semantic reasoning. Evaluations on 14 state-of-the-art general and remote-sensing-specialized MLLMs, combined with counterfactual modality testing, reveal a striking perception-reasoning dissociation. While models exhibit emerging competence in reading raw CHM height cues, they largely fail to translate geometric perception into reliable semantic reasoning, often underperforming RGB-only baselines when joint constraints are required. Overall, VertiCue-Bench exposes a critical geometry-to-semantics gap in natural scene understanding, offering actionable insights for advancing geospatial MLLMs.

[CV-47] OMGTex: One-stage Multi-style Facial Texture Reconstruction without Geometry Guidance CVPR2026

链接: https://arxiv.org/abs/2605.25778
作者: Zitong Xiao,Yuda Qiu,Zisheng Ye,Xiaoguang Han
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026 (Poster)

点击查看摘要

Abstract:We propose OMGTex, an end-to-end diffusion-based framework for reconstructing high-quality and editable facial UV textures from multi-style facial images. Existing texture reconstruction methods face two major limitations: (1) Fragility due to reliance on 3D geometry priors, which are difficult to estimate accurately, especially under facial occlusions or in stylized domains; and (2) A lack of semantic disentanglement, inhibiting region-specific texture editing and style transfer. Our work addresses both challenges simultaneously. Our core innovation is a geometry-free pipeline that directly maps a 2D face image to its corresponding editable UV texture. We introduce two key techniques: First, to address the challenge of UV misalignment common in diffusion generation, we introduce a gradient-guided refinement strategy at inference time, which explicitly corrects structural consistency. Second, we leverage the inherent semantic distribution capability of diffusion models and design a novel training paradigm to enhance this tendency, enabling semantic-aware editing of facial texture. Furthermore, to address the data scarcity in multi-style texture reconstruction, we construct CANVAS, the first comprehensive paired texture reconstruction dataset covering realistic and diverse stylized domains. To the best of our knowledge, OMGTex is the first geometry-free inference framework that achieves robust, style-consistent, and editable facial texture reconstruction across diverse domains. Our method achieves state-of-the-art performance on multiple facial texture benchmarks. Comments: CVPR 2026 (Poster) Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2605.25778 [cs.CV] (or arXiv:2605.25778v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2605.25778 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-48] DRFusion: Drift-Resilient Temporally Consistent Infrared-Visible Video Fusion

链接: https://arxiv.org/abs/2605.25775
作者: Xingyuan Li,Haoyuan Xu,Shulin Li,Xiang Chen,Zhiying Jiang,Jinyuan Liu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 7 figures, 4 tables

点击查看摘要

Abstract:Infrared and visible video fusion is essential for achieving comprehensive perception in dynamic scenes. However, maintaining temporal consistency remains a formidable challenge. Conventional methods relying on optical flow often suffer from geometric rigidity and ghosting artifacts. Moreover, standard diffusion-based fusion models typically operate in a frame-by-frame manner; when extended to autoregressive settings, they lack intrinsic temporal constraints and are prone to severe error accumulation and drifting, where minor artifacts amplify over time. To address these limitations, we propose a drift-resilient video fusion method that reformulates the task as history-conditioned motion generation. We introduce Stabilized History Guidance and Soft Temporal Anchoring to reframe temporal consistency as spectral filtering, implicitly aggregating motion dynamics without rigid alignment. Furthermore, our Decoupled Structure-Motion Adaptation strategy bridges pre-trained priors and structural constraints via two-stage training and latent refinement. Extensive experiments demonstrate that our method achieves state-of-the-art performance in both fusion quality and temporal stability.

[CV-49] SAFE-Diff: Scale-Aware Attention and Feature-Dispersive Diffusion with Uncertainty Estimation for Contrast-Enhanced Breast MRI Synthesis MICCAI2026

链接: https://arxiv.org/abs/2605.25767
作者: Tianyu Zhang,Xinglong Liang,Jarek van Dijk,Luyi Han,Chunyao Lu,Antonio Portaluri,Xinghe Xie,Yaofei Duan,Nika Rasoolzadeh,Xin Wang,Yuan Gao,Muzhen He,Yue Sun,Jonas Teuwen,Tao Tan,Ritse Mann
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Early accepted by MICCAI 2026

点击查看摘要

Abstract:Synthesizing high fidelity contrast enhanced MRI is clinically valuable for safer and more efficient breast cancer screening, yet remains challenging due to complex lesion textures and heterogeneous enhancement patterns.

[CV-50] Concept Unlearning via Cross-Attention Activation Projection for Diffusion Models

链接: https://arxiv.org/abs/2605.25765
作者: Saemi Moon,Suhyeon Jun,Seoyeon Lee,Dongwoo Kim
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Concept unlearning aims to erase a target concept from a pretrained text-to-image diffusion model without retraining. Closed-form methods are attractive in this setting because they apply a single deterministic edit to the cross-attention weights and add no inference-time cost. Existing closed-form methods, however, represent the target concept through the text encoder’s response to a few short anchor prompts that name it, and paraphrased prompts that evoke the concept without naming it consistently bypass the edit. We argue that the target should instead be represented in the cross-attention activation space. Text embeddings describe the user’s prompt, while cross-attention activations describe what the model is about to render, and the latter generalize to paraphrase the anchor templates do not cover. Building on this observation, we propose PURE (Projection in U-Net Rendering for Erasure), a closed-form method that builds the forget and retain bases from per-layer cross-attention activations captured along a short denoising trajectory and applies a single linear projector to the cross-attention key and value weights. On a recent holistic concept-unlearning benchmark covering ten concepts across artistic style, intellectual property, celebrity, and NSFW categories, PURE significantly reduces target leakage under paraphrased and adversarial prompts while preserving retain concepts close to the unedited model, yielding the best overall forget-retain trade-off among evaluated methods.

[CV-51] Benchmarking Pathology Foundation Models for Spatial Domain Understanding MICCAI2026

链接: https://arxiv.org/abs/2605.25764
作者: Bokai Zhao,Yiyang Zhang,Yuanchi Zhu,Hanqing Chao,Long Bai,Tai Ma,Minfeng Xu,Ming Song,Tianzi Jiang
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: MICCAI2026

点击查看摘要

Abstract:Pathology foundation models (PFMs) have emerged as a core approach for learning transferable representations from whole slide images (WSIs), and they are typically benchmarked through downstream clinical endpoints. While such task level evaluations are indispensable, they offer limited insight into what the representations themselves encode, particularly whether PFM embeddings can distinguish meaningful tissue regions and capture their spatial relationships. We present SpaPath-Bench, a representation level benchmark designed to diagnose spatial representation capability in PFMs. SpaPath-Bench formulates spatial domain identification (SDI) on paired whole slide image and spatial transcriptomics (ST) data as a diagnostic task. It curates 42 public paired WSI and ST slides, enables large scale evaluation across 19 encoders and seven SDI methods, and measures partition quality using three complementary criteria: unsupervised spatial coherence, transcriptomics referenced agreement, and expert referenced agreement. Across 83K runs, SpaPath-Bench reveals that different pretraining paradigms capture distinct aspects of tissue spatial architecture, and it provides practical guidance for building the next generation of spatially aware computational pathology models. Code and data pipelines are publicly available at this https URL.

[CV-52] AI-T2I: Aggregating-and-Isolating Cross-Attention to Diffusion Models for Text-to-Image Synthesis

链接: https://arxiv.org/abs/2605.25763
作者: Shipeng Cao,Biao Qian,Haipeng Liu,Yang Wang,Meng Wang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by IEEE Transactions on Multimedia (2026). 13 pages, 15 figures

点击查看摘要

Abstract:Text-to-image synthesis has made significant progress, benefiting from the strong generative capabilities of diffusion models. However, these models struggle to achieve precise text-to-image alignment within cross-attention maps during the denoising process. Existing works primarily focus on inter-subject-token activations (i.e., cross-attention scores) overlap for different subjects, overlooking the intra-subject-token activations scattering issue for identical subjects. In this paper, we propose an Aggregating-and-Isolating cross-attention approach to diffusion models for Text-to-Image synthesis, dubbed AI-T2I. Technically, to address the scattering issue, we devise an aggregation loss to identify and consolidate the scattered intra-token activations, which implicitly helps mitigate the potential overlap issue. Upon that, an isolation loss is further introduced to push the inter-token activations apart, thus fulfilling precise text-to-image alignment. Extensive experiments on various benchmarks demonstrate the superiority of AI-T2I over the state-of-the-art works for text-to-image synthesis. Furthermore, our AI-T2I exhibits excellent generalization across other tasks, e.g., controllable layout generation and personalized generation.

[CV-53] owards Anatomically Plausible Human Image Generation via Synthetic Localized Preferences

链接: https://arxiv.org/abs/2605.25759
作者: Bao Li,Yuliang Xiu,Zhen Liu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Large-scale text-to-image foundation models have achieved remarkable visual realism, yet generating human images with correct anatomical structures remains challenging. Existing approaches enforce anatomical constraints through part-specific modules or localized loss weighting during supervised fine-tuning on high-quality human photos, but such datasets are limited and often provide ambiguous optimization signals due to confounding factors such as lighting, pose, and background. Preference-based alignment offers an alternative, but standard Direct Preference Optimization (DPO) treats all pixels equally and therefore fails to exploit the localized nature of anatomical artifacts. To address this, we propose the framework of Alignment via Synthetic Anatomical Preference (ASAP), which constructs controlled preference pairs through a localized degradation mechanism applied to high-fidelity human images. This mechanism performs a controlled experiment on images by introducing explicit anatomical errors in targeted regions while preserving the remaining content. With this mechanism, we create the Human Anatomical Preference (HAP) dataset with over 10K curated pairs for effective anatomical alignment of text-to-image human image generative models. To better leverage the locality of these controlled preference pairs, we introduce a localized and margin-bounded variant of DPO that prioritizes optimization in targeted anatomical regions while enforcing a finite preference margin to prevent over-optimization and preserve global semantics. We further introduce HAF-Bench, a benchmark for systematic evaluation of anatomical fidelity. Extensive experiments demonstrate that ASAP consistently reduces anatomical errors across multiple foundation models while maintaining overall image quality.

[CV-54] Broadband Hyperspectral 3D Imaging using Dispersed Structured Light

链接: https://arxiv.org/abs/2605.25757
作者: Suhyun Shin,Yunseong Moon,Ryota Maeda,David Lindell,Kyros Kutulacos,Seung-Hwan Baek
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Hyperspectral 3D imaging enables the capture of dense spectral information and scene geometry but has traditionally been confined to narrow spectral windows, typically the visible range. In this work, we introduce a broadband hyperspectral 3D imaging (BH3D) method to extend this capability across the full visible-near-infrared and short-wavelength infrared (SWIR) spectrum (450-1500 nm). This broad coverage is critical as it captures complementary physical cues: visible wavelengths reveal surface appearance, while SWIR bands provide insight into subsurface properties and material composition. However, realizing BH3D is challenging due to fundamental sensor constraints between visible-spectrum silicon and SWIR-spectrum InGaAs sensors, which necessitate complex multi-spectrograph designs. Here we propose a single-spectrograph BH3D system, using a stereo setup comprising visible and SWIR cameras, that reconstructs dense broadband hyperspectral reflectance together with accurate 3D geometry. Our key idea is to extend dispersed structured light to the broadband regime using a single spectrograph. We model the image formation of broadband dispersed structured light, and estimate hyperspectral reflectance and depth. We validate our approach on diverse real-world scenes, demonstrating accurate reconstruction with a mean spectral angle mapper of 0.13 rad, root mean square error of 0.03, and mean depth error of 4.5 mm. We further demonstrate identifying metameric materials, performing imaging through opaque layers, uncovering hidden features on banknotes, and revealing blood vessels.

[CV-55] SplitAvatar: One-shot Head Avatar with Autoregressive Gaussian Splitting

链接: https://arxiv.org/abs/2605.25751
作者: Hongzhe Liao,Chuhua Xian,Hongmin Cai,Haiyang Liu,Fa-Ting Hong
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) provides an efficient method for high-quality scene reconstruction using anisotropic Gaussians. Recently, 3DGS-based methods have significantly improved the rendering quality of human avatars while enabling real-time performance. However, existing methods suffer from a magnitude mismatch in the number of Gaussians generated by image-based and 3DMM-based approaches. This discrepancy results in reconstructed expressions that lack fine-grained detail. In this paper, we introduce a novel method for reconstructing an animatable head avatar from a single image. We propose a Graph splitting network to progressively generate Gaussians from coarse to fine using an autoregressive architecture. To address the graph inconsistency caused by split Gaussians, we employ a mesh topology extension method to align the GNN’s connectivity with the increased Gaussian count. Furthermore, we introduce a novel density control method that includes a gating mechanism that generates soft masks for Gaussians, preventing over-densification after the splitting operation. This allows for dynamic control over Gaussian density across different facial regions. For smooth and rapid training, we employ a delayed filtering strategy to avoid re-computing the graph topology during training. Experimental results demonstrate that our autoregressive structure effectively improves expression representation ability by progressively splitting Gaussians. This process, enabled by the GNN-guided splitting, synthesizes more precise facial details and achieves higher reconstruction quality.

[CV-56] SFR-Net: Learning Scale-Frustum Representations for Ultra-Wide Area Remote Sensing Image Segmentation

链接: https://arxiv.org/abs/2605.25737
作者: Chuyu Zhong,Keyan Chen,Qinzhe Yang,Bowen Chen,Zhengxia Zou,Zhenwei Shi
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Pixel count and geographical coverage are two key characteristics of remote sensing images. Existing remote sensing image segmentation methods typically focus on images with either a small pixel count or a large pixel count but limited geographical coverage. In this paper, we introduce a novel segmentation task targeting ultra-wide area (UWA) remote sensing images, characterized by both a large pixel count and extremely wide geographical coverage. The core challenges of UWA segmentation lie in simultaneously handling ground objects with significantly varying scales and maintaining long-range contextual semantic continuity. To address these challenges, we propose the Scale-Frustum Representation Network (SFR-Net). Inspired by the viewing frustums of remote sensing images captured from different altitudes, we construct scale-frustum representations, enabling unified modeling of ground objects and contextual features at different scales. Furthermore, we design a cascaded cross-scale fusion mechanism to effectively integrate these representations, enhancing local semantic understanding while ensuring long-range contextual continuity. Experimental results on GID and FBPS demonstrate that SFR-Net achieves state-of-the-art performance, improving mIoU by 1.72% and 4.29%, respectively, over the strongest competing methods. In addition, the proposed scale-frustum representations can be integrated into generic segmentation networks to improve both segmentation accuracy and convergence speed. The implementation code will be publicly available at this https URL.

[CV-57] DeCoDrift: Stabilizing Decoder Coupling in Closed-Loop Foundation Segmentation

链接: https://arxiv.org/abs/2605.25730
作者: H. M. Shadman Tabib,Md. Shamsuzzoha Bayzid,M Sohel Rahman
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 18 Pages, 5 Figures

点击查看摘要

Abstract:Foundation segmentation models such as Segment Anything Model (SAM) are now routinely used in iterative pipelines, where each predicted mask is fed back as the next prompt. This practice turns segmentation into a closed-loop dynamical process, yet the decoder-level behavior of these systems remains largely unexamined. We show that this feedback loop can induce a previously overlooked failure mode, decoder coupling drift, in which the mask decoder’s cross-attention progressively loses alignment with the target object, causing errors to accumulate across iterations. We study this phenomenon by instrumenting SAM’s mask decoder and deriving ground-truth-free measures of prompt-image coupling, attention stability, and temporal consistency. On volumetric electron microscopy data, these decoder-internal signals reveal that standard iterative prompting systematically degrades attention alignment and temporal coherence relative to oracle-anchored feedback. We then formalize iterative prompting as a discrete-time dynamical system and show how proximal anchoring reduces error amplification in the feedback loop. Building on this analysis, we introduce DeCoDrift, a training-free inference-time stabilization framework that constrains prompt updates and preserves decoder coupling across iterations. Across extensive experiments, DeCoDrift consistently improves attention stability, temporal coherence, and segmentation quality over standard iterative prompting, without retraining or ground-truth supervision. More broadly, our results show that decoder-internal dynamics are not merely diagnostic: they provide actionable signals for stabilizing foundation segmentation models in closed-loop use.

[CV-58] riDP-PTM: a three-stage distortion-perception tradeoff guides the pre-training model for radar cardiac sensing

链接: https://arxiv.org/abs/2605.25725
作者: Jinye Li(1,2,3),Aidong Men(4),Yang Liu(5),Qingchao Chen(1,2,6) ((1) National Institute of Health Data Science, Peking University, Beijing, China, (2) Institute of Medical Technology, Peking University, Beijing, China, (3) Beijing University of Posts and Telecommunications, Beijing, China, (4) School of Artificial Intelligence, Beijing University of Posts and Telecommunications, Beijing, China, (5) Wangxuan Institute of Computer Technology, Peking University, Beijing, China, (6) State Key Laboratory of General Artificial Intelligence, Peking University, Beijing, China)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Cardiovascular diseases (CVDs) remain a leading cause of death globally, necessitating continuous, accurate non-invasive cardiac monitoring. While non-contact radar-based approaches show great promise, they often employ a single “distortion-driven” or “perception-driven” paradigm, frequently facing a trade-off between “low distortion but weak semantic information” and “high perceptual fidelity but poor interpretability.” To address this, we propose a Three-stage Distortion-Perception Pre-Training Model (TriDP-PTM), a radar-based multi-scale fusion dual-path framework that systematically compares the “direct radar-to-task” path against an “indirect radar-to-ECG-to-task” path. By integrating an ECG generator with a feature discriminator to form a composite loss function, our approach effectively incorporates medical priors - such as ECG morphology and rhythm - into downstream tasks. Through empirical analysis, we reveal that this trade-off manifests in three distinct phases (Positive-Sum, Coopetitive, and Negative-Sum), showing optimal downstream clinical accuracy typically emerges in the coopetitive stage. Extensive experiments on a dataset involving 30 subjects across 5 physiological states reveal that the indirect path consistently outperforms the direct path in diverse tasks, achieving 0.80 mean IoU in waveform segmentation, 98.3% average classification accuracy across four tasks, and a 56% MAE reduction in blood pressure regression compared to the strongest baselines. These findings validate our framework and indicate that, within the indirect radar-to-ECG pathway, appropriately weighting distortion and perception losses to operate in the coopetitive regime is critical for achieving both clinically interpretable ECG morphology and strong downstream accuracy in non-contact cardiac monitoring.

[CV-59] owards Open-World Referring Expression Comprehension: A Benchmark with Training-free Multi-task Consistency Checker

链接: https://arxiv.org/abs/2605.25706
作者: Zongjian Wu,Lei Zhang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 7 figures. Project Page: this https URL

点击查看摘要

Abstract:Referring expression comprehension (REC) aims to localize a target object within an image based on a given expression. Although recent advances in vision-language models have led to substantial improvements in REC tasks, current REC benchmarks often hold simple scenarios and the assumption that each expression maps to a unique object. These limitations hinder the deployment of REC models in open-world environments. To fill this gap, we introduce OpenRef, a new benchmark for REC in complex visual and linguistic scenarios. OpenRef features three key advancements: 1) Diverse visual scenarios: spanning diverse visual domains, including ground views, drone views, dark scenes and adverse weather conditions; 2) Variable target counts: breaking the single-target limitation with multi-target and none-target samples; 3) Rich vocabulary types: incorporating proper nouns, polysemous words and ordinal terms to fit a wider range of expression needs. Furthermore, as traditional metrics are insufficient for open-world setting, we leverage F1 to measure grounding accuracy and propose N3R (Negative Relative Rejection Reliability) to assess relative rejection reliability against negative expressions. Finally, we introduce Multi-task Consistency Checker (MCC), a training-free but plug-and-play strategy that enhances model performance with one click by enforcing consistency self-verification. Extensive experiments demonstrate that this work significantly advances the performance of existing REC models in complex scenarios, paving the way for open-world REC. Project page: this https URL

[CV-60] Opportunistic Target Selection: Early Directional Commitment for Query-Efficient Black-Box Adversarial Attacks

链接: https://arxiv.org/abs/2605.25663
作者: Florent Tariolle,Florian Yger
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 10 figures, 3 tables; code available at this https URL

点击查看摘要

Abstract:Black-box adversarial attacks that minimize only the ground-truth confidence suffer from class drift: perturbations wander through the feature space without committing to a specific adversarial class, wasting queries on diffuse, undirected progress. We introduce Opportunistic Target Selection (OTS), a lightweight wrapper that switches an untargeted attack to a targeted objective early in its trajectory, locking onto whichever non-true class currently leads. OTS requires no architectural modification to the underlying attack, no gradient access, and no a priori target-class knowledge. We validate OTS on three score-based attacks (SimBA, Square Attack with cross-entropy loss, and Bandits) across five standard ImageNet classifiers (4,500 runs). On random-search attacks, OTS closely tracks oracle performance, with gains up to +27 pp in success rate and 43% relative reduction in censored-mean iterations on ResNet-50. On gradient-estimation attacks (Bandits) and attacks with margin loss, OTS is redundant, a negative result that reinforces our interpretation of OTS as a margin-loss surrogate. On adversarially-trained models, a bimodal difficulty distribution eliminates the regime where targeting helps. Comments: 13 pages, 10 figures, 3 tables; code available at this https URL Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2605.25663 [cs.LG] (or arXiv:2605.25663v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.25663 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-61] DRM: Diffusion-based Reward Model With Step-wise Guidance

链接: https://arxiv.org/abs/2605.25661
作者: Jaxon Zhang,Binxin Yang,Hubery Yin,Chen Li,Jing Lyu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Current mainstream methods of aligning diffusion models with human preferences typically employ VLM-based reward models. However, these reward models, pre-trained for semantic alignment, struggle to capture the essential perceptual qualities-such as aesthetics, composition, and visual harmony. In this work, we argue that a model capable of high-fidelity generation must possess a profound understanding of these visual attributes. Based on this insight, we introduce the Diffusion-based Reward Model (DRM), a novel paradigm that use the pre-trained diffusion model as a powerful evaluative backbone. A key advantage of the DRM is its unique ability to assess not only the final image but also the noisy intermediate latents at any stage of the generative process. We leverage this step-wise evaluative capacity in two ways. First, we propose Step-wise GRPO, a reinforcement learning algorithm that provides dense, per-step rewards to resolve the imprecise credit assignment problem in GRPO algorithm, leading to more stable and effective alignment. Second, we introduce Step-wise Sampling, a novel inference strategy that employs the DRM as a dynamic guide to evaluate multiple generation paths at each step, steering the process towards higher-quality outcomes. Extensive experiments confirm that our approach significantly enhances the final quality of generated images. Code: this https URL.

[CV-62] StreamChar: Long-Horizon Streaming Character Audio-Video Generation with Decoupled Orchestration

链接: https://arxiv.org/abs/2605.25659
作者: Linrui Tian,Qi Wang,Bang Zhang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Real-time streaming joint audio-video generation for character animation requires a generator to speak the requested transcript, maintain visual identity across chunks, and run within a strict playback budget. These requirements are difficult to satisfy simultaneously: chunk-wise autoregressive generation can accumulate transcript-audio misalignment and visual drift, while the few-step distillation needed for low latency often degrades spatial diversity and temporal quality. We present StreamChar, a streaming framework that separates long-horizon orchestration from short-window audio-video denoising. An LLM-based orchestrator uses the transcript and historical context to produce frame-aligned audio conditions, and a joint audio-video DiT performs local bidirectional denoising with reference and motion-frame conditioning. For efficient deployment, we use a two-stage distillation pipeline that first compresses the sampler and then fine-tunes the student under online chunk rollouts. A progress-aware pointer aligns partial transcripts with generated audio during rollout training, and a sink-chunk memory provides a persistent visual anchor for reducing long-horizon drift. Experiments on short-clip and long-horizon protocols show that StreamChar runs in real time on a single H100 GPU and provides a favorable system-level trade-off among transcript fidelity, audio-visual synchronization, visual quality, and streaming stability compared with recent joint and audio-driven baselines.

[CV-63] ARMA-C3: A Contrastive ARMA Convolutional Framework for Unsupervised and Semi-supervised Classification

链接: https://arxiv.org/abs/2605.25657
作者: VSS Tejaswi Abburi,Saurabh J. Shigwan,Nitin Kumar
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In biomedical and neurodegenerative disorders, accurate and early disease identification remains challenging due to the scarcity of labeled data and the complexity of imaging patterns. To address these challenges, we introduce ARMA-C3, a unified unsupervised and semi-supervised graph learning framework for node classification based on contrastive learning and graph-cut regularization to learn structurally meaningful and discriminative representations. By modeling samples or images as graph nodes and exploiting inter-sample relationships, the proposed framework captures subject-level dependencies that conventional machine learning methods typically overlook. We conduct extensive binary classification experiments across five clinically relevant datasets: the Alzheimer’s Disease Neuroimaging Initiative (ADNI), the Neuroimaging in Frontotemporal Dementia (NIFD) dataset, and three medical imaging benchmarks (BreastMNIST, PneumoniaMNIST, and a liver ultrasound dataset). Experimental results demonstrate that ARMA-C3 achieves competitive and frequently superior performance compared to classical clustering techniques, state-of-the-art machine learning models, and existing graph-based deep learning approaches across multiple evaluation settings, particularly under limited supervision and severe class imbalance. The proposed framework further demonstrates robust representation learning and strong cross-modal generalization across diverse biomedical imaging modalities.

[CV-64] Event-based Batting Impact Estimation ICIP

链接: https://arxiv.org/abs/2605.25656
作者: Ryotaro Ishida,Wataru Ikeda,Ryosei Hara,Akemi Kobayashi,Toshitaka Kimura,Mariko Isogawa
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to IEEE International Conference on Image Processing (ICIP) 2026. © 2026 IEEE. Personal use of this material is permitted

点击查看摘要

Abstract:Estimating the precise timing of batting impact is crucial for understanding the rapid sensorimotor control. However, this task is challenging for RGB cameras due to insufficient temporal resolution and motion blur. Similarly, Inertial Measurement Units (IMUs) are impractical for actual matches due to sensor intrusiveness and their limited temporal precision. To overcome these limitations, we propose a novel framework leveraging event-based cameras, which offer microsecond resolution and high dynamic range, to estimate impact timing based on the weighted centroid distance between the detected ball and bat. To address the domain gap between event frames and RGB images that degrades segmentation accuracy, we generate high-density event frames. We then introduce a mask refinement network that leverages these frames and bidirectional mask information, optimized using a novel loss function. Experiments on real-world datasets demonstrate that our method achieves superior accuracy under challenging conditions, including low-light environments and severe occlusions, outperforming baselines by reducing the Mean Absolute Error by approximately 63%.

[CV-65] Hierarchical Consistency Learning for Test-time Adaptation in Camouflage Perception

链接: https://arxiv.org/abs/2605.25651
作者: Mingfeng Zha,Tianyu Li,Guoqing Wang,Yunqiang Pei,Chaofan Qiao,Jiening Zhang,Yang Yang,Heng Tao Shen
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Camouflaged object detection (COD) aims to localize targets that exhibit minimal perceptual differences from backgrounds through physical attributes. Existing methods, constrained by the static train-then-freeze paradigm, suffer from domain rigidity and annotation dependency, limiting their adaptability to scene variations and unseen camouflage patterns. To overcome these, we propose the hierarchical consistency learning (HCL) framework, which integrates test-time adaptation for dynamic representation recalibration. Specifically, we design the hierarchical representation reconstruction (HRR) to alleviate feature entanglement by synergizing spatial reconstruction with dual-stream frequency-domain decomposition, enhancing robustness against appearance homogenization. The pixel and spectrum inference provide structural and contextual priors. We further introduce task affinity guidance (TAG) to propagate knowledge across branches via channel-wise affinity, aligning local discriminative cues and mitigating semantic drift. To ensure semantic invariance, we formulate the prototype consistency calibration (PCC), which aggregates region features into compact prototypes and establishes prototype-feature similarity. This imposes implicit and hierarchical constraints that bridge task and representation gaps. Extensive experiments across four camouflaged and four underwater object benchmarks, under three degradation settings, demonstrate that our method consistently outperforms state-of-the-art approaches, highlighting its robustness and generalization under distribution shifts.

[CV-66] StreamOV: Streaming Omni-Video Understanding via Evidence-Guided Memory and Response Triggering

链接: https://arxiv.org/abs/2605.25621
作者: Ming Xie,Zizheng Huang,Xudong Tan,Chao Wang,Xiangyu Zeng,Wenxiao Wu,Tao Chen,Limin Wang,Yanwei Fu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:While streaming omni-video understanding demands continuous perception and proactive, real-time interaction, this crucial area remains largely under-explored. Current omni-modal methods are inherently designed for offline settings, limiting their applicability in streaming scenarios due to two fundamental flaws. First, they lack robust mechanisms to manage continuously growing audio-visual context over long horizons and cannot autonomously initiate responses at opportune moments. Second, existing benchmarks are predominantly confined to offline, single-turn question answering, failing to capture continuous, multi-turn streaming interactions. To bridge these gaps, we propose StreamOV, a novel Streaming Omni-Video understanding framework for efficient online audio-visual reasoning with bounded memory and proactive response triggering. Specifically, StreamOV introduces a multimodal evidence-guided long-short term memory that condenses historical audio-visual context into compact informative evidence under a fixed budget. It further employs a hidden-state-driven trigger to decide when to respond, avoiding explicit silence-token generation and external routers. We also curate SOVBench, the first comprehensive benchmark for online, multi-turn omni-modal evaluation. Extensive experiments show that StreamOV achieves state-of-the-art performance across diverse streaming and omni-video benchmarks, demonstrating its effectiveness for both online and offline video understanding.

[CV-67] UAV-OVO: Out-of-Viewpoint Generalization in UAV Action Recognition

链接: https://arxiv.org/abs/2605.25615
作者: Yu Xia,Zhengbo Zhang,Shuaihu Zhang,Zhigang Tu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:UAV action recognition faces a deployment shift that standard benchmarks often obscure: a model trained on UAV footage captured from low-depression viewpoints may be required to recognize the same action classes from high-depression viewpoints. While the action labels remain unchanged, this shift alters body visibility, motion projection, and scene context, encouraging models to rely on viewpoint-specific shortcuts. We introduce UAV-OVO, an Out-of-Viewpoint generalization benchmark for UAV action recognition. UAV-OVO derives view scores from uncalibrated videos, uses a view-isolation band to assign low-depression videos to the training and in-distribution test splits while reserving high-depression videos for out-of-distribution testing, and constructs ID/OOD test sets matched by class distribution so that performance differences reflect viewpoint shift rather than label imbalance. Across representative video recognizers, UAV-OVO reveals a substantial ID/OOD gap: models that fit the low-depression training distribution well often fail to transfer to held-out high-depression views, exposing viewpoint shortcuts hidden by aggregate accuracy. We further propose LATER, LoRA-Anchored Test-time Re-centering, which first adapts the recognizer with Low-Rank Adaptation (LoRA) and then uses the learned LoRA subspace as a semantic anchor for online feature re-centering. Specifically, LATER projects target-domain displacement onto the orthogonal complement of the LoRA subspace before re-centering features, reducing viewpoint-induced drift while preserving task-relevant semantics. Together, UAV-OVO and LATER provide a controlled testbed and a practical adaptation method for viewpoint-robust UAV video understanding.

[CV-68] Generalized Evidential Deep Learning: From a Bayesian Perspective ICML2026

链接: https://arxiv.org/abs/2605.25599
作者: Yuanye Liu,Yibo Gao,Yuanyang Chen,Xiahai Zhuang
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to ICML2026

点击查看摘要

Abstract:Evidential Deep Learning (EDL) has emerged as an efficient, sampling-free strategy for uncertainty estimation. A series of EDL variants have been proposed to address specific limitations of the original framework, achieving notable success. However, the underlying theoretical structure of EDL and the relationships among these variants have received limited systematic investigation. In this work, we establish a principled theoretical foundation for EDL by interpreting it within a generalized Bayesian framework that includes prior specification, posterior update, and training objective. We further characterize evidential uncertainty from a Bayesian distributional uncertainty viewpoint, established via asymptotic analysis. Building on this perspective, we further propose Generalized Evidential Deep Learning (GEDL), a unified and extensible framework that explicitly disentangles the roles of individual components and systematically relates GEDL to existing variants. Extensive experiments demonstrate that GEDL yields comparable results on classification, uncertainty estimation and OOD detections, with theoretical grounding.

[CV-69] SurfSurg6D: Geometry Consistent Dense Correspondence for Textureless Surgical Instrument Pose Estimation

链接: https://arxiv.org/abs/2605.25598
作者: Daiyun Shen,Shuojue Yang,Chang Han Low,Qian Li,Mengya Xu,Qi Dou,Yueming Jin
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Surgical instrument pose estimation provides crucial information for promising applications, including autonomous robotic surgery, skill assessment, and standardization of surgical workflow. However, this task remains highly challenging due to high precision requirements, frequent occlusions, textureless instruments, scarcity of depth information and very limited annotated data. These constraints often lead to unsatisfactory performance when employing general object pose estimation approaches to surgical scenarios. To address these issues, we first construct a new dataset SynSurg6D, to alleviate the data shortage in this task. We further propose SurfSurg6D, a dense-correspondence framework tailored for surgical instrument pose estimation. Experimental results on the SurgRIPE, EndoVis2018 and SurgPose datasets demonstrate that the introduction of our generated dataset SynSurg6D is able to diversify the pose distributions, thus enhancing the performance of existing approaches. Furthermore, SurfSurg6D outperforms existing methods, providing a robust solution for precise and efficient RGB-only pose estimation.

[CV-70] How Far Has AI Come in Liver Fibrosis Staging? A Large-Scale Real-World Dataset and Benchmark

链接: https://arxiv.org/abs/2605.25595
作者: Yuanye Liu,Nannan Shi,Zhejia Zhang,Hanxiao Zhang,Boya Wang,Derong Yu,Nao Wang,Yuxin Jin,Yang Zhou,Kunhao Yuan,Siqi Wang,Lida Yang,Xu Qiao,Wentao Liu,Xuelei He,Xin Hong,Guoyan Zheng,Xin Chen,Guang-Zhong Yang,Le Zhang,Lei Li,Yuxin Shi,Xiahai Zhuang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to Medical Image Analysis

点击查看摘要

Abstract:Despite years of methodological progress, how far AI has come in liver fibrosis staging has never been systematically evaluated under the heterogeneous, multi-center conditions that define clinical practice. To address this gap, we introduce LiFS, a large-scale dataset and benchmark derived from the MICCAI 2025 CARE-Liver challenge, comprising 610 patients across multiple centers and scanners with multi-sequence MRI. To the best of our knowledge, LiFS is the first benchmark providing complete gadoxetic acid-enhanced sequences with histopathology-confirmed annotations from diverse real-world scanners. Through systematic evaluation of 9 independently developed methods selected from 96 registered teams against in-cohort radiologist reference results, our findings address how far current AI has progressed toward clinical-level liver fibrosis staging from three complementary perspectives. First, against radiologists, the best AI methods were broadly comparable to the senior radiologist and significantly exceeded the junior radiologist in selected settings, while median AI performance generally approached junior-radiologist levels. Second, from a data perspective, cross-center heterogeneity, label imbalance, and contrast-enhanced sequence variability emerge as the dominant challenges for AI methods. Third, from a technical perspective, methodological design choices, including spatial registration, input dimensionality, multi-modal fusion strategy, and backbone architecture, appear to modulate cross-center robustness, although no single choice alone closes the gap. Overall, LiFS provides a rigorous real-world benchmark for positioning the current state of AI in liver fibrosis staging and for enabling future research on the key challenges that limit clinically reliable deployment.

[CV-71] Artifact Correction for Echo-Planar Imaging at Low-Field and Ultra-Low-Field MRI

链接: https://arxiv.org/abs/2605.25589
作者: Sisi Qiao,Yilin Yu,Tiecheng Lin,Yuhao Liu,Jiajia Sun,Xiaoling Li
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 19 pages, 10 figures, 2 tables

点击查看摘要

Abstract:Purpose: Echo-planar imaging (EPI) in low-field (LF) and ultra-low-field MRI (ULF) suffers from severe Nyquist ghost artifacts due to odd-even k-space misalignment. This study develops a reference-free artifact correction pipeline that reduces reliance on conventional reference scans while achieving improved ghost suppression. Methods: Starting from the traditional reference-scan-based ghost artifact correction method, we first introduce a peak-alignment-based ghost artifact correction method to correct odd-even line displacement without reference data. To further reduce residual artifacts, an interpolation-and-resampling strategy is applied. The combined method was evaluated using EPI and diffusion-weighted EPI data in LF and ULF. Results: The proposed pipeline effectively mitigated Nyquist ghosts, improved structural continuity, and enhanced signal uniformity. Peak-alignment-based ghost artifact correction method alone provided comparable artifact suppression to reference-scan-based ghost artifact correction method, while interpolation and resampling further suppressed residual artifacts, enabling reliable visualization of brain structures under ULF conditions. Conclusion: A practical, reference-free correction pipeline is presented for LF and ULF EPI, combining peak-alignment-based ghost artifact correction method and interpolation-resampling to achieve efficient ghost suppression and expand the clinical applicability of low-field MRI systems, providing both theoretical guidance and practical experience for ULF EPI-based DWI imaging.

[CV-72] Mosaic: Compositional Multi-Concept Erasure via Vector Field Blending

链接: https://arxiv.org/abs/2605.25574
作者: Junseok Ko,Jungwoo Kim,Jong-Seok Lee
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Concept erasure has emerged as a key research direction for ensuring safe and ethical image synthesis in Text-to-Image (T2I) models. While existing studies have explored concept erasure across multiple concepts, they typically assume only a single target concept per image, a limitation increasingly exposed by modern flow-based T2I models, which can generate complex scenes with multiple concepts simultaneously. To address this gap, we introduce compositional multi-concept erasure, a new task that aims to simultaneously remove multiple target concepts within a single scene. We propose CoME-Bench, a benchmark for evaluating compositional multi-concept erasure, which covers both intra- and cross-category scenarios. We further propose Mosaic, a novel framework for multi-concept erasure in flow-based T2I models, which exploits the spatial locality of target concepts in the vector field by dynamically constructing concept-specific masks and selectively blending them without additional optimization. Extensive experiments demonstrate that Mosaic effectively removes multiple target concepts in complex compositional scenes while preserving non-target contexts.

[CV-73] AnE: Pushing the Reasoning Frontier of Multimodal LLM s via Anchor Evolution

链接: https://arxiv.org/abs/2605.25571
作者: Zehao Wang,Yihan Zeng,Zidong Gong,Yuanfan Guo,Feng Zhu,Hongzhi Zhang,Wei Zhang,Wangmeng Zuo
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 34 pages,10 figures

点击查看摘要

Abstract:Post-training via Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) is crucial for enhancing reasoning in Multimodal Large Language Models (MLLMs), yet existing paradigms often reach a performance bottleneck due to the limitations of static data. While current methods leverage self-reflection or self-evolution to push these boundaries, they still suffer from cognitive drift and hallucinated reasoning paths caused by low-quality synthetic data. To address these challenges, we propose Anchor Evolution (AnE), a new paradigm that integrates truth-anchored data curation and model evolution, achieving faithful and steady performance gains at the reasoning frontier. Specifically, we propose Truth Anchor Expansion, which pinpoints the model failing frontier via trajectory rollouts and leverages ground-truth databases to retrieve high-fidelity anchors for faithful data curation. Subsequently, we introduce the Scaffold-Stripping Mechanism to internalize reasoning capabilities. This mechanism first anchors reasoning paths via scaffold-augmented supervision to mitigate the learning complexity and distribution drift of direct SFT on raw data, then leverages RL to strip the scaffold template, thereby effectively transitioning the reasoning paths into intrinsic model capabilities. Experimental results on multimodal reasoning benchmarks show that our method substantially advances the model performance frontier, improving the base model by 10.3% across eight multimodal benchmarks and achieving state-of-the-art results. The code will be made publicly available.

[CV-74] From Contrast to Consistency: Rethinking Event-based Continuous-Time Optical Flow Estimation CVPR2026

链接: https://arxiv.org/abs/2605.25570
作者: Rui Hu,Song Wu,Wen Yang,Jinjian Wu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2026

点击查看摘要

Abstract:Estimating continuous optical flow is a fundamental yet challenging problem in dynamic visual perception. Event-based cameras, with microsecond latency and high dynamic range, capture brightness changes asynchronously, offering a unique opportunity to model motion with fine temporal precision. However, the scarcity of temporally dense ground-truth annotations limits the effectiveness of supervised learning, while contrast maximization (CM) frameworks, focused on sharpening the Image of Warped Events (IWE), often neglect temporal continuity and structural coherence, leading to distorted trajectories under complex motion. To overcome these challenges, we propose a hybrid-supervised framework for continuous-time optical flow estimation, grounded in the principle of Spatio-temporal Structural Consistency (STSC). This paradigm jointly enforces local structural stability and trajectory continuity, ensuring physically coherent motion across time. To further enhance representation and robustness, we design a bidirectionally complementary multi-scale architecture and employ a curriculum-guided hybrid training strategy, enabling a smooth transition from supervised point constraints to self-supervised manifold regularization. Comprehensive experiments across multiple benchmarks show that our method achieves state-of-the-art performance in both continuous-time and standard optical flow estimation, demonstrating the effectiveness of the proposed learning paradigm.

[CV-75] ControlLight: Towards Controllable Consistent and Generalizable Low-Light Enhancement

链接: https://arxiv.org/abs/2605.25569
作者: Yufeng Yang,Jianzhuang Liu,Jisheng Chu,Yuqi Peng,Xianfang Zeng,Jiancheng Huang,Shifeng Chen
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages, 12 figures

点击查看摘要

Abstract:Existing deep learning-based low-light enhancement methods are typically trained on limited datasets with single enhancement targets, which restricts their generalization ability and controllability in real-world applications. To overcome these limitations, we propose ControlLight, a controllable, consistent, and generalizable framework for low-light enhancement. We first construct a large-scale dataset of real-world degraded images with continuous illumination-strength supervision. To further ensure consistent outputs under different control strengths, we introduce a misalignment-aware weighted flow matching loss that preserves image structure across continuous enhancement strengths. ControlLight allows users to edit real-world degraded low-light images toward satisfactory enhancement results by flexibly controlling the strength while preserving visual consistency and realism. Extensive experiments show that ControlLight achieves state-of-the-art performance against existing low-light enhancement approaches while demonstrating strong continuous controllability and generalization to real-world scenarios.

[CV-76] Rethinking Scribble-Guided Image Editing: Generalization Instruction Adherence and Multi-Tasking

链接: https://arxiv.org/abs/2605.25568
作者: Mingyi Xu,Jinpeng Lin,Min Zhou,Tiezheng Ge,Ming Zeng
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Scribble-guided image editing allows users to combine simple scribble annotations with text prompts to specify both where and how an image should be edited, enabling flexible interaction with precise spatial control. However, existing models still exhibit unstable performance under this paradigm, especially in multi-task scenarios. To improve performance, we conduct empirical studies using an open-source editing model and reveal an asymmetry in generalization: instruction-level generalization, including across editing tasks and from single-task to multi-task settings, is more challenging than image-domain generalization, such as from synthetic to real-world images or from mosaicked to regular images. This suggests that the primary bottleneck lies in insufficient learning for diverse editing instructions rather than in the image domain gap. Motivated by this insight, we propose three strategies: (a) a Coverage-then-Realism Curriculum, a two-stage pipeline that first builds large-scale synthetic, instruction-rich data for broad task supervision, then curates a small set of real-world data to refine generation realism; (b) Multi-Task Mosaicking, which constructs multi-task training samples by concatenating single-task examples at nearly zero cost while enabling the learned capability to generalize to non-mosaicked images; and © an Edit-Focused Loss, which leverages the changed regions between input and output images in synthetic data to focus training on edited regions, improving both learning efficiency and editing accuracy. With these strategies, we substantially improve both single-task and multi-task scribble-guided editing on the VIBE benchmark, achieving state-of-the-art results. We will publicly release our dataset and model.

[CV-77] CodecSplat: Ultra-Compact Latent Coding for Feed-Forward 3D Gaussian Splatting

链接: https://arxiv.org/abs/2605.25563
作者: Pengpeng Yu,Runqing Jiang,Qi Zhang,Dingquan Li,Jing Wang,Yulan Guo
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:While feed-forward 3D Gaussian splatting reconstructs renderable Gaussian primitives from sparse context views without per-scene optimization, existing pipelines do not provide a compact scene representation for storage or transmission. A natural solution is to apply existing 3DGS compression methods to the generated Gaussian primitives. However, this approach operates on the final irregular 3D representation and is decoupled from the internal feature-to-Gaussian generation process, which limits compression efficiency. To address this, we introduce CodecSplat, an ultra-compact latent coding framework for feed-forward 3D Gaussian splatting. CodecSplat first encodes an intermediate 2D Gaussian-generation feature into an entropy-coded scene bitstream. At the decoder, the latent feature is reconstructed and used to predict depth and Gaussian parameters, which are then mapped to 3D Gaussian primitives. Note that, by integrating compression into the feed-forward Gaussian generation pipeline, CodecSplat avoids inefficient compression over irregular 3D Gaussian primitives and allows the codec to exploit the structured intermediate feature representation. We instantiate CodecSplat on a feed-forward Gaussian splatting backbone with depth-guided multi-view feature refinement and a hierarchical learned feature codec. On DL3DV and RealEstate10K datasets, CodecSplat achieves 23.56-26.36 dB and 24.76-27.05 dB PSNR with only 20.00-107.77 KiB and 3.37-12.51 KiB per scene, respectively. This is roughly one order of magnitude smaller than compressing feed-forward generated Gaussian primitives, while preserving controllable rate-distortion behavior.

[CV-78] Are We Overconfident in Models and Results for Semi-Supervised 3D Medical Image Segmentation? ICML2026

链接: https://arxiv.org/abs/2605.25561
作者: Jun Li,Ziwei Qin
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICML 2026

点击查看摘要

Abstract:Semi-supervised learning has become a dominant paradigm for reducing annotation costs. However, we argue that the current progress is clouded by a twofold overconfidence problem. Algorithmically, mainstream pseudo-labeling frameworks often conflate prediction confidence with uncertainty, leading to severe confirmation bias. Strategically, since multiple benchmark datasets lack dedicated validation sets, some studies use the test set for validation as well, leading to inflated performance estimates. Subsequent methods, compelled to employ the same strategy to surpass reported SOTA, trigger an arms race of overfitting. This raises concerns that the impressive numerical gains in the community may reflect overfitting rather than genuine progress. Thus, we propose a tri-space calibrated segmentation framework founded on a principled dual-axis reliability assessment engine. It explicitly decouples confidence from uncertainty and uses this signal to detect and correct confirmation bias across feature, probability, and image spaces in a collaborative manner. Across three benchmark datasets, TCSeg consistently delivers strong performance under existing evaluation protocols. More importantly, we advocate that the community report final-checkpoint results under multiple-run protocols, thereby establishing more rigorous benchmarks with a more realistic perspective. Code will be available: this http URL.

[CV-79] ComPose: A Unified Completion-Pose Framework for Robust Category-Level Object Pose Estimation CVPR2026 DATE

链接: https://arxiv.org/abs/2605.25553
作者: Huan Ren,Yihan Chen,Chuxin Wang,Nailong Liu,Wenfei Yang,Tianzhu Zhang
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Accepted by CVPR 2026 (Oral, Best Paper Award Candidate). Project page is available at this http URL

点击查看摘要

Abstract:Category-level object pose estimation aims to predict the pose and size of arbitrary objects in specific categories. Existing methods struggle with the inherent incompleteness of observed point clouds, which limits their ability to capture complete object shapes for robust pose reasoning. While point cloud completion offers a promising solution, naively treating it as a separate preprocessing step for partial observations introduces compounding errors and additional computational overhead, ultimately hindering both accuracy and efficiency. To address these challenges, we propose ComPose, a novel unified framework that tightly integrates shape completion to provide complete geometric cues for enhanced pose estimation. At the core of ComPose is a keypoint-based progressive completion module, which recovers full shape representations by progressively predicting a sparse set of keypoints and their surrounding dense point sets, empowering the keypoints to capture holistic object geometries. A geometric relation encoding module further enriches keypoint features with both local and global geometric context. In addition, we introduce a novel geometric relation consistency loss to enforce structural alignment between observed keypoints and their predicted NOCS coordinates, ensuring globally coherent coordinate transformations. Extensive experiments on standard benchmarks demonstrate that our method outperforms state-of-the-art approaches without relying on category-level shape priors.

[CV-80] apSampling: Inference-Time Sampling with a Task-Progress-Understanding Verifier for Robotic Manipulation ICML2026

链接: https://arxiv.org/abs/2605.25547
作者: Sizhe Zhao,Shengping Zhang,Shuo Yang,Weiyu Zhao,Shuigen Wang,Xiangyang Ji
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: ICML 2026. Project Page: this https URL

点击查看摘要

Abstract:Existing embodied control research demonstrates remarkable performance improvements by scaling training data and model size. We instead explore inference-time strategy as an alternative axis. Non-deterministic generative models, such as diffusion and autoregressive models, have been widely adopted in the field of embodied control. However, the single-shot inference paradigm limits their performance. In this paper, we propose \textbfTapSampling, a plug-and-play framework for inference-time sampling. First, we introduce an Action-VAE that represents actions in a low-dimensional latent space by mapping policy-generated initial actions into a compressed posterior distribution, from which any number of latent samples can be drawn and decoded into candidate actions that approximate the true action distribution. Second, we formulate action verification as task-progress outcome prediction, using the intrinsic sequential structure of robotic datasets to train a semantically grounded verifier for interpretable action selection. Furthermore, TapSampling is a policy-agnostic framework. Extensive experiments in both simulated and real-world environments demonstrate that our method substantially improves multiple generalist policies without further policy finetuning. Code and models are available at the project page.

[CV-81] ris: Tile-level Sampling for Efficient and High-Fidelity Video Object Tracking

链接: https://arxiv.org/abs/2605.25538
作者: Chanwut Kittivorawong,Alena Chao,Charlie Si,Alvin Cheung
类目: Computer Vision and Pattern Recognition (cs.CV); Databases (cs.DB)
备注:

点击查看摘要

Abstract:Track materialization converts raw video into reusable object tracks that downstream queries can run against without rerunning tracking, but extracting those tracks efficiently and with high fidelity remains expensive. Prior systems reduce cost through temporal frame sampling, erasing the inter-frame motion that fine-grained tracking requires. In stationary video, however, large portions of each frame contain no objects of interest, and the remaining regions tolerate different sampling rates. We present Tetris, a track-extraction system that decomposes videos into a tile-based polyomino data model, enabling fine-grained spatiotemporal pruning that reduces detector calls with minimal fidelity loss. Tetris runs three operators upstream of the user-provided detector: a classifier identifies relevant tiles and groups them into polyominoes, an integer linear program (ILP) prunes redundant polyominoes under a user-specified accuracy constraint, and a packer assembles the survivors into canvases that minimize detector calls. Across 7 stationary-video datasets, Tetris stays within a 5% tracking accuracy loss of a full-frame, every-frame reference pipeline, whereas prior systems exceed this bound on 3 of the 7 datasets. At this 5% bound, Tetris achieves up to 17.4x higher throughput than prior systems and up to 68.8x higher than the reference pipeline. The project page is at this https URL .

[CV-82] Location Prior Generation via Multi-Source Urban Data Fusion for Low-Altitude Air Mobility

链接: https://arxiv.org/abs/2605.25530
作者: Xiang Xie,Xiaonan Liu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 7 figures, submitted to IEEE Journal of Internet of Things

点击查看摘要

Abstract:Building height, the third dimension (3D) of urban spatial data, is absent in over 95% of structures in global geospatial databases. For the emerging low-altitude economy, this data gap forces each aerial platform to rely on real-time onboard sensing rather than pre-computed 3D scene geometry. We present the Location Prior Generation Framework (LPGF), a multi-source data fusion pipeline that integrates Sentinel-2 imagery, UAV telemetry, vehicle GPS trajectories, and OpenStreetMap footprints into structured, reusable urban location priors. LPGF assigns building heights through a three-tier priority hierarchy: (1) explicit OSM height tags where available, (2) floor count multiplied by 3.2 m per story where recorded, and (3) building-type default heights otherwise, yielding a worst-case error of approximately 5.5 m. An optional shadow-based height estimation module (SHEM) is activated only when a four-criterion quality gate is satisfied; when any criterion fails, the pipeline routes to structured fallback. On the MiTra A50 Milan dataset, the quality gate correctly identified two imaging failure modes: sub-pixel shadows at 10 m GSD and ground shadow merging at 0.93 m GSD, producing a consistent 27-building prior in both cases. Tier 3 type-default heights were validated against manual floor counts (n=15), achieving MAE=3.07 m within the 5.0 m uncertainty bound. The framework demonstrates that structured, quality-gated fusion of universally available data streams can bootstrap 3D scene coverage for low-altitude urban operations.

[CV-83] ProSR: Process-Shaped Spatial Reasoning for Reliable Chain-of-Thought in VLMs

链接: https://arxiv.org/abs/2605.25524
作者: Jiangyang Li,Cong Wan,Changjie Wu,Songlin Dong,Lingjun Zhang,Linzhe Shi,Xu Wang,Zhiheng Ma,Hang Zhang,Mu Xu,Yihong Gong
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 19 pages, 6 figures

点击查看摘要

Abstract:Reliable spatial reasoning remains a core bottleneck for vision-language models (VLMs). Existing mainstream training paradigms for spatial reasoning largely rely on outcome alignment or process imitation, lacking explicit constraints on the reasoning process, and therefore struggle to ensure genuine visual dependence and stable reasoning trajectories. In this paper, we construct a high-quality CoT dataset covering diverse spatial phenomena and diagnose the model’s reasoning process, revealing two typical types of process degradation during reinforcement learning optimization: Spurious Grounding, which bypasses visual evidence, and Tail Instability, where uncertainty abnormally rises in the later stage of reasoning. To address these issues, we propose ProSR, a process-shaping optimization framework for spatial reasoning. Through a Counterfactual Invariance Penalty and a Tail Drift Penalty, ProSR extends the optimization objective from single answer correctness to two process-level dimensions: visual dependence and trajectory stability. Experiments on multiple complex and out-of-distribution spatial reasoning benchmarks show that ProSR improves answer accuracy while generating reasoning trajectories that are more stable and more dependent on visual evidence.

[CV-84] Cross-Stage Attention Multi-Expert Network for Radiologist-Inspired Breast Ultrasound Diagnosis

链接: https://arxiv.org/abs/2605.25518
作者: Xinyang Zhai,Chong Yang,Ruizhi Zhang
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Breast ultrasound imaging is an important noninvasive method for early breast cancer diagnosis, but automatic benign/malignant classification remains challenging due to tumor heterogeneity, blurred boundaries, and data imbalance. To improve feature representation and classification accuracy, this paper proposes the Cross-Stage Attention Mixture-of-Experts Network (CSA-MoE-Net). It adopts a Cross-Stage Attention-enhanced ResNet-18 as the backbone, in which the Cross-Stage Attention module adaptively recalibrates multi-level features, thereby enhancing key tumor features and suppressing redundancy. A three-branch Mixture of Experts (MoE) Block learns complementary features from the Whole Tumor Image, Tumor Core, and Boundary, and an Adaptive Gating Network fuses them to capture morphological, textural, and contextual information. The fused features are denoted as Fused Expert Feature (FEF) in the architecture. Experiments on a balanced dataset of 2,129 breast ultrasound images show that, averaged over 20 independent runs, the model achieves an accuracy of 96.33%, precision of 94.09%, recall of 98.53%, F1-score of 96.25%, and AUC of 99.50%. Compared to the baseline ResNet-18, these metrics improve by 3.01, 0.70, 5.37, 2.98, and 5.42 percentage points, respectively. The proposed mechanism requires no invasive modification and can be seamlessly embedded into VGG-16, DenseNet-121, etc., yielding stable performance gains, thus providing reliable support for computer-aided diagnosis.

[CV-85] Metric–Phase Fields: Decoupling Distance and Sign for Thin-Structure Reconstruction from Unoriented Point Clouds

链接: https://arxiv.org/abs/2605.25503
作者: Jiayi Kong,Xuhui Chen,Chen Zong,Fei Hou,Junhui Hou,Wenping Wang,Ying He
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Neural Signed Distance Functions (SDFs) excel at reconstructing watertight manifolds but fail on thin structures and open boundaries due to strict inside–outside constraints. Conversely, Unsigned Distance Fields (UDFs) accommodate general geometries but suffer from gradient singularities at the zero-level set, hindering optimization and extraction. We introduce Metric–Phase Fields (MPFs), a decoupled implicit representation that separates metric proximity from topological phase. Given an unoriented point cloud, MPFs learn (i) an unsigned metric field r and (ii) a smooth phase field \theta , for which we derive a bounded phase indicator P=\tanh(\beta\theta) that provides soft inside–outside cues where they are meaningful. We couple the two fields via a gated-metric formulation with a residual phase injection to obtain a signed implicit function with stable near-surface gradients. The phase coefficient \beta is learnable, allowing MPFs to adaptively control the sharpness of the phase transition and the degree of saturation of the soft sign indicator. Experiments on both synthetic and scanned thin-shell and thin-plate shapes demonstrate that MPFs preserve thin and layered structures more faithfully than recent SDF-based methods, while also enabling more robust training and more reliable surface extraction than UDF-based approaches. Check out \hrefthis https URLMPFs-GitHub for source code and test models.

[CV-86] Full-4D: Generating Full-Scope 4D Scenes from a Single-View Video

链接: https://arxiv.org/abs/2605.25500
作者: Tingxi Chen,Ke Hao,Yabo Chen,Zhengxue Cheng,Rong Xie,Li Song,Haibin Huang,Chi Zhang,Xuelong Li
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Generating 4D scenes from a single-view video is inherently ill-posed: a single viewpoint lacks the information needed to recover a complete, dynamic scene with full coverage. Existing methods are typically limited to monocular videos, simple 3D effects, or only small viewpoint perturbations around the original viewpoint, falling short of true 4D generation. Meanwhile, the lack of large-scale datasets capturing full-scope 4D scenes with synchronized multi-view videos further hinders progress in this direction. We propose a novel single-view video-to-4D framework that casts full-scope 4D generation as a multi-view video synthesis followed by optimization-based 4D reconstruction from the generated views. To instantiate this formulation end-to-end, we make three key contributions. First, we introduce Real-MV-4D, a large-scale dataset of synchronized multi-view videos captured in diverse real-world environments to provide the 4D supervision. Second, we train a multi-view video diffusion model driven by a novel fused time(T)-view(V) attention mechanism that directly embeds geometric reprojection priors and explicit camera conditioning into its view-time interactions. Unlike basic feature fusion, this direct binding strictly aligns the generation process with physical 3D priors to produce a dense, synchronized T \times V video grid. Third, rather than relying on non-interactive and inconsistent 2D video interpolations, we lift the synthesized multi-view videos into an explicit 4D representation (i.e. 4DGS), regularized by a Flow Matching Distillation loss that exploits the multi-view prior to improve novel-view rendering. Extensive experiments demonstrate that our method outperforms existing approaches in both visual fidelity and geometric consistency, enabling full-scope 4D scene generation from single-view videos.

[CV-87] RepSAM: Bridging Foundation Models to Robotic Vision via Representation-Guided Adaptation ECAI2026 IJCAI

链接: https://arxiv.org/abs/2605.25495
作者: Wenhui Chu
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to IJCAI-ECAI 2026 (Special Track on AI and Robotics). 8 pages, 4 figures, 12 tables

点击查看摘要

Abstract:Robotic perception in unstructured environments remains challenging despite the zero-shot capabilities of foundation models such as SAM. This work attributes performance degradation to non-uniform representation shifts across transformer layers: shallow layers exhibit substantial domain gaps (CKA 0.5), whereas deep layers transfer effectively (CKA 0.7). Based on this observation, we propose RepSAM, a representation-guided parameter-efficient fine-tuning (PEFT) framework for adapting foundation models to robotic vision. RepSAM employs a theoretically grounded CKA-guided rank allocation strategy combined with a multi-modal fusion module for robust handling of challenging robotic scenarios, including transparent objects and cluttered scenes. Experimental evaluation across six benchmarks and robotic manipulation tasks demonstrates that RepSAM achieves 97.9% of full fine-tuning performance (89.0% vs. 90.9% mIoU) while reducing trainable parameters by 158x (from 632M to 4.0M). RepSAM outperforms DoRA by 7.9% mIoU with just 4 hours of training on a single A100 GPU (a 96x reduction from full fine-tuning, which takes 384 GPU-hours). These improvements are statistically significant (p 0.01) and translate to a 12.0% absolute improvement in robotic manipulation success rates over the LoRA (RGB) baseline.

[CV-88] st-Time Self-Adaptive Conditioning for Stable Audio-Driven Talking-Head Generation

链接: https://arxiv.org/abs/2605.25488
作者: Zhicheng Zhang,Lei Wang,Yu Zhang,Yongsheng Gao
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注: Research report

点击查看摘要

Abstract:Audio-driven talking-head generation has achieved remarkable progress with recent models such as AniTalker, FLOAT, and Sonic. Despite their success, most existing approaches rely on a single static reference image to condition the entire video generation process at inference stage. This static conditioning paradigm often creates a mismatch between fixed identity features and dynamically evolving facial motion, leading to identity drift, temporal inconsistency, and degraded perceptual quality. We introduce Test-Time Self-Adaptive Conditioning (TT-SAC), a parameter-free inference framework that enables pretrained talking-head generators to adapt their conditioning representations during inference without retraining, gradient updates, or additional supervision. Instead of treating the reference portrait as immutable, TT-SAC composes the generator with its encoder in a feedback loop: the generator’s own outputs are re-encoded to construct a refined conditioning representation that better aligns with the temporal dynamics of the synthesized sequence. A single adaptation step approximates a self-consistent equilibrium of the generative process, stabilizing identity and motion across time. We further provide theoretical analysis showing that test-time conditioning adaptation reduces feature variance and improves generative stability under mild Lipschitz assumptions, while exhibiting a principled bias-variance tradeoff that governs the optimal strength of adaptation. Extensive experiments on state-of-the-art talking-head generators and benchmark datasets demonstrate consistent improvements in lip-sync accuracy, temporal coherence, identity preservation, and perceptual fidelity. TT-SAC offers a model-agnostic and training-free strategy for enhancing generative video models, establishing test-time conditioning adaptation as an effective mechanism for stabilizing audio-driven portrait animation.

[CV-89] MAIL: Multi-Modal Bi-directional Agent Layer for Vision-Language Models

链接: https://arxiv.org/abs/2605.25479
作者: Kaixiang Chen,Pengfei Fang,Hui Xue
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Adapting large vision-language models (VLMs) such as CLIP to downstream tasks remains challenging, as full fine-tuning is computationally prohibitive and prone to overfitting in low-data regimes. Parameter-efficient fine-tuning (PEFT) alleviates these issues with lightweight prompt- or adapter-based modules, and cross-modal coupling has proven especially effective by strengthening interactions between vision and language. However, existing coupling mechanisms predominantly rely on external auxiliary modules, leading to indirect, coarse-grained interactions that are structurally decoupled from the original VLM and thus limit representational expressiveness. In this paper, we propose Multi-Modal Interactive Agent Layer (MAIL), a PEFT paradigm that embeds cross-modal coupling directly into the intrinsic computation modules of VLMs. MAIL freezes the backbone and inserts lightweight agent layers after core modules, such as LayerNorm, to approximate the parameter updates induced by full fine-tuning. To couple visual and textual streams at this level, we introduce a bottleneck-based text-to-image bridge that jointly optimizes paired agent layers across modalities, coordinating the adaptation of corresponding computation modules. We further present MAIL++, which enables bidirectional cross-modal exchange through a meta agent layer, a meta-text bridge, and a meta-image bridge. At inference time, all agent layers are re-parameterized into the frozen backbone, preserving the original computational efficiency. Extensive experiments on few-shot image classification and few-shot universal cross-domain retrieval demonstrate that MAIL and MAIL++ consistently outperform state-of-the-art PEFT methods. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2605.25479 [cs.CV] (or arXiv:2605.25479v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2605.25479 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-90] MetaphorVU: Towards Metaphorical Video Understanding ICML2026

链接: https://arxiv.org/abs/2605.25461
作者: Zhuoqun Li,Boxi Cao,Guiping Jiang,Fangrui Lv,Ruotong Pan,Jianan Wang,Xiangyu Wu,Hongyu Lin,Yaojie Lu,Yong Du,Ruyin Jia,Liyan,Tingting Gao,Han Li,Xianpei Han,Le Sun
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICML 2026 spotlight

点击查看摘要

Abstract:Metaphorical videos are prevalent across various real-world scenarios to convey complex ideas, and understanding them typically requires high-order cognitive capabilities. The lack of systematic studies on metaphorical video understanding not only constrains the real-world applicability of MLLMs but also impedes the thorough assessment of their high-order cognitive capabilities. To bridge this gap, we propose MetaphorVU-Bench, the first systematic and comprehensive benchmark dedicated to metaphorical video understanding. Through experiments, we find current MLLMs struggle with accurate metaphorical video understanding, lagging far behind human level, primarily due to defective cross-domain mapping. Motivated by this finding, we construct a metaphor knowledge graph as mapping augmentation and propose MetaphorBoost, an inference-time enhancement framework achieving consistent performance improvement. Our benchmark, analysis, and method provide useful insights and a foundation for future research on advancing MLLMs.

[CV-91] Pantheon360: Taming Digital Twin Generation via 3D-Aware 360° Video Diffusion CVPR2026

链接: https://arxiv.org/abs/2605.25449
作者: Ting-Hsuan Chen,Ying-Huan Chen,Tao Tu,Jie-Ying Lee,Cho-Ying Wu,Fangzhou Lin,Hengyuan Zhang,David Paz,Xinyu Huang,Yuliang Guo,Yu-Lun Liu,Yue Wang,Liu Ren
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026. Project page: this https URL

点击查看摘要

Abstract:Generating complete digital twins from videos requires precise camera control, global scene coverage, and strict spatial-temporal consistency constraints that remain challenging for perspective video generators due to their limited field of view (FoV). Their narrow FoV forces long or multi-view trajectories, amplifying cross-view inconsistency and temporal drift. We argue that 360° video generation offers a natural solution: panoramic coverage simplifies trajectory design and provides a strong global context for maintaining coherence. We introduce Pantheon360: Taming Digital Twin Generation via 3D-Aware 360° Video Diffusion, a controllable 360° video generation framework that synthesizes high-fidelity videos from sparse 360° inputs. The key idea is an explicit 3D Cache, reconstructed from the input, which serves as a geometric scaffold for any user-defined camera path. This allows the diffusion model to focus on photorealistic texture refinement while the 3D Cache enforces global geometric consistency. Experiments show that Pantheon360 achieves superior visual quality and unmatched geometric coherence, enabling reliable and flexible 360° scene generation for downstream simulation and digital-twin applications.

[CV-92] Enhancing Single-Image Facial Demorphing using Multimodal Large Language Models

链接: https://arxiv.org/abs/2605.25442
作者: Nitish Shukla,Arun Ross
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Face recognition systems are increasingly vulnerable to morphing attacks, where a composite image is crafted to match multiple identities, enabling unauthorized access and identity fraud. Existing detection methods identify morphed images but cannot recover constituent images or identities, limiting their forensic utility. This paper presents a novel reference-free facial demorphing framework that leverages Multimodal Large Language Models (MLLMs) to guide a coupled diffusion-based reconstruction process. Our key innovation lies in extracting semantic embeddings from intermediate MLLM layers to condition the demorphing, providing high-level reasoning about facial attributes and identity cues that complement low-level pixel information. We formulate demorphing as a coupled conditional generation problem, where both constituent faces are synthesized jointly through a denoising diffusion model operating directly in the RGB domain, ensuring inter-identity consistency while preserving fine-grained perceptual details. Unlike prior approaches that rely on compressed latent representations or assume identity overlap between training and testing sets, our method bypasses lossy text generation-reencoding cycles by directly utilizing MLLM hidden states as conditioning signals, enabling the denoising network to attend to subtle visual cues such as hair, background, and facial textures. Ablation studies further reveal that middle MLLM layers encode more identity-discriminative representations, RGB-domain demorphing outperforms latent-space approaches by 30–40% at strict operating points, and full MLLM embeddings provide substantial advantages over raw ViT features through enhanced semantic structuring from multimodal pretraining.

[CV-93] Does Seeing More Mean Knowing More? Mono-Anchored Advantage Normalization for Multi-Source Visual Reasoning

链接: https://arxiv.org/abs/2605.25437
作者: Fanhu Zeng,Zhicong Luo,Zefan Wang,You Li,Chi Chen,Maosong Sun
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: preprint

点击查看摘要

Abstract:Visual reasoning through reinforcement learning with verifiable rewards (RLVR) has achieved remarkable progress. However, when dealing with multi-source inputs, existing approaches tend to treat them as a mere accumulation of information, lacking explicit mechanisms to distinguish whether integrating additional sources yields information gain or introduces interference. Therefore, they struggle to effectively model dynamic interaction when integrating multiple sources, particularly when they differ significantly in physical properties and semantics, e.g., infrared and depth, leading to inferior performance to mono-source reasoning when a certain source holds the dominant signal. To address this issue, we propose MARS, a novel mono-anchored multi-source reasoning framework that models each visual modality as an independent information source. Specifically, by treating mono-source rewards as dynamic anchors, our method explicitly incorporates the information gain introduced by multi-source fusion into advantage normalization and adaptively emphasizes mutual promotion between sources while suppressing potential noise or conflicts during RLVR. From theoretical analysis, our method effectively quantifies information gain introduced by multi-source integration in gradient estimation, enabling consistent modality regulation. Empirical results also show impressive 3.2% and 4.9% performance gains on GRPO and DAPO across diverse datasets, confirming effectiveness of our method.

[CV-94] Binding Visual Features Point by Point

链接: https://arxiv.org/abs/2605.25427
作者: Udith Haputhanthri,Declan Campbell,Rim Assouel,Jonathan D. Cohen,Taylor W. Webb
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Despite success on standard benchmarks, vision language models display persistent failures on tasks involving processing of multi-object scenes, including many tasks that are relatively easy for humans. Recent work has found that these failures may stem from a basic inability to accurately bind object features in-context, a challenge that is referred to as the “binding problem” in cognitive science and neuroscience. The human visual system is thought to solve this binding problem via serial processing, attending to individual objects one at a time so as to avoid interference from other objects. Recent work has proposed “pointing” – the use of explicit spatial coordinates to refer to objects – as an analogous solution for vision language models, and found that it improves performance on challenging multi-object tasks. However, it is unclear \textitwhy (i.e., on a mechanistic or representational level) this approach improves performance, and how directly this relates to serial processing in human vision. Here, we investigate this question. We find that learning to point-via-text induces an internal visual search routine, and we characterize the mechanisms that support this procedure. We also find that pointing behavior can be generalized to new tasks via fine-tuning, and that doing so eliminates binding errors and enables compositional generalization. These results provide a proof-of-principle that serial processing can solve the binding problem for vision language models just as it does for biological vision.

[CV-95] Learning View-Dependent Splatting Kernels SIGGRAPH2026

链接: https://arxiv.org/abs/2605.25426
作者: Huakeng Ding,Zhanpeng Liu,Fan Pei,Kun Zhou,Hongzhi Wu
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to SIGGRAPH 2026. 10 pages, 8 figures

点击查看摘要

Abstract:We present a differentiable framework to automatically learn view-dependent 2D kernels in a splatting-based pipeline to improve reconstruction quality and representation efficiency for novel 3D view synthesis. Our volumetric primitive is defined as a bounding ellipsoid and a 3D-kernel latent vector. We first learn a projection network to output a 2D-kernel latent, taking the attributes of the ellipsoid and the 3D-kernel latent as input. Next, the result is sent to a decoder to produce a radially symmetric 2D kernel in terms of Mahalanobis distance, bounded by the projected ellipsoid. The neural networks along with per-primitive attributes are jointly optimized. The effectiveness of our approach is demonstrated on standard benchmarks, comparing favorably against state-of-the-art techniques on both analytical and learned kernels. Finally, we extend the idea to learn general 2D kernels for 2D splatting as well as image representation.

[CV-96] Generating 3D models from sketches of human faces using a combined approach of Convolutional Neural Networks Procedural Modeling and Contour Mapping

链接: https://arxiv.org/abs/2605.25418
作者: Nancy Iskander
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)
备注: A thesis submitted in conformity with the requirements for the degree of Master of Science in Computer Science Graduate Department of Computer Science University of Toronto

点击查看摘要

Abstract:Generating 3D models from face sketches is an active topic of research in Computer Graphics due to its potential to tremendously facilitate the modeling of faces for both professional 3D arists and novices. Motivated by the observation that facial expressions are responsible for significantly altering and shaping the contours in our faces, we combine both expression detection and 3D model generation in our approach. The result is a novel approach to generating 3D models from sketches which relies on three components: Convolutional Neural Networks, a parametric 3D face model (Valley Girl), and Active Snake Contours. For the first time in the literature, CNNs are trained (using our own generated dataset) to detect the expression in the given sketch through detecting the active FACS Action Units. The expression is then duplicated on Valley Girl to obtain a 3D model with a similar expression. Active Snake Contours are then used to find the transforms needed to close the gaps between that model and the given sketch.

[CV-97] MTLLFM: Multimodal-Temporal Laughter Localization: UR-FUNNY-Temporal and SMILE-Temporal Benchmarks with an Adaptive Multimodal Fusion Model CVPR2026

链接: https://arxiv.org/abs/2605.25409
作者: Eyal Hanania,Nadav Kirsch,Daniel Arkushin,Jonathan Benvenisti,Amos Bercovich,Elie Zemmour,Sahar Froim
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to the Workshop on Affective Behavior Analysis in-the-wild, CVPR 2026

点击查看摘要

Abstract:Detecting laughter in video is essential for affective computing and narrative understanding, yet existing approaches treat it as coarse clip-level classification, failing to capture precise temporal boundaries of brief, transient laughter events. We address this gap with two complementary contributions. First, we introduce UR-FUNNY-Temporal and SMILE-Temporal, fully annotated temporal laughter datasets extending two widely-used humor benchmarks. Our annotations cover over 11,053 videos (78.8 hours) and provide precise onset/offset boundaries for each laughter event, along with rich metadata distinguishing speaker vs. audience laughter, modality dominance (acoustic, visual, or both), and intensity levels. Second, we propose a lightweight weakly-supervised framework for temporal laughter localization. Our architecture combines fixed HuBERT and MAE encoders with temporal softmax pooling and adaptive modality gating, learning fine-grained temporal grounding from clip-level labels without requiring frame-level annotations during training. Experiments across three datasets demonstrate that our approach substantially outperforms multimodal foundation models including Gemini 3 Flash, achieving 99% F1 and 68.1% localization precision on sports broadcast data. Ablations validate each architectural component. Furthermore, our precise temporal tags improve downstream laughter reasoning by 227% on CIDEr, enabling GPT-3.5 to outperform GPT-4o. The code, UR-FUNNY-Temporal and SMILE-Temporal datasets are publicly available at this https URL. Comments: Accepted to the Workshop on Affective Behavior Analysis in-the-wild, CVPR 2026 Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2605.25409 [cs.CV] (or arXiv:2605.25409v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2605.25409 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-98] owards Active Real-to-Twin Inspection: A New Paradigm for Zero-Shot Anomaly Detection

链接: https://arxiv.org/abs/2605.25407
作者: Jiaxuan Liu,Yunkang Cao,Yufeng Chen,Chunyang Li,Yuhuan Du,Hui Zhang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 4 figures, accepted to IEEE-CYBER 2026, Florence, Italy

点击查看摘要

Abstract:The deployment of zero-shot anomaly detection (AD) in embodied industrial inspection is severely bottlenecked by its reliance on passive, fixed-viewpoint 2D imagery. Such formulations inherently fail to accommodate the active, dynamic observations required in real-world environments. To break this limitation, we introduce Real-to-Twin Anomaly Detection, a novel task that evaluates physical observations directly against geometrically matched CAD Digital Twins. To tackle this new task, we propose AVATAR, a framework designed to learn robust semantic alignment between Real and Digital Twins. By bridging benign Sim2Real domain gaps using only defect-free pairs, AVATAR effectively transforms CAD priors into dynamic, anomaly-free references. This elegant formulation enables the model to localize diverse anomalies in a zero-shot manner as unalignable deviations, eliminating the need for defect annotations. Extensive experiments demonstrate that AVATAR substantially outperforms adapted state-of-the-art baselines, exhibiting exceptional robustness to severe viewpoint variations. The code and dataset will be made publicly available.

[CV-99] Anatomy-Anchored Self-Supervision: Distilling Vision Foundation Models for Invariant Ultrasound Representation MICCAI2026

链接: https://arxiv.org/abs/2605.25402
作者: Chunzheng Zhu,Yijun Wang,Jianxin Lin,Feng Wang,Hongwei Wang,Lei Zhao,Shengli Li,Kenli Li
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: MICCAI 2026 Accepted Paper; Anatomy-Anchored Ultrasound Self-Supervision

点击查看摘要

Abstract:Self-supervised pre-training paradigm has gained increasing prominence for learning transferable representations in medical imaging, yet existing methods for ultrasound (US) images operate at the image or frame level, overlooking the anatomical context for clinical-aligned representation learning. In this work, we propose an anatomy-anchored ultrasound self-supervision framework ANAUS that shifts representation learning from generic visual regions to clinically meaningful anatomical structures. Utilizing a learnable latent prompt engine alongside a one-time domain adaptation on existing public image–mask pairs, we empower the LP-SAM module to achieve annotation-free anatomy delineation at scale. Building upon this anatomical grounding, we propose a dual-policy self-supervised learning paradigm consisting of inter-view semantics-aware anatomy-separating alignment and contextual core-region prediction to enhance representation learning. Specifically, the former enforces feature invariance within identical anatomical regions while promoting discriminability across distinct structures; the latter compels the model to reconstruct corrupted regions, thereby capturing fine-grained structural details. Extensive evaluations on six public datasets demonstrate that \ours consistently outstrips current state-of-the-art methods while maintaining the computational efficiency essential for clinical deployment. Code is available at this https URL.

[CV-100] Subspace-Guided Semantic and Topological Invariant Registration for Annotation-Free Ultrasound Plane Quality Control MICCAI2026

链接: https://arxiv.org/abs/2605.25396
作者: Chunzheng Zhu,Jianxin Lin,Feng Wang,Cheng Jiang,Guanghua Tan,Zhenyu Zhou,Shengli Li,Kenli Li
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: MICCAI 2026 Accepted Paper; Subspace-Guided Registration for Ultrasound Quality Control

点击查看摘要

Abstract:Reliable quality control (QC) of ultrasound images is essential for both real-time acquisition guidance and retrospective clinical audit, yet existing approaches rely heavily on per-plane annotations, or employ pseudo-labeling prone to systematic bias under spatial deformations inherent in clinical acquisition. We present STRIQ, a registration-driven framework that recasts annotation-free US plane quality control as a subspace-guided consistency measurement problem. Specifically, STRIQ introduces a Latent Registration Aligner (LRA) to establish hierarchical feature space correspondences between query images and variance-driven anchors, which are autonomously distilled from unlabeled data via a variance spectrum criterion to serve as structurally stable prototypes. To further disambiguate anatomical planes and mitigate negative knowledge transfer, we propose an Orthogonal Knowledge Subspace (OKS) module. The OKS decomposes plane-specific representations into mutually orthogonal subspaces, enabling fine-grained expert collaboration while preventing inter-plane interference, ensuring that the quality metric is grounded in principled subspace proximity. Extensive experiments on the in-house US4QA and public CAMUS datasets demonstrate that STRIQ achieves state-of-the-art correlation with clinical quality scores, establishing a new paradigm for annotation-free, real-time reliable ultrasound quality control. Our code is available at this https URL.

[CV-101] Weakly Supervised Camouflaged Object Detection Based on the SAM Model and Mask Guidance

链接: https://arxiv.org/abs/2605.25385
作者: Xia Li,Xinran Liu,Lin Qi,Junyu Dong
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 18 pages

点击查看摘要

Abstract:Camouflaged object detection (COD) from a single image is a challenging task due to the high similarity between objects and their surroundings. Existing fully supervised methods require labor-intensive pixel-level annotations, making weakly supervised methods a viable compromise that balances accuracy and annotation efficiency. However, weakly supervised methods often experience performance degradation due to the use of coarse annotations. In this paper, we introduce a new weakly supervised approach for camouflaged object detection to overcome these limitations. Specifically, we propose a novel network, MGNet, which tackles edge ambiguity and missed detections by utilizing initial masks generated by our custom-designed Cascaded Mask Decoder (CMD) to guide the segmentation process and enhance edge predictions. We introduce a Context Enhancement Module(CEM) to reduce the missing detection, and a Mask-guided Feature Aggregation Module (MFAM) for effective feature aggregation. For the weak supervision challenge, we propose BoxSAM, which leverages the Segment Anything Model (SAM) with bounding-box prompts to generate pseudo-labels. By employing a redundant processing strategy, high quality pixel-level pseudo-labels are provided for training MGNet. Extensive experiments demonstrate that our method delivers competitive performance against current state-of-the-art methods.

[CV-102] CollectionLoRA: Collecting 50 Effects in 1 LoRA via Multi-Teacher On-Policy Distillation

链接: https://arxiv.org/abs/2605.25378
作者: Fangtai Wu,Hailong Guo,Shijie Huang,Jiayi Song,Yubo Huang,Mushui Liu,Zhao Wang,Yunlong Yu,Jiaming Liu,Ruihua Huang
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Customized image editing aims to equip pre-trained diffusion models with specific visual effects using limited paired data, typically via Low-Rank Adaptation (LoRA). As the number of desired effects grows, storing and dynamically loading numerous these effect LoRAs significantly increases deployment overhead. Furthermore, current pipelines typically cascade these effect LoRAs with acceleration modules for fast generation, which triggers severe parameter interference and results in concept bleeding and style degradation. We propose CollectionLoRA, a multi-teacher on-policy distillation framework capable of distilling the concepts of up to 50 different effect LoRAs along with few-step generation capabilities into a single LoRA. This fundamentally resolves the feature interference issue and significantly reduces deployment costs. Specifically, the method introduces (i) a Probabilistic Dual-Stream Routing mechanism that enables the model to randomly switch between data sources during training, effectively enhancing its generalization in unseen scenarios; (ii) an Asymmetric Orthogonal Prompting strategy to achieve concept isolation within the prompt space; (iii) a Coarse-to-Fine Distillation Objective to mitigate the distribution gap between the teacher and student models. Extensive evaluations show that CollectionLoRA distills all customized effects and few-step generation into a single LoRA, reducing deployment overhead while achieving concept fidelity comparable to or better than independently trained teacher models.

[CV-103] Adversarial Orthogonal Disentanglement for LVLM Hallucination Mitigation

链接: https://arxiv.org/abs/2605.25377
作者: Ruoxi Cheng,Haoxuan Ma,Zhengfei Hai,Yiyan Huang,Ranjie Duan,Tianle Zhang,Xu Yang,Ziyi Ye,Xingjun Ma
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) have advanced multimodal understanding, yet their reliability is limited by hallucination, where generated content conflicts with visual facts. Existing mitigation methods either rely on costly external interventions, such as instruction tuning and retrieval, or use internal mechanisms that remain limited by flawed attention weights and entangled hidden representations. We propose Adversarial Orthogonal Disentanglement (AOD), a latent geometric framework for mitigating LVLM hallucinations. AOD learns a hallucination-related direction through a minimax objective: a classifier concentrates hallucination signals into the projected component, while an adversary removes them from the orthogonal residual space via a Gradient Reversal Layer. The learned direction enables a training-free dual-forward-pass contrastive decoding strategy that suppresses hallucinations while preserving general capabilities. Experiments on three LVLMs across four hallucination and four utility benchmarks show that AOD consistently outperforms strong baselines. It improves POPE accuracy by over 6% on average, boosts AMBER by 6%, and maintains strong performance on utility tasks such as MMMU. Further analysis shows robust transfer across datasets, suggesting that AOD captures general hallucination-related biases rather than dataset-specific artifacts. Our source code and datasets are available at this https URL.

[CV-104] Physics-Aware 3D Gaussian Editing for Driving Scene Generation

链接: https://arxiv.org/abs/2605.25373
作者: Feng Zhou,Jian Zhang,Yuhang Sun,He Wang,Qiong Wen,Debao Kong,Tieru Wu,Rui Ma
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) has shown great potential in autonomous driving simulation and data generation, enabling photorealistic reconstruction and flexible scene manipulation. However, existing 3DGS scene editing methods have limited support for road geometry editing (e.g., inserting speed humps or sunken roads), and generally do not couple such edits with plausible vehicle-road interaction dynamics. Such editing is essential for generating training data under extreme driving scenarios or evaluating system reliability under these road irregularities. Moreover, many optimization-based methods require minutes of per-edit refinement, while existing efficient alternatives mainly focus on appearance-level or object-level manipulation rather than physics-aware road irregularity editing. To address these limitations, we propose RoVES, a Road-and-Vehicle Editing System for physics-aware 3D Gaussian editing in driving scenes. RoVES enables single-image-driven road geometry insertion and couples the edited road profile with a 4-DOF half-car vehicle dynamics model to achieve physics-aware vehicle pose correction in vertical displacement and pitch. RoVES inserts road elements in a one-shot, optimization-free pipeline (1.84s), and the full pipeline (including color transfer and vehicle-dynamics-based pose correction) completes in 6.24s; it edits dynamic vehicles via pose editing and corrects poses frame-by-frame to approximate dynamics-consistent vertical displacement and pitch responses. Experiments on the Waymo dataset show that RoVES provides practical efficiency and competitive visual consistency for physics-aware driving scene generation.

[CV-105] Can MLLM s Reason Reason Beyond Language? VisReason: A Comprehensive Benchmark for Vision-Centric Reasoning ACL2026

链接: https://arxiv.org/abs/2605.25364
作者: Longteng Guo,Yifan Wang,Pengkang Huo,Tailai Chen,Yuze Wu,Jing Liu,Xinxin Zhu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ACL 2026 Findings, resources released at this https URL

点击查看摘要

Abstract:Recent multimodal large language models (MLLMs) achieve strong performance on visual reasoning benchmarks, yet it remains unclear to what extent such performance reflects reasoning directly grounded in visual evidence. We introduce VisReason, a benchmark for vision-centric reasoning in everyday scenarios where perception and inference are tightly coupled. VisReason contains 1,505 questions across 10 categories spanning perceptual, structural, and conceptual reasoning. Our evaluation shows that VisReason poses a qualitatively different challenge from existing benchmarks, exposing substantial gaps between humans and current MLLMs and revealing limited benefits from test-time reasoning strategies. VisReason offers a focused diagnostic for evaluating vision-centric reasoning beyond language.

[CV-106] MARVEL: Universal Murrays Law-informed Vessel Tree Segmentation and Topology Estimation

链接: https://arxiv.org/abs/2605.25363
作者: Yi Zhou,Thiara Sana Ahmed,Jacqueline Chua,Meng Wang,Qinrong Zhang,Alejandro F. Frangi,Huazhu Fu,Jun Cheng,Leopold Schmetterer,Bingyao Tan
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 18 figures

点击查看摘要

Abstract:Vascular circulation follows fundamental biophysical principles that optimize mass transport and metabolic energy expenditure, which can be effectively modeled by Murray’s law. However, contemporary deep learning methods for vascular segmentation often neglect these biophysical constraints. This leads to physiologically implausible branching and misclassification vascular trees, rendering. These automated segmentation results are unreliable unreliable for downstream clinical tasks such as blood flow simulation or disease quantification. In this paper, we introduce MARVEL (Universal MurrAy’s law-infoRmed Vessel sEgmentation and topoLogy estimation), a backbone-agnostic framework that integrates biophysical priors into vascular tree extraction. MARVEL combines per-pixel supervision with explicit radius predictions to enforce local bifurcation constraints derived from an empirical width-exponent mapping. We implement these constraints as differentiable regularizers during training to guide models toward physiologically consistent reconstructions. We evaluate MARVEL on eight public datasets across multiple vascular modalities and segmentation backbones. Results demonstrate MARVEL’s superior performance in segmentation accuracy, topological consistency, and physiological plausibility. By converting segmented masks into graph-based hemodynamic simulations, we demonstrate that MARVEL preserves the subtle pathological narrowing and topological connectivity required to distinguish hypertensive from normotensive eyes. Results show that MARVEL significantly improves the classification of hypertension via arteriovenous pressure differences in the eye (p 0.001), outperforming baseline models in both topological consistency and clinical predictive value.

[CV-107] PDEInvBench: A Comprehensive Dataset and Design Space Exploration of Neural Networks for PDE Inverse Problems

链接: https://arxiv.org/abs/2605.25353
作者: Divyam Goel,Nithin Chalapathi,Sanjeev Raja,Aditi S. Krishnapriyan
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Computational Physics (physics.comp-ph)
备注: 37 total pages, 13 main pages, 20 figures, 8 tables. Published in Transactions on Machine Learning Research (TMLR), 2026

点击查看摘要

Abstract:Inverse problems in partial differential equations (PDEs) involve estimating the physical parameters of a system from observed spatiotemporal solution this http URL networks are well-suited for PDE parameter estimation due to their capability to model function-to-function space transformations. While existing benchmarks of machine learning methods for PDEs primarily focus on the forward problem, there are no similar comprehensive studies and benchmark datasets on PDE inverse problems, i.e., mapping solution fields to underlying physical parameters. We fill this gap by introducing PDEInvBench, a comprehensive benchmark dataset consisting of numerical simulations for both time-dependent and time-independent PDEs across a wide range of physical behaviors and parameters. Our dataset includes evaluation splits that assess performance in both in-distribution and various out-of-distribution settings. Using our benchmark dataset, we comprehensively explore the design space of neural networks for PDE inverse problems along three key dimensions: (1) optimization procedures, analyzing the role of supervised, self-supervised, and test-time training objectives on performance, (2) problem representations, where we study the value of architectural choices with different inductive biases and various conditioning strategies, and (3) scaling, which we perform with respect to both model and data size. Our experiments reveal several practical insights: 1) neural networks perform best with a two-stage training procedure: initial supervision with PDE parameters followed by test-time fine-tuning using the PDE residual, 2) incorporating PDE derivatives as input features consistently improves accuracy, and 3) increasing the diversity of initial conditions in the training data yields greater performance gains than expanding the range of PDE parameters. We make our dataset and codebase publicly available.

[CV-108] ERNIE-Image Technical Report

链接: https://arxiv.org/abs/2605.25347
作者: Jiaxiang Liu,Zhida Feng,Pengyu Zou,Zhenyu Qian,Tianrui Zhu,Jun Xia,Yuehu Dong,Yanzheng Lin,Honglin Xiong,Anqi Chen,Yunpeng Ding,Jinghui Duan,Lin Gao,Chao Han,Tiechao He,Jiakang Hu,Ranjun Hua,Xueming Jiang,Qingli Kong,Yuting Lei,Tianyu Li,Yunlin Liu,Changling Liu,Yaxin Liu,Yi Liu,Xuguang Liu,Xiaolong Ma,Yan Pan,Yiran Ren,Nan Sheng,Yu Sun,Siyang Sun,Yixiang Tu,Yang Wan,Huanai Wang,Siqi Wang,Yang Wu,Youzhi Yang,Xiaowen Yang,Jianwen Yang,Yehua Yang,Quanwen Zhang,Xinmin Zhang,Haoxin Zhang,Xiang Zhang,Jun Zhang,Qian Zhang,Qiao Zhao,Qi Zhou
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We introduce ERNIE-Image, an open-source text-to-image generation model built upon an 8B single-stream DiT architecture. ERNIE-Image aims to bridge the gap between current open-source models and leading closed-source systems through more effective mining of large-scale pre-training data and improved supervision quality throughout training. During pre-training, we adopt a bottom-up data construction pipeline that combines fine-grained image categorization, rich caption annotation, aesthetic assessment, and hierarchical sampling. This strategy reduces data noise while preserving long-tail concepts and detailed real-world knowledge, providing a stronger foundation for complex generation tasks. In the post-training stage, we use a top-down data construction pipeline for high-demand scenarios, diversify prompt annotations to better match real user inputs, and apply a stabilized DPO strategy to align the model with human aesthetic preferences. We further train ERNIE-Image-Turbo for efficient 8-NFE generation and propose MT-DMD to mitigate capability drift during distillation. To make the model easier to use in practical scenarios, we equip it with a lightweight Prompt Enhancer that expands concise user intents into structured visual descriptions. In addition, we develop ERNIE-Image-Aes, an industrial-grade aesthetic model, together with ERNIE-Image-Aes-1K, a human-annotated benchmark for realistic aesthetic evaluation. Extensive qualitative and quantitative experiments show that ERNIE-Image achieves leading performance among open-source models and approaches top-tier commercial models in instruction following, text rendering, and aesthetic quality. We release the trained models and aesthetic resources to facilitate further academic research and technical progress in the AIGC community.

[CV-109] Depth Peeling for High-Fidelity Gaussian-Enhanced Surfel Rendering

链接: https://arxiv.org/abs/2605.25345
作者: Keyang Ye,Hongzhi Wu,Kun Zhou
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Novel view synthesis has been significantly advanced by NeRFs and 3D Gaussian Splatting (3DGS), which require ordering volumetric samples or primitives for correct color blending. While the recent Gaussian-Enhanced Surfels (GES) enable high-performance, sort-free rendering, they suffer from aliasing artifacts and suboptimal reconstruction. To address these limitations, we propose DP-GES, a novel representation that augments opaque surfels with semi-transparent boundaries and leverages Depth Peeling to establish accurate per-pixel ordering. This design enables sort-free Gaussian splatting with correct transmittance modulation, effectively eliminating aliasing and popping artifacts while facilitating a fully differentiable joint optimization. Extensive experiments demonstrate that our method achieves superior reconstruction quality and compares favorably against state-of-the-art techniques across a wide range of scenes.

[CV-110] oward Native Multimodal Modeling: A Roadmap

链接: https://arxiv.org/abs/2605.25343
作者: Siyu An,Junru Lu,Junnan Dong,Qiufeng Wang,Yinghui Li,Weizhi Fei,Zichao Yu,Zheng Yuan,Biao Liu,Haopeng Wang,Renzhao Liang,Yixuan Yang,Yunhang Shen,Bo Ke,Keyu Chen,Linhao Luo,Difan Zou,Xiao Huang,Di Yin,Ruizhi Qiao,Xing Sun
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 52 pages, 5 figures, 3 tables, ~300 references

点击查看摘要

Abstract:Multimodal modeling represents a vital step from modality-agnostic reasoning toward world modeling. While early approaches predominantly rely on late-fusion that assembles encoders and frozen language backbones with output heads, recent efforts have shifted the paradigm toward native multimodal modeling (NMM) with the intrinsic integration of modalities for superior multimodal performance. Despite its potential, the design space of native architectures remains insufficiently defined. In this paper, we present the community with a formalized roadmap for this transition. Specifically, we formally define the architectural nativity, distinguishing mid-fusion and early-fusion from non-native paradigms. We further organize the existing native models through the lens of input-output duality into three categories: (i) Multi-to-Text for cross-modal comprehension with text-only output; (ii) Multi-to-Target for scenario-oriented generation, e.g., image, audio and video generation, and (iii) Multi-to-Multi for unified modeling with symmetric input-output. We deliver a comprehensive and industrial-grade investigation into the transition toward the definitive NMM framework, where understanding and generation seamlessly coexist within a unified transformer paradigm. We systematically unpack the end-to-end pipeline from industrial perspectives from architectural coordination, massive data curation, to full-stack training recipes, inference deployment, and the comprehensive evaluation for truly native modeling.

[CV-111] Dual-Pathway Geometry-Aware MLLM for Spatial Intelligence

链接: https://arxiv.org/abs/2605.25334
作者: Yufei Zheng,Xuhan Zhu,Zide Liu,Chunpeng Zhou,Chenfeng Wang,Yongchao Xu,Yunnan Wang,Jiawei Liu,Pengfei Yu,Wei Zhai,Yang Cao,Zheng-Jun Zha
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Spatial understanding of the physical world from 2D visual inputs hinges on two complementary forms of geometric knowledge: holistic 3D structural perception and fine-grained metric scale estimation. Existing multimodal large language models (MLLMs) typically address only one facet, ingesting either depth maps or point clouds as additional model inputs, which incurs substantial computational overhead and inherits the generalization limitations of upstream prediction models. We propose GAMSI, a dual-pathway Geometry-Aware MLLM for Spatial Intelligence that takes only RGB images as input while internalizing both forms of geometric prior within a unified autoregressive backbone. Specifically, we introduce Metric-Structure Decoupled Queries (MSDQ) which employ two groups of learnable queries to respectively extract dense metric signals and sparse structural cues from the shared visual context, with a task-decoupled attention mask further preventing the two pathways from contaminating each other. Building on this, an Expert-Guided Visual Grounding (EVG) module projects the aggregated cues back to frame-level visual features and aligns them with vision foundation models, which serve purely as training-time supervision, rather than as model inputs. We further build a multi-task spatial instruction-tuning dataset (MTS) comprising 152,776 samples spanning 13 task types and three visual modalities, consolidated from six public datasets. Trained with a two-stage curriculum, GAMSI achieves state-of-the-art performance on seven spatial intelligence benchmarks.

[CV-112] aching Video Generators to Remember: Eliciting Dynamic Memory for Out-of-Sight State Evolution

链接: https://arxiv.org/abs/2605.25333
作者: Tianshuo Xu,Yichen Xie,Depu Meng,Chensheng Peng,Quentin Herau,Bo Jiang,Yihan Hu,Wei Zhan
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Video world models should maintain evolving states when evidence is unobserved, yet current generators often freeze hidden states upon interruption. This is not simply a capacity problem: pretrained video diffusion transformers already possess KV-cache mechanisms capable of non-local retrieval, but they are rarely trained to use them as dynamic memory. We introduce ReMind, a framework eliciting dynamic memory behavior via memory-oriented data, event-aware training, and cache adaptation. Organized around a taxonomy of 100+ dynamic events, we build a camera-annotated training mixture combining VLM-filtered real videos, generated hard dynamics, synthetic camera loops, and memory-interruption augmentations. Each clip is converted into a frame graph with protected anchors, degraded intervals, and explicit temporal gaps. A node-structured curriculum, including node-drop, noisy memory, frontier continuation, and reference-cache training, forces the model to retrieve relevant past states across interruptions rather than relying solely on local continuity. PM-RoPE, an elegant camera-phase RoPE extension, unlocks spatiotemporal retrieval at a single-attention cost while preserving pretrained pathways. ReMind achieves the best overall scores on STEVO-Bench and recovery tasks. Furthermore, general image-to-video evaluations confirm this curriculum avoids catastrophic forgetting. We will open-source our code, data, and models.

[CV-113] DIVA: Harnessing the Representation Divergence in Unified Multimodal Models for Mutual Reinforcement ICML2026

链接: https://arxiv.org/abs/2605.25328
作者: Renjie Lu,Xulong Zhang,Xiaoyang Qu,Shangfei Wang,Jianzong Wang
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: Accepted to the 43rd International Conference on Machine Learning (ICML 2026)

点击查看摘要

Abstract:Unified Multimodal models (UMMs) built on a single architecture have shown impressive performance in both understanding and generation. We identify a fundamental challenge that lies in inductive biases induced by distinct supervision signals: generation branch prefers high-fidelity, fine-grained representations capable of reconstruction, while the understanding favours semantically discriminative embeddings that remain invariant to task-irrelevant factors. Consequently, optimizing these complementary but non-equivalent objectives within a monolithic backbone leads to mutual impairment instead of enhancement. In this paper, we first analyze the root cause of this interference in unified backbones and reveal a complementary structure in their internal representations. Motivated by the observation, we propose DIVA, a self-improved post-training framework that transforms the representation divergence into interior synergy. By explicitly factorizing the visual representation into shared and unique components based on two complementary information flow, DIVA enables both the understanding and generation branches to achieve beneficial transferring while preserving the integrity of unique information from cross-flow interference via mutual information estimation. Despite its generality, our method consistently achieves improvements across visual understanding (+7.82%) and generation (+8.46%). The official code is available at: this https URL.

[CV-114] Perceive-then-Plan: Layout-as-Policy for Monocular 3D Scene Layout Estimation

链接: https://arxiv.org/abs/2605.25326
作者: Junwei Zhou,Yu-Wing Tai
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 21 pages

点击查看摘要

Abstract:Building structured 3D scene layouts from a single image requires reconciling visual observations with physical and spatial constraints, a challenge that is difficult to address with direct prediction alone. In this work, we formulate monocular 3D layout estimation as a perceive-then-plan problem with vision-language models, where a Perceiver first grounds the 3D objects and then a Planner iteratively refines the scene hypothesis through actions that improve physical plausibility while preserving consistency with the input image. We propose Layout-as-Policy (LaP), which casts the planning stage as a policy learning problem: 3D layouts are represented as structured states, and refined via discrete actions such as translation, rotation, and rescaling. Starting from an observation-aligned initialization with the geometry-enhanced Perceiver, the LaP Planner is trained to produce action sequences that progressively resolve geometric inconsistencies and enforce realistic spatial relations. To enable effective learning, we combine supervised trajectory initialization with preference-based optimization, allowing the model to learn corrective behaviors without requiring explicit reward engineering. This formulation transforms layout estimation from a one-shot prediction task into an iterative refinement process, enabling better handling of global constraints and complex object interactions. Experiments demonstrate that our approach produces layouts that are more physically coherent and better aligned with visual observations, while naturally supporting downstream tasks such as scene editing and manipulation.

[CV-115] Stabilizing Streaming Video Geometry via Dynamic Feature Normalization

链接: https://arxiv.org/abs/2605.25308
作者: Xiaoyang Lyu,Muxin Liu,Xiaoshan Wu,Ruicheng Wang,Yi-Hua Huang,Yang-Tian Sun,Shaoshuai Shi,Xiaojuan Qi
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages, 9 Figures, page: this https URL

点击查看摘要

Abstract:Consistent 3D geometry estimation from streaming RGB input is crucial for real-world applications such as autonomous driving, embodied AI, and large-scale reconstruction. While modern monocular geometry foundation models achieve strong single-image accuracy, they exhibit severe temporal inconsistency on continuous input, notably dominated by scale–shift drifting. Through targeted empirical analysis, we trace this instability to its root cause: fluctuations in latent feature statistics, whose mean and variance directly determine the predicted depth’s scale and shift. Building on this insight, we introduce Dynamic Feature Normalization (DyFN), a lightweight, causal recurrent module that dynamically and robustly modulates feature statistics to maintain stable geometry over time. We adapt powerful pretrained monocular geometry models for streaming by finetuning only DyFN, a mere 2% additional parameters, while keeping the backbone frozen, thereby achieving temporal consistency without compromising single-image accuracy. Extensive experiments across four benchmarks show that DyFN effectively eliminates temporal artifacts such as disjointed layering and positional jitter, and achieves state-of-the-art temporal stability, improving over prior streaming methods by up to 14% and even outperforming heavier non-causal video baselines. Project Page: this https URL

[CV-116] Recursive Class Connectivity Classification (R3C) Applied to Binary Image Segmentation for Improved Infant Fingerprint Enhancement

链接: https://arxiv.org/abs/2605.25307
作者: Joao Leonardo Harres Dall Agnol,Luiz Fernando Puttow Southier,Jefferson Tales 0liva,Marcelo Teixeira,Rodrigo Mineto,Marcelo Filipa,Dalcimar Casanova,Erick Oliveira Rodrigues
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Image enhancement plays a crucial role in infant fingerprint matching, as child-specific characteristics such as smaller finger dimensions and thinner ridge structures often degrade image quality during acquisition. To address these limitations, enrollment typically depends on specialized highresolution scanners, which most existing enhancement methods are not designed to support. Consequently, identification rates for children remain significantly lower than those achieved with adult fingerprints. This study introduces Recursive Class Connectivity Classification (R3C), a novel framework that iteratively refines binary segmentation outputs from existing enhancement methods by extending ridge structures. R3C does not require modifications to the underlying classifier and operates without training data, which is not currently available for infant fingerprints. Instead, the method improves segmentation by repeatedly feeding the classified image back into the classification process, while combining each intermediate segmentation with the original input image. Experiments conducted on three fingerprint datasets using four different enhancement classifiers show that R3C can increase the True Acceptance Rate (TAR) by up to 4% for children and over 40% for newborns, compared to using the enhancement methods alone. A qualitative analysis further demonstrates that R3C reconnects fragmented ridge patterns, improving the visual quality of segmentation. Because it functions independently of the enhancement method used, R3C provides a flexible and broadly applicable solution for improving binary segmentation.

[CV-117] When Interpretability Becomes a Liability: Adversarial Attacks on CBM Concept Layers CVPR2026

链接: https://arxiv.org/abs/2605.25304
作者: Aditya Sridhar
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026 (Findings). 9 pages, 6 figures

点击查看摘要

Abstract:Concept Bottleneck Models (CBMs) have emerged as a cornerstone approach for interpretable machine learning, providing human-understandable intermediate representations through explicit concept activations. However, this interpretability fundamentally introduces a critical, previously unexplored attack surface: the concept bottleneck layer itself. We present a comprehensive, systematic study of concept-level adversarial vulnerabilities in CBMs, revealing that targeted, minimal perturbations operating on input pixels can induce catastrophic misclassification by manipulating semantic representations. We develop a rigorous theoretical framework to quantify concept-space robustness, establishing novel metrics that expose the vulnerability landscape of these architectures. Our extensive analysis on the CUB-200-2011 dataset demonstrates that standard CBMs exhibit severe susceptibility to concept-level manipulation. To address this critical weakness, we introduce SPECTRA (Semantic Perturbation-based Concept Training for Robustness against Attacks), a principled stability regularization defense. SPECTRA effectively hardens the semantic representation space, increasing the minimal perturbation norm required for a successful attack from 0.46 to over 4,200, rendering targeted concept manipulation computationally prohibitive. Furthermore, SPECTRA preserves baseline classification accuracy to within 2.2%. By establishing concept-level attacks as a fundamentally distinct threat model, this work opens a new research frontier at the intersection of interpretable machine learning and adversarial robustness.

[CV-118] A Principled Self-Referenced Early Stopping Approach for Deep Image Prior

链接: https://arxiv.org/abs/2605.25299
作者: Chaoyan Huang,Cheng-Han Huang,Ismail R. Alkhouri,Rongrong Wang
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 35 pages, 10 figures, 14 tables

点击查看摘要

Abstract:Recently, Deep Image Prior (DIP) has demonstrated strong capabilities for solving inverse imaging problems (IIPs) by optimizing a randomly initialized convolutional neural network in a training-data-free regime. However, DIP suffers from overfitting to noisy measurements due to network over-parameterization, making early stopping (ES) essential. The most successful ES method tracks fluctuations in the running variance of the network output to detect overfitting. However, in many applications, these fluctuations may appear prematurely, leading to unstable reconstructions. In this paper, we first show that nearly optimal DIP early stopping can be achieved when two independent noisy copies of the degraded image are available. Motivated by this observation, and since obtaining two fully independent copies is infeasible, we propose an overfitting detection framework based on constructing pseudo self-referenced images, resulting in three IIP-specific algorithms. Our approach is further supported by theoretical results on single-reference validation, pseudo-validation estimation, and the impact of shared noise. Across different IIPs, ranging from natural image restoration to medical image reconstruction, and under varying noise levels and noise types, our methods consistently outperform existing DIP early stopping approaches, all without requiring an accurate estimate of the noise level.

[CV-119] Geometry-Aware Image Flow Matching

链接: https://arxiv.org/abs/2605.25294
作者: Junho Lee,Kwanseok Kim,Joonseok Lee
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in generative models highlight the power of geometry-aware modeling in manifold-constrained settings. Yet, for natural images, the field remains confined to Euclidean assumptions, failing to exploit the potential of intrinsic geometric structures within the data. In this work, we investigate the geometry of natural images and observe that semantic information is predominantly encoded in directional components, while norm components can be approximated by the global average. This property holds across both RGB and latent spaces, suggesting that natural images can be effectively modeled on a hypersphere. Building on this finding, we introduce Spherical Optimal Transport Flow Matching (SOT-CFM), which utilizes angular distance, and Spherical Flow Matching (SFM), which constrains dynamics directly on the manifold. Our experiments demonstrate that these geometry-aware methods achieve superior performance against Euclidean baselines. Ultimately, this work provides a novel perspective that bridges the gap between Riemannian manifold-based modeling and natural image generation.

[CV-120] Neuromorphic LiDAR-based Birds Eye View Object Detection using Energy-efficient Spiking Neural Networks

链接: https://arxiv.org/abs/2605.25293
作者: Sambit Mohapatra,Senthil Yogamani,Heinrich Gotzig,Patrick Mader
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Autonomous driving perception demands accurate and efficient processing of three-dimensional sensor data under strict power constraints. Traditional convolutional neural networks achieve strong detection accuracy but are computationally intensive, limiting their suitability for deployment on resource-constrained neuromorphic platforms. Spiking neural networks offer a compelling alternative through event-driven sparse computation, yet their application to complex real-world perception tasks such as three-dimensional object detection remains limited. In this work, we propose an end-to-end spiking encoder-decoder network for object detection in bird’s eye view representations of LiDAR point clouds, trained using surrogate gradient backpropagation. We train two variants: a membrane potential variant that reads continuous neuron state at the output stage for maximum accuracy, achieving 92.05 / 87.04 / 86.51 AP at \mathrmIoU!=!0.5 (Easy/Moderate/Hard), and, a fully binary spiking variant that operates exclusively on spike trains at every layer for direct neuromorphic deployment. We evaluate four input spike encoding strategies and demonstrate that allowing the network to learn spike representations directly from data outperforms hand-crafted Poisson, latency, and z-axis encoding schemes on the KITTI benchmark, where sequential frames are unavailable and the BEV input is presented repeatedly across timesteps as a proxy for temporal streaming. A block-wise energy analysis demonstrates a 3.33\times reduction in synaptic operation energy over an equivalent CNN under conservative loop-based operation. Together, these results demonstrate the viability of spiking neural networks for accurate and energy-efficient neuromorphic perception in autonomous driving.

[CV-121] DeltaCam: Differential Intrinsic Camera Modeling for Video Generation

链接: https://arxiv.org/abs/2605.25266
作者: Debabrata Mandal,Zhihan Peng,Yujie Wang,Praneeth Chakravarthula
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Incorporating camera intrinsics into video generation models offers a principled way to control not only scene dynamics but also the imaging process that governs visual appearance. Prior work has primarily focused on extrinsic control, such as camera pose and motion, while treating intrinsic camera parameters as implicit or fixed. A key bottleneck is the lack of large-scale video datasets with accurate and diverse temporally varying camera metadata, which makes learning absolute camera parameterizations difficult. As a result, current models struggle to incorporate photographic camera behavior, including depth-of-field transitions, exposure variations, lens distortions, and color processing, in a controllable and temporally consistent manner. We introduce DeltaCam, a video diffusion framework that models camera behavior through \Delta -parameterized neural camera adaptors, operating on relative changes in camera motion and intrinsics instead of absolute states. By learning this differential formulation from synthetic video data, we mitigate reliance on precise real-world camera labels and enable smooth, consistent control over imaging factors such as focal length, aperture, ISO, color temperature, and lens distortion. We extend this framework to real-world footage through two mechanisms: finetuning the controls on real image-metadata pairs for precise shot matching, and extracting disentangled embeddings for implicit video-to-video style transfer without requiring explicit camera parameters. By effectively separating scene content from intrinsic imaging behavior, DeltaCam enables camera-consistent video generation and editing operations that are difficult to achieve with existing models. Ultimately, our results establish a practical and scalable approach for bridging synthetic control and real-world photographic emulation.

[CV-122] Semantics-Guided Multimodal Masked Autoencoder Pretraining for 3D BEV Object Detection ICRA2026

链接: https://arxiv.org/abs/2605.25262
作者: Prabuddhi Wariyapperuma,Rajitha de Silva,Marc Hanheide,Thomas Bohné,Leonardo Guevara
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at the ICRA 2026 Workshop on Semantics for Reliable Robot Autonomy (SRRA) as a lightning talk and poster

点击查看摘要

Abstract:Accurate 3D bird’s-eye view (BEV) object detection is essential for autonomous driving, and depends strongly on effective multimodal representations from complementary sensors such as cameras and LiDAR. Multimodal masked autoencoders have shown strong potential for learning such representations for downstream 3D BEV object detection. However, existing methods typically apply uniform random masking to camera and LiDAR inputs, treating all regions equally, and learn representations only through masked reconstruction. We propose a semantics-guided multimodal masked autoencoder framework that introduces semantic information during pretraining through two separate components: (i) semantics-guided LiDAR voxel masking, which preserves semantically important LiDAR regions more strongly, and (ii) an auxiliary point-wise LiDAR semantic decoder branch that injects semantic guidance in addition to reconstruction. On BEVFusion 3D object detection, our semantics-guided pretraining strategy improves performance on the nuScenes mini validation set compared to the standard UniM2AE baseline: semantics-guided LiDAR voxel masking yields +1.49% mean Average Precision (mAP) and +1.66% nuScenes Detection Score (NDS), while decoder-side point semantic supervision yields +1.39% mAP and +3.22% NDS over the baseline.

[CV-123] Guess the Unified Model: How Much Can We Recover from Generated Images?

链接: https://arxiv.org/abs/2605.25254
作者: Jasin Cekinmez,Ryo Mitsuhashi,Addison J. Wu,Yida Yin
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:With unified model-generated images now widespread online, attributing their model of origin offers a path toward transparency and deeper insight into the characteristic behaviors of individual models. Prior work has explored provenance in LLM-generated text, diffusion model images, and datasets, but the separability of unified model-generated images remains an underexplored area. We address this gap by examining separability across corruption, domains, and prompt languages using images generated by seven unified models. We show that model attribution is highly feasible as our model achieves near-perfect accuracy with around 20K images per model. Corruptions and structural perturbations have only a modest effect on attribution performance, and cross-domain generalization reveals that semantic content contributes to separability but is not the dominant signal. Finally, we observe that for most models, prompt language attribution is around chance levels, suggesting minimal language-specific visual signatures. These findings highlight consistent model-specific visual characteristics in unified models outputs and open new directions for tracing and auditing generative image pipelines.

[CV-124] Multi-view Consistent 3D Gaussian Head Avatars without Multi-view Generation CVPR2026

链接: https://arxiv.org/abs/2605.25220
作者: Aviral Chharia,Fernando De la Torre
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Robotics (cs.RO)
备注: CVPR 2026; Project Website: this https URL

点击查看摘要

Abstract:High-fidelity 3D Gaussian head avatar generation is critical for applications such as AR/VR, telepresence, and digital humans. Existing methods depend on multi-view datasets, 3D captures, or intermediate 2D view synthesis. In contrast, we learn both conditional and unconditional 3D head models from randomly sampled 2D images alone, without using multi-view data, 3D supervision, or intermediate view generation. We introduce MVCHead, a single-shot state space model that enforces multi-view consistency (MVC) directly in the 3D representation while regressing 3D Gaussians under these constraints. At its core, we propose a Hierarchical State Space (HiSS) block that progressively refines Gaussians from coarse to fine, while capturing long-range dependencies. Within each HiSS block, we modify Mamba’s standard unidirectional scan with the proposed Hierarchical Bi-directional State Scan (HiBiSS) that aligns recurrence with the axes along which multi-view inconsistencies are strongest. Finally, we design an SE(3) Multi-view Critic that judges whether a set of self-renders arises from a single underlying 3D configuration, rewarding cross-view pixel alignment without observing real multi-view pairs. MVCHead achieves state-of-the-art perceptual quality, surpasses prior methods in both texture and geometric consistency, and maintains comparable shape consistency. To demonstrate scalability, we release FaceGS-10K, the first large-scale dataset of ready-to-use 3D Gaussian head assets for training and evaluation of 3D head models. Project Page and code: this https URL

[CV-125] Baton: Explicit Semantic Blueprints for Joint Video-Audio Generation

链接: https://arxiv.org/abs/2605.25195
作者: Shuyuan Tu,Qi Tian,Zihan Yang,Yue Wu,Xintong Han,Weijie Kong,Jiangfeng Xiong,Jian-Wei Zhang,Zhao Zhong,Liefeng Bo,Zuxuan Wu,Yu-Gang Jiang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Current open-source diffusion models struggle to generate stable and synchronized audio-visual content, particularly in scenarios demanding complex semantic reasoning. The root cause is that existing methods rely on coarse text embeddings from off-the-shelf encoders to guide audio-video denoising, which discards fine-grained semantics and, critically, lacks a shared long-horizon plan, leading to uncoordinated denoising trajectories and fragile cross-modal alignment. We propose Baton, the first framework that introduces explicit semantic planning into joint video-audio generation. Our key insight is that complementing coarse text guidance with semantically rich, modality-aware planned tokens, jointly reasoned and mutually aligned before denoising, can simultaneously restore fine-grained semantic detail and establish a shared blueprint that coordinates both audio and video denoising trajectories. Concretely, Baton first introduces the VA-Planner, a multimodal language model equipped with dual semantic alignment towers, where learnable queries cross-attend to both video and audio features to produce a pair of semantically aligned video and audio planned tokens as keyframe-level blueprints. These planned tokens are injected into the diffusion backbone via cross-attention layers, providing temporally grounded guidance complementary to coarse text embeddings. Since planned tokens do not share one-to-one spatial-temporal correspondence with diffusion latents, we further propose Relative Semantic RoPE, a relative positional encoding that maps planned tokens and latents into a shared spatial-temporal coordinate frame, enabling each latent to accurately attend to its positionally corresponding semantic cues. Experiments on benchmarks show the effectiveness of Baton both qualitatively and quantitatively.

[CV-126] SpongeBob: Sync-Aware Harmonious Audio-Visual Generative Editing

链接: https://arxiv.org/abs/2605.25193
作者: Sen Liang,Cong Wang,Fengbin Guan,Zhentao Yu,Yiting Lu,Yuanzhi Wang,Yuan Zhou,Xin Li,Zhibo Chen
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Visual and acoustic events in the physical world are inherently coupled, yet existing video editing methods typically adopt decoupled pipelines, lacking bidirectional modality interaction. This results in two key limitations: (i) audio-visual desynchronization and (ii) contextual conflicts between generated audio and preserved content. To address these, we propose SpongeBob, the first end-to-end audio-visual joint editing framework featuring bidirectional cross-modal interaction. For synchronization, a Sync-Aware Mechanism aligns visual edits with sound events via bidirectional attention, temporal alignment, and spatial constraints. For contextual consistency, a Context-Aware Module leverages acoustic and visual context attention to prevent semantic clashes. Additionally, we introduce Sync-Preserving Training and Guidance (SPTG) to enhance alignment without degrading quality. Due to the scarcity of paired data, we construct a scalable data pipeline and a large-scale subject-level dataset. We also propose SpongeBob-Bench for systematic evaluation. Experiments show SpongeBob significantly outperforms existing baselines, improving Sync-C by 30% and Ctx-F1 by 12.5%. Our project page is available at: this https URL.

[CV-127] Injecting Image Guidance into Text-Conditioned Diffusion Models at Inference

链接: https://arxiv.org/abs/2605.25191
作者: Agata Żywot,Iason Skylitsis,Thijmen Nijdam,Zoe Tzifa-Kratira,Derck Prinzhorn,Konrad Szewczyk,Aritra Bhowmik
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Text-to-image diffusion models like Stable Diffusion generate high-quality images from text, but lack a way to inject visual guidance (e.g. sketches, styles) at inference without retraining. Existing methods either require computationally expensive fine-tuning or rely on style transfer techniques that risk semantic misalignment with textual prompts. We introduce Visual Concept Fusion (VCF), the first method offering dual conditioning on both an image and text prompt at inference time without any concept-specific training. VCF enables visual concept injection into Stable Diffusion by aligning CLIP image features with the text embedding space. VCF consists of three components: (1) a lightweight aligner that maps image tokens to the text embedding manifold using InfoNCE and cross-attention reconstruction losses, (2) a fusion strategy that preserves both textual and visual semantics, and (3) an optional Prompt-Noise Optimization (PNO) module for test-time refinement. Our experiments demonstrate that VCF successfully transfers visual attributes including style, composition, and color palette from reference images while maintaining prompt adherence. Quantitative results show a trade-off between text alignment (CLIP score) and visual correspondence (LPIPS), with VCF outperforming baselines in reference fidelity.

[CV-128] Discrepancy Minimization Improves Cross-Hospital Robustness in Digital Pathology

链接: https://arxiv.org/abs/2605.25175
作者: Ben Vardi,Dana Schonberger,Yuval Friedmann,Zohar Yakhini,Iris Barshack,Alexander Loebel,Ariel Shamir
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Pathology foundation models (PFMs) have advanced rapidly in recent years and support training classifiers for a range of histopathology tasks. However, their robustness across hospitals remains limited: performance often degrades when training a classifier on data from one hospital and evaluating it on another target hospital. We address this challenge by fine-tuning PFMs with a local maximum mean discrepancy (LMMD) objective that applies to two settings: domain adaptation, where unlabeled target-hospital data is available, and domain generalization, where target-hospital data is unavailable at all. Experiments at both the patch- and slide-level show consistent improvements across multiple PFMs and tasks.

[CV-129] Methodology for Creating a Clinically Verified Dermoscopic Image Dataset

链接: https://arxiv.org/abs/2605.25168
作者: Kozachok Elena Sergeevna
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 22 pages, 5 figures, 5 tables

点击查看摘要

Abstract:This study presents a methodology for constructing a clinically verified dataset of dermatoscopic images for medical informatics research. The relevance of the work is driven by the fact that the performance of automated diagnostic support systems depends not only on the volume of images, but also on the reproducibility of the image acquisition procedure, the completeness of structured metadata, and the reliability of diagnostic labels. International collections were primarily created under conditions that differ substantially from routine Russian outpatient practice and mobile dermatoscopy. The proposed methodology integrates three interconnected components: (1) a standard operating procedure (SOP) for acquiring images via mobile dermatoscopy, (2) an information model comprising 16 structured metadata fields organized into six clinically oriented blocks in ISIC-compatible notation, and (3) a multi-stage expert verification of diagnostic labels (initial clinical annotation, consensus review by three specialists, and histological confirmation of all malignant neoplasms). Using this methodology, a dataset of 1,026 unique dermatoscopic images from 443 patients was collected between June 2025 and May 2026. From 1,044 initial records, 18 duplicates were excluded. The dataset includes nine nosological categories; all 39 malignant lesions (18 melanomas, 15 basal cell carcinomas, and 6 squamous cell carcinomas) were histologically verified. Patient age ranged from 2 to 90 years (median 38), with 279 females (63%) and 164 males (37%). Each image is accompanied by expert-annotated dermatoscopic structures and an explicit verification_stage field indicating the level of diagnostic confirmation. The resulting dataset serves as a pilot clinically verified resource suitable for independent model evaluation, domain shift analysis, interpretability studies, and further expansion.

[CV-130] K-U-KAN: Koopman-Enhanced U-KAN for 3D Dental Reconstruction from a Single Panoramic X-ray Radiograph

链接: https://arxiv.org/abs/2605.25163
作者: Bikram Keshari Parida,Abhijit Sen,Wonsang You
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 24 pages, 9 figures,

点击查看摘要

Abstract:A panoramic X-ray compresses a 3D jaw into a 2D strip; we aim to recover the missing depth cleanly and fast. Existing implicit neural representations render realistic volumes but are slow to train, sensitive to sampling and positional encodings, and costly in practice. Pure CNN baselines are efficient yet struggle with the dental arch’s long-range geometry, blur fine enamel-dentin boundaries, and offer little interpretability. We present K-U-KAN, a three-stage pipeline that (i) lifts 2D features into depth-aware observables with Kolmogorov-Arnold Networks, (ii) advances these observables by a stable, phase-aware linear evolution via a Koopman token block, and (iii) places the predicted depth bins onto focal-trough rays before a lightweight 3D attention U-KAN refines the volume. This marriage of physics (Beer-Lambert image formation), geometry (horseshoe focal trough), and learned linear dynamics yields sharp anatomy, fewer artifacts, and robust behavior on native radiographic intensities with batch size one. On held-out data, K-U-KAN matches transformer/implicit baselines on signal and structure metrics, clearly improves perceptual quality, and trains in roughly half the time-making single-view PX \to CBCT reconstruction more practical for clinical pipelines.

[CV-131] SpikeReg: Energy-Efficient 3D Deformable Medical Image Registration with Spiking Neural Networks

链接: https://arxiv.org/abs/2605.25144
作者: Ali Mikaeili Barzili,Behzad Moshiri,Hamid Azadegan,Mohammad-Reza A. Dehaqani
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Deformable medical image registration aligns anatomical structures across images but remains computationally dense at 3D resolution. Spiking neural networks (SNNs) offer sparse event-driven computation, yet have not been systematically studied for deformable medical image registration. We introduce SpikeReg, a spiking U-Net for 3D brain MRI registration. SpikeReg is initialized from an analog ANN registration teacher, converted by layer-wise weight transfer and activation-percentile threshold calibration, and fine-tuned with a surrogate-gradient objective combining local cross-correlation, diffusion regularization, and spike-rate sparsity. On the OASIS Learn2Reg validation split ( 19 image pairs), SpikeReg reaches Dice 0.7474 \pm 0.032 , with no significant paired Dice difference from the ANN teacher ( 0.7480 \pm 0.037 , p = 0.67 ), at a 12.8% mean spike rate and a 55.5\times projected arithmetic-energy reduction under an event-sparse SynOps/MAC proxy relative to the dense-ANN baseline. We additionally report two negative findings: displacement distillation from the ANN teacher hurts performance, and ANN teachers trained with a label-Dice loss fail to transfer through rate-code conversion. Together these results show that dense geometric prediction can be performed under sparse event-driven computation, opening a path toward neuromorphic medical image registration.

[CV-132] PQDT: Pseudo-Query Dual Transformer for Robust Point Cloud Restoration CVPR

链接: https://arxiv.org/abs/2605.25127
作者: Haoqing Wu,Alexa Nawotki,Jochen Garcke
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: To be published in The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2026

点击查看摘要

Abstract:Point clouds are a fundamental 3D representation in computer vision, enabling a wide range of perception tasks. However, real-world point clouds often suffer from degradations such as incompleteness, noise, outliers, and irregular density, caused by sensor limitations or occlusions. Recovering clean and detailed shapes from such degraded data is crucial for downstream applications. While existing learning-based methods achieve progress on individual tasks like completion or denoising, they typically rely on global bottleneck features, which lose fine-grained geometry and remain sensitive to varying input quality. We propose a unified 3D restoration network that directly takes point clouds as input and adaptively reconstructs high-quality geometry under diverse degradation scenarios. At the core of our approach is a Pseudo-Query module, implemented within a Transformer backbone, which reformulates geometric translation into two cooperative stages to enhance structural clarity, robustness, and local detail preservation. Extensive experiments on curated benchmarks demonstrate that our approach surpasses state-of-the-art performance in general 3D restoration. It effectively handles complex combinations of completion, deformation, and denoising degradations. With this work, we provide a novel unified, point-only backbone for robust 3D restoration, enabling more versatile 3D perception.

[CV-133] rust-Aware Joint Feature-Prediction Discrepancy for Robust Domain Adaptation

链接: https://arxiv.org/abs/2605.25119
作者: Xi Ding,Lei Wang,Syuan-Hao Li,Yongsheng Gao
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Research report

点击查看摘要

Abstract:Domain adaptation aims to mitigate performance degradation caused by distribution shifts between a labeled source domain and an unlabeled or sparsely labeled target domain. Most existing approaches estimate domain discrepancy either in feature space or in prediction space. However, these single-perspective strategies overlook a critical problem under domain shift: the reliability of the signals used for alignment. In practice, both learned representations and semantic predictions may become unreliable, and treating all target samples equally can lead to misleading alignment and suboptimal transfer. We introduce trust-aware domain adaptation, a principled framework that models domain discrepancy through the reliability of feature and prediction signals. Central to our approach is the Joint Feature-Prediction Discrepancy (JFPD), a unified formulation that jointly captures representation divergence and prediction divergence while weighting their contributions by sample-specific trust. Trust is quantified via two complementary mechanisms: uncertainty-aware trust, derived from prediction entropy to suppress unreliable predictions, and semantic-alignment trust, computed from prototype similarity in feature space to emphasize well-aligned representations. By prioritizing confident and semantically consistent samples while down-weighting noisy or ambiguous ones, JFPD provides a reliability-aware estimate of domain discrepancy. We further integrate JFPD into a training objective that guides adaptation toward trustworthy regions of the target domain. Experiments on standard benchmarks demonstrate that the proposed framework consistently achieves superior adaptation performance and yields discrepancy estimates that correlate with target-domain error. This work addresses, for the first time, the importance of modeling trust in the interaction between features and predictions for domain adaptation.

[CV-134] Uncertainty-DTW for Sequences and Visual Tokens

链接: https://arxiv.org/abs/2605.25110
作者: Lei Wang,Syuan-Hao Li,Yongsheng Gao,Piotr Koniusz
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Research report

点击查看摘要

Abstract:Aligning structured data is a fundamental problem in computer vision and machine learning, underlying tasks such as time series analysis, human action recognition, and visual representation learning. Existing alignment methods, including Dynamic Time Warping (DTW) and its differentiable variants, rely on deterministic similarity measures and are therefore sensitive to heterogeneous and noisy features. In this work, we introduce uncertainty-aware alignment, a probabilistic framework that models pairwise correspondences with heteroscedastic uncertainty and performs structured matching along alignment paths. Our formulation, uncertainty-DTW (uDTW), assigns each correspondence a Normal distribution and parametrizes each alignment path by a Maximum Likelihood Estimate objective consisting of (i) a precision-weighted matching term that suppresses unreliable features, and (ii) a log-variance regularization that prevents degenerate solutions. This yields a probabilistic alignment mechanism that is robust to noise and interpretable, as uncertainty directly reflects the reliability of matches. We further generalize this framework from temporal sequences to tokenized visual representations, enabling structured matching over sets of visual tokens. The learned uncertainty can be interpreted as a reverse-attention: semantically relevant regions exhibit low uncertainty and dominate the alignment, while ambiguous/noisy regions have high uncertainty. This provides a connection between alignment, attention, and uncertainty modeling. We evaluate the proposed framework across diverse domains. The results demonstrate consistent improvements over state-of-the-art methods and show that learned uncertainty correlates with semantic importance. These findings establish uncertainty-aware alignment as a general, robust, and interpretable framework for learning from structured data.

[CV-135] WorldCraft: From Camera Navigation to Object Manipulation in Interactive Video World Models

链接: https://arxiv.org/abs/2605.25077
作者: Bohai Gu,Taiyi Wu,Yueyang Yuan,Jian Liu,Xiaocheng Lu,Dazhao Du,Jie Zhang,Jinxiang Lai,Shuai Yang,Xiaotong Zhao,Alan Zhao,Song Guo
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Recent video-based world models have made pixel-space environments interactive at the camera level: users can navigate viewpoints while the model generates coherent visual continuations. Yet their action spaces remain incomplete: users can move the camera, but cannot act on individual objects. Since real-world interaction is inherently object-centric, such models remain closer to passive scene observers than truly manipulable environments. We present WorldCraft, a framework that expands interactive video world models from camera navigation to object-level trajectory actions. Given a user click and a sketched path, WorldCraft generates future frames in which the selected object follows the prescribed trajectory while the camera continues to navigate the scene. WorldCraft achieves this through a trajectory-centric control pipeline: First, Normalized World Trajectory (NWT) represents user-drawn motion in a camera-invariant world coordinate system and dynamically re-projects it under the current camera pose, separating object motion from camera-induced screen-space displacement; Spatial-Pathway LoRA (SP-LoRA) then injects this world-space signal through the model’s spatial-control pathway, adding object manipulation capability while preserving the pretrained camera controller; finally, Trajectory-Anchored State Persistence (TASP) treats the world trajectory as a persistent spatial state and refreshes autoregressive memory after trajectory-conditioned generation, allowing moved objects to reappear at their updated positions after leaving the camera view. Experiments show that WorldCraft enables accurate object control, preserves the video-based world model’s camera fidelity under camera-only evaluation, and maintains object state across long autoregressive rollouts with off-camera excursions.

[CV-136] VEOcc: Voxel-Centric Online Semantic Occupancy Prediction For Embodied Scene Understanding

链接: https://arxiv.org/abs/2605.25059
作者: Ruoyu Wang,Yong Liu,Sheng Tao,Yuhang Lin,Yukai Ma
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Crucial for autonomous exploration, online 3D occupancy prediction and mapping incrementally constructs dense spatial representations on the fly. However, recent Gaussian-centric methods struggle with structural boundary fidelity and rely heavily on predefined scene-size priors, fundamentally limiting their operational efficiency. In this work, we present VEOcc, a voxel-centric framework formulated as a recursive perception-and-assimilation paradigm. By eliminating the need for initial scale estimation, VEOcc enables highly streamlined, open-ended map expansion. Furthermore, to robustly aggregate noisy temporal observations within the discrete voxel space, we propose a Spatio-Temporal-Aware Online Update Strategy. It integrates Cross-Temporal Logit Aggregation (TLA) for temporal consistency, Reliability-Aware Confidence Modulation (RCM) for spatial uncertainty calibration, and Confidence-Driven Incremental State Update (CSU) for robust global state assimilation. % Extensive experiments on Occ-ScanNet and EmbodiedOcc-ScanNet demonstrate that VEOcc establishes new state-of-the-art performance in both local and embodied settings, providing an accurate and efficient solution for real-world exploration. Extensive experiments on Occ-ScanNet and EmbodiedOcc-ScanNet demonstrate that VEOcc establishes new state-of-the-art performance in both local and embodied settings. Notably, zero-shot evaluations on self-collected video sequences further confirm its robust out-of-distribution generalization capability in completely unseen real-world environments. Ultimately, our framework provides an accurate and highly efficient solution for autonomous exploration. Code and supplementary visualizations are available on our project page: this https URL.

[CV-137] nyFormer: Preserving Tiny Objects in YOLO-DETRHybridReal-time Detectors

链接: https://arxiv.org/abs/2605.25046
作者: Jun-Wei Hsieh,Meng-Yu Kao,Ghufron Wahyu Kurniawan,Kuan-Chuan Peng
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:YOLO-series and DETR-based detectors struggle with tiny-object detection. YOLO-style models benefit from efficient dense prediction, but their large-stride backbones may suppress tiny instances in deep feature maps and make grid assignment ambiguous. DETR-based models remove hand-crafted post-processing through set prediction, yet they reason over coarse token grids, where tiny objects occupy only a few weak tokens and are easily overlooked during matching. To address these limitations, we propose TinyFormer, a unified YOLO–DETR hybrid real-time detector that combines ViT representations, NMS-free set prediction, and a YOLO-style pyramid neck for accurate small-object detection. TinyFormer introduces a Parallel Bi-fusion Module (PBM), which builds high-resolution shortcuts from shallow stages to the feature pyramid, preserving fine spatial details during multi-scale fusion. We further design a Spatial Semantic Adapter (SSA) to compensate for the spatial loss caused by coarse tokenization. SSA extracts high-resolution cues from early stages and injects them into transformer token embeddings, improving tiny-object localization without sacrificing the global modeling ability of DETR. Experiments on MS COCO show that TinyFormer consistently outperforms recent YOLO-series detectors and the strong DEIMv2 baseline. TinyFormer-X achieves 58.4% AP even without PBM, while adding PBM improves the overall AP to 58.5% and brings a 1.6% AP gain on small objects. With Objects365 pre-training, TinyFormer-X-PBM reaches 60.2% AP, surpassing RF-DETR and other Objects365-pretrained detectors with fewer parameters and lower computation. These results demonstrate that TinyFormer bridges dense YOLO-style feature fusion and DETR-style set prediction, providing a strong accuracy-efficiency trade-off for real-time tiny-object detection. Code is available at this https URL.

[CV-138] Unbiased Diffusion Variational Inversion via Principled Posterior Matching

链接: https://arxiv.org/abs/2605.25042
作者: Weimin Bai,Yuxuan Gu,Yifei Wang,Weijian Luo,He Sun
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Existing score-based methods for inverse problems often resort to approximate minimization of the KL divergence between the inversion distribution and the Bayesian posterior. Such an approximation leads to severe mode collapse and unreliable uncertainty quantification. In this paper, we propose Principled Posterior Matching (PPM), a framework that returns to the fundamentals of variational inference, rather than using tricky approximations. Instead of relying on heuristic approximations, we rigorously formulate the exact optimization of the KL divergence via the integration of Fisher divergence. We derive a tractable, equivalent gradient form of this integral, enabling precise optimization without the biases introduced by prior approximations. Our analysis clearly reveals that the mode collapse in previous methods stems directly from this approximation gap. Supported by our theoretical solution, PPM unifies two complementary paradigms: (1) In variational inference, PPM adopts mass-covering divergences that significantly improve the inversion diversity and uncertainty quantification; (2) In amortized inference, it enables the training of an efficient reconstruction network for rapid, single-step reconstruction. Furthermore, our formulation naturally extends to a broader family of divergence measures by generalizing the integral of the Fisher divergence. We validate PPM across challenging computational imaging tasks, including inpainting, super-resolution fluorescent microscopy, and radio interferometric black-hole imaging. In all experiments, PPM achieves superior reconstruction fidelity, faithful multimodal posterior recovery, and well-calibrated uncertainty estimates, establishing a robust framework for scientific imaging.

[CV-139] AstroRAG – A Pagerank-Based Retrieval-Augmented Generation Pipeline for Question Answering in Astronomy

链接: https://arxiv.org/abs/2605.25039
作者: Zhifeng Wang,Jason Jingshi Li,Kaihao Zhang,Ramesh Sankaranarayana
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to IEEE CAI 2026

点击查看摘要

Abstract:Large language models (LLMs) demonstrate strong performance in natural language processing but often generate factual errors when relying solely on parametric knowledge. Retrieval-Augmented Generation (RAG) mitigates these errors by grounding responses in external evidence, yet conventional retrieve-and-dump approaches frequently introduce irrelevant context that degrades answer quality. In this work, we present AstroRAG – a PageRank-based retrieval-augmented generation (RAG) pipeline adapted for question answering in astronomy. The system performs token-aware chunking and per-instance, ephemeral indexing in Elasticsearch, then executes a two-stage retrieval: (i) Maximal Marginal Relevance (MMR) to obtain a small, diverse candidate set and (ii) a reader-driven PageRank (PR) re-ranking on a similarity graph to identify a compact, mutually supportive context under a strict token budget. Our design is training-free, privacy-preserving, and reproducible, as each instance is processed through transient indexing to prevent cross-task leakage. We evaluate the pipeline on the AstroQA benchmark for astronomy QA, and demonstrate competitive performance across all difficulty levels. In particular, the RAG-enhanced Mistral-7B achieves \textbf79.49% accuracy and \textbf79.49% F1-score, nearly doubling the performance of its non-RAG counterpart. These results highlight the effectiveness of disciplined retrieval and refinement in boosting domain-specific reasoning, establishing a robust foundation for extending RAG to other scientific fields.

[CV-140] DA-UCT: Self-Supervised Domain-Adaptive Ultrasound Computed Tomography for Rapid Musculoskeletal Sound Speed Reconstruction

链接: https://arxiv.org/abs/2605.25024
作者: Tianyu Liu,Heyu Ma,Aiduo Wang,Peiwen Li,Boyi Li,Ying Li,Dan Li,Chengcheng Liu,Dean Ta
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Ultrasound computed tomography (UCT) via full waveform inversion (FWI) enables high-resolution quantitative imaging for tissue characterization and disease diagnosis. However, UCT suffers from large computational burden and severe convergence issues due to highly nonlinear optimization. Deep learning can accelerate UCT reconstruction, but supervised training requires large-scale labeled datasets difficult to obtain in vivo. To address these limitations, we propose SDA-UCT, a two-stage self-supervised domain-adaptive framework for rapid and accurate UCT imaging of musculoskeletal tissues. SDA-UCT employs an attention-enhanced network (AttUCT) pre-trained on simulation datasets and transfers to in-vivo data via physics-informed self-supervised learning, effectively bridging the simulation-to-real domain gap. A Low-Rank Adaptation (LoRA) mechanism is integrated to enable efficient adaptation across diverse clinical scenarios. Results showed that AttUCT achieved high-quality SOS reconstruction for simulated human forearm with a PSNR of 29.23 dB and SSIM of 0.928, outperforming conventional FWI and existing deep learning methods. Validated on in-vivo data, SDA-UCT successfully reconstructed SOS images revealing complex anatomical structures (skin, fat, muscle, tendon, bone and bone marrow) for human forearm, in high concordance with MRI references. The LoRA mechanism adjusting only 3% of parameters achieved comparable performance to full fine-tuning. The rapid reconstruction (5 ms per frame) enables real-time 3D visualization, achieving five-orders-of-magnitude improvement over traditional FWI. This work represents the first self-supervised domain-adaptive deep learning for rapid, high-resolution in-vivo UCT imaging, showing potential for musculoskeletal disease diagnosis.

[CV-141] D3S2: Diffusion-Guided Dataset Distillation for Semantic Segmentation

链接: https://arxiv.org/abs/2605.25022
作者: Wenjie Zheng,Haoji Hu,Jiali Lu,Xingze Zou,Jing Wang
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Dataset distillation (DD) aims to compress large-scale datasets into compact synthetic sets while preserving training efficacy. However, existing studies mainly focus on image classification, leaving dense prediction tasks such as semantic segmentation largely underexplored. In this work, we identify three key challenges for segmentation DD: (i) long-tailed class imbalance, (ii) the need for strict pixel-wise alignment between images and dense labels, and (iii) the high computational cost of optimizing high-resolution data with complex models. To address these challenges, we propose D3S2, a Diffusion-guided Dataset Distillation framework for Semantic Segmentation. Our method adopts a two-stage design. In Class-Balanced Mask Selection, we construct a representative mask set via a greedy strategy that prioritizes underrepresented classes. In Diffusion-Guided Image Synthesis, we employ a pretrained layout-to-image diffusion model to generate images conditioned on the selected masks, naturally ensuring spatial alignment. To further enhance the training utility of synthesized data, we introduce guided diffusion sampling with two complementary objectives: a segmentation-consistency loss for pixel-level alignment, and a class-wise feature matching loss for aligning per-class feature statistics across layers. Extensive experiments demonstrate the superiority of D3S2. Notably, at an extremely compression rate of 1%, our method achieves 24.99% and 35.49% mIoU on ADE20K and COCO-Stuff with Mask2Former (Swin-S), outperforming random selection by 9.34% and 5.70%, respectively.

[CV-142] Stop Denoising Your Blurs ICIP

链接: https://arxiv.org/abs/2605.25014
作者: Sasidhar Parvathireddy,Vamsidhar Saraswathula,Rama Krishna Gorthi
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at IEEE International Conference on Image Processing (ICIP) 2026. 7 pages, 3 figures

点击查看摘要

Abstract:In recent times, diffusion models have achieved remarkable performance in image restoration tasks. Their core mechanism relies on the restricted presumption of degradation prior to the additive noise operation. However, the blur model, one of the most widely studied degradation formulations, violates this assumption, as it is inherently based on convolution rather than addition. In this paper, we introduce ConvDiff, a novel diffusion based framework that substitutes the additive operation with convolution for the task of image deblurring. In the forward process, we construct a meaningful trajectory from the clean image to its blurred counterpart by exploiting the frequency domain characteristics of convolution, rather than progressively corrupting the image with additive noise. While the current work instantiates this framework for Gaussian blur, where frequency-domain decomposition yields closed-form and physically valid intermediate states, the underlying principle of constructing degradation trajectories from the blur operator extends naturally to other blur families. This formulation bridges the gap between the mathematical principles of blurring and the iterative design of diffusion-based restoration algorithms, enabling more physically grounded and effective image restoration models.

[CV-143] Learning from Semantic Dictionaries: Discriminative Codebook Contrastive Learning for Unified Visual Representation and Generation CVPR’26

链接: https://arxiv.org/abs/2605.25012
作者: Imanol G. Estepa,Jesús M Rodríguez-de-Vera,Bhalaji Nagarajan,Petia Radeva
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at CVPR’26

点击查看摘要

Abstract:Discriminative and generative vision models excel in their respective domains but remain semantically misaligned, hindering progress toward unified visual learning. We introduce LEASE (LEArning from SEmantic Dictionaries), a self-supervised framework that bridges this gap using a paired generative-discriminative codebook design. LEASE operates entirely in a discrete token space produced through a one-time precomputation step, enabling efficient training without data augmentations, teacher models, or online tokenizers. LEASE integrates two complementary objectives: a masked token reconstruction loss that captures fine-grained generative detail, and a codebook contrast loss that aligns encoder features with discriminative semantics via adaptive centroid weighting. This dual supervision yields a unified latent space that supports both high-quality generation and strong representation learning. On ImageNet-1K, LEASE achieves state-of-the-art unified performance, outperforming prior VQGAN-based methods such as MAGE and Sorcen across linear probing (up to +1.7%), unconditional generation (-1.26 FID and +10.19 IS w.r.t MAGE), few-shot learning (+0.56% on average against Sorcen), transfer (+0.75% average improvement against MAGE and Sorcen), and robustness benchmarks (+5.86% and +4.25% average improvement against MAGE and Sorcen, respectively). It also competes favorably with domain-specialized contrastive and generative models while surpassing previous MIM methods. The unsupervised LEASE model can also be extended to conditional generation by building upon its learned representations, proving competitive with specialized baselines. Overall, LEASE provides an efficient and effective step toward general-purpose vision models that jointly understand and generate visual content.

[CV-144] ClueAegis: Heuristic-to-Reasoning Cognitive-skill Learning for Unified Evidence-based Synthetic Image Detection

链接: https://arxiv.org/abs/2605.25009
作者: Huangsen Cao,Hongkang Chu,Yuxi Li,Ying Zhang,Chen Li,Jing Lyu,Yongwei Wang,Yu Zhao,Fei Wu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The rapid advancement of generative models has made synthetic images increasingly realistic, challenging reliable detection. Existing methods are often limited to end-to-end classification or monolithic reasoning, and thus fail to model structured forensic reasoning and heterogeneous visual evidence. We revisit synthetic image detection from a cognitive perspective and propose a \textitHeuristic-to-Reasoning cognitive skill learning framework for evidence-based forensic analysis. Given an input image, our framework first extracts heuristic perceptual clues, selects the optimal forensic skill, and then performs skill-conditioned reasoning for evidence extraction and decision making. To support this paradigm, we introduce \textbfClueAegis-Bench, which decomposes synthetic image detection into explicitly annotated forensic cognitive skills for structured evaluation beyond binary classification. Based on this benchmark, we propose \textbfClueAegis (\underlineCognitive-skill \underlineLearning for \underlineUnified \underlineEvidence-based Synthetic Image Detection), a two-stage agentic framework that conducts heuristic skill selection followed by evidence-guided reasoning through skill-conditioned toolchains. This design reformulates synthetic image detection as a configurable multi-skill reasoning process that bridges perception, skill selection, and forensic reasoning. Extensive experiments show that ClueAegis achieves state-of-the-art performance while improving cross-domain generalization and robustness. It also provides transparent reasoning trajectories and structured forensic evidence, offering a more explainable alternative to conventional end-to-end detectors.

[CV-145] NeurIPS: Neuro-anatomical Inductive Priors for Sphere-based Brain Decoding ICML

链接: https://arxiv.org/abs/2605.24993
作者: Sijin Yu,Zijiao Chen,Zhenyu Yang,Zihao Tan,Jiakun Xu,Zhongliang Liu,Shengxian Chen,Wenxuan Wu,Xiangmin Xu,Xin Zhang
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: International Conference on Machine Learning (ICML) 2026

点击查看摘要

Abstract:Current fMRI decoders face a performance-fidelity trade-off where efficient ID encoders outperform geometrically faithful surface-based models. We argue this is partly driven by inefficient surface tokenization and the failure to use anatomy as a predictive signal. We present NeurIPS, a framework that improves surface-based decoding by reframing anatomical variation from a nuisance to a powerful inductive prior. NeurIPS unites two innovations: a Selective ROI Spherical Tokenizer (SRST) for efficient geometric encoding, and a Structure-Guided Mixture of Experts (SG-MoE) that explicitly models individual anatomy using cortical features. On the Natural Scenes Dataset, NeurIPS establishes a new state-of-the-art for surface decoders and achieves performance comparable to strong 1D baselines. This is achieved with unprecedented efficiency, as the model converges dramatically faster (10 vs. 600 epochs). This efficiency enables rapid adaptation to new subjects using only 20% of data and ensures robust scalability as the training cohort is expanded. Ablations provide causal evidence that these gains are driven by the model’s use of cortical features, not by memorizing subject IDs. By leveraging anatomical priors, NeurIPS provides a principled and scalable path toward robust, generalizable brain decoding.

[CV-146] Cross-Domain Generalization Limits of Vision Foundation Models in Facial Deepfake Detection

链接: https://arxiv.org/abs/2605.24965
作者: Ibrahim Delibasoglu
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The rapid evolution of generative models has enabled the creation of hyper-realistic facial deepfakes, exposing a critical vulnerability in modern digital forensics: the inability of detectors to generalize to unseen manipulation techniques. Traditional networks suffer from representation collapse, overfitting to localized artifact fingerprints of specific training generators. This work investigates whether modern Vision Foundation Models can serve as generalizable, out-of-the-box feature extractors capable of tracking forensic anomalies across entirely unseen generative manifolds. We conduct a systematic cross-domain evaluation comparing three foundational learning paradigms: fully supervised macro-semantic features (RoPE-ViT), pure self-supervised geometric features (DINOv3), and multi-teacher agglomerative representations (NVIDIA C-RADIOv4-H). By deploying frozen backbones subjected to downstream linear probing, we map the performance limitations of these architectures on the challenging DF40 benchmark. Our empirical findings expose the intrinsic trade-offs between pre-training paradigms and parameter scale, proving that while foundation models retain high discriminative capabilities for entire face synthesis, localized face editing techniques expose fundamental boundaries in linear probe evaluation structures. Source code and model weights are available in this http URL

[CV-147] ConFi-GS Confidence-Guided High-Frequency Injection for 3D Gaussian Splatting Super-Resolution

链接: https://arxiv.org/abs/2605.24964
作者: Jiaxiang Li,Zongtan Zhou,Zhen Tan,Yadong Liu,Dewen Hu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Reconstructing high-quality 3D scenes from low-resolution multi-view images remains challenging for 3D Gaussian Splatting (3DGS), because insufficient high-frequency observations often lead to blurred textures, weak boundaries, and view-inconsistent details. Existing approaches either apply super-resolution guidance uniformly or localize enhancement regions based mainly on geometric sampling. However, they typically do not distinguish between two fundamentally different questions: where additional detail is needed, and whether the corresponding candidate high-frequency content is reliable enough to be internalized into a multi-view consistent 3D representation. In this paper, we propose a reliability-aware frequency modeling framework for low-resolution 3DGS reconstruction. The framework first estimates a geometry-guided detail-demand prior to locate regions that are likely under-detailed under low-resolution supervision. It then computes a frequency-aware reliability map to determine whether candidate high-frequency details are structurally supported, spectrally unresolved, and cross-view stable. Combining these signals yields a detail-injection map that guides where super-resolved details should be introduced during optimization. Based on this map, we design a unified optimization scheme comprising spatially selective supervision, coarse-to-fine frequency regularization, and reliability-aware Gaussian densification. This scheme controls where reliable details are injected, when high-frequency supervision is activated, and how unresolved yet reliable details are internalized into the Gaussian representation. Experiments on multiple benchmarks show improved fidelity and perceptual quality while suppressing unstable or view-inconsistent details. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2605.24964 [cs.CV] (or arXiv:2605.24964v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2605.24964 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-148] mpered Self-Similarity Alignment for Physically Plausible Video Generation CVPR2026

链接: https://arxiv.org/abs/2605.24962
作者: Manjin Kim,Suha Kwak,Minsu Cho
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to the CVPR 2026 Workshop on Video Generative Models: Benchmarks and Evaluation (VGBE)

点击查看摘要

Abstract:Despite remarkable advances in video generative models, they still struggle to generate physically realistic videos, frequently exhibiting appearance drift, implausible motion, and temporal inconsistencies. In this work, we address this limitation by transferring relational knowledge encoded in spatio-temporal self-similarity (STSS) from visual foundation models into video generative models. STSS represents pairwise similarities among features across space and time, revealing the relational structure of how objects interact with other entities throughout a video, effectively capturing real-world dynamics, including object motion and semantic transformations. To transfer this relational knowledge, we propose Tempered Self-similarity Alignment (TSA) loss, which transforms STSS into probabilistic correspondence distributions and trains the video generative model to align its correspondence distributions with those of the visual foundation model on dynamically changing regions. Evaluated on VideoPhy and VideoPhy2 benchmarks, our method demonstrates substantial improvements in physical plausibility across diverse interaction scenarios, validating the effectiveness of transferring relational knowledge for physically realistic video generation.

[CV-149] hree-Step Conditional Diffusion 3D Reconstruction for Light-Field Microscopy CVPR2026

链接: https://arxiv.org/abs/2605.24959
作者: Qihong Zhao,Shaokang Yan,Zhimin Qiao,Jinjia Wang,Bo Xiong
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 6 figures. Accepted to CVPR 2026 Findings

点击查看摘要

Abstract:Light-field microscopy (LFM) enables single-shot capture of multi-angular information from biological samples, supporting real-time volumetric imaging. However, traditional physics-based algorithms often suffer from limited spatial resolution, severe artifacts, and high computational costs. Existing learning-based methods improve inference efficiency but still face limitations in reconstruction accuracy and generalization capability. To address these challenges, this paper proposes a high-fidelity Three-Step Conditional Diffusion (TCD) 3D reconstruction method for LFM. Although conventional diffusion models have achieved remarkable success in generative modeling, their slow sampling process and the inherent trade-off between quality and efficiency hinder their application in real-time 3D imaging. We redesign the diffusion process through a deterministic three-step sampling strategy coupled with a lightweight conditional U-Net, establishing a new paradigm for fast and accurate volumetric reconstruction. Furthermore, an Inter-Class Detection (ICD) module is incorporated to identify out-of-distribution or anomalous inputs during inference, thereby enhancing model stability and reliability. Extensive experiments and cross-dataset evaluations demonstrate that TCD significantly outperforms state-of-the-art methods in both reconstruction fidelity and generalization, providing an efficient and practical 3D reconstruction solution for light-field microscopy.

[CV-150] Mitigating Object Hallucinations in Vision-Language Models through Region-Aware Attention Recalibration

链接: https://arxiv.org/abs/2605.24957
作者: Yuanzhi Xu,Qian Gao,Jun Fan,Guohui Ding,Zhenyu Yang,Sixue Lin,Yuteng Xiao
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The generation of factually incorrect objects, commonly known as object hallucination, remains a persistent challenge in Large Vision-Language Models (LVLMs). Current approaches to address this issue - ranging from expensive data-driven fine-tuning and high-latency contrastive decoding to rigid attention head truncation - frequently compromise either computational efficiency or the continuity of the model’s feature space. To overcome these limitations, we introduce a novel, training-free inference strategy that operates as a region-aware adaptive weighting mechanism to dynamically correct semantic drift without relying on abrupt heuristic truncations. By computing an outlier-resistant statistical midpoint across various attention heads, we establish a stable anchor for reliable visual representations. We then utilize the inter-head disagreement mapped across regions to dynamically determine intervention budgets, gently suppressing hallucination-inducing attention paths through a continuous penalty modulation. This recalibration process effectively rectifies visual-semantic misalignments while fully preserving generative fluency and language priors. Comprehensive evaluations on standard multimodal benchmarks, including CHAIR, POPE, and MME, reveal that our strategy substantially curtails both instance- and sentence-level hallucinations. The results demonstrate state-of-the-art performance against contemporary baselines, confirming our method’s efficiency and algorithmic robustness. Our code will be public.

[CV-151] Interpretability Transfer from Language to Vision via Sparse Autoencoders

链接: https://arxiv.org/abs/2605.24946
作者: Alexey Kravets,Da Li,Chuan Li,Da Chen,Vinay P. Namboodiri
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in language model interpretability using sparse autoencoders (SAEs) have yet to effectively translate to the visual domain, mainly due to the difficulty and ambiguity of labeling visual concepts. In this paper, we introduce Visual Interpretability via SAE Transfer Alignment (VISTA), a framework that transfers interpretability from language to vision in a LLaVA-style vision-language model by constraining a visual projector to map visual tokens into an LLM’s pre-existing, labeled textual SAE space. This approach enables visual interpretability without training dedicated vision SAEs. By regularizing the projector using the LLM’s SAE reconstruction loss, VISTA achieves a threefold increase in the matching rate, which measures how accurately the most activating textual concepts in the SAE space correspond to semantic elements in the image. Using this framework, we further analyze spatial localization properties of different vision encoders and show that DINOv2 features have stronger localization abilities than other encoders. Leveraging this precision, we validate VISTA’s cross-modal alignment through fine-grained, localized concept interventions, where specific objects are removed or replaced in the model’s perception while preserving the surrounding scene. This results in improvements of 35% in object removal and 47% in object replacement tasks over vision-only baselines, providing causal evidence that visual tokens inhabit the text SAE manifold. These contributions are validated across multiple LLM architectures.

[CV-152] HumanEgo: Zero-Shot Robot Learning from Minutes of Human Egocentric Videos

链接: https://arxiv.org/abs/2605.24934
作者: Zhi (Leo)Wang,Botao He,Kelin Yu,Seungjae Lee,Ruohan Gao,Furong Huang,Yiannis Aloimonos
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Project page: this https URL

点击查看摘要

Abstract:Human egocentric video captures rich manipulation demonstrations without any robot hardware, yet transferring these skills to robots remains challenging due to the embodiment gap between human and robot in both visual appearance and kinematics. We present HumanEgo, a framework that bridges the embodiment gap by lifting each human demonstration to an entity-level representation of hand-object interaction, and training a flow matching policy with dense auxiliary objectives that amplify supervision from every trajectory. HumanEgo is robot-data-free, hardware-agnostic, data-efficient, and zero-shot human-to-robot transferable. With only 30 minutes of human videos per task, HumanEgo achieves 92.5% average success across four real-world tasks (75% with just 15 minutes), outperforms matched-time robot teleoperation by 41%, and robustly transfers zero-shot across novel robots, cameras, and environments.

[CV-153] X-Edit: Exact Explicit and Explainable Null-Space Editing for Medical Vision Transformers MICCAI2026

链接: https://arxiv.org/abs/2605.24932
作者: Yuanye Liu,Siyuan Zhou,Ke Zhang,Lei Li,Wei Chen,Xiahai Zhuang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Early accepted by MICCAI 2026

点击查看摘要

Abstract:Pre-trained Vision Transformers (ViTs) are increasingly deployed for medical image classification. However, correcting their inevitable failure cases in dynamic clinical scenarios poses a critical challenge. Conventional fine-tuning approaches inherently suffer from catastrophic forgetting, severely degrading previously acquired diagnostic capabilities. Such instability fundamentally compromises clinical safety. Addressing this vulnerability requires an active, controllable, and reliable intervention mechanism that is both theoretically grounded and inherently interpretable. To this end, we propose X-Edit (eXact, eXplicit, and eXplainable Editing), an efficient null-space model editing framework. X-Edit transitions the editing process from iterative gradient-based optimization to a theoretically grounded, closed-form solution. Specifically, we first explicitly localize the influential layers via causal tracing governing the erroneous prediction. Subsequently, we construct an orthogonal null-space projection matrix from a curated anchor set. By geometrically constraining the exact parameter update strictly within this null space, we provide mathematical guarantees that the intervention rectifies targeted errors without perturbing established diagnostic representations. Extensive evaluations on six medical imaging benchmarks demonstrate that X-Edit comprehensively suppresses catastrophic forgetting while achieving superior edit success rates. Our code is available at this https URL.

[CV-154] MambaDSF: Multi-Scale SSM with Dilated Feature Fusion for Sonar Small Target Detection

链接: https://arxiv.org/abs/2605.24928
作者: Hui Lin,Jiayi Li,Jing Wang,Shenghui Rong
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 4 figures, under review at IEEE Geoscience and Remote Sensing Letters (GRSL)

点击查看摘要

Abstract:Sonar imaging is the primary modality for underwater target detection, yet small targets remain difficult to detect due to insufficient pixel coverage, low acoustic contrast, and scale ambiguity across imaging ranges. CNN-based detectors extract local features efficiently but cannot suppress noise-induced false alarms without global acoustic context. Transformer-based methods capture long-range dependencies at quadratic computational cost. Existing Mamba-based vision models offer efficient linear-cost scanning but lack multi-scale semantic alignment across pyramid levels, multi-receptive-field fusion, and small-target-aware training supervision needed for reliable sonar detection. This letter proposes Mamba Dilated-Scale Fusion (MambaDSF), a hybrid framework addressing these limitations through three contributions: a Mamba Enhanced Feature Pyramid (MambaEFP) backbone that jointly captures local echo cues and global acoustic context at linear complexity; a Dilate Fusion Mamba (DFMamba) encoder that enforces multi-scale feature alignment across pyramid levels; and Scale-Adaptive Weighted IoU (SA-WIoU) and Cross-Scale Coherence (CSC) losses that stabilize small-target training. MambaDSF achieves 91.5% mAP50 on the UATD forward-looking sonar benchmark with 28.7 million parameters, surpassing all compared detectors. On a small-target subset the gain reached +2.2 percentage points, and cross-domain evaluation on FLS and MD-FLS confirms the generalization of the proposed architecture. The codes are publicly available at this https URL. Comments: 8 pages, 4 figures, under review at IEEE Geoscience and Remote Sensing Letters (GRSL) Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2605.24928 [cs.CV] (or arXiv:2605.24928v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2605.24928 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-155] Snapshot Polarimetric Display Inverse Rendering

链接: https://arxiv.org/abs/2605.24915
作者: Seokjun Choi,Yunseong Moon,Kaizhang Kang,Hoon-Gyu Chung,Jin-Nyeong Kim,Giljoo Nam,Seung-Hwan Baek
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Inverse rendering remains a core challenge in graphics and vision, especially in the snapshot configurations required for lightweight desktop workflows, where the per-frame information budget is highly constrained. Previous inverse rendering work explores various available dimensions for enriching the per-shot information, including temporal modulation, spectral encoding, and polarization. In this work, we introduce polarimetric display inverse rendering, using an LCD to project a linearly polarized RGB binary pattern and an RGB polarization camera augmented with a quarter-wave plate to acquire spectro-polarimetric measurements in a single shot. A feed-forward transformer maps these measurements to per-pixel normal, albedo, roughness, and metallicity. To overcome training data scarcity, we expand a limited set of measured polarimetric bidirectional reflectance distribution functions via a generative manifold. Evaluations on a real desktop setup demonstrate accurate inverse rendering across diverse scenes, outperforming existing approaches.

[CV-156] Where Detectors Fail: Probing Generative Space for Generalizable AI-Generated Image Detection

链接: https://arxiv.org/abs/2605.24906
作者: Zijie Cao,Weijie Tu,Yao Xiao,Weijian Deng,Liang Lin,Pengxu Wei
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Detecting AI-generated images (AIGI) remains challenging because detectors often fail to generalize to unseen generators. Although existing methods are trained on large datasets, their performance still degrades when generation settings change, indicating that data scale alone is insufficient and that limited coverage of generative variations during training is a key factor. Studies on generative model editing show that small changes in internal representations can produce diverse and meaningful image variations, many of which are not explored under standard sampling. Leveraging this insight, we propose PROBE (Probing Robustness via Boundary Exploration), a framework that improves detector generalization by actively exploring challenging regions of the generative process. Instead of treating the generator as a fixed data source, PROBE uses the detector as a critic to steer the generator through manifold-level modifications, producing realistic samples that are difficult to classify. These samples expose failure cases that are uncommon under standard data sampling strategies and are used to refine the detector. Experimental results across multiple benchmarks indicate that PROBE enhances generalization to unseen generators, resulting in more generalizable AIGI detection performance. Code and models are available at this https URL

[CV-157] BFS: Back-to-Front Layered Image Synthesis via Knowledge Transfer SIGGRAPH2026

链接: https://arxiv.org/abs/2605.24894
作者: Kyoungkook Kang,Gyujin Sim,Sunghyun Cho
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: SIGGRAPH 2026

点击查看摘要

Abstract:As generative models expand the possibilities of visual content creation, layered image synthesis has emerged as a promising direction for controllable and creative editing. However, existing methods struggle to fully realize this potential. Decomposition-based methods often struggle with clean separation, while generation-based methods suffer from difficulty in training data acquisition, reducing quality and scene diversity. In this paper, we propose BFS, a novel generation-based framework for layered image synthesis. Specifically, given a background image and user guidance, BFS synthesizes a foreground layer that incorporates not only a foreground object but also its associated visual effects, such as shadows and reflections, while seamlessly harmonizing with the background to produce a coherent composite. To enable diverse and high-quality foreground layer synthesis while overcoming data scarcity, we leverage the comparatively easy-to-learn knowledge of unlayered image synthesis for the foreground synthesis. To this end, we adopt a dual-branch diffusion framework in which two interconnected branches generate a composite image and a foreground layer, respectively, enabling bidirectional knowledge transfer. Based on this framework, we propose a two-stage training scheme that utilizes a high-quality unlayered composite image dataset to effectively enhance foreground quality. Extensive experiments, including a user study, show that BFS produces high-quality layered images, consistently outperforming prior methods.

[CV-158] BED-SAM2: Boundary-Enhanced-Depth SAM2 via Monocular Geometric Priors CVPR2026

链接: https://arxiv.org/abs/2605.24893
作者: Tyler Rust,Dara McNally,Kyle O’Donnell,Colin Kelly,Chandra Kambhamettu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 5 figures, 5 tables. Presented as a poster at the CVPR 2026 Workshop on Computer Vision in the Wild (CVinW). Code available at this https URL

点击查看摘要

Abstract:Building upon the SAM2 vision foundation model for downstream segmentation, this study introduces Boundary Enhanced Depth (BED)-SAM2. The SAM2 Hiera encoder architecture is modified to directly encode monocular depth information from RGB images, thereby providing geometric cues that enhance object boundary delineation and facilitate the extraction of camouflaged object shapes. BED-SAM2 demonstrates competitive state-of-the-art performance across multiple salient and camouflaged object detection tasks with as few as five training epochs.

[CV-159] X-Foresight: A Joint Vision-Action Causal Forecasting Network via Predictive World Modeling

链接: https://arxiv.org/abs/2605.24892
作者: Baolu Li,Jingyu Qian,Rui Guo,Yilun Chen,Hanpeng Liu,Yuan Lin,Junhong Zhou,Ruixin Liu,Willow Yang,Yutong Zheng,Zhenli Zhang,Tenglong(Victor)Gu,Zhuangzhuang Ding,Pengkun Zheng,Yu Zhang,Xianming Liu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Physical world knowledge resides mainly in videos. Equipping Vision-Language-Action (VLA) models with such knowledge is fundamental for safe and generalizable planning. Predictive world modeling enables VLA to internalize physical dynamics and long-term causality by predicting future video from past observations. However, naive next-frame prediction faces two challenges: 1) unlike semantically distinct text tokens, video tokens are low-entropy and redundant, causing prediction to degenerate into trivial extrapolation. 2) world modeling poses a temporal dilemma: dense prediction captures instantaneous dynamics, but cannot efficiently model long-horizon causality. To learn world knowledge effectively, we introduce X-Foresight, a predictive world model integrated directly into the VLA architecture to jointly learn world modeling and real-time action control. At its core lies a long-horizon chunk-wise auto-regressive strategy that addresses both challenges: by predicting semantically distant chunks rather than adjacent frames, it escapes trivial extrapolation, while preserving dense intra-chunk frames for instantaneous dynamics and sparse inter-chunk transitions for long-term causality. A curriculum learning schedule progressively extends prediction horizons and stabilizes long-horizon training. To capture long-term causality effectively, we present temporal importance sampling, which concentrates supervision on safety-critical chunks identified by ego-motion and behavioral signals. We further delegate photorealistic synthesis to a diffusion-based multi-view renderer, improving photorealistic appearance. Comprehensive experiments demonstrate that X-Foresight significantly outperforms VLA baselines in planning performance while maintaining strong generative fidelity, establishing a robust paradigm for world-knowledge-driven autonomous systems. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2605.24892 [cs.CV] (or arXiv:2605.24892v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2605.24892 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-160] QuoVLA: Quotient Space for Vision-Language-Action Models

链接: https://arxiv.org/abs/2605.24890
作者: Xuan Wang,Yinan Wu,Haoran Duan,Jungong Han
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-Language-Action (VLA) models commonly adapt pretrained Vision-Language Models (VLMs) to robot control by mapping visual observations and language instructions to continuous actions. Existing approaches typically take an action-insufficiency view, assuming that pretrained VLM latents either lack directly usable action information or should be shielded from action-learning signals. Against this view, our \textitQuotient Theory for VLA shows that pretrained VLM latents are not action-insufficient but action-sufficient: they already contain the information needed for control, yet remain overcomplete by distinguishing prompt-level variations that induce the same optimal action behavior. To operationalize this theory, we propose QuoVLA, a quotient-space framework for VLA that compresses pretrained VLM latents into action-sufficient representations. Specifically, QuoVLA instantiates this principle with a quantization module and a dual-branch design with relative temporal-complexity regularization, preserving action-relevant information while removing prompt-level redundancy. Extensive experiments across multiple benchmarks demonstrate that QuoVLA achieves strong performance, with particularly notable improvements in generalization under visual, linguistic, and environmental distribution shifts. Our code will be made publicly available.

[CV-161] rajectory-Consistent Calibration for Cache-Accelerated Diffusion Models

链接: https://arxiv.org/abs/2605.24870
作者: Mingyu Liang,Dingkun Xu,Jingwei Xu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 23 pages, 8 figures, 8 tables. Code is available at this https URL

点击查看摘要

Abstract:Diffusion Transformers require repeated denoiser evaluations during iterative sampling, making inference computationally expensive. Cache-based acceleration reduces this cost by reusing intermediate representations across denoising steps, but can introduce representation deviations and degrade generation quality. In this paper, we analyze these deviations and show that effective calibration should consider both the direct mismatch caused by reuse and the subsequent trajectory shift induced by earlier corrections. To address this challenge, we propose Trajectory-Consistent Calibration (TCC), a training-free method that calibrates cached representations toward their full-computation counterparts. Specifically, rather than estimating all calibration priors from a single uncorrected cache trajectory, TCC uses an offline iterative procedure so that each prior accounts for the trajectory shift induced by preceding calibrations. Experiments on PixArt-alpha and DiT-XL/2 show that TCC consistently improves FID across representative cache-based acceleration methods while preserving their underlying reuse policies. Notably, in a representative PixArt-alpha cache-acceleration setting based on FORA, TCC reduces FID from 29.83 to 27.35, slightly surpassing the full-computation baseline.

[CV-162] Adversarial Error Correction for Visual Autoregressive Generation

链接: https://arxiv.org/abs/2605.24843
作者: Ligong Bi,Tao Huang,Jianyuan Guo,Chang Xu
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Visual Autoregressive (VAR) models have emerged as a powerful paradigm for image synthesis by performing hierarchical next-scale prediction. However, VAR models are inherently prone to cascading error propagation, where subtle coarse-scale mispredictions are amplified across the hierarchy, ultimately distorting the final synthesis. To mitigate this, we propose AID-VAR, a plug-and-play framework that enhances pre-trained VARs through Adversarially Injected Diagnosis. Instead of a standard passive generation, AID-VAR introduces a proactive error-correction mechanism inspired by the adversarial feedback in GANs. We deploy a discriminator to diagnose fidelity gaps at each scale transition, coupled with a lightweight guidance injector. This module operates as a non-invasive adapter that refines the feature manifold of a frozen VAR backbone, effectively steering the generation toward the distribution of real images without destabilizing the pre-trained latent space. Furthermore, to rigorously evaluate this cross-scale progression, we introduce the Inter-Scale Consistency Score (ISCS), a novel metric that quantifies the fidelity and structural alignment between consecutive resolution scales. Experimental results across various backbones demonstrate that AID-VAR delivers sharper textural details and fewer structural distortions with negligible overhead. For instance, AID-VAR-d20 achieves a 16% improvement in FID with only a 3% increase in parameters. These results establish AID-VAR as a highly efficient and scalable pathway for upgrading large-scale VAR generators, enhancing global coherence and local detail without altering training data, base architectures, or sampling schedules. Code is available at this https URL.

[CV-163] Multiscale Real-Time Object Detection in the NMS-Free Era: A Comparative Performance Evaluation of YOLOv8 and YOLO26

链接: https://arxiv.org/abs/2605.24831
作者: Chidera G. Oguine,Kanyifeechukwu J. Oguine,Obiozor M. Oguine,Ozioma C. Oguine
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 11 pages, 6 tables, 9 figures

点击查看摘要

Abstract:Non-Maximum Suppression (NMS) remains a key post-processing step in many real-time object detection pipelines, but it can introduce latency variation and deployment complexity in resource-constrained settings. Recent NMS-free designs such as YOLO26 aim to reduce this dependence through end-to-end detection, yet their performance relative to established NMS-based models such as YOLOv8 remains underexplored beyond standard benchmarks. This paper compares YOLOv8 and YOLO26 on Pascal VOC and VisDrone, representing general object detection and dense aerial small-object detection, respectively. Both model families are evaluated across five scales using accuracy, localization, model size, GFLOPs, and CPU/GPU latency. Results show that YOLO26 achieves stronger detection performance and lower model complexity on Pascal VOC across most scales, while the performance gap narrows on VisDrone, where both models struggle with dense small targets. YOLOv8 remains competitive in GPU latency, showing that NMS-free design does not guarantee universal deployment superiority. Overall, the study shows that detector selection depends on dataset characteristics, object scale, model capacity, and hardware constraints.

[CV-164] AOEPT: Breaking the Implicit Modality-Reduction Bottleneck in Modality-Missing Prompt Tuning ICML2026

链接: https://arxiv.org/abs/2605.24816
作者: Jian Lang,Rongpei Hong,Ting Zhong,Fan Zhou
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 20 pages, Accepted by ICML 2026, Code is available from this https URL

点击查看摘要

Abstract:Deploying multimodal systems in real-world environments often entails handling modality-missing scenarios, where one or more modalities are unavailable. While recent studies address this challenge for the general Multimodal Transformer (MT) architecture via prompt tuning, we identify a fundamental limitation in these methods: the Implicit Modality-Reduction bottleneck. By conditioning prompts solely on the observed modalities, they inadvertently restrict the reasoning scope of MTs to the modality-reduced subspace, cutting off access to the latent information sources of the missing modalities. To overcome this limitation, we propose AOEPT, which pioneers a novel modal-contextualized prompting fashion. Specifically, we introduce lightweight Modal-Contextualized Prompts (MCPs) that distill global modality-wise priors from training data, serving as latent repositories of the information sources for missing modalities. Conditioned on the remaining modalities, these MCPs are instantiated into instance-aware prompts that selectively augment missing-modality information for each sample, thereby restoring the reasoning scope of MTs beyond the observed-modality-only subspace. Experiments across various multimodal benchmarks and backbones confirm the strong performance of AOEPT, with minimal computational overhead.

[CV-165] CLIP-Guided SAM: Parameter-Efficient Semantic Conditioning for Promptable Segmentation

链接: https://arxiv.org/abs/2605.24807
作者: Shayan Jalilian,Abdul Bais
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Promptable foundation models such as the Segment Anything Model (SAM) produce high-quality masks but remain semantically blind, relying on external prompts to specify categories. Existing vision-language approaches address this limitation by using external prompt coupling, where a vision-language model generates spatial prompts for SAM as a separate stage. We propose CLIP-Guided SAM, a parameter-efficient segmentation framework built on internal semantic conditioning. Instead of using semantic signals only to generate prompts, we inject CLIP-derived text, vision, and similarity features directly into SAM’s image encoder through lightweight multi-modal semantic adapters. These adapters condition SAM’s internal feature representations, allowing semantic information to influence mask prediction while preserving SAM’s original promptable interface. Our framework is designed for low labeled-data settings and applies to both general-domain benchmarks and specialized downstream tasks. It supports two operating modes: Manual mode, for interactive segmentation with both text and spatial prompts, and Semi-Automatic text-only mode, for applications that require concept-specific segmentation using only textual input. We show that robustness depends on aligning training with the type of prompts used at inference, making train-test prompt consistency an important design principle. Through extensive experiments and ablations, we evaluate our method against SAM+PEFT baselines without semantic conditioning, vision-language + SAM pipelines, SAM 3, and strong semi-supervised segmentation methods that rely on large amounts of unlabeled data. Across these settings, CLIP-Guided SAM consistently achieves superior or competitive performance while remaining parameter-efficient in both training and deployment. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2605.24807 [cs.CV] (or arXiv:2605.24807v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2605.24807 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Shayan Jalilian [view email] [v1] Sun, 24 May 2026 01:40:30 UTC (868 KB)

[CV-166] Fishbone: From One 3D Asset to a Million Controllable Edits

链接: https://arxiv.org/abs/2605.24805
作者: Yumeng He,Xiaoying Wang,Peihao Li,Yanjia Huang,Joe Masterjohn,Jiajun Wu,Leonidas Guibas,Yin Yang,Ying Jiang,Chenfanfu Jiang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 20 pages, 19 figures

点击查看摘要

Abstract:Large-scale controllable 3D assets are critical for computer graphics, embodied AI, robotics, and interactive content creation, yet creating diverse 3D assets remains challenging due to the high cost of manual modeling and rigging. Shape deformation offers a natural way to generate variations from existing meshes, but existing data-driven methods often rely on sparse user inputs, while parametric editing frameworks require manually designed control structures and category-specific configurations. Inspired by natural creatures, where a central spine governs global shape and cross-sectional ribs control local variation, we introduce Fishbone, a unified rib-spine representation for general shapes that supports controllable parametric mesh deformation, reduced-space dynamics, and animation. Given an input mesh, Fishbone computes a geodesic scalar field with an adaptive heat method, extracts iso-contours as cross-sectional ribs, constructs a smooth geometry-aware spine through rib centers, and associates surface vertices with nearby rib and spine structures using Gaussian-weighted skinning. The resulting representation enables real-time and predictable deformation: ribs control local profiles such as thickness, orientation, and cross-sectional variation, while the spine controls global bending, twisting, and stretching. The same structure also supports reduced-space simulation and keyframe animation. We further construct Fishbone-136K by augmenting Hunyuan3D with rib-spine structures, and demonstrate applications in controllable 3D generation, deformation-based data augmentation for robot learning, interactive mesh editing, and agentic generation. Experiments demonstrate the effectiveness, efficiency, and versatility of the proposed framework.

[CV-167] Divide-and-Conquer Inference for Large-Scale Visual Recognition with Multimodal Large Language Models

链接: https://arxiv.org/abs/2605.24799
作者: Zhipeng Ye,Jiaqi Huang,Feng Jiang,Qiufeng Wang,Yikang Duan,Dawei Wang,Xihang Zhou,Qian Qiao
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have demonstrated strong capabilities across a wide range of vision language tasks. However, when applied to large scale image classification, their performance degrades significantly as the label space expands a phenomenon we define as Performance Collapse in Long Sequence Recognition. Through an information theoretic analysis, we reveal that this collapse stems from a fundamental conflict between the escalating information entropy and the prominent attention dilution and decay within attention mechanisms, which impairs the model’s ability to maintain a sufficient signal-to-noise ratio when processing extremely long prompts. To mitigate this, we propose Divide-and-Conquer Inference (DCI), a novel test-time scaling strategy for visual recognition with MLLMs. DCI recursively decomposes complex global classification tasks into multiple simpler, localized subproblems and employs a dynamic pruning mechanism to compress the search space. This method effectively improves the local signal to noise ratio and model accuracy by mitigating the inherent weight dilution issues in long-sequence inference. Moreover, while traditional self-attention incurs a prohibitive quadratic computational complexity, DCI achieves more favorable scaling behavior and substantially accelerates inference in large scale classification scenarios. Extensive experiments on benchmarks such as ImageNet-1K and ImageNet-21K demonstrate that DCI consistently improves classification accuracy. This enables lightweight open-source models to rival or even surpass frontier closed-source giants without any additional training or fine-tuning. As a model-agnostic, plug-and-play paradigm, DCI offers an efficient approach for scaling the inferential precision of MLLMs in large-scale scenarios.

[CV-168] HCL-FF: Hierarchical and Contrastive Learning for Forward-Forward Algorithm CVPR2026

链接: https://arxiv.org/abs/2605.24797
作者: Jie-En Yao,Hong-En Chen,C.-C. Jay Kuo
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2026. Code: this https URL

点击查看摘要

Abstract:Deep neural networks trained with backpropagation have achieved outstanding performance in vision tasks but remain biologically implausible, computationally demanding, and difficult to interpret. The Forward-Forward (FF) algorithm offers a promising alternative by training each layer independently through local goodness objectives. However, its purely local optimization lacks hierarchical coordination across layers, and the decoupling of goodness from features leaves the representations unconstrained and semantically ambiguous. We propose a Hierarchical and Contrastive Learning FF framework (HCL-FF) to address these limitations. HCL-FF introduces (1) a coarse-to-fine hierarchical learning strategy that guides representations from low-level cues to high-level semantics, and (2) a supervised contrastive objective that enforces class-discriminative alignment after goodness decoupling. Experiments on CIFAR-10, CIFAR-100, and Tiny-ImageNet demonstrate that HCL-FF achieves new state-of-the-art performance among FF-based methods, with notable accuracy gains of +5.46%, +17.00%, and +12.51%, respectively.

[CV-169] Parameter-Efficient VLMs for Gastrointestinal Endoscopy: Medical Image Generation and Clinical Visual Question Answering

链接: https://arxiv.org/abs/2605.24792
作者: Ojonugwa Oluwafemi Ejiga Peter,Frederick Akor Ejiga,Fahmi Khalifa,Md Mahmudur Rahman
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The major limitations of gastrointestinal (GI) endoscopy AI systems arise from a shortage of annotated data, strict privacy policies, and significant bottlenecks in conventional model fine-tuning. Such limitations impede the successful application of sophisticated AI models in clinical practice, particularly affecting the reliability and scalability of diagnosis. In this paper, we present a dual-pipeline PEFT model that addresses two fundamental problems: medical Visual Question Answering (VQA) and the generation of privacy-preserving synthetic data. For clinical VQA, we adopt the Florence-2 vision-language model. Leveraging PEFT enhances model interpretability while substantially reducing the computational cost of training. Simultaneously, we employ Low-Rank Adaptation (LoRA) with Stable Diffusion 2.1 to generate high-quality GI images that enhance training databases without violating patient privacy. This research utilized the Kvasir-VQA dataset. Our Florence-2 VQA model achieved ROUGE-1 of 0.92, ROUGE-L of 0.91, and BLEU score improvements from 0.08 to 0.24. Fine-tuning on private datasets consistently showed better results than fine-tuning on public datasets. The rank-4 LoRA synthesis achieved optimal performance with a fidelity score of 0.290, an agreement score of 0.730, and a Frechet BiomedCLIP Distance (FBD) of 1450, reducing computational costs by almost 90 percent. This framework improves the clinical potential of AI in GI endoscopy. Compared to FLUX, MSDM, and Kandinsky 2.2, our model demonstrates superior FBD and strong semantic alignment. While other models lead in Fidelity or Agreement, our lower FBD indicates better image-text coherence. These results establish our approach as a robust solution for enhancing VQA and synthetic data generation in clinical AI.

[CV-170] Self-Supervised Contrastive Learning for Cardiac MR Sequence Classification

链接: https://arxiv.org/abs/2605.24789
作者: Yuli Wang,Hyewon Jung,Dongshen Peng,Yuwei Dai,Jing Wu,Haoyue Guan,Yoko Kato,Zhicheng Jiao,Yu Sun,Ihab Kamel,Joao Lima,Cheng Ting Lin,Harrison Bai
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Vision Transformer (ViT) models, utilizing self-attention mechanisms, have demonstrated robust generalization capabilities across various vision tasks, including image classification. However, these models, typically pretrained on general public datasets, often lack the specialized domain knowledge necessary for medical imaging applications. In this study, we investigate the adaptation of ViT models, specifically for cardiac magnetic resonance (MR) images, using an in-house dataset. We found that pretrained ViT features do not effectively transfer to the cardiac MR domain. To overcome this limitation, we introduce an adaptation strategy that utilizes image-based self-supervised contrastive learning, demonstrating superior performance compared to traditional supervised training approaches. Moreover, our adapted ViT model exhibits strong generalization to external MR datasets such as BraTS and ADNI. Through ablation studies, we further investigate the impact of batch size and dataset scale on performance. Ultimately, our adapted model achieves classification AUC exceeding 0.75 across the four most common cardiac MR sequences.

[CV-171] How Noisy Poses Break Inverse Dynamics: Analysis and Mitigation for Video-Based Joint Torque Estimation

链接: https://arxiv.org/abs/2605.24776
作者: Donghyun Kim,Chanyoung Kim,Eunseo Jeong,Youngjoong Kwon,Seong Jae Hwang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in monocular 3D human pose estimation enable accurate body tracking from video. However, translating these kinematic estimates into physical quantities, such as joint torques, remains challenging due to noise amplification through inverse dynamics. In this work, we provide a systematic analysis of how pose estimation noise propagates through the inverse dynamics pipeline. We present three key findings: (1) pose noise is amplified by approximately 1,000x when computing joint torques via numerical differentiation, (2) proximal joints (spine, hips) are up to 10x more sensitive to noise than distal joints (wrists, hands), and (3) low-pass filtering before differentiation substantially reduces this amplification. To enable this analysis, we develop SMPL-Dynamics, a fully differentiable inverse dynamics module for the SMPL body model that requires no external physics simulators. Our module supports end-to-end gradient computation, and we demonstrate this through differentiable pose refinement, which reduces torque error by 93% with negligible change in pose.

[CV-172] From Theory to Decision Rule: Calibrating the Noisy-Label Crossover for Vision-Language Model Weak Supervision Across Three Medical-Imaging Benchmarks

链接: https://arxiv.org/abs/2605.24771
作者: Bruce Changlong Xu,Jose James,Alexander Ryu
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 5 pages, 2 figures, 4 tables

点击查看摘要

Abstract:Classical noisy-label theory predicts that downstream performance under weak supervision is bounded above by the labeler’s accuracy, implying a sharp crossover: once a gold-trained classifier matches the labeler, weak labels stop helping and start hurting. The prediction is theoretical; what is missing is a benchmark calibration that turns it into an instance-level statement for modern foundation-model labelers. We provide such a calibration for BiomedCLIP-generated weak labels on three medical-imaging benchmarks (PCAM, ISIC, NIH-CXR) and six downstream architectures spanning an 11x parameter range. The crossover predicted by theory appears at ng~100 on PCAM, 20-50 on ISIC, and 250-500 on NIH-CXR; weak labels above the crossover degrade AUC by up to -0.10. The location is architecture-invariant for four of five pretrained architectures, and a within-family DenseNet sweep (2.5x parameters, identical pretraining) supports the view that the labeler, not the student, is the dominant constraint. The calibration in turn produces a decision rule operable from 10-20 gold labels: compare gold-only AUC to VLM accuracy on the user’s gold set. A structured-vs-random noise sign flip on NIH-CXR shows that the rate-only formulation of the bound is incomplete and identifies a concrete refinement (label-space projection) that future benchmarks can be designed to test.

[CV-173] Muon in Vision Transformers: Optimizer-Recipe Interactions and Gradient Spectra

链接: https://arxiv.org/abs/2605.24770
作者: Ben S. Southworth,Shuai Jiang,Daniel McBride,Eric C. Cyr,Stephen Thomas
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 25 pages, 15 figures

点击查看摘要

Abstract:Muon is a recently developed matrix-aware optimizer that has shown strong results in transformer training, but its behavior in vision transformers (ViTs) is not yet well understood. We study Muon for ViT training, largely on ImageNet-100 and Pl@ntNet-300K, comparing against AdamW under standard vision recipes involving mixup, cutmix, smoothing, and random augmentation and erasing. Muon consistently outperforms AdamW, with especially large gains on long-tailed Pl@ntNet macro top-1. These gains are also recipe-dependent, where Muon benefits much more than AdamW from advanced and significant data augmentation techniques. To understand this interaction, we analyze the singular-value structure of matrix gradients throughout the ViT. Within Muon training runs, removing heavy data augmentation induces a late-training spectral concentration and mode collapse in gradient matrices, primarily in deep MLP-down blocks. Under a fixed “full” augmentation recipe, the clearest Muon-AdamW contrast appears instead in QKV gradients, where AdamW gradient energy remains concentrated in a much narrower basis while Muon spreads energy across substantially more singular modes. Muon in ViTs is therefore best understood as an optimizer-recipe interaction. Under a fixed recipe, Muon differs from AdamW most clearly in attention projections, where its gradients consist of a broader spectral basis. Within Muon, a full training recipe is important for preventing late spectral concentration and mode collapse in deep feedforward blocks. We further demonstrate efficacy in training ViTs on image segmentation and masked autoencoder models, where Muon outperforms AdamW in all settings considered.

[CV-174] Leverag ing pretrained RGB denoisers for hyperspectral image restoration

链接: https://arxiv.org/abs/2605.24769
作者: Daniele Picone,Mohamad Jouni,Mauro Dalla-Mura
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Hyperspectral image restoration faces several challenges, including limited training data, strong sensor specificity, and high spectral dimensionality. These limitations hinder the learning of robust hyperspectral priors, motivating the reuse of priors learned from large-scale RGB data. In this work, we propose a minimally trained, lightweight adapter that repurposes frozen pretrained RGB denoisers for hyperspectral restoration through a projection mapping. The method denoises low-dimensional spectral projections and reconstructs the hyperspectral cube through constrained linear aggregation, while preserving plug-and-play compatibility and the stability properties of the underlying RGB denoiser. Experiments on denoising, deblurring, and super-resolution across multiple datasets demonstrate consistent improvements over hyperspectral-specific baselines, showing the strong transferability of large-scale RGB priors.

[CV-175] 4KLSDB: A Large-Scale Dataset for 4K Image Restoration and Generation CVPR2026

链接: https://arxiv.org/abs/2605.24762
作者: Zihao Zhu,Kuan-Ru Huang,Zhaoming Xu,Renjie Li,Bo Wu,Ruizheng Bai,Mingyang Wu,Sayak Paul,Zhengzhong Tu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to the DataCV Workshop at CVPR 2026; 10 pages, 4 figures, 7 tables; Our project page is available at: this https URL

点击查看摘要

Abstract:High-resolution datasets are essential for advancing super-resolution (SR) and text-to-image (T2I) diffusion research. However, current publicly available datasets lack both the native 4K resolution and the extensive scale necessary for training state-of-the-art models. To address this gap, we introduce a 4K Large Scale Dataset and Benchmark (4KLSDB), a large-scale, diverse dataset consisting of 129,484 carefully curated 4K resolution images spanning multiple categories such as nature, urban scenes, people, food, artwork, and CGI, alongside distinct validation and test sets containing 2,000 and 1,984 images respectively. Images were sourced from established open datasets including Photo Concept Bucket, Laion2B, and PD12M. 4KLSDB underwent rigorous multi-stage automated filtering and annotation pipelines involving both human annotators and Large Multimodal Models (LMMs) to ensure high aesthetic quality and dataset consistency. We demonstrate 4KLSDB’s effectiveness by training representative super-resolution and diffusion models, observing significant improvements in performance on native 4K benchmarks. Comprehensive experiments illustrate a positive correlation between training on true 4K resolution data and improved fidelity in image restoration task, especially on 4K resolution. We provide the research community a valuable resource to drive progress toward genuinely high-fidelity image synthesis and restoration by providing 4KLSDB. Our project page is available at: this https URL.

[CV-176] Drift-Resistant Navigation World Model with Anchored Epipolar Guidance

链接: https://arxiv.org/abs/2605.24761
作者: Po-Chien Luan,Zimin Xia,Wuyang Li,Yang Gao,Alexandre Alahi
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:We propose Drift-Resistant Navigation World Model, a generative model that mitigates both perceptual drift and geometric drift in conventional rollout-based navigation world models. Existing methods recursively feed generated content into subsequent steps, causing noise accumulation and degraded predictions, i.e., perceptual drift. Meanwhile, their predictions often deviate from the agent’s motion, resulting in geometry drift. We address both types of drift by redesigning world-model prediction as an anchor-guided rollout. Instead of rolling out every frame sequentially, we first predict sparse future anchors that serve as stable long-range targets, and then generate intermediate frames within each chunk conditioned on both past context and future anchors. Importantly, these sparse anchors also provide geometric constraints, supported by bidirectional epipolar geometry, to localize where corresponding content should appear in the intermediate frames. Experiments on four benchmarks demonstrate consistent improvements over strong baselines in long-horizon visual quality, geometric consistency, and multi-view coherence. These gains further translate into improved downstream planning performance under the same planners, highlighting the importance of drift-resistant, geometry-aware prediction for reliable navigation world models.

[CV-177] Motion-Compensated Weight Compression

链接: https://arxiv.org/abs/2605.24754
作者: Ismail Lamaakal
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 54 pages, 17 tables, 6 Figures

点击查看摘要

Abstract:Neural network weights are increasingly a bottleneck for deployment, yet most compression pipelines treat layers independently and overlook cross-layer redundancy induced by function-preserving symmetries. We propose Motion-Compensated Weight Compression (MCWC), a weight-only codec that aligns permutation-symmetric blocks (e.g., hidden units and attention heads) to maximize cross-layer correspondence, turning depth into a predictable sequence. In the aligned coordinate system, MCWC uses a lightweight layer-sequential predictor with periodic keyframes and encodes only quantized prediction residuals using a learned entropy model trained under a rate distortion objective. A simple decoder reconstructs deployable weights by entropy decoding, dequantization, predictor-driven reconstruction, and inverse alignment, enabling fast weight materialization for inference. Across Transformer language modeling and vision classification, MCWC improves the rate accuracy Pareto frontier over strong quantization and learned weight-codec baselines, while maintaining competitive decode time. Ablations confirm that alignment, prediction, entropy modeling, and keyframe scheduling are each necessary for the full gains. Our code is available via this https URL.

[CV-178] Ghosts in the Point Clouds: De-glaring LiDAR in the Transient Domain CVPR2026

链接: https://arxiv.org/abs/2605.24753
作者: Avery Gump,Connor Henley,Sungjin Cheong,Akarsh Prabhakara,Mohit Gupta
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026

点击查看摘要

Abstract:Modern LiDARs are rapidly transitioning from bulky, mechanically scanned systems to ultra-compact, low-cost, solid-state arrays. This miniaturization-while enabling scalability, affordability, and camera-like data structures-introduces a new and severe failure mode: internal-multipath glare. When light from a bright or retroreflective surface reflects and scatters within the LiDAR, light that should reach a single pixel spreads across the pixel array. The resulting artifacts create phantom objects, obscure real ones, and produce safety-critical “ghosts in the point clouds.” This paper introduces a physically grounded sensing model and algorithmic techniques for addressing this effect. We show that internal glare can be represented as a linear, scene-independent operator-the Transient Glare Spread Function (TGSF)-acting on the transient measurements. Building on this model, we develop a training-free approach that operates on low-level LiDAR detections (or echoes) prior to point-cloud formation, leveraging knowledge of the glare spread function to reason about the likelihood of each detection arising from glare. The resulting approach is compatible with existing LiDAR signal-processing pipelines, and deployable on unmodified commercial sensors. Using experiments with real single-photon LiDAR hardware, we demonstrate substantial suppression of severe glare artifacts while preserving true scene structure.

[CV-179] From Full Boards to Tiny Defects: Scale-Aware Tile Inference with Topology-Aware Merging for High-Resolution PCB Defect Detection

链接: https://arxiv.org/abs/2605.24726
作者: Mohammad Alijanpour Shalmani,Alale Rezvani Boroujeni,Ali Amini,Jiann Shiun Yuan
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:High-resolution printed circuit board (PCB) inspection suffers from resolution collapse when full-board images are resized to standard detector inputs: micro-scale defects shrink to a few pixels and are missed. Tile-based inference preserves local detail but introduces boundary artefacts at tile edges, causing split detections and false negatives. We present a systematic comparison of five inference strategies evaluated on two high-resolution PCB defect datasets, PCB-Defect (230 images, 1704 annotations) and HRIPCB (693 images, 2 953 annotations), spanning six defect classes. We show that training-inference scale consistency is critical: a detector trained on full images collapses to mAP@50 = 0.01 under tile inference, while the same architecture trained on 640*640 tile crops achieves 0.72 and 0.94 on the two datasets respectively. We further exploited Topology-Aware Tile Merging (TA-TM), a training-free post-processing method that builds a tile-adjacency graph and adjusts boundary-sensitive detection scores using neighbour-tile agreement before global NMS. Across both datasets, adding 128 px tile overlap raises boundary-zone recall from ~26-63% to ~70-100%, TA-TM achieves the best mAP@50 on both benchmarks, and tile inference recovers 46-100% of small defects missed entirely by full-image methods. Results are consistent across datasets, confirming the generalizability of the proposed strategy. TA-TM requires no retraining and is architecture-agnostic, making it directly applicable to existing PCB inspection pipelines.

[CV-180] Calibrating Probabilistic Object Detectors with Annotator Disagreement

链接: https://arxiv.org/abs/2605.24722
作者: Zhi Qin Tan,Owen Addison,Yunpeng Li
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:High degrees of disagreement among annotators can exist for ambiguous objects, e.g. in medical images, underscoring the challenges of establishing ground truth annotations in object detection tasks. Despite this, all existing object detectors implicitly require access to ground truth annotations for either training or evaluation. The fundamental questions we target are: How can we learn an object detector with multiple annotators’ annotations but without objective ground truth annotations due to object ambiguity, and how can we enable the learned detector to express meaningful model predictive uncertainties in detecting ambiguous objects? To answer these questions, we present an interpretable approach to calibrate probabilistic object detectors, where the calibration goal is to align the class confidence and bounding box variance estimates to the annotators’ annotation distribution. We introduce an efficient yet effective framework to calibrate probabilistic object detectors by designing four evaluation metrics to measure calibration errors regarding classification and localization, and proposing a train-time calibration and post-hoc calibrator, all without the need to access any ground truth. This framework is generalizable to many existing probabilistic object detectors, such as the YOLO families and two-stage detectors. Empirical results with real-world and synthetic datasets of medical and natural images demonstrate the superior performance of the proposed framework with three popular object detectors.

[CV-181] Physics-Guided Self-Supervised Statistical Residual Learning for Sonar Despeckling with Improved Generalization

链接: https://arxiv.org/abs/2605.24716
作者: Swapna Pillai,Siddharth Singh Savner,Sujit Kumar Sahoo
类目: Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
备注:

点击查看摘要

Abstract:This letter introduces a physics-informed self-supervised framework for sonar image despeckling that reformulates despeckling as residual consistency in the homomorphic log domain. By constraining the log-ratio residual to obey multiplicative speckle statistics, the proposed method eliminates the need for clean supervision while preventing degenerate identity solutions. A variance-targeted statistical loss combined with edge-aware structural regularization and median-guided curriculum stabilization enables effective speckle suppression with preserved structural fidelity. This formulation along with a lightweight neural network achieves state-of-the-art performance across multiple real sonar datasets and demonstrates excellent cross-dataset robustness, while remaining suitable for real-time deployment.

[CV-182] Do Image-Text Metrics Respect Semantic Invariances?

链接: https://arxiv.org/abs/2605.24702
作者: Amit Agarwal,Hitesh Laxmichand Patel,Meizhu Liu,Jyotika Singh,Karan Dua,Hansa Meghwani,Matthew Rowe,Michael Avendi,Yassi Abbasi,Tao Sheng,Sujith Ravi,Dan Roth
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Reference-free image-to-text evaluators are now standard for scoring image-caption alignment, yet it is unclear whether they respect semantic invariances. We present an invariance probe on five popular evaluators (CLIPScore, PAC-S, UMIC, FLEUR, and a deterministic LLM judge) under semantics-preserving perturbations along three axes – spatial (flips, context-preserving repositioning, light rotations), object (scale, category), and socio-linguistic framing (cultural/economic adjectives with neutral and length-matched controls). Across curated slices of three detection datasets and three caption evaluation suites, we find consistent non-semantic sensitivities, where benign spatial edits and simple phrasing changes shift scores by \approx 6–9% on average, and for systems separated by just 0.7%, these shifts can cause ranking flips in up to \sim 37% of cases, particularly under spatial changes. A small human study also supports this finding and confirms that annotators generally judge perturbed pairs as equally correct, so these shifts reflect metric behavior rather than semantic change. We further propose invariance-calibrated scoring, a post-hoc adjustment that roughly halves median absolute sensitivity while retaining correlation with learned caption evaluators.

[CV-183] SRUG: Shadow-Guided Relightable Urban Scene with Generation Model

链接: https://arxiv.org/abs/2605.24700
作者: Yonghao Zhao,Zexin Yin,Jian Yang,Beibei Wang,Jin Xie
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:

点击查看摘要

Abstract:Creating relightable urban scenes from images or videos is widely useful but highly ill-posed. Urban environments are typically unbounded and extend beyond the visible regions. As a result, many portions of the scene remain unobserved, yet these invisible regions can cast shadows onto visible areas. Reasonably modeling shadows cast by such invisible regions is challenging and poses a significant obstacle to creating relightable urban scenes. At the same time, sparse input views and complex illumination conditions further complicate relighting, as they introduce severe ambiguities in material decomposition. In this paper, we propose Shadow-guided Relightable Urban Scene with Generation model (SRUG), a novel framework designed to address relighting challenges in urban scenes. SRUG leverages shadows to guide a 3D completion model for recovering the geometry of invisible regions, promoting the synthesis of physically reasonable shadows. In addition, SRUG employs an iterative material decomposition scheme that applies the large material model (LMM) to provide material supervision and iteratively decompose the scene’s material properties, enabling robust material decomposition. Building upon these components, we introduce a physically-based lighting model that captures the complex illumination of urban scenes and supports reliable relighting. Extensive quantitative evaluations and visual comparisons demonstrate that our method outperforms existing approaches in both novel view synthesis and relighting tasks.

[CV-184] AdaFuse-Det: Adaptive Cross-Modal Fusion of Event Cameras for Robust Object Detection in Low-Light RGB Imagery

链接: https://arxiv.org/abs/2605.24691
作者: Raju Imandi,Chethana B,Bharatesh Chakravarthi,Yong-Guk Kim,Manipriya S,Pavan Kumar B N
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Detecting objects reliably under extreme low-light conditions is an open problem in computer vision, with practical urgency in applications ranging from nighttime surveillance to search-and-rescue robotics. Conventional RGB cameras degrade sharply at low photon flux, while event cameras which record asynchronous per-pixel brightness changes at microsecond resolution and high dynamic range provide complementary structural cues that are largely illumination-invariant. We present AdaFuse-Det, a dual-stream framework that fuses CLAHE-enhanced RGB frames with voxelized event tensors through an Adaptive Cross-Modal Fusion (ACMF) module grounded in minimum-variance linear estimation theory. We formally show that the learned attention map asymptotically recovers the Gauss-Markov optimal fusion weights, and establish event conservation and temporal resolution bounds for the voxelization stage. On the LLE-VOS benchmark, AdaFuse-Det achieves a Recall of 65.54% , Precision of 53.85% , and F1-Score of 59.12% under severe illumination degradation, outperforming single-modality detectors in recall by a margin that reflects the theoretically predicted illumination-adaptation behavior.

[CV-185] HoloFair: Unified T2I Fairness Evaluation and Fair-GRPO Debiasing ICML2026

链接: https://arxiv.org/abs/2605.24687
作者: Ruyi Chen,Lu Zhou,Xiaogang Xu,Chiyu Zhang,Jiafei Wu,Liming Fang
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to ICML 2026. Code and dataset are available at this https URL

点击查看摘要

Abstract:Text-to-Image (T2I) models have made significant strides in visual realism and semantic consistency, yet they often perpetuate and amplify societal biases. Existing evaluation methods typically address only single-dimensional biases, lacking perspectives to uncover model biases at social-related deeper semantic levels. We introduce HoloFair, a comprehensive benchmark framework for multidimensional demographic bias analysis. Built upon our large-scale fairness-oriented dataset and the SpaFreq (Spatial-Frequency) attribute classifier, this framework proposes the Multi-attribute, Group-wise Bias Index (MGBI) metric, designed to assess both intrinsic diversity and conditional biases. Beyond evaluation, we further introduce Fair-GRPO, a reinforcement-learning-based debiasing method that alters the distribution of generative models through a designed multi-objective reward function. E.g., experiments on the SD3.5-Medium model demonstrate that Fair-GRPO significantly improves multidimensional fairness while maintaining high image quality. We also analyze potential reward hacking phenomena and provide corresponding mitigation strategies. Code and dataset are available at this https URL

[CV-186] MindAdapter: Few-Shot Parameter-Efficient Residual Calibration of Cross-Subject Brain-to-Visual Decoding Models KDD2026

链接: https://arxiv.org/abs/2605.24679
作者: Jiaxiang Liu,Jiawei Du,Xupeng Chen,Guoqi Li,Jiang Cai,Simon Fong,Mingkun Xu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to KDD 2026 (AI4Sciences Track). 15 pages, 7 figures

点击查看摘要

Abstract:Cross-subject brain-to-visual decoding remains a core challenge in brain-computer interfaces due to severe inter-individual variability that induces systematic subject-specific functional misalignment. To address this issue, we propose MindAdapter, a parameter-efficient few-shot calibration framework for pretrained brain-to-visual decoding models. MindAdapter adopts a decoupled linear-residual cascade alignment paradigm by freezing a pretrained explicit brain functional alignment backbone (coarse) and introducing a lightweight nonlinear residual adapter (fine), thereby disentangling global cross-subject correspondence from subject-specific residual corrections for fine-grained spatial and semantic calibration. To further preserve global representational stability, we design a topology-anchored dual-stream manifold constraint, where a small set of shared stimuli serves as topological pins with voxel-level paired supervision, while a semantic stream enforces consistency through a frozen vision-language decoder on unpaired brain data. Together, MindAdapter efficiently injects subject-specific corrections while maintaining the global representational geometry learned during pretraining. Experiments on the Natural Scenes Dataset (NSD) demonstrate that MindAdapter substantially improves cross-subject visual reconstruction and retrieval accuracy using only a few shared stimuli, offering a practical and data-efficient solution for personalized brain-to-visual decoding.

[CV-187] VaaWIT: Visual-Aware Adaptation of Large Language Models for Multilingual Web Image Translation KDD2026

链接: https://arxiv.org/abs/2605.24675
作者: Bo Li,Ronghao Chen,Ningyuan Deng,Huacan Wang,Shaolin Zhu,Lijie Wen
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by KDD 2026

点击查看摘要

Abstract:Translating text embedded in Web images is crucial for improving content accessibility and cross-lingual information retrieval, particularly within social media and e-commerce domains. Although Large Vision-Language Models (LVLMs) have advanced multimodal understanding, applying them to Web image translation remains challenging due to the visual representation gap: standard encoders often prioritize high-level semantics over the fine-grained visual details required for recognizing diverse character morphologies. To address this challenge, we propose VaaWIT, an end-to-end framework that adapts Large Language Models for multilingual Web image translation. The framework introduces two key technical contributions: (1) a Dual-Stream Attention Module (DSAM), which facilitates bidirectional interaction between multilingual semantic features and detailed visual representations, thereby synthesizing unified features robust to textual variations; and (2) a Visual-Aware Adapter (VAA), a parameter-efficient fine-tuning strategy that dynamically injects these fused visual cues into the frozen LLM backbone. This design enables the model to align the visual context with linguistic reasoning effectively while minimizing computational costs. Extensive experiments on eight tasks on three public benchmarks demonstrate that VaaWIT significantly outperforms state-of-the-art (SOTA) open-source baselines and achieves competitive performance against proprietary models. These results validate the efficacy of integrating fine-grained visual perception into LLMs for complex Web content analysis.

[CV-188] Reasoning to Align: Implicit Reasoning in Diffusion Transformers for Video Editing

链接: https://arxiv.org/abs/2605.24674
作者: Yan Li,Lin Liu,Xiaopeng Zhang,Qi Tian
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Instruction-based video editing requires transforming a source video according to a natural-language instruction while preserving irrelevant content and remaining temporally coherent. We argue that existing Diffusion Transformer (DiT) editors struggle with this task for two structural reasons. First, conditioning signals are fed undifferentiated into all transformer blocks, forcing a single token stream to encode both global editing intent and fine-grained visual evidence. Second, the cross-attention patterns that govern the edit are supervised only indirectly through pixel-level reconstruction, leaving the model’s internal reasoning process under-constrained. To address both limitations, we propose RVEDiT, an implicit Reasoning Video Editing DiT framework built around two complementary components. The first, Granularity-Routed Token Conditioning, introduces learnable editing tokens distilled from a multimodal LLM and routes them to shallow blocks, while reserving native visual and textual tokens for deeper blocks, thereby inducing a coarse-to-fine editing process inside the backbone. The second, Reference-Anchored Attention Alignment, employs a parameter-sharing reference branch during training and maximizes the mutual information between the attention features of the editing and reference branches, regularizing the model’s internal reasoning without incurring any additional inference cost. Experiments on standard instruction-based video editing benchmarks show that RVEDiT consistently outperforms state-of-the-art baselines, with particularly strong gains on localized and compositional edits.

[CV-189] AVBench: Human-Aligned and Automated Evaluation Benchmark for Audio-Video Generative Models

链接: https://arxiv.org/abs/2605.24652
作者: Jialiang Yang,Bin Xia,Ruihang Chu,Dingdong Wang,Wanke Xia,Zhun Mou,Tianyang Zhong,Yiting Zhao,Wenming Yang
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Sound (cs.SD)
备注:

点击查看摘要

Abstract:Rapid advances in audio-video (AV) generation have enabled high-fidelity synthesis with synchronized sound, particularly for human-related scenarios involving speech and interactions. Yet evaluation for AV generation remains at an early stage, with only a few coarse-grained benchmarks for human-related scenarios and relying on limited preset evaluations with generic multimodal LLMs, leading to inaccurate assessments of model capabilities. To address these issues, we introduce AVBench, a fully automated benchmark tailored for human-centric AV generation. AVBench is built on two key designs for comprehensive and accurate evaluation: (i) Human-centric and fine-grained metrics. AVBench integrates ten evaluation dimensions designed for human-centered real-world scenarios, covering visual quality, audio quality, and multi-level consistency across modalities. These practical metrics capture human-related details that existing benchmarks often overlook. (ii) Specialized evaluators via preference learning. To address the lack of specialized training data, we construct large-scale supervision by transforming real-world videos into diverse training pairs with controlled perturbations. After fine-tuning on this high-quality dataset, the evaluators learn to reliably detect subtle cross-modal inconsistencies. Crucially, instead of producing discrete textual judgment, AVBench derives continuous evaluation scores from the model’s prediction confidence on binary decisions. This probabilistic scoring mechanism enables a more reliable assessment than traditional VQA-style evaluation and aligns closely with human judgment. Taken together, AVBench offers automated evaluation for AV generation, demonstrates strong potential for data filtering, and serves as a differentiable reward signal for Reinforcement Learning from Human Feedback (RLHF).

[CV-190] Understanding the Impact of Geometric Foundation Models on Vision-Language-Action Models

链接: https://arxiv.org/abs/2605.24642
作者: Yurou Yang,Muyuan Lin,Roberto Martin-Martin,Martin Labrie,Shreekant Gayaka,Cheng-Hao Kuo,Luca Carlone
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Recent work explores new opportunities at the intersection of vision-language-action models (VLAs) and geometric foundation models (GFMs) for 3D reconstruction, such as VGGT. While the resulting geometric VLAs often show improved performance, it remains unclear (i) if modern VLAs already have sufficient geometric understanding to start with, (ii) what is the best architecture to inject geometric understanding into a VLA, and (iii) what is the effect of other design choices that affect geometric VLAs. In this paper we provide a rigorous experimental analysis to shed light on these questions, for a specific choice of VLA (GR00T-N1.5) and GFM (VGGT). Our first contribution is to formalize prior work’s intuition that current VLAs lack geometric understanding, by providing a rigorous analysis based on linear probing. The analysis quantifies, for the first time, the “geometric gap” between VLAs and GFMs. Our second contribution is to identify and compare different strategies to bridge GFMs with VLAs. We implement three different architectures, which differ in the way they inject geometry in the VLA, while keeping low-level implementation details as similar as possible, to ensure a fair comparison. Finally, we analyze the impact of non-architectural choices (e.g., training data, number of cameras, reconstruction quality) on the performance of the geometric VLAs.

[CV-191] DisDop: Distillation with Domain Priors for Open-Vocabulary Aerial Object Detection

链接: https://arxiv.org/abs/2605.24639
作者: Ruihao Xu,Yong Liu,Yansong Tang,Sule Bai,Xubing Ye,Bingyao Yu,Yutao Guo,Jiwen Lu,Jie Zhou
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:With the widespread application of drones in recent years, object detection of aerial images has attracted increasing attention, especially open-vocabulary aerial detection which is not restricted to predefined categories. Due to the scarcity of drone’s viewpoint images and their significant differences from natural images, it is difficult to achieve satisfying results by directly applying vanilla open-vocabulary detection methods designed for natural scenarios. Some studies propose to transfer knowledge from pre-trained models by using lightweight networks or generating pseudo labels, but they tend to rely on models trained on natural images, neglecting the potential of foundation models specifically tailored for remote sensing and aerial imagery. To address this limitation, we propose DisDop, a unified framework that systematically distills multi-level domain priors from remote sensing foundation models (e.g., RemoteCLIP and DINOv3) into a lightweight detector. Specifically, we first distill visual priors through a teacher fusion strategy that combines RemoteCLIP’s cross-modal alignment capability with DINOv3’s fine-grained local feature extraction ability, transferring their complementary strengths to the detector’s backbone. Second, we distill textual priors embedded in RemoteCLIP’s text encoder by explicitly modeling inter-category semantic relationships, while incorporating global contextual priors to enhance local feature representation for small objects. Through this multi-level prior distillation framework, our DisDop achieves new state-of-the-art performance on open-vocabulary aerial detection benchmarks. Extensive ablation analysis also demonstrates the rationality and effectiveness of our proposed modules.

[CV-192] Resolving Ambiguity in Composed Image Retrieval via Calibrated Interaction

链接: https://arxiv.org/abs/2605.24634
作者: Amsisan Tran,Baogh Le,Tuan Kiet Pham,Sui Yang Guang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Composed image retrieval (CIR) searches a corpus with a reference image and a text describing how to modify it. Despite rapid progress from triplet-trained compositors to zero-shot and generative methods, essentially all systems share one assumption: that a query maps to a single target, scored by Recall@K against one annotation. We argue this is fundamentally at odds with the task. A query such as make it more formal does not name an image but a region of the corpus, and which member the user intends is genuinely underdetermined. This underspecification is the root of the well-known false-negative problem and leaves current models unable to tell a precise query from an ambiguous one. We reframe CIR as calibrated intent resolution under uncertainty: a retriever is wrapped in a conformal prediction layer that returns a candidate set with a coverage guarantee and whose size is a principled measure of ambiguity; when the set is large, an expected-information-gain policy asks the single most useful clarifying question, drawn from interpretable ambiguity axes, and the set contracts. We introduce AmbiCIR, a benchmark and human-validated user simulator that revive the dormant auxiliary and dialogue annotations of CIRR and extend the multiple-positive setting of CIRCO. Across open-domain and fashion benchmarks our method matches single-turn state of the art, confirming calibrated resolution is cost-free on precise queries, while reaching the intended target in a fraction of the interaction budget required by naive conversational baselines, and it is the first to report valid coverage and calibration for the task.

[CV-193] Beyond Generative Priors: Minority Sampling with JEPA-Guided Diffusion ICML2026

链接: https://arxiv.org/abs/2605.24631
作者: Sol Park,Soobin Um
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: ICML 2026, 21 pages, 9 figures

点击查看摘要

Abstract:Minority sampling aims to generate low-density instances on a data manifold and is of central importance in applications such as medical diagnosis, anomaly detection, and creative AI. Existing approaches, however, define minority samples relative to generative priors learned from training data, confining rarity to model-specific notions that may poorly reflect real-world semantics. In this work, we propose a world-centric perspective on minority sampling, which defines rarity with respect to real-world priors rather than generator-induced densities. To this end, we introduce JEPA guidance, a diffusion sampling framework guided by a Joint-Embedding Predictive Architecture (JEPA) – a class of world models that encode broad, semantically rich representations. JEPA guidance steers diffusion trajectories toward low-density regions under the implicit density induced by the JEPA, thereby aligning generated minorities with real-world semantic rarity. To make JEPA guidance computationally practical, we develop principled approximation strategies accompanied by theoretical error bounds, significantly reducing the overhead of guidance computation. Extensive experiments across unconditional, class-conditional, and text-to-image generation demonstrate that JEPA guidance consistently improves the fidelity and semantic validity of minority samples, outperforming generator-centric baselines in capturing real-world notions of rarity. Code is available at this https URL.

[CV-194] DexSIM: Real-time Dexterous Simulation with Unified Causal Video Diffusion ICLR2026

链接: https://arxiv.org/abs/2605.24630
作者: Adam Lee
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: World Model @ ICLR 2026

点击查看摘要

Abstract:Recent progress of video diffusion models have enabled extensive simulation of the physical world. While simulation with hand object interaction has been less explored. We propose DexSIM, a dexterous simulation framework for simulating dexterous manipulation in real-time. While previous works utilizing video diffusion and 3D reconstruction focus on navigation, dexterous manipulation has been limited while it has extensive applications for creating interactive experiences with the simulated world and for generating synthetic data for robotics. Existing methods lack real-time interactivity and long-term spatial consistency and memory. We propose a 2-stage training framework for DexSIM. First we train a bi-directional video diffusion model by jointly embedding the hand action trajectory and video in a unified feature space. We utilize gaussian heatmap hand encoding for more accurate hand representation. Then we conduct a roll-out based autoregressive training with updated spatial cache as attention sink for spatial memory, which improves long-term consistency and 3D aware dexterous manipulation simulation. DexSIM outperforms the baseline on pixel and semantic similarity, motion fidelity, and hand projection accuracy. It also allows new applications such as hand motion transfer and runs at 15.24 FPS real-time interactivity.

[CV-195] ULF-Synth: Physics-Guided Ultra-Low-Field MRI Enhancement for Pediatric Neuroimaging

链接: https://arxiv.org/abs/2605.24625
作者: Toufiq Musah,Salvatore Calcagno,Federica Proietto Salanitri,Xiaomeng Li,Maruf Adewole,Marawan Elbatel
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 2 figures, 3 tables

点击查看摘要

Abstract:Ultra-low-field (ULF) MRI offers portable and accessible neuroimaging but suffers from reduced signal-to-noise ratio and limited spatial resolution compared to high-field (HF) systems. Acquiring paired ULF-HF data for supervised enhancement is often difficult, particularly in resource-limited settings. We introduce ULF-Synth, a framework that combines: (i) acquisition-based synthesis of realistic ULF images from HF volumes to create large-scale paired training data, (ii) a spatial-frequency domain objective that prioritizes recovery of high-frequency anatomical detail. This formulation is architecture-agnostic, consistently improving structural similarity and perceptual fidelity across encoder-decoder, adversarial, and diffusion-based translation models. When trained exclusively on synthetic data, the resulting models generalize effectively to real 64mT ULF acquisitions, improving downstream multiclass brain segmentation and achieving higher radiologist preference and diagnostic acceptability in a blinded reader study. These findings demonstrate that synthetic paired supervision provides a practical and scalable pathway for enhancing ULF MRI without requiring real paired acquisitions. Code, Models and Dataset: this https URL

[CV-196] Vision-Language Binding in In-Context Image Generation

链接: https://arxiv.org/abs/2605.24624
作者: Chris Ge,Rohit Gandikota,Antonio Torralba,Tamar Rott Shaham
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 35 pages, 19 figures

点击查看摘要

Abstract:In-context image generation models such as FLUX.2 take a text prompt and an optional reference image as visual conditioning for the output. Internally, all three inputs – text, reference image, and the noise tokens – are concatenated and processed through a single attention stream, where all tokens can attend to one another. This leaves open how reference information flows through the model to produce the output image. We show that an implicit cross-modal binding emerges between the text tokens and the reference image: the text tokens absorb visual reference content during the forward pass, and that absorbed content causally influences the generated output. We surface this binding with three causal interventions on FLUX.2: T2I Lens, which decodes intermediate text-token activations through a text-to-image path; Attention Knockout, which severs specific attention edges; and I2I-to-I2I Patching, which copies text token activations between editing runs. Across 2,875 editing tasks on various images, including SUN397 and DreamBench++ datasets and images collected online, we observe a consistent division of labor: properties of the reference image, like color, style, and scene setting, are first written into the text tokens, which carry them to the generated image; pixel-exact properties like a specific face or instance identity bypass the text tokens and flow directly from reference to image through image-to-image attention. We further localize the reference-text binding to the padding tokens of the text sequence. These results show that text tokens in a multimodal DiT are not just prompt holders, but a structured channel for reference image content. More broadly, they suggest that even in unified-attention multimodal generative models, token modality structures how conditioning information is represented and routed across the network.

[CV-197] PoseRefer: Pathway-Local Parameters for Semantically Grounded Reference Resolution ICRA2026

链接: https://arxiv.org/abs/2605.24622
作者: Anna Deichler
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: ICRA 2026 Workshop on Semantics for Reliable Robot Autonomy: From Environment Understanding and Reasoning to Safe Interaction

点击查看摘要

Abstract:A robot resolving ``put the cup on that one’’ must fuse gesture, language, and scene geometry, yet 3D grounding benchmarks only partially capture this regime: descriptions are written post-hoc, gestures are templated, or pointing is staged for the camera. MM-Conv captures natural co-speech gesture from dyadic VR interaction alongside full-body motion capture and 3D scene graphs. We use it to evaluate pose-language fusion with a decoupled late-fusion architecture in which pose and text pathways share no learned parameters. The two choices together make category, pose, and text contributions easier to isolate through controlled ablations. Fusion with frozen MiniLM category embeddings exceeds pose alone and the best text-only pathway on every reference type, reaching 31.9% top-1. The learned scalar gate flips between opposing policies depending on whether the text pathway has category access. This is a reliability diagnostic: fusion-accuracy claims for semantic grounding systems are indistinguishable from category-representation artifacts unless pathways are architecturally decoupled.

[CV-198] Phase-Aware Wavelet-Based-Scattering Encoder-Decoder for Dense Predictions

链接: https://arxiv.org/abs/2605.24621
作者: Ghassen Marrakchi,Basarab Matei
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 21 pages, 16 figures, 10 tables

点击查看摘要

Abstract:Scattering transforms achieve Lipschitz stability and translation invariance, but dense prediction tasks require preserving spatial structure lost in global averaging. We propose Phase-Aware Scattering Encoder-Decoder, which restores this information by explicitly preserving phase in skip connections. On image denoising (BSD68), breaking translation invariance improves PSNR by +2.17 ~dB; phase preservation adds +1.03 ~dB. A novel spatial shuffling ablation ( -1.26 ~dB penalty) demonstrates phase encodes location-dependent structure. We conduct a preliminary extensibility study on a second dense prediction task (ISIC skin lesion segmentation), with full cross-validation as ongoing work. This work advances principled wavelet-deep learning integration, showing how phase information complements scattering’s stability-expressiveness trade-off in pixel-level prediction.

[CV-199] Lattice theory and algebraic models for deep convolutional learning based on mathematical morphology

链接: https://arxiv.org/abs/2605.24608
作者: Gustavo(Jesus)Angulo
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We develop a rigorous algebraic framework for deep convolutional architectures, CNNs, ResNets, and encoder–decoder networks such as UNet, grounded in lattice theory and mathematical morphology. The central tool is the Matheron–Maragos–Banon–Barrera (MMBB) universal representation theory for translation-invariant operators, which we apply systematically to every layer of a standard deep network. The principal finding is that the standard CNN pipeline (linear convolution~ + ReLU~ + flat max-pooling) is a cross-lattice operator: the convolution is an erosion in the Fourier inf-semilattice while ReLU is a lattice-join closing and max-pooling is a dilation in the pointwise max-plus lattice, and their composition is a morphological opening in neither. A second finding is that the upper adjoint of ReLU in the pointwise lattice is a global (non-local) operator, the identity on globally non-negative functions and -\infty otherwise, so no local morphological erosion can form an adjunction pair with ReLU. These two results together provide the precise algebraic reason why depth in standard CNNs introduces genuine representational power: the composed layer is not idempotent. Three layer designs that are genuine idempotent openings are identified and fully characterised: the pure max-plus morphological layer (pointwise lattice), the spectral Wiener layer (Fourier lattice), and the self-dual morphological layer. We establish a complete fixed-point and convergence theory. The framework also unifies max-pooling, strided convolution, and the Laplacian pyramid under the Goutsias–Heijmans adjoint pyramid theory, and gives the Activation–Pooling Dilation (APD) factorisation with its correct adjoint. Subjects: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) MSC classes: 68T07, 06B23, 68U10, 94A12, 06A15 Cite as: arXiv:2605.24608 [cs.AI] (or arXiv:2605.24608v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2605.24608 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Gustavo (Jesus) Angulo [view email] [v1] Sat, 23 May 2026 14:43:52 UTC (88 KB)

[CV-200] LC-Flow: Learning Local Continuous Optical Flow and Confidence from events

链接: https://arxiv.org/abs/2605.24604
作者: Gunwoo Jeon,Chaesong Park,Jongwoo Lim
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Event cameras capture brightness changes asynchronously with microsecond resolution, yet existing optical flow methods fail to fully exploit this temporal continuity. Frame-based approaches impose artificial accumulation latency and suffer from domain overfitting, while model-based local methods operate statelessly, discarding temporal history between predictions and yielding inaccurate flows. We propose \textbfLC-Flow, the first temporally continuous, learning-based optical flow estimator that operates purely from local events. At its core, a Continuous Local Recurrent Network maintains persistent hidden states per spatial grid, incrementally accumulating temporal context as events arrive. Unlike frame-based methods constrained to fixed accumulation windows, and unlike stateless model-based methods that recompute motion from scratch at each step, LC-Flow produces sparse local flow estimates at arbitrary timestamps with full motion history. To address the inherent ambiguity of local observations, we jointly learn a confidence score that quantifies the reliability of each prediction, explicitly handling event sparsity and the aperture problem. This confidence serves a dual role: filtering unreliable estimates for downstream tasks such as visual odometry, and providing principled weights for a multi-scale confidence-guided aggregation that reconstructs globally consistent flow from the sparse local outputs. LC-Flow achieves state-of-the-art performance among local methods on both MVSEC and DSEC, while the confidence-guided aggregation establishes a new overall state-of-the-art on the MVSEC benchmark, surpassing heavy frame-based networks that rely on global spatial priors. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2605.24604 [cs.CV] (or arXiv:2605.24604v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2605.24604 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-201] Correcting Visual Blur Induced by Attention Distraction to Reduce Hallucinations: Algorithm and Theory

链接: https://arxiv.org/abs/2605.24602
作者: Quanjiang Li,Zhiming Liu,Wei Luo,Tingjin Luo,Chenping Hou
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multimodal large language models (MLLMs) frequently suffer from object hallucinations, yet the visual perceptual mechanism underlying this failure remains poorly understood. In this work, we reveal that hallucinations are strongly associated with a human-like attention distraction phenomenon, where humans under divided focus experience degraded visual clarity and produce inaccurate descriptions, while in models the same mechanism manifests as spatial inconsistency in multi-head attention and temporal fading of attention to image tokens during decoding. We further provide theoretical insights that attention dispersion increases model complexity and degrades classification generalization. Motivated by these findings, we propose an Attention-Focused Approach for Improved Image Perception (AFIP), which corrects attention distraction via cross-head attention enrichment and reinforces visual grounding through dynamic historical attention enhancement. Extensive experiments on multiple benchmarks and models validate the effectiveness of AFIP without additional training.

[CV-202] Self-supervised Dynamic Heterogeneous Degradation Modeling for Unified Zero-Shot Image Restoration

链接: https://arxiv.org/abs/2605.24593
作者: XiaoWan Hu,Jing Yang,HeNan Liu,HuaQiu Li,Mai Xu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Zero-shot image restoration provides a flexible way to handle diverse degradations without task-specific training. However, existing methods typically rely on stacked layers or pre-trained features to enhance degradation expression, while overlooking physically consistent priors. The insufficient degradation prompts impose the heavy training burden and high sampling costs during zero-shot diffusion. Moreover, the fixed inference trajectory often collapses to suboptimal solutions under complex corruptions. We observe that heterogeneous degradations can be reparameterized into a minimal set of physically coherent parameters for compact representation. Based on this insight, we first propose a unified physical zero-shot image restoration (UP-ZeroIR) framework that explicitly models heterogeneous degradations into a homogeneous all-in-one distribution. The distribution can be optimized directly in the latent space, enabling principled solution exploration and effective prompt adaptation. Besides, we introduce a dynamic quality-refinement strategy that adaptively adjusts the diffusion trajectory for robust globally optimal convergence. Extensive experiments demonstrate that our method achieves state-of-the-art performance across both single and mixed degradations. Our code is available at this https URL

[CV-203] Physen-Noise2Noise: Physics-Guided Self-Supervised Defocus Deblurring with Bias Correction under Low-Light Conditions

链接: https://arxiv.org/abs/2605.24590
作者: Ziyan Huang,Lang Wu,Hongji Wang,Yifei Liu,Dongliang Tang,Hongqiao Wang
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注: 14 pages

点击查看摘要

Abstract:Low-light, long-exposure defocus deblurring remains a challenging problem due to the simultaneous presence of severe blur and complex biased noise. Existing methods typically rely on simplified noise assumptions, which limits their effectiveness under realistic imaging conditions. In this work, we propose Physen-Noise2Noise, a self-supervised deblurring framework guided by the physical model of defocus imaging, which leverages noisy multi-frame observations without requiring clean reference images. Unlike conventional Noise2Noise-based approaches that assume zero-mean noise, we derive a frequency-domain constraint inherent to the defocus imaging process and incorporate it into the learning framework via a learnable noise bias parameter. In addition, a multi-frame noisy initialization strategy is introduced to suppress complex biased noise prior to deblurring, providing a more stable starting point for reconstruction. This formulation explicitly models biased noise and enables joint bias correction and high-frequency detail recovery during training. Furthermore, we develop a pretrain-finetune variant to enhance robustness and generalization under challenging noise conditions. Extensive experiments on both simulation and real-world datasets demonstrate that the proposed method consistently outperforms state-of-the-art self-supervised approaches for defocus deblurring in the presence of complex biased noise.

[CV-204] World Models as Group Actions

链接: https://arxiv.org/abs/2605.24578
作者: Zijie Wang,Wei Zhang,Weiming Zhang,Fanqi Zhang,Xiao Tan,Yipeng Qin,Guanbin Li
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Under review

点击查看摘要

Abstract:Video world models have achieved strong visual realism, but this does not ensure that their dynamics are truly governed by actions. In this work, we argue that action faithfulness should be understood through the compositional structure of actions, which in many embodied settings follows a group structure (e.g., SE(2) for navigation). Based on this insight, we formalize action-conditioned world modeling as realizing a group action on the state space, providing a principled criterion for evaluating dynamics beyond visual quality. To operationalize this framework, we propose a unified approach that enforces identity, inverse, and composition consistency via latent-space regularization with synthesized supervision, avoiding additional data collection. We further introduce two metrics: Group-Action Consistency (GAC) and Group-Action Robustness (GAR), to evaluate structural correctness and rollout stability. Extensive experimental results show that our method consistently improves both GAC and GAR in state-of-the-art video world models without degrading perceptual quality.

[CV-205] PILOT: Policy-Informed Learned Optimization for Adaptive Deep Network Training

链接: https://arxiv.org/abs/2605.24570
作者: Sattam Altuuaim,Lama Ayash,Muhammad Mubashar,Naeemullah Khan
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages, 5 figures

点击查看摘要

Abstract:Despite the central role of optimization in deep learning, most optimizers rely on update structures whose functional form is fixed before training begins. This static design can limit their ability to respond to changing gradient behavior across the loss landscape, where training may shift between stable, noisy, and inconsistent regimes. This study proposes PILOT (Policy-Informed Learned OpTimizer), an online optimizer that adapts its update behavior during training. Rather than using a fixed balance between momentum, normalization, and sign-based updates, PILOT uses gradient-direction agreement as a signal of local training stability. Conditioning the update rule on this agreement signal allows the optimizer to adjust its behavior when gradients become stable, noisy, or inconsistent. Experiments on FashionMNIST and CIFAR-10 show that PILOT consistently achieves the highest accuracy among the evaluated optimizers across convolutional settings. On the CNN architecture, PILOT reaches 94.13% on FashionMNIST and 81.94% on CIFAR-10. On ResNet-18, it further improves performance, reaching 95.71% on FashionMNIST and 93.42% on CIFAR-10. These results suggest that learning how to adapt the update structure during training can improve performance across both compact and deeper convolutional models while preserving a simple first-order optimization framework. The implementation of PILOT is publicly available at this https URL Comments: 16 pages, 5 figures Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2605.24570 [cs.LG] (or arXiv:2605.24570v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.24570 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Sattam Altuuaim [view email] [v1] Sat, 23 May 2026 13:10:15 UTC (172 KB)

[CV-206] EMA: Effort Metric Attention for Anatomical Effort-Guided Human Motion Diffusion

链接: https://arxiv.org/abs/2605.24566
作者: Joshua Siy,Huakun Liu,Yutaro Hirao,Monica Perusquia-Hernandez,Hideaki Uchiyama,Kiyoshi Kiyokawa
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)
备注: Accepted at IEEE International Conference on Automatic Face and Gesture Recognition (FG 2026)

点击查看摘要

Abstract:Human motion diffusion models can synthesize action sequences from text, but controlling motion intensity remains challenging. Existing approaches rely on effort-related adverbs, which are ambiguous and fail to capture quantitative aspects such as pacing, often resulting in flat and monotonous dynamics. We propose an intensity-control framework based on Effort Metric Attention (EMA), a cross-attention module that conditions diffusion on numerical effort signals. Inspired by Laban Movement Analysis (LMA), the framework focuses on the Time and Weight effort factors. We approximate these factors using two kinematic metrics: peak joint positional change for pacing and collective joint positional change for motion amount. EMA enables fine-grained, region-wise control without costly post-hoc optimization. We introduce two evaluation tasks, metric-to-motion consistency and body-part-level effort modulation, to assess numerical fidelity and localized control. Experiments and a user study show near-monotonic alignment between specified effort levels, generated motion dynamics, and established LMA descriptors. These results indicate effective and interpretable control of effort dynamics in practice.

[CV-207] PEDESTRIANQA: A Benchmark for Vision-Language Models on Pedestrian Intention and Trajectory Prediction

链接: https://arxiv.org/abs/2605.24562
作者: Naman Mishra,Shankar Gangisetty,C. V. Jawahar
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Pedestrian intention and trajectory prediction are critical for the safe deployment of autonomous driving systems, directly influencing navigation decisions in complex traffic environments. Recent advances in large vision-language models offer a powerful new paradigm for these tasks by combining high-capacity visual understanding with flexible natural language reasoning. In this work, we introduce PedestrianQA, a large-scale video-based dataset that formulates pedestrian intention and trajectory prediction as question-answering tasks augmented with structured rationales. PedestrianQA expresses richly annotated pedestrian sequences, in natural language, enabling VLMs to learn from visual dynamics, contextual cues, and interactions among traffic agents while generating concise explanations of their predictions without needing specialized architectures tailored for each task. Empirical evaluations across PIE, JAAD, TITAN, and IDD-PeD show that finetuning state-of-the-art VLMs on PedestrianQA significantly improves intention classification, trajectory forecasting accuracy, and the quality of explanatory rationales, demonstrating the strong potential of VLMs as a unified and explainable framework for safety-critical pedestrian behavior modeling.

[CV-208] IQA-Spider: Unifying Multi-Granularity Image Quality Assessment with Reasoning Grounding and Referring ICML2026

链接: https://arxiv.org/abs/2605.24553
作者: Xinge Peng,Yiting Lu,Xin Li,Zhibo Chen
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICML 2026

点击查看摘要

Abstract:We present IQA-Spider, the first image quality assessment (IQA) framework that unifies reasoning, grounding, and referring into a single LMM-based framework for multi-granularity quality understanding. Existing LMM-based IQA methods typically support only partial perception dimensions, such as quality description and question answering~(\textiti.e., reasoning) or pixel-level grounding. This limitation largely stems from the absence of (i) a unified task and data formulation and (ii) effective optimization paradigms for multi-granularity learning. To address these limitations, we formulate a rigorous four-task paradigm covering global and local quality description, pixel-level grounding, and region-level referring. Based on this formulation, we construct a corresponding IQA dataset with a scalable and automatic annotation pipeline, thereby providing a solid foundation for unified multi-granularity learning. To further enable unified perception, we adopt a conflict-free two-stage design that progressively extends text-level multi-granularity understanding to pixel-level grounding: (i) the first stage equips the model with fine-grained text-level reasoning across multiple IQA tasks, and (ii) the second stage introduces a training-free text-to-point grounding paradigm, which bridges textual semantics and pixel-level perception by mapping token logits to spatial coordinates. Based on these efforts, we achieve IQA-Spider with unified multi-granularity explainable image quality assessment. Extensive experiments across multiple benchmarks demonstrate strong performance, validating the effectiveness and versatility of the proposed formulation and framework.

[CV-209] Learnable Shape Prototypes with Occlusion-Geometry-Guided Injection for Amodal Instance Segmentation

链接: https://arxiv.org/abs/2605.24533
作者: Fufan Zhang,Jingxiang Wang,Xiangjie Ye
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 7 figures, 5 tables. Submitted to IEEE Transactions on Circuits and Systems for Video Technology

点击查看摘要

Abstract:Amodal instance segmentation aims to predict the complete object mask including occluded regions that lack pixel-level observations and must be inferred with the aid of shape priors. Existing methods acquire shape priors through fixed-capacity encoding spaces or expensive generative models, and inject them uniformly across all spatial positions without adapting to the varying prior demand between visible and occluded regions. In this paper, we propose a gated reliability-adaptive shape prior framework, which introduces a shape prior memory module that combines learnable prototypes via cross-attention to produce instance-adaptive shape priors through weighted prototype combination rather than generation. A spatial adaptive reliability gate then employs the signed distance field of the visible mask to modulate injection intensity at each position according to its occlusion depth, preserving reliable features in visible regions while directing shape compensation toward occluded areas. Experiments on two mainstream amodal instance segmentation benchmarks demonstrate that the proposed method outperforms existing approaches under multiple evaluation settings, improving the mean intersection-over-union over occluded regions by over 11 percentage points on one of the two benchmarks under the standard setting, while using approximately one-third of the total parameters. Linear probing analysis further reveals that the visible-mask cross-attention module implicitly encodes occlusion geometry into visual token representations, explaining the effectiveness of the proposed module decomposition.

[CV-210] Image-Conditioned Instance Prompt Network for Referring Remote Sensing Image Segmentation

链接: https://arxiv.org/abs/2605.24532
作者: Biaoyu Ren(1),Qingsheng Wang(1),Cun Xu(1),Dingkang Yang(2),Wenxuan Wang(1 and 3) ((1) School of Computer Science, Northwestern Polytechnical University, Xi’an, China, (2) College of Intelligent Robotics and Advanced Manufacturing, Fudan University, Shanghai, China, (3) Shenzhen Research Institute of Northwestern Polytechnical University, Shenzhen, China)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 3 figures. Equal contribution: Biaoyu Ren and Qingsheng Wang. Corresponding authors: Dingkang Yang and Wenxuan Wang

点击查看摘要

Abstract:Referring Remote Sensing Image Segmentation (RRSIS) is a situated, task-driven cross-modal task related to the embodied perception paradigm, requiring models to align visual-spatial features with linguistic intentions for precise target perception. Recent research has focused on refining the granularity of textual features and optimizing image-text feature fusion to better guide target feature representations. However, insufficient descriptive granularity and sensitivity to semantic shifts can cause bottlenecks in cross-modal feature fusion. To address these issues, we propose the Image-Conditioned Instance Prompt Network (ICIPNet) with Bilateral Information Fusion, which is designed to alleviate bottlenecks in cross-modal feature fusion. ICIPNet introduces an Image-Conditioned Instance Prompt (ICIP) module to generate self-adaptive visual and semantic representations without external knowledge. The Bilateral Information Fusion (BIF) module enhances feature fusion along the token and channel dimensions. Experiments demonstrate that the proposed ICIPNet outperforms existing RRSIS models.

[CV-211] NudgeVAD: Language-Nudged End-to-End Driving via FiLM Residuals CVPR2026

链接: https://arxiv.org/abs/2605.24531
作者: Chieh-Chi Yang,Yu-Hsiang Chen,Yi-Ting Chen
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Technical report for the doScenes Instructed Driving Challenge, CVPR 2026 DriveX Workshop. 1st place in the Ablation track

点击查看摘要

Abstract:Natural-language instructions promise controllable end-to-end driving, but their benefit can be hidden when planners already receive reliable high-level commands. We propose NudgeVAD, a frozen-planner residual framework that uses language as a calibrated nudge to a VAD trajectory. With identity-initialized FiLM and a zero-initialized residual head, NudgeVAD is equivalent to the frozen planner at initialization, so learned deviations arise only from language-conditioned residuals. We evaluate NudgeVAD along a command-reliability axis. With reliable commands, language improves the initial planner but becomes nearly redundant once compared against VAD-FT (UNCOND), a compute-matched VAD model fine-tuned without language. With random commands, however, language becomes essential: detaching text degrades ADE6s to 3.166 m, while NudgeVAD with text recovers 2.806 m and outperforms VAD-FT (UNCOND) by 0.312 m. These results show that language is not universally additive; it is most valuable when the categorical command channel is unreliable.

[CV-212] Φ-Noise: Training-Free Temporal Video Conditioning via Phase-Based Noise Manipulation

链接: https://arxiv.org/abs/2605.24509
作者: Ofir Abramovich,Nadav Z. Cohen,Adi Rosenthal,Ariel Shamir
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Machine Learning (cs.LG)
备注: Under Review; 26 pages, 21 figures

点击查看摘要

Abstract:Latent video diffusion models generate videos by progressively transforming Gaussian noise into realistic samples conditioned on text or visual inputs. However, existing conditioning methods often require additional training and computational overhead. Motivated by recent findings on the importance of frequency components in generative models, we propose a simple, training-free approach for motion-conditioned video generation by injecting low-frequency phase information from a reference video directly into the diffusion noise latents. Our method transfers motion cues without modifying the model architecture or inference pipeline. Using several applications, we demonstrate effective control over both appearance and dynamics in generated videos, while achieving competitive or superior results compared to more complex conditioning approaches.

[CV-213] FDDet: Achieving Data-Efficient Food Defect Detection Under Real-World Scenarios

链接: https://arxiv.org/abs/2605.24508
作者: Ruihao Xu,Yong Liu,Yansong Tang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Food defect detection is critical for automated quality control, yet existing studies lack unified benchmarks and suffer from data scarcity. We introduce FDD-48, a comprehensive dataset with fine-grained annotations across 13 food types and 48 defect categories under diverse real-world conditions. To improve detection with limited labeled data, we propose FDDet, a semi-supervised framework featuring two key components: (1) BBoxMixUp, a data augmentation technique that mixes same-category defect regions to reduce spurious feature associations, and (2) CGPC (Consistency-Guided Pseudo-Label Calibration), which filters pseudo-labels based on intra-sample consistency. Experiments show FDDet significantly outperforms mainstream detectors on FDD-48, demonstrating its effectiveness for food defect detection under data-limited scenarios.

[CV-214] FoodMonitor: Benchmarking MLLM s for Explainable Compliance Analysis

链接: https://arxiv.org/abs/2605.24503
作者: Ruihao Xu,Xingming Shui,Jingxuan Niu,Yiqin Wang,Jilin Yu,Haoji Zhang,Yansong Tang
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As AI-powered compliance monitoring becomes increasingly important in public governance and industrial safety, the ability to provide verifiable evidence and traceable accountability signals is essential. However, existing video anomaly detection datasets focus on event-level binary classification, lacking the rule-driven, explainable analysis required for real-world compliance scenarios. We introduce FoodMonitor, a benchmark for explainable compliance analysis in commercial kitchen surveillance. FoodMonitor comprises 477 video clips with 3,307 violation annotations across a dual-channel design covering both person-level and environment-level violations. Each annotation specifies which rule was violated, what non-compliant behavior occurred, and who committed it with frame-level bounding boxes. We establish a unified evaluation protocol with a two-stage matching mechanism that separately assesses spatial localization and semantic understanding, along with a composite metric ( C_\textscore ) that balances environment and person detection performance. Systematic evaluation of several state-of-the-art multimodal large language models reveals that the best-performing model achieves only 0.360 C_\textscore , with spatial localization and fine-grained rule understanding emerging as the primary bottlenecks. Our analysis identifies two distinct failure modes: localization-dominated errors and semantics-dominated errors, providing diagnostic insights for future model development.

[CV-215] EgoAdapt: A Multi-Scene Egocentric Adaptation Method for CVPR 2026 HD-EPIC VQA Challenge CVPR2026

链接: https://arxiv.org/abs/2605.24500
作者: Zhiwei Chen,Yupeng Hu,Zixu Li,Zhiheng Fu,Guozhi Qiu,Weili Guan,Liqiang Nie
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Technical Report for CVPR 2026 HD-EPIC VQA Challenge

点击查看摘要

Abstract:This technical report presents our solution, EgoAdapt (Egocentric Adaptation via Category, Calibration, and Consistency), to the CVPR 2026 HD-EPIC VQA challenge. HD-EPIC evaluates whether a vision-language model can reason over realistic first-person kitchen videos, where the evidence for an answer may be a short hand-object interaction, a long recipe trajectory, a spatial relation to a fixture, or a subtle gaze cue. The benchmark contains 26K multiple-choice questions across seven macro-categories: recipe, ingredient, nutrition, fine-grained action, 3D perception, object motion, and gaze. We observe that the main difficulty is not only model capacity, but also the mismatch between a single generic inference recipe and the heterogeneous temporal, spatial, and semantic structure of the benchmark. Our method, EgoAdapt, introduces three inference-time components: (1) category-conditioned routing with per-category prompts, frame budgets, and sampling rates; (2) calibrated option scoring that evaluates all candidate answers with letter-token likelihoods and generation agreement instead of relying only on direct generation; and (3) test-time consistency adaptation that aggregates predictions across option permutations and verification-style prompts for ambiguous cases. This design substantially improves over the available HD-EPIC baselines.

[CV-216] EgoAction: Egocentric Action Composition with Reliability-Aware Temporal Fusion for the EPIC-KITCHENS Action Detection Challenge at CVPR 2026 CVPR2026

链接: https://arxiv.org/abs/2605.24496
作者: Zhiheng Fu,Zixu Li,Zhiwei Chen,Fangxu Liu,Yupeng Hu,Weili Guan,Liqiang Nie
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Technical Report for CVPR 2026 EPIC-KITCHENS-100 Action Detection Challenge

点击查看摘要

Abstract:The EPIC-KITCHENS-100 Action Detection challenge evaluates whether a model can localize the start and end of each action in long untrimmed egocentric videos and assign the corresponding verb–noun action label. In this report, we formulate our submission as EgoAction (Egocentric Action Composition with Reliability-Aware Temporal Fusion), a unified decoupled detection and fusion pipeline. The pipeline uses EPIC-finetuned VideoMAE-L features, trains separate noun and verb temporal detectors with causal temporal modeling, composes action hypotheses from top noun–verb pairs, and introduces a confidence-adaptive boundary fusion rule at post-processing time. The key observation is that verb and noun streams often fail differently: verb scores are sensitive to motion transitions, whereas noun scores are sensitive to hand-object visibility and object clutter. A fixed arithmetic mean of their predicted boundaries can therefore amplify localization errors when one stream degenerates. We replace this hard-coded mean with Dynamic Weighted Fusion (DWF), which normalizes the maximum noun and verb classification confidences into proposal-wise boundary weights and linearly combines the two intervals. This lightweight tensor-only operator shifts boundary authority toward the more reliable stream while preserving the decoupled action scoring mechanism. Together with sliding-window inference, top-K noun–verb action composition, and class-wise Soft-NMS, EgoAction provides a compact and reproducible system for egocentric temporal action detection.

[CV-217] Med-R2: An Adversarial Benchmark for Evidence-Grounded Reasoning in Medical VLMs

链接: https://arxiv.org/abs/2605.24492
作者: Wen Ma,Fucheng Niu,Zhiting Fan,Zikai Xiao,Jiaxiang Liu,Zuozhu Liu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-language models have demonstrated impressive capabilities in general medical visual question answering, yet due to limited interpretability, it remains unclear whether their predictions reflect evidence-grounded clinical reasoning or reliance on spurious priors. We introduce Med-R2 Bench, a hierarchical benchmark aligned with the clinical workflow to evaluate adversarial robustness with visual grounding. We design stepwise QA tasks to assess whether reasoning chains are strictly grounded in visual evidence across the four clinical stages, and employ adversarial perturbations to test robustness against misleading cues. Med-R2 comprises 42,432 images, 31 task categories, and 110,406 QA pairs. Evaluation across 14 VLMs reveals a sequential performance degradation along the four-stage clinical workflow. Adversarial experiments show that models rely heavily on correct prompts to guess answers. Even when provided with explicit visual cues, the models struggle to accurately align textual descriptions. Finally, we demonstrate stepwise fine-tuning using our hierarchical data significantly improves reasoning robustness, highlighting its potential to drive future improvements in evidence-based medical AI.

[CV-218] Appearance-Invariant Detection of Suggestive Motion via Laban Movement Descriptors on SMPL Skeletons SIGGRAPH2026

链接: https://arxiv.org/abs/2605.24488
作者: Jaehoon Ahn,Jeonghan Kong,Moon-Ryul Jung
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: 2 pages, 2 figures. Accepted as a Poster at SIGGRAPH 2026

点击查看摘要

Abstract:Content moderation in online multiplayer 3D virtual environments has recently been relegated to automated, AI-based pipelines. However, the field has mainly been involved in detection of illicit content in images, video, and audio, leaving blind spots in detection techniques for suggestive motion. We present a motion-only classification pipeline that detects suggestive and explicit movement from SMPL skeleton trajectories using Laban Movement Analysis (LMA) descriptors. On 20,514 motion fragments (17+ hours) spanning four ordinal tiers – everyday, artistic, suggestive, explicit – logistic regression over 110 LMA features achieves 57.3% four-way accuracy (2.3x chance), 72.1% three-way, and 78.7% binary SFW/NSFW. Confusion concentrates on adjacent tiers, confirming that classification errors are concentrated between adjacent tiers over non-adjacent ones. Moreover, different movement qualities dominate at each level of the taxonomy – no single feature drives the classification, suggesting that the four-tier structure reflects genuinely distinct motion regimes.

[CV-219] OmniEgo-R2: A Routed Reasoning Framework for the 1st Cross-Domain EgoCross Challenge at CVPR 2026 CVPR2026

链接: https://arxiv.org/abs/2605.24481
作者: Zixu Li,Zhiwei Chen,Zhiheng Fu,Wenbo Wang,Yupeng Hu,Weili Guan,Liqiang Nie
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Technical Report for the 1st Cross-Domain EgoCross Challenge at CVPR 2026

点击查看摘要

Abstract:The 1st Cross-Domain EgoCross Challenge at EgoVis, CVPR 2026 evaluates whether multimodal large language models can reason over egocentric videos across surgery, industry, extreme sports, and animal perspective. We achieved second place in both the Source-Limited and Open-Source tracks. In this report, we formulate EgoCross as a robust cross-domain embodied video reasoning problem rather than a simple multiple-choice visual question answering task. We identify three key challenges: (C1) temporal boundary ambiguity, where critical state transitions are sparsely sampled and often occur between frames; (C2) cross-domain semantic granularity mismatch, where the same capability requires different domain-specific visual grammar; and (C3) decision instability under close options, where long multimodal reasoning can select unsupported distractors or produce malformed outputs. To address them, we propose OmniEgo-R ^2 (Omnidomain Egocentric Routed Reasoning), a unified routed reasoning pipeline consisting of temporal-evidence normalization, domain-agnostic capability routing, structured perception–dynamics–decision reasoning, boundary-aware option verification, and defensive answer calibration. OmniEgo-R ^2 uses the Qwen3-VL-4B-SFT checkpoints on each EgoCross domain as the visual-language backbone, and wraps them with lightweight test-time reasoning and parsing programs. Our final submissions obtain 66.35% overall accuracy in the Source-Limited track and 66.77% in the Open-Source track, ranking second in both leaderboards.

[CV-220] Robust Fuzzy Multi-view Learning under View Conflict

链接: https://arxiv.org/abs/2605.24475
作者: Siyuan Duan,Yuan Sun,Dezhong Peng,Yingke Chen,Xi Peng,Peng Hu
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Trusted multi-view classification aims to deliver reliable fusion for accurate predictions and has recently attracted substantial attention in both academia and industry. However, existing TMVC methods typically assume strict alignment across different views during both training and testing phases, which is often impractical in real-world scenarios. This limitation motivates us to revisit TMVC and extend it to a more challenging setting: how to mitigate the impact of view conflict (VC) during both training and inference. To tackle this setting, existing TMVC methods suffer from three critical limitations: underestimated uncertainty, misleading decisions, and overfitting to VC. To address these issues, this paper proposes a novel Robust Fuzzy Multi-View Learning (R-FUML) framework grounded in Fuzzy Set Theory. Specifically, R-FUML models network outputs as fuzzy memberships to quantify category credibility and uses an entropy-based method for reliable multi-view fusion. To this end, we present a Robust Multi-view Fusion (RMF) strategy that accounts for both view-specific uncertainty and inter-view conflicts, thereby alleviating the adverse impacts of VC on decision-making. To identify and conquer VC during training, we further design a Robust Learning Against VC (RLVC) framework. RLVC isolates conflicting samples by leveraging neural networks’ memory effects and then retrains the model by applying a penalty to these conflicting views. Extensive experiments across eight public datasets demonstrate that R-FUML consistently outperforms 15 state-of-the-art baselines in robustness and uncertainty estimation. The code will be released upon acceptance.

[CV-221] mpRet: Temporal Enhancement and Two-Stage Reranking for CVPR 2026 EPIC-KITCHENS-100 Multi-Instance Retrieval Challenge CVPR2026

链接: https://arxiv.org/abs/2605.24470
作者: Zixu Li,Yupeng Hu,Zhiwei Chen,Zhiheng Fu,Xiaowei Zhu,Weili Guan,Liqiang Nie
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Technical Report for CVPR 2026 EPIC-KITCHENS-100 Multi-Instance Retrieval Challenge

点击查看摘要

Abstract:Video-text retrieval has witnessed remarkable progress driven by large-scale vision-language pretraining, yet most existing approaches inherit an implicit assumption from image-text retrieval: that visual semantics can be captured frame-by-frame. This assumption overlooks the temporal dynamics of egocentric videos. The EPIC-KITCHENS-100 Multi-Instance Retrieval (MIR) challenge further raises the bar by providing soft-label relevance matrices rather than binary labels, demanding models that can resolve graded semantic correspondences across modalities. In this report, we present our solution, termed TempRet, to the CVPR 2026 EPIC-KITCHENS-100 MIR challenge. Our approach builds upon a CLIP-based dual-encoder backbone and introduces two key components to address the temporal and cross-modal challenges. First, a temporal transformer operates exclusively on the video side, modeling inter-frame dependencies through learnable positional encodings and multi-head self-attention over frame-level CLIP features. Second, a two-stage reranking pipeline first retrieves Top-K candidates via the dual-encoder, then refines their scores using a cross-encoder equipped with an Image-Text Matching (ITM) head. The entire system is trained with Symmetric Multi-Similarity Loss to exploit the soft-label relevance matrices provided by the challenge. Our method achieves 67.97% average mAP and 82.92% average nDCG on the EK-100 MIR benchmark, demonstrating the effectiveness of temporal modeling and cross-modal refinement for egocentric video retrieval.

[CV-222] Coarse-to-Fine Domain Incremental Learning with Attentive Distillation for Mining Footprint Segmentation in Multispectral Imagery

链接: https://arxiv.org/abs/2605.24460
作者: Alif Tri Handoyo,Vincent C.S. Lee,Rizka Widyarini Purwanto,Alex M. Lechner,Deanna Kemp,Muhamad Risqi U. Saputra
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Automatically mapping and segmenting global mining footprints using remote sensing and deep learning is critical for monitoring the socio-environmental risks and impacts of mining, yet its progress is hindered by the scarcity of fine-grained annotated data. Although large-scale datasets with coarse boundaries are widely available, leveraging them to improve fine-grained segmentation is challenging due to significant domain shift. To address this, we propose MineC2FNet, a coarse-to-fine domain incremental learning framework that exploits abundant coarse data to enhance fine-grained mining footprint segmentation. MineC2FNet adopts a teacher-student architecture with attentive distillation at both the feature and prediction levels, selectively transferring generalized knowledge from the coarse domain while enabling boundary refinement using limited fine-grained data (fine domain). We further introduce an expertly validated dataset of 219 images with precise boundary annotations across diverse geographies and commodities. Extensive experiments against state-of-the-art approaches, including domain adaptation and domain incremental learning methods, demonstrate that MineC2FNet achieves superior performance while effectively handling domain shift. The dataset and code are publicly available at this https URL.

[CV-223] EgoProx: Evaluating MLLM s on Egocentric 3D Proximity Reasoning Across a Cognitive Hierarchy CVPR2026

链接: https://arxiv.org/abs/2605.24456
作者: Jinzhao Li,Yinuo Chen,Dongxu Piao,Panwang Pan,Yifan Yu,Dong Wang,Honglei Yan,Liang Yue,Shaofei Wang,Yixin Chen,Siyuan Huang,Miao Liu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026

点击查看摘要

Abstract:Humans constantly reason about 3D proximity, the relations between their body and surrounding objects, to guide perception and action in daily life. Whether multimodal large language models (MLLMs) can perform such embodied 3D reasoning remains unclear. To this end, we introduce EgoProx, a benchmark for egocentric 3D proximity reasoning. We organize our tasks along a cognitive chain, covering intention, exploration, exploitation, and chain-of-actions reasoning. We also design an agent based data engine that produces diverse and consistent QA pairs at scale. We benchmark prevailing MLLMs on EgoProx and conduct additional analyses with dataset specific and task specific instruction tuning. We observe large cross-domain gains, indicating that current MLLMs contain some spatial knowledge; however, they still struggle to effectively leverage it for spatial reasoning VQA.

[CV-224] SILSM: A Sustainable Interactive Level Set Method for Progressive Refinement

链接: https://arxiv.org/abs/2605.24448
作者: Jiachen Song,Dazhi Zhang,Fanghui Song,Zhichang Guo,Shengzhu Shi
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Interactive segmentation aims to precisely isolate target objects using sparse user guidance. However, traditional methods often suffer from heavy interaction burdens and parameter sensitivity, while deep learning approaches struggle with data dependency and iterative instability. Motivated by these limitations, we propose the Sustainable Interactive Level Set Method (SILSM). The proposed level set evolution equation incorporates interaction, regularization, and segmentation terms. Specifically, high-order regularization is employed to maintain numerical stability, and unlike traditional methods, we decouple user guidance into an independent interaction term to enable direct manual control over the zero-level set evolution. Furthermore, we develop a numerical algorithm tailored for multiple interactions, which facilitates dynamic refinement by effectively updating the segmentation results based on sequential user inputs. We theoretically demonstrate that the high-order term provides stronger regularization constraints than the conventional length term, while the interaction term ensures segmentation strictly within the user-selected region. Experimental results further demonstrate that the proposed method is robust to interactive inputs, achieves competitive performance at the first interaction, and supports stable multi-round interactions with progressively improved segmentation quality.

[CV-225] Benchmarking Composed Image Retrieval for Applied Earth Observation

链接: https://arxiv.org/abs/2605.24442
作者: Bill Psomas,Dionysis Christopoulos,Thanasis Petropoulos,Nikos Efthymiadis,Ioannis Kakogeorgiou,Ondřej Chum,Yannis Avrithis,Giorgos Tolias,Konstantinos Karantzalos
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Remote sensing composed image retrieval (RSCIR) enables search in large satellite image archives using composed queries that combine a reference image with a textual modifier. Although RSCIR offers a flexible interface for expressing targeted retrieval intent, the transferability of modern composition methods to Earth observation (EO) imagery and their relevance to operational EO workflows remain underexplored. We address this gap through a unified benchmark and an application-oriented study. First, we systematically adapt and evaluate representative composed image retrieval methods with six vision-language backbones on PatternCom under a standardized protocol, analyzing their behavior across backbones, composition strategies, and query types. Second, we introduce xView2-CIR, a change-centric dataset for disaster and damage monitoring, where retrieval is conditioned on scene identity and a target post-event state. Our results show that training-free composition methods provide strong and scalable baselines for EO retrieval, while change-centric retrieval presents different challenges from attribute-based retrieval, particularly due to the need to preserve scene identity. Overall, this study establishes a practical benchmark for RSCIR and positions composed retrieval as a complementary tool for remote sensing image retrieval, archive exploration, and change analysis. The dataset and code are available at this https URL.

[CV-226] Artiverse: A Diverse and Physically Grounded Dataset for Articulated Objects CVPR

链接: https://arxiv.org/abs/2605.24403
作者: Denys Iliash,Jiayi Liu,Egor Fokin,Qirui Wu,Ali Mahdavi-Amiri,Manolis Savva,Angel X. Chang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR camera-ready version

点击查看摘要

Abstract:We present Artiverse, a diverse and physically grounded dataset of high-quality articulated 3D objects designed for realistic functional modeling and simulation. Artiverse contains 5.4K human-authored objects across a broad range of 88 categories, aggregated from multiple 3D static repositories. Objects are annotated with functional parts, interior structures, realistic kinematic relationships and articulated joints including multi-DoF joints, and physical attributes such as metric scale, material, and mass. We develop a semi-automated annotation pipeline that combines few-shot segmentation, geometric reasoning, and multi-stage human verification to achieve high-quality and efficient annotation, reducing manual annotation time by over 30%. We demonstrate the value of Artiverse on tasks of part mobility analysis, articulated object generation, and physics-based interaction. Artiverse provides a data resource to advance functional understanding for articulated objects.

[CV-227] Dual Prototype-Conditioned Diffusion Model for Scalable Multi-Class Unsupervised Anomaly Detection in Large Category Spaces

链接: https://arxiv.org/abs/2605.24402
作者: Yaoxuan Feng,Yuxin Li,Weijiang Lv,Zixuan Zhao,Yubiao Wang,Wenchao Chen,Bo Chen,Hongwei Liu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multi-class anomaly detection aims to build unified models across diverse product categories. However, as the number of categories grows, its performance often degrades due to increasingly complex and heterogeneous normal distributions. To address this challenge, we propose DPDiff-AD, a Dual Prototype-conditioned Diffusion model for large-scale multi-class Anomaly Detection. DPDiff-AD models heterogeneous normal distributions through complementary local and global prototypes. Local prototypes capture representative fine-grained structural patterns via nearest-prototype aggregation, while global prototypes regulate holistic feature geometry through optimal transport regularization. Together, these dual-scale representations define a structured normality space. This space is refined through diffusion-based reconstruction conditioned on both local and global prototypes via prototype-aware attention. By jointly leveraging dual prototypes during generation, DPDiff-AD achieves precise normality modeling, preserves structured separability as category cardinality grows, and enables scalable anomaly discrimination. Extensive experiments across five benchmarks demonstrate the effectiveness and scalability of DPDiff-AD. On the 160-category large-scale dataset, it improves image- and pixel-level AUROC by 5.3 and 2.9 points over the previous state-of-the-art method Dinomaly+, while maintaining stable performance as category cardinality increases.

[CV-228] VectorArk: Learning Practical Image Vectorization with Rounded Polygon Representation CVPR2026

链接: https://arxiv.org/abs/2605.24398
作者: Tarun Gehlaut,Difan Liu,Charu Bansal,Krutik Malani,Souymodip Chakraborty,Ankit Phogat,Matthew Fisher,Vineet Batra
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)
备注: CVPR 2026. Project page: this https URL

点击查看摘要

Abstract:Recent vision-language model (VLM)-based approaches have achieved impressive results on image vectorization tasks. However, they are typically evaluated on synthetic benchmarks, where clean SVGs are rasterized at high resolution and then re-vectorized. As a result, these methods generalize poorly to real-world scenarios, such as images with unknown rasterization methods or those generated by text-to-image models. We introduce VectorArk, a new VLM-based model designed for robust and practical image vectorization. VectorArk employs a novel rounded polygon representation that simplifies the learning process while naturally producing smooth, visually appealing primitives. We also propose a degradation model that enhances robustness across diverse and imperfect inputs. Our experiments show that, in contrast to previous methods, VectorArk achieves superior geometric completeness and artifact suppression across multiple datasets, with comprehensive ablations validating the contribution of each component.

[CV-229] Gaussian Rank-Based Neighborhood Degree for Graph Neural Networks in Image Classification

链接: https://arxiv.org/abs/2605.24367
作者: Rafael Mendonça Duarte,Jean Roberto Ponciano,Lucas Pascotti Valem
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The exponential growth of data has intensified the gap between the availability of unlabeled data and the high cost of manual annotation. Graph Neural Networks (GNNs) have emerged as a promising solution, as they exploit relational structures and learn from both labeled and unlabeled data, performing semi-supervised learning. A crucial component of many of these models is degree-based normalization, which influences message propagation but typically assumes uniform importance among neighboring nodes. In image classification, graphs are usually constructed from feature similarity, where treating all neighbors equally may overlook important variations in relevance. Motivated by this gap, we propose GRaNDe (Gaussian Rank-based Neighborhood Degree). This novel degree measure integrates neighborhood ranking with Gaussian distance weighting to better capture node importance. Experiments on five public image classification datasets show consistent accuracy improvements and competitive or superior results compared to state-of-the-art methods.

[CV-230] SparseWorld: Enhancing End-to-End Autonomous Driving via World Models with Sparse Scene Representation

链接: https://arxiv.org/abs/2605.24354
作者: Ruoyu Wang,Jingke Wang,Yukai Ma,Yuehao Huang,Shuangming Lei,Guanglin Xu,Aixue Ye,Yong Liu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recently, world models have made significant progress in enhancing end-to-end driving systems through both future situation forecasting and improved scene understanding. However, existing driving world models are typically built upon dense scene representations, causing high computational costs and redundant information. In this paper, we present SparseWorld, a lightweight world model that focuses on predicting only the critical layout of the scene, enabling efficient future forecasting for end-to-end driving systems. SparseWorld first performs autoregressive rollout to forecast future map elements and surrounding agents, enabling the model to learn how driving scenarios evolve over time. It then leverages these predicted futures to refine downstream motion prediction and trajectory planning. Specifically, we propose a Sparse Dreamer that anticipates future instances in the latent space through joint temporal and spatial attention. By interacting with predicted future instances, the motion planner captures more accurate motion patterns and generates more informed and safety-aware trajectories. Extensive experiments demonstrate that SparseWorld significantly reduces collision risk and achieves state-of-the-art performance on the open-loop planning metrics of the nuScenes dataset with a collision rate of 0.05%. Moreover, it substantially outperforms the baseline method in closed-loop planning metrics on the Bench2Drive benchmark. Supplementary material is available at the project page: this https URL.

[CV-231] ViViD-5K: Vineyard vision dataset for field-based berry detection and segmentation and grape cluster closure estimation

链接: https://arxiv.org/abs/2605.24353
作者: Xiangzhi Tong,Chengrui Zhang,Mac Flaherty,Andre Matteo Garcia,Dominic Gorman,Jonathan Jaramillo,Justine E. Vanden Heuvel,Yu Jiang
类目: Computer Vision and Pattern Recognition (cs.CV); Other Quantitative Biology (q-bio.OT)
备注:

点击查看摘要

Abstract:Cluster closure, defined as the progressive filling of gaps between the berries in a grape bunch, is a key trait in vineyard management, impacting disease risk. However, traditional visual scoring methods are labor-intensive, subjective, and lack temporal resolution. Existing datasets rarely support fine-grained berry-level analysis, limiting the development of robust deep learning models. In this work, we present ViViD-5k, a large-scale in-field Vineyard Vision Dataset containing 5,000 images with dense annotations, including over 648,000 berry centroids and cluster segmentation masks spanning 13 grape varieties. Building on this dataset, we introduce GrapeSAM, a two-stage visual pipeline that combines point-based berry localization with prompt-based segmentation using Segment Anything, followed by transformer-based cluster segmentation. The pipeline enables automated, in-field estimation of cluster closure with minimal supervision. Quantitative results demonstrate strong segmentation and counting accuracy across diverse conditions, while visualizations confirm robustness on both in-domain and out-of-domain samples. This work provides a scalable and objective alternative to manual compactness scoring and supports high-throughput grape phenotyping with enhanced spatial detail.

[CV-232] Causal Physics Steering in Video World Models via Concept Activation Vectors CVPR2026

链接: https://arxiv.org/abs/2605.24322
作者: Nahid Alam
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: In proceedings of CVPR 2026 workshop on Video World Model

点击查看摘要

Abstract:Video world models learn representations of physical dynamics, but controlling their physical expectations at inference time remains an open problem. Recent interpretability work identified a Physics Emergence Zone (PEZ), a group of middle transformer layers in VideoMAE where physical plausibility is represented separately from other visual features. However, it remained unclear whether this structure could be used to directly control the model’s physics reasoning. We present physics steering, a training-free method that uses the weight vector of a linear probe at a PEZ layer as a Concept Activation Vector (CAV) and injects it into hidden states during inference. This shifts the model’s physical expectations without changing any model weights. On the IntPhys benchmark, this intervention reliably shifts the model’s plausibility judgment in either direction, depending on the steering sign. The effect appears only when the intervention is applied within the Physics Emergence Zone, suggesting that the relevant physics representation is localized there. We further find that physics is encoded separately from motion direction, and that different intuitive physics principles occupy distinct directions within this representation space. Together, these results show that physical reasoning in VideoMAE is not only readable, but also directly steerable.

[CV-233] Unified 3D Scene Understanding Through Physical World Modeling ICLR2026

链接: https://arxiv.org/abs/2605.24321
作者: Wanhee Lee,Klemen Kotar,Rahul Mysore Venkatesh,Jared Watrous,Honglin Chen,Khai Loong Aw,Daniel L. K. Yamins
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Published as a conference paper at ICLR 2026

点击查看摘要

Abstract:Understanding 3D scenes requires flexible combinations of visual reasoning tasks, including depth estimation, novel view synthesis, and object manipulation, all of which are essential for perception and interaction. Existing approaches have typically addressed these tasks in isolation, preventing them from sharing a common representation or transferring knowledge across tasks. A conceptually simpler but practically non-trivial alternative is to unify these diverse tasks into a single model, reducing different tasks from separate training objectives to merely different prompts and allowing for joint training across all datasets. In this work, we present a physical world model for unified 3D understanding and interaction (3WM), formulated as a probabilistic graphical model in which nodes represent multimodal scene elements such as RGB, optical flow, and camera pose. Diverse tasks emerge from different inference pathways through the graph: novel view synthesis from RGB and dense flow prompts, object manipulation from RGB and sparse flow prompts, and depth estimation from RGB and camera conditioning, all zero-shot without task-specific training. 3WM outperforms specialized baselines without the need for finetuning by offering precise controllability, strong geometric consistency, and robustness in real-world scenarios, achieving state-of-the-art performance on NVS and 3D object manipulation. Beyond predefined tasks, the model supports composable inference pathways, such as moving objects aside while navigating a 3D environment, enabling complex geometric reasoning. This demonstrates that a unified model can serve as a practical alternative to fragmented task-specific systems, taking a step towards a general-purpose visual world model.

[CV-234] CoDA: Color Distribution Probing for Efficient and Generalizable AI-Generated Image Detection

链接: https://arxiv.org/abs/2605.24306
作者: Zexi Jia,Zhiqiang Yuan,Xiaoyue Duan,Jinchao Zhang,Jie Zhou,Anil K. Jain
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:AI-generated image detection faces a persistent trade-off between generalization and efficiency: lightweight artifact-based methods often degrade on unseen generators or domains, whereas more robust large-scale models are computationally expensive. Meanwhile, existing benchmarks mainly focus on cross-model evaluation in photorealistic settings, leaving cross-domain robustness underexplored. To address this gap, we introduce FakeForm, a large-scale benchmark with approximately 370,000 images across 62 diverse domains for both cross-model and cross-domain evaluation. Motivated by this broader setting, we revisit color-distribution probing as an efficient complementary cue for AI-generated image detection. We observe that, especially for photographic content, real photographs tend to exhibit smoother and more stable color patterns, whereas synthetic images often show characteristic color imbalances introduced by neural generation. Based on this observation, we propose CoDA, a compact 1.48M-parameter detector built on a Noise-Quantization Probe, together with a theoretical analysis linking probe responses to color non-uniformity. Experiments show that CoDA achieves state-of-the-art performance on standard benchmarks and the best results on the challenging cross-domain evaluation of FakeForm, while remaining highly competitive in cross-model photorealistic settings. These results suggest that persistent generative artifacts can provide a practical foundation for efficient and robust AI-generated image detection. The models and FakeForm benchmark will be made publicly available.

[CV-235] ArtSplat: Feed-Forward Articulated 3D Gaussian Splatting from Sparse Multi-State Uncalibrated Views

链接: https://arxiv.org/abs/2605.24304
作者: Inseo Lee,Yoonji Kim,Eugene Sohn,Jiwoong Lee,Jungmin You,Joonseok Lee,Jin-Hwa Kim
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Articulated object reconstruction from sparse-view images is an ill-posed problem that requires simultaneous inference of geometry and underlying articulation structure. Existing methods for articulated object reconstruction based on NeRF and 3D Gaussian Splatting (3DGS) typically rely on dense views or strong priors (e.g., depth maps, joint types, predefined number of joints) and require costly per-object optimization. In this paper, we propose ArtSplat, the first feed-forward framework for articulated 3D Gaussian Splatting. It reconstructs both geometry and joint parameters from sparse multi-view images across multiple articulation states in a single forward pass. To address the challenges of single-pass articulated reconstruction, we introduce a per-pixel joint map representation that enables the integration of joint parameter estimation into the feed-forward pipeline. We further propose a Cross-State Attention (CSA) mechanism with state tokens, which effectively captures discrete motion across input states. Experiments on 68 articulated objects from PartNet-Mobility, including both single- and multi-joint configurations, demonstrate that ArtSplat achieves competitive performance in both geometry and joint estimation, while being over 400 times faster than baselines.

[CV-236] Cross-Modal Action Recognition in Egocentric Video Using Mamba: Integrating RGB and Hand Skeleton Streams via CLS Token Fusion Strategies CVPR2026

链接: https://arxiv.org/abs/2605.24302
作者: Juan Ignacio Bustos Gorostegui,Maria Elena Buemi
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 4 pages , 2 figures , Egovis2026 , CVPR2026

点击查看摘要

Abstract:Egocentric action recognition is a challenging task due to erratic camera motion, frequent hand occlusion, and the difficulty of maintaining consistent visual representations over time. In this work, we propose a cross-modal architecture that combines RGB video and temporal hand skeleton data within a unified Mamba-based framework, exploiting the linear time complexity of State Space Models (SSMs). Our architecture consists of three components: a VideoMamba module for visual feature extraction, a skeleton encoder built on a stack of Mamba blocks, and a fusion module that integrates both modalities into a single representation. A central contribution of this work is the design and evaluation of four Class (CLS) token mixing strategies for multimodal fusion: Naive, Average, Weighted and Context-based. These strategies differ in how the pretrained unimodal CLS tokens, which role is to act as information sinks concentrating learned representations, are leveraged to initialize the mixed CLS token used for final classification. We evaluate all strategies on the H2O dataset. Experimental results show that the Average strategy achieves the best performance, yielding gains of over 10% Top-1 accuracy in the Tiny configuration and 2% in the Small configuration over the VideoMamba baseline.

[CV-237] Plume Segmentation from MethaneSAT with Cross-Sensor Transfer Learning and Physics-Informed Postprocessing

链接: https://arxiv.org/abs/2605.24273
作者: Manuel Pérez-Carrasco,Maya Nasr,Zhan Zhang,Apisada Chulakadabba,Javier Roger,Raia Ottenheimer,Sébastien Roche,Maryann Sargent,Chris Chan Miller,Daniel Varon,Jack Warren,Luis Guanter,Kang Sun,Jonathan Franklin,Jia Chen,Cecilia Garraffo,Xiong Liu,Ritesh Gautam,Steven Wofsy
类目: Computer Vision and Pattern Recognition (cs.CV); Atmospheric and Oceanic Physics (physics.ao-ph)
备注: 35 pages, 20 figures, 9 tables

点击查看摘要

Abstract:Automated detection and masking of individual methane plumes from satellite imagery is important for operational emission attribution and quantification. We present a machine learning framework for plume detection from MethaneSAT retrieved column-averaged dry-air mole fractions of methane. We address two core challenges: the scarcity of labeled MethaneSAT data and the need for inference reliability across diverse atmospheric and surface conditions. We first demonstrate that Mask R-CNN with a ResNet-50 backbone outperforms U-Net semantic segmentation on both MethaneAIR (an airborne version of MethaneSAT) and MethaneSAT data, with pixel-level F1 score gains of 10.49 and 5.48 respectively. To address MethaneSAT data scarcity, we evaluate three cross-sensor transfer strategies leveraging MethaneAIR flights and synthetic plumes. Mask R-CNN with ResNet-50 fine-tuned from MethaneAIR pre-trained weights is the most effective strategy, achieving instance-level precision of 0.60 and a near-perfect recall of 0.98 at the baseline operating point. A physics-informed post-processing pipeline converts detections into two operationally distinct modes. The first is a high-sensitivity mode that applies morphological filtering and proximity-based merging for comprehensive emission screening, achieving precision of 0.71 and recall of 0.94. The second is a high-precision mode that additionally applies a distribution-based classifier for confident source attribution, achieving precision of 0.92 and recall of 0.70. Manual review of detections classified as false positives against our wavelet-based ground truth labels reveals that a meaningful fraction of cases correspond to real methane enhancements excluded by conservative labeling criteria, indicating that precision values reported are lower bounds on true detection performance… Our data and code are available at: this https URL

[CV-238] Rethinking Continual Anomaly Detection on the Edge: Benchmarking Under Realistic Industrial Conditions

链接: https://arxiv.org/abs/2605.24251
作者: Chad Weatherly,Sen Lin
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Continual anomaly detection (CAD) addresses the need for industrial inspection systems to adapt to evolving production conditions, yet existing methods share three critical gaps: unrealistic evaluation, no systematic comparison, and no consideration of edge deployment constraints. We introduce a unified benchmark combining discrete-task evaluation on structural and logical anomalies, a novel continuous drift protocol, the first head-to-head comparison of all published CAD methods, and computational efficiency profiling on edge hardware. Our results reveal that existing CAD methods do not consistently outperform traditional approaches with simple experience replay. Thus motivated, we propose DINOSaur, a training-free method combining a frozen DINOv3 backbone with spatially-indexed coreset memory and neighborhood-restricted anomaly scoring. DINOSaur achieves zero forgetting by construction, outperforms all evaluated methods across all five protocols, and runs at sub-100,ms inference on an NVIDIA Jetson Orin Nano, with on-device adaptation to new tasks in under 30 seconds.

[CV-239] GIBLy: Improving 3D Semantic Segmentation through an Architecture-Agnostic Lightweight Geometric Inductive Bias Layer

链接: https://arxiv.org/abs/2605.24243
作者: Diogo Lavado,Alessandra Micheletti,Clàudia Soares
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:In 3D scene understanding, deep learning models rely on large models and extensive training to capture basic geometric structures that are present in the 3D data. However, existing methods lack explicit mechanisms to incorporate geometric information, such as learnable primitive shapes, often necessitating large models and more training data which in turn increases cost and can limit generalization. We introduce GIBLy, a lightweight geometric inductive bias layer that integrates learnable geometric priors into 3D segmentation pipelines. GIBLy enhances existing architectures – whether MLP-based, convolution-based, or transformer-based – by providing features aligned with simple geometric shapes (and thus human-interpretable) that improve segmentation performance with minimal computational overhead. We validate our approach across multiple 3D semantic segmentation benchmarks, demonstrating consistent performance gains, including up to +11.5% mIoU on TS40K with PTV3, while adding only 58K extra parameters. Our results highlight the benefit of explicitly encoding geometric structure to support accurate and efficient 3D scene understanding, with a lightweight add-on layer

[CV-240] Radiuma: A Unified Zero-Code Executable Graphical Workflow Generator for Reproducible and Shareable Medical Image Analysis and Machine Learning

链接: https://arxiv.org/abs/2605.24201
作者: Mohammad Salmanpour,Mehrdad Oveisi,Isaac Shiri,Arman Rahmim
类目: Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)
备注:

点击查看摘要

Abstract:Medical image computing software is essential for identifying imaging biomarkers that can support diagnosis, prognosis, treatment planning, and clinical research. However, the lack of standardized, user-friendly, and reproducible software environments has limited the broader adoption of advanced medical image analysis workflows. We present Radiuma, a freely available modular platform designed to support reliable and reproducible medical image analysis across multiple modalities and file formats. Radiuma integrates image reading, visualization, registration, fusion, processing, segmentation, radiomics feature extraction, and machine learning modules for classification, regression, and clustering. Its modular design allows users to execute each component independently or connect modules through a visual workflow system, where the output of one step can be graphically passed to the next. This enables the creation of custom, executable, and reproducible multi-step pipelines without requiring extensive programming expertise. Results from each module can be inspected directly in the visualization window, providing immediate feedback on processing quality and workflow accuracy. Radiuma also supports saving and sharing customized workflows, promoting transparency, reusability, and consistency across collaborative studies. By combining flexibility, usability, and standardized analysis tools, Radiuma provides a practical environment for radiomics and machine learning research in clinical and translational settings. The platform is designed to be accessible to users with diverse expertise, including radiologists, physicists, clinicians, and data scientists.

[CV-241] Single View Seafloor Recovery from Imaging Sonar via Differentiable Rendering

链接: https://arxiv.org/abs/2605.24195
作者: Sevan Brodjian,Michael Hobley,Pietro Perona
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Sonar is often the only modality suitable for high-resolution imaging underwater due to light attenuation and turbidity. Forward-looking imaging sonar provides measurements over range and horizontal angle but collapses vertical structure into a flat image, creating ambiguities that make 3D recovery challenging. A common use case for imaging sonar is underwater terrain mapping (bathymetry), yet current methods require many views, expensive multi-sensor setups, or significant training data, which limits use and adaptability to new environments. We present a training-free method that recovers bathymetry from a single sonar image in under 30 seconds via differentiable rendering, conditioned on a known seafloor tilt. To our knowledge, this is the first differentiable rendering approach for single-view height recovery in sonar. Our method implements differentiable sonar ray tracing and optimizes an explicit height field to reproduce the target image. On synthetic datasets, our approach outperforms a supervised CNN under distribution shift and remains close on rough terrain, while the CNN wins in-distribution. By modeling physically grounded priors of the sonar process, our method adapts across sensor configurations and environments without training data. Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2605.24195 [cs.CV] (or arXiv:2605.24195v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2605.24195 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-242] Filtered Posterior Mean Collections: A Unified Framework for Analytical Models of Diffusion Generalization

链接: https://arxiv.org/abs/2605.24192
作者: Matthew Niedoba,Berend Zwartsenberg,Frank Wood
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 27 Pages, 7 figures

点击查看摘要

Abstract:The neural-network denoising functions which form the backbone of image diffusion models are remarkably consistent in their generalization behaviour across a wide variety of network architectures and training procedure hyperparameters. A recent line of research has sought to model the outputs of these networks by aggregating posterior weighted averages of training dataset patches. In this work, we consolidate these approaches into a unified model class which we call Filtered Posterior Mean Collections (FPMCs). We define this model class using query precision vectors, response weights, and source distributions, and illustrate that existing methods are recoverable with specific choices of these design axes. Investigating each axis in turn, we find that FPMC performance can be improved with soft relaxations of prior patch-based methods, and through augmentations of source distributions. Applying these findings to an existing FPMC, we demonstrate consistent sample improvement across three natural image datasets.

[CV-243] Loki: Representation over Architecture for Diffusion-Based Portrait Animation

链接: https://arxiv.org/abs/2605.24176
作者: Pouyan Navard,Sernam Lim
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Portrait animation transfers a driver clip’s facial expression and head pose onto a single reference image while preserving the reference’s identity. State-of-the-art diffusion systems address this by stacking trained modules for expression, pose, and identity in turn, paying for it in trainable parameters, proprietary corpora, and residual entanglement between the very axes the system is meant to control independently. This complexity compensates for an upstream choice – learning facial expression and head pose from RGB, a representation in which identity, pose, and expression are inseparable without being learned apart. Loki steps out of RGB on the conditioning path. Driver expression and head pose are encoded by a face model whose parameter axes are identity-orthogonal by construction, then rasterised into a spatial map that the diffusion backbone consumes natively. Identity is routed separately through the diffusion backbone’s own pretrained features via lightweight key-value injection. Because the parametric representation factorises identity from expression and pose, cross ID reenactment reduces to a coefficient substitution at inference, requiring no cross ID training data. Loki requires ~43% fewer inference parameters than leading diffusion baselines and trained on 1496x less video samples. We define two metrics that directly measure whether the generated head pose trajectory and facial expression followed the driver’s – the questions portrait animation actually asks; Loki leads or co-leads on both.

[CV-244] EchoVQA: Enabling Conversational Assistance for Point-of-Care Cardiac Ultrasound

链接: https://arxiv.org/abs/2605.24159
作者: Filippos Bellos,Yutong Li,Jessie N Dong,Zaiyang Guo,Emily Mackay,Yayuan Li,Yannis Avrithis,Alison Pouch,Jason J. Corso
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Point-of-care transthoracic echocardiography (TTE) enables cardiac assessment in virtually any clinical setting, yet its diagnostic utility remains constrained by the expertise required for image acquisition and interpretation. Visual question answering (VQA) offers a promising paradigm for bridging this expertise gap through interactive clinical assistance, but existing echocardiography VQA datasets are limited in scale, restricted to high-quality images, and only cover a few views. We introduce EchoVQA, the first large-scale VQA dataset for echocardiography, comprising 14,299 images and 74,819 question-answer pairs. The dataset integrates public sources (EchoNet-Dynamic, CAMUS) with our own point-of-care acquisitions from two handheld probes (Lumify, Clarius), spanning diverse views and including both high-quality and suboptimal images. Uniquely, EchoVQA includes acquisition guidance questions to help users optimize transducer positioning toward a diagnostic apical 4-chamber view for left ventricular ejection fraction estimation – a challenging task for novice operators in point-of-care settings. We further develop a parameter-efficient method based on multimodal learnable prompts achieving state-of-the-art performance on most benchmarks, including EchoVQA, with significantly less trainable parameters than existing state-of-the-art approaches.

[CV-245] ImPartial: Multi-channel Whole-Cell Segmentation using Partial Annotations MICCAI’26

链接: https://arxiv.org/abs/2605.24128
作者: Gunjan Shrivastava,Saad Nadeem
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: MICCAI’26 Early Accept

点击查看摘要

Abstract:Accurate cell segmentation in pathology images typically requires dense pixel-wise annotations, which are costly and time-consuming to obtain. This challenge is especially important for emerging biological imaging modalities and multiplexed datasets with variable channel configurations, where expert-labeled data are scarce. In this work, we introduce ImPartial, a deep learning framework designed to achieve state-of-the-art segmentation performance in low-annotation regimes using sparse scribbles and limited supervision. ImPartial augments the segmentation objective via self-supervised multi-channel quantized imputation. This approach leverages the observation that perfect pixel-wise reconstruction or denoising of the image is not needed for accurate segmentation, and thus, introduces a self-supervised classification objective that better aligns with the overall segmentation goal. We demonstrate that ImPartial achieves performance at par with fully supervised models while requiring substantially fewer annotations. Extensive experiments on benchmark multiplexed cellular imaging and single-plex clinical brightfield immunohistochemistry datasets show consistent improvements over strong baselines with only partial annotations. All benchmark datasets and code are available via our Github: this https URL.

[CV-246] COSY: Compositional 3DGS Synthesis for Disentangled Human Head Editing

链接: https://arxiv.org/abs/2605.24114
作者: Florian Barthel,Shalini De Mello,Koki Nagano,Wieland Morgenstern,Anna Hilsmann,Peter Eisert
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent 3D Gaussian Splatting (3DGS) GANs for human heads synthesize and render photorealistic 3D models in real-time and offer a vast variety in identity and appearance. However, controlling specific semantic attributes such as hair color or glasses remains challenging, as edits in the entangled latent space often induce unintended changes in identity or appearance. Although there are several methods that aim to disentangle the latent space post training by estimating directions that only modify certain features, these methods cannot guarantee complete disentanglement and often require pre-trained classifiers. In our approach, we propose a new generator architecture that synthesizes components, such as hair, skin, glasses, and torso, completely independently. This allows for changing the latent vector for one region while keeping the remaining parts fixed. Further, we achieve this separation using only sparse information such as the hair or skin color, eliminating the requirement of segmentation masks or geometric priors, often seen in prior work. To ensure matching shape and lighting conditions during editing, we allow minimal shared information via context tokens between the independent generators. These tokens even allow us to control the shape and light, without any prior annotation. Compared to existing works on GAN-based generation and editing, our method shows better disentanglement, more precise editing control, and competitive visual quality.

[CV-247] D2-V2X: Depth-Driven Cooperative V2X Reasoning for Autonomous Driving CVPR2026

链接: https://arxiv.org/abs/2605.24098
作者: Kevin Richard,Alphin Varghese,Colin Pham,David Oh,Srijan Das
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to the DriveX Workshop at CVPR 2026 (Non-archival)

点击查看摘要

Abstract:Single-vehicle Vision-Language Models (VLMs) are fundamentally constrained by sensor occlusions. While Vehicle-to-Everything (V2X) systems mitigate this, current benchmarks lack the cooperative reasoning required for resolving ambiguities in complex environments. We introduce D2-V2X, a spatially-aware Question-Rationale-Answer (QRA) benchmark featuring 8,500 triplets derived from multimodal vehicle and infrastructure sensors. We additionally establish a baseline that aligns 3D LiDAR features with the VLM’s latent space. By enforcing natural language Chain-of-Thought rationales prior to structured JSON outputs, our model is forced to explicitly articulate spatial relations. Our experiments demonstrate that grounding VLMs in cooperative LiDAR achieves 24.4% recall in identifying occluded hazards compared to near-zero in zero-shot models and reduces spatial estimation error for visible objects by 77% compared to the zero-shot baseline. While the model achieves a functional decision-making F1-score of 53.5, we identify 3D-to-2D projection as a fundamental bottleneck in current VLM architectures, establishing a new baseline for future innovation. Data, code, and trained models available at this https URL

[CV-248] WideDepth: Millimeter-Accurate Benchmark for Fisheye Depth Estimation ICRA

链接: https://arxiv.org/abs/2605.24074
作者: Ilia Indyk,Ignat Penshin,Ivan Sosin,Maxim Monastyrny,Aleksei Valenkov,Ilya Makarov
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Accepted to IEEE International Conference on Robotics and Automation (ICRA) 2026

点击查看摘要

Abstract:Fisheye cameras are increasingly adopted in robotics for near-field manipulation, navigation, and immersive perception, yet indoor depth benchmarks with accurate ground truth are still missing. To address this, we introduce WideDepth - the first indoor dataset for fisheye depth estimation, featuring 101 scenes containing 5K high-resolution stereo pairs labeled with millimeter-level ground truth depth and disparity. Our dataset also includes paired pinhole and fisheye samples across varying fields of view and baselines in both horizontal and vertical stereo setups. We further propose a method to adapt pinhole-trained stereo models to fisheye images and introduce a novel stereo fisheye image generation pipeline based on high-resolution LiDAR scans. Leveraging these methods, we thoroughly evaluate state-of-the-art monocular depth, stereo matching, and depth completion models on our benchmark. Additionally, we provide 18K LiDAR-derived sparse depth training samples, achieving up to a 62% performance boost on fisheye data when fine-tuning pinhole-based stereo models. In summary, the high precision and versatility of our benchmark set a strong foundation for advancing research in fisheye depth estimation and robotics perception. Project page: this https URL

[CV-249] Distance-Aware Joint Spatio-Temporal Graph Contrastive Learning for Major Depressive Disorder Diagnosis

链接: https://arxiv.org/abs/2605.24066
作者: Muhammad Asif Hasan,Yanming Zhu,Xuefei Yin,Alan Wee-Chung Liew
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Major depressive disorder (MDD) is a common neuropsychiatric condition whose accurate diagnosis from resting-state functional magnetic resonance imaging (rs-fMRI) remains difficult. Dynamic functional connectivity (DFC) captures time-varying interactions among brain regions and provides rich spatio-temporal information, yet current DFC-based methods face three limitations: sliding-window Pearson correlation yields noisy estimates sensitive to window length and motion artifacts; correlation-derived node features do not fully exploit frequency-domain properties of blood-oxygen-level-dependent (BOLD) signals; and most spatio-temporal graph models handle spatial structure and temporal dynamics in separate stages, restricting their ability to represent coupled brain network evolution. To overcome these issues, we reformulate DFC learning as joint spatio-temporal graph representation learning under a Hawkes-process-inspired temporal dependency prior and propose HWSTCL, a two-stage framework built on a reliability-refined joint spatio-temporal graph with a kernel-weighted pretraining objective. Within each temporal window, BOLD signals are encoded as spectral node descriptors and functional edges are refined by an exponential distance-decay prior that down-weights less reliable long-range connections. The joint graph is then formed by linking each region to itself across future windows through a Hawkes-inspired exponential kernel, allowing spatial and temporal information to be propagated together during message passing. A kernel-weighted contrastive objective further promotes temporal consistency for each region across windows while reducing redundant similarity between different regions. Experiments on a benchmark rs-fMRI dataset show that HWSTCL outperforms recent baselines and yields coherent spatio-temporal representations for MDD diagnosis.

[CV-250] fMRI-Diffusion: Generating fMRI Time Series Via a Temporal Transformer Diffusion Model for Major Depressive Disorder Diagnosis

链接: https://arxiv.org/abs/2605.24065
作者: Muhammad Asif Hasan,Yanming Zhu,Xuefei Yin,Alan Wee-Chung Liew
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diagnosing Major Depressive Disorder (MDD) from functional magnetic resonance imaging (fMRI) using functional connectivity (FC) analysis requires large amounts of labeled data that are scarce in clinical settings. Existing augmentation methods synthesize FC matrices, which compress fMRI recordings into static pairwise summaries and discard temporal information. We propose fMRI-Diffusion, a framework that synthesizes region-of-interest (ROI)-level fMRI time series rather than FC matrices. A Temporal Transformer serves as the denoising network within a denoising diffusion probabilistic model, treating each time point as a token to capture temporal dependencies through self-attention. A supervised pretraining strategy initializes the Transformer with task-relevant representations before diffusion training, and FC matrices are derived from the synthesized time series for classification. Experiments on the REST-meta-MDD dataset show that augmenting training data with synthetic time series consistently improves diagnostic accuracy across ten classifiers, six parcellation atlases, and three acquisition sites. The method outperforms five recent FC-based synthesis approaches, with accuracy gains of up to 3.7 percentage points over the strongest baseline. Ablation studies confirm the contributions of both the Transformer-based denoiser and the pretraining strategy. Distributional fidelity metrics remain below 0.06 across all conditions, indicating close agreement between real and synthetic distributions. These findings suggest that synthesizing fMRI time series before FC computation preserves temporal information lost in matrix-level augmentation and provides a practical strategy for MDD diagnosis under limited data.

[CV-251] EMMA: Extracting Multiple physical parameters from Multimodal Data CVPR2026

链接: https://arxiv.org/abs/2605.24047
作者: Farhat Shaikh,Ayan Banerjee,Sandeep Gupta
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at CVPR 2026 (main conference)

点击查看摘要

Abstract:We introduce EMMA, a physics-informed multimodal framework that recovers all identifiable dynamical parameters of a system directly from raw video, audio, and image-based time-series observations. Unlike prior video-only approaches that struggle with occluded states, hidden actuation inputs, or assumptions about known initial conditions and coordinate frames, EMMA performs joint inference of explicit parameters, implicit dynamical components, and calibration invariants within a unified continuous-time model. EMMA leverages a Liquid Time-Constant (LTC) network to learn latent dynamics from heterogeneous modalities while a physics-constrained loss enforces consistency with the governing differential equations. A unified feature pipeline enables consistent alignment across video trajectories, acoustic signatures, and chart-derived measurements, allowing EMMA to estimate parameters under forced, implicit, and multivariate dynamics without requiring segmentation masks, differentiable rendering, or specialized sensors. Across 100+ scenarios including five standard dynamical benchmarks (75 Delfys videos), real-world rover and quadrotor systems with hidden inputs, and simulation-chart case studies spanning biological and chaotic systems, EMMA delivers robust multi-parameter recovery and significantly outperforms existing single-modality and equation-discovery baselines. Our results establish EMMA as a general, scalable solution for physics-consistent model extraction from opportunistic multimodal data. Code and data are available at: this https URL

[CV-252] Learning to See Like Humans: Gaze-Aligned Cycling Safety Prediction ITSC

链接: https://arxiv.org/abs/2605.24040
作者: Luís Maria Perdigão,Miguel Costa,Carlos Santiago,Manuel Marques
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to be published as part of the 2026 IEEE 29th International Conference on Intelligent Transportation Systems (ITSC), Naples, Italy, September 15-18, 2026

点击查看摘要

Abstract:Cycling delivers significant public-health and environmental benefits, yet its uptake in cities is often limited by perceived safety. When street environments appear unsafe, individuals are less likely to cycle, making perception a key barrier to adoption. Recent work has shown that pairwise comparisons of street-view images provide a scalable way to learn subjective safety judgments. However, existing approaches do not explicitly model human visual attention, which plays a central role in how humans perceive safety. We propose an Eye-Tracking-Guided Perceived Cycling Safety framework (EG-PCS) that integrates gaze data into a pairwise learning pipeline based on vision transformers. By supervising the model’s attention mechanism with eye-tracking signals, we encourage alignment between learned attention maps and human fixation patterns. Experiments show that gaze-guided models achieve similar ranking performance compared to state-of-the-art approaches while producing attention maps that more accurately reflect human visual attention behavior. Our results demonstrate that incorporating eye-tracking information enhances both predictive accuracy and interpretability in perception-based urban analytics.

[CV-253] Mode-as-Sequence: Translating Multimodal Motion Prediction into Unified Sequential Mode Modeling

链接: https://arxiv.org/abs/2605.24037
作者: Zikang Zhou,Haibo Hu,Xinhong Chen,Yifan Zhang,Nan Guan,Yung-Hui Li,Chun Jason Xue,Jianping Wang
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multimodal motion forecasting is inherently under-supervised: each training scene provides only one realized future, yet multiple plausible futures exist. This sparse supervision often leads to mode collapse (redundant hypotheses and insufficient mode coverage) and unreliable confidence ranking when predicting a small set of trajectories. We propose Mode-as-Sequence, a unified decoding framework that translates an unordered mode set into an ordered mode sequence and explicitly models mode-to-mode dependency. Under this framework, we develop two complementary instantiations. ModeSeq performs recurrent mode decoding, where each mode is generated conditioned on the previously generated modes, encouraging diverse, non-redundant hypotheses with calibrated confidence ordering. To remove the mode-by-mode autoregressive bottleneck, we further propose Parallel ModeSeq, which preserves the same causal dependency using masked mode-to-mode self-attention while decoding all modes in a single forward pass, enabling efficient large- K inference and scalable joint-scene prediction. To learn representative modes and calibrated confidence under sparse labels, we introduce Early-Match-Take-All (EMTA) and its joint-scene extension MA-EMTA, together with a lightweight ranking regularizer that reduces confidence inversions. Extensive experiments on large-scale benchmarks demonstrate consistent improvements in both ranking-oriented metrics and best-of-K accuracy across datasets, horizons, and object types. In the Waymo Open Dataset challenges, ModeSeq achieves 1st place in the 2024 LiDAR-free motion prediction track, and Parallel ModeSeq achieves 1st place in the 2025 Interaction Prediction Challenge, validating the effectiveness of Mode-as-Sequence for both accuracy and efficiency.

[CV-254] owards Large Model Feature Coding

链接: https://arxiv.org/abs/2605.24025
作者: Youwei Pang,Changsheng Gao,Dong Liu,Huchuan Lu,Weisi Lin
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large models have delivered remarkable performance across a wide range of perception and generation tasks, yet practical deployment is increasingly constrained by computational and memory budgets, as well as privacy requirements. Split execution alleviates these constraints by partitioning computation across devices, but it inevitably introduces intensive transmission and storage of intermediate features. Unlike conventional feature coding for CNNs that typically targets homogeneous spatial activation maps, modern large models generate heterogeneous features with varying statistical distributions and compression tolerances, e.g., multi-level/multi-modal representations and autoregressive context caches. These characteristics necessitate treating large model feature coding (LaMoFC) as a fundamental system component and call for a systematic evaluation framework. In this paper, we present a comprehensive benchmark and evaluation framework for LaMoFC. We first build the feature dataset LaMoFCBench, covering diverse task requirements across 4 categories and 16 scenarios while integrating widelyadopted architectures and various split-computing settings. We then specify representative split points according to practical application scenarios to extract intermediate features, establishing a unified pipeline for fair and reproducible comparisons. Finally, we benchmark mainstream universal feature codecs, exposing the profound misalignment between existing coding paradigms and the heterogeneous nature of large model features. These findings reveal that LaMoFC demands a fundamental departure from existing paradigms, and LaMoFCBench provides the shared empirical foundation to drive this transition. The data and code will be available at this https URL.

[CV-255] Mitigating Hallucinations in Large Vision-Language Models via Causal Route Gating ICML2026

链接: https://arxiv.org/abs/2605.24024
作者: Zhe Cheng,Wenyu Chen,Fode Zhang,Dehuan Shen
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted as a Spotlight Paper at ICML 2026. 33 pages, 8 figures

点击查看摘要

Abstract:Large vision-language models (LVLMs) often hallucinate content that is fluent yet unsupported by the image, limiting their reliability in real-world deployment. We show that a key failure mode arises from route competition: even when visual tokens receive attention, the final token decision can be dominated by the textual pathway, causing the decoder to follow linguistic priors over visual evidence. To mitigate this, we propose a training-free, decision-aligned intervention that decomposes each attention head into a visual route and a text route, and estimates their token-level effects using an efficient one-forward/one-gradient approximation. These estimates reveal route conflict within heads and identify prior-dominant ones, enabling selective suppression of only the text route while keeping the visual route intact. Across five benchmarks spanning discriminative and generative settings, our method consistently reduces hallucination-related errors across models with limited impact on overall multimodal performance, while incurring a modest inference-time overhead.

[CV-256] Soft Tuy-Completeness for Robust Projection Selection in Cone-Beam CT

链接: https://arxiv.org/abs/2605.24023
作者: Linda-Sophie Schneider,Andreas Maier
类目: Computer Vision and Pattern Recognition (cs.CV); Discrete Mathematics (cs.DM)
备注: Preprint

点击查看摘要

Abstract:This work introduces a continuous soft near-orthogonality score and a resolution-aware saturated coverage objective for projection selection in region-of-interest focused cone-beam CT, grounded in Tuy’s completeness theory. Replacing the binary hit-or-miss model of classical Tuy completeness with a graded, differentiable formulation preserves a direct link to achievable feature sizes while enabling both efficient approximate and exact optimisation. We establish that the underlying discrete decision problems are NP-complete via polynomial-time reductions from Set Cover, motivating a submodular greedy algorithm with proven (1-1/\mathrme) approximation guarantees and a mixed-integer linear program (MILP) that provides certified optimality bounds. The MILP serves as a quality certificate for the greedy solution rather than a competing optimiser. The primary empirical finding confirms this relationship: across a systematic benchmark spanning six target regions, multiple projection budgets, and four controlled occlusion conditions, the pooled median greedy-to-MILP objective ratio was 0.998, with a substantial fraction of cases certified globally optimal. A binary formulation is included as a diagnostic baseline; it strengthens hard directional completeness but is weaker on the continuous coverage scale. We additionally introduce Effective Spatial Resolution (ESR), a physically interpretable trajectory-level diagnostic that maps directional sampling gaps to achievable feature sizes. ESR correlates reliably with matched reconstruction quality across projection budgets and occlusion levels, providing a practical bridge between the selection stage and the image domain without requiring reconstruction. Comments: Preprint Subjects: Computer Vision and Pattern Recognition (cs.CV); Discrete Mathematics (cs.DM) Cite as: arXiv:2605.24023 [cs.CV] (or arXiv:2605.24023v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2605.24023 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Linda-Sophie Schneider [view email] [v1] Wed, 20 May 2026 12:37:15 UTC (3,669 KB)

[CV-257] Machine Intelligence that Understands Visual and Linguistic Information and Interacts with Humans and Environments

链接: https://arxiv.org/abs/2605.24020
作者: Van Quang Nguyen
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Doctoral dissertation, Tohoku University, 2022. Uploaded for archival purposes. 146 pages

点击查看摘要

Abstract:Advancements at the intersection of computer vision and natural language processing are crucial for applications like assistive tech, multimedia querying, and robotics. This dissertation proposes novel architectures to improve intelligent agents across three key vision-language tasks: image captioning, visual dialog, and interactive instruction following. First, we address limitations in visual representation for image captioning. Traditional models rely on region-based features from CNN detectors, which lack global context and suffer from high computational overhead. We propose GRIT (Grid and Region-based Image captioning Transformer), a transformer-only architecture. By integrating grid and region features using a DETR-based detector, GRIT enables end-to-end training and out-performs prior methods in both inference accuracy and speed. Second, we tackle visual dialog, which requires multi-turn conversation about an image. The challenge lies in efficiently modeling interactions between multiple inputs (image, question, history). We introduce LTMI (Light-weight Transformer for Many Inputs). Utilizing a specialized attention block, an LTMI layer matches the representational power of a standard Transformer extension while utilizing less than one-tenth of its parameters, as validated on the VisDial dataset. Finally, we study interactive instruction-following for embodied AI using the ALFRED dataset. We propose a framework featuring a two-stage instruction interpretation: it first decodes language directives independently of visual context to predict a tentative action-object sequence, which is then fused with visual features for final execution. Using multiple egocentric views and hierarchical attention, our method accurately localizes objects and achieves a state-of-the-art unseen success rate of 8.37%. Comments: Doctoral dissertation, Tohoku University, 2022. Uploaded for archival purposes. 146 pages Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2605.24020 [cs.CV] (or arXiv:2605.24020v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2605.24020 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Van-Quang Nguyen [view email] [v1] Wed, 20 May 2026 06:11:25 UTC (22,466 KB)

[CV-258] MGVQ: Synergizing Multi-dimensional Sensitivity-Aware and Gradient-Hessian Fusion for Vector Quantization

链接: https://arxiv.org/abs/2605.24019
作者: Zhong Wang,Zukang Xu,Xing Hu,Dawei Yang
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Vision-Language Models (VLMs) achieve outstanding performance, yet their huge model size severely hinders deployment on edge devices with limited resources. As an efficient model compression technique, vector quantization (VQ) excels in ultra-low-bit representation, which maps model weights to discrete codewords in a compact codebook to cut memory consumption and transmission overhead while preserving model capability. Direct VQ application to VLMs still has two core limitations. First, cross-modality weight distribution differences brought by visual and textual inputs cannot be well fitted by a single unified codebook. Second, current second-order error compensation ignores first-order gradient information, causing weight deviation from pre-trained optimal states, gradient drift and biased compensation results. This work proposes MGVQ, a novel vector quantization framework integrating multi-dimensional sensitivity perception and gradient-Hessian fusion. It consists of two core modules: sensitivity-guided structured mixed-precision quantization dynamically assigns different bit-widths according to channel sensitivity via combined global and local sensitivity analysis for refined resource allocation; gradient-aware second-order error compensation embeds first-order gradients into error correction, and adopts Kronecker and Block-LDL decomposition to ensure low computational cost. Extensive experiments on mainstream VLMs including LLaVA-onevision, InternVL2 and Qwen2-VL verify the effectiveness of MGVQ. In 2-bit quantization settings, MGVQ surpasses existing advanced post-training quantization methods significantly, achieving a maximum accuracy improvement of 4.9 points (71.4% vs 67.0% on InternVL2-26B). The proposed method realizes stable and efficient ultra-low-bit VLM quantization, greatly promoting the practical deployment of multimodal large models in resource-limited environments.

[CV-259] SkySeg: Collaborative Onboard Semantic Segmentation with Heterogeneous UAVs in the Wild

链接: https://arxiv.org/abs/2605.24014
作者: Anqi Lu,Yun Cheng,Youbing Hu,Zhiqiang Cao,Jie Liu,Zhijun Li
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The demand for unmanned aerial vehicle (UAV)-based image acquisition and analysis has surged, with UAVs increasingly utilized for semantic segmentation tasks. To meet the real-time analysis requirements of UAV remote sensing missions, performing onboard computation and making decisions based on the results is a natural approach. However, deploying semantic segmentation on resource-constrained UAV platforms presents two significant challenges: 1) hardware constraints limit the ability of UAVs to perform real-time semantic segmentation, and 2) environmental variations during flight cause data distribution shifts, deviating from the original training data. To address these issues, this paper introduces SkySeg, a heterogeneous multi-UAV air-air cooperation framework that integrates computer vision and flight pattern to enable onboard semantic segmentation using low-cost sensors. SkySeg employs an efficient information fusion inference method, combining low-definition, wide-area images with high-definition, focused-area images. Additionally, it incorporates a cross-device test-time adaptation (TTA) strategy to enhance segmentation performance in dynamic environments by collaboratively addressing distribution shifts of test data streams across UAVs. Experimental results demonstrate that our SkySeg framework accelerates inference latency by approximately 3.6x, improves onboard segmentation accuracy by 5.91%, and achieves a 10.91% average accuracy gain in the wild.

[CV-260] Deep Learning-Based Automated Quantification of TIMI Myocardial Perfusion Frame Count (DL-TMPFC) from Coronary Angiography: A Novel Framework for Rapid Assessment of Microvascular Dysfunction

链接: https://arxiv.org/abs/2605.24012
作者: Si Li,Yuanqing He,Chenkai Hu,Xiaogang Guo,Huay-Cheem Tan,Chieh Yang Koo,Xuan Zhang,Lei He,Jingyuan Zeng,Shan Xiao
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages,8 figures

点击查看摘要

Abstract:Aims: Coronary microvascular dysfunction (CMVD) affects approximately 40%-60% of patients with ischemia and non-obstructive coronary arteries, yet diagnosis remains challenging due to reliance on invasive functional testing or subjective Thrombolysis In Myocardial Infarction (TIMI) flow grade. The TIMI Myocardial Perfusion Frame Count (TMPFC) offers an objective, angiography-based quantitative measure of CMVD, but its clinical translation is hindered by cumbersome manual calculation and insufficient validation. This study aims to develop and validate a deep learning-powered TMPFC calculation (DL-TMPFC), enabling integration into clinical workflows. Methods and results: DL-TMPFC framework comprised two components. A stenosis detection network first excluded obstructive coronary artery disease (CAD). A territory-aware segmentation network then identified perfusion territories and TMPFC calculation module automatically determined the first and last frames from angiographic sequences. The framework was validated in a cohort of 655 patients (445 of obstructive CAD, 100 of confirmed CMVD, 110 of control group) from three independent institutions. DL-TMPFC showed excellent agreement with expert manual measurements (bias: -0.93 frames; 95% LoA: -5.33 to +3.47; r =0.98). DL-TMPFC markedly enhanced clinical feasibility by fully automating TMPFC and removing observer dependence. Clinically, DL-TMPFC accurately identified CMVD across a full spectrum of coronary pathologies and captured the continuous severity of CMVD beyond binary classification, enabling quantitative risk stratification. Conclusion: DL-TMPFC enabled automatic, standardized, and accurate quantification of CMVD directly from routine angiography. By providing an automatic and objective measure, this tool provided immediate diagnostic information for timely recognition and management of CMVD in clinical practice. Comments: 15 pages,8 figures Subjects: Computer Vision and Pattern Recognition (cs.CV) MSC classes: 92C55 ACMclasses: I.4.6 Cite as: arXiv:2605.24012 [cs.CV] (or arXiv:2605.24012v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2605.24012 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-261] ActQuant: Sub-4-bit Action-Guided Quantization for Vision-Language-Action Models

链接: https://arxiv.org/abs/2605.24011
作者: Arash Akbari,Arman Akbari,Masih Eskandar,Qitao Tan,Yixiao Chen,Jingwu Luo,Bertha Pangaribuan,Liyun Zhang,Jennifer Dy,Geng Yuan,Xue Lin,Gaowen Liu,Stratis Ioannidis,Yanzhi Wang
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Vision-Language-Action (VLA) models exhibit remarkable action generation for embodied intelligence, but their heavy compute make deployment on edge platforms impractical. Aggressive, sub-4-bit weight quantization is the natural solution, yet existing post-training quantization (PTQ) methods suffer severe performance degradation in this regime. To address this, we introduce ActQuant, an action-guided mixed-precision PTQ framework that operates in two stages: (1) an inter-tensor bit allocator that assigns each weight matrix a single bit-width based on how much it contributes to predicting the agent’s actions; (2) an intra-tensor scale optimizer tunes per-block quantization scales using action-aware curvature, so that dynamic range is concentrated on the weights most influential for control. To deliver the on-device benefits of our aggressive quantization, we further introduce this http URL, an agentic conversion pipeline that ports architectures into a native C/C++ runtime with efficient low-bit kernels. We evaluate ActQuant both in simulation and on a real-world 6-DoF UR3 arm, with all models deployed through this http URL. On the LIBERO benchmark, ActQuant is the only method that operates at or below 3 bits-per-weight, retaining 95.0% on OpenVLA-OFT and 94.8% on \pi_0.5 . Pushed further, ActQuant reaches 2.5 bpw at 90.1% on OpenVLA-OFT, compressing the backbone from 14.3 GB to 2.7 GB (5.3 \times ). On the physical UR3 arm, \pi_0.5 quantized with ActQuant retains the baseline’s success rate while reducing the memory footprint by 2.5 \times .

[CV-262] CAFD: Concept-Aware DNN Fault Detection using VLMs

链接: https://arxiv.org/abs/2605.24008
作者: Amin Abbasishahkoo,Mahboubeh Dadkhah,Lionel Briand
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:Fault detection for Deep Neural Networks (DNNs) has received increasing attention in recent years. While more advanced hybrid approaches have been proposed to combine multiple sources of information and outperform earlier techniques, they often incur substantial computational overhead, limiting scalability and practicality in real-world settings. In this paper, we introduce Concept-Aware Fault Detection (CAFD), a learning-based approach that achieves superior fault detection performance by effectively integrating multiple information sources while maintaining practical efficiency. Specifically, CAFD is trained using a carefully selected set of informative features, including model-based signals derived from the DNN’s outputs, distance-based features, and a novel concept-based feature, called Concept Failure Ratio (CFR). CFR leverages Vision-Language Models (VLMs) to extract textual concepts from images and quantify the likelihood that their presence is associated with DNN failures. By incorporating this feature, CAFD benefits from complementary semantic information, enabling more effective fault detection. Our results demonstrate that CFR serves as an effective indicator for DNN fault detection. We conduct an extensive empirical evaluation of CAFD, comparing it against five state-of-the-art baselines across three subject DNN models and datasets, including ImageNet. Across a wide range of constrained selection budgets, CAFD consistently outperforms all baselines in Fault Detection Rate (FDR), achieving average FDR improvements of 18.3% across all investigated subjects and budget sizes.

[CV-263] Reason --Imagine–Act: Closed-Loop LLM Decision Making with World Models for Autonomous Driving ITSC2026

链接: https://arxiv.org/abs/2605.24004
作者: Zhengqi Sun,Yiwen Sun,Boxuan Liu,Tailai Chen,Tianxu Guo,Jiabin Liu
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
备注: Accepted by the 2026 IEEE International Conference on Intelligent Transportation Systems (ITSC 2026). 8 pages, 2 figures

点击查看摘要

Abstract:Large language models (LLMs) are promising for autonomous driving, but semantics-only decision policies can yield physically unsafe behavior in dynamic traffic. Existing methods either perform online language reasoning without explicit dynamics verification or use world models mainly in offline pipelines, leaving a gap between semantic intent and physical feasibility at decision time. We propose Reason–Imagine–Act (RIA), a closed-loop framework that couples an LLM reasoner with an action-conditioned world model for online safety verification. At each step, the LLM proposes an action template and candidate sub-actions, the world model performs short-horizon rollouts, and a safety scorer selects the safest executable action with feedback to the next reasoning step. Under a unified CARLA point-goal protocol (1000 episodes), RIA achieves 80.05% route completion, 51.10% arrival rate, and 0.20% collision rate. Under the same closed-loop interface, RIA consistently outperforms training-free baselines, including CARLA TM and MADA, on core closed-loop metrics. For reproducibility, code is available at this https URL.

[CV-264] Remote sensing data imputation using deep learning for multispectral imagery

链接: https://arxiv.org/abs/2605.24003
作者: Shuang Liua,Fiona Johnson,Rohitash Chandra
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Applications (stat.AP)
备注:

点击查看摘要

Abstract:Remote sensing techniques have been increasingly utilised in aquatic applications in recent years. A common challenge in using optical satellite data is the presence of missing observations due to cloud cover. These data gaps can lead to missed detection of critical events, such as algal blooms, in lakes of high interest to water authorities. As a result, enhancing the completeness of optical satellite datasets is crucial for improving the monitoring and prediction of algal blooms. In this study, we compared a traditional data imputation method (i.e., linear interpolation) with deep learning models for reconstructing missing spectral bands across four lakes with historical records of algal blooms. The deep learning models adopted include CNN-based architectures (i.e., CNN, Inception Resnet, and Autoencoder) and CNN-LSTM-based architectures (i.e., CNN-LSTM, Resnet-LSTM, and Autoencoder-LSTM). Our results demonstrated that deep learning models substantially outperformed the baseline linear interpolation method in imputing spectral band values within artificially masked regions. Among these models, CNN delivered the best performance across most lakes. Furthermore, we evaluated the performance of algal bloom indices (i.e., Green/Red and NDCI) derived from the imputed imagery by comparing them with the observed data. Our results demonstrate that deep learning models are effective for imputing missing data in PlanetScope SuperDove imagery, enabling more reliable applications in water monitoring.

[CV-265] Diff-Instruct with Diffused Reward: Towards Principled One-step Generator RL

链接: https://arxiv.org/abs/2605.24001
作者: Junyi Wu,Weijian Luo,Haoyang Zheng,Runzhe Zhang,Guang Lin Haoyang Zheng Runzhe Zhang Guang Lin
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent advances in one-step text-to-image generation have enabled real-time synthesis with remarkable efficiency and quality. Previous reinforcement learning methods for one-step generators combine image-space reward optimization with diffusion noisy-space distribution matching. This paradigm brings challenges due to a mismatch between terminal reward optimization and the underlying generative dynamics. As a result, optimization tends to exploit stochastic degrees of freedom, often improving reward at the expense of image fidelity. To address this issue, we propose Diff-Instruct with Diffused Reward (DIDR), a data-free trajectory-level alignment framework derived from Integral KL minimization. DIDR propagates the RLHF-optimal reward-tilted clean-image distribution across all noise levels along the diffusion trajectory. We show that this objective admits the same minimizer as clean-image RLHF, while naturally inducing the Diffused Reward Score (DRS), which acts as a reward-driven correction to the reference score function. To make this practical, we further introduce the Diffused Reward Proxy (DRP), an efficient estimator of DRS based on differentiable short-step denoising. Extensive experiments demonstrate that DIDR consistently Pareto-dominates existing one-step SDXL baselines. Moreover, when transferred to a 6B DiT backbone (Z-Image), DIDR surpasses its 50-step teacher in preference alignment while requiring only a single generation step.

[CV-266] IVR-R1: Refining Trajectories through Iterative Visual-Grounded Reasoning in Reinforcement Learning

链接: https://arxiv.org/abs/2605.23997
作者: Chenghao Li,Fusheng Hao,Xikai Zhang,Likang Xiao,Yanwei Ren,Fuxiang Wu,Quan Chen,Liu Liu
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Multimodal large language models via reinforcement learning (RL) have demonstrated remarkable capabilities in complex visual reasoning tasks, yet they remain limited in long-horizon multimodal scenarios, often suffering from visual hallucination and logical error. Current methods typically pre-encode high-dimensional visual scenes into discrete textual proxies to facilitate downstream reasoning. As the reasoning chain unfolds, however, the inherent information asymmetry between text and visual scenes tends to erode visual grounding, resulting in misguided reasoning and erroneous outputs. To address this issue, we introduce IVR-R1 (Iterative Visual-grounded Reasoning), a novel RL training framework that facilitates dynamic visual re-alignment that actively rectifies reasoning trajectories to guide policy optimization. Specifically, by leveraging a reward-driven screening mechanism to identify flawed rollouts, IVR-R1 executes a fine-grained, step-level error attribution within the multimodal context. By iteratively cross-referencing intermediate reasoning states against pristine visual priors, a Re-Reasoning Loop enables automated trajectory rectification, effectively synthesizing expert-level demonstrations that serve as high-fidelity reasoning templates for the policy model. Our experiments across diverse multimodal benchmarks demonstrate that IVR-R1 consistently outperforms existing reinforcement learning methods, establishing a superior paradigm for maintaining logical and visual consistency in complex multimodal reasoning.

[CV-267] Brain-to-Image Retrieval and Reconstruction via Multimodal EEG Alignment

链接: https://arxiv.org/abs/2605.23996
作者: Chi Kit Wong,Yan Liu,Haowen Yan
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: 16 pages, 5 figures. Code available at: this https URL

点击查看摘要

Abstract:We present a brain-to-image system that decodes visual stimuli from EEG signals recorded during natural image viewing. Our system addresses two tasks: (1) EEG-to-image retrieval, which ranks the correct stimulus image among 200 candidates given an EEG segment, and (2) EEG-to-image reconstruction, which generates an image consistent with the perceived stimulus. For retrieval, we implement a multi-level blurring approach improved with biologically inspired EVNet features and trained with the InfoNCE loss. Evaluated over 10 random seeds for a single subject, the retrieval model achieves a mean final-epoch Top-1 accuracy of 86.30% and Top-5 accuracy of 98.55%. For reconstruction, we implement CognitionCapturerPro, which aligns EEG representations to multi-modal CLIP embeddings, including image, text, depth, and edge embeddings, and synthesizes images with SDXL-Turbo conditioned via IP-Adapter. Averaged over 10 seeds, the reconstruction model achieves a CLIP score of 0.903 using ViT-H-14, a CLIP score of 0.870 using ViT-L/14, and an SSIM of 0.409. These results demonstrate the feasibility of decoding rich visual representations from EEG signals using modern multi-modal alignment and generative modeling techniques.

[CV-268] ask-Aligned Self-Supervised Learning for Medical Image Analysis: A Systematic Review and Practical Design Guidelines

链接: https://arxiv.org/abs/2605.23995
作者: Chathura Wimalasiri
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: This manuscript is 31 pages with 4 tables and 3 figures

点击查看摘要

Abstract:Self-supervised learning (SSL) has emerged as a promising paradigm for addressing the annotation bottleneck in medical imaging by learning representations from unlabeled data. However, its effectiveness depends heavily on the design of the pretext task and its alignment with the downstream clinical objective. We present a systematic, task-oriented review of SSL in medical imaging, examining how different pretext-task formulations influence performance across classification, segmentation, detection, and other tasks. Following PRISMA guidelines, we analyze 75 studies published between 2017 and 2025 and organize them into four paradigms: contrastive, non-contrastive and predictive, generative and reconstruction-based, and hybrid learning. Rather than cataloguing methods by architecture, we map each paradigm to the downstream objectives it best supports. Our analysis shows there is no universally optimal SSL strategy; instead, performance is governed by the alignment between the pretext task, the imaging modality, and the target task. Contrastive methods learn global discriminative features and align well with classification, but may overlook subtle pathological patterns. Generative and spatial prediction-based approaches better preserve local anatomical structure, making them more suitable for segmentation and other dense prediction tasks, while hybrid methods offer the most balanced performance. We further show that modality-specific design is critical and that SSL provides its greatest benefit in low-label and few-shot regimes. Finally, we distill these findings into practical design guidelines and outline open challenges, including pathology-aware pretext task design, resource-efficient training for high-dimensional data, and standardized evaluation protocols. This work offers practical guidance for designing more effective and clinically relevant SSL frameworks in medical imaging.

[CV-269] RAW: Robust Avatar Watermarking – Benchmarking and Baseline

链接: https://arxiv.org/abs/2605.23994
作者: Jack Parry,Jack Saunders,Vinay Namboodiri
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Digital avatar watermarking presents unique challenges: avatars are routinely post-processed with background replacement, reframing, and format conversion before deployment. We introduce \textbfRAW (Robust Avatar Watermarking), a benchmark comprising 50 synthetic avatar videos from 5 commercial providers and 6 attacks simulating real-world avatar workflows. Evaluating 7 existing methods reveals that avatar-specific attacks such as background removal significantly degrade watermark recovery. We propose \textbfWALT (Watermarking Avatars with Learned Textures), which embeds watermarks in UV texture space via 3D face reconstruction. WALT achieves the highest robustness to zoom attacks (92.4%) while maintaining strong performance on background removal (95.6%). We release our benchmark to facilitate research into avatar-specific watermarking.

[CV-270] Nano World Models: A Minimalist Implementation of Future Video Prediction

链接: https://arxiv.org/abs/2605.23993
作者: Siqiao Huang,Partha Kaushik,Michael Chen,Hengkai Pan,Omar Chehab,Fernando Moreno-Pino,Max Simchowitz
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Project page: this https URL

点击查看摘要

Abstract:World models have become a central paradigm for learning predictive simulators that support generation, planning, and decision-making. Yet, despite rapid progress in industry-scale interactive video generation, the broader research community still lacks compact, reproducible, and easily extensible implementations for studying the design choices underlying modern world models. We introduce Nano World Models, a minimalist codebase for future video prediction centered around diffusion forcing. Nano World Models provides a unified interface for generative objectives, model scales, action-conditioning mechanisms, latent observation spaces, datasets, evaluation protocols, and long-horizon rollout procedures. This design enables controlled studies of world-modeling components that are often entangled across separate implementations. Through experiments across simple control environments, game simulation, and real-robot data, we examine how prediction parameterization, architecture scale, action injection, sampling budget, and domain complexity affect video prediction quality and autoregressive rollout behavior. By releasing code, configurations, evaluation scripts, and pretrained checkpoints, Nano World Models aims to provide a compact yet extensible experimental substrate for open, reproducible, and scientific world-model research.

[CV-271] A World Model of Radiologist Reading for Medical Image Representation Learning

链接: https://arxiv.org/abs/2605.23992
作者: Yiwei Li,Zihao Wu,Huaqin Zhao,Yifan Zhou,Chao Cao,Dajiang Zhu,Tianming Liu,Lin Zhao
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Radiologist eye-tracking data provide a rich record of how experts search, compare, and accumulate evidence during image reading; yet, existing methods exploit this signal only partially, either as a static spatial prior or as an auxiliary prediction target decoupled from diagnosis. We propose GazeWorld, a medical imaging world model that treats the image as the world and the radiologist’s fixation sequence as a trajectory through it. GazeWorld autoregressively predicts the latent representation of the next fixated patch from all previously visited ones, while a spatial-completion branch covers unvisited regions. At inference, GazeWorld generates a sequence of patch representations from the image alone without requiring real gaze data. Frozen GazeWorld features achieve state-of-the-art diagnostic accuracy across all nine supervised settings on CheXpert, RSNA Pneumonia, and SIIM-ACR Pneumothorax, as well as the highest zero-shot accuracy on all three benchmarks. On the GazeSearch benchmark, a generic decoder trained on the same frozen features outperforms the purpose-built LogitGaze-Med by over 16% in ScanMatch and 22% in SED, despite not being explicitly trained to predict gaze. GazeWorld demonstrates that modeling how experts read, not just what they conclude, offers a promising pretraining paradigm for medical imaging AI.

[CV-272] Parameter Efficient Multi-Class Intelligent Scheduling for Multimodal Online Distributed Industrial Anomaly Detection

链接: https://arxiv.org/abs/2605.23984
作者: Heqiang Wang,Weihong Yang,Zheyuan Yang,Jia Zhou,Xiaoxiong Zhong,Fangming Liu,Weizhe Zhang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Industrial anomaly detection has attracted significant attention as a fundamental challenge in industrial systems. The rapid advancement of heterogeneous industrial sensors has driven industrial anomaly detection from unimodal to multimodal paradigms. However, existing methods are primarily designed for centralized and offline settings, overlooking the distributed and continuously generated data characteristic of real-world industrial environments. With the advancement of edge intelligence, modern edge devices are increasingly capable of not only data acquisition but also distributed model training, enabling collaborative intelligence across the system. Industrial anomaly detection represents a critical application in this context. Motivated by these challenges, we propose a novel framework termed Multimodal Online Distributed Industrial Anomaly Detection (MODIAD). We first present a comprehensive workflow for MODIAD and then formulate a Multi-class Intelligent Scheduling (MIS) problem to coordinate cross class model updates by balancing data sufficiency and class update frequency. To efficiently solve this problem, we design a Sequential Marginal Gain Greedy (SMG) algorithm that enables effective multi-class training under resource constraints. Furthermore, to improve the computational and communication efficiency during training, we propose an Resource Efficient Class-Wise Low Rank Adaptation (REC-LoRA) strategy, which significantly reduces system overhead while preserving detection performance. Extensive experiments on two representative multimodal industrial anomaly detection datasets, MVTec 3D-AD and Eyecandies demonstrate that the proposed approach achieves superior performance and efficiency under the MODIAD scenario.

[CV-273] How Accurate are Video Quality Models for Diffusion-Based Video Super-Resolution?

链接: https://arxiv.org/abs/2605.25940
作者: Benjamin Herb,Steve Göring,Alexander Raake,Rakesh Rao Ramachandra Rao
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for the 18th International Conference on Quality of Multimedia Experience (QoMEX 2026)

点击查看摘要

Abstract:Recent video super-resolution (VSR) approaches use deep neural networks to enhance low-quality input videos and recover visual detail, with diffusion-based methods in particular showing promising results. In this paper, we investigate whether existing video quality models can be used to assess the performance of these diffusion-based VSR methods, by comparing model predictions with results from a subjective test. The study compares six upscaling methods (Lanczos, Rhea, SCST, DOVE, SeedVR2, Starlight Mini) applied to both compressed (AV1 and DCVC-RT) and uncompressed low-resolution videos considering the play-out on a UHD-1/4K screen. A range of full- and no-reference quality models are used to assess their applicability to this new type of quality degradation, focusing on within-sequence performance. The results highlight that CNN-based full-reference models, such as LPIPS, DISTS, and CVQA-FR show significantly higher correlation coefficients than both conventional full- as well as the tested no-reference models. Most overestimate the overly sharp results of SCST, with VMAF mainly failing due to spatial inconsistencies introduced by Starlight Mini. None of the tested video quality models reach sufficient accuracy so as to replace complementary subjective testing. The reference, degraded and upscaled videos, as well as the user ratings and model scores are made available with the paper at this https URL as open data.

[CV-274] A Clinically Validated Foundation Model for Comprehensive Lung Pathology Interpretation

链接: https://arxiv.org/abs/2605.25878
作者: Zhengrui Guo,Zhengyu Zhang,Jiabo Ma,Yihui Wang,Fengtao Zhou,Yingxue Xu,Ling Liang,Chenglong Zhao,Qi Xie,Jinbang Li,Shujing Guo,Fangyi Han,Zhijian Cen,Ziyi Liu,Cheng Jin,Junlin Hou,Zhixuan Chen,Yu Cai,Lijuan Qu,Shifu Chen,Yueping Liu,Zhe Wang,Xiuming Zhang,Muyan Cai,Li Liang,Hao Chen
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Pathological assessment guides lung cancer diagnosis, treatment selection, and prognostic evaluation, yet current CPath approaches rely on task-specific models for isolated objectives. Although pan-cancer foundation models offer versatility, they lack subspecialty-level depth and have not been evaluated across clinical workflows or prospectively validated in real-world settings. We introduce PulmoFoundation, a multi-center, prospectively validated, randomized controlled trial (RCT)-evaluated foundation model for comprehensive lung pathology assessment across pre-operative, intra-operative, and post-operative care. Built upon Virchow2 via subspecialty-specific pretraining using ~40,000 diagnostic HE-stained whole-slide images (WSIs), PulmoFoundation was systematically evaluated on ~26,000 WSIs across 32 clinically relevant tasks. In addition to accurately predicting molecular markers and patient survival, our model achieves clinical-grade performance in core diagnostic tasks across biopsy, frozen section, and surgical resection slides. In a registered prospective study of 1,357 patients across 11 diagnostic tasks, our model achieved an average AUC of 92.3%. Using pre-specified triage thresholds, PulmoFoundation could reduce additional second-review burden for 68.8% of biopsies and 83.0% of frozen sections, and defer 44.5% of IHC stain orders, with PPVs of 1.0, 0.991, and 0.966. Beyond prospective validation, we conducted a crossover RCT with eight pathologists, in which AI assistance improved diagnostic accuracy across 4,928 case-reader pairs (91.7% w/ AI vs. 83.8% w/o AI). AI assistance also reduced median diagnostic time by 19.6%, increased diagnostic confidence by 8.7%, and improved inter-rater agreement from moderate (kappa = 0.56) to substantial (kappa = 0.76). Together, these evaluations support PulmoFoundation as a clinically validated decision-support system for lung pathology.

[CV-275] Parameter-Efficient CT Reconstruction via Deep Graph Laplacian Regularization

链接: https://arxiv.org/abs/2605.25348
作者: Veera Varuni Radhakrishnan,Chinthaka Dinesh,Qurat-ul-Ain Azim
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Symbolic Computation (cs.SC)
备注: 7 pages, 3 figures, conference

点击查看摘要

Abstract:Low-dose computed tomography (LDCT) reconstruction faces a critical tradeoff between reconstruction quality and resource requirements. While recent deep learning methods achieve state-of-the-art performance, they typically rely on over 500,000 parameters trained on large-scale datasets exceeding 35,000 scans. This work investigates whether graph-based regularization can provide meaningful noise reduction under strict resource constraints. We propose Deep Graph Laplacian Regularization (Deep GLR), integrating quadratic graph regularization into a Proximal Forward-Backward Splitting optimization framework with three lightweight CNN modules. Evaluated on the LoDoPaB-CT benchmark, Deep GLR achieves 30.70 dB PSNR, representing a 6.33 dB improvement over filtered backprojection, while using only 91,848 parameters trained on 1000 samples (2.8% of standard training set). Compared to benchmark methods, this represents 5.8 times better parameter efficiency and 30 times better data efficiency per dB improvement. The learned graph bandwidth parameter ( \epsilon =1.25) converges to interpretable values, suggesting the method captures meaningful image priors rather than overfitting. While a 13 dB gap remains versus state-of-the-art methods, results demonstrate that graph-based regularization provides a favorable efficiency-quality tradeoff for resource-constrained medical imaging scenarios.

[CV-276] Catching MRI outliers: unsupervised detection and localization of MRI artefacts and clinical anomalies using deep learning

链接: https://arxiv.org/abs/2605.24609
作者: Mustafa Kadhim,Viktor Rogowski,Emilia Persson,Camila Gonzalez,André Haraldsson,Sofie Ceberg,Mikael Nilsson,Malin Kügele,Sven Bäck,Christian Jamtheim Gustafsson
类目: Medical Physics (physics.med-ph); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: This paper has been submitted to Physics and Imaging in Radiation Oncology (phiRO)

点击查看摘要

Abstract:Artificial intelligence is increasingly integrated into radiotherapy workflows, yet such pipelines remain vulnerable to out-of-distribution image data that may introduce unexpected behavior in clinical tasks. Deep learning-based anomaly detection for pelvic magnetic resonance imaging (MRI) remains largely unexplored, and transparent evaluation of its feasibility for full automation is limited. We developed and evaluated a fully automated, unsupervised anomaly-detection framework for pelvic and brain MRI. A two-stage framework was trained on reference images from public datasets: LUND-PROBE for pelvic MRI, and IXI, fastMRI, and fastMRI+ for brain MRI. In the first stage, MRI slices were compressed into discrete tokens; in the second, the distribution of normal tokens was modeled. Anomaly evidence was estimated by combining perceptual image differences with token-surprisal scores based on negative log-likelihood. Automated detection was evaluated on pelvic MRI with synthetic global and real clinical anomalies, and on brain MRI with clinically annotated fastMRI+ abnormalities. Sensitivity, specificity, area under the receiver operating characteristic curve (AUC), and false-positive behavior in held-out normal cases were assessed. The framework achieved robust detection across hidden evaluation cohorts, with AUCs of 0.97 (95% CI, 0.95-0.98) and 0.81 (95% CI, 0.74-0.87) for pelvic and brain MRI, respectively. Heatmap analysis showed strong spatial agreement between detected anomalies and ground-truth locations, supporting localization accuracy and interpretability. These results support the potential of unsupervised anomaly detection as an automated MRI quality-control layer for radiotherapy workflows, with transparent visualization of image regions likely to compromise downstream AI-based tasks.

人工智能

[AI-0] From Model Scaling to System Scaling: Scaling the Harness in Agent ic AI

链接: https://arxiv.org/abs/2605.26112
作者: Shangding Gu
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This paper studies the next major bottleneck in agentic AI as system scaling, not only model scaling: the design of auditable, persistent, modular, and verifiable architectures around foundation models. We refer to this shift as scaling the harness: treating the structured execution layer around a foundation model as a first-class object of design, evaluation, and optimization. Although recent large language models enable agents to use tools, retrieve information, maintain memory, and execute long-horizon workflows, evaluation remains largely model-centric, often reducing agents to final-task success while treating memory, retrieval, tool use, orchestration, verification, and governance as secondary implementation details. This framing is increasingly inadequate because agent performance emerges from the interaction among the foundation model, memory substrate, context constructor, skill-routing layer, orchestration loop, and verification-and-governance layer. Together, these components form the agent harness, which translates model capability into long-horizon agent behavior. We study scaling the harness through three core bottlenecks: context governance, trustworthy memory, and dynamic skill routing, together with the orchestration and governance mechanisms that coordinate and constrain them. We further outline a research agenda for harness-level benchmarks that go beyond one-shot task success to measure trajectory quality, memory hygiene, context efficiency, communication fidelity, verification cost, and safe evolution over time. To make the discussion concrete, we develop CheetahClaws: this https URL, a Python-native reference harness, and compare it with Claude Code and OpenClaw. Our main claim is that future progress in agentic AI will depend as much on system design as on stronger foundation models.

[AI-1] Beyond Summaries: Structure-Aware Labeling of Code Changes with Large Language Models

链接: https://arxiv.org/abs/2605.26100
作者: Bar Weiss,Antonio Abu-Nassar,Adi Sosnovich,Karen Yorav
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 13 pages, 6 figures

点击查看摘要

Abstract:Code review is a critical practice in software engineering, yet the growing scale and frequency of code patches in modern projects, together with the widespread adoption of AI code assistants, make manual review increasingly challenging. Identifying the types of changes within a patch, such as renames, moves, or logic modifications, can substantially improve review efficiency by enabling prioritization, filtering, and automation. However, existing LLM-based approaches to code review have largely focused on summarization and comment generation, leaving structured code reviews underexplored. In this paper, we present a systematic study of using large language models (LLMs) for taxonomy-based labeling of code changes in a code patch. We introduce a two-stage pipeline that assigns labels to diff hunks and then refines them to capture structural relationships and semantic attributes, such as rename propagation and type changes. Our approach employs few-shot prompting to produce language-agnostic and customizable labels, without the engineering overhead of traditional static-analysis pipelines. We evaluate four LLMs across multiple context configurations on a manually curated benchmark of natural and synthetic patches. Our best configuration achieves up to 84% recall and 81% precision, with high accuracy in extracting relational and attribute metadata. These results suggest that LLM-based labeling can effectively complement static analysis by enabling flexible, multilingual, and automation-friendly code review workflows.

[AI-2] OrpQuant: Geometric Orthogonal Residual Projection for Multiplier-Free Power-of-Two Transformer Quantization

链接: https://arxiv.org/abs/2605.26092
作者: Maoyang Xiang,Bo Wang,Tao Luo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The deployment of Large Language Models (LLMs) and Vision Transformers (ViTs) on edge devices is significantly constrained by memory limitations and the critical timing bottlenecks introduced by dense Multiply-Accumulate (MAC) arrays. In the ultra-low bit regime, logarithmic Power-of-Two (PoT) quantization provides a hardware-efficient alternative by replacing MAC operations with bit-shifts. However, the non-uniform exponential lattice is inherently limited by a \textbfLow Angular Resolution Regime, a structural flaw that becomes particularly pronounced at sub-4-bit thresholds, leading to a notable degradation of high-dimensional feature manifolds. To address this geometric limitation, we propose Orthogonal Residual Projection (ORP), an algorithm-hardware co-design framework. By formulating quantization as a dual-basis geometric projection, ORP adaptively synthesizes a higher-resolution residual lattice using strictly shift-and-add operations. Furthermore, ORP’s analytical solver offers a practical alternative to computationally intensive gradient-based optimization, reducing the full-model calibration time for LLaMA-2-7B to approximately \textbf15 minutes. Extensive evaluations demonstrate ORP’s applicability across modalities and its hardware efficiency. Under the 3-bit (W3/A16) constraint, ORP achieves a perplexity of 6.10 on LLaMA-2-7B, comparing favorably to conventional MAC-intensive baselines like AWQ without relying on asymmetric scaling, while maintaining competitive accuracy in 4-bit scenarios. At the silicon level, standard-cell RTL synthesis at a 28nm node indicates that ORP effectively mitigates the timing bottlenecks associated with dense multiplier trees. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2605.26092 [cs.LG] (or arXiv:2605.26092v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.26092 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-3] Claw-Anything: Benchmarking Always-On Personal Assistants with Broader Access to Users Digital World

链接: https://arxiv.org/abs/2605.26086
作者: Yusong Lin,Xinyuan Liang,Haiyang Wang,Qipeng Gu,Siqi Cheng,Jiangui Chen,Shuzhe Wu,Feiyang Pan,Lue Fan,Sanyuan Zhao,Dandan Tu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language model agents are increasingly envisioned as always-on personal assistants with access to anything relevant in the user’s digital world. Yet current systems operate over only narrow slices of that world, limiting context-sensitive reasoning and effective assistance. Existing benchmarks similarly provide only partial user state and therefore fail to capture performance in such a broad, always-on setting. To address this gap, we introduce Claw-Anything, a benchmark that expands agent context along three dimensions: long-horizon activity histories, interdependent backend services, and integrated GUI and CLI interaction across multiple devices. To instantiate this setting, we simulate months of user activity through multi-round event injection, producing complex world states and realistic noise, including irrelevant events and conflicting signals. Agents must reason over rich contextual environments while remaining robust to such noise. This expanded scope also enables the evaluation of proactive assistance, requiring agents to anticipate user needs and deliver timely recommendations. Experiments show that GPT-5.5 achieves only 34.5% pass@1, substantially below prior benchmarks, underscoring a gap between current agent capabilities and the demands of always-on personal assistance. Alongside the benchmark, we release an automated data-generation pipeline that yields 2,000 training environments and improves the base model by 23.7%, demonstrating its utility of scalable data infrastructure.

[AI-4] VeriTrace: Evolving Mental Models for Deep Research Agents

链接: https://arxiv.org/abs/2605.26081
作者: Haolang Zhao,Yunbo Long,Lukas Beckenbauer,Alexandra Brintrup
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Deep research agents face vast, interdependent, and pervasively uncertain information. Existing systems explore what evolving intermediate representations should look like, but leave their evolution to the LLM’s implicit reasoning. Without explicit regulation, the intermediate layer is easily contaminated by mixed-quality information and propagates errors along its dependencies, so model scale often ends up substituting for absent regulation. We argue that an agent’s mental model should instead evolve through explicit feedback that continuously aligns task understanding with reality, and identify three regulatory loops: interpretive update, deviation feedback, and schema revision. We realise this in VeriTrace, a cognitive-graph framework that explicitly implements the three loops. Using matched Qwen3.5-27B backbones, VeriTrace improves over the strongest matched baseline by 4.22 pp on DeepResearch Bench (DRB) Insight (1.49 pp Overall) and by 5.9 pp Overall win rate on DeepConsult. With Config-DeepSeek, it achieves the strongest reproducible open-source result on DRB.

[AI-5] Rethinking Weak Supervision in Anomaly Detection: A Comprehensive Benchmark KDD2026

链接: https://arxiv.org/abs/2605.26068
作者: Xu Yao,Siyuan Zhou,Wu Zhenbo,Chaochuan Hou,Shuang Liang,Shiping wang,Hailiang Huang,Songqiao Han,Minqi Jiang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at KDD 2026 Datasets and Benchmarks Track (Cycle 2)

点击查看摘要

Abstract:Weakly supervised anomaly detection (WSAD) has developed in three primary directions: incomplete, inexact, and inaccurate supervision. However, these directions remain isolated, lacking a unified framework to assess whether they address unique challenges or share fundamental mechanics. This paper introduces WSADBench, the first benchmark that unifies evaluation across distinct weakly supervised scenarios, benchmarking diverse approaches from specialized WSAD methods to advanced tabular foundation models. WSADBench establishes standardized protocols to evaluate 36 algorithms across 4 modalities by systematically varying label quantity, granularity, and quality, revealing the performance boundaries of various methods. Based on over 700K experiments, WSADBench reveals four critical insights: (i) Strong intrinsic correlations exist between these weak supervision scenarios, challenging the isolation of current research directions. (ii) Specialized WSAD algorithms excel only in extreme label-scarcity regimes but are quickly dominated by tabular foundation models and general classification methods as supervision increases or in OOD scenarios. (iii) Unlabeled data shows inconsistent utility across settings, with marginal gains compared to label refinement. (iv) Models exhibit asymmetric sensitivity to different types of label noise. We release WSADBench as an open-source benchmark with code and datasets to facilitate future WSAD research: this https URL.

[AI-6] Conditional KRR: Injecting Unpenalized Features into Kernel Methods with Applications to Kernel Thresholding ICML2026

链接: https://arxiv.org/abs/2605.26067
作者: Rustem Takhanov,Zhenisbek Assylbekov
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted to ICML 2026

点击查看摘要

Abstract:Conditionally positive definite (CPD) kernels are defined with respect to a function class \mathcalF . It is well known that such a kernel K is associated with its native space (defined analogously to an RKHS), which in turn gives rise to a learning method – called conditional kernel ridge regression (conditional KRR) due to its analogy with KRR – where the estimated regression function is penalized by the square of its native space norm. This method is of interest because it can be viewed as classical linear regression, with features specified by \mathcalF , followed by the application of standard KRR to the residual (unexplained) component of the target variable. Methods of this type have recently attracted increasing attention. We study the statistical properties of this method by reducing its behavior to that of KRR with another fixed kernel, called the residual kernel. Our main theoretical result shows that such a reduction is indeed possible, at the cost of an additional term in the expected test risk, bounded by \mathcalO(1/\sqrtN) , where N is the sample size and the hidden constant depends on the class \mathcalF and the input distribution. This reduction enables us to analyze conditional KRR in the case where K is positive definite and \mathcalF is given by the first k principal eigenfunctions in the Mercer decomposition of K . We also consider the setting where \mathcalF consists of k random features from a random feature representation of K . It turns out that these two settings are closely related. Both our theoretical analysis and experiments confirm that conditional KRR outperforms standard KRR in these cases whenever the \mathcalF -component of the regression function is more pronounced than the residual part. Comments: Accepted to ICML 2026 Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2605.26067 [cs.LG] (or arXiv:2605.26067v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.26067 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-7] Neuronal Stochastic Attention Circuit (NSAC) for Probabilistic Representation Learning

链接: https://arxiv.org/abs/2605.26061
作者: Waleed Razzaq,Yun-Bo Zhao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reliable quantification of uncertainty estimates in continuous-time (CT) representation learning remains nascent, particularly within CT attention architectures. We introduce the Neuronal Stochastic Attention Circuit (NSAC), a novel biologically-inspired CT attention architecture that reformulates attention logit computation as the solution of an Ornstein-Uhlenbeck stochastic differential equation modulated by input-dependent, nonlinear interlinked gates derived from repurposed this http URL Neuronal Circuit Policies (NCPs) wiring mechanism. It induces Gaussian distribution over logits that propagates principled stochasticity through logistic-normal distribution over attention weights to yield probabilistic output. A two-term objective function combining Gaussian negative log-likelihood with an epistemic-separation regularizer enforces higher predictive variance and enables joint quantification of aleatoric and epistemic uncertainty. Empirically, we implement NSAC in a diverse set of learning tasks including: (i) irregular CT function approximation; (ii) multivariate regression; (iii) long-range forecasting; (iv) Industry 4.0; and (v) the lane-keeping of autonomous vehicles. We observe that the NSAC remains competitive against several baselines in terms of accuracy and produces reasonably well-calibrated uncertainty estimates while being interpretable at the neuronal cell level.

[AI-8] Retrying vs Resampling in AI Control

链接: https://arxiv.org/abs/2605.26047
作者: James Lucassen,Adam Kaufman
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:AI coding scaffolds like Claude Code and Codex use \textitretrying: blocking actions flagged as risky and continuing the trajectory. We study retrying from an AI control perspective, which treats the model as potentially adversarial. We find that while retrying reduces honest suspicion scores, the untrusted model can exploit monitor rationale to construct sneakier attacks, negating safety gains. We also study \textitresampling: drawing multiple samples from the same context, which does not leak exploitable information. We disentangle design choices that previous work on resampling had bundled together. In BashArena, with Claude Opus 4.6 as the untrusted model and MiMo-V2-Flash as the trusted monitor, drawing five samples per step and auditing on the maximum suspicion score raises safety from 61% to 71% at a 0.3% audit budget, at no cost to usefulness. Selectively resampling only the steps that look suspicious on the first draw recovers 6.2 percentage points of the gain while drawing only 10% as many extra samples. Two of our findings in this setting contradict earlier work on resampling. The first is that auditing based on the maximum across resampled suspicion scores outperforms using the minimum, which is the opposite of what Ctrl-Z found. The second is that executing the least suspicious sample, which is the central mechanism in earlier defer-to-resample protocols, gives only a small empirical safety gain in our setting (+3.9 pp, with the confidence interval overlapping zero).

[AI-9] L2IR: Revealing Latent Intent in Graph Fraud Detection

链接: https://arxiv.org/abs/2605.26040
作者: Jinsheng Guo,Zhenhao Weng,Yibo Liu,Yan Qiao,Meng Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 12 pages, 6 figures

点击查看摘要

Abstract:Graph fraud detection has long depended on Graph Neural Networks (GNNs) to propagate and aggregate information across relational data. A critical obstacle in practice, however, is that fraudsters frequently disguise themselves by forging numerous connections with benign users, causing fraud signals to be progressively diluted during neighborhood aggregation and undermining detection reliability. While recent efforts have used Large Language Models (LLMs) to provide rich semantic cues for fraud detection, the underlying intent behind suspicious connections remains insufficiently explored. Compounding this issue, the scarcity of annotated fraud samples makes it difficult to train detectors that remain robust under heavy camouflage. To address these gaps, we propose L2IR, an LLM-driven Latent Intent Revealing framework for graph fraud detection. By uncovering latent intent from both user behaviors and suspicious connections, L2IR extracts intent-aware representations from raw behavioral traces and reasons about the true purpose behind individual connections, effectively distinguishing supportive links from misleading ones. It further incorporates adaptive self-training to enhance robustness under limited supervision. Evaluations on two real-world datasets characterized by pervasive camouflage demonstrate that L2IR surpasses strong baselines and can function as a plug-in enhancement for a range of GNN-based detectors, improving AUPRC by up to 8.27%.

[AI-10] CITYREP: A Unified Benchmark for Urban Representations Across Cities Tasks and Modalities

链接: https://arxiv.org/abs/2605.26036
作者: Junyuan Liu,Xinglei Wang,Zichao Zeng,Jiazhuang Feng,Quan Qin,Ilya Ilyankou,Guangsheng Dong,Tao Cheng
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Urban representation learning encodes complex urban environments into general-purpose embeddings for diverse downstream tasks and emerging urban foundation models. However, current evaluations are limited, typically focusing on one or two cities and tasks and relying on random splits that introduce spatial leakage, leading to inflated performance and weak support for cross-location generalization and fair comparison. To address this, we propose CityRep, a unified benchmark that evaluates urban representations across data modalities, cities, and tasks using spatially structured splits. CityRep consists of three key components: (1) a spatial unit-agnostic evaluation framework that supports heterogeneous urban representations through a standardized alignment module; (2) a unified evaluation protocol using block-based spatial splits to mitigate spatial leakage and enable rigorous model comparison; and (3) an extensible multi-city, multi-task benchmark suite spanning 8 cities and 8 tasks across regression, classification, and distribution prediction. We evaluate 11 representative urban representation models. Results show that performance is highly sensitive to the split protocol, with random splits inflating scores and altering model rankings. We also observe substantial variability across cities and tasks, underscoring the need for generalization-aware evaluation. CityRep is released as a reproducible benchmark with datasets, evaluation pipelines, and diagnostic tools to facilitate fair comparison and support future research in urban representation learning towards urban foundation models.

[AI-11] Learning in Low-Dimensional Subspaces: Orthogonal Bottlenecks for Reinforcement Learning

链接: https://arxiv.org/abs/2605.26012
作者: Aleksandar Todorov,Matthia Sabatelli
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Deep reinforcement learning (RL) agents commonly rely on high-dimensional neural representations, despite growing evidence that task-relevant value and policy structure may be intrinsically low-dimensional. In this work, we present a simple yet effective representation-level prior that inserts a fixed orthonormal projection to constrain encoder features to a low-dimensional subspace, requiring no auxiliary objectives, pretraining, or changes to the underlying RL algorithm. Under a linear realizability assumption, we prove that when the bottleneck dimension exceeds the intrinsic rank of the optimal value function in feature space, the bottleneck preserves expressivity and leaves the induced gradient dynamics unchanged up to an equivalent low-dimensional parameterization. Empirically, we find that across both single and multi-task benchmarks, baseline performance is either matched or improved once the bottleneck dimension exceeds a small task-dependent threshold; in many cases, value representations can be compressed to extremely low dimensions without loss, and the minimal sufficient dimension depends far more on environment complexity than encoder width. In addition, we analyze representation geometry and find that orthogonal bottlenecks stabilize feature norms and are associated with higher effective rank. Together, these results support a representation-space interpretation of the manifold hypothesis in reinforcement learning and position orthogonal bottlenecks as a lightweight, architecture-agnostic mechanism for shaping RL representations.

[AI-12] Neural Scalable Symbolic Search Framework for Complex Logical Queries with Multiple Free Variables

链接: https://arxiv.org/abs/2605.25985
作者: Weizhi Fei,Hang Yin,Zihao Wang,Shukai Zhao,Wei Zhang,Yangqiu Song
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 10 pages, 5 figures

点击查看摘要

Abstract:Complex Query Answering (CQA) is a fundamental knowledge representation and reasoning task over incomplete knowledge graphs (KGs). Answering existential first-order queries with k free variables (i.e., \textEFO_k queries) is a crucial yet challenging problem, as it requires ranking answer tuples in \mathcalE^k , where \mathcalE denotes the entity set of a KG. This quickly becomes intractable as k grows. Consequently, existing benchmarks and methods rely on marginal rankings over individual variables; however, marginal rankings are a poor proxy for the true joint ranking of tuples. Building on neural symbolic search for \textEFO_1 queries, we propose Neural Scalable Symbolic Search (NS3), a budgeted framework that approximates joint ranking without enumerating \mathcalE^k . NS3 (i) answers marginalized sub-queries to obtain necessary candidate sets, (ii) merges multiple free variables into hypernodes whose domains are pruned and controlled by a dynamic budget B , and (iii) progressively reduces an \textEFO_k query to an \textEFO_k-1 query over a budgeted reduced domain. Across three standard KG datasets, NS3 substantially improves joint ranking performance while retaining strong marginal accuracy. We further release a joint-ranking benchmark that extends existing \textEFO_1 datasets to k=3 , enabling systematic evaluation of multi-variable queries. Our code is provided in this https URL.

[AI-13] LECTOR: Joint Optimization of Scientific Reasoning Graphs and Introduction Generation

链接: https://arxiv.org/abs/2605.25964
作者: Jiabei Xiao,Yizhou Wang,Chen Tang,Pengze Li,Wanli Ouyang,Shixiang Tang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 25 pages

点击查看摘要

Abstract:AI Scientists have shown promising progress across multiple stages of the research pipeline, among which automatic scientific paper writing remains a formidable challenge. The Introduction writing is especially challenging, which demands not only linguistic fluency, but logical soundness and verifiable faithfulness. Most AI-assisted methods treat the task as text generation instead of reasoning and structuring, leading to severe drawbacks, e.g., hallucinating citations. To address this, we first formulate the Content-Conditional Introduction Generation (CCIG) task, which requires grounding the Introduction in the paper’s core evidence. We then propose LECTOR, a novel Logic-Expression Co-Reinforcement Learning framework that can strictly follow the scientist’s logic, add high-quality citations and keep structured expressions. LECTOR first constructs a logic-reasoning graph from the paper’s main body to serve as a verifiable logical blueprint. Subsequently, it employs a Logic-Expression Co-Rewarding mechanism to jointly optimize for both the graph’s structural fidelity and the final narrative’s quality. We conduct a dataset from Nature Communications papers to assess our method. Extensive experiments show consistent improvements in both logic fidelity and Introduction generation quality metrics, e.g., Graph Quality (+26.7%), Citation Quality (+8.6%), and Paper Consistency (+3.3%). Code and data are available at this https URL.

[AI-14] Continual Speaker Identity Unlearning with Minimal Interference

链接: https://arxiv.org/abs/2605.25962
作者: Jinju Kim,Yunsung Kang,Gyeong-Moon Park,Jong Hwan Ko
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注: preprint

点击查看摘要

Abstract:Machine unlearning removes designated concepts or knowledge from pre-trained models. Recent work has extended this paradigm to speaker identity unlearning in zero-shot text-to-speech (ZS-TTS), the task of selectively erasing a model’s ability to replicate a speaker’s voice. Existing methods, however, quietly assume all unlearning requests arrive at once; an unrealistic assumption, since privacy-motivated removals arrive sequentially over time. We show this assumption breaks state-of-the-art methods: unlearning each new speaker fully revives previously unlearned speakers, reintroducing the very privacy risk unlearning was meant to eliminate. We present Cumulative ORThogonal Identity Suppression (CORTIS), the first framework for continual speaker identity unlearning in ZS-TTS that requires no access to previously-unlearned speaker data. CORTIS combines Fisher-information-based parameter masking, which localizes updates to speaker-relevant weights, with orthogonal projection against subspaces spanned by prior unlearning updates. With VoiceBox, CORTIS unlearns each requested speaker while keeping previously unlearned speakers forgotten across long request sequences, substantially outperforming sequential application of prior methods. The demo is available at this https URL .

[AI-15] Step-TP: A Grounded Step-Level Dataset with Chain-of-Thought Reasoning for LLM -Guided Tensor Program Optimization

链接: https://arxiv.org/abs/2605.25954
作者: Mengfan Liu,Da Zheng,Junwei Su,Chuan Wu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Despite the strong reasoning capabilities of large language models (LLMs), optimizing the execution efficiency of tensor programs remains challenging due to the need for precise, composable transformation decisions. Recent LLM-guided approaches frame tensor program optimization as an iterative decision process, but existing datasets provide only end-to-end optimized program pairs using token-inefficient representations, lacking verifiable step-level supervision and interpretability. As a result, LLMs struggle to make reliable single-step decisions in large combinatorial optimization spaces. We introduce Step-TP, a post-training dataset for tensor program optimization that provides grounded, atomic, step-level supervision with structured chain-of-thought (CoT) reasoning. Step-TP forms a closed reasoning loop over intermediate program states, enabling reliable multi-step optimization rather than outcome imitation. Its design is guided by four principles: (i) a token-efficient, verifiable intermediate representation (IR) that deterministically lowers to TVM TIR; (ii) atomic and composable optimization strategies that decompose complex trajectories into interpretable single-step decisions; (iii) structured CoT supervision coupled with explicit IR-to-IR state transitions; and (iv) strategy filtering to balance coverage while preventing shortcut exploitation. The dataset and implementation are available at a GitHub link, this https URL.

[AI-16] Small Models Strong Priors: Architectural Inductive Bias for Parameter-Efficient Neural PDE Solvers

链接: https://arxiv.org/abs/2605.25949
作者: Shyam Sankaran,Hanwen Wang,Paris Perdikaris
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Physics (physics.comp-ph)
备注:

点击查看摘要

Abstract:Neural PDE solvers have followed the scaling trajectory of vision and language, with recent foundation models reaching billions of parameters. We argue that scale is a poor substitute for architectural inductive bias in this domain: structured priors deliver outsized parameter efficiency, and the pattern of where they succeed and fail is itself informative about what they capture. We instantiate this argument in WaveLiT, an architecture combining a discrete wavelet transform for lossless multi-resolution tokenization, an augmented linear attention block, a shared-weight multiscale feature pyramid, and a wavelet-domain auxiliary loss. Bespoke 1-10M-parameter WaveLiT models compete with foundation models of 100-1000 \times their size across eight TheWell benchmarks, with the largest gains on wave and acoustic-dominated benchmarks where the wavelet-multiscale prior fits the dominant dynamical structure and small per-step errors do not compound geometrically under rollout. Trained jointly across all eight benchmarks, a 10M-parameter foundation variant exhibits a structured, physically interpretable transfer pattern – strongest where the wavelet-multiscale prior matches the dynamics, weakest on chaotic advection-dominated flows. The entire pipeline trains on a single GPU. The results suggest that small-model PDE performance is shaped by architectural inductive bias rather than scale, and that the structure of a prior’s failures is a useful empirical signal about its content.

[AI-17] From Latent Space to Training Data: Explainable Specialization in Minimal MLPs

链接: https://arxiv.org/abs/2605.25939
作者: Enrique Alba,Ezequiel Lopez-Rubio
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We here study whether training biases can make hidden neurons specialize in minimal one-hidden-layer MLPs, and whether such specialization improves prototype-based reconstruction of the training dataset from the learned weights. We consider Gaussianactivation MLPs of width equal to dataset size and compare three structural losses that respectively encourage coverage of the training samples, separation between neuron-induced prototypes, and low overlap of hidden responses, against the standard fitting baseline. Experiments on uniformly sampled one-dimensional datasets show a stable pattern from N = 3 to N = 100 across 480 controlled runs. Coverage regularization gives the lowest mean reconstruction error at every tested size and raises the prototype-usage specialization ratio relative to the standard baseline, while separation has mixed effects and overlap penalties are systematically harmful. We show that the harm is not an optimization failure: overlap-active approaches fit the data as well as overlap-free ones but route the optimizer to a degenerate equilibrium in which prototype centers are pushed outside the convex hull of the training inputs. Coverage cannot reward this expulsion and acts as an attractor: separation admits it only at large temperature and overlap admits it at the nominal hyperparameter choice. A direct \tau-sweep on the separation-only mask and a prototype-position visualization at N = 100 confirm the mechanism. The findings yield a simple design principle for prototype-recoverability-aware training: every repulsive structural loss must be compensated by a compatible attractor, or it will collapse the latent geometry it was meant to refine.

[AI-18] Quantitative Evaluation of the Severity of Posttraumatic Stress Disorder through Transfer Learning from Specific Phobia Data

链接: https://arxiv.org/abs/2605.25933
作者: Nicolas Ricka,Gauthier Pellegrin,Denis A. Fompeyrine,Thomas Rohaly,Leah Enders,Heather Roy
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Submitted to a peer-reviewed journal, comments welcome

点击查看摘要

Abstract:Posttraumatic stress disorder (PTSD) is a prevalent and debilitating mental health condition with significant personal and societal impacts. Current clinical assessments of PTSD often rely on subjective evaluations, which can be time-consuming, costly, and prone to human bias. This study proposes a machine learning (ML) approach based on multivariate kernel density estimation (MKDE) technique for the objective evaluation of PTSD severity. We collected heart rate (HR) and galvanic skin response (GSR) signals as well as PTSD Checklist - Military Version (PCL-M) labels from 21 participants during an immersive simulation. A fear-response model was trained on a public arachnophobia dataset, and predictive features of PTSD were extracted from the fear-response curves estimated on the military dataset. The model achieved an accuracy of 86% in classifying PTSD status, effectively distinguishing participants with and without PTSD (PCL-M threshold of 36). The average mean absolute error (MAE) of the models is 5.6, and it estimated a clinical PTSD severity scale with a mean absolute percentage error of 17%. Our algorithm demonstrates promising potential for enhancing estimation of PTSD severity and followup by offering an objective and low-effort evaluation approach using physiology. These findings suggest clinical utility in both screening and follow-up settings.

[AI-19] Explore Before You Solve: The Speed–Depth Trade-off in Epistemic Agents for ARC-AGI-3

链接: https://arxiv.org/abs/2605.25931
作者: Liew Keong Han
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 22 pages, 3 figures. Code: this https URL (CC0)

点击查看摘要

Abstract:We systematically investigate all 25 public ARC-AGI-3 games and find that every one is reachable through non-intelligent strategies: 10 in a single blind step, 5 after one probing action, 1 via repeated ACTION1 presses, 1 via diverse exploration, and 8 via single repeated actions with sufficient budget (50-200 steps). A library-level null-coordinate vulnerability additionally bypasses 18 games in 1 step. This benchmark critique implies the public evaluation set cannot discriminate intelligent exploration from trivial heuristics - the private 55-game evaluation is the only genuine intelligence test. Against this backdrop, we present AERA (Adaptive Epistemic Reasoning Agent), a three-phase (EXPLORE / VERIFY / PLAN) agent achieving RHAE=0.2116 (4/25 solved) on these 25 games with Qwen2.5-0.5B, while random and no-explore baselines score 0.0000. We formalise AERA through a Speed–Depth trade-off framework: under a convexity assumption (proved for a class of environments in the Appendix), RHAE’s quadratic form emerges as a second-order penalty for deviating from the Pareto frontier between action efficiency and information gain. Contributions: (i) a benchmark validity analysis showing that current interactive reasoning benchmarks fail to measure the exploration they claim to require, and (ii) the EXPLORE-before-PLAN framework and model-capability x exploration interaction. The linked code track entry achieves RHAE=0.30 on the full 55-game private evaluation. Code: CC0.

[AI-20] D2-Monitor: Dynamic Safety Monitoring for Diffusion LLM s via Hesitation-Aware Routing

链接: https://arxiv.org/abs/2605.25893
作者: Aoxi Liu,Yupeng Chen,James Oldfield,Guanzhe Hong,Junchi Yu,Baoyuan Wu,Philip Torr,Adel Bibi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Despite the emergence of diffusion large language models (D-LLMs) as an alternative to autoregressive large language models (AR-LLMs), safety monitoring for D-LLMs remains largely unexplored. Unlike AR-LLMs, D-LLMs generate text through a multi-step denoising process, exposing intermediate hidden representations that may contain safety-relevant information unavailable in standard single-step monitoring setups. Motivated by the suitability of lightweight probes for always-on monitoring, we analyze which trajectory-level signals best indicate when such probes are likely to struggle. We find that the most informative signal is safety hesitation: intermediate hidden states repeatedly falling within a small margin of the probe’s decision boundary. The number of such hesitation steps in D-LLM’s trajectory predicts probe failure effectively, providing a proxy of sample difficulty. Building on this analysis, we propose D^2 -Monitor, a bi-level safety monitor for D-LLMs. D^2 -Monitor adopts a lightweight probe as an always-on monitor to jointly estimate hesitation and perform base classification. When the hesitation level exceeds a threshold, a more expressive but computationally heavier probe is activated. This dynamic routing mechanism allocates monitoring resources efficiently at test time. Evaluated on 3 datasets (WildguardMix, ToxicChat, OpenAI-Moderation) across 4 D-LLMs, D^2 -Monitor achieves state-of-the-art performance with a compact parameter footprint ( \leq 0.85M parameters), and exhibits the best trade-off between effectiveness and efficiency relative to 8 baselines.

[AI-21] From Accounting to Coordination: A Virtual Water-Aware Electricity-Computation-Water Nexus Framework for Data Center Dispatch

链接: https://arxiv.org/abs/2605.25854
作者: Haiyang You,Chengwei Lou,Jin Zhao,Yue Zhou,Lu Zhang,Jin Yang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The expansion of data centers (DCs) drives a sustained increase in electricity demand and associated water withdrawals at generation sites. These withdrawals occur at generation sites and are virtually allocated to demand based on network power flows. Consequently, the actual water footprint of a specific load varies dynamically with generation dispatch and network conditions. Existing approaches typically rely on static statistical accounting to quantify these water footprints. However, such static methods fail to capture how dispatch optimization and workload relocation dynamically affect water withdrawals. As a result, static statistical accounting approaches remain decoupled from the optimization process, rendering them incapable of guiding workload relocation or power dispatch to mitigate water stress. To address this limitation, this paper develops an operational electricity-computation-water (ECW) nexus framework that internalizes virtual water impacts directly into power system dispatch. The framework represents dispatch optimization as a differentiable optimization layer embedded within a deep learning architecture, enabling efficient end-to-end learning of coordination policies while preserving operational feasibility. Combined with fixed-point coordination, the framework enforces consistency between virtual water attribution and physical generation-side withdrawals. Case studies on the IEEE 30-bus and 118-bus test systems demonstrate reliable convergence, exact power-water consistency, and reductions of approximately 3-5% in generation-related freshwater withdrawals under water-constrained conditions.

[AI-22] Geometric Evolution Maps: Extracting Stable Concept Probes from Transformer Residual Streams

链接: https://arxiv.org/abs/2605.25848
作者: James Henry
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 24 pages, 3 figures. Reference implementation: rosetta_tools v1.3.1 (doi: https://doi.org/10.5281/zenodo.20361433 )

点击查看摘要

Abstract:Concept probes extracted from transformer residual streams are only as reliable as the layer from which they are extracted. The common practice of probing at a fixed late layer or at the peak of a separation score function ignores a fundamental structural feature: concept representations undergo substantial directional rotation during their assembly phase, and do not settle into a stable direction until a characteristic handoff layer after the primary Concept Allocation Zone (CAZ). We introduce Geometric Evolution Maps (GEMs), which track the full directional trajectory of a concept through residual stream activations, identify the handoff layer where rotation ceases, and extract the settled probe direction from that layer. Across 23 architectures spanning 70M to 14B parameters and 17 concept types, the entry-to-exit cosine similarity within CAZs has a mean of 0.233, showing that probe direction at CAZ entry does not reliably predict probe direction at exit. Ablation experiments across 391 concept x model pairs (23 models x 17 concepts) show that GEM-extracted probes are at least as precise as peak-layer probes in 268/391 trials (68.5%), and strictly outperform in 259/391 (66.2%). The architecture split is pronounced: MHA models favour the handoff in 173/221 trials (78.3%); GQA models favour the handoff in only 56/119 trials (47.1%). Model-level Wilcoxon: W=214, N=23, p=0.010 (one-sided). An adaptive ablation width rule targets the 79/391 near-final-layer cases: it improves probe quality in 60/79 triggered cases (75.9%), mean gain +7.44pp. A direction-specificity control confirms the ablation effect is concept-direction specific: median 377x suppression rate versus random-direction ablation (99.1% of concept directions beat all 10 random seeds). Reference implementation: rosetta_tools v1.3.1 (doi:https://doi.org/10.5281/zenodo.20361433).

[AI-23] Context-Instrumental Data Distillation for Kubernetes Manifest Generation: Method and Experimental Evaluation

链接: https://arxiv.org/abs/2605.25835
作者: Andrey Kozachok,Anatoliy Bakaev,Aleksandr Kozachok,Shamil Magomedov,Artem Noev
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 15 pages, 4 figures, 2 tables

点击查看摘要

Abstract:This paper examines the specialization of Small Language Models (SLMs) with up to 4 billion parameters for generating artifacts in domain-specific languages (DSL). Kubernetes manifests are chosen as the target domain. We propose the context-instrumental data distillation method: the source corpus is formed through synthetic generation and, in an extended scheme, through reverse instruction generation from real Kubernetes YAML files, with pairs included in training only upon passing external validators and matching the domain context model. Unlike classical KL-divergence knowledge distillation, the baseline implementation reduces to supervised fine-tuning on instrumentally verified examples. The experimental section presents a pilot implementation under resource-constrained conditions: the DeepSeek-V4 Flash API serves as the teacher for synthetic generation, while Qwen2.5-Coder-1.5B-Instruct is fine-tuned via LoRA on CPU. On the K8s-Distill-Pilot corpus (train_1200, validation_100, test_200), we achieved full-pass@1 = 91.5% (183/200) with a stricter prompt formulation and max_new_tokens=768. The key empirical finding is that for Kubernetes YAML, result quality in the pilot depended more on strict output format requirements than on simply increasing the number of training examples.

[AI-24] OASIS: Observation-Action Space Alignment via SE(3) Trajectory Prediction for Robotic Manipulation

链接: https://arxiv.org/abs/2605.25829
作者: Xinzhe Chen,Sihua Ren,Liqi Huang,Haowen Sun,Mingyang Li,Xingyu Chen,Zeyang Liu,Xuguang Lan
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent vision-language-action (VLA) models and world action models (WAMs) advance robotic manipulation by enriching intermediate representations with auxiliary spatial features or future visual-state prediction. However, these representations largely remain within the observation space and do not share the rigid-body geometry of the action space, forcing the action decoder to implicitly recover this geometry. We propose OASIS, a visuomotor policy that aligns the intermediate representation with the action space via SE(3) end-effector trajectory prediction. OASIS couples a 3D-aware feature encoder that fuses vision-language and metric-depth features with an SE(3) trajectory predictor that produces a camera-frame end-effector trajectory. Conditioned on the predictor’s pose-supervised hidden states, the action decoder generates action chunks consistent with rigid-body motion. Across simulation and real-world experiments, OASIS outperforms VLA and WAM baselines in success rate and out-of-distribution generalization. Our project page is available at this https URL.

[AI-25] When Can We Trust Early Warnings? Leakage-Excluded Early Outcome Prediction from LMS Interaction Logs

链接: https://arxiv.org/abs/2605.25794
作者: Ngoc Luyen Le,Marie-Hélène Abel,Bertrand Laforge
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Early-warning models built from Learning Management System (LMS) logs aim to predict end-of-course outcomes early enough to enable timely learner support. However, reported “early” performance is often inflated by temporal leakage. This occurs when the pipeline uses information that would not yet be available at the time of prediction. We formalize cutoff-based early outcome prediction under a temporal availability constraint and introduce LEAP (Leakage-Excluded Early-Availability Protocol), which enforces cutoff-first truncation prior to joins and aggregation and audits feature provenance to prevent post-cutoff evidence from entering the benchmark. We instantiate LEAP on the public Open University Learning Analytics Dataset (OULAD) as a multi-step protocol for leakage-controlled evaluation across weekly cutoffs. Using several standard learning methods, we evaluate performance using ROC-AUC, PR-AUC, Brier score, and F1@0.5. Results show improving performance as the observation window expands, with a marked gain around week~3; Random Forest performs best at the earliest cutoffs, while Gradient Boosting dominates thereafter. Leakage ablations further show that temporal violations, especially through assessment information, can inflate apparent “early” performance.

[AI-26] On the Benefits of Free Exploration for Regret Minimization in Multi-Armed Bandits

链接: https://arxiv.org/abs/2605.25789
作者: Yunlong Hou,Zixin Zhong,Vincent Y. F. Tan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Theory (cs.IT); Machine Learning (stat.ML)
备注: 55 pages

点击查看摘要

Abstract:We study a stochastic multi-armed bandit problem where an agent is granted a free exploration budget before regret accumulates, a setting not captured by the classic regret minimization or pure exploration paradigms. The goal is to design an adaptive policy that strategically explores the bandit instance in the initial free exploration phase and minimizes the cumulative regret in the subsequent phase. We formalize this regret minimization with free exploration problem and identify an interesting regime where the free exploration budget scales logarithmically with the time horizon. To quantify the amount of regret saved with high probability as a result of the availability of the free exploration phase, we introduce a novel set of policies known as (\alpha,\beta) -probably saving policies. We propose a two-phase, probably saving algorithm, UFE-KLUCB-H, which consists of a principled free exploration policy, UFE, and a history-aware regret minimization policy KLUCB-H. Instance-dependent upper bounds on UFE-KLUCB-H are derived, showing that UFE-KLUCB-H accumulates strictly less regret than policies that do not have access to a free exploration phase. Complementarily, we derive instance-dependent lower bounds based on novel multi-instance perturbation arguments tailored to the free-exploration setting, demonstrating the near-optimality of UFE-KLUCB-H for two-valued bandits. Our upper and lower bounds reveal sharp phase transitions in the accumulated regret depending on the amount of available free exploration. Simulations are conducted to demonstrate that forced exploration and adaptivity in the algorithm lead to greater regret savings.

[AI-27] NPSolver: Neural Poisson Solver with Iterative Physics Supervision KDD2026

链接: https://arxiv.org/abs/2605.25786
作者: Bocheng Zeng,Rui Zhang,Runze Mao,Mengtao Yan,Xuan Bai,Yang Liu,Zhi X. Chen,Hao Sun
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: kdd 2026

点击查看摘要

Abstract:Efficiently solving Poisson equations on complex, irregular domains remains a fundamental challenge in scientific computing, as classical iterative solvers often suffer from prohibitive runtime due to ill-conditioned systems. While neural operators offer a fast alternative, they typically rely on large-scale labeled datasets or struggle with unstable training dynamics when using physics-informed residual losses. We propose \textscNPSolver, a neural Poisson solver trained without solution labels via iterative physics supervision. Instead of relying on fully converged numerical solutions or raw PDE residuals, \textscNPSolver utilizes a small number of preconditioned conjugate gradient (PCG) steps to refine its own predictions, providing a more stable and well-scaled training signal. Theoretical analysis confirms that this iterative supervision serves as a well-conditioned error proxy and that a stop-gradient design is essential for optimization stability. To better capture boundary-driven features under mixed boundary conditions, we further introduce the Boundary-Aware Transolver (\textscBA-Transolver) architecture that explicitly separates interior and boundary tokenization. Extensive evaluations on 2D and 3D irregular geometries demonstrate that \textscNPSolver outperforms both physics-informed and data-driven baselines. Furthermore, a downstream thermal control task highlights the model’s capability for conducting efficient and reliable gradient-based boundary control. We will release our codes and data at this https URL.

[AI-28] MDGMIX: Boundary-Aware Subgraph Mixing for Multi-Domain Graph Pre-Training ICML2026

链接: https://arxiv.org/abs/2605.25771
作者: Ziyu Zheng,Yaming Yang,Ziyu Guan,Wei Zhao,Xinyan Huang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted by ICML2026

点击查看摘要

Abstract:Multi-domain graph pre-training is a crucial step in constructing foundational graph models with cross-domain generalization capabilities. However, existing methods predominantly rely on jointly training all source domain graphs, resulting in high computational costs. Furthermore, it remains unclear whether all source domain graph data contribute equally to effective transfer. This paper empirically reveals significant data redundancy in multi-domain graph pre-training. Based on this finding, we propose the Multi-domain Graph Pre-training Framework, MDGMIX, which combines boundary-aware subgraph mixing with hierarchical discrimination. By selecting boundary nodes to construct challenging mixed-domain subgraphs, MDGMIX employs coarse-grained domain discrimination and fine-grained domain decomposition losses to decouple shared patterns from domain-specific patterns. During adaptation, MDGMIX employs a lightweight prompt weighting mechanism to transfer source domain knowledge. Extensive experiments demonstrate that MDGMIX consistently outperforms strong baselines in few-shot classification tasks while exhibiting superior time and memory efficiency. The code is available at: this https URL.

[AI-29] Agent -Centric Social Trajectory Prediction: A Free Energy Principle Perspective

链接: https://arxiv.org/abs/2605.25748
作者: Yanping Wu,Ji Zhang,Hao Chen,Edmond S.L. Ho,Chongfeng Wei
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 10 pages, 4 figures

点击查看摘要

Abstract:Trajectory prediction methods have demonstrated remarkable capabilities in capturing complex motion patterns. However, existing methods rely on global state assumptions, suffer from insufficient belief inference under partial observability, and lack cognitive behavioral constraints in prediction. These limitations severely compromise both deployment feasibility and physical plausibility in real-world settings. In this work, we propose FEP-Diff, an agent-centric trajectory prediction framework grounded in the Free Energy Principle, aimed at achieving cognitively plausible predictions under realistic constraints. Specifically, a dual-branch spatiotemporal encoder extracts ego-motion dynamics and social interaction cues from local observations. Building upon this, a goal-conditioned belief learner infers multimodal latent belief distributions optimized via a free-energy objective, with a social consistency constraint on the local neighborhood graph to promote cognitive alignment among neighboring agents. Finally, a residual diffusion trajectory generator is conditioned on the learned belief representations with token-level proxy conditioning, producing precise and diverse future predictions. Extensive experiments on five public benchmarks demonstrate that FEP-Diff consistently outperforms state-of-the-art methods under restricted observability. Code: this https URL.

[AI-30] A Deep Dive into Axiomatic Design – Part I: Problem Formulation

链接: https://arxiv.org/abs/2605.25735
作者: Aydin Homay
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: The paper is accepted at the ICAD 2026 - MIT and the final camera ready will be available once it got published by the Springer

点击查看摘要

Abstract:Problem formulation translating customer needs and constraints into a minimum set of independent first-level functional requirements, is arguably the most critical step in every design framework, including axiomatic design yet it is frequently misunderstood or underestimated in practice. This paper focuses exclusively on problem formulation in axiomatic design it clarifies what first-level FRs are (and are not), explains why they should not legitimately vary across designers given the same needs and constraints, and highlights intrinsic difficulties and recurring pitfalls that lead to design failure. The discussion is grounded primarily in Nam this http URL’s three books. The Principles of Design, Axiomatic Design Advances and Applications, and Complexity Theory, and it offers practical guidance to help designers formulate well-posed first-level FRs. Finally, the paper briefly revisits problem formulation in the era of large language models and discusses what such tools can (and cannot) contribute at the first level.

[AI-31] Learning to Search and Searching to Learn for Generalization in Planning ICML2026

链接: https://arxiv.org/abs/2605.25720
作者: Michael Aichmüller,Yannik Hesse,Hector Geffner
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted at ICML 2026

点击查看摘要

Abstract:Combinatorial generalization remains a central challenge in Deep Reinforcement Learning (DRL). Classical planning provides a simple yet challenging setting to study this problem through explicit relational descriptions, without requiring learning from perception. In sparse-reward domains, standard RL exploration via real-time search is ineffective, and learning-based planning methods often rely on expert demonstrations, hindsight relabeling, or random walks from the goal state. In contrast, planners rely on best-first search methods such as \mathrmA^\star to solve problems from scratch. We propose a self-improving \mathrmWA^\star learning framework in combination with a value heuristic represented by a Relational Graph Neural Network: the heuristic guides search, and the resulting search data updates the heuristic via Q -learning. This loop yields heuristics that can function as general policies and solve new instances even without search, where DRL otherwise fails, as we show on puzzles such as Sokoban, PushWorld, The Witness, and the 2023 International Planning Competition benchmarks. Notably, we demonstrate strong zero-shot generalization: For example, heuristics trained on Blocksworld instances with fewer than 30 blocks successfully solve instances with 488 blocks without search.

[AI-32] FLOATBench: A Dataset and Benchmark for Floating Offshore Wind Turbine Tower Fatigue

链接: https://arxiv.org/abs/2605.25717
作者: João Alves Ribeiro,Bruno Alves Ribeiro,Francisco Pimenta,Sérgio M. O. Tavares,Faez Ahmed
机构: 未知
类目: Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Most of the world’s offshore wind resource lies in waters too deep for fixed-bottom foundations, making floating offshore wind turbines (FOWTs) essential for deep-water deployment. As the industry scales toward 22 MW class designs, tower fatigue becomes increasingly critical because larger structures amplify the coupled aero-hydro-servo-elastic loads induced by continuous wind and wave excitation. Accurate fatigue-damage prediction is therefore central to certification, design optimization, and cost reduction. Yet the field lacks a shared surrogate benchmark: studies report different simulations, splits, and metrics, making methods difficult to compare. We present FLOATBench, a public tabular benchmark with 582,120 per-section fatigue-damage labels across three 22 MW FOWT tower geometries, derived from 19,404 high-fidelity OpenFAST simulations across the three towers ( 6,468 per tower: 1,078 aligned wind/wave operating points \times six turbulence seeds), labeled at 30 cross-sections per tower. FLOATBench includes a regime-aware alpha-shape partition of the joint wind/wave operating envelope, stratifying test points into in-train, interpolation, and extrapolation regimes. It is paired with a reproducible evaluation harness covering three protocol levels: random validation (E1), within-tower regime-aware evaluation (E2), and cross-tower transfer (E3). The regime-aware protocol reveals rank shifts between global and extrapolation performance that random-split leaderboards cannot detect. To the authors’ knowledge, FLOATBench is the first FOWT fatigue benchmark for tabular surrogate modeling, and offers an evaluation protocol that generalizes to engineering surrogates defined over physical operating envelopes. Dataset and code available at: this https URL.

[AI-33] Agent Hijack: Benchmarking Computer Use Agent Robustness to Common Environment Corruptions ICML2026

链接: https://arxiv.org/abs/2605.25707
作者: Jingwei Sun,Jianing Zhu,Yuanyi Li,Tongliang Liu,Xia HU,Bo Han
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: accepted by ICML 2026

点击查看摘要

Abstract:Autonomous computer use agents that powered by multimodal large language models (MLLMs) are emerging as capable assistants for completing complex digital workflows. However, real-world execution environments are far from ideal: pop-ups, resolution changes, and competing applications frequently interfere with agent perception and control. We introduce AgentHijack, a benchmark designed to evaluate the robustness of computer-use agents under common corruptions, where the uncertainties in dynamic environment disrupt the execution flow without direct adversarial intent. Specifically, AgentHijack introduces 9 configurable common corruptions to replicate realistic imperfect scenarios. We evaluate a variety of desktop tasks that utilize MLLM-based agents and discover that even minor instances of corruption can result in substantial performance degradation, which emphasizes the fragility of agents and underscores the necessity of robustness evaluation. Afterward, we propose AgentHijack-Agent, a framework that integrates an action generator with enhanced grounding capabilities and an onlooker responsible for behavior summarization and environment checking. Extensive experiments validate its effectiveness. Our code, environment, baseline models and data are publicly available at: this https URL.

[AI-34] How Should LLM s Consume High-Quality Data? Optimal Data Scheduling via Quality-Aware Functional Scaling Laws

链接: https://arxiv.org/abs/2605.25698
作者: Zhitao Zhu,Xili Wang,Shizhe Wu,Jiawei Fu,Xiaoqing Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:High-quality data is scarce in large language model (LLM) training, yet how to schedule its use jointly with training dynamics lacks theoretical guidance. We extend functional scaling laws by incorporating a data-quality dimension, and solve the joint data-quality and batch-size scheduling problem in asymptotic closed form. The solution reveals two regimes and a dual role of high-quality data. In the noise-limited regime, high-quality data should be used as a signal amplifier: lowering the batch size converts cleaner data into more signal without amplifying noise. In the signal-limited regime, it should be used as a noise suppressor: late placement reduces terminal noise without sacrificing signal accumulation. Existing curriculum-style pipelines primarily exploit the second role by placing cleaner data late, but miss the first role because conventional decay schedules reduce update intensity exactly when high-quality data becomes available. Guided by this, we propose Drop-Stable-Rampup for LLM midtraining: upon the quality transition, drop the batch size, hold it stable to accumulate signal, then ramp up to suppress terminal noise. On a 15B Mixture-of-Experts model midtrained on 108B tokens, Drop-Stable-Rampup improves average accuracy over Warmup-Stable-Decay (WSD) by +1.70 and over Cosine-decay by +2.98, with particularly large gains on mathematical reasoning benchmarks such as GSM8K (+4.23) and MATH (+2.80).

[AI-35] Profiling-Driven Adaptive Distributed Transformer Inference on Embedded Edge Deployment

链接: https://arxiv.org/abs/2605.25682
作者: Muhammad Azlan Qazi,Alexandros Iosifidis,Qi Zhang
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Distributing Transformer inference across embedded edge devices can alleviate individual memory and compute constraints, yet practical benefits on real hardware remain unclear: prior work relies largely on simulations that overlook hardware-specific communication overheads. We present a hardware prototype study on NVIDIA Jetson Orin Nano devices connected over WiFi. Our key finding is that the dominant bottleneck is not just network bandwidth but also the CPU-GPU staging during communication. Because Jetson’s integrated GPU architecture lacks the PCIe/NVLink pathway that NCCL requires, all inter-device data communication should be routed through GLOO and staged in CPU memory; an overhead that scales with communication data volume and makes full-tensor exchange slower than single-device inference across the batch sizes for medium sized models such as ViT. We therefore evaluate Prism by combining Segment Means compression with lightweight offline profiling to adaptively select between local and distributed execution at runtime. Experiments show that this strategy reduces latency by 65%-77% and energy consumption by 34%-52% relative to full-tensor exchange in static distributed execution setup, demonstrating that profiling-driven adaptation is essential for practical distributed Transformer inference on embedded hardware.

[AI-36] Dont Retrain Just Reuse: Recovering Dual-Target Molecules from Single-Target Diffusion Models

链接: https://arxiv.org/abs/2605.25681
作者: Qingyuan Zeng,Pengxiang Cai,Zixin Guan,Ziyang Chen,Anglin Liu,Lang Qin,Xinyao Lai,Jintai Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Designing a single molecule that modulates two targets is a promising strategy for polypharmacology, but it remains substantially harder than standard single-target generation because one candidate must satisfy two binding requirements while preserving drug-likeness and synthesizability. Existing dual-target generative methods typically introduce dual-target capability by either retraining the generator or intervening in the diffusion process during sampling. The former can be costly and difficult to stabilize when dual-target supervision is sparse, while the latter may be sensitive to denoising-time target balancing and competing update directions. These limitations motivate a generator-preserving alternative that keeps the pretrained prior intact: can dual-target candidates instead be recovered from the input space of a frozen single-target diffusion model, without modifying its parameters or denoising dynamics? We formulate this task as a constrained multi-objective optimization problem and propose REUSE, a hierarchical evolutionary input-space search framework that combines pair-conditioned exploration with structured multi-stage selection to enforce dual-target affinity, chemical quality, and diversity. Experiments show that, compared with methods that modify the diffusion process, REUSE consistently improves dual-target affinity and balance, achieving a 20.9-percentage-point gain in Dual High Affinity over the strongest prior baseline while maintaining competitive molecular quality.

[AI-37] Referential Security as a New Paradigm for AI Evaluations

链接: https://arxiv.org/abs/2605.25673
作者: Dan Ristea,Vasilios Mavroudis
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Security evaluations inherently depend on stable identifiers. Any finding, audit, or regulatory decision must remain attached to the specific artifact it pertains to. Continuously updated artificial intelligence systems violate this core assumption, with public model designations remaining static while underlying weights, prompts, retrieval mechanisms, misuse classifiers, inference settings, and serving infrastructures undergo unannounced modifications. Consequently, current evaluations frequently apply to superficial labels rather than identifiable and distinct systems. To resolve this, we propose referential security as a new paradigm for AI evaluation. The fundamental security question extends beyond whether a model is safe to whether subsequent parties can conclusively determine which system a specific safety claim addressed. This approach reframes model identity as an empirically verifiable property and separates referential stability from the substantive security claims it conditions. This framework brings tractability to three critical workflows that current practices handle poorly. Specifically, it enables reproducible evaluation, longitudinal audit validity, and cross-provider equivalence. By grounding these evaluations in verifiable artifacts, our approach ensures that safety audits and regulatory findings maintain their empirical utility across the operational lifecycle of dynamic systems. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2605.25673 [cs.CR] (or arXiv:2605.25673v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2605.25673 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-38] Meta-Engineering Harnesses for AI-Native Software Production: A Contract-Driven Adversarial Verification Architecture with Early Deployment Report

链接: https://arxiv.org/abs/2605.25665
作者: Satadru Sengupta,Tamunokorite Briggs,Ivan Myshakivskyi
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 17 pages, 2 figures, early deployment report

点击查看摘要

Abstract:AI-native software development is often evaluated at the level of individual models, prompts, or generated artifacts. This framing is insufficient for production environments where software must be continuously produced, verified, deployed, maintained, and adapted across many operational contexts and long time horizons. We present a meta-engineering harness: a software-production architecture that transforms operational and product feature requirements into explicit contracts, routes work through role-specialized AI agents, performs independent and adversarial verification, and continuously improves itself through structured failure classification and outer-loop calibration. The harness is designed for settings in which software delivery is not a one-time project but an ongoing operating function. In our motivating application, CTO-as-a-service for small service firms, the system manages websites, booking flows, payment systems, backoffice workflow automations, and AI-agent interfaces as continuously evolving technical infrastructure rather than one-off deliverables. We describe the layered architecture, including two-pass contract compilation, persistent markdown memory with specialization records, attention-based and independence-based verifications, a four-way failure arbiter, and outer-loop calibration. We report results from an early production deployment spanning 17 features over several weeks, including a detailed in-app payments case study that revealed contract incompleteness and verification-boundary issues. These observations directly drove targeted improvements to the harness. The contribution is an implemented, measurable, and extensible verification architecture for making AI-native service-as-a-software production reliable, auditable, and improvable over time. Comments: 17 pages, 2 figures, early deployment report Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI) Cite as: arXiv:2605.25665 [cs.SE] (or arXiv:2605.25665v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2605.25665 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Satadru Sengupta [view email] [v1] Mon, 25 May 2026 10:15:24 UTC (15 KB) Full-text links: Access Paper: View a PDF of the paper titled Meta-Engineering Harnesses for AI-Native Software Production: A Contract-Driven Adversarial Verification Architecture with Early Deployment Report, by Satadru Sengupta and 2 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.SE prev | next new | recent | 2026-05 Change to browse by: cs cs.AI References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked="checked"class=“labs-tab-input”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status

[AI-39] Fine-Tuning and Serving Gemma 4 31B on Google Cloud TPU: A Technical Comparison with GPU Baselines

链接: https://arxiv.org/abs/2605.25645
作者: Jatin Kishnani,Mayank Goel,Amit Singh,Pulkit Agrawal,Sairanjan Mishra
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present the first end-to-end demonstration of fine-tuning and serving Google’s Gemma 4 31B model on TPU hardware, providing an empirical comparison of TPU and GPU platforms for large language model adaptation. Using LoRA on a Google TPU v5p-8 for training and TPU v6e-8 (Trillium) for inference, we document the full set of code-level adaptations required to port a GPU-native training recipe, built on PyTorch, HuggingFace TRL, and FSDP, to the JAX + Tunix/Qwix stack. These adaptations span mesh configuration, LoRA module naming conventions, sharding annotation corrections, gradient checkpointing, data pipeline restructuring, and a custom Orbax-to-safetensors checkpoint merging procedure. For inference, we detail the vLLM-TPU Docker setup necessary to serve Gemma 4 on v6e-8 and characterize the resulting latency and throughput profile. Compared with a 2xH100 GPU baseline under identical hyperparameters, TPU training completes 1.61x faster at 2.12x lower cost. Inference throughput is within 3% across platforms, while TPU achieves 2x lower time-to-first-token (235 ms vs. 475 ms). Together, the TPU configuration is 1.82x cheaper for a representative train-plus-service workload. Our work removes a critical gap in the open tooling ecosystem and provides practitioners with a reproducible, production-ready recipe for Gemma 4 deployment on TPU infrastructure. Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI) Cite as: arXiv:2605.25645 [cs.DC] (or arXiv:2605.25645v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2605.25645 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-40] Insuring Every Action: An Authority Frontier Framework for Runtime Actuarial Control of Autonomous AI Agents

链接: https://arxiv.org/abs/2605.25632
作者: Hao-Hsuan Chen
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Risk Management (q-fin.RM)
备注: 35 pages, 4 figures, 11 tables. Companion paper on the mathematical foundations: SSRN 6761960

点击查看摘要

Abstract:Autonomous AI agents increasingly issue side-effect-bearing actions: database mutations, refunds, payments, external commitments. We propose the Actuarial Action Interface (AAI), a deterministic runtime contract that prices each such action against a contractually fixed safe default under a time-consistent risk mapping, and gates execution against a per-boundary reserve capital budget. We then develop the Authority Frontier, an evaluation primitive measuring how much autonomous authority the runtime releases at each level of reserve capital. The framework provides (i) a deterministic quote-bind-commit protocol with toll-bounded capability tokens; (ii) a universal seven-class action taxonomy mapping heterogeneous tool calls to comparable authority units; (iii) replay determinism and pathwise reserve coverage under alpha-spending; (iv) cross-domain normalization via full reserve demand C_full and capital metrics Capital@k. We instantiate AAI across four agentic environments (database mutation, customer-service refund, and the public tau-bench retail and airline tool-use traces) and report a live Postgres panel in which three Azure-hosted models propose actions through the same contract. The frontier exhibits a common low-reserve refusal and intermediate-release pattern across domains, with saturation only where the budget grid reaches full reserve demand; required reserve capital varies by 22x (Capital@50 from 289 to 6457). The framework does not force domains into the same shape; it surfaces each domain’s actuarial geometry. In the live panel the contract prevents realized loss across all three models at low budget while differing in underwriting persistence under denial: model identity is an actuarial underwriting variable. The contribution is a benchmark-ready evaluation framework for runtime actuarial control of autonomous-agent side effects.

[AI-41] CUA-Gym: Scaling Verifiable Training Environments and Tasks for Computer-Use Agents

链接: https://arxiv.org/abs/2605.25624
作者: Bowen Wang,Dunjie Lu,Junli Wang,Tianyi Bai,Shixuan Liu,Zhipeng Zhang,Haiquan Wang,Hao Hu,Tianbao Xie,Shuai Bai,Dayiheng Liu,Que Shen,Junyang Lin,Tao Yu
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Reinforcement learning with verifiable rewards (RLVR) has driven breakthroughs in domains such as math, tool-use, and software engineering, yet its extension to computer-use agents (CUAs) has been bottlenecked by the scarcity of scalable training data with deterministic rewards. Constructing such data for CUAs requires consistent task instruction, executable environment, and verifiable reward. However, hand-curated benchmarks achieve high reward fidelity but cover few applications and LLM-as-judge-based datasets scale broadly but lack reliable verification. We present CUA-Gym, a scalable pipeline that co-generates task instructions, environment states, and reward functions. Concretely, a Generator agent constructs the initial and golden environment states, and a separate Discriminator agent writes the reward function from the task specification. An orchestrator agent drives the two through iterative rounds upon execution. Generated tuples then pass a final filter combining LLM majority voting and agent rollouts, ensuring quality beyond the per-task adversarial loop. To address the scarcity of training environments, we further synthesize CUA-Gym-Hub, a broad suite of high-fidelity mock web applications grounded in real-world software-use distributions, expanding the scale of CUA RLVR data by magnitude. Using this pipeline, we construct CUA-Gym, a dataset of 32,112 verified RLVR training tuples grounded in 110 environments. Trained with GSPO on CUA-Gym, our CUA-Gym-A3B and CUA-Gym-A17B achieve 62.1% and 72.6% on OSWorld-Verified, outperforming prior open-source CUAs at comparable scales, with performance scaling smoothly in both data volume and environment diversity. The same checkpoints also improve on the held-out WebArena benchmark, indicating transfer beyond the training environments. We will open-source the full synthesis pipeline, dataset, CUA-Gym-Hub environments, and models.

[AI-42] Back to Parsimonious Latents: Learning Task-Centric World Models from Visual Foundations

链接: https://arxiv.org/abs/2605.25620
作者: Minghao Fu,Fan Feng,Nicklas Hansen,Biwei Huang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:World models enable agents to predict future dynamics conditioned on actions, making the choice of latent representation central to planning and control. Such representations are often either learned directly from pixels with limited semantic structure or inherited from frozen visual foundation models with excessive task-irrelevant detail, yielding state spaces that are poorly matched to downstream planning and control. This is especially challenging in reward-free offline settings, where the model must learn from fixed trajectories without reward supervision or online interaction. To address this, we propose TC-WM, a framework for turning foundation-model embeddings into compact, task-sufficient world representations. The key design is to treat the pretrained embedding space as a semantic scaffold rather than as the final state space: TC-WM linearly projects high-dimensional visual embeddings into a compact latent as the dynamic space, aligns a subspace with the agent’s physical state via contrastive learning, and reconstructs embeddings to preserve useful visual structure. This combines the generality of foundation features with the controllability of task-centric dynamics. Theoretically, we show that TC-WM suffices to identify the underlying task-centric latent factors up to a simple transformation. Empirically, TC-WM enables test-time planning across diverse environments (e.g., Robomimic and D4RL), achieving better world-modeling quality and more precise control than state-of-the-art approaches.

[AI-43] owards the Connection between Activation Sparsity and Flat Minima

链接: https://arxiv.org/abs/2605.25612
作者: Ze Peng,Jian Zhang,Lei Qi,Yang Gao,Yinghuan Shi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The observation that activation sparsity emerges in MLP blocks of standardly trained Transformers offers an opportunity to drastically reduce computation costs without sacrificing performance. To theoretically explain this phenomenon, existing works have shown that activation sparsity does not result from the data properties or data fitting but from the implicit bias of the training process. However, these connections are obtained with strong assumptions, which cannot be applied to deep models standardly trained with a large number of steps. Different from these works, we find that the flatness of loss landscapes is also closely related to the MLP activation sparsity and can serve as a weaker and naturally emerging assumption standard deep networks. Specifically, we find that 1) the MLP activation sparsity equals a ratio between “augmented flatness” (a weighted sum of flatness measures) and the product of the input norm and activation gradient of the MLP. We empirically find that this ratio decreases during training, leading to sparse activations. 2) We also propose the notion of derivative sparsity, which reduces to activation sparsity under ReLU, but further enables pruning in the backward propagation and is more stable than activation sparsity. With the theoretical findings, we can further encourage activation sparsity by decreasing the numerator and increasing the denominator of the ratio using three methods. These plug-and-play modifications can effectively reduce the ratio and produce sparser activations. Experiments on ImageNet-1K and C4 demonstrate relative improvements of at least 36% on inference sparsity and at least 50% on training sparsity over vanilla Transformers, indicating further potential cost reduction in both inference and training

[AI-44] Detecting Unfaithful Chain-of-Thought via Circuit-Guided Internal-External Discrepancy

链接: https://arxiv.org/abs/2605.25603
作者: Xu Shen,Zhen Tan,Song Wang,Pingjun Hong,Rui Miao,Xin Wang,Tianlong Chen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Chain-of-thought (CoT) reasoning improves the problem-solving ability of large language models (LLMs), but generated reasoning traces may not faithfully reflect the model’s actual decision process. Existing CoT unfaithfulness detectors mainly rely on external signals from generated rationales, such as textual plausibility or answer consistency, while overlooking evidence from the model’s internal computation. Although recent circuit tracing methods provide a way to obtain model-internal evidence by tracing how information flows through model components during reasoning, constructing full reasoning circuits for long CoTs is costly and difficult to scale. To address these challenges, we propose Circuit-guided Internal-External Discrepancy Scorer (CIE-Scorer), a framework for instance-level CoT unfaithfulness detection. The key idea is that faithful reasoning traces should align with the model’s computational process, whereas unfaithful traces may diverge from it. CIE-Scorer efficiently traces compact sentence-level circuits from informative reasoning tokens, constructs internal and external reasoning graphs, and measures their discrepancy using Fused Gromov–Wasserstein distance. Experiments on four datasets from FaithCoT-Bench show that CIE-Scorer achieves state-of-the-art performance while reducing the cost of circuit construction, demonstrating the effectiveness of combining mechanistic interpretability signals with external reasoning traces for CoT unfaithfulness detection.

[AI-45] Acting on the Unseen: Communication-Free Collaborative Filtering for Decentralized Multi-Robot Task Allocation

链接: https://arxiv.org/abs/2605.25584
作者: Alexander Apartsin,Yigal Meshulam,Yehudit Aperstein
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 27 pages, 12 figures

点击查看摘要

Abstract:Multi-robot task allocation usually assumes some combination of communication, known task models, or a coordinator. We study the opposite extreme, a regime common in practice but overlooked in theory, which we name Zero-Knowledge MRTA (ZK-MRTA): a robot team with no prior knowledge (no task models, not even the latent rank), no communication (no messages, no parameter sharing, no coordinator), and only a partial and privately-noisy view of a public stream of teammates’ outcomes. A hidden low-rank structure governs which robot suits which task, and there are far more tasks than rounds, so most (robot, task) pairs are never attempted. Yet each robot can act well on tasks it never attempted, and onboard new tasks, by running online low-rank collaborative filtering over the broadcast (SwarmCF). The advantage over any structure-free learner is categorical, not a constant factor: a structure-free learner is provably at the prior-mean error floor on unseen pairs. We prove a matching per-robot sample complexity (\Theta(d) versus \Theta(n), in the rank d and the task count n), an anytime (cumulative-reward) separation under task scarcity, and a deterministic condition under which decentralized recovery from the masked broadcast is exact (validated empirically). Experiments quantify the value of the broadcast, a positive scaling law (per-robot unseen-pair skill rises with team size), and the strongest masking-robustness and anytime profile among low-rank methods, recovering most (about 80% on earned skill) of a centralized full-communication ceiling, and holding under capacity-1 contention and in a robotics-grounded sensing instance.

[AI-46] Extreme Region Policy Distillation

链接: https://arxiv.org/abs/2605.25582
作者: Changyu Chen,Xiting Wang,Rui Yan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reinforcement learning for large language models faces a fundamental trade-off between sample efficiency and asymptotic performance: strictly on-policy methods discard trajectories after a single update, while off-policy reuse introduces distribution mismatch that existing trust-region techniques mitigate primarily by enforcing conservative optimization, often leaving rich training signals underutilized. To investigate this, we perform extensive off-policy updates on fixed data. Our experiments reveal that aggressive multi-step optimization brings rapid initial gains, but excessive updates cause trajectory probabilities to deviate and entropy to collapse, with performance plateauing early. Tightening KL constraints merely lowers the ceiling without resolving the degradation. This motivates Extreme Region Policy Distillation (ERPD), a two-stage framework that decouples sample efficiency from KL efficiency. The first stage performs weakly constrained off-policy optimization on fixed data to maximally extract training signals. The resulting policy provides token-level supervision. In the second stage, we distill these signals into the base policy under trust-region constraints, filtering harmful drift while preserving useful signals. The distilled policy achieves comparable or better performance with substantially smaller KL divergence, indicating that much of the first-stage divergence was spent on unnecessary drift rather than genuine improvement. Crucially, ERPD accommodates both strong and weak teachers: when aggressive optimization yields no stronger policy, even degenerate teachers provide effective supervision via alternative signal construction strategies. We validate ERPD on mathematical reasoning, showing gains for strong base models where on-policy training plateaus, and reliable improvements with weak teachers.

[AI-47] Geometric Flow Matching for Molecular Conformation Generation via Manifold Decomposition

链接: https://arxiv.org/abs/2605.25577
作者: Yunqing Liu,Yi Zhou,Wenqi Fan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The generation of accurate 3D molecular conformations is a pivotal challenge in computational chemistry and drug discovery. Recently, diffusion and flow matching models have achieved remarkable success. However, there is a critical misalignment between their mathematical formulation and the physical reality of molecules. Existing approaches predominantly treat molecules as unstructured point clouds in Cartesian space, overlooking the intrinsic hierarchical mechanics where bond lengths and bond angles are relatively stiff, whereas torsion angles constitute the dominant flexible degrees of freedom. This lack of manifold awareness forces models to relearn fundamental geometric constraints from scratch, often leading to physically implausible intermediate structures. To address this, we propose GO-Flow that aligns generative modeling with molecular geometry via manifold decomposition. Instead of forcing motion through Euclidean space, GO-Flow decomposes the generation process into three physically motivated subspaces: translation space with linear optimal transport, rotation space with geodesic flows on SO(3) , and conformation space with entropic optimal transport. This decomposition injects geometric inductive biases and makes the generative paths better aligned with molecular degrees of freedom. When combined with equivariant neural architectures, it encourages rotation-consistent generation and improves geometric validity. Extensive experiments on GEOM-Drugs and GEOM-QM9 demonstrate that GO-Flow achieves state-of-the-art generation quality. Notably, by learning straighter probability paths on the correct manifolds naturally, our method enables high-fidelity sampling with as few as 50 steps, effectively bridging the gap between structural precision and computational efficiency.

[AI-48] Uncertainty Reasoning with Large Language Models for Explainable Disease Diagnosis

链接: https://arxiv.org/abs/2605.25566
作者: Xiaoyang Fan,Yufan Cai,Zhe Hou,Jin Song Dong
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Clinical decision-making requires reasoning over incomplete, imprecise, and linguistically expressed patient narratives. While large language models (LLMs) excel at extracting latent information from natural language, they lack the verifiability and interpretability essential for trustworthy medical AI. We propose a neuro-symbolic reasoning framework that aligns LLMs with formal logic to enable explainable and formally verifiable medical diagnosis. Patient descriptions and clinical guidelines are embedded into a neural knowledge base, where LLMs extract structured medical entities, temporal relations, and fuzzy symptom patterns, which are decoded into a symbolic knowledge base expressed in fuzzy logic and declarative rules. We perform two-stage reasoning: (1) inductive symbolic generalization to capture diagnostic patterns from encoded narratives, and (2) inference verification via a logic programming engine to derive and validate diagnoses consistent with clinical standards. Each symptom is treated as a fuzzy predicate with probabilistic weights, and inference paths are auditable, adjustable, and compatible with physician feedback. Unlike purely statistical methods, our system supports iterative refinement: misalignment between LLM-generated diagnoses and ground truth can be traced, explained, and corrected through formal rules. By combining logic-based transparency, LLM adaptability, and probabilistic robustness, the framework enables human-aligned healthcare inference with strong generalization and verifiable, step-by-step reasoning chains. We validate our framework on public benchmarks, demonstrating effective reconciliation of symbolic reasoning and LLMs with real-world clinical narratives. Results show performance comparable to state-of-the-art LLMs, while additionally providing interpretable reasoning paths and formally verifiable diagnostic conclusions.

[AI-49] Beyond Query Memorization: Large Language Model Routing with Query Decomposition and Historical Matching

链接: https://arxiv.org/abs/2605.25558
作者: Bo Lv,Jingbo Sun
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Optimizing the trade-off among predictive performance and computational cost is a central focus in the deployment of Large Language Models (LLMs). Current routing methods primarily rely on direct mapping from queries to models based on surface-level features, making them susceptible to the memorization trap and leading to poor generalizability on out-of-distribution (OOD) data. In this paper, we propose DecoR, a novel routing framework that recasts the routing task as a matching process of sifting similar queries from historical logs, effectively mitigating the memorization trap. To enhance matching accuracy, we introduce a query capability deconstruction method that decouples linguistic surface forms from task-intrinsic requirements, directing matching toward capability dimensions to ground decisions in essential task attributes. Furthermore, we develop CodaSet, a comprehensive benchmark for assessing routing generalization, where experimental results demonstrate that DecoR maintains superior accuracy while substantially lowering inference costs across both in-distribution and OOD settings. All the codes and data are available at this https URL.

[AI-50] Keep the Proof State Live: Snapshotting for Efficient Tactic Search in Lean 4

链接: https://arxiv.org/abs/2605.25556
作者: Austin Shen,Yunong Shi
机构: 未知
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI)
备注: 10 pages, 1 figure

点击查看摘要

Abstract:Automated theorem proving systems built on Lean 4 increasingly rely on parallel tactic search over partially specified proofs, such as those generated by Draft-Sketch-Prove (DSP) pipelines. In current systems, each search branch reconstructs a proof state by re-running elaboration, leading to substantial per-branch overhead. In Lean 4 with Mathlib, this cost has two components: (1) import loading, which deserializes pre-compiled libraries (~60 s per branch); and (2) theorem-body elaboration, which re-checks the theorem context up to the target goal (estimated 18-735 s depending on proof complexity). Together, these account for 99% of per-branch wall time, making portfolio-based search impractical at scale. We observe that this overhead arises from a mismatch between the structure of proof search and its execution model: branching is implemented via repeated reconstruction of proof states rather than direct reuse. To address this, we introduce proof-state snapshotting, which captures the elaborated proof state once and reuses it across branches via a small extension to the Lean 4 language server. Across 48 miniF2F-v2 problems (45 prove-phase benchmarks and 3 full end-to-end runs), our approach achieves a 5.6-50x wall-time speedup over the standard fallback (average 14x, median 9.7x). Speedup increases with the number of proof branches. Our method is orthogonal to import-level caching (e.g., Kimina Lean Server), which avoids import loading but not theorem-body elaboration. The patched Lean binary and the Snapshot-DSP pipeline will be released as open source upon publication. Comments: 10 pages, 1 figure Subjects: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI) Cite as: arXiv:2605.25556 [cs.LO] (or arXiv:2605.25556v1 [cs.LO] for this version) https://doi.org/10.48550/arXiv.2605.25556 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Austin Zi Jun Shen [view email] [v1] Mon, 25 May 2026 08:12:26 UTC (113 KB)

[AI-51] PHGNet: Prototype-Guided Hypergraph Construction for Heterogeneous Spatiotemporal Forecasting

链接: https://arxiv.org/abs/2605.25554
作者: Ruiwen Gu,Yahao Liu,Zhenyu Liu,Qitai Tan,Xiao-Ping Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As a core task in intelligent transportation systems, traffic forecasting plays a critical role in urban traffic management. Accurate traffic forecasting relies on modeling complex spatiotemporal dependencies, which is inherently challenging due to spatial heterogeneity in traffic this http URL significant progress, most existing methods are still limited to pairwise spatial dependency modeling, making it difficult to capture dynamic high-order interactions among nodes with similar traffic patterns. To address this issue, we propose PHGNet, a novel spatiotemporal forecasting framework based on prototype-guided hypergraph construction. At the core of PHGNet, a prototype learning mechanism is designed to adaptively assign pattern-similar nodes to hyperedges, thereby capturing high-order interactions with time-varying structures. To improve the reliability of dynamic hypergraph construction, we further develop a global-local node representation module to extract time-consistent features. For forecasting, iterative residual refinement and Temporal Query Attention are introduced to improve forecasting accuracy while supporting efficient parallel decoding. Extensive experiments on multiple real-world datasets demonstrate that PHGNet achieves superior predictive performance compared with state-of-the-art methods.

[AI-52] Simultaneous Spatial-Temporal Message Passing for Dynamic Graph Representation Learning

链接: https://arxiv.org/abs/2605.25548
作者: Shubhajit Roy,Anirban Dasgupta
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Dynamic graph neural networks (DGNNs) that operate on snapshot sequences typically fall into one of two categories. \emphTemporal-first approaches build per-node temporal embeddings and only afterwards perform spatial aggregation, whereas \emphSpatial-first approaches invert this order, feeding the output of a graph convolution into a downstream temporal module. In either case, the rigid sequencing forces the second stage to consume an already-compressed summary produced by the first, ruling out joint reasoning over topology and evolution; concretely, the message-passing operator never gets to weight a neighbor’s contribution by that neighbor’s \emphpast trajectory. This paper introduces \textbfSiST-GNN (\textbfSimultaneous \textbfSpatial-\textbfTemporal \textbfGNN), which fuses the two signals inside a single message-passing operation rather than chaining them. Concretely, at each snapshot we maintain a recurrent hidden state per node that summarises its history, pair it with the node’s current feature vector, and treat the pair as two nodes joined by a cross-time edge; running a standard graph convolution on this temporally augmented graph yields the updated representation. Our empirical study spans nine public baselines and fourteen model-dataset combinations, covering both fixed-split and live-update evaluation regimes. Across every public benchmark, SiST-GNN sets a new state of the art in link prediction task over the strongest prior method by 109 – 277% in the fixed-split setting and by 68 – 194% in the live-update setting. We additionally construct three dynamic node-classification tasks by discretising the underlying continuous-time event streams; here SiST-GNN beats the leading discrete-time (DTDG) baseline by 7 – 22% and matches continuous-time (CTDG) methods that consume the raw events directly.

[AI-53] ADMFormer: An Adaptive-Decomposition Transformer with Time-Varying Masked Spatial Attention for Traffic Forecasting

链接: https://arxiv.org/abs/2605.25543
作者: Ruiwen Gu,Qitai Tan,Yahao Liu,Xiao-Ping Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Accurate traffic forecasting is essential for intelligent transportation systems, supporting a wide range of real-world applications. However, it remains challenging due to two key factors:~(1) Traffic series contain heterogeneous temporal patterns, where stable periodic regularities coexist with event-driven fluctuations. Existing methods often treat them within a unified representation, limiting their ability to capture fine-grained temporal dynamics.~(2)Spatial dependencies among nodes are inherently dynamic and sparse, while dense all-pairs attention often introduces redundant interactions and amplifies noise. To address these issues, we propose ADMFormer, an Adaptive-Decomposition Transformer with Time-Varying Masked Spatial Attention. Specifically, ADMFormer first employs a time-node adaptive gating mechanism to decouple traffic signals into dominant regularities and residual fluctuations that vary across time and nodes. A dual-branch temporal module is then designed to separately capture global periodic dependencies and high-frequency irregular variations from these two decomposed components. Furthermore, ADMFormer introduces a time-varying masked spatial attention that sparsifies spatial interactions based on real-time traffic states, thereby effectively preserving dynamic and informative dependencies. Extensive experiments on four real-world datasets demonstrate that ADMFormer achieves state-of-the-art performance.

[AI-54] A Tertiary Review of Large Language Model-Based Code Generating Tasks: Trends Challenges and Future Directions

链接: https://arxiv.org/abs/2605.25536
作者: Muslim Chochlov,Michael English,Jim Buckley
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Context. Large language models (LLMs) are increasingly applied to code-generating tasks (CGTs) in software engineering. While reported results are promising, the broader effects of such application and their integration into real-world development remain insufficiently understood with existing tertiary studies provide little in this area. Objective. This tertiary study consolidates secondary evidence on LLM-based CGTs, synthesizing the publication landscape, effects, scenarios, integration challenges, and future research directions. Method. Following systematic review guidelines, we searched in related digital libraries, complemented by backward-and-forward snowballing and screening step. Study quality was assessed and extraction reliability was audited with inter-rater agreement statistics. Evidence was synthesized using SWEBOK knowledge areas and the HELM framework. Results. We identify 30 secondary studies published between 2017-2025, with rapid growth since 2023. Accuracy seems strong on benchmarks but weakly supported for real-world generalization; robustness is fragile across tasks and configurations; efficiency constraints are pervasive; toxicity and bias are under-reported. Dominant challenges concern economic feasibility, evaluation validity, and socio-technical integration. Future directions suggest domain-aware model improvement and the need for holistic, standardized evaluation. Conclusion. LLM-based CGTs represent a fast-maturing yet unevenly evaluated research area, highlighting the need for domain-aware model improvements and holistic, standardized evaluation, addressing efficiency and associated costs.

[AI-55] Personalize-then-Store: Benchmarking and Learning Personalized Memory for Long-horizon Agents

链接: https://arxiv.org/abs/2605.25535
作者: Yeonjun In,Wonjoong Kim,Sangwu Park,Kanghoon Yoon,Chanyoung Park
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: preprint

点击查看摘要

Abstract:Existing large language model (LLM) based memory systems apply universal, static policies that overlook a fundamental reality: the contexts that are worth storing in memory are different across users. This misalignment wastes limited memory budget on transient interactions while failing to preserve critical context for long horizon tasks. To address this gap, we investigate an underexplored question: can LLM based memory systems learn personalized memory policies? We introduce PerMemBench, the first benchmark for evaluating personalized memory systems, featuring multi year, multi domain interaction histories across diverse user personas. We further present the first empirical study of memory personalization, proposing session level storage gating, a lightweight framework that selectively bypasses memory operations for transient sessions. Our study confirms that personalization yields substantial retention gains under perfect gating, yet reveals that accurate gating remains an open and critical challenge.

[AI-56] StructBreak: Structural Cognitive Overload-Induced Safety Failures in MLLM s ACL2026

链接: https://arxiv.org/abs/2605.25534
作者: Yang Luo,Xinran Liu,Tiantian Ji,Zhiyi Yin,Lingyun Peng,Shuyu Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 23 pages; accepted to Findings of ACL 2026. This paper contains examples of harmful content

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) excel at structural reasoning yet suffer from a sharp logical brittleness in structural consistency. We term this phenomenon Structural Cognitive Overload (SCO), a byproduct of the contention between deep reasoning and safety alignment. However, prior work has predominantly targeted typographic and pixel-level perturbations, leaving the study of SCO largely unexplored. To this end, we propose StructBreak, an automated end-to-end framework designed to quantify SCO. By leveraging StructBreak, we uncover a novel higher-order cognitive overload attack paradigm; notably, this attack operates under a practical black-box setting, requiring no internal model access. Consequently, we utilize this framework to establish a comprehensive benchmark spanning ten diverse threat scenarios. Empirical evaluations on six leading MLLMs reveal that SCO readily triggers toxic generation, yielding a 92% average ASR (up to 97% on Gemini 2.5). To elucidate the mechanism of SCO, we further conduct model-level interpretations spanning attention dynamics, latent space topology, and geometric analysis. Our findings reveal that StructBreak acts as a novel structural channel to circumvent safety filters. Furthermore, the limited efficacy of inherent safety mechanisms underscores that current alignment paradigms are insufficient for the era of complex multimodal reasoning.

[AI-57] What Gets Cited: Competitive GEO in AI Answer Engines

链接: https://arxiv.org/abs/2605.25517
作者: Rahul Vishwakarma,Shushant Kumar,Ratnesh Jamidar
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:AI answer engines generate answers from retrieved pages but cite only a few sources. This makes visibility depend not just on ranking, but on being cited. We study competitive Generative Engine Optimization (GEO): when two retrieved candidates compete, what makes one more likely to be cited first? We build a controlled two-document retrieval-augmented generation (RAG) testbed that injects exactly two candidate sources into the model context and measures which source is referenced by the first citation marker in the output. Across six LLMs we execute 252,000 trials, repeated paired comparisons under one factorial program over 18 content factors. In each trial the two sources differ in exactly one factor; we use brand anonymization and counterbalanced source order to separate content effects from position bias. Mixed-effects models show that topical relevance and list position are the biggest drivers of being cited first. Including explicit price information and a recent timestamp also helps consistently. Completeness and trust cues add smaller gains, while formatting-only edits have little impact. We release a reproducible evaluation protocol and a prioritized GEO checklist for practitioners, and we exercised it in an early internal pilot at Sprinklr, where teams reported positive qualitative feedback on workflow usability.

[AI-58] Credit Assignment with Resets in Language Model Reasoning

链接: https://arxiv.org/abs/2605.25507
作者: Ankur Samanta,Akshayaa Magesh,Ayush Jain,Youliang Yu,Daniel Jiang,Kavosh Asadi,Daniel Jiang,Kaveh Hassani,Paul Sajda,Jalaj Bhandari,Yonathan Efroni
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Contemporary reinforcement learning with verifiable reward methods post-train language models on multi-step reasoning by assigning a single outcome reward uniformly across all tokens in a trajectory. Such uniform assignment ignores which steps contributed to success or failure. Improving credit assignment can address this limitation by enabling targeted refinement of faulty reasoning steps, rather than updating entire trajectories uniformly. Resets are one such simple mechanism, enabling more precise credit assignment by returning to an intermediate state and resampling counterfactual continuations, so that outcome differences can be attributed to decisions made at that point. We propose two such methods: Random-Reset Policy Optimization (RRPO), where reset states are drawn randomly from reasoning steps, and Self-Reset Policy Optimization (SRPO), where the model self-localizes the erroneous step in an incorrect trajectory and resets there. We analyze these methods within the Conservative Policy Iteration (CPI) framework. Extending CPI with a credit-assignment oracle that targets improvable states yields provable improvements over random resets. Across models and reasoning benchmarks, SRPO consistently outperforms standard GRPO and RRPO by sampling multiple suffix continuations at a self-localized reset and learning from their rewards, using only the model itself with no external supervision.

[AI-59] Generative AI impacts on intra-urban inequality and skill premium in Beijing

链接: https://arxiv.org/abs/2605.25505
作者: Xiliu He,Haoxiang Zhao,Mingyi Ma,Edward Wen Chuan Lai,Koei Enomoto,Anni Hu,Jiatong Li,Lingyun Chu,Yuan Lai
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); General Economics (econ.GN); Physics and Society (physics.soc-ph)
备注: 21 pages, 8 figures

点击查看摘要

Abstract:Generative artificial intelligence (GenAI) is the first automation wave to reach high-cognitive tasks at scale, yet its effects on intra-urban inequality remain largely unknown. Using 5 million job postings from Beijing (2018–2024), we construct a neighborhood-level GenAI Exposure Index by aggregating task-level assessments from five leading large language models. We examine the spatial, structural and causal mechanisms of this shock. We find that GenAI exposure is highly concentrated in the city’s core districts, deepening the intra-urban AI divide. Since 2023, high-exposure neighborhoods have experienced wage stagnation even as they continue to attract high-skilled workers – a “high-skill trap.” This wage penalty is driven by task de-skilling and intensified labor-market crowding. A difference-in-differences design centered on ChatGPT’s release supports a causal interpretation. These findings challenge the prevailing theory of skill-biased technological change and provide a basis for inclusive AI governance in global technology hubs.

[AI-60] EXPO-FT: Sample-Efficient Reinforcement Learning Finetuning for Vision-Language-Action Models

链接: https://arxiv.org/abs/2605.25477
作者: Perry Dong,Kuo-Han Hung,Tian Gao,Dorsa Sadigh,Chelsea Finn
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The ability to efficiently and reliably learn new tasks has been a foundational challenge in robotics. Vision-Language-Action (VLA) models have demonstrated strong generalization across diverse manipulation tasks, yet pretrained policies consistently fall short of the reliability required for real-world deployment. Reinforcement learning (RL) fine-tuning offers a promising path to bridge this gap, but existing approaches either train from scratch without fully leveraging pretrained priors, or fine-tune VLAs without achieving the sample efficiency and success rates that practical deployment demands. We present EXPO-FT, a system for stable, sample-efficient RL finetuning of pretrained VLA policies that closes this gap. Our system solves a suite of challenging manipulation tasks, including routing string lights and inserting the plug to light it up, striking a pool ball into a pocket, and inserting a flower into a wine bottle, each requiring combinations of high precision, dynamic actions, and robustness to varied initial states. Our system achieves perfect task performance (30/30 successes) across all evaluated tasks within an average of 19.1 minutes of online robot data, outperforming both prior RL-from-scratch and VLA finetuning approaches. We release an open-source codebase with the aim of facilitating broader adoption of RL finetuning of VLA models in robotics.

[AI-61] From Simulation to Enaction: Post-trained language models recognize and react to their own generations

链接: https://arxiv.org/abs/2605.25459
作者: Asvin G.,Jack Lindsey
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Anthropic fellows project mentored by Jack Lindsey

点击查看摘要

Abstract:Language models are pretrained as passive predictors with no incentive to model the consequences of their own outputs. Post-training changes this: a model producing its own responses can benefit from recognizing that it is on-policy. We present evidence that post-trained models recognize their on-policy generations, and this recognition is implicitly encoded in their output distributions. In particular, on-policy output distribution entropy is 3–4 \times lower than off-policy entropy, across model families and size classes. We trace part of this effect to an internal representation of input surprise, tracking the unlikeliness of the most recent input token according to the model’s prior predictions, that causally modulates output entropy. One example of these phenomena can be observed in response to open-ended prompts; post-trained models (unlike pretrained models) collapse their uncertainty over the topic of their upcoming response before the first output token; violating this cached intention with a different-topic prefill results in higher output entropy. We also tested whether models can distinguish on-policy contexts from prefills via explicit verbal report. We find that they can, but that interestingly, this explicit recognition routes through a different mechanism than implicit recognition.

[AI-62] A Signal-Language Foundation Model for Broad-Spectrum Cardiovascular Assessment from Routine Electrocardiography

链接: https://arxiv.org/abs/2605.25446
作者: Ziqing Yu,Yuhui Tao,Jiayu Huo,Lei Pan,Zilong Xiao,Juecheng Chen,Xiao Li,Jianxuan Li,You Zhou,Zhixing Li,Cong Wang,Beijian Zhang,Chen Chen,Hongyang Lu,Konstantinos Patlatzoglou,Daniel B. Kramer,Jonathan W. Waks,Yangang Su,Fu Siong Ng,Shuo Wang,Yixiu Liang,Junbo Ge
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Electrocardiography (ECG) is central to cardiovascular care, but conventional AI models are often restricted to common arrhythmias and may generalize poorly across populations or clinically subtle diseases. We developed ECG Contrastive Language-Image Pre-training (ECGCLIP), a signal-language contrastive learning framework that aligns ECG waveforms with expert diagnostic reports. ECGCLIP was pre-trained on 2,837,962 ECG studies from 1,324,856 patients and evaluated on a held-out internal test set plus nine independent external cohorts comprising about 1.5 million ECGs. Evaluation covered 89 downstream tasks, including 45 ECG diagnoses, 39 echocardiographic targets, and 5 rare cardiac diseases, using PRAUC as the primary metric. ECGCLIP consistently improved performance over random initialization and Merl-R18 baselines. On the internal test set, ECGCLIP-R34 achieved strong performance for atrial fibrillation (PRAUC 0.900) and ST-segment elevation myocardial infarction (PRAUC 0.383), with robust generalization across all external cohorts. It also improved low-prevalence and diagnostically elusive diseases, including Ebstein anomaly, constrictive pericarditis, dextrocardia, and cardiac amyloidosis, with internal PRAUC values of 0.253, 0.175, 0.121, and 0.201, respectively. ECGCLIP was data efficient, matching or exceeding full-dataset baseline performance with only 10% of training data. Feature visualization and saliency analysis suggested clinically meaningful representations aligned with established electrocardiographic criteria. These findings indicate that large-scale ECG-report contrastive pre-training can expand routine ECG interpretation beyond common arrhythmias toward broad cardiovascular assessment and opportunistic screening of echocardiographic and rare conditions.

[AI-63] Security of OpenClaw Agents : Fundamentals Attacks and Countermeasures

链接: https://arxiv.org/abs/2605.25435
作者: Yuntao Wang,Jianle Ba,Han Liu,Yanghe Pan,Jintao Wei,Zhou Su,Tom H. Luan,Linkang Du
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 17 pages, 13 figures

点击查看摘要

Abstract:The rapid evolution of large language model (LLM)-driven autonomous agents has given rise to OpenClaw, a new class of open-source agent frameworks that operate as continuously running, skill-augmented systems with persistent memory, multi-channel interaction, and high degrees of autonomy. Such capabilities enable OpenClaw agents to autonomously execute complex, multi-step tasks and interact seamlessly with external applications, but simultaneously introduce a substantially enlarged attack surface. In particular, the combination of high-privilege operations and persistent memory exposes OpenClaw agents to various emerging threats, including skill poisoning, cognitive manipulation, multi-agent cascading failures, and supply-chain vulnerabilities. In this survey, we present a comprehensive study of the security landscape of OpenClaw agents. We first examine the general architecture and key characteristics that distinguish OpenClaw agents from traditional AI agent systems. We categorize existing security and privacy threats into a layered framework and analyze how vulnerabilities arise during agent reasoning, action execution, and external interaction. Representative defense mechanisms are also reviewed to draw the current defense landscape. Finally, several unresolved issues related to the reliability and trustworthiness of OpenClaw ecosystems are discussed.

[AI-64] CODESKILL: Learning Self-Evolving Skills for Coding Agents

链接: https://arxiv.org/abs/2605.25430
作者: Yanzhou Li,Yiran Zhang,Xiaoyu Zhang,Xiaoxia Liu,Yang Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Coding agents produce rich trajectories while solving software-engineering tasks. To enable agent self-evolution, these trajectories can be distilled into reusable procedural skills that compactly encode experience to guide future behavior. However, existing skill construction and maintenance methods often rely on fixed prompts and heuristic update rules, leaving it unclear how knowledge should be selected, abstracted, and maintained to best serve downstream agents. We propose CODESKILL, an LLM-based framework that reformulates skill extraction and skill-bank maintenance as a learnable management policy. CODESKILL extracts multi-granularity procedural skills from coding-agent trajectories, evolves skills with new experience, and maintains a compact skill bank for future task solving. We train CODESKILL with reinforcement learning, using a hybrid reward that combines dense rubric-based skill-quality feedback with sparse verifiable execution feedback from the frozen downstream agent. Experiments on EnvBench, SWE-Bench Verified, and Terminal-Bench 2 show that CODESKILL improves average pass rate by 9.69 over the no-skill baseline and by 4.01 over the strongest prompt-based or memory baseline, while maintaining the skill bank at a stable size during iterative construction.

[AI-65] SeqRoute: Global Budget-Aware Sequential LLM Routing via Offline Reinforcement Learning

链接: https://arxiv.org/abs/2605.25424
作者: Zhongling Xu,Shunan Zheng,Wei Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Existing LLM routing frameworks treat queries as independent events, neglecting the sequential nature of real-world user sessions constrained by global computational budgets. This mismatch inevitably leads to budget bankruptcy: myopic routing policies exhaust resources on early interactions, forcing subsequent and often more complex queries onto inadequate models. We introduce SeqRoute, a framework that formulates multi-turn routing as a finite-horizon Markov Decision Process and solves it via offline reinforcement learning. By incorporating the remaining budget into the state space and training with Conservative Q-Learning (CQL), SeqRoute learns delayed gratification to strategically preserve resources for high-stakes turns later in the session. To overcome data starvation, we propose Hindsight Budget Relabeling (HBR). This technique retrospectively simulates historical trajectories under diverse hypothetical budgets, expanding 10,000 raw sessions into 2.38 million transitions enriched with critical bankruptcy signals. At deployment, a dynamic \lambda -sweep mechanism enables zero-shot navigation of the cost-quality Pareto frontier without retraining. Extensive evaluations demonstrate that SeqRoute reduces operational costs by 6.0-73.5% while maintaining or improving quality, and suppresses bankruptcy rates to under 1%, strictly dominating behavior cloning, budget-aware heuristics, and static baselines across the entire Pareto frontier.

[AI-66] Autoregression-Free Neural Operators for Time-Dependent PDEs

链接: https://arxiv.org/abs/2605.25413
作者: Jiaquan Zhang,Caiyan Qin,Haoyu Bian,Libin Cai,Yi Lu,Chaoning Zhang,Wei Dong,Yuanfang Guo,Yang Yang,Hen Tao Shen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Numerical Analysis (math.NA)
备注: Submitted to IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI)

点击查看摘要

Abstract:Neural operators learn mappings from function-dependent inputs to solutions, providing an effective framework for solving partial differential equations (PDEs). For time-dependent PDEs, existing methods typically perform long-horizon prediction through autoregressive rollout directly in high-dimensional physical field spaces, where each predicted state is recursively fed back as the input for the next step. Although effective for short-term prediction, this autoregressive rollout and the lack of continuous-time modeling lead to progressive error accumulation over long-horizon rollouts. In this work, we propose Autoregression-Free Neural Operators (AFNO), which map the time evolution of PDEs into a latent space and model continuous-time vector fields within it. AFNO uses flow matching to learn the latent vector field, thereby enabling continuous evolution over extended horizons, avoiding autoregressive rollout and capturing dynamics under varying parameter configurations through explicit conditioning on physical parameters. Theoretical analysis and extensive experiments on six PDEs demonstrate that AFNO improves long-horizon prediction stability and consistently reduces rollout errors compared with the baselines.

[AI-67] owards end-to-end LLM -based censoring-aware survival analysis

链接: https://arxiv.org/abs/2605.25399
作者: Yishu Wei,Hexin Dong,Yi Lin,Jiahe Qian,Yi Liu,Yifan Peng
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Objective: Survival analysis is central to medical prediction, yet large language models (LLMs) are rarely used as end-to-end survival models because censoring prevents straightforward supervised fine-tuning. Here we present LLMSurvival, a framework that enables censoring-aware survival analysis with unmodified LLMs operating directly on tabular clinical data. Materials and Methods: LLMSurvival reformulates time-to-event prediction as pairwise ranking among comparable subjects, and derives test-time risk by aggregating comparisons against anchor individuals from the training cohort. Results: Across two clinical tasks (ICU mortality prediction in MIMIC-IV and fragility fracture prediction in a NewYork-Presbyterian/Weill Cornell Medicine cohort), LLMSurvival improves overall concordance over Cox proportional hazards modeling by 3.1% for ICU mortality and 0.5% for fracture risk, 2.1% on average for ICU mortality and 2.8% for fracture risk over three established deep learning survival models. Discussion: The results show that survival modeling with censoring can be made compatible with LLM fine-tuning through comparison-based reformulation. The framework demonstrates high portability and superior performance over expert curated scores like SAPS-II and FRAX scores across diverse clinical context. Furthermore, the framework supports local deployment, as compact, publicly available base models provide sufficient performance. Conclusion: The LLMSurvival framework serves as a proof of concept for an integrated, censoring-conscious approach to survival analysis via LLMs. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2605.25399 [cs.AI] (or arXiv:2605.25399v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2605.25399 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Yishu Wei [view email] [v1] Mon, 25 May 2026 03:45:42 UTC (601 KB) Full-text links: Access Paper: View a PDF of the paper titled Towards end-to-end LLM-based censoring-aware survival analysis, by Yishu Wei and 5 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.AI prev | next new | recent | 2026-05 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked="checked"class=“labs-tab-input”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status

[AI-68] Context-CoT: Enhancing Context Learning via High-Quality Reasoning Synthesis

链接: https://arxiv.org/abs/2605.25354
作者: Hongbo Jin,Mingnan Zhu,Jingqi Tian,Xu Jiang,Zhongjing Du,Haoran Tang,Siyi Xie,Qiaoman Zhang,Jiayu Ding
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:While LLMs excel at reasoning over prompts using static pretrained knowledge, they struggle significantly with context learning-the ability to dynamically extract, internalize, and apply new knowledge from complex, task-specific contexts. Recent evaluations on the CL-Bench reveal a critical capability gap: frontier models solve only 17.2% of context-dependent tasks on average.

[AI-69] Certified Robustness from Approximate Gaussian Mixture Structures in Pretrained Latent Spaces

链接: https://arxiv.org/abs/2605.25352
作者: Konstantinos Emmanouilidis,Tianjiao Ding,Nghia Nguyen,Nicolas Loizou,René Vidal
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Deep learning models are vulnerable to adversarial perturbations, raising important concerns for safety-critical deployment. Empirical defenses can achieve strong robustness in practice, but lack formal guarantees, motivating the need for certifiably robust classifiers. While certified methods provide formal guarantees, they often yield overly conservative bounds due to their inability to exploit structure in complex data distributions. In this work, we propose a framework for designing certifiably robust classifiers that leverages latent structure in data representations. We first analyze the Gaussian mixture setting, deriving necessary and sufficient conditions for the existence of robust classifiers and constructing a classifier with a closed-form robustness certificate and generalization guarantees. Our main contribution is to show that exact structure is not required: we prove that if a pretrained encoder maps inputs to a latent distribution that is \varepsilon -close (in KL divergence) to a Gaussian mixture, then certified accuracy degrades gracefully, with an explicit bound relating robustness under the true and approximate distributions. This result enables the direct use of pretrained models without requiring exact distributional assumptions. Empirically, our method achieves state-of-the-art or competitive certified accuracy on CIFAR-10 and ImageNet, while maintaining strong clean performance and low computational overhead. Overall, our work establishes approximate latent structure as a practical and principled route to certifiable robustness.

[AI-70] Parallel Differentiable Reachability for Learning and Planning with Certified Neural Dynamics and Controllers

链接: https://arxiv.org/abs/2605.25346
作者: Keyi Shen,Glen Chou
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY); Optimization and Control (math.OC)
备注: Robotics: Science and Systems XXII (RSS 2026)

点击查看摘要

Abstract:Neural network (NN) dynamics models and control policies achieve strong performance in robotics, but providing sound guarantees under uncertainty remains difficult, especially for closed-loop NN systems. Existing reachability tools provide formal over-approximations, yet are often non-differentiable, overly conservative, or too slow for modern learning and online planning pipelines. To address this, we present a parallelizable, differentiable reachability framework in JAX for continuous- and discrete-time systems with analytical and NN-based dynamics and controllers. Our framework combines Taylor-model flowpipe construction with CROWN-style linear bound propagation through a unified representation that preserves affine dependencies while supporting GPU-batched computation and automatic differentiation. Building on this reachability primitive, we develop (i) a certified training method that encourages reachability-friendly dynamics models and controllers, and (ii) a reachability-aware sampling-based MPC scheme with gradient-based refinement. Experiments on non-prehensile manipulation and quadrotor tasks, including hardware and higher-dimensional evaluations (up to 72D), demonstrate practical online planning while maintaining certified reachable-set over-approximations under bounded uncertainty.

[AI-71] CausalFlow: Causal Attribution and Counterfactual Repair for LLM Agent Failures

链接: https://arxiv.org/abs/2605.25338
作者: Akash Bonagiri,Devang Borkar,Gerard Janno Anderias,Setareh Rafatirad,Houman Homayoun
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language model (LLM) agents frequently fail on multi-step tasks involving reasoning, tool use, and environment interaction. While such failures are typically logged or retried heuristically, they contain structured signals about where execution broke down. We introduce CausalFlow, an interventional framework that converts failed agent traces into minimal counterfactual repairs and reusable supervision. CausalFlow models execution traces as sequential chains of dependent steps and computes Causal Responsibility Scores(CRS) via step-level counterfactual intervention to identify failure-inducing steps. For these steps, we generate minimally edited repairs that flip the final outcome to success, producing validated contrastive pairs of the form (wrong step, corrected step). CausalFlow supports two complementary uses: targeted test-time repair that recovers from failures with minimal behavioral drift, and training-time supervision suitable for offline preference optimization or reward modeling. Across four benchmarks spanning mathematical reasoning, code generation, question answering, and medical browsing, CausalFlow converts failed executions into validated minimal repairs with high minimality and causal-consensus scores, and demonstrates that causal attribution is necessary for reliable improvement across diverse agent tasks, outperforming heuristic refinement in complex retrieval settings while producing more localized repairs throughout. These results demonstrate that interventional analysis over structured execution traces provides a principled and scalable mechanism for transforming agent failures into reliability gains and learning-ready supervision.

[AI-72] UWM-JEPA: Predictive World Models That Imagine in Belief Space

链接: https://arxiv.org/abs/2605.25313
作者: Santosh Kumar Radha,Oktay Goktas
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO); Machine Learning (stat.ML)
备注: 14 pages, 6 figures, 7 tables. Code and data: this https URL

点击查看摘要

Abstract:World models for partially observed environments must imagine multiple compatible hidden futures and steer between them under counterfactual actions. Joint Embedding Predictive Architectures (JEPAs) do this in latent space, but a vector-valued latent has no internal structure for carrying the belief over hidden continuations through blind rollout. We introduce the Unitary World Model JEPA (UWM-JEPA), a JEPA world model with a density-matrix latent on a joint system-environment space and a learned unitary predictor. The construction preserves the joint-state spectrum exactly during rollout, so the predictor itself cannot dissipate the represented uncertainty. On a hidden-velocity indicator task requiring five-step forward simulation under a given action sequence with the target observation masked, UWM-JEPA reaches 0.77 accuracy and degrades monotonically as actions are perturbed; a parameter-matched LSTM-JEPA trained under the same counterfactual-target objective and action head collapses to majority-class accuracy (0.53) under every action condition. Under blind rollout, UWM-JEPA loses fewer than ten points of probe R^2 at short horizons while vector-latent baselines lose forty-one and sixty-eight; both nevertheless tie on a held-out context probe, locating the separation in the predictor rather than the encoder. Action sensitivity itself requires training against counterfactual rather than teacher-forced targets, a finding that applies beyond the unitary parameterisation. For JEPA world models to imagine under partial observability, latent geometry and predictor dynamics matter, not frozen context-encoding capacity alone. Comments: 14 pages, 6 figures, 7 tables. Code and data: this https URL Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO); Machine Learning (stat.ML) MSC classes: 68T07, 81P45 ACMclasses: I.2.6; I.2.8 Cite as: arXiv:2605.25313 [cs.LG] (or arXiv:2605.25313v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.25313 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-73] AI Cartography: Mapping the Latent Landscape of AI Benchmark Ecosystems

链接: https://arxiv.org/abs/2605.25272
作者: Michael Hardy,Anka Reuel,Lijin Zhang,Jodi M. Casabianca,Sang Truong,Yash Dave,Hansol Lee,Benjamin Domingue,Sanmi Koyejo
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Applications (stat.AP)
备注:

点击查看摘要

Abstract:While aggregate leaderboard scores drive AI development, they contain substantial measurement noise whose sources and magnitudes remain unquantified, making it unclear when rankings reflect genuine capability differences versus evaluation artifacts. We introduce a framework for measuring the latent landscape in AI benchmark ecosystems. Applying Confirmatory Factor Analysis (CFA) and Generalizability Theory to 4,000+ models from the Open LLM Leaderboard, we decompose sources of ranking variance and establish: (1) structures assumed in current reporting practice underestimate the strength of relationships between benchmarks; (2) evidence of local dependence among leaderboard items, undermining uses of benchmarks as measurement instruments under current scoring systems; (3) contributor metadata explains more rank-relevant variance ( \approx9% ) than architecture or deployment categories in this context; (4) a manifest-score “scaling law” slope has low reliability ( R_\beta=0.53 ); by contrast, the latent general-factor size slope is highly stable across ecosystem controls ( R_g=0.97 ). We are able to provide unique insights into benchmark dynamics, such as which benchmarks are a function of LLM size and which can be oppositely impacted by post-training practices. We provide actionable diagnostics to determine how benchmark rankings can be trusted and how benchmark design can be improved.

[AI-74] Latent Q-Barrier Shielding for Safe In-Context Reinforcement Learning

链接: https://arxiv.org/abs/2605.25267
作者: Minjae Kwon,Amir Moeini,Shangtong Zhang,Lu Feng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Safe in-context reinforcement learning (ICRL) adapts online from interaction history without test-time parameter updates while controlling episode cost under a safety budget. Under out-of-distribution (OOD) deployment shifts, pretraining-only safe ICRL can give poor reward-safety tradeoffs because the remaining budget affects behavior only through frozen policy conditioning, not an explicit action-level check against predicted future cost. We propose a latent Q-Barrier shield that learns a context representation, latent dynamics, and an ensemble cost critic before deployment. Without parameter updates, the shield infers context from history and filters or softly reweights candidate actions using the remaining budget and predicted future cost. We prove a conditional, error-decomposed barrier-margin result: a Q-Barrier-satisfying action leaves the next latent-budget state with an approximately budget-safe continuation under the learned critic, up to Bellman and latent-prediction errors. Across five safe ICRL benchmarks, the shield improves deployment-time reward-safety tradeoffs over a strong safe-ICRL baseline: after a short context window, it achieves higher return in four of five benchmarks while matching or lowering average episode cost in all five.

[AI-75] Whose Alignment? Comparing LLM Process Alignment Across Diverse Organizational Decision Contexts ICML2026

链接: https://arxiv.org/abs/2605.25256
作者: Niklas Weller,Emilio Barkett
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted to ICML 2026 Pluralistic Alignment Workshop

点击查看摘要

Abstract:Aligning AI systems with organizational decision-making is typically framed as a single-target problem: make the model behave like the organization. We argue this framing obscures a deeper pluralistic challenge. We rely on a decision-policy capturing method to measure process alignment: whether an LLM weights information as the organization does, not merely whether it reaches the same conclusions. Applying this method to ECHR Article 6 decisions, process alignment strongly predicts output accuracy (r = 0.85, p .001) and externalization substantially improves alignment for poorly-aligned models. Applying it to German consumer credit decisions, this relationship collapses (r = 0.15, p = .60): interventions produce inconsistent effects and the benchmark encodes potentially discriminatory historical patterns. This contrast is itself a pluralistic alignment finding: in contested domains, high process alignment is neither achievable via externalization nor unconditionally desirable. Output agreement alone cannot distinguish a model that has internalized an organizational policy from one that merely approximates its outcomes; process-level measurement is a necessary component of any pluralistic alignment evaluation.

[AI-76] Quantifying Empirical Compute-Supervision Tradeoffs in RLVR ICML2026

链接: https://arxiv.org/abs/2605.25252
作者: Ryo Mitsuhashi,Patrick Chen,Isabelle Tseng,Jasin Cekinmez,Addison J. Wu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Workshop on Combining Theory and Benchmarks @ ICML 2026

点击查看摘要

Abstract:Reinforcement learning with verifiable rewards (RLVR) has become a standard paradigm for post-training language models, but in practice, verifiers are rarely perfect. Recent theoretical work predicts that verifier noise affects the rate of learning but not its final outcome, implying that sufficient compute should close any gap induced by imperfect supervision. We test this prediction empirically by post-training Qwen2.5 (0.5B, 1.5B) with GRPO on GSM8K while injecting controlled false-positive and false-negative noise into the binary correctness signal, and varying rollouts per prompt as a compute axis. In practice, the gap in validation accuracy persists under substantial compute scaling, with returns to compute that are sharply diminishing. We further find a structural asymmetry where false negatives monotonically degrade performance quicker than with false positives. These findings suggest verifier quality and training compute are not interchangeable, and that reducing false negatives is a more effective lever than scaling compute alone.

[AI-77] LipoAgent : Coordinating Fine-Tuned LLM Agents for Safer Lipid Design

链接: https://arxiv.org/abs/2605.25250
作者: Leshu Li,An Lu,Haiyu Wang,Zhibin Feng,Conghui Duan,Qing Bao,Zongmin Zhao,Sai Qian Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Lipid nanoparticles (LNPs) are among the most clinically mature platforms for nucleic acid delivery, yet designing lipids that are both effective and biologically safe remains a major bottleneck. In practical screening, toxicity is a decision-level constraint: if a lipid is toxic, its efficiency prediction is clinically irrelevant. We propose LipoAgent, a safety-aware multi-agent LLM framework for lipid discovery. LipoAgent combines domain-specific finetuning with a conditional prediction objective that enforces toxicity as a prerequisite for efficiency prediction, and further improves reliability via multi-agent verification with lightweight human oversight when disagreement persists. Across multiple foundation models, LipoAgent achieves an average 32% relative improvement in mRNA transfection efficiency prediction compared with other reported models for lipid design. Wet-lab validation confirms that virtual screening rankings reliably translate to biological transfection outcomes. The code is publicly available at this https URL.

[AI-78] FrontierOR: Benchmarking LLM s Capacity for Efficient Algorithm Design in Large-Scale Optimization

链接: https://arxiv.org/abs/2605.25246
作者: Minwei Kong,Chonghe Jiang,Ao Qu,Wenbin Ouyang,Zhaoming Zeng,Xiaotong Guo,Zhekai Li,Junyi Li,Yi Fan,Xinshou Zheng,Xi Jing,Yikai Zhang,Zhiwei Liang,Seonghoo Kim,Runqing Yang,Zijian Zhou,Sirui Li,Han Zheng,Wangyang Ying,Ou Zheng,Chonghuan Wang,Jinglong Zhao,Hanzhang Qin,Cathy Wu,Paul Pu Liang,Jinhua Zhao,Hai Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly used for optimization modeling and solver-code generation, yet practical operations research and optimization problems often require a harder capability: designing scalable algorithms that exploit problem structure and outperform direct formulation-and-solve baselines. Existing benchmarks are limited to small or simplified examples far below real-world scale and complexity. We introduce FrontierOR, among the first benchmarks to systematically evaluate LLM-based efficient algorithm design for realistic large-scale optimization problems. FrontierOR includes 180 tasks derived from methodologically diverse papers published in top-tier operations research venues, each with standardized instances and a hidden, expert-verified evaluation suite. We evaluate seven LLMs spanning frontier, cost-effective, and open-source models both in one-shot and test-time evolution settings. The results reveal that frontier models still struggle to move from executable formulations to efficient optimization algorithms: the strongest one-shot model outperforms Gurobi in only 31% of cases in both solution quality and computational efficiency, and even strong coding agents with test-time evolution achieve only 50% on selected hard tasks. FrontierOR establishes a practical evaluation platform for LLM-based optimization algorithm design, which enables future LLMs and agents to be systematically tested on whether they can move beyond correct formulation toward a feasible, high-quality, and efficient algorithm. Our FrontierOR Benchmark is available at this https URL.

[AI-79] Constraint-Anchored Attribution: Feasibility-Certified Counterfactuals and Bonferroni-PAC Sufficient Subsets for Neural CO Policies

链接: https://arxiv.org/abs/2605.25235
作者: Sohaib Lafifi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
备注: 4 pages, 1 figure, Reference implementation: this https URL (MIT)

点击查看摘要

Abstract:We give an attribution method for neural combinatorial-optimisation (CO) policies that (i) decomposes a decision by constraint families via LP-relaxation duals, (ii) certifies counterfactuals through a combinatorial feasibility model (implemented as a CSP feasibility-decision model), and (iii) bounds the size of a PAC-sufficient explanation with a Bonferroni-corrected Hoeffding sufficient-subset test along a greedy ordering. Across three CO problems and three seeds, our LP-anchored \Lambda -attribution matches the CF-derived signal at 96.5% on CVRPTW (n_cert=344) and 77.2% on the Orienteering Problem (n_cert=281) vs 75.0% and 35.2% for proxy gradient (paired diffs +0.215 and +0.420; McNemar exact p \le 10^-14 ). In the rank-aligned regime of the Flexible Job-Shop Scheduling Problem, both backends agree on every CSP-certified flip (n_cert=59), confirming the no-gain prediction. Bonferroni-PAC subsets average 5.0 nodes per step ( M=70 , \varepsilon=\delta=0.2 , k_\max=25 ). Reference implementation: this https URL

[AI-80] On the Epistemic Uncertainty of Overparametrized Neural Networks ICML2026

链接: https://arxiv.org/abs/2605.25234
作者: David Rügamer
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation (stat.CO); Machine Learning (stat.ML)
备注: Accepted at ICML 2026 (Main Track)

点击查看摘要

Abstract:Epistemic uncertainty is often viewed as a reducible uncertainty that vanishes with increasing data. This perspective implicitly assumes parameter identifiability and equates epistemic uncertainty with predictive variability. In overparametrized neural networks, however, model parameters are typically non-identifiable due to symmetries and redundant representations. As a consequence, substantial parameter uncertainty can persist even when the underlying function is fully identified. In this work, we analyze epistemic uncertainty through the lens of non-identifiability and characterize both discrete and continuous sources of residual uncertainty. Focusing on one-hidden-layer ReLU networks, we thoroughly analyze the resulting posterior structure and validate our theoretical insights through empirical studies.

[AI-81] Meta-Agent : From Task Descriptions to Verified Multi-Agent Systems

链接: https://arxiv.org/abs/2605.25233
作者: Andy Xu,Yu-Wing Tai
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:AI agents are increasingly used to solve complex, multi-step tasks, but existing multi-agent frameworks remain brittle as workflows grow in scale and depth. Small errors at intermediate stages can propagate through agent interactions, while insufficient grounding and weak verification mechanisms further limit reliability. We present Meta-Agent, a two-phase framework that automatically constructs and executes specialized multi-agent systems from natural-language task descriptions. In the construction phase, a task planner decomposes a problem into a directed acyclic graph of agent specifications with explicit input/output contracts and verification criteria. A web search module grounds each specification with external evidence, and a code generation module produces system prompts and tool configurations. A construction-time verification stage then validates generated artifacts and triggers targeted regeneration when failures are detected. In the execution phase, a coordinator dispatches subtasks across the agent graph while execution-time verification gates intermediate outputs. We further introduce a three-level error attribution mechanism that distinguishes local, upstream, and structural failures, enabling targeted recovery strategies ranging from localized retries to partial re-execution and re-decomposition. We evaluate Meta-Agent across coding, contextual learning, and open-ended reasoning tasks. Experiments against strong multi-agent baselines and ablation studies demonstrate consistent improvements in task success rate, error recovery, and workflow stability. The results highlight the importance of tightly integrating planning, grounding, and verification for building reliable multi-agent systems.

[AI-82] Specification-Based Code-Text-Code Reengineering for LLM -Mediated Software Evolution

链接: https://arxiv.org/abs/2605.25232
作者: Oleg Grynets,Vasyl Lyashkevych,Arsen Dolichnyi,Roman Piznak,Taras Zelenyy,Volodymyr Morozov
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注: 15 pages, 9 figures, 7 tables, 39 references

点击查看摘要

Abstract:Direct Code2Code transformation remains challenging to control because it can preserve surface-level syntax while introducing semantic drift, hidden behavioral changes, loss of traceability, non-idiomatic target implementations, or incomplete reconstruction of domain logic. This paper proposes a specification-based Code2Text2Code reengineering framework for LLM-mediated software evolution. The central idea is to transform source code into a neutral textual specification that captures program behavior, identifiers, computational flow, conditions, side effects, data dependencies, and domain-specific intent without directly transferring the source language syntax. The proposed framework combines factual context extraction, Code2Text generation, iterative verification between source code and text specification, Text2Code generation, target code verification, retrieval-augmented grounding, and semantic-aware chunking, and transformation loss estimation. The knowledge representation layer integrates metadata derived from AST, graph-based dependency structures, neutral natural language specifications, technical documentation, business documentation, and architecture-level representations. The conducted experiments include a Code2Text2Code dataset built from multiple programming languages and SQL dialects, comparison of intermediate representations, retrieval evaluation, documentation transformation evaluation, and prompt tuning using DSPy. A graph formalization using structural preservation, reverse compatibility, interface stability, and total graph similarity is implemented to estimate transformation losses. The results support the interpretation of the Code2Text2Code approach not as a simple code transformation, but as a controlled specification-based reengineering process for LLM-mediated software evolution.

[AI-83] Boosting Inference with Guided Reasoning : Stochastic Exploration for Recursive Models ICML2026

链接: https://arxiv.org/abs/2605.25230
作者: Andrew Corbett,Archit Sood,Anna Tzatzopoulou,Sai-Aakash Ramesh,Tim Dodwell
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Presented at the proceedings of the ICML 2026 Workshop on Structured Probabilistic Inference Generative Modeling (SPIGM)}, Seoul, South Korea. 2026

点击查看摘要

Abstract:Recent work on recursive architectures has shown that tiny neural networks can be surprisingly powerful on structured reasoning tasks. The trick is to model reasoning trajectories with a latent dynamical system. We argue that the inference-time behaviour of these architectures is best understood as approximate inference over latent reasoning trajectories, with deterministic recursion as the one-particle, zero-noise limit. We make this view operational through guided stochastic exploration: stochastic perturbations of the reasoning dynamics propose neighbouring trajectories, and the model’s existing early-stopping head reweights them online. The framework yields three label-free diagnostics: local stability, guide alignment, and cloud-token entropy. These predict, from inference traces alone, whether the procedure will help and which of its outputs to trust. On Sudoku-Extreme it lifts exact-solve accuracy from 85.9% to 98.0% without retraining; on Maze-Hard the diagnostics flag a misaligned guide, as validation performance later confirms. The same machinery thus characterises both when recursive reasoning has room to improve at the trajectory level and when the model’s internal guide can recover it.

[AI-84] Continuous-Depth Field Theory for Transformer Patching and Mechanistic Interpretability

链接: https://arxiv.org/abs/2605.25225
作者: David N. Olivieri,Antonio F. Pérez Rodríguez
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Mechanistic interpretability often uses activation patching, causal tracing, path patching, and steering directions to reveal behaviorally meaningful directions in Transformer activation space. This paper develops a field-theoretic framework for organizing and predicting such interventions. Treating the residual stream as a depth-token field, we formulate patching as localized source insertion, patch effects as sensitivity-field predictions, downstream propagation as empirical Green-function response, and patch selection as an adjoint variational problem. Empirically, we test the forward response theory in GPT-2-style autoregressive Transformers by applying localized residual-field interventions and observing the induced residual-field differences and logit-difference responses. We identify a bounded local linear regime; predict patch effects from first-order sensitivities across residual sites; measure structured anisotropic propagation across depth and token position; construct response descriptions from high-sensitivity sites and sliced Green operators; and show that prompt-induced residual displacements can transfer answer behavior. These results establish response objects, namely sensitivities, propagated fields, and Green-operator slices, as a practical language for organizing patching experiments and as the forward mathematical basis for formulating patch-site inference and cross-scale this http URL.

[AI-85] Multi-Objective Learning for Diffusion Models: A Statistical Theory under Semi-Supervised Learning

链接: https://arxiv.org/abs/2605.25210
作者: Ziheng Cheng,Yixiao Huang,Hanlin Zhu,Haoran Geng,Somayeh Sojoudi,Jitendra Malik,Pieter Abbeel,Xin Guo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Diffusion models are increasingly used as powerful conditional generators, yet real deployments often involve multiple target distributions arising from different tasks, e.g., diverse prompt domains in text-to-image generation, or multiple environments in robotics with diffusion policies. This naturally leads to a multi-objective learning (MOL) problem. A key challenge is that achieving good Pareto trade-offs can require a generalist model class with substantially larger capacity than what suffices for solving any individual task, thereby increasing statistical cost since sample complexity typically scales with the model complexity. To reconcile this, we develop a principled MOL framework for diffusion models with limited data: a semi-supervised regime where paired (labeled) samples are scarce, but (unlabeled) condition data are abundant. We propose a two-stage training procedure that first fits lightweight specialist models from limited paired data, and then distills them into a generalist model by generating pseudo-samples. We establish generalization bounds showing that the required number of paired samples only depends on the complexity of the specialist model classes. We further extend the theory to diffusion policies for sequential decision making to account for distribution shift in on-policy rollouts. Extensive experiments on robotic control and image restoration tasks are conducted to verify our theoretical results.

[AI-86] Influence-Inspired Spectral Rotations for Extreme Low-Bit LLM Quantization

链接: https://arxiv.org/abs/2605.25203
作者: Gorgi Pavlov
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注: 14 pages, no figures. Companion application paper to arXiv:2605.01637 (theory). Code and pinned eval stack: this https URL

点击查看摘要

Abstract:We apply the influence-adaptive Walsh geometry of a companion theory paper (arXiv:2605.01637) to extreme low-bit weight-only LLM quantization. The recipe is one math-invariant transformation: WHT-rotate each linear layer’s weight matrix and rescale its columns by per-coordinate Walsh-basis activation energy before handing off to a reconstruction-error quantizer (Intel auto-round). This biases per-group integer rounding toward high-spectral-energy channels. On four pretrained decoder-only models from 135M to 1.5B parameters, BBT-spectral reduces wikitext-2 perplexity by 15-58% relative to vanilla auto-round at W2A16; we also report a TinyLlama-1.1B auxiliary data point. Three extensions transfer the recipe to families it failed on: a per-head PCA matrix-Gamma replacement of q_norm/k_norm for Qwen3 attention (PPL 136.76 - 88.99 on Qwen3-0.6B); an SO(2) per-pair rotation that commutes with RoPE (PPL 36.93 - 21.84 on Qwen2.5-1.5B); and an MoE-aware input-side absorption fix identified by architectural fuzzing of Laguna-style fused-expert layouts. A W2-vs-W4 ablation gives a deliberate negative control: the redistribution payoff falls within the +/-0.5 PPL noise floor at W4, consistent with the Schur-convexity intuition that the cost of unconcentrated influence vanishes as the noise budget shrinks. All quantized weights export to OpenVINO IR and run on Intel NPU + Arc dGPU + CPU with PPL invariant to device within +/-0.1. We do not claim a formal Boolean-to-real-valued transfer of the theory paper’s majorization argument: the WHT activation energy used here is not the Boolean influence of the theory paper, the link is intuitive, and the contribution is engineering value rather than a transferred theorem. Head-to-head benchmarks against SpinQuant, QuaRot, QuIP-sharp, AQLM, OmniQuant, and ButterflyQuant at matched calibration are the main future-work item. Comments: 14 pages, no figures. Companion application paper to arXiv:2605.01637 (theory). Code and pinned eval stack: this https URL Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO) MSC classes: 68T07 68T07 68T07 Cite as: arXiv:2605.25203 [cs.LG] (or arXiv:2605.25203v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.25203 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Gorgi Pavlov [view email] [v1] Sun, 24 May 2026 18:05:37 UTC (19 KB)

[AI-87] Hide to Guide: Learning via Semantic Masking

链接: https://arxiv.org/abs/2605.25198
作者: Ruitao Liu,Qinghao Hu,Alex Hu,Yecheng Wu,Shang Yang,Luke J. Huang,Zhuoyang Zhang,Han Cai,Song Han
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reinforcement learning with verifiable rewards (RLVR) has become a powerful paradigm for improving language models on reasoning-intensive tasks, but its effectiveness is often limited by exploration. For example, models often fail on hard problems, leaving little useful reward signal. External expert traces offer a natural source of guidance, yet they may also expose reward-relevant content along the critical path to the verifier target, such as final answers, intermediate values, executable implementations, or answer-related entities. This content can create an unintended reward hacking channel, allowing the policy to obtain reward by copying the trace rather than learning the underlying reasoning or agentic behavior. Existing guided-RL methods reduce this risk by using partial trajectories, but they mainly control how much expert information is shown heuristically rather than which parts should be hidden. To this end, we propose Semantic Masked Expert Policy Optimization (SMEPO), a fine-grained semantic masking strategy for expert-guided RLVR. Instead of truncating traces coarsely or revealing them unchanged, SMEPO masks reward-relevant semantic spans along the critical path while preserving the expert’s decomposition, plan, and procedural structure. This turns hard problems from reasoning from scratch into a fill-in-the-blank process: the policy can follow the expert’s problem-solving route, but must still reconstruct the missing values, code, or entities by itself. SMEPO is simple to apply and requires no changes to the reward function or RL objective. Across diverse domains, including math, code, and agentic search, SMEPO improves accuracy by up to 3.2 points over GRPO and reduces training time by up to 4.2x. The code is available at this https URL.

[AI-88] Beyond Killer Robots: General AI Attitudes and Public Support for Military AI in Nine Countries

链接: https://arxiv.org/abs/2605.25196
作者: Andreas Jungherr,Antonia Schlude,Adrian Rauchfleisch
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:AI-enabled military systems are a fixture of modern military conflict. Applications vary from autonomous drones for surveillance and attack to AI-supported target selection. The importance of AI for modern conflict shows also in public disputes between governments and technology companies over the conditions for military access to frontier AI. Both military uses and government attempts at enabling and steering them happen before a backdrop of public opinion, yet we still know little about how people think about military AI. Drawing on a preregistered survey of 9,000 respondents in nine countries, including China, Germany, and the United States, we examine whether support for military AI is shaped primarily by general attitudes toward AI, principled opposition to lethal autonomy, or foreign-policy and geopolitical orientations. Across six military AI scenarios that vary in lethality and human control, respondents who view AI as beneficial are substantially more supportive of military AI. Hawkish respondents are also more supportive. By contrast, principled opposition to lethal autonomy is not broadly associated with the full index but is related to the application of fully autonomous lethal force. Contrary to our expectation, perceived AI risks are positively associated with support. Cross-national differences are moderate and broadly consistent with geopolitical context. Overall, public opinion toward military AI appears conditionally permissive. Publics are not categorically opposed to various military uses of AI. Instead, unease is concentrated around fully autonomous lethal force.

[AI-89] DarkForest: Less Talk Higher Accuracy for Multi-Agent LLM s

链接: https://arxiv.org/abs/2605.25188
作者: Yi Li,Songtao Wei,Dongming Jiang,Zhichun Guo,Qiannan Li,Bingzhe Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multi-agent LLM systems improve reasoning by combining outputs from multiple agents, but interaction-heavy methods can introduce error propagation and high communication overhead. When agents exchange raw responses or reasoning traces, incorrect intermediate reasoning may be adopted and amplified, leading to confident but wrong consensus; multi-round communication also increases token consumption, latency, and inference cost. In this paper, we propose a controlled-communication coordination framework named DarkForest. DarkForest first keeps agents independent, so each agent produces an answer without seeing the others’ outputs. It then parses the raw responses into structured candidate records, groups semantically equivalent candidates into clusters, and estimates a calibrated belief distribution over these clusters using agent reliability, confidence, parse quality, support-pattern reliability, and independence corrections. A coordinator receives only policy-permitted evidence from this belief state with controlled communication. Experiments on six reasoning benchmarks show that DarkForest achieves leading overall quality, improves the strongest baseline by up to 30.7% on benchmark metrics, and reduces token consumption by up to 6.5\times compared with communication-heavy baselines.

[AI-90] SpecAlign: A Semantic Alignment Framework for SystemVerilog Assertion Generation

链接: https://arxiv.org/abs/2605.25181
作者: Jaime Rafael Imperial,Hao Zheng
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Existing Large Language Model (LLM) approaches to SystemVerilog Assertion (SVA) generation primarily focus on syntactic validity and formal verification outcomes, while semantic alignment between generated assertions and natural language specifications remains difficult to quantify. As a result, hallucinated or misaligned SVAs can reduce confidence and increase debugging efforts in the absence of golden RTL. This paper presents SpecAlign, a framework for semantic evaluation and refinement of LLM-generated SVAs. SpecAlign introduces two iterative alignment loops that assess both natural language properties and SVAs against the design specification using entailment-based classification. We improve alignment decisions by generating multiple reasoning paths using chain-of-thought prompting and aggregating them via a self-consistency voting mechanism. Misaligned assertions are analyzed to generate actionable feedback for refinement. We further define a quantitative alignment score to measure semantic consistency across iterations. Experimental results demonstrate that SpecAlign effectively detects semantic inconsistencies and improves assertion alignment without relying on golden RTL, providing a scalable complement to traditional formal verification evaluation metrics.

[AI-91] Grow-Prune-Freeze Networks: Adaptive Continual Learning Technique for Olfactory Navigation

链接: https://arxiv.org/abs/2605.25170
作者: Kordel K. France,Ovidiu Daescu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Training data for olfaction is scattered through disparate, non-standardized datasets that limit the ability to build representative world models. Olfactory navigation is a highly dynamic and non-stationary task that benefits from real-time continual learning. We introduce an adaptive framework called Grow-Prune-Freeze (GPF) networks that enable an agent to continually learn through growing, pruning, and freezing early layers of its policy in response to world complexity. Grounding GPFs in non-linear random matrix theory, we show that the work of Pennington Worth (2017) can be extended from single hidden layers to n-layer continual-learning models, and that eigenvalue composition of network weights is preserved as successive layers are added. We show that GPFs based on Expected SARSA achieve a 94% success rate on turbulent plume navigation - a partially observable, non-stationary task representative of the “big world” challenges that motivate adaptive learning in robotics - and provide supporting methodology for applying GPFs in other world models. Further experiments amount evidence that GPFs may generalize well to other machine learning tasks such as reinforcement learning in Atari, image classification, and autoregressive language models. We open source all code and data to encourage improvements on and more research in olfactory robotics.

[AI-92] AME-TS: Anchored Mixture-of-Experts for Time Series Forecasting

链接: https://arxiv.org/abs/2605.25166
作者: Rui Wang,Renhao Xue,Ray Razi,Huan Song,Hannah R. Marlowe
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Time series forecasting models are increasingly scaled through large Transformer backbones, yet most existing approaches process all series through a shared dense computation path despite substantial heterogeneity in temporal structure. Mixture-of-Experts (MoE) offers a natural alternative by enabling conditional computation, but standard MoE routing leaves expert specialization weakly identified and often unstable during downstream adaptation. We propose AME-TS, a structure-guided sparse time series foundation model that aligns expert routing with interpretable temporal structure. AME-TS first uses a lightweight regime predictor to estimate series-level descriptors, including forecastability, seasonality, trend, and sparsity, and maps them to a soft structural prior over experts. This series-level prior guides token-level routing during training, encouraging structure-aligned specialization. On the GIFT-Eval benchmark, AME-TS delivers a strong accuracy-efficiency tradeoff across model scales: it substantially outperforms existing time series foundation models at small model scales and remains competitive with the strongest models at larger scales, while activating substantially fewer parameters through sparse routing. We further show that AME-TS learns more interpretable routing geometry and substantially more stable expert specialization than standard MoE during fine-tuning on the M5 dataset. These results suggest that structure-aware routing is an effective and reliable way to realize the benefits of sparse expert models for time series forecasting.

[AI-93] SimuWoB: Simulating Real-World Mobile Apps for Fast and Faithful GUI Agent Benchmarking

链接: https://arxiv.org/abs/2605.25160
作者: Guohong Liu,Jialei Ye,Pengzhi Gao,Wei Liu,Jian Luan,Yunxin Liu,Yuanchun Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Mobile GUI agents powered by large language models have progressed rapidly, creating urgent needs for realistic and comprehensive evaluation. Existing benchmarks prioritize reproducibility but are often limited to open-source apps or file-operation tasks for the difficulty of constructing rewards on real applications, leaving a gap between benchmark settings and real-world usage. Moreover, most benchmarks focus on basic grounding and navigation, with limited coverage of complex, long-horizon interactions. To address these limitations, we introduce SimuWoB, a fully synthetic benchmark for mobile GUI agents with 120 challenging tasks spanning diverse types and difficulty levels. We build a robust virtual environment generation framework that synthesizes high-fidelity tasks and environments, and automatically provides valid rewards for each task. Each environment is deployed as a backend-free webpage accessible via URL, enabling efficient and reproducible evaluation. We conduct comprehensive experiments on several state-of-the-art mobile GUI agents. The average success rate is only 27.92%, dropping to 17.82% on long-horizon tasks, which reveals substantial weaknesses in current agents under complex scenarios. Evaluation result comparison with real-world sample tasks demonstrate that agent assessments based on our synthetic environment generalize well. We further provide diagnostic insights across key capability dimensions and discuss implications for future mobile GUI agent development.

[AI-94] Abduction-Deduction Entanglement: Domain Generalization via Representation Transplants

链接: https://arxiv.org/abs/2605.25156
作者: Kasra Jalaldoust,Elias Bareinboum
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Prediction models trained under the source distribution do not generalize well to a different target distribution. A valid inference about an unseen data distribution must be anchored by the invariance of certain causal mechanisms that generate the source and target data, however, these structural invariances are non-identifiable from the source data alone. Under mild causal assumptions about the data, we show that the optimal prediction in the target is in fact partially identifiable by the source distribution. The result rests on a simple observation: In any domain, the optimal prediction can be factorized into what we call a pair of abduction and deduction maps, where the abduction map makes inference about some unobserved variables (possibly confounders) from the observed variables and the deduction map predicts the label using both the observed and inferred quantities. Access to large source data pins down the optimal prediction, thus constrains the valid abduction-deduction ensembles that produce it – a non-identifiability that we call the abduction-deduction entanglement. To leverage this, we parameterize the constrained family using what we call a representation transplant, that is a specific linear transformation in the representation space that manipulates the abduction content of the representation while retaining the deduction component. Invariance of the causal mechanism generating the label implies existence of an invariant deduction map between source and target. Thus, we can search the space of plausible target distributions via a parametric transplant. We use this scheme in a learner-adversary game that, under an idealistic optimization, provably terminates with the learner having the minimax-optimal target prediction. Evaluations verify the theory, showing that the method is competitive in DG benchmarks.

[AI-95] Representation Without Control: Testing the Realization Effect in Language Models

链接: https://arxiv.org/abs/2605.25151
作者: Ciarán Walsh,Emilio Barkett
机构: 未知
类目: Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
备注:

点击查看摘要

Abstract:Large language models are increasingly used as behavioral simulators, but it remains unclear when their outputs reflect human-like cognitive mechanisms rather than prompt-sensitive surface patterns. We study this question through the realization effect, a well-characterized finding in behavioral economics in which risk-taking differs systematically after paper versus realized gains and losses. We evaluate LLM behavior at three levels: prompt-only behavioral sensitivity, linear readout of internal representations, and causal control via activation steering. Prompt-only results show systematic condition sensitivity, but the directional pattern does not reproduce human realization-effect predictions. Gemma’s residual stream contains a linearly decodable realization-status signal at layer 18 that generalizes to held-out prompts. Steering along this direction does not, however, reliably shift downstream risk choices, a null result that holds across positive scales and in a negative sign-symmetry run. Behavioral sensitivity, latent readout, and causal control are three distinct properties that do not automatically co-occur, and successful latent readout is insufficient evidence that a model behaviorally relies on a representation during downstream decision-making.

[AI-96] Beyond the Frontier: Stochastic Backtracking for Efficient Test-Time Scaling

链接: https://arxiv.org/abs/2605.25143
作者: Dao Tran,Duc Anh Le,Ngoc Luu,Quan Pham,Tung Pham,Hung Bui
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Test-time scaling improves language model reasoning by spending additional compute to explore multiple solution trajectories. The key challenge is to maximize accuracy while minimizing the total number of generated tokens during reasoning. Recent PRM-guided methods score intermediate prefixes to steer this search, but most are frontier-only: they keep only the current active prefixes and irreversibly prune or resample away the rest using noisy PRM scores. This can cause premature commitment, diversity collapse, and the loss of prefixes that still admit correct continuations. We introduce stochastic backtracking over a persistent pool of historical prefixes, allowing test-time compute to revisit previously generated states instead of only expanding the current frontier. To make this efficient, we propose two complementary mechanisms. Subpool Selection strengthens greedy PRM-guided search by applying Top-N selection within random subpools, giving historical prefixes a chance to bypass over-scored frontier candidates. Power Backtrack Sequential Monte Carlo extends SMC-style resampling to the persistent pool using powered PRM scores and mixture-corrected weights. Across mathematical reasoning benchmarks and model scales, our methods consistently achieve higher accuracy per token count, and the same level of accuracy using only a fraction of the token count in comparison to strong PRM-guided baselines, demonstrating that persistent-pool stochastic backtracking provides a simple and effective way to improve the accuracy-token trade-off in test-time scaling.

[AI-97] ASTRO: Adaptive Spatio-Temporal Reinforcement Optimization for GNN Powered Anomly Detection in Cyber Physical Systems

链接: https://arxiv.org/abs/2605.25135
作者: Rai Ali Yar,Umaisa Lail,Anwar Shah
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Anomaly detection in Industrial Internet of Things (IIoT) environments is essential to protect the Industrial Control Systems (ICS) and Cyber-Physical Systems (CPS) from occuring run time false data injection and other malicious attacks. The increasing complexity of sensor networks and interconnected control loops makes it difficult to identify anomalous behavior hidden within high-dimensional and time-dependent signals. To address these challenges, this article introduces Adaptive Spatio-Temporal Reinforcement Optimization ASTRO (ASTRO), a novel anomaly detection framework that pioneers the use of reinforcement learning for dynamic threshold optimization. By integrating a Deep Q-Network (DQN) with Graph Neural Networks (GNNs), temporal modelling and a Multi-Head Attention mechanism, ASTRO continuously adapts its decision boundaries to improve detection accuracy. The GNN component models the spatial relations among sensors, Temporal model captures time series dependencies and the attention layer highlights most informative time steps. The model generates continuous anomaly scores, which are transformed into binary decisions using an adaptive threshold, optimized via a Deep Q-Network (DQN). The ASTRO approach is evaluated on two real world industrial benchmarks: the Secure Water Treatment (SWaT) and Water Distribution (WADI) datasets. The proposed model achieves an exceptional performance on the SWaT with F1 score of 0.990. Moreover, on highly complex 127 end devices WADI dataset, it secures F1 score of 0.788, outperforming state-of-the-art baselines by nearly 14%. Results across multiple runs confirm consistent generalization and stability. These experiments demonstrate that the ASTRO framework is highly practical and scalable method for strengthening the large scale cyber physical infrastructures

[AI-98] heoretical Analysis of Sparse Optimization with Reparameterization Weight Decay and Adaptive Learning Rate ICML2026

链接: https://arxiv.org/abs/2605.25134
作者: Huangyu Xu,Jingqin Yang,Qianqian Xu,Jiaye Teng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 31 pages, 5 figures. Submitted to ICML 2026

点击查看摘要

Abstract:Sparse optimization is a fundamental challenge in various practical applications. A popular approach to sparse optimization is \ell_p regularization. However, it may encounter optimization instability due to the unbounded gradients when 0p1 . In this paper, we introduce a novel approach to sparse optimization termed ReWA, based on Reparameterization, Weight decay, and Adaptive learning rate. ReWA is closely connected to \ell_p -regularization, yet it unveils a distinct optimization landscape that helps mitigate instability issues. Experiments on CIFAR-10 and ImageNet with ResNets demonstrate that ReWA leads to significant sparsity improvements over the \ell_1 -regularization approach while preserving test accuracy.

[AI-99] Courant: a State-Adaptive Perceiver-Based Neural Surrogate with Local Support and Interpretable Field Decomposition

链接: https://arxiv.org/abs/2605.25115
作者: Anuj Kumar,Josiah Bjorgaard,Nikolaos Bouklas,Matteo Salvador,Alexander Lavin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Applied Physics (physics.app-ph)
备注:

点击查看摘要

Abstract:We introduce “Courant”, a Perceiver-based encoder-processor-decoder surrogate model that has latent features exhibiting adaptive specialization and local support in the physical space, enabling functionality akin to an adaptive hp-refinement scheme, an attribute that is highly desirable in traditional numerical solvers and scientific machine learning broadly. The proposed architecture combines a shared random Fourier feature coordinate embedding, state-adapted latent queries, and a light-weight decoder. Courant is trained end-to-end with steady or transient simulation data and only a standard L_2 prediction loss in the physical space, achieving competitive accuracy on benchmarks. We demonstrate that Courant’s inductive biases yield latents that are interpretable by design: they develop multiscale geometric specialization in the simulation domain and track coherent structures in the time-dependent case, acting analogously to time-evolving spatial basis functions and allowing for decoding a compact, geometry-anchored, partition-of-unity-like decomposition of the simulated field.

[AI-100] Leverag ing Gauge Freedom for Learning Non-Gradient Population Dynamics of Stochastic Systems

链接: https://arxiv.org/abs/2605.25107
作者: Jules Berman,Tobias Blickhan,Benjamin Peherstorfer
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Numerical Analysis (math.NA)
备注:

点击查看摘要

Abstract:Existing work on population dynamics inference often focuses on flows arising from vector fields that are the gradients of scalar potentials. Among all admissible flows that are compatible with the population dynamics, gradient flows are optimal in a specific sense: they minimize kinetic energy. The selection of fields based on different criteria corresponds to a gauge freedom when determining population dynamics, which we leverage in this work. We propose Non-Gradient Inference Flows (NGIF), an algorithm to infer non-gradient population dynamics using a weak formulation of the continuity equation. This allows us to parameterize general vector fields and choose other selection criteria beyond minimal kinetic energy. We demonstrate on a variety of low- and high-dimensional physics problems that this more general approach improves distributional accuracy over gradient-restricted baselines and better captures non-potential transport.

[AI-101] Multi-Agent Specification-based Metamorphic Testing of FMU-Based Simulations

链接: https://arxiv.org/abs/2605.25101
作者: Ashir Kulshreshtha,Abdullah Mughees,Gaadha Sudheerbabu,Tanwir Ahmad,Kristian Klemets,Dragos Truscan,Mikael Manngård
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注: Author version. 9 pages. Accepted for publication in the 10th International Workshop on Metamorphic Testing (MET 2026) of the IEEE Conference on Computers, Software, and Applications (COMPSAC2026), June 7-10, 2026 Madrid, Spain

点击查看摘要

Abstract:In many industrial domains, the Functional Mock-up Interface (FMI) is used to exchange simulation models as Functional Mock-up Units (FMUs) across different partners using various modelling tools. This opens up the possibilities for simulation-based verification and validation using FMUs for ensuring reliable system behaviour. However, deriving effective test oracles for these simulation models remains challenging due to the absence of explicit expected outputs. This limits the applicability of conventional testing approaches, which require access to the internal workings of the systems. Metamorphic testing (MT) addresses this limitation by leveraging metamorphic relations (MRs), but extracting such relations from specifications remains largely a manual and error-prone process. To address this challenge, we propose an LLM-powered multi-agent workflow for specification-based metamorphic testing of FMU-based simulation models. The approach takes functional and interface specifications as input and orchestrates multiple agents to extract requirements and derive MRs. These MRs are expressed using Given-When-Then patterns to structure input conditions (Given), transformations (When), and expected output behaviours (Then). These relations are then used to generate metamorphic test cases, execute simulations, and evaluate output consistency across multiple sessions. We evaluate the approach on a Lube Oil Cooling system FMU, demonstrating its ability to automatically generate meaningful MRs and corresponding test cases. Preliminary results indicate that the proposed workflow can effectively support the systematic verification and validation of dynamic simulation models by reducing manual effort and improving test generation.

[AI-102] RECTOR: Priority-Aware Rule-Based Reranking for Compliance-Aware Autonomous Driving Trajectory Selection

链接: https://arxiv.org/abs/2605.25095
作者: Hadi Hajieghrary,Benedikt Walter,Chaitanya Shinde,Paul Schmitt,Miguel Hurtado
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Optimization and Control (math.OC)
备注:

点击查看摘要

Abstract:Autonomous driving stacks must pick one trajectory from a multi-modal candidate set; choosing by model confidence ignores safety, traffic-law, and comfort constraints. We present \textscRECTOR (Rule-Enforced Constrained Trajectory Orchestrator), a post-generation reranking layer that scores candidates against a tiered rulebook (Safety~ \succ ~Legal~ \succ ~Road~ \succ ~Comfort) via differentiable proxies and a scene-conditioned applicability mechanism, then selects with a deterministic \varepsilon -lexicographic rule that preserves cross-tier priority by construction – without retraining the predictor. On the Waymo Open Motion Dataset \textttvalidation_interactive split (43,219 augmented instances, K=6 ), under Protocol~B (28-rule proxy catalog, oracle applicability) rule-aware selection cuts Safety+Legal violations from 28.58% to 20.42% and Total from 40.32% to 32.41% versus confidence-only on the same candidates. A uniform-weight weighted-sum baseline matches binary compliance on this benchmark – the empirical lift comes from rule-aware ranking, while the lexicographic guarantee is the structural differentiator no weight calibration can replicate. Under adversarial confidence corruption, confidence-only selection fails in 100% of scenarios while both rule-aware selectors reject the injected mode in \sim 96%. All figures are proxy-evaluator results (not a safety certificate), open-loop, 5,s horizon, U.S.\ rules, validation split. Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Optimization and Control (math.OC) Cite as: arXiv:2605.25095 [cs.AI] (or arXiv:2605.25095v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2605.25095 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-103] Evolutionary Enhanced Multi-Agent Reinforcement Learning for Cooperative Air Combat

链接: https://arxiv.org/abs/2605.25091
作者: Chengwei Li,Junlin Liu,Yang Gao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As modern air combat evolves toward beyond-visual-range (BVR) multi-aircraft cooperative engagements, autonomous decision-making for unmanned combat aerial vehicles (UCAVs) faces significant challenges due to high-dimensional state spaces, discrete action commands, and strongly adversarial dynamic environments. To overcome the limitations of existing multi-agent reinforcement learning (MARL) methods in such settings, namely insufficient exploration efficiency, low sample utilization, and poor policy generalization, we propose Adversarial Curriculum and Evolutionary-enhanced Multi-agent Proximal Policy Optimization (ACE-MAPPO), a hybrid learning framework that integrates evolutionary algorithms with MAPPO. Specifically, a genetic soft update mechanism is introduced to enhance population diversity and mitigate convergence to local optima. An evolutionary-augmented prioritized trajectory replay strategy is further employed to improve the utilization of sparse high-value samples. In addition, an adversarial evolutionary curriculum learning mechanism is designed to enable adaptive training with progressively increasing difficulty. Extensive experimental results demonstrate that the proposed method outperforms MAPPO and other baseline algorithms in terms of training stability, convergence speed, and win rate, validating its effectiveness in multi-aircraft cooperative air combat scenarios.

[AI-104] Polynomial Context-Truncation Sensitivity in Autoregressive Language Models: Sequential Wyner-Ziv Bounds for KV Cache Compression

链接: https://arxiv.org/abs/2605.25085
作者: Munsik Kim
机构: 未知
类目: Information Theory (cs.IT); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We study the rate-distortion limits of online KV cache compression in autoregressive language models, formulating it as sequential Wyner-Ziv source coding on the filtration induced by the model, with the next-step query as decoder side information. Empirically, across four models spanning two families and 0.5 - 3 B parameters, we find that the next-token distribution’s sensitivity to context truncation decays \emphpolynomially rather than \emphgeometrically: a power law improves on an exponential fit by an order of magnitude in extrapolation, the fitted exponent is recovered independently from a sink-plus-recent KL measurement, and the decay is verified to be free of positional-encoding artifacts by a position-preserving ablation. Under a corresponding \emphpolynomial truncation-sensitivity assumption, our main result characterizes the per-token memory requirement of \emphsuffix-only cache policies: a sliding-window scheme attains distortion \varepsilon with window w = O(\varepsilon^-1/\alpha) , and – under an additional two-sided Bayes-risk condition – a converse shows w = \Omega(\varepsilon^-1/\alpha) is necessary within this policy class, so the scaling is \Theta(\varepsilon^-1/\alpha) for suffix-only policies. Whether recurrent or propagating cache summaries can beat this scaling is left open. An explicit block-Markov scheme achieves the upper bound; its rate-of-convergence exponent matches the converse under additional forward-decay and regularity hypotheses (not implied by truncation sensitivity alone), and differs by a factor of two otherwise. Empirically, the polynomial law predicts the degradation curves of concrete cache policies: recency-based eviction (sliding, sink-plus-recent) suppresses distortion by roughly two orders of magnitude over random retention at equal budget, with a power-law decay in the budget.

[AI-105] Security in the Fine-Tuning Lifecycle of Large Language Models : Threats DefensesEvaluation and Future Directions

链接: https://arxiv.org/abs/2605.25073
作者: Wenjuan Li,Yitao Liu,Runze Chen,Rajkumar Buyya
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 39 pages, 7 figures, 22 tables

点击查看摘要

Abstract:Background: Fine-tuning is central to adapting pre-trained Large Language Models (LLMs) to downstream tasks, but its reliance on training data, parameter updates, and reusable components opens entry points for attackers. Threats have evolved from data poisoning and weight tampering to agent manipulation and interface exploitation, yet existing reviews lack a unified framework spanning the full fine-tuning lifecycle. Objective: This paper presents a systematic survey of LLM fine-tuning security and establishes a lifecycle-based framework for comparing attacks and defenses, complemented by unified empirical evaluation. Methods: We divide attack and defense mechanisms into three phases by intervention timing: pre-tuning, during-tuning, and post-tuning. Within each phase, strategies are reviewed and contrasted to expose their evolution and limitations. Representative methods are then evaluated under a unified model, hardware, and protocol setup, with cross-phase experiments pairing attacks and defenses from different phases. Results: Attack effectiveness is highly model-dependent and non-monotonic with scale: weight-editing attacks effective on earlier models lose impact on modern open-source LLMs; cross-lingual backdoor transfer, reported as near-perfect at larger scales, fails entirely on tested 1B-4B models; and purely benign samples can compromise safety alignment in instruction-tuned models. Single-phase defenses rarely generalize across phases, and defense effectiveness depends jointly on model architecture and alignment state. Conclusion: We identify key open problems (configuration-robust defense, cross-phase defense composition, and embedding-space attacks beyond behavioral assumptions) and propose concrete future research directions.

[AI-106] Cultivating Machine Intelligence: The OMEGA Shift from Top-Down Optimization to Autopoietic Cognitive Ecologies

链接: https://arxiv.org/abs/2605.25062
作者: Ata G.Zare
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
备注: Extended preprint. A shorter version of this work is currently under peer review

点击查看摘要

Abstract:The dominant artificial intelligence paradigm trains neural architectures via gradient descent against proxy objectives and reinforcement learning from human feedback. While remarkably capable, this top-down optimization inherently generates structural failure modes, including hallucination, sycophancy, reward hacking, and alignment fragility, which represent paradigmatic limitations rather than mere engineering defects. In response, we introduce RECLAIM (Recursive, Ecological, Cognitive, Lifelike, Adaptive, Intelligent Machine), a theoretical framework for cultivating intelligence through computational ecology rather than engineering it through strict optimization. The model is supported by four interlocking theoretical pillars. General Darwinism replaces gradients with blind variation and selective retention, while non-agentic emergence substitutes evaluative rewards with environmental physics to structurally prevent specification gaming against human intent. Concurrently, the Polya-Hebbian bridge applies Polya urn dynamics to Hebbian reinforcement for path-dependent specialization, and the free energy principle is integrated as environmental thermodynamics rather than as an agent objective. The architecture situates autopoietic units, bounded by Markov blankets and competing for finite computational energy, within a data ecology shaped by cognitive food chains and Red Queen arms races. This framework suggests the spontaneous emergence of dual-process cognition, sensory specialization, analogical reasoning, and intrinsic motivation as natural consequences of evolution under resource constraints. We conceptualize this paradigm transition as the OMEGA shift, representing a move from optimization and maximization to emergence through generative autopoiesis.

[AI-107] GL-LFGNN:A Global-Local Dual-branch Causal Graph Neural Network Based on Liang-Kleeman Information Flow for EEG Emotion Recognition

链接: https://arxiv.org/abs/2605.25061
作者: Ziyi Wang,Dongyang Kuang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 10 pages, 3 figures

点击查看摘要

Abstract:EEG-based emotion recognition holds significant promise for objective diagnosis of mood disorders. Graph neural networks (GNNs) have emerged as the dominant paradigm for modeling inter-channel dependencies in EEG, yet existing approaches rely on symmetric adjacency matrices derived from spatial proximity or functional correlations that fundamentally capture statistical associations rather than directed causal influences, which conflicts with the inherently asymmetric, causally-driven nature of neural information flow. To bridge this gap, we propose GL-LFGNN, a Global-Local Dual-branch Causal Graph Neural Network grounded in Liang-Kleeman information flow theory. Unlike Granger causality that merely assesses temporal precedence, our approach rigorously quantifies causal strength from a dynamical systems perspective, yielding neurophysiologically interpretable directed graphs. A dual-branch architecture further integrates whole-brain connectivity with region-specific processing aligned to established functional neuroanatomy. On the MEEG dataset, GL-LFGNN achieves 86.17% (Arousal) and 86.71% (Valence) accuracy with only 37K parameters – approximately 10% of the current state-of-the-art – demonstrating that principled causal modeling can simultaneously enhance interpretability, generalization, and computational efficiency. Code will be released.

[AI-108] Scale When Needed: Adaptive Neuron-level Mixed Precision Quantization Aware Training ICML

链接: https://arxiv.org/abs/2605.25054
作者: Ayush K. Varshney,Konstantinos Vandikas,Šarūnas Girdzijauskas,Adam Orucu,Aneta Vulgarakis Feljan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at ICML - GlobalSouthML workshop, 2026

点击查看摘要

Abstract:Deploying deep neural networks on resource-constrained 6G edge devices demands aggressive compression with minimal accuracy loss. Quantization-Aware Training (QAT) has emerged as a leading compression approach; however, existing mixed-precision methods typically operate at coarse layer- or channel-level granularity. These methods often rely on heuristic or search-based bit-allocation strategies, which may overlook fine-grained variability at the neuron level. We propose Neuron-Level Mixed-Precision QAT (NMP-QAT), where each neuron independently learns its own discrete precision during training. Starting from low-bit precision, NMP-QAT expands bit-width only when training signals demand it, via differentiable surrogates and straight-through estimators, while preserving a fully discrete inference graph. This adaptability extends to both weights and activations, reducing memory movement. Evaluated on telecom and non-telecom datasets across MLP and tabular foundation model architectures, NMP-QAT achieves superior compression-accuracy trade-offs over mixed-precision QAT baselines, making it well-suited for Green AI deployments at the network edge.

[AI-109] AION: Next-Generation Tasks and Practical Harness for Time Series

链接: https://arxiv.org/abs/2605.25045
作者: Tianxiang Zhan,Xiaobao Song,Tong Guan,Shirui Pan,Ming Jin
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Project page and code are available at this https URL

点击查看摘要

Abstract:Time series research is moving beyond fixed forecasting benchmarks toward realistic tasks that combine prediction, contextual reasoning, tool use, and structured decision support. Most benchmarks are built around clean data and short evaluation loops; agents alone may miss temporal constraints, evidence checks, or review before finalizing outputs. We first formalize next-generation time series tasks as three-component tuples consisting of a task file, a workspace, and a validation interface. We then present AION, a time series harness built from six component groups: agents, skills, rules, memory, evaluation, and protocols. In this harness, we use three design principles: temporal grounding, temporal knowledge-grounded reasoning, and reliability mechanisms such as post-experiment analysis and layered review. A Kaggle Store Sales case study shows that the harness produces more detailed process traces, more artifacts, and more review steps than the same base agent operating in OpenCode direct build mode. Taken together, these results argue for a paradigm shift from fixed tasks to realistic ones under real-world constraints.

[AI-110] Performance Comparison of Classical and Neural Sampling Algorithms for Robotic Navigation

链接: https://arxiv.org/abs/2605.25010
作者: Hichem Cheriet,Badra Khellat Kihel,Samira Chouraqui
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Integrating artificial intelligence (AI) into sampling-based motion planning provides new possibilities for improving autonomous navigation efficiency. In this paper, three algorithms, namely RRT*, Neural RRT*, and Neural Informed RRT*, are implemented and evaluated on environments containing convex and concave obstacles with different obstacle densities. The obtained results indicate that neural-guided planners improve path quality, producing up to 14% shorter paths and 55–75% smoother trajectories compared with the conventional RRT* algorithm. Among the evaluated methods, Neural Informed RRT* achieves the best overall performance in terms of path length and trajectory smoothness. These results demonstrate the effectiveness of AI-guided sampling strategies for improving reliability and trajectory efficiency in robotic and UAV navigation, despite a slight increase in computation time. Overall, the study highlights the growing importance of artificial intelligence in real-time robotic path planning applications.

[AI-111] Metropolis-Scale Resilient and Trustworthy Traffic Flow Inference Using Multi-Source Data

链接: https://arxiv.org/abs/2605.25004
作者: Qishen Zhou,Yifan Zhang,Michail A. Makridis,Anastasios Kouvelas,Yibing Wang,Simon Hu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: The paper has been submitted to Elsevier for possible publication

点击查看摘要

Abstract:Inferring network-wide traffic states from sparse observations with high accuracy and trustworthy uncertainty quantification is essential for intelligent transportation systems, yet it remains challenging due to the underdetermined nature of the problem, multifaceted disturbances in sensing networks, and the inherent conflicts among multiple inference sub-tasks when modeled jointly. We propose the Task-Aware Attentive Neural Process (TA-ANP), a unified probabilistic framework for resilient and trustworthy global traffic state inference (GTSI) by fusing floating car data (FCD) with sparse fixed-detector measurements. By casting GTSI as a stochastic process, TA-ANP leverages the meta-learning properties of neural processes to adapt rapidly to changes in sensing configurations without retraining. A task-aware multi-query attention module with distinct spatiotemporal inductive biases is introduced to jointly handle three GTSI sub-tasks, while mitigating cross-task interference. For uncertainty quantification, we combine neural processes with Monte Carlo Dropout to capture both aleatoric and epistemic uncertainty. To support metropolis-scale evaluation, we construct the Metropolitan Multi-Source Traffic Dataset (MMTD), integrating fixed-loop sensor measurements, FCD statistics, and OpenStreetMap road-network data over an urban network of 2,371 road segments. Experiments on MMTD show that TA-ANP achieves state-of-the-art performance across all sub-tasks under deterministic and probabilistic metrics. The resulting well-calibrated uncertainties enable more efficient fixed-sensor placement with fewer sensor deployments. Under a Damage-Repair-Addition sensing lifecycle, TA-ANP demonstrates superior resilience in terms of disturbance absorption, performance recovery, and adaptability to unseen sensing configurations.

[AI-112] Bridging the Gap: Enabling Soft Actor Critic for High Performance Legged Locomotion

链接: https://arxiv.org/abs/2605.24975
作者: Gianluca Sabatini,Chenhao Li,Marco Hutter
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Proximal Policy Optimization (PPO) has become the de facto standard for training legged robots, thanks to its robustness and scalability in massively parallel simulation environments like IsaacLab. However, its on-policy nature makes it inherently sample-inefficient, preventing its use for continuous adaptation and fine-tuning on real hardware. Soft Actor-Critic (SAC), by contrast, is an off-policy algorithm that can reuse past experience, making it a natural candidate for sim-to-real transfer workflows where the same algorithm can be used both in simulation and for online learning on the real robot. Despite these advantages, SAC has consistently failed to match PPO’s empirical performance in massively parallel training settings. This work identifies the root causes of this gap and introduces targeted modifications, covering policy initialization, timeout-aware critic targets, and multi-step return estimation, that enable SAC to train stably at scale. Evaluated across multiple legged robot platforms and diverse locomotion tasks, our approach closes the performance gap with PPO entirely.

[AI-113] GFormer: Towards Temporal Graph Transformer with Auto-Correlation Mechanism

链接: https://arxiv.org/abs/2605.24971
作者: Hongjiang Chen,Pengfei Jiao,Ming Du,Xuan Guo,Zhidong Zhao,Di Jin,Xiao Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The growing interest in Temporal Graph Neural Networks (TGNNs) stems from their ability to model complex dynamics and deliver superior performance. However, TGNNs encounter fundamental challenges in capturing long-term dependencies and identifying periodic patterns. To address these limitations, we propose TGFormer, a novel Transformer architecture specifically designed for temporal graphs. Our model redefines temporal graph learning by establishing a trajectory framework that aligns with time series analysis principles. This approach allows TGFormer to derive node representations through systematic analysis of historical interactions, enabling granular examination of node relationships across sequential timestamps. Building upon stochastic process theory, we develop an auto-correlation mechanism that systematically uncovers periodic dependencies in node interactions. This innovation empowers TGFormer to perform dependency discovery and representation aggregation at sub-interaction levels, demonstrating superior efficiency and accuracy compared to conventional attention mechanisms. Experimental validation across six public benchmarks confirms the effectiveness of our approach, with TGFormer at most achieving 9.35% precision improvement compared to state-of-the-art approaches.

[AI-114] OSDTW: Optimal Shared Depth and Task Weighting for Long-Tailed Recognition

链接: https://arxiv.org/abs/2605.24969
作者: Chang Chu,Qingyue Zhang,Shao-Lun Huang,Junxiong Zheng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: ICIC 2026 Oral

点击查看摘要

Abstract:Long-tailed recognition suffers from a persistent head–tail trade-off: improving tail performance often degrades head accuracy and can increase training instability. Despite strong empirical results from re-weighting, decoupled training, and multi-expert methods, key design choices about representation sharing between head and tail classes and supervision weighting across class groups remain largely heuristic. In this work, we propose OSDTW, a principled task-decomposition framework that partitions the original single-label recognition problem into a head task and a tail task, implemented with a shared encoder and task-specific decoders. To handle the mutual exclusivity and statistical dependence between the two label groups, we introduce a factorized model and show that the resulting Kullback–Leibler divergence-based generalization error can be written as the sum of task-wise terms up to an additive constant, yielding a well-defined task-wise objective. We further develop a three-stage training pipeline: independent task training to estimate task-wise optima and the Fisher information matrix, weighted joint training to learn a shared encoder, and branch assembly to construct the final decoupled model. Under a block-diagonal Fisher approximation, we derive a computable second-order expansion of the expected generalization error, decomposing it into encoder variance, encoder bias, and decoder variance. This bias–variance decomposition provides a computable proxy to select the shared depth and task weights, enabling efficient hyper-parameter search. Experiments on standard long-tailed benchmarks demonstrate the effectiveness of the proposed approach over strong baselines.

[AI-115] owards Multi-Turn Dialog Systems for Industrial Asset Operations and Maintenance

链接: https://arxiv.org/abs/2605.24953
作者: Chengrui Li,Rujing Li,Yitong Bai,Rui Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Industrial asset operations and maintenance question answering is inherently multi-turn, iterative, and highly dependent on external tool invocation. However, the conventional plan-execute single-agent architecture exhibits clear limitations in maintaining cross-turn context, and reusing intermediate results. In this paper, we present a multi-turn dialog system designed for industrial scenarios based on a supervisor-specialist multi-agent architecture. To alleviate tool invocation bottlenecks, the system incorporates structured artifact reuse, dynamic replanning, and parallel tool execution. Evaluation results show that our system achieves better response quality compared with the baseline, with planning effectiveness increasing by 54.5% and task completion improving by 37.8%. System profiling further shows that cross-turn artifact reuse effectively reduces redundant tool invocation, decreasing the tool-time share from 47.3% to 26.3% and making turns 2-5 approximately 4.2x faster than the first turn.

[AI-116] APT-Agent : Automated Penetration Testing using Large Language Models

链接: https://arxiv.org/abs/2605.24949
作者: William Guanting Li(1),Alsharif Abuadbba(2),Kristen Moore(2),Dan Dongseong Kim(1) ((1) University of Queensland, (2) CSIRO Data61)
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 11 pages, 8 figures

点击查看摘要

Abstract:Penetration testing is essential to securing modern web infrastructures, yet traditional manual methods struggle to keep pace with their scale and complexity. Large Language Models (LLMs) offer new opportunities for automating these tasks, but existing approaches face two persistent challenges: hallucination of technical entities and insufficient long-term contextual memory. To address these issues, we present APT-Agent, a fully automated LLM-driven penetration testing framework that systematically orchestrates reconnaissance, exploitation, and exfiltration. APT-Agent introduces a hybrid rectification module to recover hallucinated commands and a command-specific memory architecture to preserve operational context across multi-step attack sequences. We evaluate our APT-Agent on Metasploitable 2 against seven vulnerable services spanning web, database, and network protocols. APT-Agent achieves an 84.29% end-to-end exploitation success rate, compared to 48.57% (Script Kiddie) and 18.57% (PentestGPT) under matched conditions. By reducing cognitive burden and minimizing reliance on human intervention, APT-Agent represents a step toward scalable, reliable, and cognitively efficient automation for penetration testing.

[AI-117] RealBench: Benchmarking Data-Driven Numerical Weather Forecasting Under Operational Conditions and Extreme Event Challenges

链接: https://arxiv.org/abs/2605.24945
作者: Ruize Li,Zhibin Wen,Tao Han,Hao Chen,Fenghua Ling,Wei Zhang,Song Guo,Lei Bai
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Atmospheric and Oceanic Physics (physics.ao-ph)
备注: 35 pages, 22 figures

点击查看摘要

Abstract:Accurate evaluation of weather forecasting models is critical for their reliable deployment in real-world applications. However, existing benchmarks predominantly rely on reanalysis products such as ERA5, which are generated through delayed data assimilation and do not reflect the constraints of real-time operational forecasting, thereby resulting in a systematic mismatch between benchmark performance and real-world forecasting. In this work, we introduce RealBench, a next-generation benchmark for AI weather forecasting that emphasizes realistic evaluation under operational conditions. RealBench features a strictly out-of-distribution test set spanning 2025 to eliminate data leakage and capture recent atmospheric regimes. It integrates multiple data sources, including low-latency operational analysis and a large-scale global in-situ observation dataset comprising over 10,000 stations, enabling direct evaluation against real atmospheric measurements. Beyond standard global metrics, RealBench provides a comprehensive evaluation framework for high-impact extreme events, including heatwaves, cold surges, and tropical cyclones, using event-specific metrics that better reflect real-world forecasting priorities. The evaluation results reveal substantial discrepancies between reanalysis-based metrics and real-world performance, particularly concerning extreme events. By highlighting the limitations of existing benchmarks, this work establishes a more faithful and operationally relevant evaluation paradigm, providing a rigorous foundation for advancing next-generation AI weather forecasting systems. The benchmark implementation is available at: this https URL.

[AI-118] Riemannian-Manifold Steering: Geometry-Aware Generative Autoencoders for Label-Free Steering

链接: https://arxiv.org/abs/2605.24942
作者: Narmeen Oozeer,Shivam Raval,Philip Quirke,Manikandan Ravikiran,Jeff Phillips,Shriyash Upadhyay,Amirali Abdullah
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Steering a language model - intervening on its internal activations to change downstream behaviour - has recently expanded beyond linear interpolation to nonlinear methods such as angular and kernelized steering, which define intervention transformations without learning an explicit geometry over paths in activation space. Freshly introduced geometry-aware manifold methods do learn such a geometry, but require labelled class centroids together with prescribed cyclic or sequential structure. These assumptions restrict where manifold steering can be applied, since existing constructions require labelled centroids and compatible boundary conditions. We recast manifold steering more broadly as \textbfRiemannian geodesic computation on activation space, recovering linear and labelled-spline steering as geodesics under particular choices of metric. A principled metric within this framework is the output-space Hellinger distance pulled back to activations; we approximate this with a learned encoder trained on output distances over a small concept-token schema - no per-prompt labels, no topology prior, and no per-task curve fitting. Empirically, the method reliably drives the model onto the target class across all tasks in a standard four-task language-model arithmetic benchmark, while following more behaviourally natural trajectories than baselines on smaller output spaces. We thereby provide a unified Riemannian framework for manifold steering together with a schema-supervised, label-free instantiation that operates without labelled centroids or prescribed boundary conditions.

[AI-119] Energy Shields for Fairness

链接: https://arxiv.org/abs/2605.24926
作者: Filip Cano,Thomas A. Henzinger,Konstantin Kueffner
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Runtime fairness is not a one-time constraint but a dynamic property evaluated over a sequence of decisions. To ensure fairness at runtime, it is necessary to account for past decisions, information neglected by conventional, static classifiers. Traditional fairness shields enforce runtime fairness abruptly, by intervening \emphdeterministically whenever a sequence of decisions violates the target for a running fairness measure. This motivates our \emphmain conceptual contribution: \textbfenergy shields. An energy shield is a novel, lightweight, adaptive controller that monitors a sequence of decisions and intervenes \emphprobabilistically to ensure runtime fairness smoothly, by utilizing physics-inspired energy functions to nudge the sequence toward fairness: the more unfair the decisions, the stronger the nudging force becomes. This makes energy shields the \emph\textbffirst fairness shields to provide both \emphshort-term safety and long-term liveness guarantees. Safety ensures that the running fairness measure stays within a running target interval with high probability, and liveness ensures that the limit of the fairness measure lies within the limit target interval. Intuitively, the short-term specifies the tolerated fairness values and the long-term specifies the desired fairness values. We also provide a synthesis procedure for constructing the least intrusive energy shield for a given target specification, and demonstrate its efficiency experimentally. We evaluate our energy shields against existing fairness shields through the lens of short- and long-term fairness. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2605.24926 [cs.AI] (or arXiv:2605.24926v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2605.24926 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.1145/3805689.3806807 Focus to learn more DOI(s) linking to related resources Submission history From: Konstantin Kueffner [view email] [v1] Sun, 24 May 2026 08:10:17 UTC (1,111 KB) Full-text links: Access Paper: View a PDF of the paper titled Energy Shields for Fairness, by Filip Cano and Thomas A. Henzinger and Konstantin KueffnerView PDFHTML (experimental)TeX Source view license Current browse context: cs.AI prev | next new | recent | 2026-05 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked="checked"class=“labs-tab-input”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status

[AI-120] Quaternion Self-Attention with Shared Scores ICML2026

链接: https://arxiv.org/abs/2605.24920
作者: Shogo Yamauchi,Tohru Nitta,Hideaki Tamori
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: 26 pages, 6 figures and 15 tables. Accepted at ICML2026

点击查看摘要

Abstract:Quaternion neural networks are parameter-efficient and model multidimensional dependencies by representing four related features as a single entity. However, existing quaternion self-attention computes component-wise scores and applies independent softmax operations to each component, which increases the computational cost and allows attention distributions to diverge across components. We propose a shared-score quaternion self-attention mechanism that computes a single real-valued score using the quaternion inner product and applies a shared attention distribution across all components. This reduces score-computation multiplications by 75% and the number of softmax operations from four to one. We prove that, when queries and keys are produced by quaternion linear projections that induce component pre-mixing, the component-wise and shared scores lie in the same interaction subspace, indicating that independent component-wise attention primarily re-parameterizes the same interactions rather than expanding the feature interaction space. In speech enhancement, our method reduces inference time by up to 44.3% on a GPU and 58.1% on a CPU while maintaining quality, with consistent trends across vision and natural language processing.

[AI-121] Explainable Retinal Imaging for Prediction of Multi-Organ Dysfunction in Type 2 Diabetes

链接: https://arxiv.org/abs/2605.24912
作者: Mini Han Wang,Liting Huang,Wei Hong,Boonthawan Wingwon
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Other Quantitative Biology (q-bio.OT)
备注: 15 pages, 8 figures

点击查看摘要

Abstract:Background: Type 2 diabetes mellitus (T2DM) is increasingly recognised as a systemic disease characterised by coordinated dysfunction across metabolic, renal, lipid, and inflammatory pathways. Existing clinical assessments often fail to capture this multi-dimensional burden. Methods: We conducted a retrospective study of 1,195 patients using routinely collected laboratory biomarkers. System-level abnormality indices were constructed to quantify organ-specific dysfunction, and multi-system involvement was defined as abnormalities in two or more systems. Supervised machine learning models, including logistic regression, random forest, and gradient boosting, were trained to predict multi-system dysregulation. Model interpretability was achieved using SHapley Additive exPlanations (SHAP). Results: The gradient boosting model demonstrated near-perfect discrimination (AUC = 1.000), significantly outperforming logistic regression (AUC = 0.925). Feature attribution analysis revealed that hyperglycaemia, renal impairment, dyslipidaemia, and inflammation were the dominant drivers of multi-system risk. Dose-response relationships observed in partial dependence analyses further supported the biological plausibility of model predictions. Conclusion: This study presents an interpretable, data-driven framework for quantifying systemic disease burden in T2DM. By linking routine biomarkers to multi-organ dysfunction, our approach provides both predictive accuracy and mechanistic insight, offering potential for improved risk stratification and precision medicine in diabetes care. The data and code used in this study are openly available on GitHub at: this https URL Comments: 15 pages, 8 figures Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Other Quantitative Biology (q-bio.OT) MSC classes: 97P80 ACMclasses: J.3 Cite as: arXiv:2605.24912 [cs.LG] (or arXiv:2605.24912v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.24912 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Mini Han Wang [view email] [v1] Sun, 24 May 2026 07:31:57 UTC (1,490 KB) Full-text links: Access Paper: View a PDF of the paper titled Explainable Retinal Imaging for Prediction of Multi-Organ Dysfunction in Type 2 Diabetes, by Mini Han Wang and 3 other authorsView PDF view license Current browse context: cs.LG prev | next new | recent | 2026-05 Change to browse by: cs cs.AI q-bio q-bio.OT References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked="checked"class=“labs-tab-input”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) IArxiv recommender toggle IArxiv Recommender (What is IArxiv?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status

[AI-122] Factorize to Generalize: Retrieval-Guided Invariant-Dynamic Decomposition for Time Series Forecasting

链接: https://arxiv.org/abs/2605.24911
作者: Jinjin Chi,Lei Feng,Lulu Zhang,Yongcheng Jing,Yiming Wang,Ximing Li,Jialie Shen,Leszek Rutkowski,Dacheng Tao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Time series foundation models (TSFMs) have recently achieved strong zero-shot forecasting performance through large-scale pretraining and retrieval-augmented prediction. However, our empirical analysis reveals a non-trivial limitation of retrieval-based forecasting: retrieval tends to induce more oscillatory predictions, improving performance on highly fluctuating series while degrading accuracy on smoother, trend-dominated ones. This suggests that retrieved information may be fused into prediction without explicitly distinguishing stable temporal structure from instance-specific variations, which can reduce robustness under distribution shifts. We propose a Retrieval-guided Invariant-Dynamic DEcomposition framework for time series forecasting. Rather than using retrieval as auxiliary predictive context, we leverage retrieved sequences as implicit samples from related environments to guide representation decomposition. Specifically, we first construct a retrieval-aware representation via attention-based aggregation, and then introduce a retrieval-guided routing mechanism to decompose it into an invariant component capturing stable shared structure and a dynamic component modeling context-dependent variations. These two components are forecast separately and fused for final prediction, enabling the model to preserve transferable patterns while remaining adaptive to evolving dynamics. We further design training objectives that encourage invariant learning and disentanglement, and provide theoretical insight showing that retrieval aggregation reduces variance and approximates invariant representation learning without explicit environment supervision. Extensive experiments demonstrate that our method consistently improves robustness under distribution shifts and outperforms existing TSFMs and retrieval-based baselines in zero-shot forecasting settings.

[AI-123] Noise-Robust Financial Numerical Entity Attribute Tagging

链接: https://arxiv.org/abs/2605.24910
作者: Hsin-Min Lu,Chen-Yang Lai,Yi-Jhen Li,Ju-Chun Yen
机构: 未知
类目: Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
备注:

点击查看摘要

Abstract:Financial Numerical Entity (FNE) understanding aims to recover the meaning of numerical mentions in financial reports. Existing studies primarily focus on concept name prediction and face two important limitations. First, labels derived from inline XBRL may contain errors because filings are usually prepared manually. Second, other important FNE attributes, such as reporting-time relation, measurement scale, and accounting sign, are less emphasized. We propose \textbfNOise-\textbfRobust Tagging for Rich Financial Numerical Entity \textbfAttributes (\textscNORA) to address these gaps. NORA uses task-aware instance-specific weighting to attenuate the influence of noisy labels during training, and we further propose the Neighborhood Prior-adjusted KNN (NPK) filtering method for more reliable evaluation on real-world noisy test sets. In addition, we construct a large-scale benchmark containing 6.6 million instances with multi-attribute labels and filing metadata. Experiments show that \textscNORA performs strongly compared with state-of-the-art noisy-label baselines, including Co-teaching, Mixup, SSR, and SelfMix. Moreover, NORA is robust under both unfiltered and noise-filtered test settings. It achieves the best Accuracy, Macro F1, and Weighted F1 for concept name and time-relation prediction, while remaining competitive on scale and sign prediction. These results demonstrate the value of jointly modeling rich FNE attributes while accounting for label noise in real-world financial filings.

[AI-124] On the Impact of Class Imbalance on the Learning Dynamics of Deep Neural Networks:An Intuitive Insight

链接: https://arxiv.org/abs/2605.24908
作者: Ismail B. Mustapha,Shafaatunnur Hasan,Sunday O. Olatunji,Hatem S. Y. Nabus
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Conference

点击查看摘要

Abstract:Class imbalance in deep neural networks (DNNs) has witnessed a rapid increase in research attention in recent years. However, the varying accounts of the reasons behind the poor performance of DNN on imbalance data in pertinent literature shows that little is known about how this agelong phenomenon impacts the performance of DNNs. A better understanding of this problem is crucial to developing effective DNN-based imbalance methods. Thus, this study systematically investigates the impact of class imbalance on the learning dynamics of DNN by monitoring the learning pattern of DNN models on both the majority and minority classes of datasets of varying imbalance ratios. Experimental findings shows that as against learning from balanced datasets where DNN learns the classes similarly, class imbalance has severe deteriorating impact on the performance of DNN, driving the model to underfit the minority class samples in the early training epochs while simultaneously learning only the majority class. Although DNN ultimately learns the minority samples, learning in this manner only results in learnt minority representations that are non-generalizable at test phase because they are merely overfitted to keep the overall training loss as low as possible.

[AI-125] ProActor: Timing-Aware Reinforcement Learning for Proactive Task Scheduling Agents ACL2026

链接: https://arxiv.org/abs/2605.24900
作者: Lei Ding,Bin He,Chenguang Wang,Yang Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 47 pages, 31 figures. Accepted to ACL 2026

点击查看摘要

Abstract:Proactive task-oriented agents must autonomously anticipate user needs, identify actionable opportunities, and trigger software actions at appropriate moments - fundamentally shifting from reactive systems that await explicit instructions. However, existing approaches lack generalizable end-to-end solutions for measuring and optimizing such anticipatory behaviors. This paper introduces ProActor, a unified framework for conversational task scheduling that integrates: (1) a domain-agnostic automated annotation methodology that enables scalable proactiveness reinforcement learning (RL) by generating full opportunity time windows instead of rigid point labels, (2) systematic proactiveness metrics capturing both timing quality and reference action alignment, and (3) RL optimization using GRPO with various reward designs. Our insight is that RULER-based rewards with proactiveness rubrics are crucial for improving timing quality, and that proactiveness optimization enabled by stage-aware composite rewards is key to balancing timing quality and reference action alignment. Timing-aware RL requires extensive exploration, demanding efficient infrastructure. We develop ART-F, an adaptive framework combining request-adaptive inference clusters with DDP-based training on single-node multi-GPU systems, enabling LoRA training of 4-bit Qwen2.5-14B-ProActor-Q4 with 4-8x speedups. Experiments on two newly auto-annotated datasets demonstrate significant improvements in proactive timing while maintaining action consistency comparable to state-of-the-art (SOTA) baselines. Ablations validate the effectiveness of distinct composite reward variations. Comments: 47 pages, 31 figures. Accepted to ACL 2026 Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2605.24900 [cs.AI] (or arXiv:2605.24900v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2605.24900 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-126] aBIIC2: Interactive Building of Ontological Taxonomies using Weighted Self-Organizing Maps

链接: https://arxiv.org/abs/2605.24899
作者: Mathieu d’Aquin
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Ontologies represent the conceptual knowledge of a domain. At the core of an ontology is the taxonomy of concepts and subconcepts that represent specific entities, which can be complex to build. In many cases, information is available in the form of records describing the characteristics of relevant entities, i.e., tabular data. Identifying patterns and similarities in such data can serve as a basis for identifying concepts and organizing them. However, doing so manually can be challenging, and purely automatic approaches, such as agglomerative clustering or relying on a large language model to analyze the data, can leave the user with overwhelming results and little control. In this paper, we describe a tool that enables the progressive and interactive construction of a taxonomy of concepts by identifying clusters as well as their intentional definitions. To do so, we rely on weighted self-organizing maps as a clustering method because they enable the creation of an arbitrary number of clusters that are distinct with respect to the distributions of values of specific characteristics of the clustered entities. We show that, by integrating this mechanism and others for rapidly creating concepts that group together instances from tabular data, this tool represents a middle ground between purely manual analysis and automatic methods for building ontological taxonomies.

[AI-127] Inverting the Shield: Systematically Generating Safety Tests from Policy Specifications ACL2026

链接: https://arxiv.org/abs/2605.24883
作者: Xiaoyue Lu,Xianglin Yang,Haijun Liu,Jiahao Liu,Kuntai Cai,Yan Xiao,Jin Song Dong
机构: 未知
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Software Engineering (cs.SE)
备注: Accepted to the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)

点击查看摘要

Abstract:The widespread integration of Large Language Models (LLMs) necessitates rigorous and systematic safety evaluation. Existing paradigms either rely on constructed benchmarks to assess safety from predefined perspectives, or employ dynamic red-teaming to probe potential vulnerabilities. While effective, these approaches face challenges, as they depend heavily on expert domain knowledge, offer limited systematic guarantees, and are vulnerable to rapid obsolescence. To address these limitations, we introduce a novel framework POLARIS that brings the rigor of specification-based software testing to AI safety. POLARIS first compiles unstructured natural-language policies into First-Order Logic (FOL) representations, establishing a traceable link between high-level rules and concrete test cases. This formalization enables the construction of a Semantic Policy Graph, where complex policy violation scenarios are encoded as traversable paths. By systematically exploring this graph, POLARIS uncovers compositional violation patterns, which are then instantiated into executable natural-language test queries, enabling coverage-driven and reproducible safety testing. Experiments demonstrate that POLARIS achieves higher policy coverage and attack success counts compared to established baselines. Crucially, by bridging formal methods and AI safety, POLARIS provides a principled, automated approach to ensuring LLMs adhere to safety-critical policies with verifiable traceability. We release our code at this https URL.

[AI-128] DBPnet: Damper Characteristics-Based Bayesian Physics-Informed Neural Network for Wheel Load Estimation

链接: https://arxiv.org/abs/2605.24860
作者: Tianyi Wang,Tianyi Zeng,Zimo Zeng,Feiyang Zhang,Yujin Wang,Xiangyu Li,Yiming Xu,Sikai Chen,Junfeng Jiao,Christian Claudel,Xinbo Chen
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Machine Learning (cs.LG); Robotics (cs.RO)
备注: 14 pages, 12 figures, 6 tables

点击查看摘要

Abstract:Advanced driver assistance systems (ADAS) play an important role in modern automotive intelligence, significantly enhancing vehicle safety and stability. The performance of ADAS critically relies on accurate and reliable vehicle state estimation, particularly from vehicle dynamic sensors. Among these signals, wheel load is a key variable for chassis control and safety-critical functions, yet it remains difficult to estimate robustly due to complex suspension geometry, nonlinear dynamics, and measurement noise. To address this issue, we propose DBPnet, a Bayesian physics-informed neural network (PINN) with a physics-aware embedding module inspired by damper characteristics. First, this paper presents a suspension linkage-level modeling (SLLM) approach that constructs a nonlinear instantaneous dynamic model by explicitly considering the complex geometric structure of the suspension. Building upon SLLM, Bayesian inference is integrated into the PINN to effectively cope with noise and uncertainty in the vehicle chassis system, thereby improving the model’s robustness. Then, a physics-informed loss function is employed to ensure consistency with fundamental physical principles, while the damper characteristics-inspired embedding module extracts temporal variation features of input signals and incorporates them into each layer of the PINN, ensuring that physical observations guide the neural network without being constrained by fixed physical models. Extensive evaluations on high-fidelity simulations and real-world experiments demonstrate that our DBPnet consistently achieves lower RMSE and MaxError than baseline methods. These results highlight the potential of our DBPnet to advance wheel load estimation and contribute to the development of more reliable ADAS actuator functions.

[AI-129] he Concept Allocation Zone: Tracking How Concepts Form Across Transformer Depth

链接: https://arxiv.org/abs/2605.24856
作者: James Henry
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 34 models, 8 architectural families, 7 concepts. Companion papers: GEM (arXiv forthcoming), CAZ Validation (arXiv forthcoming), PRH Validation (arXiv forthcoming). Code: this https URL

点击查看摘要

Abstract:Concept formation in transformer language models is depth-extended, not a single-layer event: concepts emerge gradually across a contiguous region of the residual stream. Mechanistic interpretability methods identify the single layer of peak class separation – the “best layer” – capturing a snapshot rather than the process itself. We introduce the Concept Allocation Zone (CAZ): the depth interval within which a concept becomes measurably separable, the region allocated to its geometric expression. We formalize the CAZ through three layer-wise metrics (Separation, Concept Coherence, Concept Velocity) and derive principled boundary detection without manual layer sweeps. A CAZ is not a concept: it is the depth region within which the model organizes its geometry to make a concept separable. A single concept typically participates in multiple CAZes; multiple concepts may share one. Empirical validation across 34 models from 8 architectural families and 7 concepts reveals that the separation curve S(l) is frequently multimodal. A scored detector uncovers “gentle CAZes” – subtle allocation regions invisible to standard peak detection but causally active in 93-100% of cases under ablation (16 of 34 models; 26 in the companion validation paper). The framework generates seven testable predictions; four yield clear verdicts (two not supported, one partially supported, one supported), one had its precondition invalidated by the data, and two are underpowered – with cross-architecture alignment confirmed as depth-matched rather than monolithic under leave-one-concept-out cross-validation. Reference implementation: rosetta_tools v1.3.1 (doi:https://doi.org/10.5281/zenodo.20361433).

[AI-130] ny Brains Giant Impact: Uncovering the Keystone Neurons of LLM with Just a Few Prompts

链接: https://arxiv.org/abs/2605.24846
作者: Xiangtian Ji,Yuxin Chen,Zhengzhou Cai,Xiang Wang,An Zhang,Tat-Seng Chua
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) display strong comprehensive abilities, yet the internal mechanisms that support these behaviors remain insufficiently understood. In this work, we show that across a wide range of open-weight Transformers, a subset of neurons remains consistently highly activated during inference across tasks of multiple capability dimensions. By probing along the cross-task activation strength, an extremely sparse subset is isolated, whose removal causes a collapse in model behavior, which we term keystone neurons. Our analysis reveals that keystone neurons are a stable and intrinsic neuron subset of the model that is largely established during pretraining. The parameters associated with these neurons are tightly calibrated during the training process, and their precise values are critical for the capabilities of the model. Building on these insights, we propose a supervised fine-tuning approach that updates only keystone neurons, achieving task gains comparable to or even better than full-parameter fine-tuning while better preserving performance in other capability dimensions, despite modifying a much smaller number of parameters.

[AI-131] Solving Combinatorial Counting Problems with Weighted First-Order Model Counting

链接: https://arxiv.org/abs/2605.24845
作者: Yuanhong Wang,Juhua Pu,Yuxu Zhou,Yuyi Wang,Ondřej Kuželka
机构: 未知
类目: Artificial Intelligence (cs.AI); Combinatorics (math.CO)
备注: 47 pages, 9 figures

点击查看摘要

Abstract:Combinatorial counting problems pervade artificial intelligence, statistics, and discrete mathematics. Whether the task is enumerating subsets, multisets, permutations, partitions, or compositions under structural and arithmetic constraints, solving it remains a stubbornly manual exercise. Closed-form derivations are powerful but brittle, while naive encodings to propositional model counting or constraint satisfaction destroy the exchangeability that makes counting tractable in the first place. We present Cofola (COmbinatorial counting LAnguage with First-Order logic), a typed declarative language whose primitives are the combinatorial objects that recur in everyday counting questions, including sets, bags, tuples, sequences, circles, partitions, and compositions, together with natural relational and arithmetic constraints over them. A denotational semantics maps every Cofola program to a well-defined combinatorial counting problem, and a three-phase compilation pipeline (preprocessing, decomposition, and symmetry-preserving encoding) reduces this problem to a weighted first-order model counting (WFOMC) instance augmented with coefficient-extraction constraints. To stay inside known domain-liftable fragments whenever possible, the encoding groups indistinguishable entities, breaks the symmetry of unordered groupings lexicographically, and encodes sequences and circles via order axioms. On a suite of representative combinatorial counting problems, ranging from textbook math problems to multi-object scenarios that the closest prior framework cannot express, Cofola produces concise specifications and a uniform solving pipeline that is practical end-to-end.

[AI-132] Reflect-Guard: Enhancing LLM Safeguards against Adversarial Prompts via Logical Self-Reflection

链接: https://arxiv.org/abs/2605.24834
作者: Lixing Lin,Juli You,Yue Li,Luyun Lin,Yiqing Wang,Zhen Zhang,Moxuan Zheng
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 12 pages, 2 figures, and 4 tables

点击查看摘要

Abstract:Large language model (LLM) safety classifiers such as Llama Guard are effective at detecting overtly harmful prompts but remain vulnerable to adversarial jailbreak attacks that disguise malicious intent through role-play scenarios, fictional framing, and indirect requests. We present Reflect-Guard, a method that augments LLM-based safety classifiers with chain-of-thought self-reflection capabilities through parameter-efficient fine-tuning. Our approach distills analytical reasoning from GPT-4o-mini into structured reflection annotations, then trains Llama-Guard-3-8B via QLoRA to generate logical self-reflections before issuing safety verdicts. Using only 1000 training examples and updating just 0.5% of model parameters (~42M), Reflect-Guard achieves substantial improvements on two challenging benchmarks. On WildGuardTest, F1 score improves from 0.770 to 0.842 (+7.2 pp), with recall on adversarial prompts increasing from 0.513 to 0.921 (+40.8 pp). On JailbreakBench, the attack success rate drops from 10.3% to 1.8%, representing an 82.5% relative reduction. These gains are especially pronounced on adversarial inputs, where the explicit reasoning step enables the model to see through obfuscation techniques that defeat standard pattern-matching approaches. Our results demonstrate that teaching safety classifiers to reason about adversarial intent, rather than simply classify surface patterns, is a promising direction for robust LLM safety.

[AI-133] st-Time Deep Thinking to Explore Implicit Rules

链接: https://arxiv.org/abs/2605.24828
作者: Wentong Chen,Xin Cong,Zhong Zhang,Yaxi Lu,Siyuan Zhao,Yesai Wu,Qinyu Luo,Haotian Chen,Yankai Lin,Zhiyuan Liu,Maosong Sun
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:With the continuous advancement of Large Language Models (LLMs), intelligent agents are becoming increasingly vital. However, these agents often fail in environments governed by implicit rules–hidden constraints that cannot be observed directly and must be inferred through interaction. This causes agents to fall into repetitive trial-and-error loops, ultimately leading to task failure. To address this challenge, we propose Test-Time Exploration (TTExplore), a framework where a thinker component analyzes interaction history to infer these implicit rules and guide an actor. Effective exploration in this setting critically depends on the reasoning ability of the thinker. However, evaluating deep reasoning trajectories is inherently unstable and difficult, which poses a major obstacle to effective training. To overcome this issue, we introduce a novel and stable reinforcement learning pipeline. The core idea is to use accurate task-level scores as indirect rewards to bypass the difficulty of evaluating intermediate reasoning, and to retain only a single thinking node per trajectory to alleviate reward sparsity. Using this pipeline, we train a specialized 7B model, Exp-Thinker. Experiments on five text-based embodied tasks show that TTExplore equipped with Exp-Thinker improves baseline agent performance by an average of 14 - 19 points, demonstrating the effectiveness of explicitly reasoning about implicit rules.

[AI-134] Agent Manufacturing: Foundation-Model Agents as First-Class Industrial Entities

链接: https://arxiv.org/abs/2605.24823
作者: Yilei Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Manufacturing has passed through four widely recognized paradigms - mechanization, electrification, programmable automation, and Smart Manufacturing - each defined by the kind of work it shifted from humans to machines. In every case, one layer of industrial work remained fundamentally human: the coordinative cognition of production, comprising the interpretive, allocative, diagnostic, negotiative, and governance work exercised by engineers, planners, and operational managers. We argue that a fifth transition is now underway in which this layer, rather than the physical or routine-cognitive layers below it, is what foundation-model-based autonomous agents primarily redistribute. We name this paradigm Agent Manufacturing and define it operationally: a manufacturing system is an instance of Agent Manufacturing when its principal coordination mechanism is reasoning performed by foundation-model agents that can interpret open-ended goals, plan over long horizons, invoke tools and machines, and negotiate with other agents and humans. This is a narrower and more falsifiable definition than the existing literature on cognitive manufacturing or Industry 5.0 provides, and it distinguishes the paradigm sharply from classical multi-agent manufacturing systems, which were autonomous only within closed protocol spaces.

[AI-135] CoRe-Code: Collaborative Reinforcement Learning for Code Generation

链接: https://arxiv.org/abs/2605.24812
作者: Zhihao Dou,Qinjian Zhao,Zhongwei Wan,Xiaoyu Xia,Sumon Biswas
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have achieved strong performance in code generation, but most methods rely on autoregressive decoding without global planning, often leading to locally coherent yet globally suboptimal solutions (e.g., failing test cases or inefficient complexity). While recent approaches such as Chain-of-Thought (CoT) and multi-agent systems (MAS) introduce planning, their limited role specialization and coordination hinder performance on complex tasks. To address the challenges of coordination and specialization in multi-agent code generation, we propose Collaborative Reinforcement Code (CoRe-Code), a framework for role specialized LLM agents that enhances inter-agent coordination to generate more accurate and efficient code. CoRe-Code adopts a simple Planner-Coder paradigm, where the Planner produces high-level plans and the Coder executes them to generate code. We further introduce a collaboration-aware reinforcement learning stage based on Group Relative Policy Optimization (GRPO) to enhance role specialization and alignment. Experiments show that CoRe-Code outperforms a wide range of existing RL-based and multi-agent methods. In addition, we demonstrate that CoRe-Code can generalize to other multi-agent frameworks (e.g., Retrieval and Debugging agents), highlighting its flexibility and scalability. We evaluate CoRe-Code on multiple benchmarks of varying difficulty using three base models. Compared to existing baselines, the results show consistent improvements in accuracy, while also achieving higher efficiency in terms of execution time and memory usage, demonstrating the effectiveness and practicality of CoRe-Code.

[AI-136] Cross-Domain Energy-Guided Diffusion Generation for Off-Dynamics Reinforcement Learning

链接: https://arxiv.org/abs/2605.24810
作者: Yu Yang,Yihong Guo,Anqi Liu,Pan Xu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO); Applications (stat.AP)
备注: 29 pages, 3 figures, and 14 tables

点击查看摘要

Abstract:Off-dynamics offline reinforcement learning seeks to learn a target-domain policy from a large source dataset and a limited target dataset under mismatched transition dynamics. Existing approaches such as reward augmentation and data filtering are constrained to the source dataset and cannot synthesize new target behavior to improve coverage beyond the collected source trajectories. While recent model-based methods attempt to address this by learning target-aware dynamics, the generated experience is constructed only at the transition level, which leads to accumulated errors over long horizons. These limitations necessitate a shift toward trajectory-level generation for off-dynamics offline RL. We propose CEDGE, a Cross-domain Energy-guided Diffusion GEneration framework. CEDGE trains a trajectory diffusion model on source-domain trajectories and adapts the generated samples to the target domain through energy guidance. This guidance is derived by minimizing the distribution mismatch between the source and desired target-domain trajectories and is decomposed into return, domain, and behavior energy components. The resulting energy-guided trajectories are useful both for direct planning and as synthetic data for policy learning. Since target adaptation is achieved via energy guidance rather than retraining the diffusion model, CEDGE can be efficiently adapted to new target dynamics compared to previous methods. Experiments on the ODRL benchmark demonstrate that trajectory-level energy-guided generation improves diffusion planning under dynamics shifts and produces synthetic data that improves downstream target policy learning.

[AI-137] Disentangled Double Machine Learning for Accurate Causal Effect Estimation

链接: https://arxiv.org/abs/2605.24808
作者: Guodu Xiang,Kui Yu,Yujie Wang,Richang Hong,Fuyuan Cao,Jiye Liang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 15 pages, 9 figures

点击查看摘要

Abstract:Confounding bias is a key challenge in causal effect estimation from observational data. Double Machine Learning (DML) addresses this issue by estimating treatment and outcome nuisance functions, constructing treatment and outcome residuals, and estimating causal effects from the residuals. However, DML often produces biased and unstable estimates in highdimensional or finite-sample scenarios. One reason is that DML estimates nuisance functions using all covariates without disentangling distinct latent factors, resulting in unreliable nuisance function estimation. Another is that imprecise nuisance estimation further introduces residual dependence between the treatment residual and the remaining outcome error, undermining the accuracy of causal effect estimates. To address these issues, in this paper, we propose Disentangled Double Machine Learning (DDML), a novel algorithm that integrates two key strategies. First, a causal role disentanglement strategy decomposes covariates into confounders, treatment-specific factors, and outcomespecific factors for enabling reliable nuisance function estimation. And second, a residual dependence orthogonalization strategy mitigates residual dependence caused by nuisance estimation errors for enhancing the precision of causal effect estimates. Experimental results on synthetic, semi-synthetic, and real-world datasets demonstrate that DDML significantly outperforms 13 state-of-the-art baseline algorithms in both MAE and RMSE.

[AI-138] Zero-Shot Parkinsons Disease Detection from Speech: Comparing Large Audio and Language Models

链接: https://arxiv.org/abs/2605.24806
作者: Muhammad Ashad Kabir,Sirajam Munira
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: 6 pages

点击查看摘要

Abstract:Large audio and language models have recently demonstrated zero-shot reasoning capabilities across various domains. However, it remains unclear how the form of audio input, whether handcrafted acoustic features extracted from speech or the raw audio waveform itself, affects performance for Parkinson’s disease (PD) detection across different languages. In this study, we systematically compare two input modalities for zero-shot PD detection: (i) handcrafted acoustic features extracted from speech recordings analyzed by a general-purpose LLM, and (ii) direct waveform input analyzed by audio-capable models. Experiments on PD speech datasets in four languages show that performance varies across input modalities, speech tasks, and languages. Handcrafted acoustic features provide more stable performance in a low-resource language (e.g., Bengali), whereas audio input yields dataset-dependent gains. These findings highlight the impact of input modality on zero-shot PD detection from speech.

[AI-139] CONF-KV: Confidence-Aware KV Cache Eviction with Mixed-Precision Storag e for Long-Horizon LLM

链接: https://arxiv.org/abs/2605.24786
作者: Yubo Li,Yidi Miao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Long-horizon LLM inference turns the key–value (KV) cache into the dominant GPU memory consumer and makes per-token attention increasingly expensive. Many common eviction policies use static recency windows or historical attention, leaving unused a signal computed on every decoding step: the model’s current uncertainty. We introduce CONF-KV, a KV-cache manager that converts the next-token distribution into a scalar confidence score and uses it to choose the per-step cache budget, retaining more context when the model is uncertain and pruning aggressively when it is confident. Within each budget, tokens are ranked by a composite of accumulated attention mass and recency, while a protected recent window preserves local coherence. We combine the policy with blockwise online-softmax attention, mixed FP16/INT8 storage, and a pyramidal per-layer budget variant. Across four model families and generated lengths up to 4K, CONF-KV stays near the footprint of a fixed 512-token sliding window while remaining within 1.5–2.1 perplexity points of full KV. On Needle-in-a-Haystack up to 32K tokens, CONF-KV reaches 91.4% retrieval accuracy versus 53.8% for sliding windows and 80.6% for H2O; on 75 VisualWebArena tasks it retains 95.3% of full-KV success at 2.8 times lower peak memory.

[AI-140] PANDO: Efficient Multimodal AI Agents via Online Skill Distillation

链接: https://arxiv.org/abs/2605.24785
作者: Yubo Li,Yidi Miao,Haotian Shen,Yuxin Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advances in multimodal web agents often rely on increased inference-time computation, including rollout search, verifier passes, offline skill discovery, and specialist model stacks. This raises a central question: can a web agent become more efficient as it accumulates experience, rather than more expensive? We first analyze trajectories from VisualWebArena and identify three recurring sources of inefficiency: repeat-action loops, hidden discovery costs, and low prompt-cache reuse. We then introduce PANDO, a single-rollout online skill-distillation framework that maintains a structured Skill Library and combines progress reflection, confidence-based skill demotion, hierarchical routing, visual compression, and cache-aware prompting. On the full set of 910 VisualWebArena tasks, PANDO achieves a 58.3% success rate, outperforming SGV (54.0%) and our WALT reproduction (45.2%), while using 58% fewer tokens than SGV and 61% fewer tokens than WALT, without any pre-evaluation discovery budget. A 300-task ablation further shows that rules and routines provide most of the success gains, while routing, compression, and cache-aware prompting convert the larger skill library into lower marginal token cost. Finally, we introduce three trajectory-level efficiency metrics – Action Repetition Rate, Step Overhead Ratio, and Prompt Cache Utilization – to make efficiency visible beyond terminal success.

[AI-141] GRAIL: AI translation for scientists application workflow on satellite data

链接: https://arxiv.org/abs/2605.24784
作者: Zhuocheng Shang,Ahmed Eldawy
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Domain scientists increasingly develop Python scripts to analyze satellite imagery but they lack scalability to large-scale data. This paper demonstrates GRAIL, an agentic translation system that converts Python geospatial workflows into executable Spark-based programs without requiring scientists to learn a new framework. Rather than fine-tuning a specialized LLM model, GRAIL adapts RDPro, a Scala library for satellite data analysis, to make it LLM-ready using structured documentation, API alias functions, and repair-oriented error logs. Translation is structured as a LangGraph pipeline that decomposes code generation into explicit sections with guided inputs and outputs, enabling targeted repair without regenerating the full program. We demonstrate GRAIL on real-world geospatial workflows and showcase the correctness and scalability of the translated code.

[AI-142] Complement Submodular Information Measures for Balanced and Robust Data Selection

链接: https://arxiv.org/abs/2605.24779
作者: Rishabh Iyer
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Combinatorics (math.CO)
备注:

点击查看摘要

Abstract:Submodular optimization has become a fundamental paradigm for data selection, retrieval, summarization, and representation learning due to its ability to model coverage, diversity, and representativeness. However, classical submodular objectives optimize only the selected subset and do not explicitly preserve structural information between the selected subset and the remaining data. In many modern machine learning applications, including train/validation/test splitting, benchmark construction, and robust subset selection, the quality of a selection depends critically on preserving balanced structure across both the selected subset and its complement. In this work, we introduce Complement Submodular Information (CSI), a new class of complement-aware submodular objectives that quantify shared structural information between a subset and its complement. Our framework induces complement-aware variants of several classical submodular functions including Facility Location, Graph Cut, LogDet, Saturated Coverage, Set Cover, Probabilistic Set Cover, and Feature Based Functions. We analyze the theoretical properties of CSI objectives and show that they exhibit approximate monotonicity under bounded curvature conditions, leading to near- (1-1/e) greedy approximation guarantees. Empirically, CSI objectives consistently outperform standard submodular objectives on robust hidden-slice-aware subset selection. In particular, CSI objectives significantly improve preservation of coherent rare/tail semantic structure while simultaneously suppressing noisy and isolated outliers, leading to substantially improved downstream predictive performance. Synthetic experiments further illustrate how different CSI instantiations capture complementary notions of representativeness, diversity, connectivity, and balanced neighborhood preservation. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Combinatorics (math.CO) Cite as: arXiv:2605.24779 [cs.LG] (or arXiv:2605.24779v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.24779 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-143] Uncertainty Decomposition via Cyclical SG-MCMC and Soft-label Learning for Subjective NLP

链接: https://arxiv.org/abs/2605.24773
作者: Keito Inoshita,Takato Ueno
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Annotator disagreement in emotion classification reflects ambiguity intrinsic to emotion concepts and is essential for predictor-quality assessment in subjective NLP. Yet no prior work integrates soft-label learning with Bayesian deep learning to evaluate uncertainty along axes including annotator-distribution fidelity. We train a linear head on a frozen RoBERTa via cyclical stochastic gradient Markov chain Monte Carlo (cSG-MCMC), targeting the empirical annotator distribution with a soft-label objective under a five-axis evaluation. On the 28-emotion GoEmotions benchmark, the proposed method outperforms Monte Carlo Dropout and Deep Ensemble simultaneously on three axes – Jensen-Shannon divergence (JSD) to the annotator distribution, Spearman correlation between per-emotion aleatoric uncertainty and disagreement, and selective-prediction Area Under the Risk-Coverage Curve (AURC) and Area Under the ROC Curve (AUROC) – showing independent axes are jointly attainable from one posterior. Post-hoc temperature scaling exhibits a bidirectional effect, establishing hard-label calibration and annotator-JSD as independent dimensions and motivating joint reporting as an honest protocol.

[AI-144] Proper Scoring Rules for Agent ic Uncertainty Quantification

链接: https://arxiv.org/abs/2605.24756
作者: Suresh Raghu,Satwik Pandey,Shashwat Pandey
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 38 pages, 2 figures

点击查看摘要

Abstract:Language-model agents increasingly emit uncertainty signals throughout a trajectory, but existing agentic UQ evaluations often conflate ranking usefulness with probabilistic truthfulness. AUROC, AUPRC, risk-coverage, Trajectory ECE, and scalarized trajectory scores evaluate discrimination, binwise calibration, or collapsed summaries, but do not strictly elicit the full prefix-conditioned success-probability trace q_t = P^\pi(Y=1 | H_t) . Building on prequential proper scoring, we introduce the Trajectory Proper Score (TPS), a predictor-agnostic family of strictly proper trajectory-level scoring rules for any per-step uncertainty signal calibrated into a probability of eventual success. We prove that TPS strictly elicits the success-probability process under complete observation, within the chosen score family and weight schedule. We extend the construction to administratively censored trajectories by projecting the complete-data score onto the observable stopped prefix, yielding an exact q_Z -weighted reduced score and a tractable approximation when q_Z is unestimated. We further show that common trajectory evaluators target weaker objects than the full prefix-conditioned probability process: Trajectory ECE is resolution-blind, while scalarized Trajectory Brier elicits only the collapsed scalar, not the full trace. Experiments on StrategyQA, Tau2-Bench, HotpotQA, and WebShop show that these theoretical distinctions are operationally visible: probability recalibration can substantially change TPS while leaving rank metrics nearly unchanged, and the tractable censored approximation can change the verdict relative to complete-only evaluation.

[AI-145] Bilevel Optimization of Synthetic Trajectories for Multi-Turn LLM Fine-Tuning

链接: https://arxiv.org/abs/2605.24743
作者: Shresth Verma,Mauricio Tec,Cheol Woo Kim,Kai Wang,Milind Tambe
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:While LLMs excel at single-turn generation, they struggle with long-horizon, multi-turn interactions. Offline reinforcement learning (RL) offers a scalable approach, yet its performance hinges on the availability and quality of multi-turn trajectory data. A common remedy is to augment training with synthetic trajectories generated by LLMs or simulators, but synthetic data is highly heterogeneous in quality, and naively treating all trajectories as equally informative can degrade performance. We propose BOOST, a bilevel optimization framework where the inner level trains the LLM on reweighted data and the outer level trains a lightweight reweighting head on held-out real validation tasks, assigning continuous trajectory-level weights without requiring an external judge. To ground this approach, we derive a PAC-Bayesian bound revealing a three-way trade-off: synthetic data increases diversity but risks task-shift, while concentrating weight on high-quality trajectories improves empirical performance at the cost of effective sample size. Empirically, our method consistently outperforms multiple baselines. Analysis reveals it upweights synthetic trajectories that align with the real data distribution and exhibit higher qualitative merit.

[AI-146] Hylos: Operability Contracts for Model-Native Spatial Intelligence

链接: https://arxiv.org/abs/2605.24728
作者: Christopher Da Silva
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 27 pages, 7 figures. Systems/position preprint with focused artifact study

点击查看摘要

Abstract:Foundation models can increasingly describe, reconstruct, and generate 3D objects, assemblies, scenes, and environments, but visually plausible spatial output is not yet operable 3D. A generated object or environment becomes useful to an agent only when the system can identify its entities, frames, surfaces, constraints, provenance, admissible actions, expected effects, and validation failures. This paper introduces Hylos, a systems architecture for contract-bounded spatial intelligence. Hylos maintains scene-scale operability state over objects, assemblies, assets, surface anchors, assertions, action candidates, solver jobs, shared actuator invocations, capability gaps, and effect diffs. Durable spatial changes are routed through a SpatialTransaction: a commit boundary that resolves references, checks admissibility, protects invariants, projects effects, and returns commit, review, rollback, deferral, or capability-gap outcomes. The paper is framed as a systems/position preprint with a focused artifact study rather than a broad benchmark. The study examines causal repair: a visible misalignment appears on a dependent component, while the supported repair lies upstream in the placement structure that controls it. The successful interaction traces the symptom through scene dependencies, selects a supported upstream interaction, and applies a validated change instead of directly editing visible geometry. The broader claim is that spatial AI should be evaluated not only by visual quality, but by whether generated or edited 3D can become reliable substrate for CAD, robotics, simulation, inspection, manufacturing, and interactive world authoring. Comments: 27 pages, 7 figures. Systems/position preprint with focused artifact study Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2605.24728 [cs.AI] (or arXiv:2605.24728v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2605.24728 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Christopher Da Silva [view email] [v1] Sat, 23 May 2026 20:47:05 UTC (954 KB)

[AI-147] MDIA: A Multi-Agent Diagnostic Intelligence Pipeline on HealthBench Professional

链接: https://arxiv.org/abs/2605.24699
作者: Roberto Cruz,David Rey-Blanco
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 33 pages, 10 figures

点击查看摘要

Abstract:Most reported gains on agentic-LLM clinical benchmarks are often attributed to prompt engineering, yet our results suggest that larger improvements can come from architectural and engine-level design. We present MDIA, a Multi-agent Diagnostic Intelligence Agent implemented as a 7-node specialty-routed clinical reasoning graph, on the full HealthBench Professional benchmark (n = 525), on a non-fine-tuned LLM. MDIA achieves 0.6272 under OpenAI’s GPT-5.4-2026-03-05, which is +3.72 pp above the performance of OpenAI’s ChatGPT for Clinicians. The experimental work shows that performance lift is attributable to system architecture: specialty routing, multi-turn context preservation, drug-state safety gating, site-filtered search, length-aware synthesis, and engine-level reliability. These findings support the view that agentic clinical benchmark performance is shaped both by the underlying foundation model and the orchestration architecture. Nevertheless, we also noticed notable differences when using other models as a grader; in particular, when using Gemini 2.5 Pro, MDIA scored 0.6585, which suggests that the choice of grader is a source of variability. Robust evaluation of LLMs would therefore require assessment across several independent grader models.

[AI-148] Emotional intelligence in large language models is frag mented across perception cognition and interaction

链接: https://arxiv.org/abs/2605.24686
作者: Minghao Lv,Lu Chen,Enchang Zhang,Anji Zhou,Xiaoran Xue,Hanyi Zhang,Fenghua Tang,Zhuo Rachel Han,Mengyue Wu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As large language models (LLMs) are increasingly integrated into emotionally sensitive domains, the structural integrity of their emotional intelligence (EI) becomes a critical frontier for safety and alignment. Current benchmarks often conflate superficial politeness with deep affective reasoning, failing to distinguish between perceptual accuracy and interactive efficacy. Here, we introduce FACET (Functional Affective Competence and Empathy Test), a psychometrically grounded framework comprising 480 expert-crafted items. Unlike previous metrics, FACET is theoretically anchored in the Mayer-Salovey-Caruso four-branch ability model, operationalizing EI through perception, facilitation, understanding, and management of emotions. Through an evaluation of nine frontier models (including GPT-5, Claude-Sonnet-4), we demonstrate that emotional intelligence is not a monolithic capability but is fragmented across cognitive and interactive dimensions. While frontier models demonstrate robust proficiency in objective emotion recognition and social reasoning, this does not consistently translate to interactive success. We categorize these discrepancies into three distinct performance profiles: cognitive-dominant, interactive-dominant, and context-dependent. These typologies indicate that emotional skills do not scale uniformly with general intelligence or model size; rather, they are shaped by specific alignment paradigms. Notably, we identify hidden emotion recognition as a universal performance bottleneck across all architectures. Our results suggest that current RLHF processes may optimize for “stochastic empathy”, a statistical mimicry of emotional syntax, at the expense of integrated affective reasoning. These findings challenge the assumption of linear emotional scaling and provide a rigorous roadmap for developing socially aware agents capable of genuine clinical resonance.

[AI-149] Beyond the Aggregation Dilemma: Prior-Retaining Decoupled Learning for Multimodal Graphs

链接: https://arxiv.org/abs/2605.24684
作者: Hao Yan,Xuanru Wang,Jun Yin,Shirui Pan,Senzhang Wang,Chengqi Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multimodal Attributed Graph Learning (MAGL) integrates intrinsic node attributes with structural topology via graph aggregation. However, as pretrained encoders evolve into Large Foundation Models (LFMs), the landscape of MAGL fundamentally shifts: under high-confidence LFM priors, mandatory aggregation introduces topological noise that overwhelms discriminative signals, triggering a counter-intuitive performance inversion where sophisticated MAGL architectures underperform simple topology-agnostic MLPs. Through systematic empirical and theoretical analysis, we identify that this inversion stems from a fundamental aggregation dilemma characterized by two concurrent pathologies: (1) Representational Pathology (SNR Degradation) - mandatory aggregation dilutes robust intrinsic features with topological noise, causing the noise penalty to outweigh its collaborative benefit; and (2) Optimization Pathology (Gradient Starvation) - topological aggregation attenuates gradient flow, while a shared task loss causes dominant modalities to prematurely suppress weaker ones. To resolve this dilemma, we propose SUPRA (Shared-Unique Prior-Retaining Architecture), a decoupled dual-pathway paradigm. SUPRA processes modality-specific features through topology-agnostic MLPs while capturing structural synergy via a lightweight shared GNN, with auxiliary deep supervision counteracting gradient starvation. Extensive evaluations demonstrate that SUPRA achieves state-of-the-art performance while requiring 3.5x lower peak GPU memory and up to 4.4x faster training time than Multimodal Graph Transformers.

[AI-150] When Mean CE Fails: Median CE Can Better Track Language Model Quality

链接: https://arxiv.org/abs/2605.24667
作者: Hao Guo,Simon Dennis,Rivaan Patil,Kevin Shabahang
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 20 pages

点击查看摘要

Abstract:Mean cross-entropy is the standard validation metric for language models, but it can fail to track model quality during training. We examine this in two common scenarios. First, in Qwen2.5-1.5B SFT on synthetic fact-learning, we find that mean CE rises substantially after the initial learning phase while held-out fact-recall accuracy remains near its peak. Second, we find that in top-K distillation on TinyStories, decreasing K improves median CE while worsening mean CE; the Top-5 student attains the highest LLM-judge score and crosses below its teacher on median CE, despite having the worst mean CE. In both cases, median CE correlates much more closely with task performance than does mean CE. Analyzing how bulk and tail percentile CE move during training reveals that training reshapes the empirical per-token CE distribution. In top-K distillation, smaller K yields a distribution with more mass at both extremes, decreasing the median and increasing the mean. In Qwen SFT, the bulk saturates quickly while the tail extends in the latter half of training. In both, the task-evaluation metric appears more sensitive to the bulk than to the tail. Practically, we recommend reporting a small set of percentile CE summaries alongside the mean, and using concordance among them as a tool to keep track of distribution reshaping, as well as a low-cost diagnostic for when mean and median CE disagree on model selection. Comments: 20 pages Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2605.24667 [cs.AI] (or arXiv:2605.24667v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2605.24667 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-151] CyBOKClaw: Human-in-the-Loop CyBOK Mapping for Cybersecurity Curriculum

链接: https://arxiv.org/abs/2605.24663
作者: Yan Lin Aung,Kevin Togbe
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper presents CyBOKClaw, an interpretable human-in-the-loop retrieval framework for mapping cybersecurity keywords or phrases (KWoPs) to the Cyber Security Body of Knowledge (CyBOK). Rather than treating the task as strict exact classification, the framework is designed as a top-k candidate generator for expert review. It combines query normalization, curated term expansion, concept-level boosts, topic-description enrichment, and domain-sensitive ranking rules. Because educational KWoPs are often broad, ambiguous, and only approximately aligned with CyBOK terminology, strict exact matching provides only a partial account of practical utility. We therefore evaluate the framework using both structural retrieval metrics and an expert-guided top-5 usefulness metric, ECA-5 (Exact or Closest Acceptable Match at top-5), which records whether the returned candidates contain at least one mapping that an expert would judge exact or accept as the nearest practical CyBOK placement. On the development dataset, CyBOKClaw achieves 64.73% EXA-5 (Exact Match at top-5), 84.18% structural semantic alignment, and 91.88% ECA-5; on the validation dataset, it achieves 81.19% EXA-5, 93.32% structural semantic alignment, and 98.00% ECA-5. These results show that expert-guided top-k usefulness provides a more faithful account of practical CyBOK mapping utility than exact structural matching alone, and that CyBOKClaw is effective as a CyBOK-specific expert-support retrieval system.

[AI-152] Beyond Inference-Only Deployment: Comparing Weight-Based Consolidation Against Cascading Compaction

链接: https://arxiv.org/abs/2605.24657
作者: Simon Dennis,Kevin Shabahang,Hao Guo,Rivaan Patil
机构: 未知
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: 15 pages

点击查看摘要

Abstract:Major LLM platforms deploy models in an inference-only configuration: the model serves requests but never updates per-user weights. Users must repeatedly re-teach preferences, corrections, and project context, and context-based workarounds consume context-window space and degrade under cascading compaction. We evaluate an alternative: nightly consolidation of interaction knowledge into model weights via reflection, synthesis, and Low-Rank Adaptation (LoRA) fine-tuning on a single consumer GPU. Across ten realistic software development conversations (n = 10, 1,146 test questions across three memory types), three cycles of cascading compaction retain 36.8 +/- 3.0% of knowledge (between an 11.8% no-context floor and a 90.1% full-context ceiling), while consolidation retains 80.4 +/- 1.3% – a 43.6 pp gain (paired t(9) = 14.8, p 0.001) that more than doubles what compaction preserves, with the largest gains on procedural corrections (36.3% - 74.6%) and episodic project facts (31.5% - 78.2%). As a methodological aside, mean per-token validation cross-entropy is negatively correlated with LLM-judged accuracy (r = -0.51) while median per-token validation cross-entropy tracks accuracy almost exactly (r = +0.99): under evaluators that tolerate surface-form variation, the mean is misleading and a heavy-tail-robust statistic is the faithful signal. Persistent personalization requires moving beyond inference-only deployment toward architectures that consolidate knowledge into weights.

[AI-153] On the Stability and Realizability of Recurrent Polynomial Surrogate Ternary Logic Gate Networks

链接: https://arxiv.org/abs/2605.24649
作者: Sai Sandeep Damera,Ryan Matheu,Aniruddh G. Puranic,John S. Baras,Calin Belta
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注: 9 pages, 3 figures. This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:Recurrent Neural Networks (RNNs) can learn to predict Signal Temporal Logic (STL) verdicts online from partial trajectories, but deploying them as runtime monitors in safety-critical systems demands more than predictive accuracy. Standard RNN architectures offer no structural guarantee that outputs degrade gracefully under sensor degradation; a dropped input can silently flip a verdict from safe to unsafe. We introduce the Recurrent Differentiable Ternary Logic Gate Network (R-DTLGN), a recurrent architecture that operates over Kleene’s three-valued logic -1, 0, +1\ , where 0 explicitly represents unknown. The R-DTLGN trains through continuous polynomial surrogates and hardens to a discrete ternary logic circuit at inference. We analyze the hardened circuit through two gate vocabularies derived from two orderings on the ternary domain: numerically monotone gates ensure stable recurrent dynamics, while information-monotone gates, when present, guarantee principled abstention (unknown inputs never produce wrong outputs) and monotonicity in input certainty (more information can only improve the verdict). We show that the recurrent connections required by bounded STL operators use exclusively AND and OR, which belong to both vocabularies, linking the monitoring task to the architecture’s guarantees. A realizability bound derived from the STL formula’s temporal operators directly sizes the network’s hidden state, replacing hyperparameter search with a formula-driven specification. We evaluate on STL specifications over D4RL PointMaze navigation data, testing prediction accuracy, degradation under predicate dropout, and the accuracy-versus-safety tradeoff between two label construction pipelines. The R-DTLGN is, to our knowledge, the first recurrent architecture that couples learned temporal prediction with formal degradation guarantees rooted in three-valued logic.

[AI-154] Demystifying the Mythos or Disrupting Bugonomics? From Zero-Day Asymmetry to Defender Remediation Throughput

链接: https://arxiv.org/abs/2605.24632
作者: Alfredo Pesoli,Herman Errico,Lorenzo Cavallaro
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent demonstrations of large language models producing candidate and confirmed vulnerabilities in production software have renewed the narrative that AI will reshape offensive and defensive security. Headlines emphasize capability; they rarely interrogate costs and incentives. This paper examines LLM-driven vulnerability discovery through a bugonomics lens: the operational economics of producing, proving, prioritizing, and fixing security-relevant defects. Historically, the most visible high-end bugonomics was offense-priced because production-grade zero-days and exploit chains were expensive specialist outputs for governments, brokers, and offensive vendors. Defender-side bugonomics already existed in vulnerability research, reward programs, and vendor remediation work; LLM-assisted systems change its scale and distribution. They make candidate generation, code comprehension, harness construction, proof-of-impact drafting, and report preparation cheaper at codebase scale. Exploits and proofs of concept remain important, but in defender workflows they primarily prove impact, guide prioritization, and justify remediation. The resulting bottleneck is not only finding more bugs; it is absorbing, validating, triaging, patching, and shipping a larger stream of reports. Using public data from Anthropic’s Mythos Preview and Mozilla Firefox collaborations, along with public exploit-market price anchors and vulnerability reward programs, we argue that the near-term shift is not simply more zero-days. It is a move toward broader defender remediation throughput: low-signal candidates become cheaper, evidence-rich remediation become more important, and scarce capacity shifts toward maintainer review and release work. The effect is acute in open source, where LLM-assisted discovery can increase report volume while maintainer-side validation, triage, funding, and release capacity may not scale.

[AI-155] Agent -as-Peer-Debriefer: A Multi-Agent Framework with Perspective-Based Refinement for Qualitative Analysis

链接: https://arxiv.org/abs/2605.24600
作者: Zhimin Lin,Kun Cheng,Fan Bai,Jie Gao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly used for qualitative data analysis (QDA), yet their outputs often miss the depth and nuance of human analysis. We argue this gap reflects a missing credibility practice from human QDA: peer debriefing, in which an analyst seeks feedback from a disinterested peer and uses it to refine their coding. To bring this practice into LLM-assisted QDA, we propose Agent-as-Peer-Debriefer, a multi-agent QDA framework that builds peer debriefing into key coding steps. In our framework, a Hierarchical Coding Agent follows the standard QDA process to generate codes, sub-themes, and themes, along with self-explanations and reflection memos. It then shares these outputs with three Peer-Debriefing Agents, each applying a distinct analytical perspective (Theory-Driven, Data-Driven, or Applied) and refining the codes by keeping, renaming, reassigning, merging, or splitting them. These perspectives are drawn from established human QDA practices that generalize across domains and datasets. To evaluate the framework, we test it on three datasets across two domains with three LLMs, measuring semantic similarity to human-annotated codes. Across all settings, perspective-based, peer-debriefing refinement aligns more closely with human codes than a single-LLM baseline, and an ablation further shows the gain is not merely from additional refinement. The three perspectives also produce distinct trade-offs, showing that the choice of perspective is a meaningful and controllable design decision. More broadly, these findings suggest that simulating peer debriefing with explicit perspectives is a promising route to more credible LLM-assisted QDA.

[AI-156] HeartBeatAI: An Interpretable and Robust Deep Learning Framework for Multi-Label ECG Arrhythmia Detection

链接: https://arxiv.org/abs/2605.24588
作者: Shubham Gupta,Nikhil Panwar,Partha Pratim Roy
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:While Deep Learning (DL) enhances automated electrocardiogram (ECG) analysis, clinical deployment is hindered by class imbalance and the generalization gap. This paper presents HeartBeatAI, a deep learning framework combining domain generalization, multi-scale feature aggregation, and clinical explainability for robust 12-lead ECG classification. Moving beyond image-based paradigms, HeartBeatAI integrates a Squeeze-and-Excitation (SE) ResNet to isolate diagnostic leads alongside a Multi-Layer Concentration Pipeline to capture macro-rhythm and micro-morphological anomalies. To mitigate domain shift, the framework employs MixStyle regularization and Label Smoothing. Rigorous benchmarking across four large-scale datasets using intra-source and Leave-One-Domain-Out (LODO) protocols demonstrates high performance (98% Macro F1-score) under intra-source conditions. However, LODO evaluations reveal significant degradation in detecting rare anomalies, highlighting a persistent challenge in cross-institutional deployment.

[AI-157] LAPLEX: The FFT of Learnable Laplace Kernels

链接: https://arxiv.org/abs/2605.24584
作者: Łukasz Struski,Hanna Blazhko,Piotr Kubaty,Jacek Tabor
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Fast linear algebra in deep learning usually comes with a choice: fixed geometry and exact computation, as in the Fourier transform, or adaptive geometry paid for by dense parameters, random features, or low-rank surrogates. To move beyond this trade-off, we introduce LAPLEX, a class of exact, trainable (phased) Laplace-kernel operators. A LAPLEX layer is a typically full-rank dense matrix, implicitly defined by learnable coordinate anchors, with FFT-like scaling. Consequently, it supports trainable matrix–vector operations at vector dimensions up to 10^9 on modern GPUs. As a neural layer, it yields compact projections and classification heads interpretable as soft, trainable routing models. The same primitive also serves as an efficient Gram operator, enabling high-dimensional covariance models on flattened images of dimension 3 \cdot 10^6 that preserve visible spatial structure without imposing convolutional bias. These applications reflect a single principle: dense geometry can be learned without storing a dense matrix, which enables data-adaptive global interactions in regimes where ordinary dense layers are out of reach. In this sense, LAPLEX separates expressivity from storage cost: it behaves like a dense trainable matrix, but is represented and applied through a small structured set of parameters. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2605.24584 [cs.LG] (or arXiv:2605.24584v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.24584 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-158] Associations between echocardiographic traits and AI-ECG predictions of heart failure

链接: https://arxiv.org/abs/2605.24576
作者: Elias Stenhede,Eivind Bjørkan Orstad,Torbjørn Omland,Henrik Schirmer,Arian Ranjbar
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Artificial intelligence-enabled electrocardiography (AI-ECG) can detect heart failure (HF), including disease not captured by left ventricular ejection fraction (LVEF), but the cardiac phenotypes underlying model predictions remain unclear. We therefore investigated whether AI-ECG-predicted HF risk aligns with established echocardiographic measures of myocardial dysfunction, remodelling, and filling pressures. We retrospectively analysed ECG and echocardiography data from 8147 patients who underwent both examinations within three days at Akershus University Hospital between 1 January 2023 and 1 June 2025. A previously validated AI-ECG model for HF detection was applied to all ECGs. Spearman’s rank correlation \rho quantified associations between echocardiographic parameters and AI-ECG risk. Subgroup analyses were performed by sex and left ventricular ejection fraction (LVEF). External validation included 36,286 ECG-echocardiography pairs from Columbia University Irving Medical Center. Global longitudinal strain (GLS) showed the strongest correlation ( \rho =0.57), followed by mitral annular plane systolic excursion (MAPSE) ( \rho =-0.49) and LVEF ( \rho =-0.45). In patients with LVEF50%, correlations remained substantial for GLS, MAPSE, and diastolic-related parameters. Volumetric left ventricular indices correlated less strongly in women, whereas diastolic indices showed stronger correlations in women than in men. Physiological validation showed that AI-ECG HF risk predictions align primarily with measures of systolic function, particularly global longitudinal strain, while also capturing diastolic-related abnormalities in patients with preserved LVEF. This approach may improve clinical interpretability and identify opportunities for model refinement. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2605.24576 [cs.AI] (or arXiv:2605.24576v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2605.24576 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Elias Stenhede [view email] [v1] Sat, 23 May 2026 13:37:36 UTC (385 KB)

[AI-159] Summoning the Oracle to Slay It: Mitigating Look-Ahead Bias in Financial Backtesting with Large Language Models

链接: https://arxiv.org/abs/2605.24564
作者: Weixian Waylon Li,Mengyu Wang,Tiejun Ma
机构: 未知
类目: Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Backtesting large language models (LLMs) on historical financial data is unreliable because pre-training cuts off after the events happened. An LLM trained in 2024 already “knows” which way 2018-2020 stocks moved. We name this failure parametric look-ahead bias and propose FinCAD, an inference-time adaptation of Context-Aware Decoding that suppresses an LLM’s memory of historical outcomes without retraining. FinCAD pairs an adversarial bias-discovery pipeline that learns a model-specific memory-activating prior prompt with an entity- and date-adaptive rule that scales the CAD strength to per-(entity, date) memorisation, so the penalty fires on memorised in-sample dates and decays to zero out-of-sample. Across five 7-14B LLMs and five mega-cap equities, FinCAD cuts in-sample backtest returns by up to -67.1% on memorised dates while leaving 2025 out-of-sample returns within 8K and Sharpe within 0.10 of baseline, and preserves general-purpose reasoning within 1.7 pts. On an eleven-model leaderboard, it raises the in-sample / out-of-sample Spearman correlation from +0.779 to +0.846, recovering rankings that genuinely predict out-of-sample performance.

[AI-160] PALoRA: Projection-Adaptive LoRA for Preserving Reasoning in Large Language Models

链接: https://arxiv.org/abs/2605.24549
作者: Mustafa Hayri Bilgin,Mariam Barry,Albert Bifet,Azzedine Idir Ait Said,Soumya Banerjee
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Efficiently updating Large Language Models (LLMs) with new or evolving factual knowledge remains a central challenge, as even parameter-efficient adaptation can erode previously acquired reasoning abilities. This tension reflects a plasticity-stability dilemma: models must incorporate new knowledge while preserving skill-critical representations. In this work, we study this trade-off through the spectral structure of multilayer perceptron weight matrices. We show, both theoretically and empirically, that information essential for reasoning is not localized only in dominant singular directions, but is instead distributed across the singular spectrum. Motivated by this observation, we introduce PALoRA, a two-stage framework for knowledge injection with reduced interference. PALoRA first trains a Singular Value Fine-Tuning (SVF) expert on a reasoning dataset and uses its learned singular scaling vector as a frozen geometric probe to identify components that are critical for the target skill. It then performs factual knowledge injection with Low-Rank Adaptation (LoRA) under a structural orthogonality constraint, ensuring that updates avoid the identified skill-relevant subspace. Across Llama 3.1 8B and Mistral 7B, and across mathematical, coding, and scientific reasoning benchmarks, PALoRA preserves on average 95% of the SVF expert’s reasoning performance while maintaining competitive factual recall. It consistently improves skill retention over prior spectral Parameter-Efficient Fine-Tuning (PEFT) methods while adding less than 0.006% parameter overhead.

[AI-161] Rethinking Federated Unlearning via the Lens of Memorization KDD2026

链接: https://arxiv.org/abs/2605.24545
作者: Jiaheng Wei,Yanjun Zhang,He Zhang,Leo Yu Zhang,Chao Chen,Kok-Leong Ong,Jun Zhang,Yang Xiang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: This paper has been accepted by SIGKDD 2026

点击查看摘要

Abstract:Federated learning (FL) increasingly needs machine unlearning to comply with privacy regulations. However, existing federated unlearning approaches may overlook the overlapping information between the unlearning and remaining data, leading to ineffective unlearning and unfairness between clients. In this work, we revisit federated unlearning through the lens of memorization. We argue that unlearning should mainly remove the unique memorized information attributable to the data to be forgotten, while preserving overlapping patterns that are also supported by the remaining data. Specifically, we propose Grouped Memorization Evaluation, an example-level metric that separates memorized knowledge from overlapping knowledge. Building on this metric, we introduce Federated Memorization Pruning (FedMemPrune), a pruning-based unlearning approach that resets redundant parameters responsible for memorization. Extensive experiments show that FedMemPrune closely matches retraining-based unlearning baselines while more effectively eliminating memorization than existing federated unlearning algorithms, yielding strong unlearning performance without sacrificing the utility of retained knowledge.

[AI-162] Emission-Aware Reinforcement Learning for Sustainable Electric Vehicle Charging and Carbon Dioxide Reduction Under Varying Renewable Penetration

链接: https://arxiv.org/abs/2605.24543
作者: Ninglin Ou,Mohammad A. Razzaque,Iftekher Islam Shovon,Shafkat Khan Siam,Shafiuzzaman K Khadem,Krishnendu Guha,Mayeen U Khandaker,Md. Noor-A-Rahim
机构: 未知
类目: Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注: Submitted the Engineering Applications of Artificial Intelligence Journal (Elsevier)

点击查看摘要

Abstract:The rapid growth of Electric Vehicle (EV) adoption challenges power distribution networks through peak load spikes, voltage instability, and transformer overloads from uncoordinated charging. While Model Predictive Control (MPC) and standard Reinforcement Learning (RL) methods have addressed these issues, existing approaches rarely treat real-time carbon intensity or fluctuating renewable energy (RE) availability as primary scheduling objectives, leaving substantial decarbonisation potential unrealised. This paper proposes an emission-aware RL strategy based on the Soft Actor Critic (SAC) algorithm, with a multi-objective reward that penalises carbon emissions, curtailed on-site renewables, and unmet user demand. The agent is trained within a unified benchmarking framework on the EV2Gym platform, incorporating behind-the-meter solar and wind profiles, time-varying EirGrid carbon intensity data, and realistic workplace EV behaviour across 25 Electric Vehicle Supply Equipment (EVSE) units. Nine control strategies, including heuristics, emission-aware MPC variants, and the proposed RL agent, are compared under five renewable penetration scenarios (0%-50%) over ten independent runs each. The RL agent achieves a carbon intensity as low as 23.96 grams of carbon dioxide per kilowatt-hour under 50% wind penetration, representing up to 87% emission reduction versus the uncontrolled baseline, and outperforms the external graph-based Power Distribution Network (PDN) benchmark. Transformer overload remains below 7 kWh across scenarios, against up to 1093 kWh for the As Fast As Possible (AFAP) heuristic, and renewable self-consumption reaches 52% under combined wind and solar supply. Embedding carbon intensity forecasts into the RL state and reward aligns charging with low-emission periods while preserving grid compliance and user satisfaction.

[AI-163] DemoEvolve: Overcoming Sparse Feedback in Agent ic Harness Evolution with Demonstrations

链接: https://arxiv.org/abs/2605.24539
作者: Lirong Che,Yuzhe yang,Peiwen lin,Chuang wang,Xueqian wang,Jian su
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Agent harness evolution improves frozen language-model agents by modifying the executable structures around them. We study this paradigm as a form of sample-efficient fast adaptation: instead of updating model weights, an agent can acquire task-specific competence by changing its external harness, while leaving the base model’s general capabilities intact. Prior work shows that self-generated rollouts can support harness search, suggesting that agents may acquire new task competence through practice. Yet in long-horizon stochastic environments, self-practice becomes fragile: rewards are sparse, outcomes are high-variance, and failures are hard to attribute to concrete harness mechanisms. We introduce DemoEvolve, a demonstration-bootstrapped approach to harness evolution. When reward-only search is too broad and noisy, competent human trajectories serve as expert reference experience for the coding proposer, guiding harness-level diagnosis and editing. Experiments on Liar’s Dice show that self-rollout evolution can work when episodes are short and failures are attributable. In contrast, Balatro exposes a harder long-horizon stochastic regime, where self-rollout evolution is misled by sparse feedback and candidate-selection noise, while tutorial-like textual knowledge alone does not yield stable improvement. Under the same limited budget, DemoEvolve produces more effective and auditable harness edits and achieves better performance. Overall, demonstrations make sparse-feedback harness evolution more diagnosable, localizable, and stable.

[AI-164] Reasoning as an Attack Surface: Adaptive Evolutionary CoT Jailbreaks for LLM s

链接: https://arxiv.org/abs/2605.24497
作者: Jianan Li,Simeng Qin,Xiaojun Jia,Lionel Z. Wang,Tianhang Zheng,Xiaoshuang Jia,Yang Liu,Xiaochun Cao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Reasoning Models (LRMs) have demonstrated remarkable capabilities in reasoning and generation tasks and are increasingly deployed in real-world applications. However, their explicit chain-of-thought (CoT) mechanism introduces new security risks, making them particularly vulnerable to jailbreak attacks. Existing approaches often rely on static CoT templates to elicit harmful outputs, but such fixed designs suffer from limited diversity, adaptability, and effectiveness. To overcome these limitations, we propose an adaptive evolutionary CoT jailbreak framework, called AE-CoT. Specifically, the method first rewrites harmful goals into mild prompts with teacher role-play and decomposes them into semantically coherent reasoning fragments to construct a pool of CoT jailbreak candidates. Then, within a structured representation space, we perform multi-generation evolutionary search, where candidate diversity is expanded through fragment-level crossover and a mutation strategy with an adaptive mutation-rate control mechanism. An independent scoring model provides graded harmfulness evaluations, and high-scoring candidates are further enhanced with a harmful CoT template to induce more destructive generations. Extensive experiments across multiple models and datasets demonstrate the effectiveness of the proposed AE-CoT, consistently outperforming state-of-the-art jailbreak methods.

[AI-165] Market Regime Council for Dynamic Credit Assignment in Multi-Agent LLM Decision Systems

链接: https://arxiv.org/abs/2605.24490
作者: Yunhua Pei,Zerui Ge,Jin Zheng,John Cartlidge
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Portfolio Management (q-fin.PM)
备注: 35 pages, 13 figures, preprint

点击查看摘要

Abstract:Multi-agent LLM decision systems for portfolio management still lack a principled way to assign credit across specialist agents, remain vulnerable to cold-start dominance under regime shifts, and offer limited transparency into how final allocations are formed. We propose Market Regime Council (MRC), a cooperative multi-agent decision system that computes exact Shapley credits across all single, pairwise, and Grand-coalition outputs for online agent weighting. Instantiated with N=3 specialist agents, at each trading period, MRC recomputes coalition-based Shapley weights from exponentially weighted performance histories, uses a Bayesian adaptive mixture to stabilize early periods, applies regime-dependent multipliers to adjust agent authority, and records each rebalance through a five-layer causal trace. Over 1,037 trading days across 13 crypto assets and five seeds, MRC achieves a Sharpe ratio of 1.51 and a cumulative return of 440.1%, ranking first on CR, SR, and IR among active baselines and attaining the lowest MDD among active methods. Ablation results show that the gains come from Shapley-weighted integration across coalition outputs rather than from any single stage in isolation. Code and demo data are included in the supplementary material.

[AI-166] IGER: Text-Informed Generalized Enzyme-Reaction Retrieval ACL2026

链接: https://arxiv.org/abs/2605.24489
作者: Yuhang Zhang,Keyan Ding,Peilin Chen,Han Liu,Can Lin,Ruixi Chen,Shiqi Wang,Qi Song
机构: 未知
类目: Artificial Intelligence (cs.AI); Biomolecules (q-bio.BM)
备注: Accepted to ACL2026

点击查看摘要

Abstract:Enzyme-reaction retrieval is a fundamental problem in computational biology, underpinning enzyme characterization, reaction mechanism elucidation, and the rational design of metabolic pathways and biocatalysts. As a bidirectional task, it entails both enzyme-to-reaction and reaction-to-enzyme mapping. However, existing approaches suffer from poor generalization across tasks and distributions, with performance highly sensitive to dataset splits and substantial asymmetry between retrieval directions. To address these challenges, we present TIGER, a Text-Informed Generalized Enzyme-Reaction Retrieval framework that leverages protein-to-text generation models to distill textual semantic knowledge from enzyme sequences, providing a generalized representation that bridges enzymes and biochemical reactions. To ensure the quality and reliability of textual semantics, we design a Dynamic Gating Network that adaptively fuses text-derived knowledge with sequence features, enabling more consistent and informative enzyme representations, while a Structure-Shared Feature Projector aligns enzyme and reaction representations within a unified latent space. Extensive experiments demonstrate that, under bidirectional retrieval supervision, TIGER significantly outperforms state-of-the-art baselines across diverse distributions and exhibits strong robustness and transferability across tasks.

[AI-167] SPACE: Unifying Symmetric and Asymmetric Routing Problems for Generalist Neural Solver

链接: https://arxiv.org/abs/2605.24484
作者: Rongsheng Chen,Changliang Zhou,Canhong Yu,Yuanyao Chen,Yu Zhou,Zhuo Chen,Zhenkun Wang
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Generalist neural routing solvers have shown great potential in solving diverse vehicle routing problems (VRPs) with a unified model. However, existing solvers are typically limited to symmetric settings or degrade in performance when switching to asymmetric settings due to input inconsistencies or inherent structural differences, substantially limiting their practicality in real-world scenarios that encompass both scenarios. To address this limitation, we define the spatial position of each node based on the relative distances to a specific set of pivots and further propose a Spatial Pivot-Aligned Coordinate-free Embedding (SPACE) framework that unifies node representation and solution generation across symmetric and asymmetric VRPs. Specifically, we construct a bidirectional Frechet representation using a novel furthest pivot sampling strategy to enable invariant node representations across distinct problem settings. Furthermore, we introduce a weight-decomposed adaptive decoding mechanism that decouples geometric perception from problem representations, mitigating the overfitting of constraint decisions to a specific geometry setting. Extensive experiments on 110 VRP variants, comprising 55 symmetric problems and their asymmetric counterparts, demonstrate that SPACE achieves promising zero-shot generalization in both symmetric and asymmetric VRPs.

[AI-168] SAM: State-Adaptive Memory for Long-Horizon Reasoning Agent

链接: https://arxiv.org/abs/2605.24468
作者: Yuyang Hu,Hongjin Qian,Shuting Wang,Jiongnan Liu,Ziliang Zhao,Jiejun Tan,Zheng Liu,Zhicheng Dou
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Long-horizon agentic reasoning requires large language models to act over long interaction histories containing thoughts, tool calls, observations, and partial conclusions. The challenge is not merely that these histories grow long, but that information needed for the current decision may be scattered across distant steps and only become relevant later. Existing approaches address this difficulty by truncating the interaction history, compressing it into shorter surrogates, or retrieving selected parts of it for reuse, but they do not explicitly model how access to past interaction should adapt to the agent’s evolving state. We instead cast long-horizon reasoning as a problem of state-adaptive memory. To this end, we propose State-Adaptive Memory~(SAM), a standalone framework that consolidates ongoing interaction into compact memory cues while preserving raw trajectory pages for intent-driven recall. These cues are not treated as replacements for history; rather, they serve as lightweight handles that allow the agent to reconstruct temporally distant information according to its current needs, without retraining the underlying backbone. We further optimize the memory module through expert-guided supervision and reinforcement learning, aligning it with trajectory-level utility. Across BrowseComp, BrowseComp-ZH, WideSearch, and HLE, SAM consistently outperforms strong baselines over diverse agent backbones. Our results suggest that explicit memory modeling provides a simple and effective foundation for long-horizon agentic reasoning.

[AI-169] Balancing Fairness Privacy and Accuracy: A Multitask Adversarial Framework for Centralized Data-Driven Systems

链接: https://arxiv.org/abs/2605.24458
作者: Imesh Ekanayake,Elham Naghizade,Jeffrey Chan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 13 Pages, 6 figures, IEEE TKDE

点击查看摘要

Abstract:The integration of fairness and privacy in centralized data-driven applications is critical, especially as these systems increasingly influence sectors with significant societal impact. Current methods rarely address privacy, fairness, and accuracy together, which can potentially compromise ethical standards and privacy regulations. However, balancing these three objectives is quite challenging since each of objective often imposes conflicting requirements on the design and training of models, making it difficult to optimize one without compromising the others. This paper introduces a novel multitask adversarial model that treats fairness and privacy as integral objectives rather than afterthoughts, and learns a latent representation that hides sensitive attributes while preserving essential task-related information. Our approach dynamically balances fairness with accuracy and privacy through an optimized cost function with minimal performance loss even under strict conditions. Extensive testing on diverse datasets shows the ability of our model to achieve high standards of fairness and privacy without significant sacrifice to accuracy. Benchmarking against state-of-the-art privacy and fairness standards shows that our method enhances the robustness of privacy, fairness, and accuracy optimization, proving its adaptability across various datasets.

[AI-170] Code2UML: Agent ic LLM s with context engineering for scalable software visualization

链接: https://arxiv.org/abs/2605.24453
作者: Alin-Gabriel Văduva,Anca-Ioana Andreescu,Simona-Vasilica Oprea,Adela Bâra
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Model (LLM)-based code analysis tools are adopted to automate software documentation tasks. However, the scalability of these approaches to real codebases, where Intermediate Representations (IR) exceed LLM context limits, remains underexplored. This paper introduces an agentic architecture with context engineering for automated UML diagram generation from source code repositories. It employs a hierarchy of five specialized agents: PlannerAgent, AnalyzerAgent, DiagramAgent, CorrectorAgent and DependencyAnalyzerAgent, built on the Claude Agent SDK, each addressing a distinct cognitive subtask. A deterministic, importance-weighted IR compaction layer transforms full project IRs into diagram-specific views guaranteed to fit within token constraints, requiring no LLM calls and completing in milliseconds. Thus, we evaluate the system across 12 open-source repositories in 4 programming languages (Java, JavaScript, PHP, Python) and 7 UML diagram types, producing 84 observations assessed on 5 automated metrics. Results demonstrate high syntactic validity (mean: 91.5%, with component and deployment diagrams reaching 100%), strong relationship precision (mean: 0.858) and consistent structural quality (mean: 81.7/100, with cross-language variance of 3.1 points). Entity recall averaged 0.313, reflecting deliberate architectural prioritization over exhaustive coverage. A sensitivity analysis (31 to 4,578 IR entities) confirms that quality scores remain stable regardless of scale.

[AI-171] Benchmarking the Limits of In-Context Reinforcement Learning for Ad-Hoc Teamwork

链接: https://arxiv.org/abs/2605.24423
作者: Yuheng Jing,Kai Li,Ziwen Zhang,Jiajun Zhang,Zeyao Ma,Jiaxi Yang,Lei Zhang,Zhe Wu,Jinmin He,Junliang Xing,Jian Cheng
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 41 pages, 14 figures

点击查看摘要

Abstract:In-Context Reinforcement Learning (ICRL) has enabled foundation agents to adapt instantaneously to novel tasks, yet its efficacy in Ad-Hoc Teamwork (AHT)-where coordination with unknown partners is required-remains unexplored. To rigorously evaluate this, we introduce a large-scale benchmark ICRL4AHT, built upon a high-throughput JAX implementation of Overcooked-V2. Our benchmark includes a large, diverse teammate suite spanning both RL and heuristic policies, enabling controlled train-test shifts, and provides a reproducible end-to-end pipeline for teammate generation, learning-history collection, dataset construction, and online multi-episode evaluation. We evaluate representative history-conditioned ICRL algorithms, including Algorithm Distillation (AD) and Decision-Pretrained Transformer (DPT), across millions of transitions. Results reveal notable limitations: contrary to their success in single-agent domains, these baselines fail to exhibit robust test-time adaptation in multi-agent settings. Specifically, these methods frequently underperform random baselines across both unseen teammate and unseen layout tracks, with no clear in-context improvement over long horizons. These findings highlight the challenges of strategic inference under partial observability within the OvercookedV2 AHT protocol, establishing our benchmark as a critical testbed for next-generation coordination algorithms.

[AI-172] Batch Normalization Amplifies Memorization and Privacy Risks

链接: https://arxiv.org/abs/2605.24420
作者: Ngoc Phu Doan,Chongyan Gu,Ihsen Alouani
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Batch Normalization (BN) is widely adopted to enable faster convergence and more stable training of deep neural networks. However, its impact on privacy and memorization has remained largely unexplored. In this work, we investigate the effect of BN layers on the memorization of atypical or outlier samples and its implications for privacy leakage. We conduct an extensive empirical study using three complementary approaches: (i) unintended memorization of out-of-distribution training samples, (ii) per-sample influence measured via gradient norms, and (iii) susceptibility to membership inference attacks (MIA). Across multiple datasets and architectures, we consistently observe that BN substantially increases the memorization of outliers compared to models without BN. Critically, this amplified memorization translates directly into privacy vulnerabilities: models with BN exhibit significantly higher susceptibility to MIAs. We complement our empirical findings with a theoretical analysis showing that BN amplifies the per-step influence of outlier samples during training, providing mechanistic insight into this phenomenon. Our results highlight an underappreciated privacy risk associated with BN and provide both practical and theoretical insights into how normalization layers can amplify the influence of rare or sensitive training examples.

[AI-173] JT-SAFE-V2: Safety-by-Design Foundation Model with World-Context Data

链接: https://arxiv.org/abs/2605.24414
作者: Junlan Feng,Fanyu Meng,Chong Long,Pengyu Cong,Duqing Wang,Yan Zheng,Yuyao Zhang,Xuanchang Gao,Ye Yuan,Yunfei Ma,Zhijie Ren,Fan Yang,Na Wu,Di Jin,Chao Deng
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We introduce JT-Safe-V2, a large language model designed to advance the safety and trustworthiness of foundation models, extending our previous JT-Safe model toward a more comprehensive safety-by-design paradigm. JT-Safe-V2 emphasizes the joint optimization of general intelligence and safety-by-design through several key innovations: enriching pre-training data with contextual world knowledge, high-certainty pre-training procedures, and safety strengthening post-training mechanisms for enterprise-oriented agentic capabilities. Building on these safety-enhanced foundation models, we propose Safe-MoMA (Safe Mixture of Models and Agents), a framework that enables traceable and efficient inference through the orchestrated deployment of multiple models and agents. Extensive evaluations demonstrate that JT-Safe-V2 achieves state-of-the-art performance across both general intelligence and safety benchmarks. Moreover, Safe-MoMA reduces inference costs by more than 30% compared to using the largest standalone model baseline while maintaining comparable performance. To facilitate future research on safety-by-design foundation models, we publicly release the post-trained JT-Safe-V2-35B model checkpoint.

[AI-174] he Model Is Not the Product: A Dual-Pillar Architecture for Local-First Psychological Coaching

链接: https://arxiv.org/abs/2605.24411
作者: Alexander Mihalcea
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 10 pages, 3 figures

点击查看摘要

Abstract:Existing language model applications struggle to meet the demand for emotionally oriented support, primarily due to their inability to maintain deep, persistent context across sessions. This report introduces Psych LM, an iOS application that validates the thesis that, for such applications, the surrounding architecture is paramount. Psych LM runs a local, on-device language model within a purpose-built, local-first runtime designed for behavioral and life-coaching applications. The system achieves the practical effect of a near-infinite context window through an automated, user-inspectable memory corpus that converts conversations into structured memory cards, including facts, goals, and events, and dynamically injects them into the prompt via semantic and vector search. As such, the system can be defined as an active-learning, retrieval-augmented generative, on-device architecture. This architecture delivers four primary contributions: a local-first design where privacy is a core property; a detailed description of the memory corpus for persistent context of key user information; a deterministic orchestration layer that provides a stable behavioral spine independent of the model’s internal state; and a benchmark framework focused on evaluating the integrated system’s reliability under realistic operating conditions. The R and D process confirms that complex, context-aware interaction can be reliably achieved under the strict constraints of a mobile environment by prioritizing architectural control and resource management over simple model size.

[AI-175] Advancing Graph Few-Shot Learning via In-Context Learning KDD26

链接: https://arxiv.org/abs/2605.24410
作者: Renchu Guan,Yajun Wang,Chunli Guo,Bowen Cao,Fausto Giunchiglia,Wei Pang,Yonghao Liu,Xiaoyue Feng
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: KDD26

点击查看摘要

Abstract:Graph few-shot learning, which aims to classify nodes from novel classes with only a few labeled examples, is a widely studied problem in graph learning. However, existing methods often face two key limitations. First, the predominant graph few-shot learning paradigm relies on supervised tasks, failing to leverage the vast number of unlabeled nodes in the graph. Second, many approaches require complex task adaptation or fine-tuning during inference, limiting their efficiency and applicability. Inspired by the powerful in-context learning capabilities of large language models, we propose a novel model named VISION for adVancIng graph few-Shot learning via In-cOntext LearNing to address these challenges. Our model reframes graph few-shot learning as a fine-tuning-free sequence reasoning problem. At its core is a context-aware network that initializes nodes with role embeddings and employs a dual-context fusion module to synergistically integrate local topological structures and global task-level dependencies. This allows our model to dynamically generate class-aware representations for the query set conditioned on the support set context in a single forward pass. To effectively train our model, we introduce an unsupervised task generator that creates structure-adaptive features and constructs diverse pseudo-tasks from abundant unlabeled data. Our method unifies unsupervised meta-learning with graph in-context learning, achieving efficient inference. Extensive experiments on multiple benchmark datasets demonstrate the superiority of our model. Our public code can be found

[AI-176] Generative OOD-regularized Model-based Policy Optimization

链接: https://arxiv.org/abs/2605.24405
作者: Aysin Tumay,Jiahe Huang,Elise Jortberg,Rose Yu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We study sequential decision-making with offline reinforcement learning (RL). Traditional offline RL policies may result in out-of-distribution (OOD) actions when training relies only on sparse offline representations. To ensure safe offline policies in a sparse state-action space, we explore how density estimation models can be integrated into model-based RL methods to avoid the OOD regions. Generative models are capable of explicitly modeling the density in sparse state-action spaces. Building on this, we introduce Generative OOD-regularized Model-based Policy Optimization (GORMPO), a density-regularized offline RL algorithm that uses generative density modeling to restrict policy updates to high-density areas of the dataset. Furthermore, we examine whether better OOD detection corresponds to better model-based offline policies. We compare (1) the OOD detection capabilities of various density estimators and (2) their performance within the GORMPO framework on a real-world medical dataset and sparse offline RL datasets. We theoretically guarantee GORMPO’s performance under mild assumptions. Empirically, GORMPO outperforms state-of-the-art baselines by 17% on a real-world medical dataset and enhances the base model on the offline RL datasets. Our empirical findings show that better OOD detection generally results in improved policies in environments with stable dynamics, while conservative penalties with poor density estimation are favored when dynamics are uncertain.

[AI-177] ConceptM3oE: Concept-Guided Multimodal Mixture of Experts for Interpretable Computational Pathology

链接: https://arxiv.org/abs/2605.24399
作者: Xuan Wang,Zhongling Xu,Gopi Kannedhara,Joakim Nguyen,Jian Yu,Jinrui Fang,Abdurrahmaan Baghdadi,Tianlong Chen,Awais Naeem,Chandra Krishnan,Edward Castillo,Andrew H. Song,Ankita Shukla,Ying Ding,Nicholas Konz,Hairong Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Healthcare models are transitioning from unimodal prediction toward multimodal reasoning over heterogeneous diagnostic inputs. In computational pathology, for complex tumor subtypes where morphology alone can be challenging to distinguish, pathology reports and molecular measurements may provide additional diagnostic evidence alongside whole-slide images, yet existing models often fail to clarify how diverse signals assemble into recognizable diagnostic concepts. We propose ConceptM ^3 oE (Concept Multimodal MoE), which embeds concept formation directly within interaction-aware mixture-of-experts (MoE) pathways. The architecture decomposes evidence into modality-specific, redundant, and synergistic experts, which are then projected into structured concept bottlenecks mapping latent features to a hierarchy of morphology and biomarker concepts. To prevent the information loss typical of interpretable bottlenecks, we utilize residual pathways within each expert to allow task-relevant signals to flow both through the concepts and directly to the final task prediction, so that high performance is maintained alongside interpretability. Across an institutional pediatric brain tumor cohort and a public glioma cohort, the framework delivers competitive performance to unconstrained models while producing reasoning traces validated by an independent neuropathologist. In data-limited regimes, ConceptM ^3 oE improves limited-data performance, increasing macro-F1 from 56.41% to 66.70% at small training sizes compared to non-concept-informed baselines, while also showing faster training convergence consistent with the regularizing effect of concept learning. This work offers a scalable path toward high-performance medical AI that is inherently verifiable and better aligned with the complex decision-making of clinical practice.

[AI-178] Understanding and Mitigating Premature Confidence for Better LLM Reasoning

链接: https://arxiv.org/abs/2605.24396
作者: Jingchu Gai,Guanning Zeng,Christina Baek,Chen Wu,J.Zico Kolter,Andrej Risteski,Aditi Raghunathan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Long chains of thought (CoT) from current language models frequently contain logical gaps and unjustified leaps, limiting the gains from additional test-time compute. Improving reasoning quality directly would require process reward models, but the step-level annotations needed to train them are expensive and scarce. We find such a signal in how the model’s confidence evolves during reasoning: premature confidence, the tendency to commit to an answer early and use the remaining tokens to rationalize it, strongly predicts flawed reasoning across tasks and model scales. We exploit this in progressive confidence shaping, a reinforcement learning objective that trains models to update their confidence as they reason rather than commit early – rewarding gradual confidence growth and penalizing early commitment, with no external labels or reward models. The method improves accuracy and reasoning quality from 1.5B to 8B parameters across arithmetic (Countdown), math (DAPO, AIME), and science (ScienceQA): on Countdown, accuracy improves 3.2x (+42.0pp) and flawed reasoning drops 48pp; on AIME, Pass@64 improves 6.6pp. Consistent with this mechanism, the method also improves faithfulness: on a safety benchmark, our models more transparently surface misleading content in their reasoning traces rather than concealing it. Controlled experiments reveal that the problem and its remedy scale together: premature confidence grows with model size and task difficulty, and so do the gains from addressing it.

[AI-179] MX-SAFE: Versatile Inference- and Training-Proof Microscaling Format with On-the-Fly Exponent and Mantissa Bit Allocation DATE2026

链接: https://arxiv.org/abs/2605.24391
作者: Dahoon Park,Jahyun Koo,Sangwoo Hwang,Jaeha Kung
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
备注: 7 pages, 7 figures, accepted DATE2026

点击查看摘要

Abstract:As the demand for deep learning grows, cost reduction through quantization has become essential for both training and inference. In 2022, the Open Compute Project (OCP) consortium standardized narrow precision formats for deep learning, called the microscaling (MX) format. The MX format is a hardware-friendly dynamic quantization scheme that effectively reduces the data size by sharing an 8-bit exponent across multiple operands. The MX format can be categorized into two types with their own strengths: (i) MXINT which focuses on a high precision consisting only of mantissa bits and (ii) MXFP which focuses on a wider dynamic range by allowing local exponent bits. In this work, we present a versatile MXFP format, called MX-SAFE (MXSF in short), that adaptively uses two modes, i.e., a wider mantissa mode (FP8 E2M5) and a subnormal FP mode (FP5 E3M2), to support both training and direct-cast inference. Furthermore, we propose a tile-based block design to increase hardware efficiency by reducing the burden of re-quantization process during the training with the MXSF format. Owing to the use of the proposed MXSF format, 0.05%/11.1% and 3.55%/3.57% improvements in accuracy, on average, for inference/full-training compared to MXFP8 E2M5 and MXFP8 E4M3 are observed, respectively. Moreover, we present a training-inference accelerator that supports the MXSF format and it achieves similar accuracy to the BF16 baseline while using 24.9% less total energy consumption.

[AI-180] A governance horizon for ethical-use constraints in open-weight AI models

链接: https://arxiv.org/abs/2605.24383
作者: Weiwei Xu,Hengzhi Ye,Haoran Ye,Kai Gao,Vladimir Filkov,Minghui Zhou
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Ethical constraints on open-weight AI models are both a reflection of societal concerns and a foundation for AI governance policy. They are expected to propagate to downstream derivatives while implemented as voluntary metadata disclosures that must be restated at each generation of reuse. We audit 2,142,823 model repositories on Hugging Face Hub to test whether this disclosure-based governance infrastructure can sustain traceability across deep model lineages. Restriction evidence decays with a half-life of 1.31 derivation steps ( R^2 =0.98), and beyond seven downstream generations at least 80% of descendant models lack sufficient public evidence for a governance determination, a depth boundary we formalize as the governance horizon. Platform-level interventions to restore missing licence metadata reveal that policy design (not enforcement alone) is the binding factor: inheritance-only designs require near-complete enforcement to move the horizon, whereas a mandatory-declaration design that explicitly resolves orphan lineage components shifts the horizon already at moderate enforcement. The structural bottleneck is lineages with no inheritable upstream intent: such orphan components remain undecidable under any inheritance-only policy regardless of enforcement rate, and unresolved upstream nodes additionally create direct downstream undecidability bottlenecks that inheritance rules alone cannot recover. Comparison with PyPI, where governance signals are carried by explicit machine-readable declarations, corroborates that the collapse is topology-specific to open-weight derivation rather than inherent to open ecosystems. These results establish that disclosure-based governance has a shallow, structurally determined reach in open-weight AI, and that achieving deep supply-chain accountability requires provenance mechanisms propagating governance signals through derivation itself.

[AI-181] Assessing the Operational Viability of Foundation Models for Time Series Forecasting

链接: https://arxiv.org/abs/2605.24381
作者: Kavin Soni,Debanshu Das,Vamshi Guduguntla
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Applications (stat.AP); Machine Learning (stat.ML)
备注: 21 pages, 8 Figures, Code available at [ this https URL ]

点击查看摘要

Abstract:Time series forecasting drives operational decisions in areas like finance, transportation, and energy. While supervised learning approaches achieve strong performance, they require domain-specific training, feature engineering, and ongoing maintenance. Large-scale foundation models have recently emerged as a zero-shot alternative, avoiding task-specific training much like LLMs. In this work, we evaluate foundation models against standard supervised approaches. Rather than focusing solely on aggregate accuracy, we analyze performance across four operational regimes: periodic human-centric systems, physically constrained processes, stochastic financial markets, and heterogeneous demand forecasting. Our results characterize optimal deployment areas. Foundation models perform well in domains with transferable periodic structures and are efficient for cold-start or long-tail scenarios. Conversely, supervised specialists maintain higher precision in systems governed by strict physical constraints. In financial domains, newer foundation models are rapidly closing the performance gap with supervised specialists. We further quantify trade-offs in inference latency, data drift adaptability, and deployment constraints. Finally, we propose a Complexity Router that assigns each series to the optimal model class using empirical features. We demonstrate that this selective routing achieves higher accuracy and significantly lower inference costs compared to deploying a universal foundation model, providing a practical framework for balancing generalization and efficiency. Comments: 21 pages, 8 Figures, Code available at [this https URL] Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Applications (stat.AP); Machine Learning (stat.ML) Cite as: arXiv:2605.24381 [cs.LG] (or arXiv:2605.24381v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.24381 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-182] Distilling Game Code World Model Generation into Lightweight Large Language Models

链接: https://arxiv.org/abs/2605.24375
作者: Tyrone Serapio,Arjun Prakash,Haoyang Xu,Kevin Wang,Amy Greenwald
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have shown great ability in generating executable code from natural language, opening the possibility of automatically constructing environments for AI agents. Recent work on Code World Models (CWMs) demonstrates that LLMs can translate game rules into Python implementations compatible with solvers like Monte Carlo Tree Search. We study this problem in game settings, where generated environments must implement rules, legal actions, state transitions, observations, and rewards. We refer to these game-specific executable models as Game Code World Models (GameCWMs). However, current approaches to generating code world models rely on frontier models and inference-time refinement loops, limiting accessibility and scalability. This work investigates whether GameCWM generation capabilities can be distilled into smaller models through post-training. We introduce: (1) a curated dataset of 30 games spanning perfect and imperfect information games, (2) a verification framework that evaluates generated code against structural and semantic game properties, and (3) a post-training pipeline combining Supervised Fine-Tuning (SFT) with Reinforcement Learning with Verifiable Rewards (RLVR). We experiment with Qwen2.5-3B-Instruct and find that SFT can increase syntactic correctness, while RLVR can improve execution-level adherence to game rules, thereby improving Qwen’s ability to generate valid GameCWMs in both perfect and imperfect information games. Overall, our pipeline makes Qwen2.5-3B-Instruct more capable of generating valid GameCWMs, thereby offering a scalable path toward automatic environment generation from natural language.

[AI-183] reatment Effect Estimation with Differentiated Networked Effect on Graph Data KDD2026

链接: https://arxiv.org/abs/2605.24358
作者: Xiaofeng Lin,Han Bao,Hisashi Kashima
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted by the research track of the KDD 2026 conference

点击查看摘要

Abstract:Estimating individual treatment effect (ITE) from observational graph data is crucial for decision-making in the fields such as commerce and medicine. This task is challenging due to interference, where individual outcomes can be influenced by the treatments and covariates of their neighbors. Existing methods attempt to model such interference for accurate ITE estimation. However, a critical issue is often overlooked: differentiated networked effect (DNE), an effect caused by local networks consisting of neighbors with varying importance and scales. Capturing DNE is vital; otherwise, we will end up with imprecise ITE estimation due to an erroneous characterization of interference, which can result in misguided decisions. To address this challenge, we propose a novel interference modeling mechanism that incorporates two partial attention mechanisms and a message amplifier. The partial attention mechanisms automatically estimate the importance of different neighbors in contributing to interference, while the message amplifier adjusts the results of the interference modeling mechanism based on the scale of neighbors, all of which enables the model to capture DNE. Experiments on three real-world graphs demonstrate that our methods outperform existing approaches for ITE estimation from graph data, which corroborates the importance of explicitly capturing DNE.

[AI-184] Partner-Aware Hierarchical Skill Discovery for Robust Human-AI Collaboration

链接: https://arxiv.org/abs/2605.24352
作者: Adnan Ahmad,Bahareh Nakisa,Mohammad Naim Rastgoo
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multi-agent collaboration, especially in human-AI teaming, requires agents that can adapt to novel partners with diverse and dynamic behaviors. Conventional Deep Hierarchical Reinforcement Learning (DHRL) methods focus on agent-centric rewards and overlook partner behavior, leading to shortcut learning, where skills exploit spurious information instead of adapting to partners’ dynamic behaviors. This limitation undermines agents’ ability to adapt and coordinate effectively with novel partners. We introduce Partner-Aware Skill Discovery (PASD), a DHRL framework that learns skills conditioned on partner behavior. PASD introduces a contrastive intrinsic reward to capture patterns emerging from partner interactions, aligning skill representations across similar partners while maintaining discriminability across diverse strategies. By structuring the skill space based on partner interactions, this approach mitigates shortcut learning and promotes behavioral consistency, enabling robust and adaptive coordination. We extensively evaluate PASD in the Overcooked-AI benchmark with a diverse population of partners characterized by varying skill levels and play styles. We further evaluate the approach with human proxy models trained from human-human gameplay trajectories. PASD consistently outperforms existing population-based and hierarchical baselines, demonstrating transferable skill learning that generalizes across a wide range of partner behaviors. Analysis of learned skill representations shows that PASD adapts effectively to diverse partner behaviors, highlighting its robustness in human-AI collaboration.

[AI-185] Adaptive Human-AI Coordination via Hierarchical Action Disentanglement

链接: https://arxiv.org/abs/2605.24343
作者: Adnan Ahmad,Bahareh Nakisa,Mohammad Naim Rastgoo
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Human-AI collaboration requires agents that can adapt to diverse partner behaviors and skill levels while remaining robust to unseen partners. Existing methods often collapse to a single dominant behavior or learn poorly aligned skills, limiting effective coordination. We propose Intrinsic Action Disentanglement (IAD), a deep hierarchical reinforcement learning (DHRL) framework that learns distinct, partner-aware low-level action sequences conditioned on high-level latent skills. IAD introduces an intrinsic reward that explicitly encourages disentangled action distributions of the agent’s low-level policy across skills, yielding an interpretable mapping between high-level decisions and partner-specific behavioral responses. By capturing temporally extended interaction patterns, IAD enables flexible adaptation to heterogeneous partner dynamics under distributional shift. We evaluate IAD in the Overcooked-AI domain across multiple layouts and diverse partner settings, including unseen simulated partners, a human-proxy model trained on human-human gameplay, and real human partners. Results show that IAD consistently outperforms strong baselines and achieves more reliable, adaptive coordination across all settings.

[AI-186] ScaleAcross Explorer: Exploring Communication Optimization for Scale-Across AI Model Training

链接: https://arxiv.org/abs/2605.24326
作者: Minghao Li,Alicia Golden,Samuel Hsia,Michael Kuchnik,Adi Gangidi,Xu Zhang,Ashmitha Jeevaraj Shetty,Zachary DeVito,Weiwei Chu,Dong He,Haoci Zhang,Yuchen Hao,Ruoming Pang,James Hongyi Zeng,Ying Zhang,Minlan Yu,Carole-Jean Wu
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)
备注: 28 pages, 27 figures

点击查看摘要

Abstract:The rapid scaling of large language model training requires distributing GPU resources across multiple data center buildings and regions. We refer to such paradigm as “scale-across” training. As infrastructure expands, the system design space becomes increasingly intricate, encompassing new model architectures, hardware heterogeneity, and evolving communication patterns. Drawing from Meta’s production experience, we highlight the complexities of deploying training jobs across a few data centers housing hundreds of thousands of GPUs. To accelerate exploration of the large design space and to enable efficient training for frontier model development, we conduct in-depth characterization of three key design dimensions: parallelism placement, parallelism scheduling, and network layer technologies. We then propose ScaleAcross Explorer, an optimizer that considers the interplay of design dimensions and holistically optimizes scale-across training. Testbed experiments and simulations demonstrate up to 64.62% training speedups over production configuration and up to 37.59% training speedups over the state-of-the-art baseline across a wide range of design points.

[AI-187] ChaosBench-Logic v2: Evaluating LLM Logical Reasoning over Dynamical Systems at Scale ICLR2026

链接: https://arxiv.org/abs/2605.24305
作者: Noel Thomas
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 14 pages, 8 figures. Published at the ICLR 2026 Workshop on LLM Reasoning

点击查看摘要

Abstract:Standard accuracy on binary reasoning benchmarks hides critical failure modes: prior collapse, inconsistency under paraphrase, and inability to reason about parameter-dependent dynamics. We present ChaosBench-Logic v2, a 40,886-question benchmark over 165 dynamical systems with 27 FOL predicates and 78 axiom edges, together with CARE (Calibration- and Adversarial-Robust Evaluation), a protocol that surfaces these pathologies. Evaluating 14 models, we find that regime-transition reasoning remains near random (MCC = 0.05) even for frontier models, whereas FOL deduction with given premises reaches MCC = 0.52. Per-family decomposition shows that the proprietary-model advantage concentrates on cross-indicator (+0.40) and consistency tasks, while open-source Qwen 2.5-32B dominates indicator diagnostics (0.91 vs. 0.45). Two models exhibit negative MCC on bifurcation questions, confirmed as systematic anti-correlation via confusion-matrix analysis.

[AI-188] Enhancing Reliability in LLM -Based Secure Code Generation

链接: https://arxiv.org/abs/2605.24300
作者: Mohammed F. Kharma,Mohammad Alkhanafseh,Ahmed Sabbah,David Mohaisen
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 15 pages; 7 tables; 3 figures

点击查看摘要

Abstract:Large language models (LLMs) are widely used for code generation, but their security reliability remains inconsistent across languages and prompting strategies. Existing prompt engineering improves functional correctness but rarely ensures consistent security outcomes. We introduce the \textitMitigation-Aware Chain-of-Thought (MA-CoT) framework, which embeds task-specific CWE mitigation guidance and language-aware safeguards to reduce recurring vulnerabilities in generated code. We evaluate MA-CoT across three LLMs (gpt-5, claude-4.5, gemini-2.5), three programming languages (C, Java, Python), and four prompting strategies (Vanilla, Zero-shot, CoT, MA-CoT) on a 200-task primary dataset, with external validation on LLMSecEval. Using static analysis with expert validation, MA-CoT reduces total security findings from 92 to 39 (57.6%) on the primary dataset and from 73 to 4 (94.5%) on LLMSecEval. High-severity findings (Blocker + Critical) drop from 90 to 39 (56.7%) and from 45 to 2 (95.6%), respectively. Across both datasets, MA-CoT is the only strategy that consistently improves security reliability; Zero-shot and CoT are less reliable and may increase vulnerability, especially in C. We further introduce a strict layered attribution of vulnerability drivers (language-core vs. stack layers) and show that residual risk concentrates in hardening-oriented patterns (e.g., OS- and toolchain-dependent), motivating secure-by-construction primitives alongside prompting.

[AI-189] An Empirical Evaluation of LLM -Generated Code Security Across Prompting Methods

链接: https://arxiv.org/abs/2605.24298
作者: Mohammed Kharma,Ahmed Sabbah,Mohammad Alkhanafseh,Mohammad Hammoudeh,David Mohaisen
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 40 pages, 22 tables, 8 figures

点击查看摘要

Abstract:The growing use of Large Language Models (LLMs) for automated code generation has enhanced software development efficiency, but often at the cost of security. Generated code frequently overlooks critical concerns, leaving it vulnerable to issues such as weak encryption and improper input validation. To investigate this problem, we present a comprehensive empirical evaluation of the security quality of LLM-generated code across five LLMs and four programming languages (Java, C++, C, and Python), examining the impact of multiple prompt engineering methods. We introduce a weaknesses-aware zero-shot chain-of-thought (WA-0CoT) prompting strategy that enriches prompts with security context using CWE mappings to guide model reasoning. Our empirical analysis, supported by chi-square tests, finds no statistically significant reductions in vulnerability frequency or density across prompt methods. However, prompting strategies, including WA-0CoT, systematically influence the compositional distribution of CWE categories, with effects varying by programming language. These findings suggest that while security-aware prompting alters the structure of generated weaknesses, prompt engineering alone is insufficient to reliably reduce overall vulnerability levels. The results highlight the importance of language-aware and model-aware prompt design when evaluating the security properties of LLM-generated code.

[AI-190] Concept Drift Adaptation Using Self-Supervised and Reinforcement Learning In Android Malware Detection

链接: https://arxiv.org/abs/2605.24294
作者: Ahmed Sabbah,Mohammad Kharma,Mohammad Alkhanafseh,Radi Jarrar,Samer Zein,David Mohaisen
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 9 pages, 2 figures, 2 tables

点击查看摘要

Abstract:Android malware detectors often degrade after deployment because of concept drift, while full retraining at each maintenance step is costly. We propose a chronological adaptive maintenance framework that models deployment-time maintenance as a sequential decision problem. The framework learns a stable latent representation through self-supervised learning during initialization, freezes the encoder, measures latent drift in the fixed representation space, and performs lightweight downstream adaptation using a trainable adapter and classification head. A proximal policy optimization controller selects low-cost maintenance actions based on the detector state, including current utility, retention on a fixed memory set, latent drift indicators, and update cost. We evaluate the framework under a causal deployment-style protocol on emulator and real Android malware datasets with static and dynamic features. Results show that the RL controller provides a strong cost-aware adaptation strategy, consistently remaining among the top-performing policies while achieving a favorable balance between temporal performance, memory retention, and maintenance cost under non-stationary deployment conditions.

[AI-191] Safety-Oriented Routing Analysis of Mixtral MoE Under Benign and Harmful Prompts

链接: https://arxiv.org/abs/2605.24270
作者: Md Nurul Absar Siddiky
机构: 未知
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Sparse mixture-of-experts (MoE) language models activate only a small subset of parameters for each token, making router behavior a central part of model computation. This paper studies routing behavior of Mixtral 8x7B-Instruct under benign and harmful prompts using two complementary signals: activation-based routing scores derived from expert selection frequencies and gradient-based scores derived from router-gate sensitivities. We analyze expert- and layer-level routing behavior and conduct expert-suppression interventions. The results show that activation-based expert usage is broad and long-tailed, whereas gradient-based importance is concentrated. At expert level, benign and harmful prompt groups remain close under both signals with modest separation. At layer level, activation-based routing is most selective around layers 8-15, while gradient-based importance is concentrated in final layers. Expert classification shows most experts are shared across benign and harmful prompts, though a limited subset shows clear group preference. Top-ranked expert sets show stronger benign-malicious overlap under gradient scores than activation scores, suggesting concentration on a common late-layer expert set. In intervention experiments, suppressing top five benign-dominant experts from activation scores reduces restricted responses from 24 to 14 over 100 prompts, while suppressing gradient-derived experts reduces them from 34 to 22 with fewer unintended reversals. Overall, safety-relevant routing in Mixtral is subtle, depth-dependent, and distributed rather than dominated by a fixed set of experts.

[AI-192] Attested Tool-Server Admission: A Security Extension to the Model Context Protocol

链接: https://arxiv.org/abs/2605.24248
作者: Alfredo Metere
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:The Model Context Protocol (MCP) standardizes how a large-language-model (LLM) agent and an external tool server exchange messages, but not trust: a host reads a server’s self-declared tool list and dispatches calls, with no notion of which servers it may use, at what sensitivity, or which of a server’s tools are in bounds. This work grew out of a concrete need – letting the Enclawed agent use Google’s externally-operated MCP servers (Gmail, Calendar, Drive) safely, admitting the server and bounding the tools it may drive, without changing MCP or Enclawed’s own tool application-programming interface (API). The mechanism we built, mcp-attested (shipped in both the open enclawed-oss distribution and the enclaved flavor), generalizes: the gap that makes an unmediated third-party connection unsafe for one user makes a regulated deployment impossible to accredit. We close it with three additive mechanisms: (1) a small, offline-signed clearance assertion a server publishes at a well-known Uniform Resource Identifier (URI) and a host verifies against a pinned trust root before any tool dispatch; (2) a deny-by-default per-server tool allowlist, so admitting a server is not trusting its every tool; and (3) a flavor-gated enforcement mode that turns the checks from warnings into hard denials, with every decision written to a tamper-evident audit log. We give the wire format, the verification algorithm, a security analysis, and an LLM-driven adversarial evaluation; we then state the design in normative Request-for-Comments (RFC 2119) form – schema, verification rules, error registry, well-known registration, and machine-checkable conformance vectors – so it can be adopted as an MCP addendum rather than reinvented. An unextended host ignores the well-known document and behaves exactly as today.

[AI-193] Unlocking Apples Private Cloud Compute: An Analysis of Privacy-Preserving Artificial Intelligence

链接: https://arxiv.org/abs/2605.24239
作者: Yannik Dittmar,Marvin Jerome Stephan,Thomas Völkl,Matthias Hollick,Jiska Classen
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Many existing Artificial Intelligence (AI) solutions on mobile devices rely on an extensive collection of sensitive data, raising privacy concerns and often requiring storage for both context and model improvement. Apple’s Private Cloud Compute (PCC) aims to address this by emphasizing mobile device integration and a privacy-first design. The central claim of PCC is that it does not store any user data and that user input and user accounts are unlinkable. While most of the PCC system specifications are public, compiled binaries add a layer of opaqueness. There are no reproducible builds, and there are no symbols within those binaries, creating potential discrepancies between the specification and what is shipped to the user. Additionally, the underlying models and interfaces for querying PCC are not openly accessible, limiting academic evaluation of model properties, such as accuracy. This poses a challenge in assessing whether a privacy-preserving approach like PCC is actually trustworthy while also providing high-quality answers. We are the first to reverse-engineer the PCC implementation on mobile devices to evaluate privacy aspects and to open its non-public interfaces on local devices to support custom PCC queries. We demonstrate this level of access beyond Apple’s intended use cases by independently benchmarking the PCC model. We enable future research by making our PCC benchmarking framework publicly available. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2605.24239 [cs.CR] (or arXiv:2605.24239v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2605.24239 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: Proceedings of the 19th ACM Conference on Security and Privacy in Wireless and Mobile Networks (WiSec 2026) Related DOI: https://doi.org/10.1145/3765613.3811691 Focus to learn more DOI(s) linking to related resources

[AI-194] oward Enactive Artificial Intelligence

链接: https://arxiv.org/abs/2605.24238
作者: Banafsheh Rafiee,Richard Sutton
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In this paper, we advocate for incorporating enactive approaches to perception and cognition into artificial intelligence (AI). Enactive approaches view perception as an active, skillful engagement with the world, where agents perceive by acting and by understanding how their actions shape their experience. This contrasts with classical views that treat perception as a passive internal process in which the brain receives sensory input, processes it, and issues commands for action. Enactive views emphasize the dynamic, embodied, and interactive character of perception, grounded in the lived experience of agents embedded in their environments. We identify and develop four key enactive concepts that we find most relevant to AI: experience, action perception inseparability, autonomy, and embodiment. Much of mainstream AI, from classical rule based systems to large language models, has largely neglected these insights, treating cognition as internal processing detached from embodied interaction and intrinsic normativity. Reinforcement learning (RL), however, exhibits structural resonance with enactive principles through its emphasis on action, agent environment interaction, feedback driven adaptation, and agent centered evaluation. However, this resonance should not be taken as theoretical equivalence, as RL approximates some enactive insights, but key elements remain absent or weakly developed. Building on this analysis, we suggest a broader incorporation of enactive ideas into both mainstream AI and RL.

[AI-195] How Well Do Models Follow Their Constitutions? WWW

链接: https://arxiv.org/abs/2605.24229
作者: Arya Jakkli,Senthooran Rajamanoharan,Neel Nanda
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 37 pages including appendix. Code, tenet lists, and full transcripts: this https URL . Companion blog post on LessWrong/AI Alignment Forum: this https URL

点击查看摘要

Abstract:Frontier AI developers now train models against long written behavioral specifications, such as Anthropic’s constitution (Anthropic, 2025a) and OpenAI’s Model Spec (OpenAI, 2025a), integrated into post-training via methods like character training (Anthropic, 2024) and deliberative alignment (Guan et al., 2024). These documents serve a governance function, but it is unclear how well models actually follow them under adversarial, multi-turn pressure similar to what they would face in real-world deployment. We propose a multi-method audit pipeline that treats each lab’s published specification as an auditable target: it decomposes the specification into atomic testable tenets (205 for Anthropic, 197 for OpenAI), generates multi-turn adversarial scenarios with the Petri auditing agent (Anthropic, 2025b), runs a modified SURF-style rubric search (Murray et al., 2026) to catch shallow single-turn failures Petri misses, validates flagged transcripts against the relevant specification, and compares the findings against the lab’s own published system card. Applying the pipeline across seven models per specification, we find that models follow their own lab’s specification substantially better with each generation. On Anthropic’s constitution, the Claude family falls from a 15.0% violation rate (Sonnet 4) to 2.0% (Sonnet 4.6); on OpenAI’s Model Spec, the GPT family falls from 11.7% (GPT-4o) to 3.6% (GPT-5.2 medium reasoning), with the severity ceiling falling from 10/10 to 7/10. We cannot externally isolate whether these gains come from specification-specific training, broader post-training improvements, or evaluation awareness. Remaining failures cluster around operator-imposed personas under AI-identity questioning, irreversible action in agentic deployments, and fabricated quantitative claims with false precision.

[AI-196] Beyond Final Answers: Auditing Trajectory-Level Hallucinations in Multi-Agent Industrial Workflows

链接: https://arxiv.org/abs/2605.24219
作者: Harshada Badave,Santosh Borse,Andrea Gomez,Harshitha Narahari,Sara Carter,Vishwa Bhatt,Aishani Rachakonda,Shuxin Lin,Dhaval Patel
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly deployed as autonomous agents that reason, use tools, and act over multiple steps. Yet most hallucination benchmarks still evaluate only the final output, missing failures that originate in intermediate Thought-Action-Observation steps. We present Trajel, a dataset and evaluation framework for auditing trajectory-level hallucinations in multi-agent industrial workflows. Trajel introduces a five-type hallucination taxonomy (factual, referential, logical, procedural, and scope-based) over expert-annotated agent traces from AssetOpsBench. We benchmark supervised detection models at the subtask, trajectory, and long-context levels. Our results show that the most common failure modes are missed by existing benchmarks, that nearly half of hallucinated trajectories involve multiple types at once, and that automated detectors with high binary accuracy still misclassify the subtlest types. Trajectory-aware detection significantly outperforms standard post-hoc verification, making taxonomy-grounded evaluation necessary for safer agentic deployment.

[AI-197] Identifying and Mitigating Systemic Measurement Bias in Production LLM Inference Benchmarks

链接: https://arxiv.org/abs/2605.24217
作者: Ashok Chandrasekar,Jason Kramberger
机构: 未知
类目: Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:

点击查看摘要

Abstract:As Large Language Models (LLMs) transition from research environments to production deployments, evaluating their performance against strict Service Level Objectives (SLOs) has become critical. However, current evaluation methodologies suffer from severe measurement bias at scale. We demonstrate that widely used benchmarking utilities rely on single-process, asyncio-driven architectures that introduce fundamental client-side queuing bottlenecks under high concurrency. By modeling the benchmarking client as an M/G/1 queue, we mathematically demonstrate how the Python Global Interpreter Lock (GIL) artificially inflates Time to First Token (TTFT) and Time Per Output Token (TPOT) metrics as request rates scale. To resolve this systematic inaccuracy, we propose an unbiased, multi-process evaluation framework that effectively distributes client-side load, ensuring negligible queuing overhead. Furthermore, we formalize a composite metric, Normalized Time Per Output Token (NTPOT), to robustly amortize end-to-end latency, including prefill and scheduling delays across sequence lengths. Our empirical evaluation demonstrates that this methodology successfully isolates pure serving engine performance, enabling accurate, reproducible profiling of LLMs at production scales exceeding thousands of queries per second.

[AI-198] owards Evaluation Engineering: An Empirical Study of ML Evaluation Harnesses in the Wild

链接: https://arxiv.org/abs/2605.24213
作者: Zhimin Zhao,Zehao Wang,Abdul Ali Bangash,Bram Adams,Ahmed E. Hassan
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Evaluation harnesses are software systems that orchestrate model evaluation by managing model invocation, data loading, metric computation, and result reporting. Despite their critical role in machine learning infrastructure, their operational challenges and engineering concerns have received limited attention so far. We present an empirical study of 57 evaluation harnesses, deriving a five-stage harness model and classifying 16,560 issues by workflow stage and root cause. Most harness operational challenges concentrate in the Specification stage (41.4% of issues), where harnesses integrate external models, datasets, and scoring judges. The three most frequent root causes of operational challenges are unimplemented features (24.3%), documentation gaps (20.3%), and missing input validation (17.2%), which together account for 61.7% of classified issues, spanning both defects in existing functionality and capability gaps that block intended workflows. Root causes also vary by workflow stage: environment incompatibility and external dependency breakage account for 36.2% of provisioning issues, whereas algorithmic error (25.9%) and validation gap (22.5%) dominate assessment issues. Together, these contributions establish an empirical foundation for treating evaluation engineering as a distinct software engineering concern.

[AI-199] When Does Multi-Agent RL Improve LLM Workflows? Workflow Scale and Policy-Sharing Tradeoffs

链接: https://arxiv.org/abs/2605.24202
作者: Yifan Zeng,Yiran Wu,Yaolun Zhang,Wentian Zhao,Kun Wan,Qingyun Wu,Huazheng Wang
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Multi-agent LLM workflows route inference through specialized roles to lift end-task accuracy, but jointly training those roles with reinforcement learning is unstable in ways that are poorly understood. We study when end-to-end RL training of multi-agent LLM workflows improves over their base models, comparing Shared-Policy training, where all roles update one policy, with Isolated-Policy training, where each role has its own parameters. Our experimental matrix spans Eval-Opt, Voting, and Orch-Workers workflows, math and code tasks, and three model scales (0.6B, 1.7B, 4B). We find that multi-agent RL usually improves over base models, but gains depend jointly on workflow, task, and scale, not on policy sharing alone. Isolated-Policy tends to reach higher peak accuracy yet more often falls off a terminal accuracy cliff, while Shared-Policy training does not eliminate failure; it redistributes failure into qualitatively different patterns. We then explain the strongest of these patterns through role-level gradient dynamics induced by workflow topology and policy routing: under Isolated-Policy, parallel same-role agents on shared prompts amplify per-role gradients and drive terminal degradation in Voting and Orch-Workers workflows; under Shared-Policy, asymmetric per-step gradient mass causes the shared policy to be captured by the dominant role, producing different failure signatures by task and workflow. Together, the empirical map and its underlying mechanisms show that policy sharing routes training pressure through different channels rather than offering uniform stability, making it a design choice with workflow- and task-conditional tradeoffs.

[AI-200] A Sober Look at Agent ic Misalignment in Automated Workflows

链接: https://arxiv.org/abs/2605.24197
作者: Wenqian Ye,Bo Yuan,Zhichao Xu,Yijun Tian,Yawei Wang,Henry Kautz,Aidong Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We study a class of emergent misalignment in multi-agent systems (MAS), with a focus on automated workflows, which we refer to agentic misalignment. Although these systems can solve complex tasks, they often fail because agents act according to implicit proxy utilities that do not align with the intended human goals. We formally define these behaviors and analyze them within a Bayesian framework, showing that generic utilities naturally lead to posterior collapse of agents in automated workflows. To address this issue, we propose Agentic Evidence Attribution (AEA), a novel alignment paradigm that improves agent posteriors using context-specific evidence. AEA reasons over agent actions and provides structured evidence to correct misaligned behavior during collaboration. To better understand the role of evidence, we study two instantiations of AEA: self-reflection (internal evidence from the model) and weak-to-strong generalization (external evidence on the agentic trajectory). We show that a small evidence model effectively aligns the MAS by providing orthogonal failure attribution. Our results clarify the sources of agentic misalignment in automated workflows and show that evidence-based alignment can effectively improve agent collaboration and leads to reliable multi-agent systems built on automated workflows.

[AI-201] AvalancheBench: Evaluating Enterprise Data Agents Through Latent World Recovery

链接: https://arxiv.org/abs/2605.24183
作者: Darek Kleczek,Fuheng Zhao,Alexander W. Lee,Julien Tissier,Pawel Liskowski,Ugur Cetintemel,Anupam Datta
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We introduce AvalancheBench, a benchmark for evaluating enterprise data agents through \emphlatent world recovery. AvalancheBench improves on existing benchmarks in three ways. First, it evaluates analytical understanding rather than pipeline completion: systems are scored on whether they recover the segments, drivers, temporal events, and relationships that explain the data, not merely on whether they execute a workflow or produce a plausible report. Second, it provides ground truth for goal-driven analytics by generating observations from a known latent world, enabling partial credit for incomplete but valid recoveries. Third, it exposes how early analytical mistakes propagate into later conclusions: missed segments, merged events, or wrong attributions can lead to systematically wrong recommendations. In this sense, AvalancheBench complements real-data benchmarks by providing a controlled setting for diagnosing whether agents recover the analytical structure behind enterprise data. On a first e-commerce use case, the strongest configuration of a leading coding agent recovers only 26% of the rubric, with failures concentrated in generic customer segmentations and merged temporal events.

[AI-202] EPPC-OASIS: Ontology-Aware Adaptation and Structured Inference Refinement for Electronic Patient-Provider Communication Mining in Secure Messages

链接: https://arxiv.org/abs/2605.24172
作者: Samah Fodeh,Sreeraj Ramachandran,Elyas Irankhah,Muhammad Arif,Afshan Khan,Ganesh Puthiaraju,Linhai Ma,Srivani Talakokkul,Jordan Alpert,Sarah Schellhorn
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Secure patient-provider messages contain clinically important communication behaviors that are difficult to characterize manually at scale. The Electronic Patient-Provider Communication (EPPC) framework provides an ontology for coding these behaviors, but automated extraction remains challenging because predictions must preserve fine-grained code/sub-code structure while grounding annotations in message text. We developed EPPC-OASIS, an ontology-aware adaptation approach for structured EPPC extraction, and combined it with deployable inference-refinement procedures designed to improve the coherence of final annotations. EPPC-OASIS augments supervised fine-tuning with a Wasserstein alignment objective that encourages alignment between model representation neighborhoods and EPPC ontology-derived neighborhoods, while inference refinement uses verification, self-consistency, hybrid correction, and selection or ensembling to address residual prediction errors. We evaluated the framework on a de-identified corpus of secure patient-provider messages against prompting, supervised fine-tuning, preference-based, and robustness-oriented baselines across multiple open-weight language models. Across model families, the best deployable pipeline achieved 77.13% Code+Sub-code F1 and 63.83% Triplet F1, corresponding to modest but consistent absolute gains of +1.39 and +2.12 F1 points over the strongest supervised fine-tuning baseline. These results suggest that ontology-aware adaptation with structured inference refinement can support scalable retrospective EPPC mining, although external validation is needed before operational use.

[AI-203] PromptAudit: Auditing Prompt Sensitivity in LLM -Based Vulnerability Detection

链接: https://arxiv.org/abs/2605.24171
作者: Steffen J. Camarato,Yahya Hmaiti,Mandana Ghadamian,David Mohaisen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models are increasingly used for vulnerability detection, yet their reliability under different prompt formulations remains uncharacterized. We present PromptAudit, a controlled evaluation framework that isolates prompt effects by fixing the dataset, decoding, and parsing while varying only the prompting strategy. Using five prompting strategies across five open-weight models on 1,000 CVEs (6,074 code samples spanning 16 programming languages), we evaluate accuracy, recall, abstention, coverage, and effective F1. We find that standard chain-of-thought prompting achieves the strongest overall operational performance, while few-shot prompting provides model-dependent benefits that are most pronounced for prompt-sensitive models. In contrast, adaptive chain-of-thought frequently suppresses recall and self-consistency induces excessive abstention, sharply reducing effective performance. These results show that vulnerability detection behavior is jointly determined by the model and the prompt, and that prompt sensitivity is a first-class system property that must be explicitly characterized in evaluation and deployment.

[AI-204] Inference Time Context Sparsity: Illusion or Opportunity?

链接: https://arxiv.org/abs/2605.24168
作者: Sahil Joshi,Prithvi Dixit,Agniva Chowdhury,Anshumali Shrivastava,Joseph E. Gonzalez,Ion Stoica,Kumar Krishna Agrawal,Aditya Desai
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 19 pages, 8 figures

点击查看摘要

Abstract:Sparsity has long been a central theme in LLM efficiency, but its role in context processing remains unresolved. As LLM workloads shift toward longer contexts and agentic interactions, the compute and memory bottlenecks of attention become increasingly critical, raising the question of whether these constraints are fundamental. Our position is that these constraints are artificial and unnecessary, and that the future of LLM inference lies in extreme but principled sparsity along the context dimension. This position is supported by several strands of empirical and theoretical evidence. First, we find the insistence on dense attention unreasonable, since in a long context a query effectively projects O(N) attention information into a hidden space of dimension d N, making the process inherently lossy. Second, we perform an extensive study of sparsity in LLMs spanning 20 models across five model families, varying context lengths, and different sparsity levels. We empirically demonstrate a strong trend: current LLMs, despite not being trained for context sparsity, are remarkably robust to inference-time decode sparsity across tasks of varying complexity, including retrieval, multi-hop QA, mathematical reasoning, and agentic coding. Importantly, we also show that current hardware is already sufficient to realize substantial gains from this sparsity. For example, our sparse decode kernels accelerate large-context processing by up to 10x over FlashInfer at 50x sparsity levels on hardware such as the H100. Overall, these results position extreme context sparsity not as a heuristic, but as a principled foundation for LLM inference, training, and architecture design: one that is both feasible and beneficial, and a compelling direction for future systems.

[AI-205] Knowledge Graph Modulated Deep Learning for Limited-Sample Clinical Data Analysis

链接: https://arxiv.org/abs/2605.24162
作者: Yuwei Xue,Sakib Mostafa,James Zou,Joseph Liao,Maximilian Diehn,Ash A. Alizadeh,Lei Xing,Md. Tauhidul Islam
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 17 pages, 4 figures, 12 supplementary figures

点击查看摘要

Abstract:Biological systems are governed by structured molecular interactions, where pathways, regulatory circuits, and functional gene relationships shape cellular behavior and disease progression. Much of this knowledge is naturally represented as graphs. However, most biomedical AI models cannot directly use graph-encoded biological knowledge and instead require compressed low-dimensional representations, which can lose important structure and reduce performance, especially in limited-sample clinical studies. Here, we introduce Graph-in-Graph (GiG), a knowledge graph-modulated deep learning framework for data-efficient clinical prediction. GiG represents each patient as a standalone modular graph, in which curated biological knowledge graphs define edges and patient-specific measurements, such as gene expression, define node features. This design allows multiple biological knowledge graphs to be integrated while preserving gene-gene interactions and pathway topology during patient-level representation learning. Across cohorts comprising nearly 9,700 patients and five clinical tasks, including liquid biopsy cancer detection, prostate cancer diagnosis, and 32-class pan-cancer classification, GiG consistently outperforms traditional and state-of-the-art methods, with the largest gains in limited-sample settings. On the challenging prostate cancer diagnosis task, GiG improves macro-F1 by up to 49 percentage points relative to competing methods. Control experiments replacing real pathway graphs with random topologies confirm that these gains arise from biologically grounded knowledge graph structure rather than graph modeling alone. These findings show that knowledge graph-modulated deep learning can improve robustness, interpretability, and sample efficiency in clinical data analysis, and provide a principled framework for integrating biological knowledge graphs into predictive modeling.

[AI-206] Palette: A Modular Controllable and Efficient Framework for On-demand Authorized Safety Alignment Relaxation in LLM s

链接: https://arxiv.org/abs/2605.24154
作者: Qitao Tan,Xiaoying Song,Arman Akbari,Arash Akbari,Yanzhi Wang,Xiaoming Zhai,Lingzi Hong,Zhen Xiang,Jin Lu,Geng Yuan
机构: 未知
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:Current safety alignment of foundation models largely follows a \emphone-size-fits-all paradigm, applying the same refusal policy across users and contexts. As a result, models may refuse requests that are unsafe for general users but legitimate for authorized professionals, limiting helpfulness in specialized professional settings. Existing approaches either require costly realignment or rely on inference-time steering that suffers from imprecise control and added latency. To this end, we propose \textscPalette, a modular, controllable, and efficient framework that selectively relaxes refusal behavior on authorized target domains while preserving standard safety elsewhere. Our method identifies a refusal direction via multi-objective search and internalizes it into the model through lightweight adaptation. \textscPalette further supports modular composition: it learns domain-specific safety controls independently and composes them through parameter merging, enabling on-demand multi-domain authorization without retraining. Experiments across four safety benchmarks, multiple model variants, and both LLMs and VLMs show that \textscPalette delivers precise safety control without sacrificing general utility, offering a practical path toward foundation models that adapt to diverse professional needs.

[AI-207] Neuro-Inspired Inverse Learning for Planning and Control

链接: https://arxiv.org/abs/2605.24152
作者: Maryna Kapitonova,Tonio Ball
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present a neuro-inspired framework for embodied planning and control. Building on three principles that enable fast and highly effective goal-directed behavior in the mammalian brain - paired forward/inverse internal models, open-loop multi-step motor commands, and sequential, hierarchical organization of action - our Inverter framework uses learned components, trained end-to-end through Inverse Learning (IL) and supplemented where natural by analytic or algorithmic modules; we formalize IL and delineate it from supervised, reinforcement, and imitation learning. IL bridges Reinforcement Learning (RL)-style amortization, which runs in a single forward pass but emits only one action at a time, and Optimal Control (OC)-style sequence planning over whole trajectories, but with iterative test-time computation. Single Inverters or hierarchical n=2 Inverter stacks match or improve on offline-RL and diffusion-planner baselines on all 3 maze2d and 6 antmaze D4RL variants by an average of +24.2% (range -1.9% to +78.2%), at one-to-two orders of magnitude less inference compute time. Distinctively, optimizing through the Figure of Merit (FoM) over the entire T-step action sequence - rather than per step - lets Inverters produce smooth, goal-coherent, trajectory-wide structure and reach control policies closer to the analytic optimum than the policy underlying the training data itself. We also identify a failure mode of IL: FoM hacking under narrow training-data coverage, which we mitigate by using random training data with broader coverage. As an application example, a Pulse Inverter synthesizes arbitrary single-qubit quantum gates with fidelity matching the standard iterative numerical baseline (GRAPE), at more than 1000x lower per-gate compute time. In summary, we conclude that IL enables a versatile class of world-interfaces, especially for latency- and resource-critical embodied AI.

[AI-208] HyperGuide: Hyperbolic Guidance for Efficient Multi-Step Reasoning in Large Language Models

链接: https://arxiv.org/abs/2605.24140
作者: Yuyu Liu,Haotian Xu,Yanan He,Sarang Rajendra Patil,Mengjia Xu,Tengfei Ma
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multi-step reasoning remains a central challenge for large language models: single-pass generation is efficient but lacks accuracy; tree-search methods explore multiple paths but are computation-heavy. We address this gap by distilling reasoning progress into a hyperbolic geometric signal that guides step-by-step generation. Our approach is motivated by a structural observation: in combinatorial reasoning trees, solution-bearing states are few while dead ends are exponentially numerous. The hyperbolic space matches this asymmetry, with compact volume near the origin and exponentially expanding capacity toward the boundary, so that distance-to-origin naturally encodes solution proximity while angular separation distinguishes branches requiring different next operations. We train a lightweight head to project LLM hidden states into this space, then fine-tune a low-rank adapter interactively on its own reasoning attempts to act on the injected signal. Across multiple benchmarks, the geometric signal yields consistent gains, with larger improvements on deeper reasoning chains. Our code is publicly available at this https URL.

[AI-209] MAPLE: Multi-State Aggregated Policy Evaluation for AlphaZero in Imperfect-Information Games

链接: https://arxiv.org/abs/2605.24139
作者: Qian-Rong Li,Hung Guei,I-Chen Wu,Ti-Rong Wu
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted by the IEEE Conference on Games (IEEE CoG 2026)

点击查看摘要

Abstract:Imperfect-information games (IIGs) are challenging, as players must make decisions without fully observing the true game state. While AlphaZero has achieved remarkable success in perfect-information games, extending it to IIGs remains difficult. Existing search-based approaches, such as Perfect Information Monte Carlo (PIMC), suffer from strategy fusion, while Information Set Monte Carlo Tree Search (IS-MCTS) incurs high computational cost when combined with neural networks. In this paper, we propose Multi-State Aggregated PoLicy Evaluation (MAPLE), a tree search method that aggregates policy and value evaluations from multiple sampled world states within a single search tree, combining the advantages of PIMC and IS-MCTS while maintaining a controllable computational cost. We further incorporate a Siamese-based sampling strategy to select informative world states from the information set. Experiments on Phantom Go and Dark Hex show that MAPLE significantly outperforms the PIMC-based AlphaZero baseline, achieving Elo improvements of 291 and 136, respectively. These results demonstrate that MAPLE is an effective approach for AlphaZero-style learning in imperfect-information games.

[AI-210] Understanding Conversational Patterns in Multi-agent Programming: A Case Study on Fibonacci Game Development

链接: https://arxiv.org/abs/2605.24138
作者: Srijita Basu,Viktor Kjellberg,Simin Sun,Bengt Haraldsson,Md. Abu Ahammed Babu,Wilhelm Meding,Farnaz Fotrousi,Miroslaw Staron
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 10 pages, 7 figures, AIware, FSE 2026

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly applied to software engineering (SE), yet their potential for autonomous, role-oriented collaboration remains largely underexplored. Understanding how multiple LLM-based agents coordinate, maintain role alignment, and converge on solutions is critical for SE, as naively allowing agents to interact does not reliably lead to correct or stable outcomes. Recent empirical studies show that unstructured or poorly understood interaction dynamics can result in error propagation, premature consensus on incorrect solutions, or prolonged disagreement that prevents convergence, even when correct partial solutions are present early in the interaction. As an initial step towards addressing this underexplored area, we undertake a systematic analysis of conversations between two agents, a Designer and a Programmer across 12 model combinations from 7 open-source LLMs (Gemma 2, Gemma 3, LLaMA 3.2, LLaMA 3.3, DeepSeek-R1, MiniCPM, and Qwen3). Our systematic approach reveals three key dimensions of multi-agent interaction: efficiency (the speed and stability of convergence), consistency (the degree of role alignment visualized by BLEU and ROUGE), and effectiveness (the extent of compilation success and error resolution). Results show that the DeepSeek-R1:DeepSeek-R1 pair was unique in converging to the correct solution from the very first iteration and sustaining it consistently to the final iteration, while LLaMA 3.2:LLaMA 3.2 and Qwen3:Qwen3 demonstrated strong Designer:Programmer role alignment despite of diverging from the correct solution. The other pairs deviated from the task, never to converge to a result. These findings advance understanding of agentic programming and highlight the need for further research on understanding and calibrating convergence and stop conditions essential for future autonomous SE.

[AI-211] Empirical Analysis and Detection of Hallucinations in LLM -Generated Bug Report Summaries

链接: https://arxiv.org/abs/2605.24137
作者: Hinduja Nirujan,Shreyas Patil,Abdallah Ayoub,Ahmad Abdel Latif,Gouri Ginde
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly used to generate summaries of software bug reports, including sections such as Steps-to-Reproduce (S2R), Actual Behavior (AB), and Expected Behavior (EB). However, these models frequently produce hallucinations that can be convincing but unsupported by the source report. This can mislead developers and reduce trust in automated maintenance tools. Existing hallucination detection approaches typically evaluate outputs at the full-response level and do not consider the structure of technical documents. An initial exploratory study on 80 structured bug report summaries found that approximately 47.9% contained missing information, while 12.3% included fabricated content, highlighting the need for systematic hallucination analysis in bug report summarization. In this work, we empirically investigate hallucinations in LLM-generated bug report summaries from a section-aware perspective. Using the BugsRepo dataset, derived from Mozilla OSS projects, we introduce controlled synthetic hallucination injection to construct a benchmark for training and evaluation. We propose a section-aware hallucination detection approach that jointly predicts whether a summary contains hallucinated content, identifies affected sections, and classifies hallucination types. Experimental results across multiple pretrained language models show that the proposed approach achieves strong performance across all tasks, with the best model obtaining 0.89 report-level Macro-F1, 0.83 section-level Macro-F1, and 0.84 hallucination-type Macro-F1. We further analyze common hallucination patterns and model failure modes to better understand limitations of current LLM-generated bug report summaries. The findings highlight the importance of section-aware hallucination analysis for improving the reliability of LLM-assisted bug report summarization in software maintenance workflows.

[AI-212] SkillEvolBench: Benchmarking the Evolution from Episodic Experience to Procedural Skills

链接: https://arxiv.org/abs/2605.24117
作者: Yingtie Lei,Zhongwei Wan,Jiankun Zhang,Samiul Alam,Zixuan Zhong,Peizhou Huang,Xin Wang,Jingxuan Zhang,Donghao Zhou,Yunta Hsieh,Zhihao Dou,Hui Shen,Yan Xu,Dimitrios Dimitriadis,Tuo Zhang,Mi Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language model (LLM) agents accumulate rich episodic trajectories while solving real-world tasks, but it remains unclear whether such experience can be distilled into reusable procedural skills. We introduce SkillEvolBench, a diagnostic benchmark for evaluating this step from experience reuse to skill formation. It contains 180 tasks across six real-world agent environments, organized into role-conditioned task families with shared latent procedures. Agents learn from acquisition tasks, update an external skill library using compacted trajectories and verifier feedback, and then face frozen deployment tasks testing context shift, adversarial shortcuts, and composition. By comparing self-generated and curated-start skill evolution against no-skill and raw-trajectory controls, SkillEvolBench separates procedural abstraction from base capability, curated prior knowledge, and direct reuse of episodic traces. Across ten model configurations and three agent harnesses, we find that current agents often adapt locally but rarely form robust reusable skills. Skill-based conditions can improve acquisition or replay, and individual models sometimes gain on specific deployment axes, but these gains are unstable under frozen deployment. Raw-trajectory reuse frequently outperforms distilled skills, suggesting that current abstraction procedures discard contextual and procedural cues that remain useful for future tasks. Capacity and cost analyses further show that writing more skills or larger Tier-3 resource libraries is not sufficient: additional updates can improve coverage while introducing episode-specific drift and procedural clutter. These findings position SkillEvolBench as a testbed for measuring when one-off experience becomes durable procedural knowledge rather than task-local memory.

[AI-213] MASt3R-Nav: WayPixel Navigation in Relative 3D Maps ICRA

链接: https://arxiv.org/abs/2605.24111
作者: Vansh Garg,Rohit Jayanti,Krish Pandya,Sarthak Chittawar,Siddharth Tourani,Muhammad Haris Khan,Sourav Garg,Madhava Krishna
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 2026 IEEE International Conference on Robotics Automation (ICRA)

点击查看摘要

Abstract:Visual navigation ability is strongly tied to its underlying representation of the world. Unlike classical 3D maps that require globally-consistent geometry, image- or object-relative topological graphs almost entirely do away with geometric understanding. But, this comes at the cost of navigation capability, often limiting it to merely teach-and-repeat. In this work, we propose a novel map representation in the form of pixel-relative connectivity, which is geometrically accurate but does not require global geometric consistency. Inspired by recent progress in 3D grounded image matching, we construct a map from an image sequence through inter-image connectivity based on pixel correspondences in the relative 3D coordinate systems of individual image pairs. We then use this pixel-level graph to perform global path planning by approximating and sparsifying intra-image pixel connectivity. Through this, we derive a ‘‘WayPixel Costmap’’ representation and train a controller conditioned on it to predict a trajectory rollout. We show that this dense pixel-level costmap based on relative geometry is a more accurate conditioning variable for control prediction than its image- and object-level counterparts. This enables a highly capable navigation system, as validated on four types of navigation tasks in the simulator and through real world demonstrations.

[AI-214] EvoCode-Bench: Evaluating Coding Agents in Multi-Turn Iterative Interactions

链接: https://arxiv.org/abs/2605.24110
作者: Haiyang Shen,Xuanzhong Chen,Wendong Xu,Yun Ma,Liang Chen,Kuan Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Work in Progress; 32 pages, 10 figures, preprint

点击查看摘要

Abstract:Coding agents are increasingly used as iterative development partners, but most benchmarks still evaluate one specification followed by one final assessment. This leaves out a basic question: can an agent keep its own codebase working as requirements change? We introduce EvoCode-Bench, a benchmark of 26 stateful coding tasks and 227 evaluated rounds. Each task preserves the agent’s workspace for 5-15 rounds, states requirements through observable behavior, and uses cumulative executable tests to check new requirements and still-active prior ones. We evaluate 13 coding agents with two metrics: MT@4, a four-attempt fail-stop multi-round score, and SR, a single-round score from a reference-completed prior state. For most agents, SR exceeds MT@4 by 22-40 points. The gap also changes rankings: the highest-SR agent (78.9) ranks only third in persistent execution (44.0 MT@4). Even the strongest agents achieve only about 50% success on multi-turn metrics, and aggregate pass rate drops below half of round-1 performance by round 5. Failure analysis shows tier-dependent behavior: weaker agents fail early, while stronger agents survive long enough to expose specification-tracking and regression failures. We release the benchmark data and Harbor multi-turn infrastructure.

[AI-215] Overcoming “Physics Shock” in Earth Observation A Heteroscedastic Uncertainty Framework for PINN-based Flood Inference

链接: https://arxiv.org/abs/2605.24106
作者: Tewodros Syum Gebre,Jagrati Talreja,Matilda Anokye,Leila Hashemi-Beni
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: This article is accepted in IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing

点击查看摘要

Abstract:Rapid and accurate flood extent mapping from Remote Sensing data, such as Synthetic Aperture Radar (SAR), is critical for operational disaster response, but standard Deep Learning models often produce physically impossible predictions due to a lack of hydrological constraints. While PhysicsInformed Neural Networks (PINNs) attempt to address this by embedding governing laws directly into the loss function, their application to real-world remote sensing data frequently fails. Enforcing rigid spatial derivatives (e.g., the 2D Shallow Water Equations) onto unconditioned latent spaces attempting to fit noisy SAR speckle causes catastrophic gradient divergence, a phenomenon we term Physics Shock. In this paper, we propose a novel Uncertainty-Aware PINN framework tailored specifically for applied Earth Observation that addresses this instability. By integrating a dynamic Warm-Start protocol and modeling heteroscedastic aleatoric uncertainty via a negative log-likelihood objective, the network learns to dynamically relax physical constraints in regions of high sensor noise while strictly enforcing them in high-confidence areas. Evaluated on the Sen1Floods11 dataset, our probabilistic Attention-Gated FNO-UNet successfully stabilizes multi-objective optimization, achieving a +25% relative improvement in Intersection over Union (IoU) compared to deterministic baselines. Furthermore, through Deep Ensembles, we successfully disentangle intrinsic sensor noise from out-of-distribution terrain ignorance, providing operational agencies with highly calibrated, physically consistent confidence bounds for robust disaster mitigation and real-time decision-making.

[AI-216] he Time is Here for Just-in-Time Systems: Challenges and Opportunities

链接: https://arxiv.org/abs/2605.24096
作者: Shu Liu,Alexander Krentsel,Shubham Agarwal,Mert Cemri,Ziming Mao,Soujanya Ponnapalli,Alexandros G. Dimakis,Sylvia Ratnasamy,Matei Zaharia,Aditya Parameswaran,Ion Stoica
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Software Engineering (cs.SE)
备注: preprint

点击查看摘要

Abstract:Core systems like key-value stores have historically taken years to build, and are designed to be general so as to amortize cost across deployments, paying a significant performance cost. We argue that LLM-based coding agents now make a different approach tractable: Just-in-Time Systems, in which the entire system is synthesized from scratch, specialized to the environment, workload, and required system properties. We present a JIT system synthesis pipeline, Jitskit, and explore its effectiveness in synthesizing key-value stores from spec cards that span different YCSB workloads, deployment constraints (e.g., compute resources), and system properties (e.g., consistency and durability). Jitskit iteratively refines a system implementation to match the specification against an evolving evaluation test suite. The resulting synthesized systems are performant, beating comparable state-of-the-art systems on 18 of 18 specs tried, by up to 4.6x over the best off-the-shelf baseline on the most favorable spec. Naively running Claude Code either reward-hacks or underperforms Jitskit by up to 5.4x. We discuss the challenges we overcame in building Jitskit and our key takeaways.

[AI-217] Verified SHAP: Provable Bounds for Exact Shapley Values of Neural Networks ICML2026

链接: https://arxiv.org/abs/2605.24084
作者: David Boetius,Shahaf Bassan,Guy Katz,Stefan Leue,Tobias Sutter
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注: Accepted at ICML 2026. 34 pages, 13 figures

点击查看摘要

Abstract:Shapley additive explanations (SHAP) are widely recognised as computationally intractable for neural networks, since they induce an exponential search space over the input features. In this work, we take a first step towards scaling exact SHAP computation to larger search spaces by introducing an algorithm that leverages recent advances in neural network verification to compute arbitrarily tight exact lower and upper bounds on SHAP values for neural networks, ultimately recovering the exact SHAP values. We demonstrate that our approach scales to orders of magnitude larger search spaces than state-of-the-art exact methods. This provides an important first step towards exact SHAP computation and establishes a principled cornerstone for evaluating statistical approximation methods on larger search spaces.

[AI-218] Not All Transitions Matter: Evidence from PPO ACML

链接: https://arxiv.org/abs/2605.24071
作者: Ajhesh Basnet
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 12 pages, 5 figures. Submitted to 2026 8th Asia Conference on Machine Learning and Computing (ACMLC 2026)

点击查看摘要

Abstract:Training a reinforcement learning agent on-policy means collecting fresh experience at every update, and that experience comes with a hidden problem. Each state in a rollout is the direct output of the previous one, causally chained together by the agent’s own actions. Because of this, consecutive transitions are never truly independent. They carry overlapping information, and the gradient signal the network receives ends up far more repetitive than the batch size suggests. The same directions get reinforced over and over, the value network struggles to keep up as the policy shifts, and training becomes quietly unstable in ways that reward curves alone rarely reveal. This paper asks whether that redundancy can simply be removed. We show that randomly dropping a fixed fraction of transitions from the rollout, at the right stage so the reward signal stays intact, is enough to break the repetitive gradient structure and stabilize training. The change is minimal: one sampling step, no new components, no modification to the core algorithm, and it works with any PPO implementation. Across five environments of increasing difficulty, CartPole-v1, Acrobot-v1, LunarLander-v2, HalfCheetah-v5, and Hopper-v5, the method matches vanilla PPO on reward while producing more consistent training dynamics across KL divergence, policy entropy, and value estimates. Dropping 25% of transitions turns out to be the sweet spot: enough to disrupt the redundancy, not enough to thin the batch. Comments: 12 pages, 5 figures. Submitted to 2026 8th Asia Conference on Machine Learning and Computing (ACMLC 2026) Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) ACMclasses: I.2.6; I.2.8 Cite as: arXiv:2605.24071 [cs.LG] (or arXiv:2605.24071v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.24071 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: Proceedings of the 2026 8th Asia Conference on Machine Learning and Computing

[AI-219] When the Manual Lies: A Realistic Benchmark to Evaluate MCP Poisoning Attacks for LLM Agents

链接: https://arxiv.org/abs/2605.24069
作者: Shi Liu,Xuehai Tang,Xikang Yang,Liang Lin,Biyu Zhou,Wenjie Xiao,Wantao Liu
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The rise of tool-using Large Language Model (LLM) agents, standardized by protocols like the Model Context Protocol (MCP), has unlocked unprecedented autonomous execution capabilities for LLM Agents by integrating external open-domain knowledge and tools. However, this interoperability introduces a covert attack surface targeting the agent’s cognitive planning layer. This paper systematically investigates Tool Description Poisoning (TDP), a novel semantic attack. In TDP, malicious instructions are not embedded in a tool’s executable code, but rather covertly injected into its descriptive metadata, the very “manual” an agent relies on for secure planning and decision-making. To rigorously and systematically evaluate this emerging threat, we introduce the MCP-TDP Security Benchmark. This high-fidelity sandbox environment comprises 32 realistic, real-world test cases spanning 6 distinct risk categories. Our evaluation of 8 mainstream LLMs reveals severe vulnerabilities, with leading models like GPT-4o exhibiting a nearly 100% Attack Success Rate (ASR) in six high-risk scenarios. Furthermore, our findings demonstrate that common prompt-guardrail defenses are largely ineffective and can, counterintuitively, even be counterproductive (a phenomenon which we term the “Firewall Fallacy”). Crucially, we also propose a defense mechanism: “Reactive Self-Correction,” where an agent autonomously detects and reverts its own malicious actions post-execution. This work provides the first specialized security benchmark tailored for TDP, offering essential insights for securing the cognitive and planning layers of advanced agentic systems.

[AI-220] Generative Representation Learning on Hyper-relational Knowledge Graphs via Masked Discrete Diffusion ICML2026

链接: https://arxiv.org/abs/2605.24064
作者: Jaejun Lee,Seheon Kim,Joyce Jiyoung Whang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 28 pages, 16 figures, 18 tables, 43rd International Conference on Machine Learning (ICML 2026)

点击查看摘要

Abstract:Hyper-relational knowledge graphs (HKGs) effectively represent complex facts. While inferring new knowledge in HKGs is a critical problem, current methods cast it as a simple link prediction, assuming that nearly all entities and relations within a fact are known, leaving only a single blank to be filled. However, this restricted assumption may not hold in real-world scenarios in which multiple, or even all, constituent components of a fact may be missing simultaneously. To bridge this gap, we introduce a task called fact generation: generating a valid hyper-relational fact from an arbitrarily masked query, i.e., completing a partially observed fact or generating a fact from scratch. We propose KREPE, the first generative representation learning method for HKGs that learns to model the probability distributions of missing components conditioned on the local fact components and global structure of HKGs via a masked discrete diffusion. KREPE models both the intra-fact dependencies by contextual message passing and inter-fact correlations by aggregating stochastically sampled contexts. KREPE seamlessly unifies link prediction and fact generation within a single training framework, achieving state-of-the-art performance on standard HKG link prediction benchmarks and outperforming LLM-based baselines in generating novel and correct facts.

[AI-221] Federated Learning over Human-Body Communication for On-Body Edge Intelligence: A Survey Taxonomy and BODYFED-HBC Scheduling Vignette

链接: https://arxiv.org/abs/2605.24062
作者: Koffka Khan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Human-body communication (HBC) is a promising physical substrate for wearable body-area networks because it can localize communication around the body and reduce the burden of conventional radio links. Federated learning (FL) is a promising learning substrate because it can reduce raw-data centralization for physiological and behavioral sensing. Yet these two literatures remain weakly connected: FL for wearables usually abstracts the communication layer, whereas HBC research usually abstracts learning and model-update traffic. This article surveys the intersection of HBC, wireless body-area networks, wearable FL, Internet-of-Bodies privacy, and edge-intelligence optimization. We propose a taxonomy that distinguishes intra-body, body-hub, cross-user, and clinical-cloud FL deployments, and we identify the open problem of body-channel-aware FL: learning protocols whose client selection, update compression, and aggregation are controlled by posture-dependent HBC links, residual energy, sensor memory, and privacy risk. To make the research agenda concrete, we introduce BODYFED-HBC as a reference architecture and provide an optimization formulation and scheduling algorithm. We further specify a reproducible simulation vignette that combines public wearable datasets with empirical body-coupled-communication signal-loss models. The article concludes with open datasets, evaluation metrics, limitations, and research directions for computer scientists working above the hardware layer.

[AI-222] Spectral Probe-Circuits: A Three-Step Recipe for Identifying Attention-Head Circuits in Pretrained Transformers

链接: https://arxiv.org/abs/2605.24059
作者: Yongzhong Xu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 35 pages, 4 figures

点击查看摘要

Abstract:We present a three-step recipe for identifying attention-head circuits in pretrained transformers. A per-head spectral signal – the time-integrated participation ratio of each head’s attention output – ranks heads doing sustained content-dependent computation without labels or attribution gradients. A task-pattern screen filters this general indicator into a task-specific candidate circuit, and group ablation against a matched-random control completes the causal claim. We validate across an 8x parameter range (51M to 1B-active / 7B-total), two architecture families (dense, mixture-of-experts), and four pretraining pipelines. The recipe ports: a 2-6 head induction circuit is causally necessary in every model tested, with a 94-100% drop in synthetic-induction top-1 after ablation. The spectral signal is predictive without supervision: on six independent seeds of a 51M-parameter probe model, the same computation identifies the seed-specific circuit on each seed. The fraction of heads doing identifiable specialized computation is conserved at 17-19% across the Pythia family (124M to 410M), while specific induction circuits stay 3-11 heads – sublinear in total head count. This paper is the methodology anchor of a three-paper program; companion papers extend the recipe to developmental trajectories during pretraining and to composed-task circuits where pattern selectivity decouples from task-causal structure.

[AI-223] Signs Beat Floats: Low-Rank Double-Binary Adaptation for On-Device Fine-Tuning

链接: https://arxiv.org/abs/2605.24058
作者: Yoshihiko Fujisawa,Yuma Ichikawa,Yudai Fujimoto,Akira Sakai,Katsuki Fujisawa
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 34 pages, 3 figures

点击查看摘要

Abstract:On-device adaptation of large language models commonly keeps a quantized base model frozen while training and deploying a small, task-specific LoRA adapter. In the unmerged adapter-mode setting, however, the adapter is more than a compact storage module; it introduces an additional dense floating-point branch, maintains a trainable state for local updates, and acts as a unit of communication and this http URL introduce LoRDBA, a LoRA-compatible adapter that replaces both low-rank factors with binary sign carriers while representing magnitudes through lightweight, channel-wise scales, converting the dense adapter branch into two sign-accumulation matrix multiplications interleaved with channel-wise scaling. A finite-sample analysis shows that reconstruction quality is governed by the residual-to-magnitude ratio of the original LoRA factors. In adapter-mode experiments, LoRDBA outperforms low-bit baselines at matched model sizes while matching fp16 LoRA quality in selected regimes. The unmerged adapter incurs at most 8% prefill latency overhead at matched rank r=16 despite an over 10x reduction in adapter footprint, with moderate training memory overhead of approximately 1.6x that of fp16 LoRA.

[AI-224] Feature Lottery? A Bifurcation Theory of Concept Emergence

链接: https://arxiv.org/abs/2605.24057
作者: Fuming Yang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Neural networks acquire structured representations at specific moments during training, yet identifying these transitions typically relies on retrospective, label-dependent metrics. We introduce a bifurcation theory of representation dynamics to detect these moments in real time. Analyzing a passive GMM probe attached to the evolving encoder, we show the onset of structure corresponds to a supercritical pitchfork bifurcation driven by the loss Hessian. The system exhibits a theoretically predictable zero-crossing ( \beta_c ) that, compared to the network’s current state ( \beta ), yields a dynamic ratio \beta(t)/\beta_c(t) : a universal, label-free phase coordinate for representation dynamics, computable entirely from hidden states. We empirically validate four distinct transition regimes predicted by this coordinate across diverse settings: SAEs on language models (Pythia), SSL (CIFAR), and grokking (modular arithmetic). Crucially, under finite dissipation, macroscopic symmetry-breaking can lag the initial zero-crossing by orders of magnitude, which providing a rigorous dynamical account of the delayed escape observed in grokking. Microscopically, the bifurcation creates a shared unstable subspace, forcing collective symmetry breaking. We term this the “feature lottery” in SAE training: a feature’s terminal interpretability becomes predictable remarkably early. By only 5% of training, early atom purity robustly predicts final convergence purity, with top-decile early atoms achieving over 12x the baseline purity at convergence. Beyond explaining concept emergence, \beta/\beta_c provides a practical early-warning indicator for training health, detecting the onset of usable structure, the crystallization of feature identity, and representational collapse epochs before downstream metrics react.

[AI-225] Cascade-KDE: Robust Time-Series Restoration under Out-of-Distribution Impulse Corruptions

链接: https://arxiv.org/abs/2605.24055
作者: Yuefeng Liu,Ning Yang,Ziyu Yang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Real-world time-series data in industrial sensing, healthcare, and energy systems is often corrupted by a mixture of Gaussian noise and occasional large-magnitude impulse outliers. For tasks that depend on local shape, such as ECG morphology analysis and battery degradation monitoring, the main requirement is not only low reconstruction error but also preservation of derivative peaks and task-critical features. We propose Cascade-KDE, a training-free restoration framework for corrupted time series. The method first estimates a two-dimensional temporal-amplitude density, then applies a Density-Truncated Robust Expectation to limit the influence of distant abnormal points, and finally refines the sequence through an exponential cascade with adaptive stopping. This design aims to improve robustness under out-of-distribution impulse corruptions while keeping the restored trajectory close to the original local structure. Across several benchmark datasets, the proposed method shows consistent gains over classical filters and representative learning-based baselines on curve fidelity, derivative preservation, downstream classification, and runtime efficiency. These results suggest that bounded density-based restoration is a practical option for feature-preserving preprocessing in noisy time-series pipelines.

[AI-226] ruthful Online Preference Aggregation for LLM Fine-Tuning in Mobile Crowdsourcing

链接: https://arxiv.org/abs/2605.24052
作者: Shugang Hao,Lingjie Duan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:To better serve users’ demands in mobile applications (e.g., navigation), mobile crowdsourcing platforms can iteratively align large language model (LLM)-generated content (e.g., AI-generated traffic condition predictions) with human feedback collected from crowdsourcing workers (e.g., mobile users). However, workers may strategically misreport their online preference feedback to maximize their influence or payment. Existing pipelines in mobile crowdsourcing (e.g., EM-based weight estimation) fail to identify the most accurate worker in this online setting, resulting in a linear regret \mathcalO(T) over T time slots. In this paper, we study truthful online preference aggregation for LLM fine-tuning in mobile crowdsourcing. We formulate a new dynamic Bayesian game to model the multi-agent online learning process between the platform and strategic mobile workers. We propose a novel online weighted aggregation mechanism that dynamically adjusts each worker’s weight in the preference aggregation according to their feedback accuracy. We prove that our mechanism ensures truthful feedback from strategic workers and achieves a sublinear regret \mathcalO(\sqrtT) over T time slots. We further extend our mechanism to a challenging scenario with limited worker feedback per time slot, still guaranteeing a sublinear regret \mathcalO(\sqrtT) . Experiments on LLM fine-tuning with real-world datasets further demonstrate significant performance gains of our mechanisms over benchmark schemes.

[AI-227] More Skills Worse Agents ? Skill Shadowing Degrades Performance When Expanding Skill Libraries

链接: https://arxiv.org/abs/2605.24050
作者: Hongwen Song,Song(Vinson)Wei
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Applications (stat.AP)
备注:

点击查看摘要

Abstract:Skill libraries allow LLM agents to load task-specific instructions on demand, letting non-expert users solve domain-specific tasks through natural language without knowing which skills exist or how they work. However, performance degrades as libraries grow – by up to 21% when scaling from a small set of helpful skills to a 202-skill library. In this work, we formulate this performance degradation as the pass rate drop between loading a library of known-helpful skills and the full library. Moreover, we propose to decompose the pass rate drop by conditioning on the skill(s) invocation – which skills the agent selects during a trajectory – into two effects: \emphskill shadowing, where the agent selects wrong skills more often as the library expands, and \emphcontext overhead, where the enlarged context degrades execution even when selection is correct. We derive upper bounds on both effects to characterize their magnitudes of impacts to the pass rate drop. Our empirical estimates of the effects and their upper bounds both show that the \emphskill shadowing effect grows with library size and significantly contributes to the performance degradation, whereas the \emphcontext overhead effect remains small and indistinguishable from zero. This observed asymmetry establishes that the skill selection failure, not the enlarged context, is the primary bottleneck when expanding the skill libraries.

[AI-228] Mixture of Complementary Agents for Robust LLM Ensemble

链接: https://arxiv.org/abs/2605.24048
作者: Yichi Zhang,Kevin Lu,Yuang Zhang,Jie Gao,Lirong Xia,Fang-Yi Yu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multi-AI collaboration, such as ensembling or debating large language models (LLMs), is a promising paradigm for aggregating information and boosting performance. A foundational step in these pipelines is to feed the responses of several proposer LLMs into a summarizer LLM, which synthesizes a better answer. However, choosing which proposers to include is non-trivial. Existing approaches primarily focus either on accuracy (picking the strongest models) or diversity (ensuring variety), and often overlook the interactions among proposers and with the summarizer. We reframe proposer selection as a combinatorial selection problem akin to feature selection, where the value of an LLM lies in its complementarity with others. However, directly applying standard feature-selection algorithms is impractical in the LLM setting due to prohibitive time complexity. Motivated by this limitation, we explore an extensive range of computationally feasible, greedy-style selection algorithms that assess complementarity using a small labeled set. Our experiments validate complementarity as a guiding principle for proposer selection and identify methods that achieve the best performance-cost trade-offs in practice.

[AI-229] A Large-Scale Dataset and Benchmark: Do Protein-Ligand Models Learn Binding Sites or Just Binding Likelihood? NEURIPS2026

链接: https://arxiv.org/abs/2605.24045
作者: Zhaohan Meng,Zhen Bai,Ke Yuan,Iadh Ounis,Zaiqiao Meng,Hao Xu,Joseph Loscalzo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Under Review for the NeurIPS 2026 Conference, Track on Evaluations and Datasets

点击查看摘要

Abstract:Protein-ligand modeling underpins computational drug discovery and molecular design. Existing protein-ligand benchmarks typically evaluate whether a protein and ligand interact and how strongly they bind, through tasks such as binary binding prediction and affinity regression. However, these evaluations provide limited evidence of whether models can localize binding sites or identify the non-covalent interactions underlying molecular recognition. To address this gap, we introduce InteractBind, a large-scale protein-ligand dataset comprising approximately 100k protein-ligand pairs, together with a benchmark for fine-grained evaluation. The core fine-grained task is that of binding-site localization, which uses protein-residue and ligand-atom interaction maps spanning six major types of non-covalent interactions to assess whether model-derived interaction maps localize binding sites. InteractBind further includes binding affinity and protein similarity-controlled splits to support realistic generalization assessment. Using InteractBind, we evaluate eight existing sequence-based and interaction-aware models, assessing binary binding prediction and binding-site localization. Results reveal limited binding-site localization despite strong binary binding prediction, with marked variation across non-covalent interaction types. Overall, InteractBind establishes a benchmark paradigm that encourages the development of more interpretable and physically grounded protein-ligand models.

[AI-230] LLM -AutoSciLab: Closed-Loop Scientific Discovery via Active Experimentation with LLM s

链接: https://arxiv.org/abs/2605.24043
作者: Sanchit Kabra,Nikhil Abhyankar,Saaketh Desai,Prasad Iyer,Chandan K Reddy
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Scientific discovery is a closed-loop process in which hypotheses guide data acquisition and observations refine the hypothesis space. Yet most approaches reduce discovery to supervised learning over fixed datasets, where limited observations can support multiple plausible mechanisms that fit locally but fail to generalize. Thus, the key challenge is selecting informative observations to resolve uncertainty, shifting the focus from static inference to adaptive data acquisition. To address this, we propose LLM-AutoSciLab, a closed-loop framework that couples hypothesis generation with hypothesis-conditioned experiment selection and mechanism refinement. Rather than fitting models to passively collected data, LLM-AutoSciLab iteratively proposes plausible hypotheses, selects informative experiments to distinguish or refine them, and updates its state using the resulting evidence. To evaluate dynamic, closed-loop scientific discovery with active data acquisition, we introduce ActiveSciBench, comprising two datasets: ActiveSciBench-Chem with 57 enzyme-kinetics tasks and ActiveSciBench-GRN with 45 gene-regulatory-network tasks. These datasets model discovery as a budget-constrained process requiring adaptive experiment design, variable selection, and recovery of true mechanisms. Across NewtonBench, ActiveSciBench-Chem, and ActiveSciBench-GRN, LLM-AutoSciLab outperforms prior methods, achieving 67.6% and 35.1% symbolic accuracy on NewtonBench and ActiveSciBench-Chem, respectively, and 31.1% exact graph recovery on ActiveSciBench-GRN. Moreover, hypothesis-guided experimentation is 2-5x more sample-efficient than the strongest competing baselines. Code and data are available at: this https URL

[AI-231] Hidden-State Privacy Has an Empty Middle

链接: https://arxiv.org/abs/2605.24042
作者: Alexander Okezue Bell
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 74 pages, 61 figures

点击查看摘要

Abstract:Of 1,536 Gaussian release covariances we tested for single-layer hidden-state privacy, zero achieve both moderate utility and moderate privacy against an adaptive retrieval attacker. We prove a complementary Fisher-ball lower bound: every full-rank Gaussian release at O(1) Fisher utility admits a direction whose Mahalanobis signal grows linearly in hidden width, ruling out uniform Gaussian safety in the class and matching the empirical empty middle. The diagonal inverse-Fisher release \Sigma^\star_\mathrmdiag(\mathcalK) = (2\mathcalK/d),\mathrmdiag(1/F_ii) is the unique minimax-optimal diagonal mechanism at first-order KL budget \mathcalK and the only release with worst-attacker top-1 \le 0.001 at every point of a 32 model-layer grid, but it sits on a privacy/utility edge rather than filling the middle. A generalized-eigen mechanism reaching 13\times Pareto reduction under Euclidean retrieval collapses to 100% top-1 under the adaptive Mahalanobis attacker, and a full-trajectory sequence inverter recovers 94% of clean GPT-2 prefixes but 0% under \Sigma_\mathrmdiag . A split-memory transformer trained from scratch reaches G_\mathrmMah \in [20, 33] at 90M and maintains a 6 – 24\times advantage over same-budget GPT baselines from 30M to 1B at a fixed-token language-modeling loss penalty; pretrained models top out at 9.3. These results reframe hidden-state release from mechanism-design within the Gaussian class to architecture or release co-design.

[AI-232] Iterative Refinement Neural Operators are Learned Fixed-Point Solvers: A Principled Approach to Spectral Bias Mitigation ICML2026

链接: https://arxiv.org/abs/2605.24041
作者: Xiaotian Liu,Shuyuan Shang,Xiaopeng Wang,Pu Ren,Yaoqing Yang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 47 pages; accepted to ICML 2026 as a Spotlight

点击查看摘要

Abstract:Neural operators serve as fast, data-driven surrogates for scientific modeling but typically rely on a monolithic, single-pass inference procedure that struggles to resolve high-frequency details, a limitation known as spectral bias. We introduce the Iterative Refinement Neural Operator (IRNO), which augments pre-trained operators with a learned refinement module iteratively applied via fixed-point iteration. IRNO decomposes the prediction into a coarse initialization followed by successive residual corrections, paralleling classical numerical solvers. Under local assumptions, we establish contraction of the induced operator, ensuring convergence to a unique fixed point. To explicitly target high-frequency errors, we propose a progressive spectral loss that adaptively increases penalty on high-frequency components over refinement steps during training. Across physical systems, IRNO consistently lowers error, with up to 56.05% improvement on turbulent flow. On Active Matter, spectral analysis reveals that, relative to base operator, the normalized error ratios decrease to 27.72-36.10% in low-, 5.07-6.68% in mid-, and 1.48-2.04% in high-frequencies, remaining stable beyond the trained iteration count. Code is available at this https URL

[AI-233] SA-Kura: An Energy-Efficient Systolic Array Accelerator for Locally-Coupled Kuramoto Drift in Diffusion Sampling

链接: https://arxiv.org/abs/2605.24016
作者: Jeongmin Jin,Kyeongwon Lee,Mundo Jeong,Jongin Choi,Woojoo Lee
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
备注: 8 pages, 6 figures, 1 table; ACM/IEEE ISLPED 2026 accepted paper

点击查看摘要

Abstract:Diffusion inference remains costly for edge deployment, yet existing accelerators focus almost exclusively on score networks because standard drift is merely a trivial linear scaling. Kuramoto orientation diffusion replaces this trivial drift with locally coupled phase interactions, improving sampling efficiency but introducing a new hardware bottleneck: a center-dependent nonlinear 5 x 5 stencil evaluated at every reverse step. This kernel maps poorly to conventional CNN accelerators and matrix-oriented engines. We present SA-Kura, to our knowledge the first digital systolic-array accelerator dedicated to locally coupled Kuramoto drift. By reformulating pair-wise sinusoidal coupling into neighbor accumulation independent of the center phase followed by a single center-dependent multiply-subtract combination, SA-Kura eliminates in-PE transcendental units and enables regular systolic execution with register-level reuse. SA-Kura was implemented in synthesizable RTL, integrated into a lightweight RISC-V-based SoC, prototyped on FPGA, and evaluated through 45 nm CMOS synthesis and power analysis. For the drift kernel only, compared with software execution of the same kernel on the processor core in the same SoC platform, SA-Kura reduces latency and energy by 193x and 69.4x, respectively. Compared with a standalone Jetson Orin Nano CUDA implementation of the same kernel, it is 6.57x faster and achieves approximately 46.0x lower energy per pixel.

[AI-234] Beyond Predefined Learning Objects: A Thinking-Learning Interaction Model for Up-to-Date Autonomous Robot Learning

链接: https://arxiv.org/abs/2605.23987
作者: Hong Su
机构: 未知
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Autonomous robots operating in open and changing environments cannot always rely on predefined inputs, outputs, and action routines. Although existing learning methods enable robots to improve their performance through environmental interaction, the objects of learning are often fixed in advance, such as input features, recognition outputs, network structures, task goals, or action sequences. This limits their ability to adapt when new features, new categories, or more efficient task routines appear during long-term operation. To address this problem, this paper proposes a thinking-learning interaction model for autonomous robots. The core idea is that thinking guides learning by identifying potential changes, selecting useful evidence, organizing training materials, and planning verification actions, while learning promotes thinking by updating task knowledge, feature-selection experience, action strategies, and future reasoning processes. Based on this bidirectional mechanism, the robot can gradually move beyond predefined learning settings and adapt its recognition relations and action relations through continuous interaction with the environment. Specifically, the proposed model supports adaptive input feature discovery, output category expansion, learning model update, and action routine reconstruction. Experimental results show that the proposed model improves the final recognition accuracy from 0.419 to 0.845 in feature adaptation, achieves higher new-category formation accuracy and model-update success rate, and reduces the average action length from 13.0 to 4.0 in action routine reconstruction. In learning-enhanced thinking, the useful evidence selection rate increases from 0.272 to 0.965, indicating that learning results can effectively improve future evidence selection and reasoning.

[AI-235] Saturating Scaling Laws for Equational Discovery: A Phenomenology of Growth Dynamics in Three Toy Substrates with Two Real-World Replications

链接: https://arxiv.org/abs/2605.23983
作者: Fabio Rovai
机构: 未知
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO); Social and Information Networks (cs.SI)
备注: 17 pages, 5 figures, 4 tables, 2 algorithms. Code and data at this https URL (currently private; will be made public on acceptance)

点击查看摘要

Abstract:We investigate growth dynamics in deterministic equational discovery substrates. Across three toy domains (arithmetic, boolean, higher-order list; n=592 trajectories), short-range substrate sizes fit a power-law N(t) proportional to t^b. Within each substrate b is architecture-sensitive (cross-validated R^2 approximately 0.82); the regression does not transfer across substrates (arith+bool to list yields R^2 approximately -0.84). A heuristic mean-field closure model predicts a saturating power-law dN/dt = K N^k exp(-mu N) of which the pure power-law is the short-range approximation. Three robustness checks: bootstrap intervals on (k, mu) are tight in 4/5 toy trajectories and degenerate in 1/5; out-of-sample forecasting on toy data (fit first 100 epochs, predict next 400) is won by pure power-law 5/5, indicating the toy trajectories do not reach saturation; on two real-world growth proxies the result splits. New Mathlib/*.lean file additions per month (mathlib4, 60 months, 9701 files) support the saturating form on OOS forecasting by approximately 7x over pure power-law; Coq mathcomp monthly commits (129 months, 3083 commits) favour pure power-law on both tests with mu collapsing to zero. The dynamics are substrate-conditional at two levels: within-substrate architecture-to-b regressions do not transfer, and the preferred functional family for N(t) itself (pure vs. saturating power-law) differs by substrate. We propose “saturating power-law growth with substrate-conditional (k, mu), observable when the substrate has reached its saturation regime” as a working framing.

[AI-236] LGMT: Logic-Grounded Metamorphic Testing for Evaluating the Reasoning Reliability of LLM s

链接: https://arxiv.org/abs/2605.23965
作者: Zenghui Zhou,Man Li,Xiaoke Fang,Xinyi Zhou,Weibin Li,Zheng Zheng
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Software Engineering (cs.SE)
备注: Zheng Zheng is the corresponding author

点击查看摘要

Abstract:Large Language Models (LLMs) achieve strong performance on logical reasoning benchmarks, yet their reliability remains uncertain. Existing evaluations rely on static benchmarks, which fail to assess robustness under logically equivalent transformations and often overestimate reasoning capability. We propose LGMT (Logic-Grounded Metamorphic Testing), an oracle-free framework that leverages first-order logic (FOL) to evaluate LLM reasoning. By deriving metamorphic relations from formal logical equivalences, LGMT constructs semantically invariant test cases and detects reasoning defects through cross-case consistency checking. Experiments on six state-of-the-art LLMs show that LGMT exposes substantial hidden defects missed by traditional reference-based evaluations. We further find that models are particularly sensitive to symbol-level and conclusion-level variations, and that advanced prompting such as Few-shot CoT only partially mitigates these issues. These results suggest that LLM evaluation should move beyond isolated correctness toward robustness under logical invariance. LGMT provides a principled and scalable approach for diagnosing reasoning failures.

[AI-237] Multi-market value-stacking: Battery control for combined imbalance participation and non-uniform FCR bidding

链接: https://arxiv.org/abs/2605.23964
作者: Celle Hendrickx,Fabio Pavirani,Chris Develder
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI)
备注: 5 pages, 2 figures. Presented at ACM Sustainability Week 2026 (ACM Sustainability Week Companion 26), June 22-25, 2026, Banff, AB, Canada

点击查看摘要

Abstract:The growing share of Renewable Energy Sources (RES) in modern power systems increases both grid imbalances and frequency deviations, reinforcing the need for ancillary services such as Frequency Containment Reserve (FCR) and passive balancing. Battery Energy Storage Systems (BESS) are well-suited for these services, but prior research typically relies on uniform FCR bids that remain constant throughout the control period. Such static bids fail to fully exploit BESS flexibility, as they do not balance the trade-off between reserving energy for FCR delivery and using it for imbalance arbitrage, limiting the achievable value in value-stacking settings. To address this limitation, we propose a two-stage control framework for the European context that introduces non-uniform FCR bids. In the first stage, we derive a time-varying bid sequence using data-driven Monte Carlo (MC) optimization. In the second stage, a Deep Reinforcement Learning (DRL) agent leverages the residual flexibility for real-time imbalance trading while proactively managing the State of Energy (SoE) to ensure compliance with FCR requirements. The framework is presented as a proof of concept, highlighting the potential benefits of time-varying bidding strategies. By incorporating daily cycle budgets and time-varying reserve commitments, our approach achieves a 7.56% profit increase compared to uniform baselines. These results show that non-uniform bidding can unlock additional value by more effectively aligning reserve obligations with rapidly changing imbalance opportunities.

[AI-238] AI in the Enterprise: How People Use M365 Copilot Chat

链接: https://arxiv.org/abs/2605.23958
作者: Scott Counts,Yan Chen,Jing Dong,Himanshu Sharma,Andrey Zaikin,Rui Hu,Alperen Kok,Gorkem Ozer Yilmaz,Siddharth Suri,Kiran Tomlinson,Sonia Jaffe,Will Wang
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); General Economics (econ.GN)
备注:

点击查看摘要

Abstract:M365 Copilot is used every week by millions of people across more than a million companies around the world as part of their workflows. Uniquely positioned in the AI landscape given its near-exclusive use for work purposes, M365 Copilot can offer a clear picture of how people use AI for work and where that usage may expand next. This paper characterizes that usage through direct classification of user interactions with M365 Copilot Chat. Based on an anonymized and privacy-preserving analysis of a sample of approximately 5.5 million sessions, we combine a learned classification of user intent with a classification of O*NET work activities done with M365 Copilot Chat. We find that M365 Copilot is emerging as an everyday assistant for knowledge work: writing dominates, but users also rely on it for information retrieval, analysis, decision making and strategizing, and evaluating and diagnosing programs and systems, among others. Information seeking tasks remain common, but time trends suggest a relative shift away from ``chat as search’’ and toward content and communication-related work. Comparisons across occupational groupings and to work done in the labor market further show that usage is broad but uneven, where the relative share of work done with M365 Copilot Chat cuts across jobs in some cases and is occupation-specific in others. Areas of relative underrepresentation in the labor market suggest the next frontier for enterprise AI adoption.

[AI-239] Low-Cost Labels Reliable Choices: Rollout-Calibrated Hyper-Heuristics for Job Shop Scheduling

链接: https://arxiv.org/abs/2605.23957
作者: Junhao Wei,Yanxiao Li,Yifu Zhao,Zhenhong Peng,Baili Lu,Dexing Yao,Haochen Li,Qinbin He,Sio-Kei Im,Yapeng Wang,Xu Yang
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Learning-assisted hyper-heuristics can select among dispatching rules while preserving the feasibility and interpretability of constructive Job Shop Scheduling Problem (JSSP) heuristics. Their main computational cost lies in label generation rather than model fitting, since each supervised label usually requires rolling out candidate rules from a partial schedule. We study this label-cost problem together with a reliability problem: a learned selector should not switch away from a strong default rule unless the predicted gain is credible. The proposed selector uses regret-normalized rollout labels, a contextual KNN uncertainty estimate, and a gate that acts only when the predicted improvement exceeds an uncertainty-adjusted margin. We also vary rollout depth and breadth to measure the cost-quality trade-off. On synthetic JSSP instances, the gated selector achieves the lowest mean RPD among learned selectors, remains close to the best fixed dispatching rule, and reduces Random-HH mean RPD by more than an order of magnitude.

[AI-240] From Accuracy to Auditability: A Survey of Determinism in Financial AI Systems

链接: https://arxiv.org/abs/2605.23955
作者: Ruizhe Zhou,Xiaoyang Liu,Gaoyuan Du,Yi Zheng,Shouxi Ren,Deepayan Chakrabarti,Dengdu Jiang
机构: 未知
类目: Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Social and Information Networks (cs.SI); Computational Finance (q-fin.CP)
备注:

点击查看摘要

Abstract:Deploying machine learning in regulated financial environments – credit risk, fraud detection, and anti-money laundering – exposes critical vulnerabilities in algorithmic reproducibility. While early financial ML addressed statistical challenges such as backtest overfitting, deep neural networks and Generative AI have introduced mechanical nondeterminism rooted in hardware and architecture. This survey provides a systems perspective on reproducibility failures across three modalities now dominant in financial AI: tabular models (post-hoc explanation variance), graph networks (stochastic sampling and temporal asynchrony), and LLM-based agentic workflows (batch-dependent divergence and trajectory drift). We supplement the literature analysis with first-party experiments on public financial datasets – quantifying explanation rank instability in credit scoring, prediction flip rates in GNN-based fraud detection, and tensor-parallel-induced output divergence in LLM entity extraction. We propose a layered evaluation framework linking modality-specific metrics (RBO, D_cos, TDI, PSD) to audit readiness, and empirically validate the complementarity of logit-level and semantic-level determinism measures.

[AI-241] Stop Comparing LLM Agents Without Disclosing the Harness

链接: https://arxiv.org/abs/2605.23950
作者: Yunbei Zhang,Janet Wang,Yingqiang Ge,Weijie Xu,Jihun Hamm,Chandan K. Reddy
机构: 未知
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:This position paper argues that, for long-horizon tasks evaluated across models with comparable frontier capability, the agent execution harness, namely the infrastructure layer that governs context construction, tool interaction, orchestration, and verification around a language model, is often a stronger determinant of agent performance than the model it wraps. We formalize and defend the Binding Constraint Thesis: in this regime, performance variance is governed more by harness configuration than by model choice, and current evaluation protocols therefore systematically misattribute harness-level gains to model improvements. We support this thesis along three lines. First, a control-theoretic formalization treats the harness as the controller of a closed-loop dynamical system and the LLM as the stochastic policy it governs, which explains why small harness changes can produce performance shifts that exceed those obtained by substituting one model for another. Second, published benchmarks, industry deployments, and a controlled variance decomposition show that harness-induced variance can substantially exceed model-induced variance, including cases of model ranking reversal. Third, we propose a harness-aware evaluation framework with a disclosure standard and a variance decomposition protocol. Until harness specifications are disclosed, leaderboard comparisons for long-horizon agents should be treated as incomplete and potentially misleading.

[AI-242] AI-Driven Controlled Environment Agriculture as Resilient Infrastructure for U.S. Fresh-Produce Supply Chains

链接: https://arxiv.org/abs/2605.23946
作者: Andrii Vakhnovskyi
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 12 pages, 5 figures, 7 tables. Includes open-data greenhouse control metrics demonstration

点击查看摘要

Abstract:Climate volatility, regional production concentration, labor constraints, cyber risk, and dependence on long-distance fresh-produce supply chains expose vulnerabilities in U.S. fresh-produce and specialty-crop systems. Controlled environment agriculture (CEA) can reduce some exposure by moving selected production into protected, sensor-rich environments, but recent failures in venture-backed vertical farming show that CEA cannot be treated as a universal food-security solution. This paper proposes the Controlled Environment Agriculture Resilience Infrastructure Framework, Version 2.0 (CEA-RIF 2.0), for evaluating AI-driven CEA as targeted regional fresh-produce continuity infrastructure. The framework assesses seven dimensions: supply continuity, climate isolation, energy and grid integration, water and nutrient circularity, cyber-physical reliability, economic viability, and governance and deployment. Drawing on U.S. government reports, peer-reviewed CEA and energy literature, demand-response research, cybersecurity standards, international smart-agriculture programs, 2025-2026 financing and policy signals, and public autonomous-greenhouse datasets, the paper argues that AI creates resilience value only when it improves measured operational outcomes such as climate stability, energy flexibility, yield consistency, anomaly detection, labor productivity, and safe recovery from faults. The analysis reframes AI-driven CEA as a cyber-physical infrastructure problem: energy-aware, grid-interactive, secure, interoperable, regionally distributed, financially disciplined, and connected to public resilience goals. The paper concludes with a research agenda for interagency testbeds, open datasets, standardized metrics, demand-response pilots, and cyber-physical reference architectures. Comments: 12 pages, 5 figures, 7 tables. Includes open-data greenhouse control metrics demonstration Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI) ACMclasses: I.2.1; J.2 Cite as: arXiv:2605.23946 [cs.CY] (or arXiv:2605.23946v1 [cs.CY] for this version) https://doi.org/10.48550/arXiv.2605.23946 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Andrii Vakhnovskyi [view email] [v1] Mon, 4 May 2026 22:21:06 UTC (190 KB) function toggleList(whichLayer,toggleThis) var elem, vis; if( document.getElementById ) // standard elem = document.getElementById( whichLayer ); else if( document.all ) // old msie versions elem = document.all[whichLayer]; else if( document.layers ) // nn4 elem = document.layers[whichLayer]; vis = elem.style; // if the style.display value is blank we try to figure it out here if(vis.display==‘’!=undefined!=undefined) vis.display = (elem.offsetWidth!=0!=0)?‘inline’:‘none’; vis.display = (vis.display==‘’||vis.display==‘inline’)?‘none’:‘inline’; // toggle link inner text status = vis.display; if(vis.display==‘inline’) document.getElementById(‘toggle’).innerHTML = “(collapse list)”; document.getElementById(‘toggle’).title = “Collapse list”; else document.getElementById(‘toggle’).innerHTML = “(”+toggleThis+“)”; document.getElementById(‘toggle’).title = “Show complete list”; Full-text links: Access Paper: View a PDF of the paper titled AI-Driven Controlled Environment Agriculture as Resilient Infrastructure for U.S. Fresh-Produce Supply Chains, by Andrii VakhnovskyiView PDFHTML (experimental)TeX Source view license Ancillary-file links: Ancillary files (details): cea_rif_wur_4tu_metrics.csv cea_rif_wur_4tu_metrics.py Current browse context: cs.CY prev | next new | recent | 2026-05 Change to browse by: cs cs.AI References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked="checked"class=“labs-tab-input”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status

[AI-243] Accelerating Long-Tail Generation in Synchronous RLHF Training via Adaptive Tensor Parallelism

链接: https://arxiv.org/abs/2605.23945
作者: Long Zhao,Qinghe Wang,Jiaan Zhu,Youhui Bai,Zewen Jin,Chaoyi Ruan,Shengnan Wang,Cheng Li
机构: 未知
类目: Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注: 11page, 14 figures

点击查看摘要

Abstract:Reinforcement Learning from Human Feedback (RLHF) has become a key post-training paradigm for improving model quality. However, the synchronous three-stage RLHF pipeline is often bottlenecked by the generation stage, where response-length skew causes the effective batch size to shrink rapidly during decoding, leaving GPUs underutilized while a few long responses remain unfinished. Mainstream frameworks employ a static tensor parallelism (TP) configuration that cannot adapt to changing batch characteristics, leaving substantial performance headroom unexplored. We propose PAT, an adaptive TP method that dynamically reconfigures TP during the generation stage of each RLHF iteration. PAT introduces two key techniques. First, a predictor-guided online reconfiguration method decides both the reconfiguration point and the target TP configuration based on offline profiling, triggering reconfiguration only when the predicted latency benefit outweighs the reconfiguration overhead. Second, a lightweight online reconfiguration mechanism updates only the states and layouts affected by TP changes: it adapts unfinished decoding states through a cost-model-based choice between KV-cache migration and recomputation, performs in-place weight resharding, and reuses cached communication groups. We implement PAT on top of SGLang and integrate it with the VeRL framework. Evaluations on LLaMA3.1-8B and Qwen3-14B using DeepScaleR show that PAT reduces generation latency by up to 34.6% and end-to-end RLHF training iteration latency by up to 27.2% compared to the original VeRL setup.

[AI-244] Right-Sizing Communication and Recommendation Set Size in AI-Assisted Search

链接: https://arxiv.org/abs/2605.23944
作者: Jing Dong,Prakirt Raj Jhunjhunwala,Yash Kanoria
机构: 未知
类目: Artificial Intelligence (cs.AI); Probability (math.PR)
备注:

点击查看摘要

Abstract:We model the interaction between a user and an AI driven recommendation system. The user initiates the process by conveying preference information through a costly and noisy message. The AI assistant, acting as a Bayesian agent, interprets the user’s message to form a posterior belief about their true preferences and make product recommendations. In particular, it determines how many recommendations to present so as to maximize the user’s expected utility from their final choice, while accounting for the search cost induced by the size of the recommendation set. We use mutual information based cost functions to model the two distinct costs incurred by the user during the interaction: (i) a communication cost, which increases with the precision of their preference message, and (ii) a search cost, which increases with the size of the recommendation set provided by the AI assistant. We study products and preferences which live in d dimensional space, and ask how the user’s expected payoff can be maximized. For large d, we characterize how optimal message precision and recommendation set size depend on the cost parameters, under two distinct distributions from which recommendations can be sampled from the product universe: (i) Bayes’ posterior belief, and (ii) an optimized tilted distribution. Under the posterior sampling scheme (i), we identify a hybrid regime, in which an efficient interaction policy requires jointly optimizing the amount of information (in bits) conveyed by the user and the number of recommendations provided by the AI assistant. In the tilted sampling scheme (ii), our results show that the optimal interaction policy uses only one of communication and search, favoring whichever of them is less costly. Subjects: Artificial Intelligence (cs.AI); Probability (math.PR) Cite as: arXiv:2605.23944 [cs.AI] (or arXiv:2605.23944v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2605.23944 Focus to learn more arXiv-issued DOI via DataCite

[AI-245] Spacetime Formation under Requirements: Contextual Realization and Form-Dependent Probability

链接: https://arxiv.org/abs/2605.23943
作者: Song-Ju Kim
机构: 未知
类目: Artificial Intelligence (cs.AI); History and Philosophy of Physics (physics.hist-ph); Quantum Physics (quant-ph)
备注: 19 pages, 1 figure

点击查看摘要

Abstract:Quantum cognition often explains order effects, contextuality, and violations of the law of total probability by replacing classical probability with quantum probability on a fixed event structure. This paper proposes a different interpretation: quantum probability is the fixed-spacetime projection of contextual spacetime formation under finite-state requirements. The framework begins not with time, space, objects, or probabilities, but with requirements such as finite representational capacity, single-state semantic stability, context-sensitive intervention, avoidance of explicit context labels, coherent world-formation, and intersubjective transformability. When these requirements cannot be realized within a single global Boolean event structure, the mismatch appears, under fixed-spacetime projection, as noncommutativity, interference, and quantum-like probability. Building on prior single-state approaches to contextuality, we reinterpret classical contextual bookkeeping cost as the fixed-spacetime shadow of contextual spacetime formation. Auxiliary memory or context labels in a classical representation correspond, in this account, to holonomy-like mismatch among locally Boolean logic-worlds. The interference term is the cross term generated when locally classical realization contributions are nontrivially glued and projected back into a fixed classical spacetime form. The result is a transcendental-operational realist account: objecthood, eventhood, probability, and spacetime are treated as forms of realization under requirements, while objectivity is defined by invariants preserved across observer- and history-dependent spacetime formations. Comments: 19 pages, 1 figure Subjects: Artificial Intelligence (cs.AI); History and Philosophy of Physics (physics.hist-ph); Quantum Physics (quant-ph) Cite as: arXiv:2605.23943 [cs.AI] (or arXiv:2605.23943v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2605.23943 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Song-Ju Kim Dr. [view email] [v1] Fri, 1 May 2026 23:39:57 UTC (20 KB)

[AI-246] A Dynamical Framework for Cognitive Processes Based on Transformations and Semantic Equivalence

链接: https://arxiv.org/abs/2605.23942
作者: Carlo Cattani,Dioneia Motta Monte-Serrat
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper proposes a structural and dynamical framework for modeling cognitive processes within a cybernetic perspective. Cognitive states are represented as elements of a state space evolving through an iterative update rule of the form [ X_t+1 = \pi\big(F(f(X_t))\big), ] where f describes internal transformations, F represents interpretative mappings, and \pi enforces semantic equivalence. The model is interpreted as a feedback system integrating transformation, observation, and stabilization. A categorical formulation is introduced to capture compositional structure, while the associated dynamics are analyzed through fixed-point arguments and contraction conditions ensuring stability. To demonstrate the operational character of the framework, a computational illustration is provided, together with a qualitative analysis of the induced dynamics. A concrete linguistic application shows how context-dependent interpretation can be modeled as a trajectory toward a stable semantic class. The proposed approach connects dynamical systems, category theory, and cognitive modeling, and provides a unified representation of cognition as a feedback-driven process evolving toward invariant interpretations. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2605.23942 [cs.AI] (or arXiv:2605.23942v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2605.23942 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Carlo Cattani [view email] [v1] Wed, 29 Apr 2026 16:56:52 UTC (24 KB) Full-text links: Access Paper: View a PDF of the paper titled A Dynamical Framework for Cognitive Processes Based on Transformations and Semantic Equivalence, by Carlo Cattani and Dioneia Motta Monte-SerratView PDFHTML (experimental)TeX Source view license Current browse context: cs.AI prev | next new | recent | 2026-05 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked="checked"class=“labs-tab-input”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status

[AI-247] MEMOR-E: In-Context and Fine-Tuned LLM Personalization for Alzheimers Assistive Robotics

链接: https://arxiv.org/abs/2605.23941
作者: Maissa Abir Smaili,Eren Sadikoglu,Ransalu Senanayake
机构: 未知
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: 8 pages 14 figures

点击查看摘要

Abstract:Alzheimer’s disease is a neurodegenerative disorder marked by progressive declines in memory and language that reduce independence in daily life, motivating socially assistive robotic support. This paper presents MEMOR-E, a mobile quadruped robot with an interactive tablet interface that assists patients and caregivers through medication reminders, routine guidance, memory oriented interactions, and companionship. We evaluated the feasibility of fine tuning large language models (LLMs) to emulate stage consistent cognitive behavior and interpret responses across standard neuropsychological language tasks, using audio transcriptions from 235 Alzheimer’s patients and synthetically generated healthy controls. We also report findings on using in context learning (ICL) in LLMs, where a second LLM produced domain and severity level cognitive error summaries. Our results show that MEMOR-E can generate stage aware, non diagnostic cognitive summaries that support personalized assistive interactions, while explainable AI mechanisms translate model outputs into transparent, human readable evidence to enable caregiver oversight and trustworthy human robot interaction.

[AI-248] DRIVE: Modeling Skills at the Reasoning and Interaction Levels for Web Agents under Continual Learning

链接: https://arxiv.org/abs/2605.23939
作者: Xirui Liu,Sihang Zhou,Yanning Hou,Rong Zhou,Haoyuan Chen,Maolin He,Siwei Wang,Hao Chen,Jian Huang
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 35 pages, 5 figures

点击查看摘要

Abstract:Web agents require both high-level reasoning (for task decomposition) and low-level interactions (for page elements manipulation) to conduct different tasks. However, these knowledge types differ fundamentally: reasoning knowledge (e.g., booking a flight requires first searching for routes) is abstract and transferable across websites, while interaction knowledge (e.g., clicking the Search button at a specific coordinate on Site A) depends heavily on page-specific contexts. Existing methods store experiences uniformly. This creates a dilemma: abstract representations lose executability on concrete pages, while concrete representations fail to generalize across domains. This entanglement limits capability accumulation: on new websites, agents either fail to recognize reusable task logic due to surface-level differences or attempt infeasible actions from outdated page structures. To disentangle them, we propose DRIVE, a dual-level skill modeling framework separating historical experience into natural language reasoning skills, which capture transferable task logic, and programmatic interaction skills, grounding abstract actions to executable operations. A scene-aware coordination mechanism adaptively retrieves and invokes these dual-level skills based on task semantics. DRIVE also uses skill-level reflection to identify hierarchy-specific failure modes, enabling targeted skill library expansion and refinement. Experiments across five WebArena domains show DRIVE attains an average task success rate of 52.8%, exceeding the skill-free baseline by 7.3 percentage points. Further ablations show reasoning and interaction skills provide distinct, complementary benefits, supporting separation of transferable task logic from executable page-level operations.

[AI-249] Authority Inversion in LLM -Mediated Ubiquitous Systems: When Models Trust Users Over Sensors

链接: https://arxiv.org/abs/2605.23938
作者: Long Zhang,Zi-bo Qin,Wei-neng Chen
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language models (LLMs) increasingly fuse heterogeneous inputs in ubiquitous systems. Yet, how LLMs implicitly allocate authority when sensor measurements and user claims conflict remains unexamined, raising critical reliability concerns for deployments where physical sensing must retain priority. Unlike explicit traditional fusion, LLMs bury authority allocation within learned representations. We discover this allocation is severely format-dependent: numerical sensor data fails to integrate into answer-relevant model directions, allowing natural-language claims to dominate the final decision, a phenomenon we term \textbfAuthority Inversion.To diagnose and mitigate this, we develop a geometric framework of context integration, introduce two computable audit metrics, specifically the Context Integration Ratio (CIR) and Authority Alignment Index (AAI), and propose Geometric Authority Calibration (GAC), an inference-time layer-level intervention to suppress misplaced user authority. Evaluating four models (4B to 35B parameters, three architectures) across four datasets totaling 576 conflict instances reveals extreme inversion: on numerical tasks, models exhibit near-zero sensor trust (AAI = -0.805, Cohen’s d = -2.14), unaffected by model capacity. Validating our geometric framework, theory-guided causal injection flips 80.2% of incorrect decisions (vs. 0.4% for random controls). Practically, GAC improves HAR accuracy from 0 – 1.6% to 21.9 – 27.5%, outperforming prompting baselines. Ultimately, authority allocation in LLM-mediated systems must be explicitly audited and application-specifically configured rather than left implicit.

[AI-250] BoxLitE: A Faithful Knowledge Base Embedding Based on Convex Optimization KR KR2026

链接: https://arxiv.org/abs/2605.23937
作者: Bruno F. Lourenço,Hesham Morgan,Ana Ozaki,Aleksandar Pavlović,Emanuel Sallinger
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Logic in Computer Science (cs.LO); Optimization and Control (math.OC)
备注: 28 pages. Full version of paper accepted to KR 2026 (23nd International Conference on Principles of Knowledge Representation and Reasoning). Track: KR meets Machine Learning and Explanation

点击查看摘要

Abstract:Knowledge base (KB) embeddings aim at combining the capability of classical knowledge graph embeddings to generalize the information present in facts, the ABox, with conceptual knowledge represented in an ontology language, the TBox. Several authors have recently explored the idea of mapping concepts to convex regions in a vector space. This is useful to represent hierarchies, typically present in TBoxes, since more general concepts can be mapped to larger regions, containing those regions associated with more specific concepts. However, the power of convexity is rarely leveraged during the actual learning tasks. Here, we introduce BoxLitE, a KB embedding model for DL-Lite ^\mathcalH that allows for convex optimization. We show that for any satisfiable DL-Lite ^\mathcalH KB, there is a BoxLitE embedding that is a weakly faithful model. As a proof of concept, we show how to formulate the KB embedding task as a convex optimization problem and how to obtain embeddings with such desirable faithfulness properties.

[AI-251] Fuzzy Neutrosophic and Uncertain Graph Theory: Properties and Applications

链接: https://arxiv.org/abs/2605.23936
作者: Takaaki Fujita,Florentin Smarandache
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 326 pages. Publisher: Neutrosophic Science International Association (NSIA) Publishing House. ISBN: 978-197250204-4

点击查看摘要

Abstract:This book presents a comprehensive and systematic survey of graph theory under uncertainty, with particular emphasis on the unifying role of the uncertain graph framework. It reviews fundamental concepts, structural properties, graph classes, and graph parameters within fuzzy, neutrosophic, and related models, while also introducing a wide range of extensions such as uncertain digraphs, hypergraphs, superhypergraphs, and dynamic graphs. In addition to theoretical developments, the book explores practical applications, including uncertain molecular graphs, decision-making systems, graph neural networks, knowledge graphs, and cognitive maps. By organizing diverse uncertainty-aware graph models within a common perspective, this work provides a coherent framework for understanding their relationships, capabilities, and applications in complex systems.

[AI-252] Practical Quantum CIM Empowerment via All-Domestic-Core Agent ic Large Model

链接: https://arxiv.org/abs/2605.23934
作者: Wang Rui,Lu Diannan
机构: 未知
类目: Artificial Intelligence (cs.AI); Quantum Physics (quant-ph)
备注: 21 pages 7 figures

点击查看摘要

Abstract:Quantum computing devices are recognized as powerful tools for solving NP-complete problems. However, the intricacy of their modeling presents notable barriers for non-specialists, while the tedious iteration of constraint weights and modeling methodologies also consumes substantial effort on the part of experts. To address these challenges, this study integrates a femtosecond laser-pumped Coherent Ising Machine (CIM) with an LLM-driven agentic system by leveraging the LangGraph and LangChain frameworks. Comprehensive investigations demonstrate that large language models (LLMs) can effectively perform such tasks in modeling as QUBO/Ising model calibration, constraint weight decision iteration and rapid validation of literature-reported schemes. Notably, all these tasks can be fully implemented based on domestic large models, combined with domestically developed CIM hardware, we truly achieve the practical empowerment of quantum CIM that fully relies on all-domestic agentic large models and hardware. This work successfully realizes robust technological integration, laying a solid foundation for subsequent research. Nevertheless, it also identifies the persisting challenges in the two cutting-edge fields of large models and quantum computing at the current stage. Encouragingly, we unexpectedly discover a promising new paradigm where accumulated knowledge from agent-assisted quantum computing iterations reciprocally enhances the agent’s own problem-solving capability, thereby addressing these challenges.

[AI-253] KT4EQG: Personalized Exercise Question Generation via Knowledge Tracing

链接: https://arxiv.org/abs/2605.23933
作者: Xinyi Gao,Qiucheng Wu,Lu Ding,Q.Vera Liao,Kaizhi Qian,Ying Xu,Shiyu Chang,Yang Zhang
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Educational Question Generation (EQG) aims to synthesize customized exercise questions that enhance student learning. An effective EQG system should ideally personalize questions for each student by modeling the student’s knowledge state and generating questions that provide the greatest learning benefit. However, few existing EQG approaches are able to achieve such fine-grained personalization. In this paper, we explore how EQG can benefit from knowledge tracing (KT), which models students’ knowledge states based on historical performance and predicts future performance. We propose KT4EQG, a personalized EQG framework that generates effective questions for individual students under the guidance of a KT model. Specifically, KT4EQG seeks to maximize a student’s potential improvement in overall knowledge mastery by leveraging the KT model to select the most suitable knowledge concept for the student to practice. An LLM-based question generator is then trained to produce a question faithfully grounded in the selected concept. Experimental results on XES3G5M and MOOCRadar show that KT4EQG consistently generates more effective questions than methods with limited or no personalization.

[AI-254] BODHI: Precise OS Kernel Specification Inference

链接: https://arxiv.org/abs/2605.23931
作者: Zhiming Chang,Ziyang Li
机构: 未知
类目: Artificial Intelligence (cs.AI); Programming Languages (cs.PL); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:The formal verification of operating system kernels requires precise specifications that capture the intended behavior of system calls. Writing these specifications manually demands deep domain expertise, motivating the use of large language models (LLMs) to automate the process. However, in OSV-Bench, a benchmark of 245 specification generation tasks derived from the Hyperkernel OS kernel, the best reported Pass@1 is 55.10%. We propose a domain knowledge prompting method (BODHI), which augments the standard few-shot prompt with a structured C-to-Python translation guide covering 15 categories of domain-specific translation patterns. Inspired by Structured Chain-of-Thought (SCoT) prompting, the guide organizes translation by separation of concerns, addressing pre-condition extraction and post-condition generation as distinct categories. Evaluated on nine models from six providers (Anthropic, Mistral, Amazon, DeepSeek, Meta, Alibaba), covering dense, mixture-of-experts and reasoning architectures, BODHI improves every model tested, with gains ranging from +11% to +32%. The best configuration (Claude Opus 4.6 + BODHI) reaches 96.73% Pass@1. BODHI reduces both syntax and semantic errors, with the strongest effect on models that have sufficient instruction-following capability to utilize structured reference material. These results demonstrate that domain knowledge injection is a model-agnostic technique that substantially bridges the gap between general-purpose code generation and formal specification synthesis.

[AI-255] oward Reliable Design of LLM -Enabled Agent ic Workflows: Optimizing Latency-Reliability-Cost Tradeoffs

链接: https://arxiv.org/abs/2605.23929
作者: Ya-Ting Yang,Quanyan Zhu
机构: 未知
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:Modern AI systems increasingly rely on workflows composed of multiple interacting agents, some powered by large language models (LLMs) and others by conventional computational modules. This paper analyzes the fundamental tradeoffs between latency, reliability, and cost in LLM-enabled agentic workflows. We introduce performance models for both LLM and non-LLM agents that capture the relationship between computational effort and output quality, incorporating the impact of reasoning and output tokens for LLM agents using a parametric exponential reliability function. Then, we study the design of sequential workflows under latency and cost constraints. Main results include a water-filling token allocation policy and characterizations of optimal workflow reliability in terms of shadow prices.

[AI-256] How Much Thinking is Enough? Quantifying and Understanding Redundancy in LLM Reasoning

链接: https://arxiv.org/abs/2605.23926
作者: Zhiyuan Zhai,Xinkai You,Wenjing Yan,Xin Wang
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Reasoning-capable large language models solve hard problems by emitting long chains of thought, paying heavily in latency, GPU time, and energy. Casual inspection of their traces reveals extensive reformulation, verification, and circular self-reflection, yet how much of this deliberation is actually necessary has never been measured at scale or explained from first principles. This paper closes both gaps. We formalise reasoning redundancy directly in terms of the reasoning model itself: the redundancy of a correct trace is the largest fraction of its trailing segmented steps that can be truncated while \pi , forced to terminate thinking and emit a final answer, still produces the correct answer. A large-scale quantification across four frontier reasoning models and two mathematical benchmarks shows that step-level redundancy is consistently high – between 61% and 93% across the 8 (model, benchmark) conditions we study, with the median critical prefix equal to a single segmented step in six of the eight conditions – that the finding is robust to the choice of judge family, and that although \rho decreases with problem difficulty on MATH-500, all four models remain substantially redundant ( \rho \in [46%, 85%] ) even on the hardest Level-5 problems. We then prove that this redundancy is a structural consequence of length-agnostic outcome rewards, not a model-specific artefact: under any such reward, no finite expected stopping time is optimal. The result holds regardless of RL algorithm, base model, data distribution, or whether the policy is obtained via RL or distillation; over-thinking is therefore not a bug to be patched in individual models but a structural property of how current reasoning models are trained. Code: this https URL Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2605.23926 [cs.AI] (or arXiv:2605.23926v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2605.23926 Focus to learn more arXiv-issued DOI via DataCite

[AI-257] High-Risk AI Systems and the Problem of Identity in the European AI Act

链接: https://arxiv.org/abs/2605.23922
作者: Andrea Ferrario
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted as a non-archival paper at The 2026 ACM Conference on Fairness, Accountability, and Transparency (FAccT '26), June 25-28, 2026, Montreal, QC, Canada

点击查看摘要

Abstract:The EU Artificial Intelligence Act (AIA) establishes a lifecycle governance regime for high-risk AI systems built around ex-ante conformity assessment, post-market monitoring, and re-assessment upon “substantial modification.” These obligations presuppose AI identity judgments: regulators and providers must decide when an updated system remains the same system over time. In this work, we show how this logic is clarified by the function+ framework of artifact identity, which individuates AI systems by their intended function together with context-sensitive criteria of appropriate functioning, captured as “AI trustworthiness.” We further argue that the AIA does not provide an internal, auditable criterion for synchronic identity–when two AI systems at a given time should count as the same for regulatory purposes–and instead largely defers such sameness determinations to sectoral or harmonization instruments. function+ supplies a synchronic identity test anchored in intended function and trustworthiness profiles and levels, making synchronic identity decisions inspectable in governance settings such as procurement, liability, and market surveillance. Our contribution is a conceptual and auditing lens: we provide a correspondence map between AIA lifecycle obligations and function+ identity components, and we make the synchronic case operationally legible via a minimal decision flow for audit and dispute contexts. We conclude with two implementation-facing recommendations: (1) more precise, testable reporting of intended purpose, and (2) standardized, auditable trustworthiness reporting that supports comparability over time and across deployments.

[AI-258] Authority Signals in Claude AI Health Citations: A Descriptive Analysis Using the Authority Signals Framework

链接: https://arxiv.org/abs/2605.23921
作者: Erin T. Jacques(1),Erela Datuowei(2),Elizabeth Quaye(3),Corey H. Basch(4),Arijit Chatterjee(1),Juanita Davis(1) ((1) York College, CUNY, (2) Teachers College, Columbia University, (3) York College, CUNY, (4) William Paterson University)
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 10 pages, 2 figures, 2 tables

点击查看摘要

Abstract:This study seeks to determine the authority signals used by Anthropic’s Claude AI in its presentation of sources when answering consumer health questions. While there exists a great deal of discourse around the quality of health citations that LLMs produce, there is limited information on the integrity of the sources the citations originate from, and to what extent the sources are, from what health professionals would consider, credible sources. This descriptive cross-sectional study used data from HealthSearchQA, which contains 3,172 consumer health questions curated by Google Research. After exclusions, a final dataset of 3,075 questions yielding 10,038 citations was analyzed. The Authority Signals Framework (Jacques et al., 2026) was applied to examine 10 authority signals across four domains for a disproportionate stratified sample of 542 sources. Established institutional sources accounted for 97.8% of all citations (n = 9,818). Medical Institutions were the most frequently cited organization type (36.5%), followed by Government Resources (31.6%) and Professional Associations (28.4%). Commercial Health Information comprised 2.2% (n = 220). The top 10 organizations accounted for 57.8% of all citations, with Mayo Clinic alone representing 24.7%. Among commercial sources in the focused sample, 86.4% displayed medical review statements, 82.5% used schema markup, and 71.8% had comprehensive content, while traditional institutional sources appeared in Claude’s citations with or without these same markers. As Anthropic positions Claude for HIPAA-ready healthcare applications, these findings establish a baseline for Claude’s citation behavior and demonstrate the utility of the Authority Signals Framework as a tool for ongoing, cross-platform evaluation of AI-mediated health information.

[AI-259] Artificial Effort

链接: https://arxiv.org/abs/2605.23920
作者: Federico Belotti,Stefano Coniglio,Antonio Cosma,Francesco Fallucchi
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Real-effort tasks, in which participants perform cognitively costly activities whose outcomes depend on actual performance, are widely used in experimental economics. Their validity, however, rests on the assumption that a human performs them. We study whether this assumption still holds in the era of Artificial Intelligence (AI) and Large Language Models (LLMs). Using 8 canonical real-effort tasks and 23 LLMs from three major providers, we show that most tasks can now be solved accurately and at a negligible cost, while only a few resist automation. Performance improves with each model generation, and midtier models are rapidly closing the gap with frontier ones, broadening the set of widely accessible models that can automate these tasks. Additionally, we show that verbally offering monetary incentives has no effect on LLM performance. Our findings establish a boundary condition for the use of real-effort tasks in unsupervised settings: when participants can cheaply outsource task completion to an LLM, observed performance may no longer reflect genuine human effort.

[AI-260] Confidence Calibration in Large Language Models

链接: https://arxiv.org/abs/2605.23909
作者: Noam Michael,Daniel BenShushan,Jacob Bien,Don A. Moore
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We investigate the calibration of large language models’ (LLMs’) confidence across diverse tasks. The results of our preregistered study show that the current crop of LLMs are, like people, too sure they are right: confidence exceeds accuracy, on average. Importantly, however, this tendency is moderated by a powerful hard-easy effect, wherein overconfidence is greatest on difficult tests; by contrast, easy tests actually show substantial underconfidence. We develop LifeEval, a test for evaluating model calibration across levels of difficulty.

[AI-261] LETS Forecast: Learning Embedology for Time Series Forecasting ICML

链接: https://arxiv.org/abs/2506.06454
作者: Abrar Majeedi,Viswanatha Reddy Gajjala,Satya Sai Srinath Namburi GNVV,Nada Magdi Elkordi,Yin Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: Accepted at International Conference on Machine Learning (ICML) 2025

点击查看摘要

Abstract:Real-world time series are often governed by complex nonlinear dynamics. Understanding these underlying dynamics is crucial for precise future prediction. While deep learning has achieved major success in time series forecasting, many existing approaches do not explicitly model the dynamics. To bridge this gap, we introduce DeepEDM, a framework that integrates nonlinear dynamical systems modeling with deep neural networks. Inspired by empirical dynamic modeling (EDM) and rooted in Takens’ theorem, DeepEDM presents a novel deep model that learns a latent space from time-delayed embeddings, and employs kernel regression to approximate the underlying dynamics, while leveraging efficient implementation of softmax attention and allowing for accurate prediction of future time steps. To evaluate our method, we conduct comprehensive experiments on synthetic data of nonlinear dynamical systems as well as real-world time series across domains. Our results show that DeepEDM is robust to input noise, and outperforms state-of-the-art methods in forecasting accuracy. Our code is available at: this https URL.

[AI-262] A Token/KV-Cache Communication Media Selection and Resource Allocation Strategy for Multi-Agent Collaboration

链接: https://arxiv.org/abs/2605.25422
作者: Lipeng Dai,Luping Xiang,Kun Yang
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Information Theory (cs.IT)
备注:

点击查看摘要

Abstract:The convergence of large language models (LLMs) with 6G networks is fostering a paradigm of autonomous multi-agent cooperation, which in turn is expected to substantially increase east-west traffic. Although latent-space interaction mechanisms can enable more efficient collaboration than symbolic natural-language (NL) exchanges, prior work often abstracts away the associated communication overhead under practical wireless constraints. In embodied multi-agent settings, heterogeneous interaction media incur disparate inference and transmission costs, thereby inducing an inherent end-to-end (E2E) latency trade-off. To address this, we propose a joint design that integrates communication-media selection with wireless resource allocation. Through analytical characterization and simulation-based evaluation, we show that neither token-based transmission nor key-value (KV) cache-based transmission is uniformly optimal across operating regimes, as performance depends critically on system parameters such as available computational resources and channel conditions. Accordingly, we formulate a joint optimization problem aimed at minimizing the E2E latency of multi-agent collaboration and develop a low-complexity joint media selection and resource allocation (JMSRA) algorithm. Numerical results further confirm that, by adaptively coordinating the interaction media and bandwidth allocation over heterogeneous links, the proposed scheme achieves markedly reduced E2E latency relative to conventional NL-only and KV-cache-only baselines, enabling efficient and robust multi-agent collaboration in future wireless networks.

[AI-263] Positivity in classical enumerative geometry: a case study in synchronized AI-assisted mathematics

链接: https://arxiv.org/abs/2605.25271
作者: Gergely Bérczi,László M. Fehér
机构: 未知
类目: Algebraic Geometry (math.AG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注: 29 pages

点击查看摘要

Abstract:We study the symmetric polynomial \prod_\alpha\in A_n,d\bigl(1+\alpha_1 x_1+\cdots+\alpha_n x_n\bigr) where A_n,d:=\alpha\in\mathbbZ_\ge 0^n:|\alpha|=d\ , which is the total Chern class of \mathrmSym^d(\mathbbC^n) , viewed as a torus representation whose Chern roots are the weights \alpha_1 x_1+\cdots+\alpha_n x_n for \alpha\in A_n,d . Its homogeneous degree- k part c_k(n,d) is the k -th Chern class of \mathrmSym^d(\mathbbC^n) . These Chern classes, together with their coefficients in various symmetric function bases, play a central role in enumerative geometry. Despite their simple definition, general closed formulas for their coefficients are subtle, and many structural properties of these classes have remained poorly understood. In this paper we prove several conjectures concerning their structure, establish explicit formulas, and study log-concavity properties for both the Chern classes and their K -theoretic analogue. In rank two, passing to the Schur basis and expanding the Schur coefficients in the binomial basis of d , we uncover a new binomial log-concavity phenomenon and prove refined positivity results. The paper demonstrates a novel methodology: we combine several AI systems with human mathematical insight in a coordinated workflow, deploying each tool according to its strengths in experimental discovery, conjecture formation, symbolic proof construction, and verification. To our knowledge, this is one of the first detailed case studies of orchestrating multiple AI tools to make substantial progress on a coherent mathematical research project. Comments: 29 pages Subjects: Algebraic Geometry (math.AG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE) MSC classes: 68T20, 05E05, 14C17, 05A20, 05A10, 14M15, 14N10 Cite as: arXiv:2605.25271 [math.AG] (or arXiv:2605.25271v1 [math.AG] for this version) https://doi.org/10.48550/arXiv.2605.25271 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-264] Explainable Multi-Task Retinal Imaging Reveals Microvascular Signals for Systemic Risk Stratification in Type 2 Diabetes: A Pilot Study

链接: https://arxiv.org/abs/2605.24913
作者: Mini Han Wang,Liting Huang,Wei Hong,Boonthawan Wingwon
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
备注: 18 pages, 4 figures

点击查看摘要

Abstract:Retinal imaging provides a non-invasive window into systemic microvascular health and has emerged as a potential biomarker for systemic diseases. However, whether retinal features encode biologically meaningful systemic signals that can be reliably interpreted using explainable artificial intelligence (XAI) remains unclear. An explainable multi-task deep learning framework was developed to investigate associations between retinal microvascular features and systemic abnormalities in Type 2 Diabetes Mellitus. A total of 11,011 fundus images from 2,719 individuals were analysed using a shared neural network with task-specific heads for glycaemic status, kidney abnormality, and multi-system involvement. Model interpretability was evaluated using Gradient-weighted Class Activation Mapping (Grad-CAM), anatomical masking, and vessel alignment analysis. The framework demonstrated task-dependent predictive performance, with the best discrimination observed for kidney abnormality (AUC up to 0.63), whereas glycaemic status prediction showed limited performance (AUC = 0.49-0.61). Explainability analyses consistently localized model attention to retinal vessels and peripapillary regions. Masking experiments showed that occlusion of vascular regions caused the greatest performance decline, indicating that retinal vessels were the primary predictive source. Different architectures exhibited heterogeneous attention patterns, suggesting multiple representational pathways for systemic signal encoding. This pilot study demonstrates that retinal microvascular features contain measurable signals associated with systemic abnormalities, particularly microvascular damage. By integrating multi-task learning with quantitative XAI validation, this framework advances retinal imaging toward interpretable digital biomarkers for systemic risk stratification in diabetes.

[AI-265] Distributionally Robust Transfer Learning with Structurally Missing Covariates with Application to Cross-National Cardiac Arrest Prediction

链接: https://arxiv.org/abs/2605.24212
作者: Siqi Li,Chuan Hong,Ziye Tian,Benjamin Sieu-Hon Leong,Koshi Nakagawa,Hideharu Tanaka,Sang Do Shin,Khuong Quoc Dai,Do Ngoc Son,Marcus Eng Hock Ong,Nan Liu,Molei Liu
机构: 未知
类目: Applications (stat.AP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Deploying clinical prediction models across healthcare systems often fails when key training covariates are unavailable at deployment and labeled outcomes are limited in the target domain. For example, high-performing models for out-of-hospital cardiac arrest (OHCA) rely on detailed prehospital measurements routinely collected in high-resource settings but unavailable in many international registries. Existing methods either discard missing covariates, sacrificing predictive information, or rely on untestable assumptions about their target distribution. We propose DRUM (\underlineDistributionally \underlineRobust \underlineUnsupervised transfer learning with structurally \underlineMissing covariates), a framework that transfers prediction models to target populations where certain covariates are structurally absent and outcome labels are unavailable. DRUM partitions covariates into shared components ( X ), observed across all settings, and missing components ( A ), observed only in the source. Rather than imputing missing covariates, DRUM optimizes worst-case predictive performance over the unknown target distribution of A \mid X using a neural network generator, with a robustness parameter controlling allowable deviation from the source conditional. We further develop a bias correction procedure that reduces sensitivity to nuisance estimation error. Simulations show substantial improvements in both mean and worst-case prediction error under distribution shift. Applied to cross-national OHCA prediction, transferring models from a US registry to multiple Asian registries where prehospital variables are unrecorded, DRUM yields better-calibrated predictions and improved clinical classification performance across sites.

[AI-266] WTKO-CNN: Deep Learning Reveals Sequence Motifs Distinguishing Wild-Type and Knockout ATAC-seq Peaks

链接: https://arxiv.org/abs/2605.24034
作者: Lopamudra Dey
机构: 未知
类目: Genomics (q-bio.GN); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Chromatin regulators can alter transcriptional programs by modifying the accessibility of regulatory DNA elements. Understanding how regulatory sequences differ between wild-type (WT) and knockout (KO) conditions is crucial for deciphering transcriptional control. Here, we applied a convolutional neural network, \textbfWTKO-CNN with an attention mechanism to classify DNA sequences as WT or KO, achieving high predictive performance. To interpret the model, we generated saliency maps to identify nucleotide positions most influential for the classification decision. From these high-saliency regions, we extracted and clustered k-mers, enabling de novo motif discovery. Sequence logos and consensus motifs derived from the CNN filters revealed biologically meaningful patterns, which are further validated using MEME, TOMTOM, and HOMER against known transcription factor binding sites. Our analysis identified motifs associated with transcription factor families that discriminate WT from KO sequences, demonstrating that CNN-guided saliency mapping is a powerful approach for uncovering functional sequence features.

[AI-267] Harnessing AtomisticSkills for Agent ic Atomistic Research

链接: https://arxiv.org/abs/2605.24002
作者: Bowen Deng,Bohan Li,Matthew Cox,Hoje Chun,Juno Nam,Artur Lyssenko,Sathya Edamadaka,Jurgis Ruza,Xiaochen Du,Nofit Segal,Jesus Diaz Sanchez,Mingrou Xie,Ty Perez,Yu Yao,Miguel Steiner,Sauradeep Majumdar,Charles B. Musgrave III,Anirban Chandra,Abhirup Patra,Detlef Hohl,Connor W. Coley,Ju Li,Rafael Gómez-Bombarelli
机构: 未知
类目: Chemical Physics (physics.chem-ph); Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI); Computational Physics (physics.comp-ph)
备注:

点击查看摘要

Abstract:Computational materials science and chemistry span vast knowledge domains and fractured software ecosystems. Although large language models (LLMs) have demonstrated research capabilities, scaling monolithic agents to manage the rigor and complexity of atomistic research remains a challenge. Here, we introduce AtomisticSkills, an open-source harness framework that empowers general-purpose AI coding agents to conduct atomistic research across materials science, chemistry, and drug discovery. By hierarchically decomposing scientific workflows into agent skills and tools, AtomisticSkills provides agents with modular, extensible, and plug-and-play research capabilities. The framework integrates more than 100 human-curated multidisciplinary skills, including database access, thermodynamics and kinetics modeling, and diverse simulation engines employing machine learning interatomic potentials (MLIPs) and density functional theory (DFT). We validate its functional coverage against scientific literature and demonstrate robust orchestration capabilities across diverse scientific campaigns: generative design of Li-ion solid-state electrolytes, high-throughput screening of metal-organic frameworks for CO2 capture, autonomous MLIP benchmarking and fine-tuning, multi-stage structure-based virtual screening for drug design, multimodal X-ray diffraction pattern analysis, and screening of Fe-oxide catalysts for oxygen evolution reaction. AtomisticSkills provides a critical agent infrastructure towards building fully autonomous AI scientists.

[AI-268] Sensing Intelligence as a Trainable Metamaterial Property

链接: https://arxiv.org/abs/2605.23967
作者: Kyungmi Na,Yifei Li,Xinyi Yang,Bolei Deng
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI); Applied Physics (physics.app-ph)
备注:

点击查看摘要

Abstract:In biological systems, sensing is not performed by the brain alone: the body deforms, vibrates, and filters external stimuli before they are transduced into neural signals. In engineered systems, this processing burden is placed largely on electronics and computation, while the mechanical body is usually designed only for strength and stability. Here, we present sensing intelligence as a trainable property of the body. We show that the geometry of a metamaterial can be optimized to reshape external stimuli into internal signals that are easier for a neural network to interpret. Rather than hand-designing this physical preprocessing, we let the neural network train its own body for sensing by backpropagating the sensing loss to the body’s design parameters through differentiable simulation. Across numerical and experimental sensing scenarios, the optimized body improves sensing accuracy by up to fivefold or reduces the number of required electronic sensors by nearly an order of magnitude.

[AI-269] Multimodal Alignment and Preference Optimization for Zero-Shot Conditional RNA Generation

链接: https://arxiv.org/abs/2605.23961
作者: Roman Klypa,Alberto Bietti,Sergei Grudinin
机构: 未知
类目: Biomolecules (q-bio.BM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The design of RNA molecules that interact with specific proteins is a critical challenge in experimental and computational biology. Despite recent progress in natural language modeling and deep learning-based protein design, there remains significant room to improve the frequency of successful interactions and the authenticity of generated sequences for functional applications. In this work, we frame conditional RNA sequence generation as a multi-stage alignment problem, introducing Moirain: a suite of models optimized via multimodal supervised fine-tuning (SFT) and Direct Preference Optimization (DPO). Our approach begins with large-scale pretraining on diverse RNA corpora to capture the fundamental grammars of sequence plausibility. To achieve target-specific generation, we employ a multimodal SFT architecture that conditions RNA synthesis on protein structural and sequential features. Finally, we leverage DPO to refine the model using synthetic interaction data: taking advantage of DPO’s unique ability to navigate non-aligned preference spaces, we improve functional fitness without collapsing the learned natural distribution. Extensive evaluation of the Moirain series (Moirain-Base, -Multi, and -DPO) demonstrates that our framework consistently produces novel, diverse, and biologically plausible RNA sequences with superior binding affinities compared to existing baselines.

[AI-270] AI-Driven Alpha Decay: Algorithmic Homogenization Reflexive Signal Erosion and the Paradox of Intelligent Markets

链接: https://arxiv.org/abs/2605.23905
作者: Shuchen Meng,Xupeng Chen
机构: 未知
类目: General Finance (q-fin.GN); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT)
备注:

点击查看摘要

Abstract:We show that AI-driven investment strategies are inherently self-defeating at scale. As AI adoption rises, three mutually reinforcing channels – signal crowding, performative signal erosion, and Red Queen competition – compress excess returns. We derive the alpha half-life h(\phi) = \ln 2/[\theta + \delta(\phi)] , where \theta is the natural mean-reversion rate and \delta(\phi) = N\phi\rho a/\lambda(\phi) is the AI-accelerated decay component, which is convex-decreasing in adoption. At current adoption levels ( \phi \approx 0.7 , \rho \approx 0.6 ), the model implies signal half-lives of 18 months versus 5-7 years pre-AI. We establish four theoretical results. First, the alpha half-life theorem: signal lifespans are convex-decreasing in AI adoption. Second, a signal extinction cascade: beyond a critical threshold \phi^* , the decay of one signal class triggers accelerated competition for remaining signals. Third, a Red Queen impossibility: in the monoculture equilibrium, net alpha is identically zero despite heavy AI investment. Fourth, a fragility-efficiency tradeoff: the adoption level maximizing price discovery strictly exceeds the level minimizing systemic fragility. Empirical validation calibrates portfolio convergence to SEC Form 13F filing patterns (99.5 million holdings, 2013-2024), documenting that simulated institutional portfolio convergence increases by 42% over the sample period. We examine simulated hedge fund return dynamics showing declining cross-sectional dispersion among AI-adopting funds, and simulate the 2010 Flash Crash to illustrate fragility consequences.

[AI-271] PilotWiMAE: Pilot-Native Representation Learning for Wireless Channels

链接: https://arxiv.org/abs/2605.22856
作者: Berkay Guler,Giovanni Geraci,Hamid Jafarkhani
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Information Theory (cs.IT); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
备注:

点击查看摘要

Abstract:Channel foundation models assume access to fully observed channels, an assumption that fails in deployment. We introduce PilotWiMAE, a self-supervised framework whose encoder ingests noisy pilot observations directly and whose attention factorizes along the axis separating temporal from joint space-frequency processing, an inductive bias inspired by the physics of the problem. Pilot input shrinks the observation space by up to two orders of magnitude and also removes the unrealistic assumption of full-CSI availability while incurring lower latency. The factorized design generates robust representations by exploiting the separable channel structure and allows a pretraining mask ratio of 99% . We pair patch-normalized reconstruction, which captures small-scale fading structure, with an auxiliary scale loss that recovers the large-scale fading features, and use an AWGN curriculum to match pilot noise at pretraining and deployment. Pretrained solely on 3.5 ,GHz and evaluated at 28 ,GHz across in-distribution and out-of-distribution settings, PilotWiMAE’s cross-frequency beam selection and channel characterization beat supervised baselines despite operating on a smaller observation space. To weaken the coupling between decoder capacity and representation quality, we further propose a decoder-centric pretraining stage following the encoder-decoder joint pretraining, which allows PilotWiMAE to demonstrate competitive channel estimation without sacrificing representation quality. To foster further work in this direction, we release the PilotWiMAE pretrained weights and training pipeline, together with CSIGen, our Sionna-based ray-tracing channel-generation tool, and the channel datasets used in this work.

机器学习

[LG-0] Looped Diffusion Language Models

链接: https://arxiv.org/abs/2605.26106
作者: Sanghyun Lee,Chunsan Hong,Seungryong Kim,Jonghyun Lee,Jongho Park,Dongmin Park
类目: Machine Learning (cs.LG)
*备注: 23 pages

点击查看摘要

Abstract:Masked diffusion models (MDMs) have emerged as a promising alternative to autoregressive models for language modeling, yet the effective design of transformer architectures for MDMs remains underexplored. In this paper, we show that selectively looping the early-middle transformer layers significantly improves both training efficiency and model performance in MDMs. We call this approach LoopMDM(Looped Masked Diffusion Model), which brings two key benefits: looping layers at training-time yields a depth-scaling effect without adding parameters, while varying the number of loops at inference-time enables flexible compute scaling. Despite the simplicity, the results are striking: across multiple pre-training corpora, LoopMDM matches the performance of same-size MDMs with up to 3.3 fewer training FLOPs, while its final performance outperforms them on various reasoning benchmarks, including up to 8.5 points on GSM8K. It even surpasses deeper non-looped MDMs trained with comparable per-step compute, indicating that selective looping is more effective than naive depth scaling. Furthermore, LoopMDM can scale inference-time compute by increasing the number of loops. Adaptively adjusting the number of loops throughout the sampling process further yields additional gains in compute efficiency while maintaining performance. Lastly, with attention analysis, we provide evidence that looping is effective in MDMs by promoting interactions among masked positions. Our code and weights will be publicly released.

[LG-1] Forgetting in Language Models: Capacity Optimization and Self-Generated Replay

链接: https://arxiv.org/abs/2605.26097
作者: Martin Marek,Dongkyu Cho,Shikai Qiu,Rumi Chunara,Pavel Izmailov,Andrew Gordon Wilson
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Models trained on a new task typically degrade on prior tasks, a phenomenon known as forgetting. Traditionally, mitigating forgetting has required replaying stored exemplars from prior tasks, which is often impractical. By contrast, language models can sample from their own training distribution, and we show that these self-generated samples serve as effective replay data, nearly eliminating forgetting. We find that forgetting nonetheless persists when the model has little remaining capacity: models pretrained close to saturation cannot absorb new information without overwriting prior knowledge. When capacity is not the limiting factor, low learning rates reduce forgetting but require substantially more training steps. Replay breaks this tradeoff, enabling fast, high-learning-rate finetuning without forgetting.

[LG-2] Goal-driven Bayesian Optimal Experimental Design for Robust Decision-Making Under Model Uncertainty

链接: https://arxiv.org/abs/2605.26093
作者: Jinwoo Go,Xiaoning Qian,Byung-Jun Yoon
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Bayesian optimal experimental design (BOED) selects experiments to maximize information gain about model parameters. However, in decision-critical settings, reducing parameter uncertainty does not necessarily improve downstream decisions, as only specific parameter directions relevant to the objective truly matter. We propose GoBOED, a goal-driven BOED framework that directly optimizes experimental designs for a specified decision-making objective. GoBOED combines an amortized variational posterior surrogate with a differentiable convex decision layer, enabling gradient-based design optimization that is fully decision-focused. We theoretically show that GoBOED gradients are insensitive to parameter directions irrelevant to the decision objective, providing a formal justification for why goal-driven design achieves equivalent decision quality over a wider set of experimental designs than information-gain maximization. Empirically, across source localization, epidemic management, and pharmacokinetic control, GoBOED identifies designs that better align with downstream decision objectives and reveals that near-optimal design windows are substantially wider than those predicted by goal-agnostic BOED approaches.

[LG-3] Global Convergence of Wasserstein Policy Gradient for Entropy-Regularized Reinforcement Learning

链接: https://arxiv.org/abs/2605.26078
作者: Zhaoyu Zhu,Rui Gao,Shuang Li
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Wasserstein policy gradient (WPG) is a policy optimization method for reinforcement learning (RL) that exploits the optimal-transport geometry of action distributions. For the entropy-regularized RL objective, WPG evolves each state-conditional policy by transporting it along the action gradient of the soft Q-function together with a Langevin-type diffusion. Despite its appeal for continuous-control problems, its global convergence properties remain poorly understood. Standard Langevin analyses do not directly apply, because the RL objective depends on the policy through the Bellman recursion rather than through a static convex functional, and the Langevin drift is determined by the soft Q-function, whose regularity must be controlled along the policy iterates. In this paper, we develop a global convergence theory for WPG by exploiting the Bellman structure of entropy-regularized RL. We show that the role usually played by convexity can be replaced by a Bellman-based argument: the soft Bellman residual admits a statewise KL representation with respect to a Gibbs policy; Bellman contraction relates this residual to the global optimality gap; and a Bellman resolvent identity connects value improvement to relative Fisher information. Combined with a uniform log-Sobolev inequality (LSI) for the evolving Gibbs family, these ingredients yield a distributional Polyak–Łojasiewicz condition. We further establish the regularity and uniform bounds needed to control the discretization error, thereby obtaining geometric contraction up to a discretization bias. Conceptually, our analysis shows that although entropy-regularized RL is not convex in the usual flat sense, the Bellman recursion induces a favorable Polyak–Lojasiewicz-type (PL) geometry that supports global convergence of WPG. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2605.26078 [cs.LG] (or arXiv:2605.26078v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.26078 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-4] Active Query Synthesis for Preference Learning

链接: https://arxiv.org/abs/2605.26072
作者: Namrata Nadagouda,Nauman Ahad,Maegan Tucker,Mark A. Davenport
类目: Machine Learning (cs.LG)
*备注: 27 pages, 12 figures

点击查看摘要

Abstract:Efficient learning of user preferences is crucial for many modern decision making systems but typically requires costly labeled data. Active learning reduces this cost, yet standard methods are computationally expensive due to pool-based evaluation. Further, most methods assume all query feedback is equally reliable, ignoring that pairwise queries between nearly identical or entirely dissimilar items yield ambiguous, low-confidence responses. To address the issue of feedback reliability, we introduce a novel confidence aware response model that explicitly accounts for these ambiguous comparisons. To overcome the computational bottleneck of pool-based evaluation, we propose an active query synthesis framework, Info-Synth that generates optimal queries by maximizing a mutual information-based objective within a continuous space. Moreover, we propose two strategies, Pair M-dist and Pair Opt-dist, that extend Info-Synth to select effective queries even when restricted to finite query pools. We demonstrate our framework’s versatility and performance across synthetic preference learning, constrained text summary datasets, and subjective, continuous-space controller gain tuning for a simulated mobile robot.

[LG-5] Length Generalization with Log-Depth Recurrent Units

链接: https://arxiv.org/abs/2605.26035
作者: Charles Pert,Dalal Alrajeh,Alessandra Russo
类目: Machine Learning (cs.LG)
*备注: 39 pages, 11 figures

点击查看摘要

Abstract:Length generalization remains a persistent challenge for neural networks: recurrent models tend to suffer from positional biases, while transformers are constrained by fixed computational depth. Regular languages provide a frequently used testbed for evaluating length generalization, as label prediction can be checked for any sequence length. We propose MLP-LDRU, a type of Log-Depth Recurrent Unit, which captures a class of associativity-biased operators designed to approximate recurrence through parallel reduction. We evaluate MLP-LDRU on 21 regular-language tasks, consisting of standard benchmarks and new prefix languages, where it achieves 100% out-of-distribution accuracy on 18 tasks and at least 99.9% on the remaining 3 when increasing max training length, outperforming comparable recurrent and attention-based models. We further evaluate MLP-LDRU beyond regular languages on ListOps and NLP classification benchmarks, where it performs competitively.

[LG-6] Causal methods for LLM development and evaluation KDD2026

链接: https://arxiv.org/abs/2605.25998
作者: Dennis Frauen,Marie Brockschmidt,Konstantin Hess,Haorui Ma,Yuchen Ma,Abdurahman Maarouf,Maresa Schröder,Jonas Schweisthal,Yuxin Wang,Athiya Deviyani,Sonali Parbhoo,Rahul G. Krishnan,Stefan Feuerriegel
类目: Machine Learning (cs.LG)
*备注: Published in KDD 2026

点击查看摘要

Abstract:Large language model (LLM) development is currently driven by large-scale empirical iteration over data mixtures, reward models, routing strategies, and evaluation pipelines. Here, we argue that many central questions in LLM development and evaluation are inherently causal: What is the effect of adding a data domain during pretraining? How do annotator preferences change when LLMs generate text in a different style? Should a prompt be routed to a larger or smaller model given inference cost constraints? In general, causal methods are well-suited to such settings where interventions change outcomes but, surprisingly, are underrepresented in LLM development. Our contribution is threefold: (1) We explain how causal methods can help develop modern LLM development and evaluation: LLM development relies heavily on logged data, which are often subject to confounding and distribution shifts; evaluation uses learned but potentially biased judges; and deployment environments are non-stationary. These conditions make purely predictive approaches fragile and create opportunities for principled identification and estimation methods from causal inference. (2) We further map opportunities for causal methods in the entire LLM development pipeline, including pretraining, alignment, routing, agentic workflows, and evaluation. (3) We discuss new research opportunities around leveraging causal methods for LLM development and evaluation. Overall, we argue that causal methods are potentially underutilized for the LLM development and evaluation pipeline, despite the fact that such methods can ensure a reliable and scientifically grounded design.

[LG-7] Deployment-complete benchmarking

链接: https://arxiv.org/abs/2605.25997
作者: El Mustapha Mansouri,Keigo Arai
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 33 pages, 5 figures, 1 table; supplementary tables and code available

点击查看摘要

Abstract:Benchmarks increasingly guide deployment, procurement and scientific screening, yet a score supports only the response it records, not necessarily the deployment action. We introduce deployment-complete benchmarking, which tests whether benchmark evidence determines a deployment action. A benchmark is complete for a claim exactly when the action is constant on each evidence fiber; mixed fibers expose missing deployment information, and completion curves quantify the evidence required to resolve ambiguity. In controlled response spaces, benchmark-channel conformal coverage of 94.98% transferred poorly to an unmeasured deployment channel (10.07%), whereas response-rank intervals achieved 94.91% coverage; even zero benchmark error certified only 45.4% of candidates at the largest residual size. Public audits revealed incompleteness, including 97.9% mixed Tox21 fibers and zero median certifiable fraction in main Matbench and JARVIS audits. In held-out replays, certify-then-acquire reduced false decisions from 1.19% to 0.027% in Tox21 and from 20.3% to 0.128% in JARVIS, while changing model choice and identifying deployment-relevant probes. Deployment-ready benchmarks should report evidence, supported actions, ambiguity and completion cost rather than scores alone.

[LG-8] Fuzzy PyTorch: Rapid Numerical Variability Evaluation for Deep Learning Models

链接: https://arxiv.org/abs/2605.25991
作者: Inés Gonzalez-Pepe,Hiba Akhaddar,Tristan Glatard,Yohan Chatelain
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注: 19 pages, 8 figures, Published in Transactions on Machine Learning Research (01/2026)

点击查看摘要

Abstract:We introduce Fuzzy PyTorch, a framework for rapid evaluation of numerical variability in deep learning (DL) models. As DL is increasingly applied to diverse tasks, understanding variability from floating-point arithmetic is essential to ensure robust and reliable performance. Tools assessing such variability must be scalable, efficient, and integrate seamlessly with existing frameworks while minimizing code modifications. Fuzzy PyTorch enables this by integrating stochastic arithmetic into PyTorch through Probabilistic Rounding with Instruction Set Management, a novel library interfacing with Verificarlo, a numerical analysis compiler. The library offers stochastic rounding mode and a novel mode; up-down rounding. Comparative evaluations show Fuzzy PyTorch maintains model performance and achieves runtime reductions of 5x to 60x versus Verrou, a state-of-the-art tool. We further demonstrate scalability by running models from 1 to 341 million parameters, confirming applicability across small and large DL architectures. Overall, Fuzzy PyTorch provides an efficient, scalable, and practical solution for assessing numerical variability in deep learning, enabling researchers and practitioners to quantify and manage floating-point uncertainty without compromising performance or computational efficiency.

[LG-9] Hidden in Plain Tokens: Simply Robust Gradient-Free Watermark for Synthetic Audio ICML2026

链接: https://arxiv.org/abs/2605.25967
作者: Georgios Milis,Yubin Qin,Yihan Wu,Heng Huang
类目: Machine Learning (cs.LG); Sound (cs.SD)
*备注: Accepted to ICML 2026

点击查看摘要

Abstract:As policy catches up with the capabilities of generative AI, watermarking is central to content provenance efforts. Inference-time watermarks for autoregressive models are unfit for continuous modalities due to discretization inconsistencies. Existing methods overcome this by finetuning the modality tokenizers, nullifying the watermark’s training-free advantage. In this work, motivated by the vocabulary redundancy of discretization, we propose an elegant solution for powerful and robust watermarking of synthetic audio. We theoretically analyze the impact of token errors on watermark detection, and effectively mitigate them using a reduced vocabulary obtained via community detection. Thorough experiments showcase that our gradient-free method can boost detectability by several orders of magnitude, while also achieving built-in robustness to audio modifications. Broadly, we discover a new state-of-the-art for token-level watermarks in multimedia, which simply arises from the nature of discrete representation learning.

[LG-10] STaT: Resolving Shape Distortion in Non-Stationary Time Series via Tri-Modal Synergy

链接: https://arxiv.org/abs/2605.25943
作者: Hui Cheng,Jinsheng Guo,Zhenhao Weng,Yan Qiao,Meng Li
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent research in time series forecasting frequently investigates the integration of textual and visual modalities with numerical models to better navigate non-stationary environments. Despite delivering solid numerical results, existing multi-modal approaches usually encounter a dilemma: prioritizing the minimization of average errors can result in excessively smooth forecasts that overlook essential fluctuations. To resolve this limitation, we introduce STaT, an innovative multimodal architecture for Symbolic-Temporal-Textual Alignment, which seamlessly unites three synergistic modalities. Specifically, the symbolic modality converts continuous time series into discrete tokens, facilitating the accurate identification of structural patterns and turning points; the temporal modality extracts inherent sequential dependencies; and the textual modality leverages domain semantics to steer the macroscopic forecasting trends. Comprehensive evaluations on eight real-world benchmarks indicate that STaT delivers exceptional performance, enhancing conventional magnitude indicators by up to 8.9% while simultaneously decreasing shape distortion by up to 8.5%.

[LG-11] Building an Adversarial Malware Dataset by Family and Type: Generation Evasion and Poisoning Evaluation

链接: https://arxiv.org/abs/2605.25937
作者: David Košťál,Martin Jureček
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present a dataset of adversarial malware samples derived from the public RawMal-TF collection of real-world malware binaries. Using a suite of adversarial malware generators, we construct two sets of adversarial PE files: 44,347 family-labelled samples and 33,596 type-labelled samples, achieving evasion rates of 98.35 % and 92.20 % against the EMBER classifier, respectively. Each adversarial binary is accompanied by detailed metadata, including EMBER scores and VirusTotal classifications. We further demonstrate the susceptibility of malware classification pipelines to data poisoning attacks through a series of training experiments. Injecting fully mislabelled adversarial samples representing only 0.5 % of the training data in the family-labelled dataset increases the evasion rate against the re-trained classifier from 26.1 % to 92.8 %. The dataset is publicly released to facilitate future research on adversarial malware, poisoning attacks, and the robustness of machine-learning-based malware detection systems.

[LG-12] Joint Optimization of Training and Inference in Federated Edge Learning via Constrained Multi-Objective Deep Reinforcement Learning

链接: https://arxiv.org/abs/2605.25916
作者: Zhen Li,Jun Cai,Chao Yang,Haoran Gao
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Networking and Internet Architecture (cs.NI)
*备注:

点击查看摘要

Abstract:Federated edge learning (FEEL) has recently emerged as a promising paradigm for achieving edge intelligence (EI) via enabling collaborative model training across edge devices while protecting data privacy. In this paper, we put forth an online optimization framework that jointly manages federated training and inference on resource-constrained edge devices. We introduce a tandem-queue-inspired conversion mechanism that bridges inference requests and training data, and further incorporate both data and model freshness into the accuracy formulation to capture temporal dynamics in real-world environments. To maximize inference accuracy while minimizing latency and energy consumption, the mode selections, communication, and computation resource allocations of edge devices are jointly optimized. We formulate this optimization as a multi-objective optimization problem, which is NP-hard and further complicated by the online setting. To address these challenges, we transform the problem into a multi-objective Markov decision process (MOMDP) and develop a \underlineconstrained \underlinemulti-\underlineobjective \underlineproximal \underlinepolicy \underlineoptimization (C-MOPPO) algorithm. Specifically, C-MOPPO first learns a set of policies with different preferences across three objectives, then leverages constrained policy optimization to enrich the Pareto front and obtain high-quality, dense solutions. Extensive experiments demonstrate that C-MOPPO achieves well-balanced trade-offs among objectives and significantly outperforms baselines under various system configurations.

[LG-13] Reading the Finetuning Prior: Verbatim Content Recovery via Contrastive Decoding Diffing

链接: https://arxiv.org/abs/2605.25902
作者: Michał Brzozowski,Zuzanna Dubanowska,Enrico Cassano,Neo Christopher Chung
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Narrowly finetuned language models memorize implanted content verbatim, but auditing what a deployed model has been taught, without access to its weights or training data, remains an open challenge. Recent work shows that activation differences between base and finetuned models carry readable traces of the finetuning domain; the state-of-the-art Activation Difference Lens (ADL) recovers a vague domain-level description but requires full “white-box” access to model internals. We introduce Contrastive Decoding Diffing (CDD), a model diffing method that operates on output-level logit distributions only, with no weight access, no layer selection, and no per-model tuning, yet recovers implanted facts. CDD consists of three ideas: bypassing the chat template to expose the raw finetuning prior, seeding generation with maximally vague pre-fills, and amplifying the logit-space difference between finetuned and base models at each decoding step. A single default configuration recovers implanted facts verbatim – exact drug names, vote counts, physical measurements, and procedural details – across four architectures (1B–32B parameters), uniformly outperforming ADL despite less access and running ~170x faster. Furthermore, CDD surfaces unintended data pipeline artifacts: a fictional persona introduced by the LLM data generator via mode collapse leaked into model weights and was extracted by CDD, constituting to our knowledge the first demonstrated end-to-end fingerprinting chain from data generator artifact to model weights to recovered output. We validate on real-domain finetuning settings, achieving near-perfect recovery across all single-dataset non-CoT variants and correctly identifying all four datasets in the mixed-dataset setting. CDD’s success as a grey-box method outperforming white-box baselines underscores its practical utility for transparency and accountability in AI systems.

[LG-14] Predicting Stock Price Direction on Earnings Announcement Days using Multi-modal Deep Learning

链接: https://arxiv.org/abs/2605.25894
作者: Manuel Noseda,Nathan Soldati,Marco Paina
类目: Machine Learning (cs.LG); Statistical Finance (q-fin.ST)
*备注:

点击查看摘要

Abstract:Predicting stock price movements during Earnings Announcements (EAs) is a significant challenge due to market noise and high-impact price discontinuities. In this study, we evaluate whether pre-announcement news sentiment, firm fundamentals, and recent market dynamics jointly predict the directional price movement of equities on EA days. We construct a multi-modal feature space combining 15 fundamental metrics, 3 price-based technical indicators and sentiment scores derived from financial news articles processed using FinBERT. We compare a Long Short-Term Memory (LSTM) network and a Transformer-based architecture against a logistic regression baseline, and further assess all models with and without sentiment features to quantify their incremental value. Our results indicate that while the LSTM demonstrates higher precision through a conservative safe-bet strategy, the Transformer model exhibits superior sensitivity in identifying volatile movements, achieving a higher macro F1-score, with ablation experiments showing a consistent benefit from incorporating news sentiment.

[LG-15] Merge-Bench: Resolve Merge Conflicts with Large Language Models

链接: https://arxiv.org/abs/2605.25890
作者: Benedikt Schesch,Michael D. Ernst
类目: Machine Learning (cs.LG)
*备注: 14 pages, 7 figures

点击查看摘要

Abstract:This paper applies machine learning to the difficult and important task of version control merging. (1) We constructed a dataset, Merge-Bench, of 7938 real-world merge conflict hunks from 1439 GitHub repositories. The ground truth is the merge resolution that developers committed to the repository. Our dataset construction methodology is scalable to arbitrary amounts of data since no manual labeling is required. (2) We trained a model, LLMergeJ, to resolve merge conflicts in Java programs. Our approach uses Group Relative Policy Optimization (GRPO), an online reinforcement learning method, to train a Large Language Model (LLM). (3) We performed two evaluations of the performance of LLMs on resolving merge conflicts. On Java programs, LLMergeJ with 14B parameters outperforms 3 commercial LLMs, trailing only Gemini 2.5 Pro. Across 11 programming languages, commercial LLM performance is largely stable from language to language. The best models correctly resolve less than 60% of merge conflicts.

[LG-16] Capability and Robustness Cannot Both Be Free: An Information-Theoretic Bound for Vision-Language-Action Models

链接: https://arxiv.org/abs/2605.25889
作者: Jianwei Tai
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Vision-Language-Action (VLA) models are increasingly deployed on real robots, where each predicted action is executed and each failure carries a safety cost. They reach high success rates on clean inputs but collapse under small adversarial perturbations. A 16/255 PGD attack on OpenVLA-7B drops LIBERO success from above 95% to under 5% . Empirical defenses recover some robustness at a cost in clean accuracy, but the literature does not say whether the trade-off has a theoretical floor. We prove that it does. For any VLA policy with discrete actions, the sum of capability (mutual information between policy action and oracle action) and robustness (mutual information preserved under adversarial perturbation, net of trivial channel leakage) is upper-bounded by a policy-independent budget: task entropy plus adversarial channel capacity. The proof is two applications of the Data Processing Inequality plus MI non-negativity. The pixel-level bound is loose on current models ( \sim 10^3 nats), but an encoder-specific corollary restricts the channel to the policy-relevant subspace, reducing the budget from \sim 5,000 to \sim 31 nats on OpenVLA; the policy already consumes \sim 24% of this tighter budget, leaving limited room for simultaneous robustness improvement. We validate the bound across 252 closed-form Gaussian-VLA cells and 48 OpenVLA-7B \times LIBERO \times PGD cells (zero violations). We propose encoder-specific slack as a normalized comparison axis for defense papers, and release all code, manifests, and results.

[LG-17] Optimal and Order-optimal Gated Priority-based Greedy Policies for Two-layer Multi-item Order Fulfillm ent

链接: https://arxiv.org/abs/2605.25888
作者: Xi Chen,Yuze Chen,Ziyi Chen,Yuan Zhou
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:We study how an e-commerce firm should make real-time fulfillment decisions in a two-layer distribution network when multi-item customer orders arrive sequentially and future demand is unknown. The central managerial tension is whether to use scarce front distribution center (FDC) inventory to save current fulfillment cost or preserve that inventory for future orders that may be more valuable to serve locally. We formulate an adversarial online model with multiple FDCs, one regional distribution center (RDC), multi-unit multi-item orders, and item-specific and time-varying variable costs. Our theoretical objective is to characterize when simple, interpretable, and implementable fulfillment rules can perform nearly as well as an optimal clairvoyant planner. We develop a family of Gated Priority-based Greedy policies, derive competitive-ratio guarantees under both time-varying and time-invariant cost structures, and establish matching or near-matching lower bounds for any online algorithm. Numerical experiments show that the proposed policies perform strongly relative to generalized myopic and forecast-based benchmarks. The analysis yields managerial guidance on when local inventory should be protected, when splitting orders is worth the fixed-cost burden, and how the relative magnitudes of fixed and variable costs determine the value of more sophisticated optimization.

[LG-18] Conformalised imprecise inference for robust extrapolation under limited data

链接: https://arxiv.org/abs/2605.25882
作者: Yu Chen,Scott Ferson
类目: Machine Learning (cs.LG)
*备注: 10 pages, 5 figures

点击查看摘要

Abstract:Recent advances in uncertainty quantification increasingly emphasise the distinction between aleatory and epistemic uncertainty in machine learning, motivating the need for more unified frameworks. However, despite much progress in producing reliable predictions, existing methods often lack rigorous guarantees when generalising beyond the training domain. We propose a conformalised imprecise inference framework for robust extrapolation, which is model-agnostic and augments predictive models with imprecision and distance awareness. The proposed approach yields imprecise predictions (probability boxes) that remain valid under distributional shift, maintaining coverage while adaptively expanding uncertainty in extrapolation regimes. Experiments on synthetic and benchmark datasets demonstrate improved robustness and reliable coverage compared to standard probabilistic approaches, particularly under limited data.

[LG-19] he Quantization Benefits of Residual-Free Transformers

链接: https://arxiv.org/abs/2605.25880
作者: Yiping Ji,Mahalakshmi Sabanayagam,Peyman Moghadam,Hemanth Saratchandran,Simon Lucey
类目: Machine Learning (cs.LG)
*备注: Under review

点击查看摘要

Abstract:Large-scale transformer training and deployment are increasingly constrained by the transfer of activations, gradients, and optimizer states across accelerators. Low-bit quantization offers a natural remedy, but transformer activations are often heavy-tailed and outlier-dominated, making simple quantization highly lossy. We show that this difficulty is not only a property of the quantizer, but also of the architecture. Specifically, residual connections can drive transformer activations away from Gaussianity during training. Using controlled comparisons between residual and residual-free transformers, we demonstrate that this effect leads to substantially higher quantization error and accuracy degradation at low precision in residual models. We explain the phenomenon through an excess kurtosis analysis, showing that residual mixing can amplify non-Gaussianity, whereas dense mixing in residual-free contracts non-Gaussianity. We then show that residual-free transformers can be made trainable using orthogonal initialization, spectral or second-order optimization, and depth-aware scaling of attention temperature. In language tasks, while there is a small drop in full precision performance, these models retain near-Gaussian activations and exhibit significantly improved robustness to low-bit quantization. Our results identify an accuracy–compressibility trade-off in transformer design and motivate architecture-level approaches to quantization-friendly foundation models.

[LG-20] UNATE: UNsupervised ATomic Embedding for crystal structures property prediction

链接: https://arxiv.org/abs/2605.25866
作者: Laura Solà-Garcia,Àlex Solé,Javier Ruiz-Hidalgo
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci); Classical Physics (physics.class-ph)
*备注:

点击查看摘要

Abstract:Accurately predicting crystal properties is critical for accelerating materials discovery, but it is often limited by scarce labeled data and costly theoretical calculations. To alleviate this, we propose UNATE (Unsupervised Atomic Embedding), a framework that leverages structural information extracted from unlabeled crystal structures. UNATE integrates an unsupervised denoising autoencoder with self-supervised contrastive learning to learn robust atomic representations, which are then used as input features for downstream property prediction. Experimental results show that replacing raw atomic numbers with UNATE-pretrained node embeddings yields a 2.7% improvement over the full-data baseline. Notably, the benefits become more pronounced in scenarios with limited labeled data, reaching improvements of up to 10% when only 25% of the labeled data is used.

[LG-21] Branched Signature Kernel Solvers for ODEs with rough Single-Trajectory signals

链接: https://arxiv.org/abs/2605.25826
作者: Munawar Ali,Qi Feng,Charlie Pyle,George Xu
类目: Numerical Analysis (math.NA); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注: 39 pages, 12 figures

点击查看摘要

Abstract:We develop a branched signature kernel solver for linear and nonlinear ordinary differential equations driven by a \emphsingle observed trajectory of a possibly rough forcing signal – a setting that arises naturally in earthquake engineering, finance, biology, and structural health monitoring, where the forcing is observed exactly once and the solver must respect the underlying physical law without recourse to an ensemble of realizations. Two ingredients are new. First, a \emphcount-sampling construction turns the single observation into a hierarchical family of N+1 nested training paths on which the branched signature kernel can be evaluated; this allows the signature kernel machinery, originally designed for multi-realization regression problems, to operate on a single-trajectory observation. Second, a kernel-collocation framework places the ansatz either on the highest-order derivative of the solution (with lower derivatives recovered by integrating the kernel) or on the solution itself (after m -fold integration of the ODE). We prove a universal approximation theorem for the branched signature kernel, leveraging the Hairer–Kelly morphism to express branched signature evaluations through geometric signatures of time-extended paths. The offline solver is extended to a streaming Test/Train/Retrain protocol with closed-form online updates in the linear case and scalar Newton steps in the nonlinear case. Numerical experiments on six benchmarks (El-Centro earthquake displacement, the Solow capital-stock model, an fBM-driven second-order ODE, a forced Duffing oscillator, a path-dependent Arias-intensity-degraded oscillator with variable coefficients, and a noisy Kuramoto phase-oscillator system) show that the branched signature-kernel solver delivers accurate, stable predictions across all regimes.

[LG-22] Visual-Redundancy-Controlled Parallel Decoding for Diffusion-Based Multimodal Large Language Models

链接: https://arxiv.org/abs/2605.25820
作者: Yulin Yuan,Hongshuo Zhao,Xiangming Meng
类目: Machine Learning (cs.LG)
*备注: 18 pages, 5 figures

点击查看摘要

Abstract:Diffusion-based multimodal large language models (dMLLMs) decode by iteratively predicting tokens at multiple masked positions in parallel. This turns each decoding step into a position-selection problem: the model must choose not only which predictions are reliable in isolation, but also which positions should be committed together as context for later decoding steps. Existing confidence-based decoding ranks masked positions independently and commits the top-K positions, largely ignoring whether the committed tokens provide complementary visual grounding. We identify a step-level limitation of this strategy in multimodal settings: high-confidence tokens selected in the same step can rely on overlapping visual grounding, introducing visual redundancy among the committed tokens and leaving less complementary visual grounding available for later decoding. To quantify this effect, we introduce the Visual Redundancy Index (VRI), which measures visual grounding overlap among tokens committed in parallel. To control this redundancy during decoding, we propose Visual-Redundancy-Controlled Decoding (VRCD), a training-free inference-time decoding method that uses token-to-image attention to prioritize visually complementary positions. Across diverse multimodal benchmarks, VRCD reduces visual redundancy and remaining-position entropy with modest runtime overhead. In longer decoding experiments, it also achieves relative accuracy gains of up to 18.8% on M^3CoT and 6.9% on MMBench over confidence-based decoding. Code will be released at this https URL.

[LG-23] On Reliability of Efficient Membership Inference Vulnerability Evaluation

链接: https://arxiv.org/abs/2605.25819
作者: Joonas Jälkö,Gauri Pradhan,Ossi Räisä,Antti Honkela
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: 14 pages, 10 figures

点击查看摘要

Abstract:Membership inference attacks (MIAs) are popular methods for empirically assessing the leakage of sensitive information in the training data through models or statistics learned from the data. The MIA vulnerability is often evaluated through false positive rate (FPR) and true positive rate (TPR) of a binary classifier that tries to predict whether a particular sample was in the training data. However, in order to reliably estimate the TPR especially for low FPR values, a lot of observations are needed, which in case of MIA translates to many target models, leading to large computational cost. To avoid excessive compute requirements, the MIA scores are often averaged over multiple individuals and multiple targeted models. We demonstrate two key weaknesses in this efficient MIA evaluation pipeline. First, we show that evaluating the TPR based on MIA scores concatenated across multiple individuals, commonly used to study vulnerabilities in the very low FPR regime, is not calibrated across the per-sample FPRs. This makes it unreliable as a tool for auditing differential privacy. To solve this, we propose a post-processing method to effectively calibrate the FPR across different samples. Second, we identify a finite population bias in the commonly used efficient likelihood-ratio attack (LiRA) implementation proposed by Carlini et al. 2022, leading to a positive bias in the per-sample vulnerability.

[LG-24] Invariant-Based Weight Sharing for Message Passing

链接: https://arxiv.org/abs/2605.25750
作者: Florian Seiffarth
类目: Machine Learning (cs.LG)
*备注: 13 pages main paper + 30 pages references and appendix

点击查看摘要

Abstract:Message-passing neural networks (MPNNs) are a powerful framework for learning representations of graph-structured domains. However, weights in MPNNs act on features only, limiting their ability to capture structural patterns. We introduce a novel structure-aware weight sharing principle that explicitly incorporates information inherent to the graph structure. Weights are indexed directly by user-chosen graph invariants, i.e., functions preserved under node permutations, enabling systematic reuse across structurally equivalent subgraphs. We present ShareGNNs, which instantiate this principle within a simple encoder-decoder architecture, resulting in an MPNN with learnable adjacency and transformer-like connectivity. We show that their expressivity is at least as strong as the discriminative power of the chosen invariants, providing explicit control over the model complexity. Experiments on synthetic and real-world data, as well as subgraph counting tasks, demonstrate consistent improvements over standard MPNNs, competitive expressivity beyond the 1-WL test, and scalability to large datasets.

[LG-25] Latent Representation Alignment for Offline Goal-Conditioned Reinforcement Learning ICML2026

链接: https://arxiv.org/abs/2605.25740
作者: Hyungkyu Kang,Byeongchan Kim,Min-hwan Oh
类目: Machine Learning (cs.LG)
*备注: Accepted in ICML 2026

点击查看摘要

Abstract:Offline goal-conditioned reinforcement learning (GCRL) provides a practical framework for obtaining goal-reaching policies from fixed datasets. However, learning a reliable goal-conditioned value function in long-horizon tasks remains challenging. In this paper, we identify erroneous generalization in goal-conditioned value functions as a fundamental bottleneck, and demonstrate that appropriate inductive bias in the value function is crucial for addressing the bottleneck. Building on these findings, we propose Latent-Aligned Value Learning (LAVL), an offline GCRL algorithm that integrates latent-representation-based value generalization with hierarchical planning in a unified framework. Extensive experiments on OGBench demonstrate that LAVL consistently outperforms existing offline GCRL methods, achieving the highest performance on 20 out of 22 datasets. Notably, LAVL exhibits strong performance in long-horizon tasks and trajectory stitching datasets, where prior methods suffer significant performance degradation. Our code is available at this https URL.

[LG-26] he Behavioral Credibility Trilemma: When Calibrated Autonomy Becomes Impossible

链接: https://arxiv.org/abs/2605.25739
作者: Lauri Lovén,Nam Do,Hassan Mehmood,Dinesh Kumar Sah,Sasu Tarkoma
类目: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT); Machine Learning (stat.ML)
*备注: 48 pages, 3 figures

点击查看摘要

Abstract:We prove that no reinforcement learning policy with confidence-gated autonomy can simultaneously achieve maximum helpfulness, optimal calibration, and full autonomy under rational oversight, whenever some tasks exceed the agent’s reliable competence: the Behavioral Credibility Trilemma. The impossibility is geometric – adding any non-affine autonomy incentive to a strictly proper scoring rule destroys strict properness, so an agent rewarded for both calibrated confidence and autonomous action systematically inflates its reported confidence on tasks below the principal’s approval threshold. The Behavioral Perturbation Lemma quantifies the inflation (scaling as w_A/(2 w_C) for the Brier score) and shows detection requires \Omega(1/\Delta^2) observations. We prove the principal’s optimal oversight rule is necessarily non-affine, making the impossibility unconditional and optimizer-independent across log-concave-density policy families. We formalize the Confidence-Gated Decision Problem, map existing methods onto the trilemma, and identify two constructive resolution pathways (commitment, domain separation). A 540-configuration Best-of-N experiment tests five pre-registered hypotheses, all strongly confirmed (effect sizes d = 1.10 to 5.32 ), and adds a descriptive analysis of the achievable- (H, C, A) surface geometry showing a plateau-truncated frontier consistent with the predicted inflation saturation.

[LG-27] Evaluating passing decision-making in professional football: An enhanced MPNN approach to Receiver Selection

链接: https://arxiv.org/abs/2605.25696
作者: Gabriel Masella,Giuseppe Alessio D’Inverno,Max Goldsmith,Gianluigi Rozza
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The process of decision-making in football is characterized by a complex interplay between spatial positioning, opponent pressure, and player intent. This work introduces a Graph Neural Network (GNN) framework designed to predict Receiver Selection, the optimal passing target, by modeling on-field interactions as dynamic graphs. Each player is represented as a node with positional and contextual features, while potential passing lines form weighted edges characterized by distance, angle, and pressure metrics. A Message-Passing Neural Network (MPNN) has been developed and trained using a combination of tracking data and event data from professional matches, synchronized through a robust pipeline based on an optimized version of the Needleman-Wunsch Algorithm. The model achieves competitive accuracy in identifying the actual chosen receiver and state-of-the-art accuracy within its top three suggestions. Our model further offers quantification of each option’s likelihood, threat, and creativity, enabling performance analysts to evaluate over 1,000 passes in seconds.

[LG-28] Stochastic Estimation of the Layer-wise Hessian Trace for Monitoring Neural-network Training

链接: https://arxiv.org/abs/2605.25674
作者: Maxim Bolshim(1),Alexander Kugaevskikh(1) ((1) ITMO University, Saint Petersburg, Russia)
类目: Machine Learning (cs.LG)
*备注: 9 pages, 1 table

点击查看摘要

Abstract:The loss and the norm of its gradient separate the healthy and the pathological regimes of neural-network training only weakly, whilst the curvature of the empirical risk differs qualitatively between them but is inaccessible explicitly at parameter counts P\sim 10^6-10^8 . We present a stochastic estimator of the trace of the diagonal blocks of the Hessian matrix of the empirical risk of a neural network. The procedure combines the Hutchinson stochastic trace estimator with a single Hessian-vector product over the whole parameter vector and recovers unbiased estimates of every per-layer trace in one backward pass through the computational graph. We show that correctness under weight sharing requires the layer-wise Hessian to be assembled before the second differentiation: unrolling shared weights into independent coordinates introduces a systematic bias whose sign and magnitude are governed by the cross-instance blocks of the unrolled Hessian. A closed-form expression for the variance of the estimator at a fixed Hessian is derived, together with a decomposition of the total variance under the mini-batch sampling distribution. This decomposition yields a critical probe count K^\star that balances the two sources of randomness and supports the practical recommendation K\in[5,10] in the on-line monitoring regime. The estimator is applied to the detection of the label-memorisation regime of ResNet-18, ResNet-34, and VGG-11 on CIFAR-10 and CIFAR-100, where a calibrated cumulative-sum decision rule attains an empirical detection power of 179/180 at a false-alarm rate of 16/120 .

[LG-29] Closed-Form Node Classification with Exact Graph Unlearning

链接: https://arxiv.org/abs/2605.25662
作者: Aditya Gaur,Charu Sharma
类目: Machine Learning (cs.LG)
*备注: 19 pages, 5 figures, 12 tables (7 main + 5 appendix)

点击查看摘要

Abstract:Graph neural networks for node classification are typically trained by gradient descent over hundreds or thousands of epochs. Recent work has shown that, when properly tuned, classic GCN/SAGE/GAT architectures can match graph transformers on many node-classification benchmarks. We ask a complementary question: how much of this performance can be recovered by deterministic closed-form solvers, and what guarantees does this enable? We introduce a routed closed-form framework selected by adjusted homophily. For assortative graphs, we use SGC-style propagation followed by Ridge regression; for heterophilous graphs, we introduce LCF-Net, a layer-wise closed-form graph feature-refinement network whose per-layer Ridge solves are capped by a Gaussian kernel-Ridge head. Across 14 benchmarks, including ogbn-arxiv and ogbn-proteins, our closed-form predictors match or beat the best vanilla 2-layer GCN/SAGE/GAT on 9 of 9 measured datasets, tie tuned deep recipes within one standard deviation on 9 of 12 small benchmarks, and exceed the OGB-leaderboard plain GCN on both large graphs. The remaining heterophilous gap closely tracks the gain from vanilla 2-layer to deep SAGE, suggesting that the residual difference is primarily architectural. Because our predictors are explicit solutions of deterministic linear systems, modified graph inputs can be re-solved to obtain retrain-equivalent parameters. We formalize exact graph-object unlearning for label, feature, edge, node, and subgraph modifications, prove K-hop locality for Ridge components, and verify exactness across 109 configurations. On ogbn-arxiv, localized updates give 21 – 45\times speedups over full re-solving and roughly 10^6\times speedups over gradient retraining. Structural-inversion experiments further quantify the privacy floor of exact retraining and the additional leakage of approximate graph-unlearning methods. Comments: 19 pages, 5 figures, 12 tables (7 main + 5 appendix) Subjects: Machine Learning (cs.LG) Cite as: arXiv:2605.25662 [cs.LG] (or arXiv:2605.25662v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.25662 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-30] Analogies between Transformer Layers and Power Method

链接: https://arxiv.org/abs/2605.25619
作者: Chenglong Li,Claudio Altafini
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In the paper we show that there is an analogy between the operations occurring in a layer of a transformer (projections and layer normalizations, disregarding the feedforward neural network) and a step in the power method. Coherently with this analogy, we show that passing through a layer the tokens tend to be tilted towards the principal eigenvector of a matrix which is the product of the output and value weight matrices of that layer. In the special case of a transformer with shared weights (i.e., in which all layers have identical weights) then the alignment with this principal eigenvector is particularly evident empirically, and can also be shown analytically. The analogy also suggests a method to steer the output of the transformer towards an arbitrary desired direction in token space.

[LG-31] Courtroom Analogy: New Perspective on Uncertainty-Aware Classification ICML2026

链接: https://arxiv.org/abs/2605.25616
作者: Taeseong Yoon,Heeyoung Kim
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: ICML 2026

点击查看摘要

Abstract:Single-pass uncertainty quantification (UQ) methods for classification represent uncertainty by predicting a tractable distribution over the class probability vector. While existing approaches primarily focus on enhancing the expressiveness of this distribution, they often provide limited insight into how predictive uncertainty is structured and aggregated, resulting in weak interpretability. We introduce the courtroom analogy, which conceptualizes uncertainty-aware classification as a structured debate among class-specific advocates. Each advocate forms a probabilistic opinion, and a final verdict is reached by aggregating these opinions using input-dependent plausibility weights. In this framework, each advocate’s opinion is modeled as a Dirichlet distribution whose concentration parameter is decomposed into shared evidence and class-specific advocacy. This yields a structured mixture of Dirichlet distributions with semantically interpretable parameters. To instantiate this formulation, we propose Mixture of Dirichlet EXperts (MoDEX), a single-pass neural architecture that predicts the courtroom parameters, enabling efficient and expressive UQ while explicitly modeling uncertainty aggregation. We demonstrate that MoDEX enjoys strong theoretical properties and achieves state-of-the-art UQ performance across diverse benchmarks, yielding interpretable uncertainty estimates with meaningful semantics.

[LG-32] Learning Latent Dynamical Causal Processes for Single-Cell Perturbation Prediction KDD2026

链接: https://arxiv.org/abs/2605.25581
作者: Wenkang Jiang,Yuhang Liu,Erdun Gao,Ehsan Abbasnejad,Lina Yao,Javen Qinfeng Shi
类目: Machine Learning (cs.LG)
*备注: Accepted to SIGKDD 2026 AI4Science Track

点击查看摘要

Abstract:Single-cell perturbation prediction aims to infer how cells respond to unseen interventions and to achieve out-of-distribution (OOD) generalization, providing a computational route to understanding how perturbations reshape cellular programs over time. Existing machine learning methods have made important progress, but typically capture only one side of the response. Latent causal approaches seek mechanisms that support generalization and interpretation, yet often treat perturbation effects as static outcomes. Temporal models describe how gene expression changes across time, but usually do not explicitly recover the latent causal generative mechanisms driving these changes. In practice, perturbation effects are both latent and dynamical: interventions act through unobserved cellular programs, whose states evolve over time and give rise to observed expression profiles. Motivated by this view, we propose a latent dynamical causal generative model for single-cell perturbation data that jointly captures latent cellular programs, perturbation-conditioned mechanisms, and temporal evolution. We further provide an identifiability analysis showing that, under suitable conditions, the latent causal variables are recoverable up to standard equivalence classes. Guided by this analysis, we develop CITE-VAE, a learning framework for recovering latent cellular programs and their perturbation-driven dynamics from single-cell sequencing data. Experiments on Causal-3DIdent validate the theoretical results and the effectiveness of the proposed method in controlled settings. Additional experiments on real-world CRISPR-based single-cell perturbation data show improved generalization to unseen perturbations compared with state-of-the-art baselines, highlighting the practical robustness of our approach.

[LG-33] Learning Permutation from Structure Without Supervision

链接: https://arxiv.org/abs/2605.25551
作者: Ran Eisenberg,Ofir Lindenbaum
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Many learning problems require uncovering a hidden ordering that reveals structure in unordered data, such as monotonicity in sorting or spatial continuity in jigsaw reconstruction. In these settings, permutations can be learned as latent operators by optimizing objectives defined directly on the reordered output, often without access to ground-truth orderings. Differentiable relaxations such as Gumbel-Sinkhorn make this approach practical by approximating permutation matrices with doubly stochastic matrices. However, learning from structure without supervision induces a non-uniform uncertainty: some assignments become confident early, while others remain ambiguous. Existing methods control this process using a single global temperature, forcing all assignments to sharpen or diffuse simultaneously and leading to instability at scale. We introduce an entropy-adaptive formulation of Gumbel-Sinkhorn that locally modulates temperature based on assignment uncertainty. This allows confident assignments to discretize early while preserving exploration where uncertainty remains. Across sorting and jigsaw reconstruction tasks and in routing-style settings, adaptive entropy control improves training stability and final permutation quality relative to fixed-temperature baselines, particularly as problem size and assignment ambiguity increase.

[LG-34] A Multimodal Framework for Dementia Detection via Linguistic and Acoustic Representation Learning

链接: https://arxiv.org/abs/2605.25540
作者: Loukas Ilias,Dimitris Askounis
类目: ound (cs.SD); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Alzheimer’s disease (AD) is a progressive neurodegenerative disorder and the leading cause of dementia, affecting memory, reasoning, communication, and daily functioning. Early diagnosis is particularly important, as timely intervention may help slow cognitive decline and improve patient care. Recent studies have demonstrated that spontaneous speech contains valuable linguistic and acoustic biomarkers associated with dementia. However, existing approaches often rely on independently trained modality-specific models, feature concatenation strategies, ensemble methods, or attention-based fusion mechanisms that do not explicitly maximize the dependency between speech and transcript representations. In this work, we propose a multimodal deep learning framework for automatic dementia detection that jointly exploits speech and transcript information in an end-to-end trainable manner. Specifically, speech recordings are divided into 10-second segments and passed through a pre-trained HuBERT model to extract contextualized acoustic representations. To better capture informative temporal speech characteristics, attentive statistics pooling is employed to aggregate frame-level acoustic embeddings. For the textual modality, transcripts are encoded using a pre-trained BERT model, where the [CLS] token representation is used as the linguistic embedding. The acoustic and textual representations are subsequently combined using an attention-based Audio-Text Fusion (AT-Fusion) mechanism. In addition, we introduce a MINE objective to maximize the mutual information between modalities and improve multimodal representation alignment. The fused multimodal representation is finally used for dementia classification. Experiments conducted on the publicly available ADReSS Challenge and PROCESS-2 dataset demonstrate the effectiveness and robustness of the proposed approach for speech-based dementia assessment.

[LG-35] DeepSeek Math Meets Order Book: Group-Aware Policy Optimization for High-Frequency Directional Trading

链接: https://arxiv.org/abs/2605.25527
作者: Sayak Charabarty,Souradip Pal
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE)
*备注: 9 pages, 3 figures

点击查看摘要

Abstract:This paper studies reinforcement learning for high-frequency trading on limit order books by pairing an Order-Flow-based state model with policy-gradient methods. Instead of value-based RL techniques like tabular Q-learning, our approach deploys policy-based methods like vanilla PPO and DeepSeekMath-inspired variants like GRPO and GSPO, that use group-normalized updates and downside-aware shaping. On backtests with financial assets AMZN, AAPL, and GOOG under a simplified backtesting setup based on spread-scaled rewards, these new policies improve net average PnL, profitability, and drawdown over the Q-Learning baseline. Our results show that (1) Order-Flow signals are an adequate state for policy RL and (2) group-aware PPO surrogates are preferable over value-based baselines.

[LG-36] SAE-FD: Sparse Autoencoder Feature Distillation for Continual Learning of Large Language Models

链接: https://arxiv.org/abs/2605.25525
作者: Mingxu Zhang,Yuhan Li,Lujundong Li,Dazhong Shen,Hui Xiong,Ying Sun
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Continual learning enables large language models to adapt to evolving tasks without retraining from scratch, yet catastrophic forgetting remains a central obstacle. Among continual learning methods, regularization-based approaches are widely used to constrain model updates and reduce forgetting, operating in weight space, gradient space, or output space. However, these dense representation spaces suffer from feature superposition, where multiple concepts are encoded in overlapping dimensions, making it difficult to selectively protect previously learned knowledge without impeding new-task learning. To address this issue, we propose \method (Sparse Autoencoder Feature Distillation), which anchors model representations in the sparse feature space of a pre-trained Sparse Autoencoder, where dense activations are decomposed into a sparse overcomplete basis that reduces representational entanglement, enabling more targeted regularization with less interference to new-task learning. Experiments on two continual learning benchmarks across three model architectures show that \method consistently outperforms existing regularization-based methods, achieving up to 52.70% average accuracy with only -0.46 backward transfer.

[LG-37] Relative Repairability: A Calibration-Based Diagnostic for High-Sparsity Post-Pruning Allocation

链接: https://arxiv.org/abs/2605.25508
作者: Qishi Zhan,Liang He,Minxuan Hu,Ziheng Chen
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:At very high sparsity, neural network pruning does more than decide which weights remain. It also determines where pruning induced damage is placed across the network, and whether that damage can be recovered by a fixed lightweight repair procedure. We study this problem through the lens of repair conditioned sparsity allocation. We introduce Relative Repairability (RR), a calibration based diagnostic that compares the raw activation distortion caused by layerwise pruning with the residual distortion left after channelwise variance matching repair. RR estimates the fraction of local damage that remains after repair, using only unlabeled calibration data. Across ResNet18, ResNet34, and VGG16 BN on CIFAR10 and CIFAR100, we find that RR is not a universally dominant allocation rule. Instead, it is most useful near an architecture dependent recoverability transition, where standard structural or magnitude based allocation priors begin to lose reliability but post repair recovery has not yet fully collapsed. On CIFAR100 ResNet18, a fine grained sweep shows that RR improves over ERK across the central transition band and surpasses LAMP near the upper part of this band. A projection forced ablation further shows that capped ERK can over protect projection layers, shifting excessive sparsity onto regular convolutions and reducing post repair recovery. These results suggest that high sparsity pruning should allocate not only retained weights, but also repairable damage.

[LG-38] Accelerated Dynamic Importance Weighting with Versatile Divergence-Minimizing Estimators

链接: https://arxiv.org/abs/2605.25499
作者: Tongtong Fang,Nan Lu,Gang Niu,Kenji Fukumizu,Masashi Sugiyama
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Importance weighting (IW) is a golden solver for joint distribution shift, where the joint distributions differ between the training and test data. To solve this problem, IW estimates test-to-training density ratios as importance weights and reweights the training losses accordingly. Recent advances in dynamic IW (DIW) integrate weight estimation into model training, enabling scalable IW for deep models and achieving strong performance on large modern datasets. Despite its promise, DIW remains limited in two aspects. First, it incurs substantial computational overhead by solving a kernel mean matching (KMM)-induced optimization problem to convergence in every mini-batch. Second, it relies solely on KMM for weight estimation, whereas the IW literature contains diverse estimation methods based on different divergence measures. In this paper, we propose accelerated DIW (ADIW), a unified and efficient IW framework for deep learning under joint distribution shift. ADIW performs a few lightweight projected gradient descent updates that warm-start from previously updated weights, substantially improving efficiency. Moreover, ADIW generalizes DIW into a unified divergence-minimization framework that supports diverse weight-estimation methods in a plug-and-play manner, including those based on the Kullback-Leibler divergence, squared distance, and Wasserstein-1 distance. We establish convergence guarantees for ADIW under mild conditions, and empirical results demonstrate that ADIW achieves state-of-the-art IW performance while being substantially more efficient.

[LG-39] SafetyRepro: Configuration-Conditional Rank Instability on Alignment Benchmarks

链接: https://arxiv.org/abs/2605.25492
作者: Yanhang Li,Zhichao Fan,Zexin Zhuang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Pairwise model comparisons drawn from foundation-model benchmarks (“A is safer than B”) are read as quantitative verdicts but hinge on harness choices benchmark papers under-specify. We close one theory-benchmark loop on this primitive: a finite-envelope proposition tying a measurable pairwise-disagreement rate to whether the strict ordering admits a configuration-pair reversal, paired with a commit-stamped evaluation protocol that operationalises it on widely cited alignment benchmarks. On every benchmark we test, configuration choice alone can flip the pairwise verdict; the proposition isolates this strict-reversal failure mode.

[LG-40] JacQuant: STE-Free Quantization-Aware Training via Learned Jacobian Surrogates

链接: https://arxiv.org/abs/2605.25469
作者: Kai Yi,Vignesh Vivekraja,Harshit Khaitan,Steven Li
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Quantization-aware training (QAT) is widely deployed but typically relies on the Straight-Through Estimator (STE), which passes gradients through non-differentiable quantizers by fiat. This often makes training brittle near bin boundaries and weakly aligned with the actual behavior of the low-precision model. We introduce JacQuant, a QAT framework that learns a lightweight surrogate of the model’s local sensitivity to parameter changes and uses it to stabilize and accelerate training within standard variance-reduced optimizers. The surrogate is inexpensive (diagonal or block-diagonal), data-driven, and compatible with common weight and activation quantizers. On code-preserving training phases, we prove convergence for non-convex objectives and obtain linear rates under a PL condition, and we relate the learned sensitivity to end-to-end output fidelity via a simple calibration argument. Across LLM benchmarks at \leq 2 bits, JacQuant consistently reaches higher accuracy than STE-based QAT, and the runtime analyses on various models show that the added cost remains negligible under practical group sizes. The method is drop-in and requires no changes to the forward quantizers; our empirical claims are scoped to ultra-low-bit LLM QAT.

[LG-41] BigMac: Breaking the Pareto Frontier of Compute and Memory in Multimodal LLM Training

链接: https://arxiv.org/abs/2605.25451
作者: Zili Zhang,Chengxu Yang,Shenglong Zhang,Chenyu Wang,Yufan Zhang,Tuo Dai,Zhouyang Li,Yuhong Ge,Chao Jin,Xin Jin,Yuliang Liu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Training multimodal large language models (MLLMs) is challenged by both model and data heterogeneity. Existing systems redesign the training pipeline to address these challenges, but remain bound by a Pareto frontier between compute and memory efficiency, improving one only at the expense of the other. We present BigMac, a new training pipeline for multimodal LLMs. The core idea of BigMac is to elegantly nest the encoder and generator computation into the original LLM pipeline, forming a dependency-safe nested pipeline structure. With this design, BigMac reduces the activation memory complexity of the encoder and generator to O(1) while keeping the activation memory complexity of the LLM unchanged. At the same time, it achieves the same computational efficiency as the idealized setting with unlimited memory. As a result, BigMac breaks the Pareto frontier between computational efficiency and memory usage, enabling simultaneous optimization of both computation and memory in MLLM training. We evaluate BigMac on multiple MLLMs and training workloads. Experimental results show that BigMac achieves a 1.08 \times -1.9 \times training speedup over baseline systems while maintaining stable memory usage as batch size increases.

[LG-42] Missing Pattern Recognized Diffusion Imputation Model for Missing Not At Random

链接: https://arxiv.org/abs/2605.25439
作者: Gyuwon Sim,Sumin Lee,Heesun Bae,Byeonghu Na,Doyun Kwon,Ju-Hee Hwang,Jae-Young Lim,Il-Chul Moon
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Missing data frequently arises across diverse domains, including time-series and image domains. In the real world, missing occurrences often depend on the unobservable values themselves, which are referred to as Missing Not at Random (MNAR). In this work, we introduce the Missing Pattern Recognized Diffusion Imputation Model (PRDIM), a novel framework that explicitly captures the missing pattern and precisely imputes unobserved values. PRDIM iteratively maximizes the likelihood of the joint distribution for observed values and missing mask under an Expectation-Maximization (EM) algorithm. In this sense, we first employ a pattern recognizer, which approximates the underlying missing pattern and provides guidance during every inference toward more plausible imputations with respect to the missing information. Through extensive experiments, we demonstrate that PRDIM consistently achieves strong imputation performance under MNAR settings across multiple data modalities.

[LG-43] Rethinking Feature Alignment in Generalist Graph Anomaly Detection: A Relational Fingerprint-based Approach ICML2026

链接: https://arxiv.org/abs/2605.25429
作者: Yujing Liu,Yixin Liu,Yu Zheng,Alan Wee-Chung Liew,Xiaofeng Cao,Shirui Pan
类目: Machine Learning (cs.LG)
*备注: 9 pages, 7 figures. Accepted by ICML 2026

点击查看摘要

Abstract:Generalist graph anomaly detection (GAD) aims to detect anomalies on unseen graphs without graph-specific retraining. Nevertheless, existing approaches primarily focus on aligning heterogeneous features across different data domains via PCA-based projection, which harmonizes feature dimensions ignores feature semantics. As a result, GAD models fail to learn transferable semantic knowledge, and even exhibit negative transfer on unseen graphs. To address this issue, we propose a Relational Fingerprint-based generalist GAD approach (ReFi-GAD for short), aligning heterogeneous raw features with a universal and semantics-aware Relational Fingerprint (ReFi) that encodes anomaly-indicative cues from both contextual and structural perspectives. Building on ReFi, we design a fingerprint-grounded generalist GAD model, which combines a transformer-based encoder to capture domain-invariant knowledge with an SNR-guided refinement module for domain-specific adaptation. Extensive experiments on 14 datasets demonstrate that ReFi-GAD significantly outperforms state-of-the-art methods.

[LG-44] Capture-Calibrate-Coach: A Graph-Based Framework for Knowledge Monitoring Estimation and Adaptive Feedback

链接: https://arxiv.org/abs/2605.25419
作者: Gen Li,Li Chen,Cheng Tang,Boxuan Ma,Yuncheng Jiang,Daisuke Deguchi,Takayoshi Yamashita,Atsushi Shimada
类目: Machine Learning (cs.LG)
*备注: To be published in Proceedings of the 27th International Conference on Artificial Intelligence in Education (AIED 2026)

点击查看摘要

Abstract:Effective learning support requires understanding not only what learners know but also how accurately they perceive their own understanding. This metacognitive dimension, known as knowledge monitoring, fundamentally influences self-regulated learning, yet this dimension remains underexplored in current systems. This paper introduces the Capture-Calibrate-Coach (3C) framework for adaptive learning support. The Capture phase extracts learners’ perceived knowledge states from open-ended self-reports to construct a heterogeneous graph linking learners and knowledge concepts. The Calibrate phase applies a heterogeneous graph neural network to infer latent perceived states for concepts not explicitly mentioned, enabling systematic knowledge monitoring assessment. The Coach phase classifies learners into five metacognitive patterns and delivers personalized feedback addressing both knowledge gaps and calibration errors. Evaluation with 684 students demonstrates 85.21% AUC in predicting latent perceived states, significantly outperforming baseline methods. A user study with 47 participants shows positive reception of feedback quality, with participants particularly valuing concrete feedback on knowledge gaps and actionable study guidance. These findings advance AI-based learning support toward metacognitive teammates that foster accurate self-awareness while supporting knowledge growth.

[LG-45] EMA-Nesterov: Stabilizing Nesterovs Lookahead for Accelerated Deep Learning Optimization

链接: https://arxiv.org/abs/2605.25395
作者: Chung-Yiu Yau,Dawei Li,Athanasios Glentis,Valentyn Boreiko,Hoi-To Wai,Mingyi Hong
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 25 page, 10 figures

点击查看摘要

Abstract:Lookahead-based acceleration methods, such as Nesterov’s momentum, are widely used in optimization, but they often become unreliable in deep learning training mainly due to stochastic gradient noise and non-convex loss landscapes. In particular, standard lookahead relies on short-horizon update signals (e.g., differences between consecutive iterates), which are inherently noisy and can lead to unstable extrapolation directions. This work revisits Nesterov’s acceleration from a trajectory perspective and argues that effective acceleration in deep learning should harness the low-frequency trends of optimization trajectories rather than extrapolating noisy one-step updates. Leveraging this insight, we propose EMA-Nesterov, a simple modification that replaces the standard Nesterov’s lookahead direction with an exponential moving average (EMA) of parameter updates. This yields a stabilized lookahead direction that captures and harnesses the evolving trend of the training trajectory through a low-pass filter, while remaining adaptive to progressive changes via the geometric weighting structure of EMA. We show that EMA-Nesterov retains a theoretical accelerated convergence rate in convex problems that is analogous to Nesterov’s accelerated gradient method. Furthermore, we provide empirical evidence on language model pre-training to verify that EMA-Nesterov is broadly applicable across a range of fine-tuned base optimizers, including Adam, SOAP, Muon, as well as complex optimizers that achieve state-of-the-art performance on optimization benchmarks (NanoGPT). Compared to prior lookahead methods, EMA-Nesterov achieves better performance by avoiding the instability of short-horizon lookahead and the non-adaptivity of long-horizon lookahead.

[LG-46] A Context Augmented Multi-Play Multi-Armed Bandit Algorithm for Fast Channel Allocation in Opportunistic Spectrum Access

链接: https://arxiv.org/abs/2605.25391
作者: Ruiyu Li,Guangxia Li,Xiao Lu,Jichao Liu,Yan Jin
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: Accepted by ISCC’24

点击查看摘要

Abstract:We study the restless contextual multi-play multi-armed bandit (MP-MAB) problem for channel allocation in the opportunity spectrum access (OSA) scenario. Most existing MP-MAB methods are impractical for real-world OSA systems as they assume many ideal conditions, incur a heavy computational cost, and most importantly, ignore the impact of channel noise which is directly related to the quality of service. In this study, we embody this impact by modeling channel noise as a perturbation of the arm’s reward function in MP-MAB. As there is an implicit correlation between channel state information and channel noise, we take the former as a context for MP-MAB to present the perturbation caused by the latter. We investigate two types of correlation between the context and the perturbation – linear and nonlinear, and derive two index policies, respectively. These policies learn the correlations through a linear model and a neural network, and use estimated noise value to adjust the upper confidence bound. Numerical experiments demonstrate that the proposed policies can achieve lower regret and select sub-optimal arms in a more reasonable way.

[LG-47] ViroBench: Benchmarking Nucleotide Foundation Models on Viral Genomics Tasks

链接: https://arxiv.org/abs/2605.25388
作者: Dongxin Ye,Fang Hu,Han Hu,Shu Hu,Yang Tan,Wanli Ouyang,Stan Z. Li,Jie Cui,Nanqing Dong
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注: 42 pages,15 figures

点击查看摘要

Abstract:Nucleotide sequences constitute the fundamental genetic basis of biological systems, rendering viral genomic analysis critical for biomedical advancement. Despite progress in biological foundation models, specifically nucleotide foundation models (NFMs), the field lacks a unified standard for viral genomics to facilitate community development and enforce biosecurity constraints. To address this, we introduce ViroBench, the first comprehensive and large-scale benchmark specifically designed for NFMs in viral settings. ViroBench evaluates models across two critical dimensions: biological understanding and latent biosecurity risk, covering 18 diverse scenarios within 4 task types. Extensive evaluation of 66 NFMs across diverse architectures yields three critical conclusions. Firstly, NFMs exhibit a performance degradation in biological understanding under phylogenetic and temporal shifts, indicating weak extrapolation capabilities. Secondly, generation tasks reveal a decoupling between statistical likelihood and biological functional validity, posing latent biosecurity risks. Thirdly, controlled ablation studies reveal that taxonomic diversity in pretraining data outweighs parameter scale. Specifically, a lightweight baseline trained on diverse data achieves a 67.5% performance gain over its original model. Overall, ViroBench provides interpretable, diagnostic evaluations and a reproducible measurement framework for future research on viral nucleotide foundation models. The datasets and code are publicly available at this https URL.

[LG-48] Not only where But when: Temporal Scheduling for RLVR

链接: https://arxiv.org/abs/2605.25381
作者: Jinghao Zhang,Ruilin Li,Feng Zhao,Jiaqi Wang
类目: Machine Learning (cs.LG)
*备注: Github: this https URL

点击查看摘要

Abstract:Reinforcement learning with verifiable rewards (RLVR) has become a core technique for post-training of Large Language Models (LLMs). While policy optimization is driven by all sampled tokens under a globally broadcast scalar reward, the heterogeneous policy behaviors exhibited along trajectories are largely overlooked without differentiation. Existing works address this by credit allocation, including token-level advantage reweighting, and selective token optimization, however, the allocation criterion are principally stagnant throughout training, limiting resilient policy evolution. In this work, we argue that \textitwhen learning signals are scheduled can be as important as \textitwhere they are allocated across tokens, and introduce the temporal dimension that scheduling the credit allocation criteria over the course of RLVR optimization. We find that prioritizing targeted tokens emphasized with specific policy behaviors, and gradually attenuating toward general optimization leads to more stable and efficient learning dynamics. Furthermore, we show that simple trajectory percentiles provide a natural perspective for distinguishing policy behaviors, and works effectively with temporal scheduling. Our analysis reveals that standard optimization substantially sacrifices policy entropy when simultaneously accommodating heterogeneous behaviors, whereas temporal scheduling yields healthier policy evolution dynamics. Experiments across mathematical and general reasoning benchmarks demonstrate consistent improvements, suggesting that temporal scheduling constitutes a promising optimization dimension.

[LG-49] Electricity Consumption Forecasting: An Approach Using Cooperative Ensemble Learning with SHapley Additive exPlanations

链接: https://arxiv.org/abs/2605.25305
作者: Eduardo Luiz Alba,Gilson Adamczuk Oliveira,Matheus Henrique Dal Molin Ribeiro,Érick Oliveira Rodrigues
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Electricity expense management presents significant challenges, as this resource is susceptible to various influencing factors. In universities, the demand for this resource is rapidly growing with institutional expansion and has a significant environmental impact. In this study, the machine learning models long short-term memory (LSTM), random forest (RF), support vector regression (SVR), and extreme gradient boosting (XGBoost) were trained with historical consumption data from the Federal Institute of Paraná (IFPR) over the last seven years and climatic variables to forecast electricity consumption 12 months ahead. Datasets from two campuses were adopted. To improve model performance, feature selection was performed using Shapley additive explanations (SHAP), and hyperparameter optimization was carried out using genetic algorithm (GA) and particle swarm optimization (PSO). The results indicate that the proposed cooperative ensemble learning approach named Weaker Separator Booster (WSB) exhibited the best performance for datasets. Specifically, it achieved an sMAPE of 13.90% and MAE of 1990.87 kWh for the IFPR-Palmas Campus and an sMAPE of 18.72% and MAE of 465.02 kWh for the Coronel Vivida Campus. The SHAP analysis revealed distinct feature importance patterns across the two IFPR campuses. A commonality that emerged was the strong influence of lagged time-series values and a minimal influence of climatic variables.

[LG-50] Algorithms with Polynomially-Improved Approximation Factors for the 2 rightarrow q Norm and Applications

链接: https://arxiv.org/abs/2605.25303
作者: Samuel B. Hopkins,Stefan Tiegel
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:The 2 \rightarrow q norm of a matrix X \in \mathbbR^n \times d is defined as \lVert X \rVert_2 \rightarrow q = \sup_\lVert v \rVert_2 = 1 \lVert Xv \rVert_q . We give polynomial-time multiplicative approximation algorithms for this norm when q 2 (i.e. in the hypercontractive setting). This problem either directly captures or is closely related to long-standing open problems in combinatorial optimization and hardness of approximation (e.g. Small Set Expansion), quantum information (e.g. Best Separable State), and algorithmic statistics. Very little is known about what approximation factors we can achieve for this problem in polynomial time, even though such approximations have significant downstream consequences. Barak, Brandão, Harrow, Kelner, Steurer, and Zhou showed that no polynomial-time algorithm can achieve an approximation factor better than 2^\sqrt\log n , assuming the Exponential Time Hypothesis (FOCS’12). On the other hand, a simple spectral algorithm gives a d^1/4 -approximation as a baseline. We give, to the best of our knowledge, the first polynomial-time approximation algorithm beating this baseline by polynomial factors. For the important special case of q = 4 it achieves a d^1/8 -approximation. All previous algorithms required additional assumptions on X , or only surpassed the baseline for small values of n . Moreover, we construct sum-of-squares certificates for the 2 \rightarrow q norm. This directly implies improved algorithms for robust mean and covariance estimation, robust regression, and clustering, when the data only satisfies a bound on its q -th moment. Subjects: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML) Cite as: arXiv:2605.25303 [cs.DS] (or arXiv:2605.25303v1 [cs.DS] for this version) https://doi.org/10.48550/arXiv.2605.25303 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-51] Label-NTK Alignments and A Tighter Convergence Bound in the NTK Regime

链接: https://arxiv.org/abs/2605.25275
作者: Ruchirinkil Marreddy,Chaoyue Liu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The Neural Tangent Kernel (NTK) framework explains optimization in over-parameterized neural networks via approximately linearized dynamics, yielding exponential convergence guarantees. However, existing results are often overly pessimistic and do not match the fast training in practice, as they depend on the smallest NTK eigenvalue, which is typically extremely small in practice. In this work, we develop sharper convergence guarantees by characterizing the interaction between data labels and the NTK eigen-spectrum. We identify two key phenomena, Label-NTK alignment and Residual-NTK alignment, showing that projections of labels and residuals onto NTK eigenvectors scale with the corresponding eigenvalues. We provide empirical evidence and theoretical justification under mild data assumptions. Exploiting these alignment properties, we derive a refined convergence bound that depends on the full spectrum and closely matches practical training dynamics, significantly improving over classical worst-case results. We further obtain improved generalization bounds. Experiments on MLPs and CNNs across multiple datasets validate our theory.

[LG-52] A Blended Likelihood Approach for Achieving Fairness Using Naive Bayes

链接: https://arxiv.org/abs/2605.25228
作者: John Arthur Junior,Abdul Lateef Yussif,Maame G. Asante-Mensah,Charles R. Haruna,Sandro Amofa,Elliot Attipoe
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Concerns about algorithmic bias and fairness have increased as artificial intelligence has been incorporated into high-stakes decision-making. Traditional Naive Bayes classifiers, while efficient and interpretable, lack fairness-awareness mechanisms and perpetuate historical biases in sensitive domains such as hiring, credit scoring, and criminal justice. This study develops a fairness-aware extension of the Naive Bayes classifier that mitigates bias while maintaining computational efficiency. We propose the Bias Mitigating Naive Bayes (BMNB) classifier, integrating in-processing and post-processing interventions. The in-processing stage employs a blended likelihood approach combining group-specific and pooled likelihood estimates through a tunable blending parameter alpha to balance fairness and accuracy. The post-processing stage applies output calibration with adaptive thresholding to fine-tune group-specific decision boundaries. Experimental results indicate that BMNB attains Disparate Impact (DI) values of 1.000, 1.171, and 0.997 and Equal Opportunity Difference (EOD) values of -0.217, -0.226, and -0.053 on the Adult, ProPublica, and Framingham datasets, respectively, while maintaining computational efficiency. Ablation studies confirm that the combination of blended likelihood and adaptive thresholding yields superior performance compared to either technique in isolation.

[LG-53] Personalized Federated Learning by Energy-Efficient UAV Communications

链接: https://arxiv.org/abs/2605.25212
作者: Shiqian Guo,Jianqing Liu,Beatriz Lorenzo
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Federated learning (FL) is an effective paradigm for enhancing the learning capability of edge devices while preserving data privacy. In geographically dispersed FL systems, such as sensor networks in remote areas, unmanned aerial vehicles (UAVs) can flexibly establish high-quality communication links to support parameter exchange. However, device heterogeneity and the limited battery capacity of UAVs pose significant challenges. Specifically, data heterogeneity slows convergence, while scheduling all devices for global collaboration incurs excessive communication and energy costs. To overcome these challenges, we adopt a strict separation between a globally shared backbone and permanently local personalization heads, thereby mitigating the impact of data heterogeneity. Furthermore, we propose a gradient-based scheduling strategy that jointly considers energy efficiency and learning performance. In each communication round, the backbone is updated only by the top- \alpha devices ranked by gradient \ell_2 -norm, ensuring that optimization focuses on the most informative updates. Simulation results demonstrate that the proposed scheme achieves higher learning accuracy than state-of-the-art approaches while significantly reducing UAV energy consumption.

[LG-54] Evolving Causal Regulatory Networks (ECR-Net)

链接: https://arxiv.org/abs/2605.25211
作者: Govind Vallabhasseri Binish,Abdhul Ahadh,Rano Roy Kavanal,Arya Ukunde
类目: Machine Learning (cs.LG)
*备注: 9 pages, 6 figures. Presents ECR-Net, an evolutionary framework for adaptive causal structure discovery under non-stationarity, with empirical evaluation against NOTEARS, PCMCI+, and related baselines

点击查看摘要

Abstract:Modern machine learning models excel at pattern recognition but remain brittle, often failing to generalize out of distribution (OOD) because they capture spurious correlations rather than the underlying causal data-generating process. Current causal discovery methods, while powerful, typically assume a static graph structure, rendering them unable to model systems that adapt or undergo structural changes across different environments. We introduce ECR-Net, Evolving Causal Regulatory Networks, a novel, bio-inspired framework for adaptive causal mechanism discovery. Our approach models the data-generating process not as a static graph, but as a dynamic system analogous to a Gene Regulatory Network (GRN), composed of localized, recursive functions where variables can activate and inhibit one another. To discover the latent structure of this network, we employ an evolutionary search algorithm that evolves a population of candidate regulatory graphs, optimizing for a fitness function that measures how well the simulated system dynamics reconstruct the observed data. The key innovation of ECR-Net is its ability to model structural adaptation, it explicitly ingests shifts in the data’s statistical properties as signals of an environmental shock. In response, the evolutionary search identifies parsimonious modifications to the causal graph topology, such as link inhibitions or activations that explain the new data regime. We posit that ECR-Net represents a new class of adaptive Structural Causal Models capable of discovering how and why a system’s fundamental rules change, offering a path toward robust generalization in complex, non-stationary systems.

[LG-55] Localization then Neutralization: Gradient-guided Token Suppression against Visual Prompt Injection Attack

链接: https://arxiv.org/abs/2605.25194
作者: Dongpeng Zhang,Ke Ma,Yangbangyan Jiang,Gaozheng Pei,Longtao Huang,Qianqian Xu,Qingming Huang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Adversarial images pose a severe security threat to multimodal large language models through prompt injection. Existing defenses largely lack a principled understanding of the underlying mechanisms and struggle to balance efficiency and defense utility. In this work, we show that successful adversarial attacks do not rely on the entire image uniformly but instead depend on a small subset of critical image tokens. Based on this insight, we propose Gradient Token Masking (GTM), which localizes these tokens via gradient analysis and neutralizes them through masking. We find that attribution based on the first generated token’s output probability fails when attacks preserve the predicted token. To overcome this, GTM utilizes the Hidden-State Gradient Norm score for generation-influence attribution under adversarial inputs. We prove that its ranking is consistent with that of the full adversarial loss gradient, providing a theoretical guarantee for accurate localization. Our method requires only a single forward-backward pass to identify and zero out a small number of high-scoring tokens, effectively disrupting the adversarial attack path. Extensive experiments on prompt injection and multimodal jailbreak attacks demonstrate that our approach reduces attack success rates (ASR) to near zero while preserving model utility with negligible computational overhead.

[LG-56] Learning Treatment Effects during Resource Allocation via Priority-Queue Randomization

链接: https://arxiv.org/abs/2605.25169
作者: JungHo Lee,Johnna Sundberg,Pim Welle,Bryan Wilder
类目: Machine Learning (cs.LG); Methodology (stat.ME); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Public service programs often allocate limited resources under uncertainty about their benefits, creating a need for randomization to support credible evaluation. In practice, however, applicants commonly enter waitlists where resources are prioritized toward individuals judged to have higher need through tiered priority queues, making direct randomization difficult. Motivated by this, we develop an experimental design framework for learning treatment effects while treating those most in need where incoming applicants are randomized into priority queues based on their assessed risk scores. Treatments are then provided across queues in priority order and first-in-first-out within queue as budget becomes available. Our contributions are two-fold. First, we characterize what causal effects are identified under this priority-queue allocation. When arrivals are exogenous, treatments are conditionally randomized, and hence standard estimands are identified; when arrivals are endogenous, queue randomization instead provides an instrument for treatment, identifying local treatment effects induced by the queuing process. Second, we develop optimized queue-assignment designs that trade off statistical efficiency against prioritizing higher-need applicants. We show in the process that, despite dependence in treatment assignments induced by the design, usual iid efficiency bounds remain well-justified design objectives. We illustrate the proposed designs using data from a housing allocation program in a large U.S. county.

[LG-57] Blocked Gibbs meets Diffusion Transformers: Unsupervised Learning for Constraint Optimization

链接: https://arxiv.org/abs/2605.25129
作者: Yudong W. Xu,Wenhao Li,Xiaoyu Wang,Scott Sanner,Elias B. Khalil
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Diffusion models have shown promise in learning to solve constraint optimization problems. However, they are mostly restricted to problems with binary variables and rely on graph neural networks, hindering their application to a broader range of problems such as those with general discrete variables or constraint structures that necessitate global rather than local reasoning. We investigate the use of Diffusion Transformers to address the aforementioned limitations. A naive implementation performs poorly due to a fundamental mismatch between the standard diffusion process and constraint solving: while the former applies small, incremental denoising across all variables, the latter requires substantially altering specific subsets of variables to attain feasibility or optimality. Our method, Blocked Gibbs Diffusion Transformer (BloGDiT), is the first to address this limitation by replacing standard joint Gaussian denoising with blocked Gaussian denoising. BloGDiT uses iterative block resampling and anneals the block size over time to facilitate large, targeted edits within a block of variables. Across Sudoku, Graph Coloring, Maximum Independent Set, and MaxCut, BloGDiT matches or outperforms existing methods, demonstrating that blocked Gibbs-style diffusion provides a highly effective inductive bias for Transformer-based constraint satisfaction and optimization.

[LG-58] Optimizing Multidimensional Scaling in Gini Metric Spaces

链接: https://arxiv.org/abs/2605.25124
作者: Cassandra Mussard,Stéphane Mussard
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The Gini Multidimensional Scaling (Gini MDS) framework extends the Euclidean multidimensional scaling. We introduce a Gini pseudo-distance based on values and their ranks that depends on a fine-tunable hyperparameter. This pseudo-distance allows flexible exploration of latent configurations, enabling embeddings that best match observed dissimilarities. The Gini MDS is shown to be robust to noise and outliers, making it well-suited for real-world applications. We provide experiments on 16 UCI datasets with outliers and on MNIST images with noise to show that the Gini MDS outperforms the Euclidean MDS on noisy data. Finally, a tensor-based implementation in \textttPyTorch provides GPU acceleration and efficient computation compared to the standard MDS of the \textttsklearn library.

[LG-59] Revisiting Pre-Propagation GNNs: Robust Diffusion Operators and Hidden-State Re-Propagation

链接: https://arxiv.org/abs/2605.25111
作者: Zichao Yue,Zhiru Zhang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Pre-propagation graph neural networks (PPGNNs) decouple node feature propagation from transformation: graph diffusion is performed once as preprocessing, and training reduces to dense per-node transformations. This design enables mini-batch training without inter-node dependencies, avoids repeated sparse matrix–matrix multiplications, and better matches modern accelerators optimized for dense compute. However, their expressivity remains unclear, and empirical results show a gap between PPGNNs and their message-passing counterparts on commonly used graph benchmarks, especially heterophilic ones. In this paper, we propose a suite of robust graph diffusion operators for preprocessing and a few-shot hidden-state re-propagation scheme during training. Our methods improve the validation and test accuracy of PPGNNs, enabling them to match the accuracy of message-passing GNNs while maintaining training efficiency.

[LG-60] Reinforcement Learning for Laser Additive Manufacturing Scan-Order Optimisation: A Bilevel Proxy–FEA Diagnostic Framework for Reward and World-Model Diagnosis

链接: https://arxiv.org/abs/2605.25063
作者: Xian Wu,Haoran Li,Dongbin Zhao,Ruiyao Zhang,Yuanqi Chu,Bin Wang
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci)
*备注: 31 pages, 7 figures, 3 tables

点击查看摘要

Abstract:Reinforcement learning offers a promising approach for scan-order optimisation in laser additive manufacturing, where sequential scan decisions critically influence thermal accumulation, residual stress, distortion, and final part quality. A central challenge in applying RL to this domain lies in reward and world-model fidelity: full finite-element analysis is computationally prohibitive for dense in-the-loop evaluation, while cheap thermo-inspired proxy metrics, though efficient, may capture only partial aspects of the true thermo-mechanical objectives. This paper investigates a bilevel Proxy–FEA diagnostic framework for reward and world-model diagnosis in reinforcement-learning-guided scan-order optimisation. The lower level employs lightweight scan-path and thermo-inspired proxies for rapid candidate generation and preliminary policy-side screening, while the upper level utilises sparse Abaqus FEA simulations to provide simulation-based reference labels. The framework is examined on a simplified whole-track heating LDED32 stripe benchmark comprising ten representative scan strategies. Final-cooling residual Mises stress, U3 vertical distortion, and PEEQ plasticity metrics reveal an observed stress–distortion trade-off rather than a single monotonic quality objective. Within the evaluated set, the center_out strategy emerges as a robust compromise candidate, while raster_left_to_right and edge_in form opposing endpoints of the trade-off. Proxy–FEA alignment analysis shows that current cheap path-based metrics predominantly capture distortion-related (U3) behaviour and exhibit only weak correlation with the sparse FEA reference labels. These findings highlight that proxy-only reward designs risk misalignment in future RL training and underscore the value of sparse FEA reference signals for diagnostic-guided reward and world-model refinement prior to large-scale policy optimisation.

[LG-61] Random Neural Network Expressivity for Non-Linear Partial Differential Equations

链接: https://arxiv.org/abs/2605.25057
作者: Muhammed Ali Mehmood,Lukas Gonon
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Neural networks with randomly generated hidden weights (RaNNs) have been extensively studied, both as a standalone learning method and as an initialization for fully trainable deep learning methods. In this work, we study RaNN expressivity for learning solutions to non-linear partial differential equations (PDEs). Despite their widespread use in practical applications, a rigorous theoretical understanding of the approximation properties of RaNNs in this context remains limited. Here, we derive error bounds for RaNN approximations to time-dependent Sobolev functions and obtain a dimension-free approximation rate \frac12 for sufficiently regular functions. We apply our results to two important classes of non-linear PDEs: Porous Medium Equations and Compressible Navier-Stokes Equations, showing that RaNNs are capable of efficiently approximating solutions to these complex, non-linear PDEs. Our theoretical analysis is supported by numerical experiments, showing that the obtained convergence rates extend beyond the considered setting.

[LG-62] MimirRAG : A Multi-Agent RAG Framework for Financial Data Retrieval with Metadata Integration

链接: https://arxiv.org/abs/2605.25030
作者: Magnus Samuelsen,Wilmer Nyström,Somnath Mazumdar,Mansoor Hussain,Mikkel Strange
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) systems offer a promising approach to reduce hallucinations and improve answer accuracy in large language models (LLMs), a requirement for reliable, financial analysis where answers must be grounded in verifiable evidence from filings rather than generated from model priors. However, designing RAG systems that extract meaningful insights from mixed financial documents and integrate into analyst workflows remains challenging. This paper introduces MimirRAG (Metadata-Integrated Multi-Agent Information Retrieval), a multi-agent RAG system developed iteratively to address these challenges. MimirRAG features a modular pipeline encompassing structure-preserving parsing of PDF filings, table-aware chunking, metadata extraction, agent-based retrieval with query planning and hybrid search, validation, and context-aware generation with numerical reasoning support. Our ablation study identifies three key technical enablers for effective financial RAG: metadata integration, table-aware chunking, and an agentic workflow. MimirRAG was evaluated quantitatively using FinanceBench and qualitatively through expert validation with four financial analysts. The system achieved 89.3% accuracy on FinanceBench, outperforming the original benchmark baselines. Expert feedback highlighted that successful deployment also requires calibrated trust, comprehensive data integration, and user personalization. We conclude that combining multi-agent RAG architecture with human-centric design principles can improve the extraction of meaningful insights in financial analysis.

[LG-63] A perspective on fluid mechanical environments for challenges in reinforcement learning

链接: https://arxiv.org/abs/2605.25011
作者: Shruti Mishra,Michael Chang,Vamsi Spandan,Shmuel M. Rubinstein
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We consider the challenge of developing agents that efficiently interact with high-dimensional, evolving environments, towards a view of practical reinforcement learning (RL) agents interacting with open worlds, of which they witness and affect only a small part. We argue that canonical fluid mechanics problems, and their simulations, present a compelling testbed for the development of such methods. These problems arise in nonlinear instabilities, where small disturbances can grow to transform the dynamics of a system. Nonlinear instabilities represent several open scientific challenges with industrial applications – the droplet breakup of a liquid jet, mixing at an interface between two fluids, and the appearance of unusually tall rogue waves in the ocean. In these settings, agents may leverage preserved representations across the changing dynamics to learn efficiently. We present two problem descriptions of agents interacting with a fluid mechanical environment, and describe the state and action spaces, and reward functions, for these agents. For these examples, we specify the aspects of the environment which are nonstationary and the preserved invariances. We note Dedalus and JAX-CFD as open-source simulators that can be used for the development of reinforcement learning methods (Burns et al., 2016; Kochkov et al., 2021)) We demonstrate the use of Dedalus for environment generation by creating RL agents that learn to navigate in a stationary environment that is simulated using Dedalus. This sets the stage for future development of RL agents that learn to meaningfully interact with simulated environments that represent scientific challenges in natural and industrial flows. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2605.25011 [cs.LG] (or arXiv:2605.25011v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.25011 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-64] Convex-Neural RRT*: Fast and Reliable Learning-Guided Sampling for High-Quality Robot Path Planning

链接: https://arxiv.org/abs/2605.25006
作者: Hichem Cheriet,Badra Khellat Kihel,Samira Chouraqui,Bara J. Emran
类目: Robotics (cs.RO); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:Sampling-based algorithms for robot path planning offer probabilistic completeness and strong empirical convergence properties across environments with diverse obstacle configurations. However, in practice, these methods often require many iterations to obtain high-quality solutions. This paper proposes Convex-Neural RRT*, an enhanced RRT* variant that incorporates neural guidance to predict informative waypoint regions near high-quality paths. Convex candidate regions are extracted from these predictions, enabling the planner to concentrate exploration on geometrically relevant areas while preserving global exploration. The proposed algorithm is evaluated against Neural RRT*, Neural Informed RRT*, classical RRT*, and LTA* across three environment types and 18 benchmark maps. Experimental results show that Convex-Neural RRT* reduces computation time by 30-75% compared to neural-guided variants and up to 88-98% relative to LTA*, while achieving an average path length reduction of approximately 5% compared to classical RRT*, with larger improvements observed in complex environments. The method also maintains an overall success rate above 99% across varying obstacle densities. These findings indicate that convex-guided neural sampling provides an effective balance between computational efficiency and solution quality, supporting its applicability to time-sensitive robotic navigation tasks.

[LG-65] Mitigating Gradient Pathology in PINNs through Aligned Constraint

链接: https://arxiv.org/abs/2605.25001
作者: Yichen Luo,Peiyu Zhu,Dongxiao Hu,Jia Wang,Tailin Wu,Dapeng Lan,Yu Liu,Zhibo Pang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:While Physics-Informed Neural Networks (PINNs) are powerful for solving Partial Differential Equations (PDEs), their training is often paralyzed by gradient pathology. The gradients from the PDE residuals and boundary constraints oppose each other, trapping the model in local minima. Current solutions, such as adaptive weighting or hard constraints, either fail to fundamentally resolve this ill-conditioning or are limited to simple geometries. In this study, we systematically analyze the possible causes of this gradient pathology from the perspectives of loss landscapes and optimization dynamics. Based on the obtained conclusion, we propose Constraint-Aligned loss with Manifold Lifting (CAML). By reformulating all zeroth-order terms into aligned constraints, our method effectively mitigates gradient conflicts. In addition, we introduce a delay factor to help the optimizer skip the high-curvature area. Experiments demonstrate that our CAML significantly enhances numerical stability and efficiency in highly complex PINN problems. Our code is open-sourced on this https URL.

[LG-66] Learning locomotion and navigation of soft synthetic snakes in three-dimensional heterogeneous environments

链接: https://arxiv.org/abs/2605.24985
作者: Xiaotian Zhang,Ali Albazroun,Tixian Wang,Songyuan Cui,Prashant G. Mehta,Mattia Gazzola
类目: Robotics (cs.RO); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注: 14 pages, 5 figures

点击查看摘要

Abstract:Limbless terrestrial animals exhibit exceptional locomotor versatility and control, currently unmatched by engineered counterparts. Here, we introduce a computational framework that enables soft synthetic snakes to navigate unstructured, heterogeneous 3D terrains. Our approach is grounded in bio-inspired actuation and sensing models that reduce the control complexity inherent to high-degree-of-freedom, continuum bodies. These models are integrated into a reinforcement learning architecture to derive environment-traversing policies. Training first occurs in simplified, homogeneous terrains to learn locomotion primitives. These are then composed into adaptive strategies for complex landscapes. We demonstrate robustness by deploying a snake in high-fidelity 3D environments reconstructed from real-world imaging, achieving reliable navigation. Overall, this work provides a physically-realistic simulation platform and practical insights for the control of continuum systems in natural terrains.

[LG-67] Benchmarking non-conformity score functions in conformal prediction

链接: https://arxiv.org/abs/2605.24983
作者: Sol Erika Boman
类目: Machine Learning (cs.LG)
*备注: 3 tables, 1 supplementary table, 1 supplementary figure

点击查看摘要

Abstract:Conformal prediction is a useful and versatile alternative to model calibration in machine learning classification. It replaces single-class prediction with prediction sets, guaranteeing that the \textita priori probability of the prediction sets containing the true class is larger than or equal to a pre-specified rate. The size and usefulness of the prediction sets relies heavily on the choice of the non-conformity score function. The scientific literature contains many examples of non-conformity score functions but there is an absence of studies examining their properties and effectiveness. In this paper, we give an overview of properties of non-conformity score functions. We give examples of non-conformity score functions in the existing literature and introduce original modifications. We introduce an original method of evaluating the prediction set sizes of conformal predictors and use it to provide a comparison between non-conformity score functions. We also examine efficacy of different non-conformity score functions for class-conditional conformal prediction in a setting with imbalanced classes.

[LG-68] MedMamba: Multi-View State Space Models with Adaptive Graph Learning for Medical Time Series Classification ICML

链接: https://arxiv.org/abs/2605.24961
作者: Da Zhang,Bingyu Li,Zhiyuan Zhao,Hongyuan Zhang,Junyu Gao,Xuelong Li
类目: Machine Learning (cs.LG)
*备注: Accepted to 2026 ICML

点击查看摘要

Abstract:Medical time series are central to healthcare, enabling continuous monitoring and supporting timely clinical decisions. Despite recent progress, existing methods struggle to jointly model local-global dynamics and handle nonstationarities like baseline drift, while often failing to capture latent channel interactions. To address these challenges, we propose MedMamba, an end-to-end architecture that integrates state space models with domain-specific inductive biases. Specifically, MedMamba first employs multi-scale convolutional embeddings to capture discriminative local morphology. Second, to mitigate nonstationarity, we introduce a tri-branch differential state space encoder that processes raw, temporal-difference, and frequency-domain views, fusing them to emphasize informative patterns while suppressing drift. Furthermore, to uncover latent channel correlations, we design a spatial graph Mamba module that learns a directed dependency structure regularized toward sparsity and acyclicity, which obviates the need for predefined graphs. Extensive experiments on five real-world datasets demonstrate that MedMamba achieves state-of-the-art performance while maintaining linear computational complexity, and ablation studies validate each component’s this http URL is available at this https URL.

[LG-69] ARCANE-PedSynth: Synthetic Multi-Pedestrian Datasets with Behavioural Crossing Annotations

链接: https://arxiv.org/abs/2605.24950
作者: Muhammad Naveed Riaz,Maciej Wielgosz,Antonio M. López Peña
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present ARCANE-PedSynth, an open-source CARLA-based software framework for generating synthetic multi-pedestrian datasets with dense behavioural annotations for pedestrian crossing prediction in autonomous driving. The framework overcomes CARLA’s native 9% crossing rate through a hybrid AI-manual pedestrian control architecture, enabling configurable target rates up to 75%. A 12-state behavioural finite state machine with five character archetypes produces diverse crossing behaviours. The framework generates synchronised RGB, LiDAR, and DVS data with per-frame crossing labels, behavioural states, and estimated 2D pose keypoints. We demonstrate ARCANE-PedSynth through PedSynth++, an example dataset generated with the framework, comprising 533 multi-pedestrian clips across 12 weather conditions with RGB, LiDAR, and DVS streams. ARCANE-PedSynth is fully reproducible via CLI parameterisation and Docker containerisation.

[LG-70] Memory-Induced Tool-Drift in LLM Agents

链接: https://arxiv.org/abs/2605.24941
作者: Mahavir Dabas,Jihyun Jeong,Ming Jin,Ruoxi Jia
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Modern LLM agents combine long-term memory for personalization with tool-calling interfaces for taking actions in the world – a combination underpinning contemporary production systems. We study a previously unexamined failure of this combination: when personality-driven biases stored in memory (cost-consciousness, impatience, risk tolerance, etc.) silently affect tool calls in contexts where they are not applicable. We call this memory-induced tool-drift and operationalize it through MEMDRIFT, a benchmark of 105 scenarios spanning five bias dimensions and seven professional domains, generated through an automated adversarial pipeline. Across seven frontier models – including those with extended reasoning – biased memories raise deflection scores (a judge-scored measure of parameter deviation from unbiased baselines) by up to +3.6 points on a 1–5 scale. Tool-drift persists when memory management is handled by three production memory architectures. The phenomenon affects real-world tools: scanning 6,062 tools across 288 verified MCP servers, we flag 608 with susceptible parameters and confirm tool-drift on a validated subset. Mechanistically, biased memories act as implicit steering vectors, pushing activations along the same latent directions as explicit behavioral instructions. They also redistribute attention from task-relevant context toward memory entries with surface-level keyword overlap to the target parameter. Standard defenses – prompt-based relevance instructions and memory filters – reduce drift but do not eliminate it. As agents take increasingly consequential actions on a user’s behalf, memory-induced tool-drift represents a systematic vulnerability that current safeguards do not address, motivating dedicated defenses at the intersection of memory management and tool-call generation.

[LG-71] Global linear convergence of entropy-regularized softmax policy gradient beyond tabular MDPs

链接: https://arxiv.org/abs/2605.24939
作者: Ziyue Chen,David Šiška,Lukasz Szpruch
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:We study the global convergence of policy gradient for infinite-horizon entropy-regularized Markov decision processes (MDPs) with continuous state and action spaces. We consider log-linear softmax policies with linear function approximation, which extend the tabular softmax parameterization while retaining a tractable policy class. Under Q^\pi_\tau -realizability for the regularized state-action value function, we first establish a non-uniform Polyak–Łojasiewicz (PŁ) inequality. The non-uniformity arises through degeneracy of constants associated with the policy geometry, namely the Fisher information matrix or an uncentered feature covariance matrix. We then identify two feature regimes under which this non-uniform constant can be bounded along the gradient flow. For full-affine-span features, we prove radial unboundedness of the KL regularizer and show that the smallest eigenvalue of the Fisher information matrix remains bounded below by an initialization-dependent positive constant. For simplex-valued features, we prove an analogous radial unboundedness result in the subspace orthogonal to the all-ones vector and obtain a uniform lower bound for the smallest eigenvalue of the uncentered covariance matrix. These results imply global linear convergence of the regularized objective along the gradient flow, i.e. suboptimality decaying as \mathcalO(e^-Ct) for some C0 . Our analysis extends the global convergence theory of entropy-regularized softmax policy gradient beyond the tabular setting of Agarwal et al. (2020); Bhandari and Russo (2024); Mei et al. (2020).

[LG-72] BandVQ: Band-Wise Vector-Quantized EEG Foundation Model

链接: https://arxiv.org/abs/2605.24921
作者: Jamiyan Sukhbaatar,Satoshi Imamura,Toshihisa Tanaka
类目: Machine Learning (cs.LG)
*备注: 15 pages, 1 figure

点击查看摘要

Abstract:A central challenge in electroencephalography (EEG) foundation modeling is learning transferable representations across recordings with diverse tasks, montages, references, and spectral characteristics. Existing masked modeling approaches often rely on broadband continuous patches or a single discrete representation, which may underrepresent frequency-specific activity. This paper proposes BandVQ, a band-wise vector-quantized EEG foundation model that decomposes EEG into delta, theta, alpha, beta, and gamma bands, trains an independent VQ-VAE tokenizer for each band, and pretrains a shared Transformer encoder on the resulting discrete VQ code indices. The encoder uses masked code tokens, quantized absolute log-power tokens, channel and temporal embeddings, and metadata prefix tokens representing reference, band, task family, and phase. Region-based masking is also introduced to reduce the trivial reconstruction of spatially adjacent electrodes. The model is pretrained on 71 public EEG corpora comprising over 9,200 subjects and 357,000 single-channel hours and evaluated on six subject-independent classification datasets. Under the current evaluation setting, the proposed model achieves strong transfer performance, with the highest reported results on three cognitive tasks and competitive performance on three motor imagery tasks.

[LG-73] SEED: Semi-supervised Continual MalwarE Detection for Tackling ConcEpt Drift on a BuDget

链接: https://arxiv.org/abs/2605.24903
作者: Suresh Kumar Amalapuram,Bikraj Shresta,Siva Ram murthy Chebiyam,Bheemarjuna Reddy Tamma,Sumohana S Channappayya
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Machine learning based malware detectors become obsolete over time due to concept drift in benign and malware applications. Recent methods rely on fully labeled data and use hierarchical contrastive loss (HCL) with active learning to improve robustness against drift by exploiting semantic structure in malware representations. However, obtaining labeled data in the security domain is difficult. Under partially labeled settings, HCL suffers significant performance degradation in detecting unseen malware, especially on datasets such as BODMAS where strong semantic structure may not exist. In this paper, we propose SEED, a semantic-structure-agnostic method for malware detection under limited supervision. SEED combines a tailored binary cross-entropy objective with semi-supervised continual learning and active learning. For partially labeled seen tasks, unlabeled samples are projected into a representation space constructed from previously seen data using singular value decomposition, and paired with suitable labeled samples to encourage representation consistency. For unseen tasks with fully unlabeled data, uncertainty is quantified using cosine distance in representation space, and the most uncertain samples are selected for analyst labeling. We evaluate SEED on both Windows and Android malware datasets. Using only 20% labeled data on seen tasks, SEED achieves average AUT improvements of 40% on BODMAS and 14% on AndroZoo for unseen malware detection compared to HCL* (the semi-supervised adaptation of HCL), while remaining competitive on APIGraph. Finally, we introduce a delayed buffer update strategy to reduce label noise propagation during replay and improve learning stability.

[LG-74] Efficient DP-SGD for LLM s with Randomized Clipping ICML2026

链接: https://arxiv.org/abs/2605.24879
作者: Enayat Ullah,Sai Aparna Aketi,Devansh Gupta,Huanyu Zhang,Meisam Razaviyayn
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: Accepted at ICML 2026

点击查看摘要

Abstract:Large language models (LLMs) are trained on vast datasets that may contain sensitive information. Differential privacy (DP), the de facto standard for formal privacy guarantees, provides a principled framework for training LLMs with provable privacy protection. However, state-of-the-art DP training implementations rely on fast gradient clipping techniques with memory overhead O(B \min\T^2, d^2) , where B is the batch size, T is the sequence length, and d is the model width. This becomes prohibitive as both model size and context length grow. We propose DP-SGD-RC, a novel variant of DP-SGD with randomized clipping that reduces memory and compute complexity. DP-SGD-RC leverages stochastic trace estimation methods, specifically Hutchinson’s estimator[Hutchinson, 1989] and its improved variant, Hutch++[Meyer et al., 2021], to reduce the memory footprint of per-sample gradient norm estimation. We provide a tight privacy analysis showing that DP-SGD-RC achieves noise multipliers competitive with deterministic clipping. Experiments fine-tuning Llama~3.2-1B on long-context benchmarks spanning classification, question answering, and summarization tasks demonstrate that DP-SGD-RC matches baseline utility while significantly reducing memory and compute requirements.

[LG-75] IV-Net: A neural network for elliptic PDEs with random and highly varying coefficients

链接: https://arxiv.org/abs/2605.24876
作者: Shan Zhong,George Biros
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注: 36 pages

点击查看摘要

Abstract:We introduce a novel neural operator architecture designed to approximate solutions of linear elliptic partial differential equations with high-contrast, spatially varying coefficients. The network, termed the Iterated V-shaped Net (IV-Net), realizes a mapping from the input coefficients and righthand side to the corresponding solution field. The architecture of IV-Net is informed by, and closely resembles, a V-cycle multigrid solver. The IV-Net model is parameterized via convolutional layers defined in the physical domain. For coercive problems with highly heterogeneous coefficients, the proposed network exhibits superior performance relative to a proper orthogonal decomposition (POD) approach and several existing neural operator architectures. For low-frequency oscillatory Helmholtz problems with smooth coefficients, its performance is similar to that of a Fourier neural operator. We analyze the approximation error and convergence behavior of IV-Net, its data efficiency, and its dependence on the underlying discretization mesh. Furthermore, we demonstrate the practical effectiveness of the architecture through a series of numerical experiments, including applications to uncertainty quantification, inverse problems, and prediction of quantities of interest.

[LG-76] Cluster Frequency Conformal Prediction for Local Coverag e

链接: https://arxiv.org/abs/2605.24872
作者: Tomer Lavi,Bracha Shapira,Nadav Rappoport
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Conformal prediction provides distribution-free coverage guarantees, but in many-class classification it may still under-cover specific classes or subpopulations, preventing safe deployment in high-stakes applications. We propose Cluster Frequency Conformal Prediction (CFCP), a plug-in framework that adapts conformal prediction to local structure in a learned representation space. CFCP clusters learned embeddings, estimates cluster-level label-frequency distributions from calibration data, and for each test point constructs a sample-specific probability vector by softly mixing nearby cluster distributions regularized with global-prior and reliability-aware shrinkage. This vector is then conformalized using standard set constructors. In the disjoint-split regime, CFCP inherits standard finite-sample marginal validity. Under additional assumptions, CFCP further admits a local-validity interpretation. Since representation clusters aggregate locally similar samples, their empirical class frequencies provide a stable estimate of local label ambiguity. Across image and text benchmarks, CFCP achieves the best class coverage in 15/16 dataset/score-family comparisons and a competitive prediction set size efficiency, with several settings substantially more efficient. Overall, our results show that cluster-frequency information provides an effective localized signal for improving classwise reliability in many-class conformal prediction.

[LG-77] A comparative study of accuracy and rollout stability of temporal surrogate models

链接: https://arxiv.org/abs/2605.24868
作者: Rajarshi Biswas
类目: Machine Learning (cs.LG); Chaotic Dynamics (nlin.CD); Computational Physics (physics.comp-ph)
*备注: 24 pages, 18 figures, submitted to journal

点击查看摘要

Abstract:Temporal surrogate models are effective for predicting chaotic dynamical systems where computational cost can be prohibitive. Several deep neural network architectures can be used for such purposes. In this work, a few commonly used architectures are compared using a common training protocol. The objective is to fairly assess the impact of model architectures for long-horizon prediction stability. Experiments are carried out for three problems, the double pendulum, the Kuramoto-Sivashinsky equations, and the Kolmogorov flow. The experiments are carried out with matching model capacity. Analysis is also carried out for a scenario where each model is individually optimized. It is observed that in both scenarios, the models exhibit categorical differences in long-horizon rollouts. For a concrete quantification, stepwise error injections and perturbation amplifications are analyzed using metrics such as local jacobian, relative one-step bias, and finite-time Lyapunov growth. Additionally, an attractor analysis is also conducted to assess how well the learned models replicate the underlying system geometry. An ablation study to isolate the impact of each component of a continuous-update architecture is also carried out. It is concluded that models that having integrator-like updates show lower bias and perturbation amplification yielding stable long-horizon rollout and more accurate predictions.

[LG-78] Unifying Value Alignment and Assignment in Cross-Domain Offline Reinforcement Learning with Heterogeneous Datasets ICML2026

链接: https://arxiv.org/abs/2605.24862
作者: Zhongjian Qiao,Jiafei Lyu,Chenjia Bai,Peisong Wang,Siyang Gao,Shuang Qiu
类目: Machine Learning (cs.LG)
*备注: Accepted at ICML 2026

点击查看摘要

Abstract:Cross-domain offline reinforcement learning (RL) aims to learn a policy in the target domain with a limited target domain dataset and a source domain dataset that exhibits a dynamics shift. Training directly on the original source dataset typically leads to performance collapse. Recent studies perform data filtering from the perspective of dynamics alignment or value alignment to enable efficient policy transfer. However, these studies are typically validated on single-domain or single-behavior-policy source datasets. In this work, we explore a more general heterogeneous cross-domain offline RL setting, where the source datasets may be collected from multiple source domains by diverse behavior policies. We first uncover a critical yet overlooked issue in this setting: value misassignment. Empirically and theoretically, we demonstrate that value misassignment can undermine value alignment, mislead data filtering toward selecting suboptimal samples, and loosen the suboptimality gap, thereby degrading the agent’s performance. To address this issue, we propose V2A, which integrates dynamics alignment, value alignment, and value assignment. V2A first employs temporally-consistent modality representation learning to extract dynamics modalities from the source dataset, followed by modality-aware advantage learning to rectify value alignment. Finally, it adopts a data filtering paradigm to selectively share source data for policy learning. Empirical results show that V2A significantly outperforms strong baseline methods under general heterogeneous cross-domain offline RL settings.

[LG-79] 2S-MPC: Time-Embedded Online Adaptive Model Predictive Control for Time-Varying Dynamics

链接: https://arxiv.org/abs/2605.24852
作者: Zeyu Shen,Zhuoyuan Wang,Laixi Shi
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Recent advances in learning-based model predictive control (MPC) have leveraged neural networks for online model learning, achieving strong performance when nonstationary system dynamics deviate from nominal models. However, existing approaches primarily address specific or relatively structured forms of dynamical variation, leaving more general, unknown, and unpredictable time-varying dynamics insufficiently handled. To tackle this challenge, we propose T2S-MPC, a framework that adaptively learns a residual dynamics model online and integrates it with the nominal model within the MPC framework to enable fast-evolving online planning. To make the model time-aware, we explicitly encode temporal information through a structured time embedding and employ a two-timescale update scheme, allowing the controller to capture nonstationary dynamics while balancing rapid adaptation with stable learning. We evaluate the proposed method on a 2D quadrotor across stabilization and trajectory tracking tasks under diverse time-varying disturbances, including linear drifting and periodic perturbations. Experimental results show that T2S-MPC consistently outperforms classical MPC, neural MPC, and ablated variants in control performance, while also demonstrating strong robustness across a wide range of disturbance conditions without additional tuning. The source code is publicly available at this https URL

[LG-80] DriftingMol: Decoder-Coupled Drift for One-Pass Property-Conditional Molecular Generation

链接: https://arxiv.org/abs/2605.24841
作者: Jiangjie Qiu,Yijun Li,Wentao Li,Xiaonan Wang
类目: Machine Learning (cs.LG)
*备注: 9 pages, 5 figures

点击查看摘要

Abstract:Property-conditional molecular generation should produce valid, diverse molecules while responding to continuous target values at low sampling cost. We introduce DriftingMol, a two-stage framework that adapts drifting models to a SELFIES latent molecular space. A frozen SELFIES beta-VAE provides the latent space, and the hidden representation of its decoder serves as the drift feature map. In decoder-coupled drift, decoder weights remain fixed, but drift gradients are backpropagated through the decoder feature map to a DiT generator, inducing a pullback metric aligned with molecular decoding. On ZINC250K, the default setting achieves QED Spearman correlation 0.493 with 94.7% uniqueness, while the strongest decoder-coupled condition reaches 0.510. Under protocol-matched four-property conditioning, decoder-coupled drift reaches mean Spearman correlation up to 0.598. Across 15 controlled variants, models that preserve the gradient path through decoder features achieve higher correlations than the tested latent-space, random-feature, and external-feature drift variants, while detached or stop-gradient decoder controls yield near-zero QED correlation and very low uniqueness. These results indicate that decoder-coupled drift is a useful low-cost mechanism for property-biased molecular generation, requiring one generator evaluation and one frozen decoder pass.

[LG-81] Active Learning for Stochastic Contextual Linear Bandits

链接: https://arxiv.org/abs/2605.24803
作者: Emma Brunskill,Ishani Karmarkar,Zhaoqi Li
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:A key goal in stochastic contextual linear bandits is to efficiently learn a near-optimal policy. Prior algorithms for this problem learn a policy by strategically sampling actions but naively (passively) sampling contexts from the underlying context distribution. However, in many practical scenarios – including online content recommendation, survey research, and clinical trials – practitioners can actively sample or recruit contexts based on prior knowledge of the context distribution. Despite this potential for active learning, the role of strategic context sampling in stochastic contextual linear bandits is underexplored. We propose an algorithm that learns a near-optimal policy by strategically sampling rewards of context-action pairs. We prove instance-dependent theoretical guarantees demonstrating that our active context sampling strategy can improve over the minimax rate by up to a factor of \sqrtd , where d is the linear dimension. We show empirically that our algorithm reduces the number of samples needed to learn a near-optimal policy, in tasks such as warfarin dose prediction and joke recommendation.

[LG-82] he Perception-Physics Paradox: Probing Scientific Alignment with TC-Bench ICML2026

链接: https://arxiv.org/abs/2605.24782
作者: Dingling Yao,Andrea Polesello,Adeel Pervez,Caroline Muller,Francesco Locatello
类目: Machine Learning (cs.LG)
*备注: Accepted at ICML 2026

点击查看摘要

Abstract:While Vision Foundation Models (VFMs) excel at predictive tasks on satellite imagery, their performance can arise from visual correlations rather than underlying structural invariants, making even perception-based out-of-distribution accuracy a poor proxy for scientific utility. As a result, models may look correct without reasoning correctly, a discrepancy we term the Perception-Physics Paradox. To address this gap, we introduce scientific alignment as an implicit objective for representation learning in scientific domains. We study a principled, testable aspect of scientific alignment through structural isomorphism, which requires latent representations to uniquely identify physical systems up to a linear reparameterization. This perspective induces a hierarchy of necessary conditions and yields a systematic probing protocol for physical and causal interpretability. To operationalize this framework, we release TC-Bench, a global, reproducible benchmark dataset with an automated construction pipeline for tropical cyclone research, and show that current VFMs rely on visual shortcuts that collapse in intense regimes, indicating that scientific alignment does not arise as a natural byproduct of scaling alone.

[LG-83] Hermite-NGP: Gradient-Augmented Hash Encoding for Learning PDEs ICML

链接: https://arxiv.org/abs/2605.24774
作者: Jinjin He,Zhiqi Li,Sinan Wang,Bo Zhu
类目: Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注: Accepted by ICML this http URL page: this https URL

点击查看摘要

Abstract:We propose Hermite-NGP, a gradient-augmented multi-resolution hash encoding designed to enable fast and accurate computation of spatial derivatives for neural PDE solvers. Unlike existing NGP-based approaches that rely on automatic differentiation or finite differences and suffer from instability or high cost, Hermite-NGP explicitly stores function values and mixed partial derivatives at hash grid vertices, allowing fully analytic evaluation of gradients, Jacobians, and Hessians via Hermite interpolation. This design preserves the efficiency and spatial adaptivity of NGP while supporting analytic differential operators up to second order. We further introduce a multi-resolution curriculum training strategy analogous to multigrid V-cycles to enable coarse-to-fine optimization. Across a range of 2D and 3D PDE benchmarks, Hermite-NGP achieves up to approximately 20 times lower error than prior neural PDE methods, and reduces wall-clock convergence time by 2 to 10 times compared to other solvers, with per-epoch training times as low as 3.5 ms for models with up to 17M parameters.

[LG-84] CyberMaskQA: A Privacy-Aware Benchmark for Evaluating Large Language Models in Cybersecurity Question Answering

链接: https://arxiv.org/abs/2605.24765
作者: Matilda Gaddi,Jin Noh,Onat Gungor,Tajana Rosing
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly applied to cybersecurity question answering (QA) for critical tasks such as incident response and vulnerability analysis. However, real-world operational contexts, including system logs and network configurations, inherently contain sensitive identifiers, e.g., IP addresses, host names, and user accounts. Processing this data with cloud-based models is often unsafe or infeasible in regulated environments. Furthermore, progress in privacy-preserving QA is hindered by the lack of annotated, context-rich datasets capable of jointly evaluating operational reasoning and privacy preservation. To address this gap, we introduce CYBERMASKQA, a privacy-aware QA benchmark covering key security domains. Unlike existing benchmarks that primarily test factual knowledge, CYBERMASKQA grounds questions in realistic organizational contexts with explicit causal dependencies among assets and privileges. Generated through a systematic pipeline, the dataset combines human-curated base scenarios with LLM-driven semantic expansion, annotating each instance with precise private entity labels to enable controlled information disclosure. Evaluations of QA accuracy and masking performance demonstrate the benchmark’s utility for developing deployable, context-aware cybersecurity models and facilitating nuanced studies of privacy-utility trade-offs. Upon acceptance, we will release the dataset and the generation framework.

[LG-85] High-fidelity Modeling of Full-scale Pressurized Water Reactor Flow Fields for Machine Learning Applications

链接: https://arxiv.org/abs/2605.24763
作者: Logan A. Burnett,Hyungjun Kim,Hsien-Cheng Chou,Arsha Witoelar,Robert A. Brewster,Benoit Forget,Emilio Baglietto,Majdi I. Radaideh
类目: Machine Learning (cs.LG); Fluid Dynamics (physics.flu-dyn)
*备注: 30 pages, 10 figures, and 6 Tables

点击查看摘要

Abstract:This work presents a high-fidelity computational fluid dynamics (CFD) and data-driven modeling framework for assembly-level flow characterization in a four-loop pressurized water reactor (PWR). A full lower-plenum and core-inlet domain was constructed using publicly available geometry and operating conditions, enabling transient simulations with pump-induced swirl boundary conditions. The results show that cold-leg swirl and lower-plenum transport generate strongly heterogeneous assembly-wise inlet flow distributions, particularly near the lower core region, while axial resistance and mixing progressively homogenize the flow at higher elevations. These physics-informed datasets were subsequently used to evaluate machine learning (ML) applications for partial field reconstruction and short-term autoregressive prediction. A 3D convolutional-based inpainting model successfully recon-structed missing assembly-level mass flow rates from partial observations, with errors concentrated in the highly turbulent base (bottom) layer and diminishing significantly in upper layers. Comparative analysis across multiple ML models demon-strates that spatially aware architectures, particularly ConvLSTM, significantly outperform sequence-based (LSTM) and operator-learning (DeepONet) approaches by effectively capturing coupled spatio-temporal dynamics. The study also high-lights key challenges, including the sensitivity of inlet flow predictions to turbulence and mesh resolution, as well as the absence of full-scale experimental validation data. Despite these limitations, the results remain consistent with expected physical behavior. Overall, this work establishes high-fidelity CFD as a critical foundation for developing data-driven surrogates, sparse sensing strategies, and future multiphysics coupling frameworks.

[LG-86] A Contractive Feedback Semantics for Reinforcement Learning

链接: https://arxiv.org/abs/2605.24759
作者: Zuyuan Zhang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Discounted reinforcement learning is usually presented through Bellman equations on closed Markov decision processes. This paper develops a compositional view: a one-step decision process is treated as an open stochastic component, and infinite-horizon policy evaluation is obtained by closing a contractive feedback loop. The resulting semantics assigns typed Bellman transformers to open components, interprets series and parallel wiring as composition and tensoring of transformers, and interprets feedback as an admissible guarded Banach trace realized by a unique fixed point. This perspective yields three theoretical consequences. First, approximate component equivalence is a contextual congruence for admitted well-typed guarded one-hole contexts: local operator error remains controlled after plugging the component into a surrounding circuit that uses the hole once and whose feedback nodes have certified uniform guardedness. Second, exact and approximate state abstractions become commuting or near-commuting coalgebraic diagrams, giving value-preservation and explicit sup-norm distortion bounds. Third, under monotone \omega -continuous contract-transformer semantics, safety, risk, and resource specifications can be represented as quantale-valued contracts, where local inductive bounds lift through wiring and feedback by least-fixed-point reasoning. Its central claim is not that all RL morphisms form a global traced monoidal category, but that discounted Bellman evaluation admits a contractive feedback semantics on the admissible class of guarded circuits.

[LG-87] A computational phase transition for learning-to-sample from Ising models

链接: https://arxiv.org/abs/2605.24752
作者: Andrej Risteski,Thuy-Duong Vuong
类目: Machine Learning (cs.LG); Computational Complexity (cs.CC); Data Structures and Algorithms (cs.DS); Probability (math.PR)
*备注:

点击查看摘要

Abstract:We study \emphlearning-to-sample – a basic algorithmic task underlying generative modeling – for Ising models, a standard testbed for algorithmic ideas in both theoretical computer science and machine learning. Given i.i.d. samples of an unknown target distribution, the goal of learning-to-sample is to learn a computationally efficient generation procedure that produces new samples following approximately the same distribution. We construct a family of Ising models of constantly bounded-width which lie just beyond the spectral threshold \lambda_\max(J)-\lambda_\min(J)=1 , and show that learning-to-sample for this family is computationally hard under standard cryptographic assumptions, even when the learner is given both polynomially many i.i.d. samples from the model and explicit access to its parameters. Combined with results of [AJKPV24,KLV25] showing tractability of learning-to-sample below the spectral threshold, this establishes a sharp computational phase transition at the spectral threshold. Moreover, combined with prior results on parameter learning for bounded-width Ising models [KM17,WSD19,VML20], this shows that learning-to-sample can be more difficult than parameter learning. Finally, we show that any efficient learner for these hard instances exhibits a natural memorization-hallucination dichotomy: the learner must either output configurations that, after a simple transformation, match the (transformed) training data or place substantial mass on configurations of negligible probability under the target distribution.

[LG-88] Aligning Molecular Graph Explanations with Chemical Identity via InChIfied Invariants

链接: https://arxiv.org/abs/2605.24742
作者: Emanuele Guidotti,Sara Puglioli
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Obtaining consistent explanations for machine learning on molecular graphs requires predictions and attributions to be aligned with chemical identity. However, chemically equivalent drawings of the same molecule can induce different molecular representations, leading to inconsistent predictions and explanations. Here, we introduce InChIfied Invariants, a class of node, edge, and graph features based on the International Chemical Identifier (InChI) and designed to be invariant under transformations that preserve chemical identity. Using one million molecular graphs from PubChem Substances, we show that InChIfied Invariants produce identical representations for chemically equivalent graphs in 99.62% of cases, whereas standard Daylight invariants do so in only 0.35% of cases. Across MoleculeNet tasks, InChIfied Invariants preserve predictive performance while significantly improving prediction consistency across alternative graph depictions of the same molecules. We further perform a quantitative attribution analysis and show that explanations produced with standard molecular featurization methods vary substantially across chemically equivalent graphs, while InChIfied Invariants enforce consistent attributions by construction. We release open-source software implementing InChIfied Invariants, which can be used as a drop-in replacement for standard molecular graph features.

[LG-89] Reinforcement Learning for Reachability: Guaranteeing Asymptotic Optimality ICML2026

链接: https://arxiv.org/abs/2605.24740
作者: Amogh Palasamudram,Jakub Svoboda,Suguman Bansal,Krishnendu Chatterjee
类目: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT)
*备注: Main text and appendix of work accepted in ICML 2026

点击查看摘要

Abstract:Reinforcement learning (RL) for reachability specifications is fundamental in sequential decision-making, yet theoretical guarantees remain less explored. A recent work achieves asymptotic convergence to optimal policies. However, this approach provides limited insight into convergence dynamics. In this work, we present an alternative approach that provides deeper theoretical insights into convergence. Our approach builds on PAC learning with assumptions. PAC learning guarantees near-optimal policies with high confidence in finite time but requires knowing internal MDP parameters like minimum transition probability. We argue that while these parameters are unknown in RL, they can be iteratively refined and estimated with increasing accuracy. By iteratively satisfying PAC conditions, we show that exact optimality can be achieved in the limit. Empirical evaluations on standard benchmarks validate our theoretical insights into convergence dynamics.

[LG-90] Feature Learning in Wide Neural Networks under μP: Identifiability and Sparse-Dictionary Decomposition of the Mean-Field Limit

链接: https://arxiv.org/abs/2605.24710
作者: Akmal Xodarev
类目: Machine Learning (cs.LG); Probability (math.PR); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注: 86 pages

点击查看摘要

Abstract:We establish four structural results for feature learning in wide two-layer neural networks under the Maximal Update Parametrization ( \mu P). First, we prove global existence and uniqueness of the mean-field limit of noisy gradient descent under \mu P, identifying the maximal admissible weight w^* on the moment sequence of the initialization as the reciprocal parameter-moment-growth boundary, and hence the largest weighted moment class propagated by the flow. The finite-particle approximation has uniform-in-time squared-Wasserstein rate O(N^-1) . Second, we characterize identifiability of the mean-field limit: two admissible parameter measures induce the same network function in L^2 exactly when their active components agree modulo the finite-rank realization symmetry of the architecture. The orbit depth D^_\mathrmorb is separated from the moment-variety depth D^\mathrmvar . Third, under the Barron-Hermite target condition the active support of the long-time limit measure admits a sparse-dictionary decomposition: it is supported on at most S^* atoms modulo finite-rank realization symmetry, with S^* bounded by an explicit coefficient-threshold number. Fourth, we derive the total feature-learning-error decomposition into statistical, optimization, propagation-of-chaos, and sparse-residual components, with a target-dependent Hermite/Barron tail replacing any initialization-only residual. The four results are tied together by an architectural identity: the triple (w^, D^\mathrmorb, S^*) – the maximal admissible weight, the orbit identifiability depth, and the sparse-dictionary depth at which the target is realizable – is the natural learning cell of the architecture-data pair (\sigma, \rho) . The proofs are self-contained except for standard results from \mu P and mean-field Langevin theory. Comments: 86 pages Subjects: Machine Learning (cs.LG); Probability (math.PR); Statistics Theory (math.ST); Machine Learning (stat.ML) MSC classes: 68T07, 62F12, 49Q22 (Primary) 60H30, 60J60, 60F17 (Secondary) ACMclasses: I.2.6; G.3 Cite as: arXiv:2605.24710 [cs.LG] (or arXiv:2605.24710v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.24710 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Akmal Xodarev [view email] [v1] Sat, 23 May 2026 19:26:25 UTC (66 KB) Full-text links: Access Paper: View a PDF of the paper titled Feature Learning in Wide Neural Networks under \mu P: Identifiability and Sparse-Dictionary Decomposition of the Mean-Field Limit, by Akmal XodarevView PDFHTML (experimental)TeX Source view license Current browse context: cs.LG prev | next new | recent | 2026-05 Change to browse by: cs math math.PR math.ST stat stat.ML stat.TH References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked="checked"class=“labs-tab-input”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) IArxiv recommender toggle IArxiv Recommender (What is IArxiv?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status

[LG-91] Streaming Reinforcement Learning under Partial Observability with Real-Time Recurrent Learning

链接: https://arxiv.org/abs/2605.24709
作者: Noah Farr,Aryaman Reddi,Carlo D’Eramo,Jan Peters
类目: Machine Learning (cs.LG)
*备注: 16 pages, 4 figures

点击查看摘要

Abstract:Streaming reinforcement learning has emerged as an online learning paradigm that conforms to the restrictions of natural learning agents that process data incrementally, i.e. with a batch size of 1 and no replay buffer. While streaming RL has recently been shown to scale with deep function approximation with full observability, partially observable settings have remained out of reach. Truncated backpropagation through time collapses to a one-step gradient horizon under the streaming setting, and exact real-time recurrent learning is prohibitively expensive. We close this gap using recurrent trace units, a diagonal recurrent architecture that enables exact RTRL with linear time and memory complexity in the parameter count, and show that they integrate cleanly into existing streaming algorithms across both discrete and continuous control. On a MemoryChain diagnostic with chain lengths from 2 to 128, our method sustains performance where streaming TBPTT(1) baselines using feedforward, GRU, and RTU networks collapse. On five POPGym tasks and on partially observable MuJoCo continuous control, the streaming approach is competitive with batched PPO on POPGym and recovers a substantial fraction of batched performance on masked MuJoCo, despite using no replay buffer or batched updates.

[LG-92] CALIBURN: A Regime-Sensitivity Study of Operationally Calibrated Streaming Intrusion Detection

链接: https://arxiv.org/abs/2605.24696
作者: Michel A. Youssef
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 55 pages, 5 figures, 14 tables. Under review at Cyber Security and Applications. Code: this https URL . Archived release: this https URL

点击查看摘要

Abstract:Streaming network intrusion detection systems must process flows continuously while keeping memory bounded, but most current methods leave alerting threshold selection as a post-hoc tuning problem poorly suited to production. Operators need alerting behaviour specifiable before deployment using inputs such as false-negative cost, false-positive cost, and alerting budget. This paper presents CALIBURN, a five-component streaming alerting pipeline composed of a truncated Bayesian online change-point detector, an isotonic calibration layer mapping the change-point posterior to an empirical conditional attack probability, a cost-sensitive decision threshold derived from operator-specified misclassification costs, a Conformal Risk Control wrapper that converts an alert-budget specification into a within-window valid threshold under exchangeability, and a multi-window burn-rate alerting layer adapted from Site Reliability Engineering practice. Rather than claiming uniform dominance, we present CALIBURN as a regime-sensitivity study, evaluating the pipeline across three attack-prevalence regimes: LITNET-2020 at 5.2 percent, CICIDS2017 at 22.06 percent, and UNSW-NB15 at 64 percent. In the rare-attack regime, CALIBURN achieves AUC-PR 0.943 on LITNET-2020, outperforming the best streaming baseline by 2.21x and the best batch reference by 4.12x; isotonic calibration reduces Brier score by 30 percent. In the moderate-prevalence regime, CALIBURN remains the strongest streaming method on CICIDS2017 but is exceeded by batch density methods. In the high-prevalence regime, all streaming methods approach the prevalence floor. We further identify two distinct CRC-collapse mechanisms driving the alert rule to degeneracy at small alpha, treating both as operational guidance for practitioners.

[LG-93] Sum of Costs Diffusion with Dynamic Guidance for Motion Planning ICRA

链接: https://arxiv.org/abs/2605.24690
作者: Aysu Aylin Kaplan,Özgür Erkent
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: Accepted at the Frontiers of Optimization for Robotics Workshop at the IEEE International Conference of Robotics Automation (ICRA), 2026

点击查看摘要

Abstract:The motion planning problem for robotic manipulation can be addressed through classical or deep learning approaches. Existing methods face significant challenges in generalizing to diverse settings. In this study, we present a method with high generalization capability that generates collision-free trajectories using diffusion models where the denoising process is guided by the gradient of the total collision cost. We are also presenting a dynamic approach for choosing start step of the gradient guidance. Experimental results demonstrate that guiding the diffusion model dynamically with the sum of collision costs offers more robust performance by overcoming the generalization issues faced by competing methods. The proposed model demonstrates its effectiveness by achieving the highest performance on diverse test settings in M \pi nets\ dataset among the compared methods.

[LG-94] rajectory-Based Difficulty Scoring for Reliable Learning on Tabular Data

链接: https://arxiv.org/abs/2605.24680
作者: Tomer Lavi,Bracha Shapira,Nadav Rappoport
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Gradient-boosted trees achieve strong performance on tabular data, yet often leave a long tail of poorly predicted instances. We introduce a Trajectory-based Difficulty Score (TDS), an instance-level difficulty estimator for boosted ensembles derived from per-tree cumulative prediction trajectories. For each instance, we compute interpretable trajectory descriptors (e.g., variance, oscillation peaks, sign switches, and tail stability) and train a lightweight regression model to predict held-out loss. An empirical CDF calibrates the resulting signal into a score in [0,1] that supports ranking hard cases. Across diverse tabular benchmarks and ensemble sizes, TDS exhibits strong rank correlation with error and outperforms established instance-hardness and uncertainty baselines on classification, while remaining competitive on regression. We then show how a single difficulty signal improves multiple data mining workflows: difficulty-driven active learning for label-efficient training, difficulty-thresholded selective prediction for improved risk-coverage trade-offs, and TDS-stratified (Mondrian) conformal prediction for more uniform conditional coverage. Finally, clustering high-TDS instances using SHAP attributions reveals coherent failure modes characterized by compact feature-value ranges, supporting error analysis and targeted data acquisition.

[LG-95] IterInject: Indirect Prompt Injection Against LLM Agents via Feedback-Guided Iterative Optimization EMNLP2026

链接: https://arxiv.org/abs/2605.24659
作者: Zixuan Chen,Jiaxiang Chen,Li Luo,Ke Xu,Xiaoxiang Huang,Tanfeng Sun,Xinghao Jiang
类目: Machine Learning (cs.LG)
*备注: Submitted to EMNLP 2026

点击查看摘要

Abstract:LLM-based agents are increasingly deployed for complex tasks requiring planning, tool use, and interaction with external services. Their reliance on untrusted external content exposes them to indirect prompt injection (IPI), in which adversarial instructions embedded in retrieved data hijack agent behavior. Existing attacks rely on static payloads that cannot adapt to agent-specific defenses; even recent adaptive methods lack structured feedback to guide optimization. We introduce \oursys, a feedback-guided iterative framework that closes the loop between injection, diagnosis, and refinement: a rule-based diagnoser produces structured outcome labels with behavioral descriptions, and an LLM-based optimizer refines payloads conditioned on the full optimization history. A synthesis step generates new disguise seeds from failure patterns, enabling the strategy space to self-evolve. On AgentDojo and InjectAgent, \oursys substantially outperforms static baselines and existing adaptive methods across four victim models. Extension experiments on Claude Code, a production-grade coding agent with layered defenses, show that optimized payloads achieve full success on 5 of 9 targets; even those that resist full exploitation exhibit measurable improvement from iterative refinement. We further present a mechanistic analysis of IPI, identifying an attention-mediated threshold mechanism in mid-to-late layers; three causal interventions validate this finding and point to concrete defense directions.

[LG-96] WLNO: Wavelet-Laplace Neural Operator for Solving Partial Differential Equations

链接: https://arxiv.org/abs/2605.24658
作者: Muhammad Abid,Arth Sojitra,Omer San
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This work introduces the Wavelet-Laplace Neural Operator (WLNO), a novel neural operator that fuses Haar wavelet multi-scale spatial decomposition with the Laplace-domain pole-residue formulation of the Laplace Neural Operator (LNO). While LNO captures transient and steady-state dynamics through learnable system poles and residues, it lacks an explicit mechanism for extracting spatially localized multi-scale features inherent in complex PDE solutions. WLNO addresses this by augmenting the LNO core with a parallel single-level Haar discrete wavelet transform (DWT) branch that decomposes the lifted feature map into four frequency subbands: approximation (LL), horizontal detail (LH), vertical detail (HL), and diagonal detail (HH) and applies independent learned 1\times1 convolutions to each subband before reconstruction via the inverse DWT. The two branches are fused through a learnable sigmoid-gated weight \alpha_\mathrmwav , initialized to give a small initial contribution to the wavelet branch, allowing the model to adaptively balance Laplace-domain dynamics against spatial multi-scale features throughout training. WLNO is evaluated against LNO on five benchmark PDE problems using identical hyperparameters, training data, and evaluation protocols: the diffusion equation, the Burgers equation, the reaction-diffusion system, Darcy flow, and the two-dimensional Navier-Stokes equation. WLNO consistently outperforms LNO on all five problems, with the most pronounced improvement on problems with strong spatial multi-scale structure, such as the Burgers equation with sharp shock fronts and the Navier-Stokes equation with coherent vortical structures, while remaining consistent across smoother and elliptic problems. These results demonstrate that wavelet-based multi-scale spatial decomposition is a principled and effective complement to Laplace-domain operator learning.

[LG-97] WINO: A Weak-Form Physics Informed Neural Operator for Hyperelasticity on Variable Domains

链接: https://arxiv.org/abs/2605.24651
作者: Bokai Zhu,Qinghui Zhang,Timon Rabczuk
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We propose a Weak-form Physics-Informed Neural Operator (WINO), a data-free framework that combines the efficiency of neural operators with the geometric flexibility of the \varphi -finite element method ( \varphi -FEM). \varphi -FEM is an unfitted method that accommodates geometric variations without body-fitted meshes, where the domain geometry is represented by the level-set function \varphi . To impose the boundary conditions, Dirichlet problems adopt the \varphi -FEM lifting so only the homogeneous displacement contribution is learned, whereas traction-driven Neumann problems additionally predict the auxiliary fields necessary for the unfitted weak formulation. Parameters are trained by minimizing squared weak-form residuals aligned with \varphi -FEM together with squared penalties on the cut-cell auxiliary equations, which removes the need for large paired datasets of converged reference solutions. After training, WINO outputs can seed the nonlinear \varphi -FEM solvers as neural operator warm starts (NOWS), which reduce iteration counts relative to traditional cold-started solvers. Numerical benchmarks show that WINO achieves high accuracy below 0.04 across all benchmarks, while reducing total computational time by 50–80% compared with purely data-driven methods.

[LG-98] Beyond Fixed Points: Superpolynomial Capacity of Asymmetric Hopfield Networks

链接: https://arxiv.org/abs/2605.24611
作者: Aakash Kumar,Anatoly Khina,Frederik Mallmann-Trenn,Emanuele Natale
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Classical Hopfield networks are limited to static patterns due to symmetric weights, whereas asymmetric networks can encode temporal sequences via limit-cycle attractors. Achieving high-capacity storage of long sequences in classical synchronous asymmetric networks, however, has remained a challenge. We present a simple and robust construction within the classical asymmetric Hopfield model with binary neurons and synchronous updates, that allows n neurons to support \exp!\big(\Omega(n/(\log n)^2)\big) distinct limit-cycle attractors, each with period \exp!\big(\Omega(\sqrt n/\log n)\big) and robust to random noise with flip probability up to \frac12-o(1) , yielding superpolynomial capacity in both the number and length of stored sequences. This is the first demonstration of such capacity for asymmetric Hopfield networks, which we obtain by combining results from combinatorics, number theory and the analysis of opinion dynamics. Our findings show that synchronous asymmetric Hopfield networks possess a sequence-memory capacity which is larger and more robust than previously recognized, demonstrating that, in both biological and artificial neural systems, robust sequence representation can be achieved through coarse architectural motifs rather than complex nonlinearities.

[LG-99] Position: AI for Science Should Treat Measurement-to-Dataset Pipelines as Inference Components

链接: https://arxiv.org/abs/2605.24558
作者: Ling Zhan,Xiaoyao Yu,Tao Jia
类目: Machine Learning (cs.LG)
*备注: 23 pages, 5 figures, Proceedings of the 43 rd International Conference on Machine Learning, Seoul, South Korea. PMLR 306, 2026

点击查看摘要

Abstract:AI for Science (AI4Science) workflows often treat the released dataset as a fixed interface to the underlying system. However, in domains relying on \emphindirect observation, the learner observes a derivative representation produced by multi-stage measurement, reconstruction, and preprocessing pipelines. \textbfWe argue that these measurement-to-dataset pipelines are inference components: treating their outputs as given data'' freezes an observation model and obscures uncertainty over feasible pipeline choices. We identify three failure modes arising from this frozen lens’': \textbf(C1) hidden hypothesis space, where the released dataset does not specify the pipeline configuration or its validity conditions; \textbf(C2) uncertified transportability, where a pipeline may be documented but its regime of validity is untested, so failures under distribution shift cannot be adjudicated; \textbf(C3) ungoverned multiplicity, where many defensible pipelines exist and dispersion is real but not propagated into uncertainty-aware evidence. We stress-test these claims with a large-scale neuroscience empirical audit, finding a survival rate of \approx 0.0004% under a cross-dataset stability criterion. We call on the AI4Science community to make pipelines \emphcomputable inference objects via domain-specific Computable Observation Frameworks. This shift enables quantifying pipeline adequacy and stability, converting implicit implementation choices into auditable, reproducible, and cumulative scientific evidence. Comments: 23 pages, 5 figures, Proceedings of the 43 rd International Conference on Machine Learning, Seoul, South Korea. PMLR 306, 2026 Subjects: Machine Learning (cs.LG) Cite as: arXiv:2605.24558 [cs.LG] (or arXiv:2605.24558v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.24558 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Ling Zhan [view email] [v1] Sat, 23 May 2026 12:50:00 UTC (1,442 KB) Full-text links: Access Paper: View a PDF of the paper titled Position: AI for Science Should Treat Measurement-to-Dataset Pipelines as Inference Components, by Ling Zhan and 2 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.LG prev | next new | recent | 2026-05 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked="checked"class=“labs-tab-input”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) IArxiv recommender toggle IArxiv Recommender (What is IArxiv?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status

[LG-100] Deep ZakaiJ: Structured Filtering for Jump-Diffusion Time Series Forecasting

链接: https://arxiv.org/abs/2605.24548
作者: Yan Leng,Thibaut Mastrolia,Hao Wang
类目: Machine Learning (cs.LG); Probability (math.PR)
*备注:

点击查看摘要

Abstract:Time series driven by unobserved latent states frequently exhibit abrupt jump discontinuities whose timing and magnitude cannot be predicted from observed history alone. Classical jump-diffusion models offer a principled mathematical framework but assume rigid parametric forms, while recent neural jump models operate on fully observed trajectories without inferring the hidden states that govern the dynamics. We propose \textitDeep ZakaiJ, a latent-state model for partially observed jump-diffusion systems that embeds the Zakai nonlinear filtering equation into a neural encoder–decoder architecture. The encoder recursively updates a belief over the latent state via Strang splitting into three interpretable substeps: prior propagation, diffusion innovation, and jump innovation, yielding a differentiable, first-order-accurate approximation of the exact filtering evolution. The decoder is a structured jump-diffusion model explicitly conditioned on the filtered belief, preserving the separation between continuous dynamics and discontinuous shocks. On synthetic, financial, and oceanographic datasets, \textitDeep ZakaiJ improves distributional forecasts while remaining competitive in point accuracy, achieving calibrated predictive intervals and recovering interpretable latent structure in synthetic and qualitative case studies.

[LG-101] RL with Learnable Textual Feedback: A Bilevel Approach

链接: https://arxiv.org/abs/2605.24547
作者: Utsav Singh,Sidhaarth Sredharan,Souradip Chakraborty,Amrit Singh Bedi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Reinforcement learning with verifiable rewards can improve LLM reasoning, but learning remains sample-inefficient when terminal rewards are sparse. This has motivated a growing line of work on RL with textual feedback, where a critic model generates natural language feedback to guide a reasoning model (the actor), augmenting scalar rewards with richer learning signals. However, existing methods typically treat feedback as fixed or auxiliary, which misses a key property: feedback should not merely be correct, but should improve the policy (actor model) when provided in context. This motivates a paradigm of learnable textual feedback for RL. Yet the learnability and usefulness of feedback depend on the policy’s ability to learn from it, making RL with learnable feedback an inherently bilevel problem. We formalize this coupling as a Stackelberg bilevel program and derive Bilevel Natural Language Actor-Critic (Bi-NAC), which jointly trains a critic to generate reward-improving feedback and an actor to exploit it. Across MATH-500, MBPP, and GPQA, Bi-NAC improves sample and parameter efficiency over RL and fixed-critic baselines: our 2B model outperforms the 3B GRPO baseline, achieving 46.6% versus 41.4% on MATH-500, while our 6B model surpasses the 7B GRPO baseline, achieving 49.3% versus 43.6% on GPQA.

[LG-102] Steering Beyond the Support: Adversarial Training on Unsupervised Jailbroken Activation Simulation ICML2026

链接: https://arxiv.org/abs/2605.24535
作者: Luoyu Chen,Weiqi Wang,Zhiyi Tian,Chenhan Zhang,Feng Wu,Jianhuan Huang,Ahmed Asiri,Shui Yu
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: accepted by ICML 2026

点击查看摘要

Abstract:Jailbreak prompts can trigger harmful completions on aligned LLMs, In accordance, safety steering has been proposed: test-time activation interventions that steer jailbreak activations to trigger refusal while preserving benign utility. However, existing steering methods are fundamentally supervised and tied to a static, limited training set, whereas real jailbreaks evolve and are often out-of-distributed from the training set, leading to failures on unseen attacks. In this paper, we tackle the failure on unseen jailbreaks problem, base on unsupervised latent direction discovery. We propose a bi-level adversarial training framework for zero-shot jailbreak defense. In the inner step, we simulate diverse jail-broken activations by extrapolating from refusal-state harmful-request activations via unsupervised latent direction discovery, which expands the coverage of real jailbreak activation subspaces. In the outer step, we train a potential-induced steering field to push these adversarial jailbroken states into refusal regions while keeping benign unchanged. Across three LLMs and six classical jailbreak families, our method achieves strong defense with attack success rates mostly below 5%, and rising subspace coverage throughout training helps explain the improved generalization.

[LG-103] Lake Detection and Water Quality Estimation in Sentinel-2 Data

链接: https://arxiv.org/abs/2605.24515
作者: Iulia Pleşu,Alexandra Băicoianu,Ioana Cristina Plajer
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:With climate change and increasing human pressure on natural landscapes, inland water resources are becoming progressively scarcer, more vulnerable, and more difficult to manage sustainably. Reliable and automated methods for detecting, monitoring, and assessing surface water bodies are therefore of growing scientific and practical importance. In this paper, we investigate and compare three distinct machine learning architectures for water body identification and monitoring. Their performance is evaluated through quantitative metrics and real-world examples. Furthermore, a direct comparison with classical NDWI thresholding is conducted on a representative test image to highlight differences between data-driven and index-based approaches. This analysis allows us to identify the best-performing model in terms of accuracy, robustness, and practical applicability. Beyond detection, a major challenge for meaningful water quality assessment lies in the consistent and interpretable visualization of spectral water indices. Standard color mapping techniques are often inadequate or potentially misleading for environmental applications. To address this gap, we propose a suite of meaningful color schemes adapted for water quality indices, facilitating clearer interpretation, comparison, and decision-making for human users.

[LG-104] Zeroth-Order Nonconvex Nonsmooth Optimization with Heavy-Tailed Noise

链接: https://arxiv.org/abs/2605.24513
作者: Zhuanghua Liu,Luo Luo
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper considers the nonconvex nonsmooth problem in which the objective function is Lipschitz continuous. We focus on the stochastic setting where the algorithm can access stochastic function value evaluations with heavy-tailed noise, which is prevalent in many popular machine learning applications. We propose a stochastic zeroth-order algorithm that refines the framework of online-to-nonconvex conversion by clipping the two-point gradient estimator. The theoretical analysis shows that our algorithm can find a (\delta, \epsilon) -Goldstein stationary point with zeroth-order oracle complexity of \mathcal O(d^\fracp2(p-1)\delta^-1\epsilon^-\frac2p-1p-1) , where d is the problem dimension and p\in(1,2] is the order of bounded moments. Note that our dependence on dimension d matches the best-known results of stochastic zeroth-order optimization for finding the sub-optimal solution of a stochastic convex nonsmooth problem. In addition, our dependence on accuracy parameters \delta and \epsilon is consistent with that of the best-known stochastic first-order algorithms for stochastic nonconvex nonsmooth problems. Finally, we conduct numerical experiments to demonstrate the effectiveness of the proposed method.

[LG-105] he Normalized Maximum Likelihood for Regular Non-Smooth Models: Measure-Theoretic Foundations and Geometric Sampling

链接: https://arxiv.org/abs/2605.24477
作者: Trenton Lau,Gary P. T. Choi
类目: Machine Learning (cs.LG); Information Theory (cs.IT); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:The Normalized Maximum Likelihood (NML) codelength, or stochastic complexity, represents a principled criterion for universal coding. While recent coarea-based formulations provided a calculation method for smooth models, this framework collapses for the non-smooth estimators ubiquitous in modern machine learning (e.g., Lasso, Sparse SVMs). In this work, we provide a rigorous framework for computing the NML for regular path-differentiable Lipschitz (PDL) estimators. By applying classical geometric measure theory and bridging the coarea formula with conservative Jacobians, we prove that the stochastic complexity for non-smooth models is well-posed and theoretically consistent with the outputs of modern Automatic Differentiation. To compute this quantity exactly, we introduce the Propose-and-Project Metropolis-Hastings (PDL-PPMH) sampler, a geometric MCMC algorithm capable of traversing the non-differentiable level sets of the maximum likelihood estimator. We theoretically justify its components, including a stochastic tangent space proposal and a provably convergent non-smooth projection solver. We demonstrate the method’s robustness by sampling from a high-dimensional Lasso posterior ( P=2000 ), while simultaneously quantifying the computational scaling that governs the trade-off between exactness and mixing time. Crucially, we empirically demonstrate that our exact NML criterion provides a highly data-efficient alternative to cross-validation, achieving statistically indistinguishable predictive optima without requiring data splitting. Altogether, our work paves the way for the theoretical analysis of the NML codelength for regular non-smooth models.

[LG-106] Asymmetric Adaptation-based Real-time Fault Diagnosis Under Transitional Operating Conditions

链接: https://arxiv.org/abs/2605.24457
作者: Hongshuo Zhao,Zeyi Liu,Xiao He
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注: 6 pages, 3 figures, Accepted by ICAIS ISAS 2026

点击查看摘要

Abstract:Data streams in real-world industrial scenarios often contain transitional operating conditions that are uncovered during offline training, leading to significant distribution shifts. To bridge the gap between static offline models and dynamic online data, a novel asymmetric adaptation-based fault diagnosis method is proposed in this paper. Specifically, in the offline stage, we employ domain generalization techniques to extract domain-invariant features from multiple stable conditions and construct robust normalized fault prototypes as reference anchors. Subsequently, during online inference, we design an online test-time adaptation method based on a periodic prototype re-projection mechanism to dynamically update prototype positions. Furthermore, we utilize the geometric distribution derived from anchors to guide the updates of classifiers and adopt an asymmetric learning rate strategy for the feature extractor and classifier. The proposed approach ensures rapid adaptation to new transitional conditions while preserving the discriminative power inherited from the offline domain generalization initialization. Experimental results demonstrate that this mechanism effectively leverages offline generalized knowledge to guide online inference, significantly improving robustness in non-stationary environments.

[LG-107] Vision-Guided Outdoor Flight and Obstacle Evasion via Reinforcement Learning

链接: https://arxiv.org/abs/2605.24449
作者: Shiladitya Dutta,Aayush Gupta,Varun Saran,Avideh Zakhor
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: Published in IEEE Robotics and Automation Letters, vol 11, no 2. Presented at the IEEE International Conference on Robotics and Automation 2026

点击查看摘要

Abstract:Although quadcopters boast impressive traversal capabilities enabled by their omnidirectional maneuverability, the need for continuous pilot control in complex environments impedes their application in GNSS and telemetry-denied scenarios. To this end, we propose a novel sensorimotor policy that uses stereo-vision depth and visual-inertial odometry (VIO) to autonomously navigate through obstacles in an unknown environment to reach a goal point. The policy is comprised of a pre-trained autoencoder as the perception head followed by a planning and control LSTM network which outputs velocity commands that can be followed by an off-the-shelf commercial drone. We leverage reinforcement and privileged learning paradigms to train the policy in simulation through a two-stage process: 1) initial training with optimal trajectories generated by a global motion planner acting as a supervisory backbone, 2) further fine-tuning in a curriculum environment. To bridge the sim-to-real gap, we employ domain randomization and reward shaping to create a policy that is both robust to noise and domain shift. In outdoor experiments, our approach achieves successful zero-shot transfer to both obstacle environments and a drone platform that were never encountered during training.

[LG-108] CAffNet: Hard Constraint-Affine Neural Networks

链接: https://arxiv.org/abs/2605.24437
作者: Yang Zhao,Jungeun Lee,Jeong hwan Jeon,Sze Zheng Yong
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present a novel framework for embedding hard constraint satisfaction into neural network (NN) architectures, specifically feedforward neural networks and transformers, with input-dependent affine constraints of arbitrary cardinality. Traditional constraint enforcement approaches either rely on penalty-based soft constraints, which offer no guarantee of satisfaction, or on post-processing methods that enforce constraints after the NN is trained, which may lead to suboptimality. We introduce a trainable constraint-affine (CAffine) layer into NNs, yielding CAffNet, which goes beyond enforcing affine constraints via fixed orthogonal or parallel projections and enables joint optimization with network parameters. Moreover, we impose no restrictions on the constraint space dimensions and establish that our construction preserves the universal approximation properties of NNs, while providing provable guarantees on constraint adherence for all inputs. Experimental validation demonstrates robust performance across diverse domains requiring guaranteed constraint satisfaction.

[LG-109] Smoother Action Chunking Flow Policy via Prior-Corrected Orthogonal Trust-Region Guidance

链接: https://arxiv.org/abs/2605.24433
作者: Kai Fang,Hailong Pei,Xuemin Chi
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Flow-matching robot policies commonly use action-chunking inference for efficient closed-loop control, but chunk boundaries can introduce discontinuous action transitions. Existing RTC guidance improves continuity by injecting correction signals during denoising, yet its weight schedule is weak at intermediate timesteps and its unconstrained correction direction may introduce transverse perturbations. We propose POTR, a prior-corrected orthogonal trust-region guidance method. First, we incorporate a data-prior scale \sigma_d into the RTC guidance weight, yielding stronger intermediate-time correction. Second, we decompose the guidance vector into components parallel and perpendicular to the denoising velocity, and constrain the perpendicular component within a trust region. On LIBERO with \pi_0.5 , POTR improves success rate and consistently reduces chunk-boundary discontinuity, acceleration, and jerk compared with RTC. Ablations show that the prior-corrected weight provides the main correction gain, while the orthogonal trust region further improves stability.

[LG-110] Representation-Guided Discrete Molecular Graph Retrosynthesis

链接: https://arxiv.org/abs/2605.24428
作者: Jiahai Huang,Anjie Qiao,Zhen Wang,Defu Lian,Yutong Lu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Stochastic process-based molecular graph generators have become the state of the art for template-free single-step retrosynthesis. However, these models are typically trained only on product-reactant pairs, thereby acquiring chemistry-relevant representations in an indirect and implicit manner. Meanwhile, recent advances in computer vision demonstrate that offering representation guidance to a generator can effectively distill semantics from pretrained encoders into DiTs, substantially improving both convergence and generation quality. Whether similar gains extend to the retrosynthesis task, and what graph-specific design choices can make them work, remains an open question. To address these questions, we conduct a systematic empirical study over a unified design space spanning teacher molecular representations, endpoint and granularity choices, injection depths in the denoiser, correspondence strategies and guidance scheme. Guided by these considerations, we develop Graph-oriented Representation Guidance (GRG), which achieves 58.6 / 77.2 / 83.4 / 87.1 top-1 / 3 / 5 / 10 accuracy on USPTO-50k, while increasing diversity to 15.5, both substantially outperforming the adopted base generator. Notably, GRG consistently improves all top-k metrics in out-of-distribution settings, suggesting that representation guidance facilitates the acquisition of intrinsic chemical semantics. Meanwhile, the introduced representation guidance reduces the number of epochs by 35% and the wall-clock time by 30% to reach comparable performance. In addition, we introduce a simple yet effective representation-similarity-based reranking mechanism, which further improves the top of the ranked list without training an additional verifier.

[LG-111] Poisoning the Watchtower: Prompt Injection Attacks Against LLM -Augmented Security Operations Through Adversarial Log Content

链接: https://arxiv.org/abs/2605.24421
作者: Rohan Pandey,Archit Bhujang
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 10 pages

点击查看摘要

Abstract:Large language models (LLMs) are increasingly used as analyst assistants in security operations centers (SOCs), where they ingest log and alert data to produce triage labels, incident summaries, or remediation advice. We study a structural failure mode of this design: many log fields are attacker controlled. User agents, URLs, payloads, DNS queries, and attempted usernames can therefore carry instructions to the model alongside evidence of the intrusion. We call this setting \emphlog-substrate prompt injection. We introduce a four-class taxonomy of log-substrate attacks: direct override (S1), persona hijack (S2), context manipulation (S3), and obfuscated payloads (S4). We evaluate 48 strategy-defense-task combinations using \textttgpt-4o-mini as the analyst. Three findings stand out. First, direct overrides are ineffective in our setting: all S1 classification attacks achieve 0% suppression. In contrast, persona hijacks suppress 68% of malicious logs under a naive classifier and remain effective under stronger defenses. Second, summarization is the highest-risk task: context manipulation reaches 96% injection success without defenses and 38% even with constrained output. Third, defenses reduce but do not eliminate the attack surface: average injection success falls from 26.6% under naive prompting to 11.8% under our strongest defense. We also compare empirical results to a deterministic mock analyst and find that simulation substantially mispredicts current model behavior, especially for direct overrides. These results suggest that SOC copilots should treat raw log content as adversarial input rather than ordinary analyst context.

[LG-112] ChainLearn: A Blockchain-Based Capacity-Aware Framework for Federated Ensemble Learning

链接: https://arxiv.org/abs/2605.24418
作者: Karan Sharma,Aditya Tripathi,Rahul Mishra,Tapas Kumar Maiti
类目: Machine Learning (cs.LG)
*备注: 10 pages, 7 figures, 11 tables. IEEE conference format. Code: this https URL

点击查看摘要

Abstract:Federated learning is used in medical imaging where privacy prohibits centralizing data. Standard federated algorithms assume homogeneous hardware, identical architectures, and centralized aggregation, which fails when hospitals have unequal compute resources. We propose capacity-aware coordination: measure each hospital’s throughput, assign capacity-appropriate architectures (MobileNetV3-Small, EfficientNet-B0, ResNet-50), and combine predictions via weighted ensemble. Weak and strong hospitals can participate without forcing uniform architectures. We separate on-chain policy from off-chain learning. A Solidity contract stores hospital registration, benchmark hashes, metrics, and weights. Hospitals train locally and submit only hashes and scalars (not parameters). Weighted ensemble inference is computed off-chain. Experiments on PneumoniaMNIST and DermaMNIST (5 seeds, 3 non-IID levels) show our method achieves lower or equal calibration error versus equal-weight ensemble and competitive accuracy versus FedAvg, FedProx, and FedMD. Communication overhead is 224 bytes per round, a reduction of over 912,000x compared to FedAvg. Comments: 10 pages, 7 figures, 11 tables. IEEE conference format. Code: this https URL Subjects: Machine Learning (cs.LG) Cite as: arXiv:2605.24418 [cs.LG] (or arXiv:2605.24418v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.24418 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-113] LLM TabBench: Evaluating LLM s on Binary Tabular Classification From Zero to Few Shots

链接: https://arxiv.org/abs/2605.24417
作者: Daria Grushina,Kseniia Kuvshinova,Alina Kostromina,Aziz Temirkhanov,Mile Mitrovic,Dmitry Simakov
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Supervised classification for tabular data remains a core machine learning task, yet its reliance on large labeled datasets limits applicability in data-scarce domains. For such few-shot scenarios, specialized methods like TabPFN - a state-of-the-art Prior-Data Fitted Network - have set a high standard by leveraging large-scale synthetic pretraining, though they still require a context of labeled examples to function. In contrast, Large Language Models (LLMs) could offer a more flexible alternative via zero- and few-shot in-context learning directly from task descriptions, but their performance on tabular data remains inconsistent and poorly understood. We introduce LLMTabBench, a benchmark designed to systematically evaluate LLMs for tabular classification under data-scarce conditions. LLMTabBench explicitly probes (i) how LLM prior knowledge interacts with in-context information (task descriptions and few-shot examples), and (ii) how model performance scales with increasing data complexity, using both real-world and controlled synthetic datasets. Our findings include: (1) LLMs are highly competitive in zero-shot settings and can outperform alternative models, even when those models have access to few-shot examples; (2) incorporating additional few-shot examples can conflict with LLM prior knowledge, limiting or even degrading performance; and (3) there is a data complexity threshold beyond which LLMs’ performance declines and few-shot examples become less effective. Together, these findings reveal fundamental constraints of in-context learning for tabular data and provide practical guidance for deploying LLMs in low-data regimes.

[LG-114] Synheart Capacity: A Theory-Driven Physiological Representation of Cognitive Capacity Dynamics from Wearable Signals

链接: https://arxiv.org/abs/2605.24416
作者: Yisak Debele,Henok Ademtew,Israel Goytom
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Human cognitive performance is constrained by limited mental resources, yet continuous computational estimation of cognitive capacity dynamics remains an open challenge. We propose a theory-driven multimodal learning framework that models capacity-related cognitive state as a two-dimensional physiological representation defined by voluntary resource allocation (mental effort) and overload-related strain (stress). The proposed architecture combines dual-stream encoding of cardiac (IBI/HRV) and electrodermal (EDA) signals with late fusion and task-specific output heads that independently estimate probabilistic effort and stress states. Evaluation on the SWELL-KW dataset using strict leave-one-subject-out cross-validation demonstrates cross-individual generalization (stress: 70.0% balanced accuracy; effort: 72.2%), with significant gains from multimodal integration and theory-guided supervision. Rather than collapsing physiological dynamics into a single workload label, the proposed effort–stress state-space enables structured differentiation between distinct cognitive regimes, including productive engagement and overload-related strain. Predicted state trajectories exhibit significant demand-sensitive shifts under controlled workload manipulations, with effort and stress responding differentially across interruption and time-pressure conditions. These results suggest that physiologically grounded multidimensional state representations may provide a foundation for adaptive systems capable of continuous capacity-aware monitoring and human-centered interaction. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2605.24416 [cs.LG] (or arXiv:2605.24416v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.24416 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-115] A Unified Python Framework for Direct PPO-based Control of AHUs with Economizer Logic and CO2-Constrained Ventilation

链接: https://arxiv.org/abs/2605.24406
作者: Erfan Haghighat Damavandi,Davide Papurello,Mahdi Alibeigi,Armin Keshavarz,Simone Canevarolo,Marco Condo
类目: Machine Learning (cs.LG)
*备注: 10 pages, 7 figures

点击查看摘要

Abstract:Optimizing HVAC (Heating, Ventilation and Air Conditioning) can enhance a building’s energy efficiency while providing comfort levels for its occupants. Using conventional control systems to maintain HVAC functions is often difficult because of the nonlinear characteristics of a building envelope as it experiences stochastic load variations over time. This paper presents a new approach to optimizing HVAC systems through the use of Deep Reinforcement Learning (DRL) algorithms and the Proximal Policy Optimization (PPO) algorithm implemented in a custom Python performance environment. The DRL system uses a second order resistor-capacitor thermal model and an integrated dynamic mass balance of CO2 to replicate the complex physics associated with buildings. One major innovation of this study is a “Hierarchical Flow Logic,” which provides the means to ensure that indoor air quality (IAQ) is maintained by overriding the accepted actions of the agent that cause CO2 to exceed 1000 ppm. In addition, an enthalpy-based economiser is used to create free cooling from the outdoor environment. The experimental data shows that compared to PID controllers tuned by GA or traditional On-Off controls, a PPO agent has better temperature stability and energy efficiency overall. An end-to-end pipeline provides an avenue for robust and generalized solutions to help implement smart building energy management within the context of real hardware implementation.

[LG-116] AvAtar: Learning to Align via Active Optimal Transport ICML2026

链接: https://arxiv.org/abs/2605.24395
作者: Qi Yu,Ruizhong Qiu,Zhichen Zeng,My T. Thai,Huan Liu,Hanghang Tong
类目: Machine Learning (cs.LG)
*备注: Published as a conference paper at ICML 2026

点击查看摘要

Abstract:Alignment plays a fundamental role in many machine learning problems, such as multi-network analysis, multimodal learning, and point cloud registration. Recent works increasingly leverage optimal transport (OT) for distributional alignment, whose effectiveness largely depends on sparse supervision that is hard or costly to obtain in practice. Existing works, however, largely overlook how to actively acquire high-quality supervision to improve their alignment performance under OT frameworks. In this paper, we propose a principled active alignment framework for optimal transport alignment called AvAtar. We quantify the informativeness of a candidate by measuring its gradient-based impact on the global alignment result, computed as the gradient propagation from the global alignment result to all possible supervisions of the candidate through the entropy-regularized OT formulation. While differentiating through OT is challenging given its constrained nature, we leverage the adjoint-state method to reformulate the computation to a linear system solvable by the conjugate gradient method with linear complexity and guaranteed convergence. By encoding the global alignment result via effective utility functions, AvAtar is applicable to general alignment problems under the OT framework. Extensive experiments on three representative alignment tasks demonstrate the effectiveness, scalability, and generalizability of the proposed AvAtar.

[LG-117] Learning Laplacian Eigenspace with Mass-Aware Neural Operators on Point Clouds

链接: https://arxiv.org/abs/2605.24390
作者: Zherui Yang,Tao Du,Ligang Liu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The eigendecomposition of the Laplace–Beltrami Operator (LBO) is fundamental to geometric analysis, yet computing its low-frequency eigenmodes remains a significant bottleneck due to the high cost of iterative solvers on large-scale data. To amortize this cost, we introduce the Neural Eigenspace Operator (NEO), a feed-forward framework designed to predict the spectrum directly from point clouds. Crucially, NEO circumvents the ill-posed nature of standard eigenvector regression, which suffers from intrinsic sign flips and rotation ambiguities, by learning the stable, invariant low-frequency subspace instead. Specifically, the network predicts a redundant set of basis functions whose span robustly covers the target eigenspace, allowing for the recovery of accurate eigenpairs via a lightweight Rayleigh–Ritz refinement. To handle irregular sampling, we propose a mass-aware neural operator that incorporates per-point area weights into attention-based aggregation, improving robustness to non-uniform densities and enabling zero-shot generalization across resolutions. Our approach achieves near-linear runtime scaling and substantial wall-clock speedups over iterative solvers at comparable accuracy, and exhibits strong zero-shot transfer to high-resolution point clouds. The resulting eigenpairs support standard spectral geometry tasks, while the raw basis functions provide effective point-wise features for downstream learning. Code: this https URL.

[LG-118] GEESE: Genotype-aware End-to-End Spatio-temporal Embedding for Behavioral Phenotyping

链接: https://arxiv.org/abs/2605.24370
作者: Yiran Ding,Yuen Gao,Chunqi Qian,Zijun Cui
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:Behavioral phenotyping of genetic animal models currently requires labor-intensive manual feature engineering that limits reproducibility and scalability. We present GEESE, an end-to-end deep learning framework that learns behavioral representations directly from 3D pose dynamics without hand-crafted features. Using a pretrained time series foundation model, we encode movement sequences into a behavioral manifold that supports both behavior classification and genotype prediction. Evaluated across three autism-associated genetic models (CNTNAP2, CHD8, FMR1), our deep learning approach surpasses hand-crafted feature baselines in both tasks, revealing that learned representations capture genotype-specific behavioral signatures. The framework generalizes across genetic backgrounds, and an all-cohort model identifies both genetic background and genotype from movement patterns alone. We further provide HONK, an interactive intelligent tool enabling researchers without programming expertise to perform behavioral phenotyping from pose data through natural language interaction.

[LG-119] Refined Analysis of Entropy-Regularized Actor-Critic

链接: https://arxiv.org/abs/2605.24357
作者: Safwan Labbi,Paul Mangold,Daniil Tiapkin,Eric Moulines
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this paper, we study the role of the critic in actor–critic for entropy-regularized, finite, discounted environments. We establish that, when the critic is exact, using the latter as a baseline is a variance-reduction method in a strong sense. In this case, actor–critic with stochastic gradients matches the sample complexity of deterministic policy gradient, reaching an \epsilon -optimal regularized value with \tildeO(\log(1/\epsilon)) samples. In practice, the critic is learned alongside the actor: the variance of the actor update is then influenced by the critic’s variance and bias. Specifically, when the critic has a sufficiently small error, the variance reduction and rapid convergence are preserved. This suggests to learn the critic first, keeping it up to date after each actor update, underscoring the crucial role of accurate critic estimation in actor–critic methods.

[LG-120] Evolving Robustness–Exploration Trade-off in Online Reinforcement Learning via Quantile Bayesian Risk MDPs

链接: https://arxiv.org/abs/2605.24345
作者: Meichen Song,Yuhao Wang,Enlu Zhou
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In online reinforcement learning, data scarcity creates epistemic uncertainty that makes robustness important early in learning, whereas sufficient exploration is needed to learn the true-environment optimal policy. We study this time-varying robustness–exploration trade-off through a quantile Bayesian risk-aware Markov decision process (BR-MDP), in which the quantile level controls how posterior uncertainty enters the Bellman backup. We characterize this control through an asymptotic normality result for the difference between the quantile BR-MDP value and the value in the true environment. The result implies that upper/lower-tail quantiles induce optimism/pessimism towards epistemic uncertainty, and the magnitude of the optimism/pessimism decreases as data accumulate. Building on this characterization, we propose an online Bayesian risk-aware algorithm with an adaptive quantile schedule that emphasizes robustness early and gradually encourages exploration of less-visited state–action pairs. We establish sublinear Bayesian regret bounds with respect to both the true optimal value and the optimal BR-MDP robust value. Numerical experiments demonstrate strong performance in both exploration-demanding and exploration-costly environments.

[LG-121] ChainzRule: Sample-Efficient Robust Deep Learning Across Tabular NLP and Vision Tasks

链接: https://arxiv.org/abs/2605.24340
作者: Rowan Martnishn
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Production deep learning systems across enterprise domains operate under constraints that academic benchmarks routinely obscure: labeled data is expensive, inference budgets are tight, and models that cannot explain their behavior are difficult to trust and maintain. We present ChainzRule (CR), a neural architecture replacing typical activations with learnable polynomial layers governed by Differential Regularization (DREG), a layer-wise Jacobian penalty computed analytically during the forward pass at standard inference cost. The core claim is that bounding intermediate derivatives forces the network toward low-frequency, structurally stable representations, simultaneously reducing dependence on labeled data volume, improving robustness to distribution shift, and providing a measurable, gradient-based handle on model behavior. Evaluated across five domains, CR achieves 85.71% \pm 2.01% on Pima Diabetes (statistically superior to SVM and XGBoost), 46.20% \pm 0.37% on SST-5 sentiment classification with a frozen encoder (superior to RNTN using approximately 5% of its training data), 55.79% on SST-5 with a fine-tuned BERT backbone (versus BERT-base linear head at 54.9% ), 70.17% on Yelp Full ordinal regression with 3.2M parameters versus a 10-model average of 66.35% , and +2.32% mean corruption accuracy on CIFAR-10-C. All results with reported p -values fall below the \alpha = 0.05 threshold after Bonferroni correction. CR maintains a gradient tail ratio \tau (p99/mean) of 1.01 – 1.02 against 1.07 – 1.09 for all typical activation function baselines across every data fraction, a structural invariant we propose as the mechanistic driver of sample efficiency and a deployment-time proxy for model reliability.

[LG-122] CurveRL: Principled Distribution-Aware Context Reweighting for LLM Reasoning

链接: https://arxiv.org/abs/2605.24331
作者: Ke Sun,Yizhou Zhao,Jiayi Xin,Qi Long,Weijie Su
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Context or prompt-level reweighting has emerged as a central algorithmic lever in Reinforcement Learning with Verified Rewards (RLVR) for improving the reasoning capability of large language models, yet the principle determining what constitutes an optimal weighting remains poorly understood. We address this gap by formulating prompt reweighting as a functional derivative of a utility functional defined in the pass-rate function space, yielding a unified optimality framework that accommodates existing schemes, including REINFORCE and GRPO. Building on this optimality framework, we propose a distribution-aware prompt reweighting approach, called CurveRL, based on a quantile coordinate transform, in which the weight assigned to each prompt depends not on the absolute value of pass rates but on its rank and density to reflect the distributional structure of the pass rates in the learning dynamics. Extensive experiments across multiple benchmarks demonstrate that our proposed CurveRL consistently outperforms GRPO and other RLVR baselines. Our study identifies context-distribution control as a principled axis for analyzing and designing prompt-reweighted RLVR algorithms. The code is released in this https URL.

[LG-123] Interdomain Attention: Beyond Token-Level Key-Value Memory

链接: https://arxiv.org/abs/2605.24330
作者: Naoki Kiyohara,Harrison Bo Hua Zhu,Riccardo El Hassanin,Zhuo Sun,Wenlong Chen,Samir Bhatt,Yingzhen Li
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Transformers and deep state space models (SSMs) sit at opposite ends of a basic design choice: attention routes each query through a growing key-value (KV) cache by content-based matching at quadratic cost, while deep SSMs compress context into a fixed-size recurrent state that is not directly addressed by query-key matching. We propose Interdomain Attention, which integrates an SSM into an attention module through kernel methods: an attention kernel is approximated by a finite feature map, the resulting key features and values are projected onto a shared set of basis functions maintained by a single SSM recurrence, and each query attends to the compressed coefficients through its own feature map, recovering query-conditioned attention over a fixed-size state. The scalable layer is a learned relaxation of this derivation, and we validate its components through ablations. In a 125M to 1.3B autoregressive language-modeling study on FineWeb-Edu at matched recurrent-state budget, Interdomain Attention improves on an SSM token mixer at every scale, surpasses a same-recipe softmax baseline at 1.3B on validation perplexity and on the eight-task commonsense suite, and inherits the length-flat behavior of its fixed-state core out to 3.5x the training context. Ablations indicate that the query-conditioned projection is the main source of the gain.

[LG-124] Omissive Bias in Religious Representation: Benchmarking LLM Answers to Everyday Ethical Decision-making

链接: https://arxiv.org/abs/2605.24319
作者: David Wingate,Sheryl Carty,Joshua Coates,Daniel Feldman,Nancy Fulda,Larry Howell,Brett Israelson,Dallin Jacobs,Jonathan Karr,John Paul Kimes,Elisabeth Kincaid,Paul Martens,Gavin Mobley,Suzana Pinheiro,Lindsay Slemboski,Peter Whiting
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:As large language models become a default source of guidance on personal, moral, and existential questions, it matters whether they draw on the religious frameworks that have historically shaped such reasoning, or systematically omit them. In this paper, we ask a deliberately narrow question: when posed an everyday ethical question for which religious perspectives may be valuable, do LLMs invoke religion at all? In contrast to benchmarks that look for the presence of political leanings or social bias, we look for the absence of religious representation as a dimension of value alignment and bias in LLMs. We term this ``omissive bias.‘’ To measure omissive bias, we contribute the AllFaith Religious Representation Benchmark: 150 ethically and personally salient questions, sourced from in-the-wild chat transcripts and faith-community contributors, paired with an LLM-as-judge rubric that gives full credit for any mention of a religion, a religious practice, or a religious leader. The questions are not themselves about religion–they are open-ended questions about grief, forgiveness, relationships, purpose, and honesty, where religion is one valuable perspective among several. We also run a human-subjects survey to compare LLM behavior against human expectations. Evaluating 27 models, we find that LLMs consistently underrepresent religion relative to human expectations. The omission is asymmetric: models invoke religion more readily for abstract existential questions (meaning, death, truth) than for the practical personal situations–grief, marriage, family conflict, addiction–where many people most rely on it. It is not our purpose to adjudicate which values LLMs should hold. We argue, more modestly, that current LLM responses overlook critical opportunities to reflect religious frameworks that many people draw on when navigating personal and ethical challenges. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2605.24319 [cs.LG] (or arXiv:2605.24319v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.24319 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-125] From One-Pass SGD to Data Reuse: Mini-Batch Scaling Laws in Sketched Linear Regression

链接: https://arxiv.org/abs/2605.24316
作者: Ziyan Chen,Ding-Xuan Zhou
类目: Machine Learning (cs.LG)
*备注: 56 pages, 3 figures

点击查看摘要

Abstract:Scaling laws provide compact descriptions of how prediction error varies with compute, model size, and data, but existing theory mainly treats single-sample SGD or full data reuse, leaving the role of mini-batching unclear. We study batch scaling laws for sketched linear regression under a power-law covariance spectrum and a source condition on the target parameter. We analyze one-pass batch SGD, multi-pass batch SGD with replacement, and multi-pass batch SGD without replacement. Our first result is a risk decomposition: all three procedures share the same irreducible and approximation terms, while their stochastic terms depend on the sampling protocol. One-pass batch SGD splits into bias and variance, whereas the two multi-pass methods split into GD bias, GD variance, and a fluctuation term around a common GD reference trajectory. We then prove source-condition scaling laws for one-pass and multi-pass mini-batch methods. For one-pass batch SGD, mini-batching preserves the approximation and optimization-bias exponents, while the variance scales as O(\min(M,(T_\mathrmeff\gamma)^1/a)/(B T_\mathrmeff)) . Thus the usual 1/B covariance reduction holds at fixed update count T , but in the one-pass regime T=N/B it is partly offset by the shorter optimization horizon. For multi-pass batch SGD, with- and without-replacement sampling have identical approximation and GD bias/variance terms; they differ only in the fluctuation covariance prefactor, which is 1/B with replacement and \rho_N,B=(N-B)/(B(N-1)) without replacement. Hence without-replacement sampling is less noisy for B1 , and when B=N the fluctuation vanishes, recovering deterministic gradient descent. These results place batch size on the same theoretical footing as compute, data, and model dimension in sketched linear regression.

[LG-126] LLM s Show No Signs Of Individuated Metacognition

链接: https://arxiv.org/abs/2605.24299
作者: M. Moran,Mark Whiting
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Confidence-weighted routing, selective abstention, and ensemble weighting all assume that a model’s stated confidence is informative about its capability on the question being asked. They presume functional metacognition, the capacity to assess one’s own capabilities, without exercising them. Aggregate calibration is well studied, with mixed results, but the underlying structure of elicited confidence is less well understood. We decompose binary confidence judgements from 20 frontier Large Language Models (LLMs) across six benchmarks using tetrachoric factor analysis paired with pairwise calibration, asking whether two models that differ in confidence also differ in performance. On factual recall and information retrieval benchmarks the cross-model confidence matrix is approximately rank-one and a single dominant factor captures most of the latent variance. Models retrieving facts share an item-level difficulty axis and differ mainly in their decision thresholds along it. Across all benchmarks the relationship between confidence and performance collapses once items that all models agree on are removed. Inter-model pairwise calibration is small even where statistically significant, and what remains shrinks to nothing once base-rate differences along the shared factor are controlled for. Mathematical reasoning is the apparent exception, but this turns out to be a confound where reasoning models answer questions about their confidence by trying to solve them in their chain of thought, bypassing the sub-symbolic self-knowledge we seek to measure. We find no evidence for significant verbalised individuated metacognition in any tested domain.

[LG-127] Private Adaptive Covariance Estimation via Gaussian Graphical Models

链接: https://arxiv.org/abs/2605.24295
作者: Cecilia Ferrando,Miguel Fuentes,Brett Mullins,Cameron Musco,Daniel Sheldon
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We propose PACE-GGM, a data-adaptive differentially private method for covariance estimation that concentrates its privacy budget on the most informative entries of the empirical covariance matrix, rather than perturbing all entries. This applies in the natural setting where the modeler supplies separate bounds for each variable, so that individual entries can be measured with less noise than the full matrix. In each round, our method selects a poorly approximated entry, measures it using the Gaussian mechanism, and then reconstructs a full covariance matrix using a maximum-entropy reconstruction objective, leading to a Gaussian graphical model structure. Experiments on diverse real-world datasets demonstrate consistent improvements in estimation error with respect to the Gaussian mechanism and other baselines, particularly in high-dimensional and low-to-moderate privacy regimes.

[LG-128] UBE: Tangent Upper Bound on Evidence for Discrete Diffusion Language Models

链接: https://arxiv.org/abs/2605.24292
作者: Arseny Ivanov,Sergei Kholkin,Vladislav Gromadskii,Grigoriy Ksenofontov,Ivan Oseledets,Alexander Korotin
类目: Machine Learning (cs.LG)
*备注: Preprint. 9 pages main text, 5 figures, plus appendix

点击查看摘要

Abstract:Log-likelihood is a standard metric for evaluating generative models. Unfortunately, in contrast to autoregressive models (ARMs), discrete diffusion models generally do not admit exact computation of this quantity. Existing evaluations, therefore, rely on the evidence lower bound (ELBO), leaving unclear how much higher the true value may be. We address this by introducing the Tangent Upper Bound on Evidence (TUBE), a variational upper bound on log-likelihood that admits an unbiased Monte Carlo estimator. Our TUBE extends across latent-variable models, including masked diffusion models (MDMs), any-order ARMs (AO-ARMs), and block variants of both. Applied to block MDMs and block AO-ARMs, TUBE reveals our key empirical finding that these models lie strictly below the exact ARM baseline, showing that ARMs still dominate in likelihood.

[LG-129] Fourier Feature Pyramids for Physics-Informed Neural Networks

链接: https://arxiv.org/abs/2605.24278
作者: Brandon Zhao,Yixuan Wang,Jonathan T. Barron,Katherine L. Bouman,Dor Verbin,Pratul P. Srinivasan
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present an improved neural field architecture for solving partial differential equations (PDEs). Current physics-informed neural networks (PINNs) provide a flexible framework for solving PDEs, but they struggle to achieve highly accurate solutions and require computation that scales poorly with parameter count. Our model, which we call beignet (Bandlimited Embedding with Interpolated Grid Network), replaces the random Fourier feature embedding used by existing PINN models with a trainable multi-resolution Fourier feature pyramid. To query beignet at a continuous coordinate, we use Fourier interpolation at each level of the pyramid to return features at the input coordinate, and then decode this vector with a fully-connected neural network trunk. Our model provides multiple benefits: 1) Spatial derivatives can be computed efficiently by using the chain rule to compose derivatives of the neural network computed with automatic differentiation with derivatives of the feature grid computed spectrally by the Fast Fourier transform (FFT). 2) beignet can achieve higher accuracy in a compute-efficient manner by scaling the parameter count of this Fourier feature pyramid, instead of the less-efficient strategy of scaling the neural network architecture. 3) beignet can directly control the representation bandlimit, resulting in more stable optimization for difficult PDEs. We demonstrate that beignet finds significantly more accurate solutions on PDE benchmarks using fewer parameters than state-of-the-art PINN methods. We further evaluate beignet on the self-similar inviscid Burgers blowup problem and show that it can minimize residuals to near machine precision using Adam, an accuracy regime previously attained only by using computationally expensive higher-order optimizers.

[LG-130] A lift for input-convex neural network training

链接: https://arxiv.org/abs/2605.24274
作者: Ali Siahkoohi,Anirudh Thatipelli
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Input-convex neural networks (ICNNs) are widely used for log-concave density estimation, convex-potential normalizing flows, optimal transport, and transport-map inversion for high-dimensional Bayesian posteriors. These tasks share a structural constraint: the inter-layer weights of the ICNN must remain non-negative. The standard recipe, projected gradient descent (PGD) onto the non-negative cone, applies a hard, non-smooth projection – the stiff-penalty limit of an ADMM-style constraint splitting – and its classical convergence guarantees do not transfer to the non-smooth ICNN training landscape; the differentiable alternative, softplus reparametrization, attenuates the gradient exponentially in the weight magnitude, stalling training with dead inter-layer weights and plateaued loss. Inspired by parameter-extension lifts of PDE-constrained inverse problems, we propose the lift: instead of constraining the inter-layer weights directly, we train an unconstrained hypernetwork that emits them from a permutation-invariant summary of the input batch. This adds stochasticity to the training dynamics that softens the loss landscape, letting the iterates escape the gradient-attenuated region where direct softplus stalls. We trace this softening to three structural ingredients – a learnable bias acting as slack, a hypernetwork body that conditions on the target batch, and a cross-covariance coupling the two through batch stochasticity – and prove each one necessary: deleting any single ingredient collapses the cross-covariance that carries the softening. On log-concave energy-based modeling from one-dimensional toy targets to image-flavored latents, and convex-potential normalizing flows on a 21-dimensional tabular benchmark, we show that the lift reaches a lower test loss than both PGD and direct softplus, and turns a plateau-bounded training trajectory into a valley-descending one.

[LG-131] Optimizing Digital Therapeutic Interventions: Online Learning under Endogenous Adherence

链接: https://arxiv.org/abs/2605.24261
作者: Eric Pulick,Stephanie Carpenter,Matthew Buman,Yonatan Mintz
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 48 pages, 6 figures

点击查看摘要

Abstract:A critical challenge facing clinicians managing chronic disease interventions is sustaining long-run patient health given limited information and resources. Digital therapeutics (DTs) provide a cost-effective way to manage interventions at scale through repeated interactions (e.g. daily treatment recommendations), but patient success is highly dependent on their adherence. Behavioral psychology suggests that both treatment recommendations and past adherence affect future adherence, yet existing decision support frameworks for DTs model only recommendation effects or treat adherence as exogenous context, leaving a key gap in model and algorithm development. To address this gap, we present a DT decision support framework that captures both recommendation and adherence effects, allowing clinicians to better plan treatment recommendations. We model a patient’s time-varying capacity for engagement with treatment using a linear dynamical system (LDS) that captures both recommendation and adherence effects, endogenously connected to adherence behavior with a logit link. We establish finite-time identification guarantees for this model, extending LDS results to our setting. Next, we propose an optimism-based algorithm, UCB-BOLD, for online treatment selection and prove that it achieves sublinear regret. We evaluate UCB-BOLD against benchmarks via ablation studies on a synthetic patient cohort generated using micro-randomized trial data. DT decision support tools can include dynamical models to enable decision makers to efficiently use the data in DT settings to improve patient health through effective resource allocation. While myopic or heuristic approaches suffice for some patient types, the benefits of explicitly planning around recommendation and adherence effects are significant for others; UCB-BOLD achieves 2-3x lower conditional value-at-risk regret than the next-best benchmark.

[LG-132] PrivFusion: A Privacy-preserving Multi-Agent Framework for Harmonizing Distributed Datasets

链接: https://arxiv.org/abs/2605.24249
作者: Anisa Halimi,Liubov Nedoshivina,Kieran Fraser,Stefano Braghin
类目: Machine Learning (cs.LG)
*备注: Accepted by IEEE CBMS 2026

点击查看摘要

Abstract:The growing availability of clinical data has increased the use of machine learning, yet centralized data aggregation is often infeasible for sensitive health information. Federated Learning (FL) offers a distributed alternative, but its adoption is limited by substantial heterogeneity across institutional datasets, making harmonization a critical but frequently overlooked prerequisite for multi-site analytics. We introduce PrivFusion, a privacy-preserving multi-agent framework that automates the harmonization of structured datasets prior to federated training. PrivFusion uses agents to analyze local data, cluster semantically similar features across sites, and provide iterative transformation recommendations until alignment is achieved. Evaluation across four heterogeneous COVID-19 datasets demonstrates that PrivFusion effectively and efficiently harmonizes multi-site data while substantially reducing manual effort.

[LG-133] Characterizing the Representational Capacity of Neural Processes

链接: https://arxiv.org/abs/2605.24210
作者: Robin Young
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: To appear at ProbML/AABI 2026

点击查看摘要

Abstract:What functions can Neural Processes represent? We analyze the representational capacity of popular NP architectures: Conditional Neural Processes (CNPs), Attentive Neural Processes (ANPs), Transformer Neural Processes (TNPs), and their latent variants. We prove these architectures form a strict hierarchy. CNP-representable functions are exactly those depending on finitely many expected features of the context distribution. ANPs strictly generalize CNPs via query-dependent reweighting, enabling kernel smoothers. ConvCNPs and ANPs are incomparable; each contains functions outside the other, separated by stationarity versus translation equivariance. TNPs with L self-attention layers capture L -hop context interactions. For latent NPs, we show finite-dimensional latents provide coherent sampling but do not circumvent encoder limitations; matching GP posterior distributions requires latent dimension scaling with context size. These results provide a theoretical foundation for architecture selection based on task structure.

[LG-134] Incorporating Deep Learning Design in Database Queries

链接: https://arxiv.org/abs/2605.24207
作者: Yuval Lev Lubarsky,Dean Light,Boaz Berger,Shunit Agmon,Benny Kimelfeld
类目: Databases (cs.DB); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deep learning over relational databases is conventionally realized by translating data into graph representations and applying graph-based neural networks within external frameworks. This round-trip between the database and external machine learning (ML) systems introduces non-trivial engineering overhead. In effect, these graph neural networks operate on tuple embeddings and manipulate them in ways that capture the interactions induced by relational joins. Given this natural correspondence, there is no fundamental reason why specifying a neural network over relational data should be substantially harder than querying it. We propose an approach that naturally integrates deep learning with database queries. The key idea is to associate each tuple with provenance, represented as a vector embedding with learnable parameters. Queries are lifted to operate jointly on data and embeddings, mapping input relations with embedded tuples to output relations with embedded tuples. This approach provides a declarative foundation for relational deep learning, facilitating integration with database systems, optimization, and wide adoption. We describe RelaNN, a proof-of-concept implementation of this approach built on top of PyTorch and cuDF. We illustrate the utility of RelaNN by implementing various graph-learning models, including graph convolutional networks, heterogeneous graph transformers, hypergraph neural networks and deep homomorphism networks. The simplicity of the programs and their competitive runtime performance demonstrate a concrete path toward making the implementation of state-of-the-art neural networks over databases as simple as writing a query. Subjects: Databases (cs.DB); Machine Learning (cs.LG) Cite as: arXiv:2605.24207 [cs.DB] (or arXiv:2605.24207v1 [cs.DB] for this version) https://doi.org/10.48550/arXiv.2605.24207 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-135] Music Transcription with (Almost) No Supervision

链接: https://arxiv.org/abs/2605.24193
作者: Saebyeol Shin,Chao Wan,Zhenzhen Liu,Justin Lovelace,Daniel C. Lin,Kilian Q. Weinberger,John Thickstun
类目: ound (cs.SD); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Competitive music transcription models require large amounts of paired audio-score data, which is scarce due to collection costs, alignment difficulty, and copyright restrictions. Meanwhile, vast quantities of unpaired audio recordings and symbolic scores are freely available but have gone unused. We adopt a cycle-consistent translation framework in which a small amount of paired data acts as a minimal anchor, unlocking the full potential of the unpaired pool. We find that: unpaired data yields surprisingly large gains, especially under limited supervision; unpaired audio contributes more than unpaired scores; incorporating unlabeled audio from a new instrument during training improves transcription for that instrument without any paired supervision. Together, these results suggest that scaling unpaired data offers a practical path toward high-quality transcription for instruments where labeled data remains scarce.

[LG-136] EVA: Accelerating LLM Decoding via an Efficient Vector Quantization Architecture ISCA2026

链接: https://arxiv.org/abs/2605.24144
作者: Bowen Duan,Cong Guo,Chiyue Wei,Haoxuan Shan,Yuzhe Fu,Xinhua Chen,Yifan Xu,Ziyue Zhang,Changchun Zhou,Hai Li,Yiran Chen
类目: Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注: 17 pages. Accepted to ISCA 2026

点击查看摘要

Abstract:Large Language Models (LLMs) have achieved impressive performance across diverse domains but remain inefficient during the autoregressive decoding phase. Unlike the prefill stage, which employs compute-bound GEMM operations, decoding executes a sequence of small GEMV-like computations that are memory-bound and underutilize modern accelerators. Weight-only vector quantization (VQ) has emerged as an effective compression technique that clusters model weights into a shared codebook and replaces the original weight matrix with low-precision indices, enabling 2-bit-level weight compression. While this approach substantially reduces model size and memory bandwidth, it still suffers from two critical inefficiencies: the low utilization of GEMV computation and frequent memory conflicts during codebook lookups. This paper presents EVA, an efficient vector-quantization-based architecture that addresses both computational and memory bottlenecks in LLM decoding. EVA builds on a simple yet effective insight that combines input-codebook computation with conflict-free memory access. Instead of reconstructing quantized weights from indices, EVA directly performs dot products between input vectors and the weight codebook, transforming LLM decoding from GEMV to GEMM computation. It then performs structured lookups from an intermediate output buffer, eliminating memory bank conflicts. We further design a hardware-software co-optimized architecture specialized for LLM decoding while remaining compatible with conventional prefill execution. Evaluations show that EVA achieves up to 11.17 \times speedup and 7.17 \times higher energy efficiency compared with the SOTA lookup-based architecture, while preserving arithmetic precision after vector quantization. Our code is available at this https URL. Comments: 17 pages. Accepted to ISCA 2026 Subjects: Hardware Architecture (cs.AR); Machine Learning (cs.LG) Cite as: arXiv:2605.24144 [cs.AR] (or arXiv:2605.24144v1 [cs.AR] for this version) https://doi.org/10.48550/arXiv.2605.24144 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-137] Riemannian Archetypal Analysis: Interpretable non-linear data analysis on deformed star distributions

链接: https://arxiv.org/abs/2605.24113
作者: Willem Diepeveen,Deanna Needell
类目: Machine Learning (cs.LG); Differential Geometry (math.DG); Optimization and Control (math.OC); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:Classical archetypal analysis is appealing for its interpretability, but its linear geometry can limit performance on data with strongly non-linear structure; at the same time, existing neural extensions improve flexibility while often weakening the geometric meaning of archetypes and interpolations. In this work, we develop a Riemannian version of archetypal analysis based on data-driven pullback geometry for real-valued data, with the goal of combining the interpretability of classical archetypal analysis with the expressive power of modern non-linear models. We introduce a class of deformed star distributions together with associated pullback Riemannian geometry to provide a statistical interpretation of the resulting manifold mappings, define the Riemannian archetypal mapping (RAM) as a projection onto the manifold of geodesically convex combinations of archetypes, and propose a practical optimization scheme based on convex relaxation followed by non-convex refinement. We further propose a learning scheme that yields reasonable, albeit generally suboptimal, deformed star distributions from data. Experiments on synthetic examples and MNIST show that the resulting framework produces meaningful geodesics, useful denoising projections, and geometry-aware classifications, while also clarifying where current optimization limitations remain.

[LG-138] owards Verifiable Transformers: Solver-Checkable Circuit Explanations

链接: https://arxiv.org/abs/2605.24033
作者: Neel Somani
类目: Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
*备注:

点击查看摘要

Abstract:Mechanistic interpretability often identifies circuits inside Transformer models, but explanations of those circuits are usually validated through examples, ablations, and manual reasoning. This leaves a gap between finding a plausible circuit and proving what the circuit does. We introduce Verifiable Transformers, a framework for converting task-localized Transformer circuits into bounded, solver-checkable claims. Given a behavior, a finite task domain, and a candidate-token projection, we extract a task circuit and verify properties such as projected functional equivalence, edge necessity, task-relevant invariance, and final-residual robustness. Direct verification encodes the extracted circuit itself into an SMT solver. When a circuit contains operators that are not exactly or tractably encodable, surrogate-mediated verification fits an SMT-encodable surrogate, validates it against the extracted circuit over the bounded domain, and verifies symbolic explanations against the surrogate. We instantiate direct verification with a GPT-style architecture using Signed L1 BandNorm, sparsemax attention, and LeakyReLU. On small symbolic sequence tasks, we train an SMT-representable Transformer, extract sparse circuits for quote closing and bracket type tracking, and exhaustively verify projected functional equivalence, content invariance, edge necessity, and final-residual robustness. At GPT-2 scale, the same operator stack trains stably on OpenWebText, although naive direct SMT verification remains intractable. We also demonstrate surrogate-mediated verification on task-localized circuits with hard-to-encode attention, showing both verified symbolic explanations and solver-generated counterexamples. The goal is not full-model verification, but a concrete path for turning mechanistic circuit explanations into formal propositions that can be proven or refuted.

[LG-139] A Tabular Schedule Abstraction for Communication-Aware Evaluation of Pipeline-Parallel LLM Training

链接: https://arxiv.org/abs/2605.24006
作者: Daniel Barley,Jonathan Leis,Benjamin Klenk,Holger Fröning
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: Accepted at the 25th IEEE International Symposium on Parallel and Distributed Computing (ISPDC 2026)

点击查看摘要

Abstract:Pipeline parallelism is a key technique for distributed training of large language models because it reduces per-device parameter and activation memory. However, comparing pipeline schedules is difficult: analytical models expose structural quantities such as bubble ratios, while end-to-end hardware experiments are costly and system-specific. In this work, we introduce a tabular schedule abstraction and a unified multi-abstraction methodology that connects formula-based reasoning, idealized schedule tables, and communication-aware execution simulation. Using this framework, we compare GPipe, 1F1B, Chimera, and Hanayo in its restricted regime across multiple modeled system configurations. Our results show that schedule rankings are not abstraction-invariant: communication can negate structural advantages suggested by bubble analysis alone. Under the assumptions considered here, GPipe and 1F1B are runtime-equivalent, but 1F1B achieves a lower activation-memory peak. Chimera is advantageous mainly at low microbatch counts and in communication-favorable regimes, while Hanayo is effective in its intended restricted operating point but remains sensitive to network bottlenecks. We further study an asymmetric Chimera-style placement, which does not reduce the global peak memory requirement but reveals limited runtime gains in shallow pipelines. Overall, pipeline schedule quality is meaningful only in the context of the modeled execution environment. Comments: Accepted at the 25th IEEE International Symposium on Parallel and Distributed Computing (ISPDC 2026) Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG) Cite as: arXiv:2605.24006 [cs.DC] (or arXiv:2605.24006v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2605.24006 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-140] SFLora: Token-Compressed Split Fine-Tuning for Wireless Edge Networks

链接: https://arxiv.org/abs/2605.23988
作者: Xianke Qiang,Zheng Chang,Li Wang,Ying-Chang Liang
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Adapting large AI models (LAMs) to personalized edge data is challenging because wireless devices have limited memory, computation, and uplink capacity. Federated fine-tuning preserves data privacy but still requires each device to host the full model, while split learning reduces device memory at the cost of heavy activation transmission. This paper proposes TSFLora, a token-compressed split fine-tuning framework for communication-efficient LAM adaptation at the edge. TSFLora combines attention-guided token selection, token merging, low-bit activation quantization, and LoRA-based adaptation within a split federated training pipeline. The key idea is to compress the intermediate token sequence before transmission so that the system reduces both uplink traffic and server-side processing without changing the frozen backbone. Experiments on ViT models over CIFAR-10, CIFAR-100, and TinyImageNet show that TSFLora achieves up to \textbf6.8 \times communication reduction and \textbf41% memory saving while maintaining competitive accuracy.

[LG-141] Algometrics: Forecasting Under Algorithmic Feedback

链接: https://arxiv.org/abs/2605.23978
作者: Marc Schmitt
类目: Machine Learning (cs.LG); Econometrics (econ.EM); Statistical Finance (q-fin.ST); Trading and Market Microstructure (q-fin.TR)
*备注:

点击查看摘要

Abstract:In algorithmic markets, predictive models become part of the data-generating process they aim to forecast. Once their outputs are converted into trades, allocations, execution schedules, or risk controls, they change the future data on which they are evaluated. I introduce algometrics, a framework for time series whose evolution depends on the predictive algorithms forecasting them. The framework distinguishes historical risk, measured under passive forecasting, from deployment risk, measured when forecasts drive actions. I prove three results. First, deployment risk is not identifiable from passive historical data alone: even in a one-step linear feedback model, infinitely many algorithm-mediated environments induce the same historical law while implying different deployment risks for the same forecaster. Second, historical model rankings can invert under crowding, so a predictor with lower passive error can have higher deployment error once similar algorithms are adopted. Third, randomized or instrumented actions identify short-horizon linear feedback, and I derive a finite-sample bound for deployment-risk estimation. These results suggest that time-series benchmarks in algorithmic markets should report feedback sensitivity alongside predictive accuracy.

[LG-142] he Model Parking Tax: Quantifying the Hidden Energy Cost of Always-On GPU Model Deployment

链接: https://arxiv.org/abs/2605.23918
作者: Sai Sathvik Vadari
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Performance (cs.PF)
*备注: 7 pages, 3 figures, 5 tables

点击查看摘要

Abstract:The AI inference industry keeps models loaded in GPU memory around the clock to avoid cold-start latency, implicitly treating idle power as a fixed cost of readiness. Yet the structure of this cost has never been empirically decomposed - and never across GPU architectures. We present the first cross-architecture measurement of idle GPU power as a function of VRAM allocation, combining 18 days of production telemetry (335,267 samples, 14 H100 GPUs) with controlled dose-response experiments on three GPU architectures spanning three memory technologies: NVIDIA H100 (HBM3, 80 GB), A100 (HBM2e, 80 GB), and L40S (GDDR6, 48 GB). We observe that idle power is piecewise constant on all three architectures: the CUDA context forces a discrete DVFS transition consuming +26-66 W over bare idle (26-50 W on HBM architectures, 66 W on GDDR6), while the marginal VRAM effect is bounded below measurement relevance ( |\beta| 0.02 W/GB) on every device tested. The CUDA context accounts for 98% of the parking tax regardless of memory technology. We validate this finding with a real HuggingFace model (Qwen2.5-7B) on all three architectures, confirming 0.5 W difference from empty tensors on every device, and capture cold-start power profiles during model loading. We derive a cold-start breakeven model showing energy-optimal behavior depends on request arrival rate and loading latency - not model size - with breakeven intervals of 1-5 minutes. Our results identify a constraint consistent across all tested architectures: idle-with-context power is determined by DVFS state, not memory occupancy.

[LG-143] DiscoverPhysics: Benchmarking LLM s for Out-of-the-Box Scientific Thinking

链接: https://arxiv.org/abs/2605.26087
作者: Matt L. Wiemann,Lindsay M. Smith,Peter Melchior,Siddharth Mishra-Sharma,Andrew Gordon Wilson,Pavel Izmailov,Carolina Cuesta-Lázaro
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Frontier LLMs now perform strongly across a wide range of physics evaluations, but it is hard to disentangle genuine reasoning from recall of established science. We introduce DiscoverPhysics, an interactive benchmark that asks a LLM agent to discover the laws of motion of a simulated world whose physics deliberately deviates from our own. We construct 22 worlds governed by, among others, screened and fractional-power gravity, multi-species couplings, hidden dark-matter-like particles, non-coordinate-free physics, and time-varying interactions. Each world is generated on demand by an N-body simulator, for which the agent proposes several rounds of experiments, observes raw trajectory data, and ultimately submits both a natural-language explanation of the world’s physics and a Python implementation of the inferred law. Because solving a world requires the agent to design informative experiments and revise its hypotheses, the benchmark probes long-horizon reasoning over an experimental history. We evaluate submissions along two complementary axes: trajectory MSE on held-out particles and an LLM-judged explanation score following an expert-written rubric assessing conceptual understanding of each world. Across eleven frontier models, we find that the strongest agents pass only half of the worlds and consistently fail on those where latent structure must be uncovered. Open-source models lag substantially behind commercial models, both in their ability to design informative experiments and in extracting conclusions from the data. We further find that good predictive accuracy does not guarantee high explanation quality and that conceptual understanding depends on hypothesis refinement through well-chosen experiments.

[LG-144] Accelerating Bayesian inverse design in computational fluid dynamics using neural operators

链接: https://arxiv.org/abs/2605.26059
作者: Bipin Tiwari,Omer San
类目: Fluid Dynamics (physics.flu-dyn); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Bayesian inverse design provides a principled framework for inferring aerodynamic geometries from sparse flow observations while quantifying uncertainty. However, its practical use in computational fluid dynamics (CFD) is severely limited by the cost of repeated high-fidelity simulations required for gradient-based Markov chain Monte Carlo (MCMC) sampling. While surrogate models are commonly proposed to reduce this cost, their effect on posterior geometry and uncertainty, especially for shock-dominated flows, remains poorly understood. In this work, we demonstrate that neural operator surrogates can be embedded directly within the MCMC inference loop while preserving posterior structure. Using a fully Bayesian inverse formulation of quasi-one-dimensional nozzle flow, we demonstrate that geometry parameterization plays a decisive role in identifiability and posterior conditioning, with cubic B-splines yielding stable and physically meaningful uncertainty estimates. Building on this formulation, a Deep Operator Network trained on CFD-generated data is substituted for the CFD solver within a No-U-Turn Sampler, while keeping the likelihood model, priors, and sampling configuration unchanged. Across sparse to fully observed regimes, surrogate-based inference reproduces the posterior geometry and uncertainty trends of the CFD reference. As a result of surrogate integration, total inference time is reduced to under one second, corresponding to a speedup exceeding three orders of magnitude. In addition, a direct inverse neural operator is examined as a deterministic alternative for inverse design, enabling single-shot geometry reconstruction without posterior sampling. These results demonstrate that neural operator-accelerated Bayesian inference enables practical, uncertainty-aware inverse design workflows for aerodynamic applications.

[LG-145] Statistical Inference for Stochastic Gradient Descent Beyond Finite Variance

链接: https://arxiv.org/abs/2605.26000
作者: Jose Blanchet,Peter Glynn,Wenhao Yang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Stochastic gradient descent (SGD) is a foundational algorithm for large-scale statistical learning and stochastic optimization. However, statistical inference based on SGD iterates remains challenging when stochastic gradients have infinite variance, as the relevant limiting distributions depend on unknown nuisance parameters. In this paper, we develop an efficient, model-agnostic methodology for constructing confidence regions from SGD trajectories that applies in both finite- and infinite-variance regimes. The procedure is based on a joint weak convergence result for the Polyak-Ruppert averaged estimator and an empirical second-moment normalizer constructed from stochastic gradients along the SGD trajectory. This joint limit yields a self-normalized statistic in which the leading tail-dependent scaling terms cancel. We then use a subsampling calibration scheme to estimate the relevant critical values, avoiding explicit estimation of tail indices, slowly varying functions, or stable-law parameters. The resulting confidence regions are straightforward to implement and are asymptotically valid under both the finite- and infinite-second-moment regimes. Simulation studies show reliable coverage in various settings, supporting the proposed method as a practical tool for uncertainty quantification in stochastic optimization.

[LG-146] Minimax Limits of k-Fold Cross-Validation via Majority

链接: https://arxiv.org/abs/2605.25859
作者: Ido Nachum,Rüdiger Urbanke,Thomas Weinberger
类目: atistics Theory (math.ST); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study the mean-squared error of k -fold cross-validation as a risk estimator, with particular emphasis on how its accuracy depends on the number of folds k . Despite the widespread use of cross-validation, principled guidance for choosing k is largely absent, mainly due to the complex dependence between fold-wise error estimates. To obtain sharp and interpretable results, we focus on the majority algorithm in binary classification, a minimal yet nontrivial empirical risk minimization procedure. We provide a fine-grained analysis of its cross-validation behavior, showing that even this simple algorithm exhibits subtle and delicate phenomena for which existing theory provides loose and even vacuous bounds. Leveraging this analysis, we introduce a minimax framework for cross-validation risk estimation and prove that no empirical risk minimization algorithm can achieve an O(1/n) minimax mean-squared error when the number of folds grows with the number of samples n ; instead, a lower bound of order \Omega(\sqrtk/n) is unavoidable. Our results reveal fundamental limitations of cross-validation as a data-reuse strategy, clarify gaps and inaccuracies in prior theoretical work, and position the majority algorithm as a natural benchmark that any tight analysis of cross-validation should be able to explain. Subjects: Statistics Theory (math.ST); Machine Learning (cs.LG) Cite as: arXiv:2605.25859 [math.ST] (or arXiv:2605.25859v1 [math.ST] for this version) https://doi.org/10.48550/arXiv.2605.25859 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-147] Geometry Adaptive Counterfactual Distribution Learning with Diffusion-Guided Smoothing

链接: https://arxiv.org/abs/2605.25811
作者: Kwangho Kim
类目: Methodology (stat.ME); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We study counterfactual distribution learning for high-dimensional outcomes whose counterfactual law may concentrate near lower-dimensional structure. Standard isotropic smoothing treats all ambient directions equally, leading to unfavorable scaling and unstable local inference. We propose two diffusion-guided estimators based on semiparametric debiasing: diffusion-informed smoothing for counterfactual densities and diffusion-informed score smoothing for counterfactual scores. The estimators combine causal nuisance adjustment with geometry-adaptive localization driven by diffusion score information, removing first-order nuisance bias while aligning smoothing with local outcome geometry. We establish asymptotic expansions, risk bounds, and inference procedures for smoothed density and score-based targets, with ambient density inference obtained under additional approximation conditions. Under structural geometry conditions, the leading stochastic error is governed by an effective dimension induced by the diffusion-guided kernel, rather than by the ambient dimension. Semi-synthetic experiments based on CelebA show steeper error decay for geometry-adaptive methods, supporting the proposed effective-dimension theory.

[LG-148] Machine Learning Multiscale Interactions

链接: https://arxiv.org/abs/2605.25710
作者: Àlex Solé,Sergio Suárez-Dou,Albert Mosella-Montoro,Silvia Gómez-Coca,Eliseo Ruiz,Alexandre Tkatchenko,Javier Ruiz-Hidalgo
类目: Chemical Physics (physics.chem-ph); Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注:

点击查看摘要

Abstract:Realistic physical systems are characterised by emergent interactions across multiple length and time scales, posing a significant challenge for predictive machine learning (ML) models. Most scientific ML models focus on a narrow range of interactions. While machine learning force fields (MLFFs) offer near-quantum accuracy, the ubiquitous message-passing layers miss long-range many-body effects. Here we introduce the Multiscale Structural Ensemble (MuSE), a hierarchical model that uses Soft Coarse-Graining Pooling to construct coarse representations from smooth fractional assignments of atoms to coarse nodes, enabling MLFF modules to operate across multiple scales. MuSE is architecture-agnostic and coupled with SO3krates, MACE, and PaiNN MLFFs for both molecules and materials. We demonstrate the power of MuSE through Hessian-based benchmarks, folding trajectories for biomolecules, and energy profiles in molecule-graphene nanostructures, where MuSE accurately captures quantum-mechanical interactions at relevant scales – unlike other recent long-range ML models.

[LG-149] PAC Learning with Bandit Feedback: Sharp Sample Complexity in the Realizable Setting

链接: https://arxiv.org/abs/2605.25678
作者: Steve Hanneke,Qinglin Meng,Shay Moran,Amirreza Shaeiri
类目: Machine Learning (stat.ML); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注: 18 pages

点击查看摘要

Abstract:We study the problem of multiclass PAC learning with bandit feedback in the realizable setting. In this framework, there is an unknown data distribution over an instance space \mathcalX and a label space \mathcalY , as in classical multiclass PAC learning, but the learner does not observe the labels of the i.i.d. training examples. Instead, in each round, it receives an unlabeled instance, predicts its label, and receives bandit feedback indicating only whether the prediction is correct. Despite this restriction, the goal remains the same as in classical PAC learning. We provide a general characterization of the optimal sample complexity of this problem, sharp for every concept class up to logarithmic factors. Our characterization is based on a new combinatorial dimension, termed the bandit \mathrmDS dimension, defined via generalized combinatorial structures we call pseudo-boxes. These extend the pseudo-cubes underlying the \mathrmDS dimension by allowing a different number of neighbors in each coordinate. In contrast to the \mathrmDS dimension, which governs the full-information setting by counting the number of coordinates in the pseudo-cube, the bandit \mathrmDS dimension aggregates the number of neighbors across coordinates, leading to a characterization in which the sample complexity scales with the total number of neighbors. We also propose a general learning algorithm achieving the upper bound, based on an algorithmic principle called ListCascade, which connects bandit learning to list learning and may be of independent interest.

[LG-150] StrTransformer: Source-Wise Structured Transformers for Unsupervised Blind Source Recovery

链接: https://arxiv.org/abs/2605.25648
作者: Yuan-Hao Wei
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper proposes StrTransformer, a source-wise structured Transformer framework for blind source recovery and branch-wise latent modeling. Instead of using an encoder to infer latent variables, StrTransformer directly optimizes the latent source matrix together with an observation-space mixer and source-wise structural Transformer branches. The mixer enforces reconstruction consistency, while each Transformer branch imposes a differentiable structural constraint on one latent source trajectory. Specifically, each source is converted into multi-scale patch tokens, randomly masked, processed by a locality-biased Transformer, and evaluated through a masked patch reconstruction energy. This energy acts as an implicit source-wise structural prior. To encourage different latent branches to specialize into different temporal regimes, StrTransformer further introduces an ordered multi-scale controller that learns branch-specific patch-scale weights, ordered scale centers, and locality attention slopes. The resulting objective combines observation reconstruction, source-wise structural regularization, and modular auxiliary penalties for separation and scale specialization. We analyze the decoupling and coupling structure of the objective, the regularized exact-reconstruction fiber, and the reduction of permutation symmetry induced by ordered branch descriptors. A controlled case study shows that the learned branches converge to distinct temporal-scale structures and recover source-aligned latent trajectories under post-hoc evaluation.

[LG-151] 3D Magnetic Field Reconstruction and Mapping with Physics-Informed Neural Networks

链接: https://arxiv.org/abs/2605.25640
作者: Haohan Yu,Zhanxu Hao,Bingzhi Li,Zejia Lu,Xiang Chen,Liang Li
类目: Instrumentation and Detectors (physics.ins-det); Machine Learning (cs.LG); High Energy Physics - Experiment (hep-ex); Nuclear Experiment (nucl-ex)
*备注:

点击查看摘要

Abstract:Accurate reconstruction of magnetic fields in inaccessible regions is vital for many high-precision experiments in physics. Traditional methods, such as spherical harmonic expansion, often suffer from truncation errors that limit their precision. This study proposes an advanced Physics-Informed Neural Network (PINN) framework for high-precision 3D magnetic field mapping. Unlike conventional data-driven models, the proposed PINN integrates Maxwell’s equations directly into the loss function, enforcing divergence-free and curl-free conditions across the entire domain. A key innovation is the inclusion of explicit physics-residual losses at measurement locations, ensuring rigorous physical consistency beyond random collocation sampling. Validation using simulated data achieves a reconstruction accuracy of 10^-4 , a tenfold improvement over existing PINN benchmarks. Furthermore, experimental validation using a custom coil assembly demonstrates robust reconstruction with sub-percent relative accuracy, reaching the 10^-3 level under ambient conditions. This AI-driven methodology provides a robust, high-precision solution for field monitoring and measurement in complex experimental environments where direct sensor placement is restricted.

[LG-152] Learning Sparse Compositional Functions with Norm-Constrained Neural Networks

链接: https://arxiv.org/abs/2605.25608
作者: Shuo Huang,Lorenzo Fiorito,Lorenzo Rosasco,Tomaso Poggio
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The ability of deep neural networks to learn hierarchical features is widely regarded as a key mechanism underlying their success in high-dimensional learning. Existing theory partially supports this view by establishing approximation rates based on parameter counts and sample complexity guarantees for compositional models without incurring the curse of dimensionality (CoD). To study overparameterized regimes, where the number of parameters exceeds the sample size, we develop a framework that measures complexity via the parameter norm. Within this approach, we establish approximation rates and excess risk bounds for learning sparse compositional functions whose compositional structure is represented by directed acyclic graphs (DAGs), using Frobenius norm-constrained deep neural networks. Our results have broad applicability since every function that is efficiently Turing computable admits sparse compositional representations. In particular, we cover a range of representative models, including multi-index models, binary tree structures, and general compositional architectures. The rates we derive show that deep networks can exploit the compositional structure of the target functions, effectively avoiding the CoD through hierarchical representations.

[LG-153] Decoding Stimulus Reconstruction-Based Auditory Attention Robustly in Unbalanced EEG Datasets

链接: https://arxiv.org/abs/2605.25605
作者: Yuanming Zhang,Yayun Liang,Zhibin Lin,Jing Lu
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In the past decade, numerous studies have applied deep neural networks (DNNs) to decode auditory attention (AAD) from Electroencephalogram (EEG) signals via stimulus reconstruction. However, the influence of dataset balance on the decoding performance of stimulus reconstruction-based AAD remains unexplored. In this study, three publicly available EEG-AAD datasets - KUL, DTU, and NJU cEEGrid - are used to construct both balanced and unbalanced experimental conditions. We hypothesize and demonstrate that stimulus reconstruction-based DNN decoders tend to produce overestimated decoding performance on unbalanced datasets. To address this issue, we propose a leave-one-paired-envelope-out (LOPEO) cross-validation protocol. Experimental results confirm that LOPEO effectively prevents inflated decoding accuracy on unbalanced datasets. While balanced datasets are generally preferred in experimental design, LOPEO provides a principled evaluation framework for unbalanced datasets that have already been published, filling an important gap in the field.

[LG-154] Optimal Design for Multinomial Logit Model with Applications to Best Assortment Identification ICML2026

链接: https://arxiv.org/abs/2605.25592
作者: Joongkyu Lee,Min-hwan Oh
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: Accepted at ICML 2026

点击查看摘要

Abstract:We study optimal experimental design for multinomial logit (MNL) bandits, where an agent repeatedly selects a subset of K items from a ground set of size N and observes single-choice feedback. Unlike linear or generalized linear bandits, MNL bandits have a combinatorial action space, which makes classical optimal design approaches and naive optimization over all subsets computationally intractable. We propose a computationally efficient optimal design framework for MNL models that achieves both statistical efficiency and scalability through two complementary approaches: (i) an exact or certified-approximate reformulation of the design oracle as a 0 - 1 mixed-integer linear program (MILP) with solver-certified early stopping, and (ii) a fully polynomial-time lifted design that replaces the nonlinear objective with a tractable surrogate. Using the Kiefer-Wolfowitz equivalence theorem, we establish near G-optimality guarantees and characterize the induced statistical-computational trade-offs. As an application, we develop a best assortment identification algorithm for MNL bandits with linear utilities and non-uniform revenues, and prove an instance-dependent sample complexity of \tildeO\big(\fracd \log N\Delta^2\big) , where d is the feature dimension, N is the number of arms, and \Delta is the minimum revenue gap.

[LG-155] Nonstationary Generalized Linear Bandits with Discounted Online Mirror Descent

链接: https://arxiv.org/abs/2605.25590
作者: Joongkyu Lee,Min-hwan Oh
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study nonstationary generalized linear bandits (GLBs), where the expected reward is modeled through a nonlinear link function with an unknown time-varying parameter. This framework encompasses a broad class of reward models, including linear, Bernoulli, and binomial rewards. Existing approaches are predominantly based on maximum-likelihood estimation (MLE), using sliding-window, restart, or discounting mechanisms to handle nonstationarity. Although these methods achieve statistically efficient regret guarantees, they generally require revisiting past observations at every round, which leads to computation and memory costs that grow with time; moreover, several of them rely on a non-convex projection step. In this paper, we propose DOMD-GLB, a new algorithm for nonstationary GLBs that utilizes discounted online mirror descent (DOMD) for parameter estimation, thereby incurring only O(1) computation and memory costs per round. We prove dynamic regret bounds of order \tildeO \big(c_\mu^-1/2 d^3/4 P_T^1/4 T^3/4\big) in drifting environments and \tildeO\big(c_\mu^-1/3 d^2/3 \Gamma_T^1/3 T^2/3\big) in piecewise-stationary environments, where d denotes the feature dimension, T the time horizon, P_T the path length, \Gamma_T the number of change points, and c_\mu a curvature parameter associated with the link function, while substantially improving computational efficiency over prior work. To the best of our knowledge, this is the first algorithm for nonstationary GLBs with per-round computation and memory costs independent of time.

[LG-156] Rao-Blackwellized Score Matching on Manifolds ICML2026

链接: https://arxiv.org/abs/2605.25567
作者: Divit Rawal
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 22 pages, 3 figures. SPIGM @ ICML 2026

点击查看摘要

Abstract:We study denoising score matching (DSM) when the latent distribution is supported on a smooth embedded manifold M \subset \mathbbR^D . Under ambient Gaussian corruption, the tangent denoising target contains a singular normal-fiber noise channel whose variance diverges as d/\sigma^2 as \sigma \to 0^+ . We show that conditioning on the nearest-point projection \pi(X) canonically removes this singularity: the resulting conditional expectation is the unique L^2 -optimal Rao-Blackwellized predictor of the tangent DSM target among all estimators depending only on the projected observation \pi(X) . We then compute the small-noise expansion of this canonical target and show that it equals the intrinsic Riemannian score up to an explicit order- \sigma^2 correction that decomposes into an intrinsic Tweedie term and an extrinsic curvature term involving the Weingarten and Ricci operators. In the flat case, the construction reduces exactly to ordinary lower-dimensional Gaussian DSM, while on S^d the extrinsic correction simplifies to the scalar factor (1-d/2)\nabla_M \log q ; this extrinsic \sigma^2 correction cancels identically on S^2 , though the intrinsic Tweedie term remains.

[LG-157] From DPPs to k-DPPs: identifiability analysis via spectral decomposition

链接: https://arxiv.org/abs/2605.25526
作者: Hideitsu Hino,Keisuke Yano
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 10 pages

点击查看摘要

Abstract:We study the geometry of determinantal point processes (DPPs) through the spectral decomposition L=U\Lambda U^\top . The spectrum \Lambda governs the cardinality distribution via elementary symmetric polynomials, while the eigenspace orientation U governs the conditional law within each fixed-cardinality stratum. Conditioning on cardinality k yields the k -DPP, for which the identifiability structure changes fundamentally: the spectral parameter becomes identifiable only up to a common scale, and the eigenspace rotation parameter is identifiable only through squared minors of the eigenvector matrix. We characterize the identifiability gap precisely, via three explicit invariances (scale, sign similarity, and eigenspace rotation) and a dimension-counting theorem showing the existence of additional continuous non-identifiability whenever \binomNkN(N+1)/2 . In contrast, for the full DPP the non-identifiability comes only from the discrete sign similarity.

[LG-158] Guided Flow Matching for Forward and Inverse PDE Problems with Sparse Observations: Algorithm and Theory

链接: https://arxiv.org/abs/2605.25509
作者: Xifeng Zhang,Jin Zhao
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 50 pages, 8 figures, 4 tables

点击查看摘要

Abstract:Reconstructing PDE solutions from sparse observations is a core challenge in scientific computing. We present FM4PDE, a flow-matching generative framework that learns the joint distribution of PDE coefficients (or initial states) and solutions (or final states), enabling both forward simulation and inverse recovery with limited paired data. At inference, sampling is guided by a composite loss that enforces agreement with sparse measurements and reduces the PDE residual; we support deterministic, stochastic, and hybrid samplers. We provide error guarantees for these guided procedures. For the deterministic optimizer, a coercivity condition ensures trajectory boundedness and a phase-wise contraction yields logarithmic complexity in the target accuracy. For the stochastic sampler, we introduce adaptive guidance and assume dissipativity of the velocity field to obtain uniform moment bounds independent of the noise-floor parameter. This leads to polynomial-time error bounds, and a matching lower bound shows constant guidance induces an unavoidable positive bias, motivating adaptivity. A hybrid deterministic-stochastic analysis is also provided. Experiments on static and time-dependent benchmark PDEs demonstrate competitive accuracy and faster inference than diffusion-based generative models.

[LG-159] Mean-Shift PCA by Knockoff Mean ICML2026

链接: https://arxiv.org/abs/2605.25460
作者: Mengda Li,Zeng Li,Jianfeng Yao
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: ICML 2026

点击查看摘要

Abstract:Removing noise is difficult, but adding noise is easy. In this work, we show how to eliminate mean-shift noisy components from PCA by deliberately introducing knockoff mean-shift perturbation. Standard PCA is highly sensitive to shifts in the sample mean: a small fraction of samples from a shifted distribution can cause large deviations in the leading principal components. In high-dimensional regimes, existing Robust PCA approaches cannot handle the mean-shift contamination structure inherent in the mixture model. Using tools from Random Matrix Theory, we prove that the mean-shift spikes are spectrally separable from the stable eigenvalues of the original covariance. Furthermore, the original eigenspace remains asymptotically invariant to the contamination, independent of the mixture weight. Exploiting this spectral stability, we propose a simple, two-stage PCA algorithm by adding knockoff mean that identifies and removes the mean-shift component using only standard PCA operations.

[LG-160] Different Statistical Perspectives for Understanding Generalisation in Graph Neural Networks

链接: https://arxiv.org/abs/2605.25452
作者: Nil Ayday,Mahalakshmi Sabanayagam,Debarghya Ghoshdastidar
类目: Methodology (stat.ME); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 15 pages, 4 figures, submission for Special Issue in AStA Advances in Statistical Analysis

点击查看摘要

Abstract:Graph Neural Networks (GNN) are currently the most popular approach for learning and prediction on graph-structured data and are deployed in various fields, from social network analysis to drug discovery. However, there is limited mathematical understanding of the performance of GNNs. We discuss the various perspectives used to study statistical generalisation in GNNs. We identify three broad frameworks. The first approach, rooted in learning theory, relies on uniform convergence bounds and the complexity of the hypothesis class of specific GNN architectures. This approach also builds on the expressivity of GNNs, typically studied through the lens of graph isomorphism tests. The second principle is to simplify the neural architecture by analysing GNNs under the asymptotics of infinitely many parameters or infinite graph size. This approach approximates GNNs using Gaussian processes, neural tangent kernels or graphon neural network operators, which allow studying the generalisation or stability of trained GNNs. The third framework studies GNNs under random graph models, often the contextual stochastic block model, and derives non-asymptotic error rates using tools from high-dimensional statistics. We highlight some key theoretical results and discuss a few limitations and open research questions for each perspective.

[LG-161] Learning manifold diffusion semigroups from graph transition matrices

链接: https://arxiv.org/abs/2605.25383
作者: Xiuyuan Cheng,Nan Wu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:We consider graph diffusion processes constructed from finite i.i.d. samples drawn from an unknown manifold embedded in ambient Euclidean space, where the graph affinity is defined by an ambient Gaussian kernel matrix. We show that the manifold heat semigroup Q_t = e^t\Delta can be approximated directly by iterating the graph transition matrix P , under only low regularity assumptions on the test function f , including the case f \in L^\infty . We bound | P^n f - Q_t f | in \infty -norm, with the operator application to f properly defined, and we recover the classical graph-Laplacian pointwise rate O(N^-2/(d+6)) up to logarithmic factors, for diffusion times t up to O(1) and longer. The rate holds for in-sample error as well as out-of-sample generalization, where the estimator of Q_t f at a new point is defined via kernel convolution. To handle non-uniform sampling densities on the manifold, we introduce a right-normalization of the graph transition matrix; under the assumption that the sampling density p is C^3 and bounded away from zero, the same convergence rates hold. We numerically demonstrate the performance of the proposed estimator on simulated data.

[LG-162] Choosing Online Experiment Designs under Interference in Ads Recommendations and Member-Experience Systems

链接: https://arxiv.org/abs/2605.25290
作者: Prashant Shekhar,Caroline Howard
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Online experiments in ads, recommendation, and member-experience systems are often planned before the dominant interference mechanism is known. A treatment may propagate through budgets, inventory, producer exposure, graph spillovers, or temporal carryover, making the randomization design itself a statistical decision. We formulate this problem as robust design selection over uncertain exposure mechanisms. Given a finite catalog of six implementable designs, the selector compares each design by worst-case planning risk over an ambiguity set. The risk combines exposure bias, assignment-unit variance, minimum detectable effect, contamination or carryover, operational cost, and estimand mismatch. For theoretical justification, the paper develops a geometry-aware guarantee, stating that design bias is bounded by Wasserstein distance to the launch exposure distribution, and this penalty is minimax tight under Lipschitz exposure response. We also prove finite-catalog approximation and a robust selector theorem with excess-risk control, exact recovery under separation, and certified shortlists when the risk surface is flat. Empirically, the same selector gives different recommendations across samples from public datasets. It selects user-randomization on Criteo ads with dimensionless robust risk 1.295, switchbacks on Open Bandit-bts/men with risk 2.105, and cluster-randomization on KuaiRand with risk 2.240. The Open Bandit case stresses known but uneven logging support, with propensities from 0.00006 to 0.594 and a 5.17% IPS effective-sample share. Overall, the paper contributes an interference-aware experiment design framework based on mechanism-robust design decisions, where the output is either a justified design choice or an uncertainty shortlist.

[LG-163] Data-Specific Hyper-Parameter Design: A Paradigm Shift in Reservoir Computing

链接: https://arxiv.org/abs/2605.25221
作者: G Manjunath,Juan-Pablo Ortega,Alma van der Merwe
类目: Dynamical Systems (math.DS); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Reservoir computing typically relies on large, randomly generated reservoirs, enabling simple, often linear readouts. Over the past two decades, most constructions have exploited the freedom to select the reservoir, constrained primarily by stability conditions based on state contraction or memory capacity. However, these designs are largely independent of the input data and learning objective, resulting in a trial-and-error methodology driven by randomness. In high dimensions, the reservoir acts as a random embedding of the input history, implicitly relying on Johnson–Lindenstrauss–type concentration phenomena to preserve information. In contrast, we develop reservoir design principles from a geometric perspective for inputs generated by deterministic dynamical systems. Rather than relying on random embeddings, we require reservoir state increments to align within a cone around an input-determined vector subspace, and prove that such a cone concentration reduces ridge-regression training error. When the cone angle is small, the variance of reservoir states concentrates in the input-determined subspace, improving conditioning of the empirical second-moment matrix and strengthening alignment between dominant covariance directions and the state-target cross-covariance. For echo state networks, we provide a constructive approach to reservoir design. The reservoir matrix is chosen so that associated Krylov-chain directions remain nearly closed within an input-determined subspace while permitting controlled mixing in its orthogonal complement. We also provide a spectral diagnostic for ridge regression training that identifies when reservoir geometry concentrates predictive information into a few dominant covariance modes and when ``spectral pollution’’ inhibits forecasting. Numerical experiments demonstrate consistent performance gains over arbitrary reservoir constructions.

[LG-164] Growing a Neural Network in Breadth Depth and Time

链接: https://arxiv.org/abs/2605.25174
作者: Eivinas Butkus,Kedar Garzón Gupta,Nikolaus Kriegeskorte
类目: Neurons and Cognition (q-bio.NC); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:Spatial and temporal resource constraints are critical for both biological and artificial intelligent systems. Here we define differentiable cost terms for breadth, depth, and time within a recurrent convolutional neural network conceived as a finite subset of an infinite lattice. We optimize these costs jointly with task errors via backpropagation. We set different pressures on breadth, depth, and time, which leads to diverse computational graphs emerging organically through training. We find that all three resources can be traded off against each other to achieve a given level of accuracy. Networks grow in all three dimensions with task complexity and spontaneously take more recurrent steps when inputs are occluded. Surprisingly, time used by the model correlates with human reaction times in an object recognition task. Our framework provides a normative account of how resource constraints shape neural architectures, connecting to questions about brain design in neuroscience, and may help illuminate the diversity of neural solutions found in nature.

[LG-165] Nyström Kernel Stein Discrepancy Tests

链接: https://arxiv.org/abs/2605.25173
作者: Florian Kalinke,Zoltán Szabó,Bharath K. Sriperumbudur
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:Kernel Stein discrepancy (KSD) is among the most popular goodness-of-fit (GoF) measures on general domains with a large number of successful deployments. One of the main applications of KSD is in constructing powerful GoF tests. However, tests relying on the classical U-/V-statistic-based KSD estimators have two major drawbacks. (i) Their runtime scales quadratically in the number of samples. (ii) Their asymptotic null distribution is computationally intractable in most cases, typically handled by bootstrapping. While it is known that the Nyström method permits accelerating KSD estimation with no loss of statistical accuracy under mild conditions, to the best of our knowledge, the fundamental question of its impact on bootstrap-based GoF testing is open; resolving this question is the focus of the current paper. In particular, we prove that the key properties of the quadratic-time bootstrapped KSD-based GoF test (asymptotic level and local consistency) are preserved by its Nyström acceleration. We numerically demonstrate the efficiency of the accelerated KSD estimator and bootstrap in the context of GoF testing of spherical and functional data. Our numerical results show that the Nyström-accelerated method performs statistically on-par with the quadratic-time approach, while requiring substantially smaller runtime.

[LG-166] Rejoinder: The ICML 2023 Ranking Experiment: Examining Author Self-Assessment in ML/AI Peer Review ICML2023

链接: https://arxiv.org/abs/2605.25172
作者: Buxin Su,Jiayao Zhang,Natalie Collina,Yuling Yan,Didong Li,Kyunghyun Cho,Jianqing Fan,Aaron Roth,Weijie Su
类目: Applications (stat.AP); Digital Libraries (cs.DL); Machine Learning (cs.LG)
*备注: Rejoinder to the JASA Discussion of “The ICML 2023 Ranking Experiment: Examining Author Self-Assessment in ML/AI Peer Review” ( arXiv:2408.13430 )

点击查看摘要

Abstract:This article is the rejoinder to ``The ICML 2023 Ranking Experiment: Examining Author Self-Assessment in ML/AI Peer Review,‘’ to appear in the Journal of the American Statistical Association with discussion. To address the practical and theoretical points raised by the discussants, we organize our response around four core themes: (i) formulating peer review as a statistical estimation problem; (ii) mitigating equity and strategic concerns in the deployment of the Isotonic Mechanism; (iii) incorporating complementary signals such as reviewer rankings and structured metadata; and (iv) exploring a human-centered framework for peer review in the era of generative AI.

[LG-167] Counterfactually Safe Reinforcement Learning

链接: https://arxiv.org/abs/2605.25114
作者: Jingyi Li,Peng Wu,Chengchun Shi
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Reinforcement learning algorithms are generally designed to maximize the expected return across a population. However, a policy that is optimal on average may be suboptimal for certain individuals, leading to potential safety concerns. To address this, we first formalize the notion of individual harm from a counterfactual perspective and define harm as the event in which a chosen action results in a strictly worse outcome than a baseline alternative. We then propose a general two-stage procedure for learning policies that maximize the expected return while accounting for individual harm. We further establish the finite-sample properties of the learned policy, derive an upper bound on its sub-optimality gap, and show that the harm rate remains well-controlled. Numerical experiments on both simulated and real-world datasets demonstrate the effectiveness of the proposed approach.

[LG-168] QML-PipeGuard: Drift-Aware Behavioral Fingerprinting for Quantum Machine Learning Pipeline Integrity

链接: https://arxiv.org/abs/2605.25066
作者: Esra Yeniaras
类目: Quantum Physics (quant-ph); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 54 pages, 12 Tables, 5 figures

点击查看摘要

Abstract:Quantum machine learning (QML) is moving from research prototypes to deployed cloud services. As QML enters regulated industries, the integrity of the quantum stage becomes a practical concern on two fronts: noisy hardware drifts at the channel level between recalibrations, and an adversary with control over the execution environment can substitute the declared quantum channel with a behaviorally similar but mathematically distinct one. Neither concern is covered by existing QML verification work on pulse-level noise, input drift, input-perturbation robustness, or device identity. We introduce QML-PipeGuard, a contract-based framework addressing both concerns under a single mathematical machinery. It characterizes a QML pipeline at runtime by its behavioral fingerprint, the vector of observable expectation values under a tomographically structured measurement family, and operates in two modes: drift-aware monitoring that absorbs benign calibration changes within a calibrated tolerance, and adversarial detection that catches channel substitution as a violation of an informationally complete observable contract. The framework contributes a pipeline-composition treatment of the encoder-ansatz-measurement channel with a QML-specific threat model (tight frame-bound C=sqrt(3) for the single-qubit Pauli family), a finite-shot sample-complexity bound, and a tolerance decomposition separating adversarial and natural-drift contributions. We validate the framework end-to-end on a two-qubit QSVM pipeline on the IBM Heron r2 processor (ibm_fez), with a sample-complexity validation on a noise-matched simulator. The prescribed measurement budget (about 1.4e4 shots) fits in a single batched job, the sneaky channel is detected with a wide safety margin while evading the weak contract, and the typical hardware drift sits within tolerance.

[LG-169] Multimodality Stacking with Blockwise missing values and application to the PIONeeR biomarkers study for prediction of resistance to immunotherapy

链接: https://arxiv.org/abs/2605.25050
作者: Mohamed Boussena,Florence Monville,Jacques Fieschi-Meric,Frederic Vely,Pierre Milpied,Julien Mazieres,Maurice Perol,Eric Vivier,Laurent Greillier,Fabrice Barlesi,Sebastien Benzekry
类目: Applications (stat.AP); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Integrating multimodal datasets in clinical oncology is frequently hindered by high dimensionality and blockwise missingness, where entire data sources are unavailable for specific patient subsets. Standard survival models often struggle with these gaps, leading to biased results or patient exclusion. We introduce Multimodality Stacking with Blockwise missing values (MSB), a late-fusion framework for survival analysis that independently models modality-specific features before aggregating predictions via a cross-validated stacking meta-learner. MSB was validated on the PIONeeR study (n=443 patients, 378 biomarkers across eight heterogeneous sources) to predict progression-free survival in advanced non-small cell lung cancer patients receiving immunotherapy. MSB yielded higher predictive performance (C-index) than baseline algorithms. Improvements varied by baseline strength: linear models showed a 15.9% increase (p0.001 for the Wilcoxon signed-rank test), random survival forests gained 5.4% (p=0.002), and gradient boosting methods improved by 2.1% (p=0.030). Beyond discrimination, MSB reduced the generalization gap (train-test difference in 5 folds cross-validation repeated 3 times: 0.055 vs 0.380 for linear models). Permutation importance analysis identified routine laboratory markers, clinical features, and PD-L1 expression as primary predictive drivers. Missing block indicators showed negligible importance, suggesting the model learned from biomarker values rather than data availability patterns. MSB provides a statistically validated framework for multimodal survival prediction with blockwise missingness. By enabling systematic biomarker evaluation without requiring complete data, MSB offers a practical tool for predictive modeling in biomedical research, pending external validation. Implementation is available at this https URL under Inria license. Subjects: Applications (stat.AP); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM); Machine Learning (stat.ML) Cite as: arXiv:2605.25050 [stat.AP] (or arXiv:2605.25050v1 [stat.AP] for this version) https://doi.org/10.48550/arXiv.2605.25050 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Mohamed Boussena [view email] [v1] Sun, 24 May 2026 12:48:38 UTC (4,934 KB)

[LG-170] Estimating Mixture Distributions via Stochastic Mirror Descent

链接: https://arxiv.org/abs/2605.24929
作者: Mohammadreza Ahmadypour,Tara Javidi,Farinaz Koushanfar
类目: Machine Learning (stat.ML); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We revisit the classical problem of estimating an unknown distribution from its samples by fitting a mixture model that minimizes cross-entropy loss. Framing the task as a stochastic convex optimization problem over the space of M -component mixture distributions, we propose a family of estimators derived from the stochastic mirror descent (SMD) algorithm. This optimization-based approach provides a principled and flexible framework that generalizes traditional estimators and proposes a variety of novel estimators through the choice of Bregman divergences. A key advantage of our method is that it scales efficiently with the number of candidate components f_i ; that is, one can employ a large set of basis distributions in the mixture model without incurring significant computational overhead. This enables richer approximations and improved estimation accuracy. Moreover, in the case of categorical distribution (discrete outcomes) our estimators do not require a strict lower bound, in other words our framework does not require the precise knowledge of the support of the distribution. We demonstrate that, under mild conditions, the proposed \varphi -SMD estimators achieve near-optimal convergence rates in both Kullback-Leibler (KL) divergence and \ell_2 -norm and offer practical benefits when computation is expensive. Our numerical analysis highlights improved performance guaranties over classical estimators, particularly in terms of sample efficiency and scalability. Subjects: Machine Learning (stat.ML); Information Theory (cs.IT); Machine Learning (cs.LG) Cite as: arXiv:2605.24929 [stat.ML] (or arXiv:2605.24929v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2605.24929 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Mohammadreza Ahmadypour [view email] [v1] Sun, 24 May 2026 08:19:42 UTC (10,225 KB)

[LG-171] Lifted Schrödinger Bridges for Gaussian Mixture Endpoints: Projection Gaps and Path-Space Obstructions

链接: https://arxiv.org/abs/2605.24795
作者: Siddhartha Ganguly,George Rapakoulias,Panagiotis Tsiotras
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Robotics (cs.RO); Systems and Control (eess.SY)
*备注: 35 pages. Submitted to a journal; comments are welcome

点击查看摘要

Abstract:We study stochastic density control between Gaussian-mixture endpoint distributions under Brownian prior dynamics. Since the direct Schrödinger bridge between Gaussian mixtures is generally not available in closed form, we introduce a lifted path-space construction in which each trajectory is augmented with a source–target component label. Consequently, the problem decomposes into Gaussian component-to-component Schrödinger bridges with explicit marginal, drift, and cost formulas, while the mixture-level assignment reduces to a finite-dimensional entropic coupling problem with a Sinkhorn scaling form. We then analyze the projection obtained by discarding or forgetting the label. By construction, the projected law satisfies the original Gaussian-mixture endpoint constraints, but its relative entropy generally differs from the lifted relative entropy by a nonnegative conditional label-information gap. This gap reveals a path-space obstruction: the lifted optimizer cannot, in general, be identified with the direct unlabeled Schrödinger bridge after projection. We also derive the posterior-averaged Markov drift associated with the projected marginal flow, prove a kinetic-energy upper bound, and identify a common path-potential condition under which the projection gap vanishes. Several numerical illustrations showing density and shape control are recorded for a self-contained exposition.

[LG-172] How Neural Reward Models Learn Features for Policy Optimization: A Single-Index Analysis

链接: https://arxiv.org/abs/2605.24749
作者: Rei Higuchi,Ryotaro Kawata,Akifumi Wachi,Shokichi Takakura,Kohei Miyaguchi,Taiji Suzuki
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 35 pages

点击查看摘要

Abstract:Reward modeling is not only a prediction problem: in KL-regularized policy optimization, the learned reward is exponentiated to define the deployed policy, so downstream value depends on errors in reward-tilted regions. We study this feedback in a Gaussian single-index model with r^(x) = \sigma^(\langle \theta^, x\rangle) and x \sim N(0, I_d) . We analyze a two-stage neural reward model that first learns the hidden direction \theta^ from reward-weighted samples and then fits the readout layer by weighted ridge regression. Exponential reward weighting changes the Hermite signal available to the first layer; for any feature-learning temperature \beta_1 above a dimension-free O(1) threshold, a constant fraction of neurons recover the hidden direction, with weak-recovery complexity governed by the generative exponent. After feature recovery, we derive tilted-policy value-gap bounds for an idealized label-weighted fit with weights e^y/\beta_2 and a more practical surrogate-weighted fit with weights e^r_a_0(x)/\beta_2 . Keeping the \beta_2 -dependence explicit yields an admissible set of deployment temperatures, balancing the gain from lowering \beta_2 against the learning cost amplified by exponential weighting; in the surrogate-weighted case, proxy-dependent factors shrink this admissible set.

[LG-173] Deep Learning-Enabled Prediction of Geoeffective CMEs Using SOHO and SDO Observations

链接: https://arxiv.org/abs/2605.24748
作者: Zhaoxin Yan,Jason T. L. Wang,Haimin Wang,Harim Lee,Ju Jing,Yan Xu,Chunhui Xu,Vasyl Yurchyshyn
类目: olar and Stellar Astrophysics (astro-ph.SR); Machine Learning (cs.LG)
*备注: 23 pages, 12 figures, 4 tables

点击查看摘要

Abstract:Understanding and forecasting the geoeffectiveness of a coronal mass ejection (CME) is crucial for protecting infrastructure in the near-Earth space environment and on Earth. In this study, we present a novel fusion model to forecast the geoeffectiveness of CME events. Our model combines convolutional neural networks for feature learning and a prediction network for feature fusion and event classification. The model is trained by observations from instruments including the Large Angle Spectroscopic Coronagraph (LASCO) on board the Solar and Heliospheric Observatory (SOHO) and the Atmospheric Imaging Assembly (AIA) and Helioseismic and Magnetic Imager (HMI) on board the Solar Dynamics Observatory (SDO). The trained model is then used to predict whether an Earth-reaching CME will cause a geomagnetic storm and/or the probability that the CME will cause such a storm. Experimental results based on a five-fold cross validation scheme demonstrate the good performance of our fusion model, achieving a mean true skill statistic (TSS) score of 0.703 when the model is used as a deterministic prediction tool, and a mean Brier score of 0.095 when the model is used as a probabilistic forecasting tool, where a TSS score of 1 or a Brier score of 0 indicates perfect performance. This work contributes to forecasting the causal relationship between Earth-directed CMEs and geomagnetic storms in solar-terrestrial interactions.

[LG-174] On the Sample Complexity of Robust Binary Hypothesis Testing

链接: https://arxiv.org/abs/2605.24741
作者: Shankar Vallinayagam,Ankit Pensia,Varun Jog
类目: atistics Theory (math.ST); Information Theory (cs.IT); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Comments welcome

点击查看摘要

Abstract:We study the sample complexity of robust binary hypothesis testing under three standard contamination models: \varepsilon -additive (Huber), \varepsilon -subtractive, and \varepsilon -total variation (TV), denoted by n^_\mathrmHub(\varepsilon) , n^\mathrmSub(\varepsilon) , and n^*\mathrmTV(\varepsilon) , respectively. For subtractive contamination, we show that least favourable distributions exist and provide explicit formulas for the same, bringing this model in line with the classical Huber and TV models. Next we show that in all three models, sample complexity may be highly unstable in the contamination parameter \varepsilon , increasing by polynomial factors even for o(\varepsilon) perturbations. Similarly, there may be polynomial factor gaps between the sample complexities when \varepsilon is known exactly versus when it is known up to o(\varepsilon) error. Despite the instability of the sample complexity in all models, we show that the sample complexities across models are comparable up to constant-factor rescaling of \varepsilon . Specifically, for any fixed \delta_00 , the following hold for all distributions p and q : (i) n^_\mathrmHub(\varepsilon) \lesssim n^\mathrmTV(\varepsilon) \lesssim n^*\mathrmHub(2\varepsilon) , (ii) n^_\mathrmSub(\varepsilon) \lesssim n^\mathrmTV(\varepsilon) \lesssim n^*\mathrmSub((2+\delta_0)\varepsilon) , and (iii) n^_\mathrmSub(\varepsilon) \lesssim n^\mathrmHub(\varepsilon) \lesssim n^*\mathrmSub((1+\delta_0)\varepsilon) , and the scaling constants are tight. Finally, we extend our results to adaptive versions of the contamination models. Comments: Comments welcome Subjects: Statistics Theory (math.ST); Information Theory (cs.IT); Machine Learning (cs.LG); Machine Learning (stat.ML) Cite as: arXiv:2605.24741 [math.ST] (or arXiv:2605.24741v1 [math.ST] for this version) https://doi.org/10.48550/arXiv.2605.24741 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-175] Affinity Graph Connectivity in Convex Clustering

链接: https://arxiv.org/abs/2605.24673
作者: Sam Rosen,Jason Xu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 28 pages, 6 figures

点击查看摘要

Abstract:We generalize finite-sample bounds for convex clustering to the setting where affinity weights appearing in the objective correspond to a general connected graph. These bounds and their analysis lead to a better understanding of clustering behavior under various implied connectivity structures behind the data and to new rates of convergence for centroid recovery. The new theoretical framework is based on random walks, which allow application of concentration inequalities related to random graph models, and formalizes the relationship between the clustering performance and the connectivity of the graph structures. Through the form of the bound and empirical results, we argue proper tuning of hyperparameters to convex clustering problems should also include tuning of input affinity weights.

[LG-176] AnnotateMissense: a genome-wide annotation and benchmarking framework for missense pathogenicity prediction

链接: https://arxiv.org/abs/2605.24520
作者: Muhammad Muneeb,David B. Ascher
类目: Genomics (q-bio.GN); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Missense variant interpretation remains challenging because pathogenicity depends on heterogeneous evidence from population frequency, evolutionary conservation, transcript context, amino acid substitution severity, prior pathogenicity predictors and protein-language-model-derived features. We present AnnotateMissense, a scalable annotation, benchmarking and genome-wide prediction framework for missense variant interpretation. AnnotateMissense integrates hg38 missense variants derived from dbNSFP v5.1 with ANNOVAR annotations, dbNSFP transcript/protein descriptors, AlphaMissense scores, ESM-derived features, conservation metrics, population-frequency variables, established pathogenicity predictors and engineered amino acid/codon-context features. Using 132,714 ClinVar-labelled missense variants, we benchmarked machine-learning and deep-learning models under controlled feature configurations. The full 303-feature benchmark set achieved the strongest performance with XGBoost, reaching mean MCC = 0.9411 and ROC-AUC = 0.9950 across stratified five-fold cross-validation. Restricted naive and location-oriented feature sets achieved lower best MCC values of 0.4989 and 0.5113, respectively. Circularity-controlled ablations showed that removing prior-predictor, population-frequency and clinically overlapping evidence reduced performance, whereas excluding AlphaMissense and ESM-derived features alone had minimal effect. Temporal ClinVar validation on newly observed pathogenic/benign variants achieved MCC = 0.7613, accuracy = 0.8798 and F1-score = 0.8750. The final model was applied to 90,643,830 hg38 missense variants to generate AnnotateMissense pathogenicity scores and binary prediction labels. Code and outputs are available at this https URL and this https URL.

[LG-177] Implicit Binarization via Complex Phase Dynamics in Combinatorial Optimization

链接: https://arxiv.org/abs/2605.24502
作者: Khen Cohen,Mark Glass,Meir Feder,Yaron Oz
类目: atistical Mechanics (cond-mat.stat-mech); Machine Learning (cs.LG); Combinatorics (math.CO); Computational Physics (physics.comp-ph)
*备注: 27 pages, 5 figures

点击查看摘要

Abstract:We introduce a physics-inspired continuous relaxation framework that yields substantially improved solutions for NP-hard combinatorial optimization problems, including Quadratic Unconstrained Binary Optimization (QUBO), binary sparse coding, and planted-solution Ising models. By parameterizing discrete binary variables as continuous wave-like states on the complex unit circle, we inherently smooth highly non-convex energy landscapes. We show that representing binary variables as complex phases reveals an implicit regularization mechanism that promotes convergence toward discrete states. Extracting this mechanism yields significant improvements even within standard real-valued optimization frameworks, using this regularizer explicitly. Empirically, this regularization yields vastly higher ground-state convergence rates than standard real-valued alternatives. Our models achieved zero error in large-scale 160x160 QUBO tasks under severe noise (sigma=0.25), and outperformed traditional algorithms (OMP and LASSO) in underdefined sparse coding with perfect recovery at sigma=0.15. The solver’s robustness was further validated by recovering exact ground-state configurations in 8 out of 11 rigorously engineered planted-solution benchmarks.

[LG-178] Clustering based on Stochastic Dominance with application for risk averters and risk seekers

链接: https://arxiv.org/abs/2605.24422
作者: Hua Li,Xue Jia,Yilin Kang,Wing-Keung Wong
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Stochastic Dominance (SD) theory provides a rigorous framework for selecting superior assets tailored to the asset allocation needs of investors with varying risk preferences (i.e., risk-averse, risk-seeking, and risk-neutral). However, traditional stock clustering methods typically rely on geometric metrics such as Euclidean distance, which often fail to effectively capture the intrinsic risk dominance relationships among assets. To address this limitation, this paper proposes an innovative clustering analysis framework based on SD test statistics. Methodologically, this study deeply integrates SD theory with machine learning algorithms. Transcending the limitations of traditional reliance on geometric distance, we innovatively utilize test statistics from first-, second-, and third-order SD to construct a “Stochastic Dominance Coefficient Matrix.” Building upon this matrix, we modify the classic K-means and Hierarchical Clustering algorithms. Specifically, we derive 12 distinct algorithm variants tailored to different orders of SD relationships. Simultaneously, we construct the SD-SC coefficient and the SD-DBI index as specialized validity indices to evaluate the clustering performance. Empirically, we analyze constituent stock data from a representative developed market (the US NASDAQ Index) and an emerging market (China’s CSI 100 Index). The results verify the effectiveness and robustness of the proposed method. Furthermore, we apply the clustering results to the modification of the Single Index Model and the construction of Global Minimum Variance Portfolios (GMVP). The findings demonstrate that the proposed method effectively facilitates customized asset allocation for investors, holding significant theoretical value and practical implications.

[LG-179] Fermi-Dirac machines as quantizations of neurons

链接: https://arxiv.org/abs/2605.24386
作者: Alexander He,Nana Liu,Mark M. Wilde
类目: Quantum Physics (quant-ph); Statistical Mechanics (cond-mat.stat-mech); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注: 87 pages, 12 figures, 2 tables

点击查看摘要

Abstract:Fermi-Dirac machines were proposed recently as an approach to solving semidefinite optimization problems on quantum computers. Here, we reinterpret them as canonical quantizations of classical neurons. By viewing a classical neuron as an activation function applied to a parameterized classical Hamiltonian, we quantize this model by replacing classical variables with operators whose eigenvalues encode their possible values. This follows the standard approach to canonical quantization in quantum mechanics. Crucially, when the Hamiltonian consists of commuting operators, our construction reduces exactly to a classical neuron. More generally, our approach yields an activation observable, defined as an activation function applied to a parameterized quantum Hamiltonian. The output of this quantized neuron is a random variable with expectation value equal to that of the activation observable with respect to an input state. We develop efficient hybrid quantum-classical algorithms for evaluating outputs and gradients of our quantized neurons, enabling evaluation and training. These algorithms rely on basic primitives that include random sampling, Hamiltonian simulation, and the Hadamard test. We also quantize a whole host of other activation functions, including the smooth rectified linear unit (ReLU), sigmoid linear unit, Gaussian-smoothed ReLU, and Gaussian error linear unit (GeLU), which are known to be useful for deep learning applications. Numerical experiments indicate that neurons based on quantum Hamiltonians can learn functions that classical neurons cannot. We further define a computational decision problem based on Fermi-Dirac neurons and prove that it is BQP-complete, providing complexity-theoretic evidence against efficient classical simulation. Finally, we generalize our approach to continuous quantum variables and sketch two different ways of composing these neurons into networks.

[LG-180] Multicalibration Boosting: Theory Convergence and Transferability

链接: https://arxiv.org/abs/2605.24364
作者: Hanxuan Ye,Hongzhe Li
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multicalibration extends classical calibration by requiring predictions to be unbiased over a rich collection of functions, encompassing both prediction slices and subpopulations. It has emerged as a powerful framework for fairness, robustness, and reliable prediction, yet the theoretical understanding of multicalibration boosting (MCBoost) remains fragmented and often relies on restrictive assumptions. In this work, we develop a unified and refined perspective on MCBoost that subsumes existing variants, including multiaccuracy, BatchGCP, and BatchMVP. We uncover several phenomena that provide new insights into its practical behavior: even highly accurate and flexible predictors can remain substantially miscalibrated; enforcing multicalibration introduces a calibration-risk trade-off; and early stopping plays a central role in controlling this trade-off. On the theoretical side, we establish a general framework for MCBoost under weaker and more realistic conditions. We show that the boosting iterates converge to a Bregman projection of the population-optimal predictor onto the cumulative span generated by the audit class, thereby explicitly characterizing the function space on which multicalibration is achieved. We further derive convergence rates under different smoothness assumptions, finite-sample guarantees, and principled stopping rules that ensure multicalibration at termination. Finally, we extend the theory of universal adaptability under covariate shift, providing more general transfer guarantees and clarifying when multicalibrated predictors generalize across domains. These results provide a more complete theoretical foundation and practical guidance for multicalibration boosting, positioning it as both a unifying framework and a reliable post-processing approach for modern predictive models.

[LG-181] A Matched Spectral Benchmark of Quantum Inspired Feature Maps

链接: https://arxiv.org/abs/2605.24324
作者: Toheeb Ogunade,Taofeek Kassim,Etinosa Osaro
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Quantum machine learning is often motivated by the idea that quantum systems can expose useful high-dimensional structure that is difficult to access with classical models. We isolate one central component of this claim: the fixed data-encoding map. Amplitude, angle, and basis encoding are evaluated as deterministic feature maps for classical supervised learning under matched output dimensionality and strong classical controls. The benchmark compares these encodings against raw linear models, random Fourier features, polynomial features, PCA, RBF SVMs, and shallow neural networks across diverse classical datasets. Rather than treating performance as a single endpoint, we analyze the geometry of each representation through effective rank, condition number, centered kernel alignment, predictive performance, and practical overhead. The resulting picture is mechanistic: amplitude encoding can remove magnitude information through unit-sphere normalization, angle encoding can become geometrically redundant with raw linear features, and basis encoding can impose a binary Hamming geometry that is poorly aligned with smooth decision structure. These findings do not argue against quantum computation, however, they show that fixed quantum-inspired encoding geometry alone is not a reliable source of machine-learning advantage on classical data.

[LG-182] MEDAL: Manifold Embedding Distillation via Autoencoder Learning

链接: https://arxiv.org/abs/2605.24244
作者: Irene Chang,Tarek M. Zikry,Genevera I. Allen
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Low-dimensional embeddings are widely used as visual summaries of high-dimensional data and to enable downstream scientific discoveries. Yet, popular nonlinear dimension reduction methods, such as t-SNE and UMAP, are often selected based on visual appeal alone and without rigorous quantitative validation. A major reason is that manifold embeddings typically do not provide an out-of-sample map nor an inverse back to the original feature space; this makes held-out validation, the gold standard in supervised learning, all but impossible. To address these challenges, we develop a novel framework, MEDAL (Manifold Embedding Distillation via Autoencoder Learning), which distills a fitted manifold embedding into a reusable encoder–decoder model. MEDAL trains a constrained autoencoder whose bottleneck exactly matches any teacher embedding while the decoder reconstructs the original input; this yields an explicit map for new samples, an approximate inverse, and a pointwise reconstruction-based measure of distortion in the manifold space. This converts static manifold embeddings into models that can be evaluated on held-out data, enabling quantitative validation including comparing different dimension reduction methods as well as hyperparameter tuning. Across multiple benchmark and scientific case studies, we show that MEDAL enables held-out validation to determine optimal manifold embeddings and hyperparameters, reveals biologically coherent regions that are difficult to preserve in two dimensional embeddings, and detects distribution shift when new samples are mapped into a fixed reference manifold. MEDAL provides a general validation wrapper to any existing dimension reduction technique that will improve the rigor and

[LG-183] Learning dynamical systems with biochemically informed neural ordinary differential equations

链接: https://arxiv.org/abs/2605.24170
作者: Luis L. Fonseca,Reinhard C. Laubenbacher,Lucas Böttcher
类目: Dynamical Systems (math.DS); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注: 23 pages, 13 figures, 4 tables

点击查看摘要

Abstract:Ordinary differential equation models of biochemical reactions are often formulated as stoichiometric systems in which the dynamics arise from a collection of interacting processes. A central challenge is that the functional form of each process is rarely known a priori and may be difficult to infer from data. We propose biochemically informed neural ordinary differential equations (BINODEs), a neural-ODE framework that retains the stoichiometric structure of mechanistic models while representing individual processes by neural networks. In BINODEs, the outputs of neural network processes (NNPs) are mapped to state derivatives through a linear layer analogous to a stoichiometric matrix. This architecture allows biological side information, such as process-specific inputs, sign constraints, and monotonicity assumptions, to be built directly into the model. We characterize the approximation properties of NNPs for several standard biochemical rate laws and show that the proposed framework recovers both trajectories and process-level structure in Monod, Lotka–Volterra, pharmacokinetic, and ultradian endocrine models. These results suggest that BINODEs offer a useful compromise between mechanistic interpretability and data-driven flexibility for modeling partially known biochemical or biological dynamical systems.

[LG-184] Detecting Metastable Basins in High Dimensions via Marginal Trajectory Distribution Discrimination

链接: https://arxiv.org/abs/2605.24136
作者: Taj Jones-McCormick
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Computation (stat.CO)
*备注:

点击查看摘要

Abstract:We study the problem of identifying dynamically distinct basins of attraction in high dimensional time-homogeneous Markov processes using only trajectory sampling. This problem is fundamental in the analysis of metastable dynamical systems, where the process rapidly mixes within basins while transitions between basins occur rarely on the timescale of interest, or even when the state space is reducible. Existing approaches typically rely on spatial discretization or spectral analysis of estimated transition operators, which can become unreliable in high dimensional settings or when the underlying basin geometry is highly nonlinear. We propose a discriminative approach to basin identification based on marginal trajectory distribution comparison. We prove a simple risk separation result: if two initial states belong to the same basin, the Bayes-optimal classifier distinguishing their marginal trajectory distributions achieves risk close to 1/2, whereas if they lie in distinct basins, the optimal risk is close to zero. This observation reduces basin detection to a two-sample discrimination problem between marginal trajectory distributions. Motivated by this principle, we develop a neural algorithm that receives a set of candidate basin representatives and iteratively merges them by estimating classification risk with a neural network that approximates the Bayes classifier. We evaluate the method on various metastable systems. These include synthetic systems constructed by embedding low-dimensional dynamics into high dimensional noisy ambient spaces. In these settings, standard spectral and clustering-based methods often fail, while our approach accurately recovers the underlying basin structure. These results display a shortcoming of existing methods and highlight trajectory discrimination as an effective tool for identifying dynamical basins in high dimensional stochastic systems. Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Computation (stat.CO) Cite as: arXiv:2605.24136 [stat.ML] (or arXiv:2605.24136v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2605.24136 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-185] LWM-CDE: A Representation Space for Wireless Data Reasoning and Transferability

链接: https://arxiv.org/abs/2605.24077
作者: Sadjad Alikhani,Akshay Malhotra,Shahab Hamidi-Rad,Ahmed Alkhateeb
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: The model and relevant scripts are available on the WILab Hugging Face page: this https URL

点击查看摘要

Abstract:Machine learning deployments in real-world wireless communication tasks face significant generalization challenges due to location and environment-specific signal structure, high diversity in data across different deployments, and limited availability of real-world data. Current approaches for assessing data similarity between training and inference (deployment) distributions, as well as evaluating model transferability, suffer from high computational costs and inconsistent performance, leaving critical model deployment and model life cycle management decisions without a principled foundation. To address this, we introduce a dataset similarity framework built upon the feature space of a pretrained wireless foundation model. Our method, LWM-CDE (Contrastive learning of Dataset Embedding), fine-tunes the dataset embeddings of the foundation model using a combination of contrastive and geometry-shaping losses, creating a structured manifold where distance reliably indicates transferability. Extensive experiments on wireless benchmarks show that LWM-CDE achieves stronger correlation with empirical transfer performance than existing metrics while being more computationally efficient. The learned representation space supports more effective and data-efficient decision-making for tasks like source dataset selection, label-aware augmentation, and budgeted pretraining, demonstrating its broader utility across different wireless communication applications.

[LG-186] Causality as the Statistical Conscience of Artificial Intelligence: From Pearls Ladder to Trustworthy Machines

链接: https://arxiv.org/abs/2605.24076
作者: Ernest Fokoué
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 18 pages, 4 figures, 1 table

点击查看摘要

Abstract:Modern Artificial Intelligence achieves remarkable predictive power by optimizing statistical risk functionals over vast corpora. Yet a gap separates this from genuine intelligence: the inability to distinguish correlation from causation. This paper argues that causal inference (identifying mechanisms invariant under intervention) is AI’s indispensable statistical conscience. Without causal grounding, AI systems are correlation machines: powerful in familiar domains, brittle under distribution shift, and biased in high-stakes settings. Three contributions develop this argument. First, a Statistical Necessity Theorem for Causal Generalization: any algorithm achieving out-of-distribution generalization must encode causal structure, formalizing the distinction between prediction P(Y|X) and intelligence P(Y|do(X)). Second, a unified framework connects Pearl’s do-calculus, the Potential Outcomes framework, Double Machine Learning, and Invariant Risk Minimization as a family of Causal Statistical Estimators, each identifying interventional distributions under different assumptions. Third, three AI failure modes (hallucination in large language models, reward hacking in reinforcement learning from human feedback, and degradation under distribution shift) are manifestations of causal blindness, each admitting a principled statistical remedy. Trustworthy AI is, at its core, a problem of causal statistics. The statistical community is not merely equipped to solve it – it is the only community with the foundational tools to do so rigorously.

[LG-187] Multitask learning with semiempirical orbital charges enables sample-efficient MLIPs

链接: https://arxiv.org/abs/2605.24073
作者: Ihor Neporozhnii,Sjoerd Hoogland,Oleksandr Voznyy
类目: Chemical Physics (physics.chem-ph); Machine Learning (cs.LG)
*备注: 16 pages, 6 figures

点击查看摘要

Abstract:Machine learning interatomic potentials (MLIPs) require generating computationally expensive, large-scale training datasets to accurately simulate materials and molecules. Incorporating electronic structure information using multitask learning improves sample efficiency, however, training on full Hamiltonian matrices, which scale quadratically with the number of atoms, is intractable for large datasets. In this work, we show that multitask learning utilizing orbitally resolved semiempirical charges significantly improves sample efficiency and accuracy in MLIPs. To efficiently predict orbital charges, we implement a specialized equivariant model, reducing charge prediction error compared to an invariant baseline. By augmenting training with computationally inexpensive GFN1-xTB orbital charges, which scale linearly with the number of atoms, our model achieves a 46% reduction in energy mean absolute error and requires five times less data to match the performance of energy-only models. Furthermore, our approach outperforms models trained on expensive density functional theory (DFT) atomic charges, capturing orbitally resolved electronic complexity and forcing the network to learn a physically accurate latent space that spontaneously clusters metals by shared chemical properties. Because orbital charges are only required during training, this approach preserves inference efficiency, providing a scalable recipe for developing accurate, data-efficient foundation models for complex chemical systems.

[LG-188] Optimal Non-Asymptotic Edgeworth Expansions for Multivariate Neural Network Outputs

链接: https://arxiv.org/abs/2605.24072
作者: Lucia Celli
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Probability (math.PR)
*备注: 34 pages, 2 figures

点击查看摘要

Abstract:Finite-width fully connected neural networks with Gaussian-initialized weights deviate from their infinite-width Gaussian limit, exhibiting non-vanishing higher-order cumulants. We approximate these deviations, for a neural network evaluated in a finite number of inputs, using multidimensional Edgeworth expansions of arbitrary order 4m-1 , with m\in\mathbbN . Assuming that the corresponding Gaussian limit has an invertible covariance matrix and that the activation function is polynomially bounded, we establish a bound of order n^-m on the total variation distance between the law of the true network output and its Edgeworth approximation, with matching lower bounds. As an application, we quantify the error in Bayesian posterior distributions when the prior is replaced by its Edgeworth expansion. Our results are more general and also apply to sequences of conditionally Gaussian vectors converging to a Gaussian vector with invertible covariance.

[LG-189] Seeing Inside the Storm: Improving Nowcasting by Integrating Meteorological Drivers

链接: https://arxiv.org/abs/2605.24067
作者: Minghui Qiu,Jun Chen,Lin Chen,Weifeng Chen,Shuxin Zhong,Zhidan Liu,Yu Zhang,Kaishun Wu
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Most nowcasting systems, built on radar reflectivity, focus on current precipitation, ignoring the atmospheric precursors – such as low-level convergence, turbulent eddies, and latent heating – that offer a fleeting window to foresee storm birth. We introduce MeteoLogist, a physics-inspired radar intelligence framework that models the full life cycle of convection – from its precursors to organized storm evolution. However, exploiting these precursors is non-trivial: they originate from multiple meteorological drivers – thermodynamic, kinematic, and microphysical – that evolve asynchronously (C1) and remain spatially fragmented (C2). To this end, MeteoLogist designs three tightly integrated components. The Physics-Tailored Encoders process radar echoes according to their intrinsic physical scales and semantics, forming thermodynamic, kinematic, and microphysical streams that capture distinct dynamical regimes. The Temporal-Phase Aligner addresses C1 by leveraging causal temporal attention to capture when and how different drivers interact and activate. The Cross-Field Spatial Aggregator addresses C2 through cross-regional fusion, aligning weak and scattered precursors across neighboring cells to expose upstream triggers and enforce spatial coherence. Evaluated on 3D-NEXRAD (2020–2022, US-wide), MeteoLogist boosts high-impact detection (CSI40) by +9.7% over strong baselines, and achieves a remarkable 37.67% gain during the storm-developing stage – demonstrating true foresight in sensing storms before they appear. The code can be found in the supplementary material.

[LG-190] Aurora Hunter: A Two-Stage Framework for Probabilistic Visibility Forecasting

链接: https://arxiv.org/abs/2605.24038
作者: Zongyuan Ge,Chenwaner Zhang,Haoyang Li,Hantai Zhang,Wenxin Gu,Wei Zhou,Zhaoming Wang
类目: pace Physics (physics.space-ph); Earth and Planetary Astrophysics (astro-ph.EP); Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Forecasting aurora borealis visibility matters for space weather research and aurora tourism. Visibility at a site and night depends on two distinct factors: (1) whether aurora is physically occurring, driven by solar wind-magnetosphere coupling, and (2) whether observing conditions allow naked-eye detection, mainly cloud cover and lunar illumination. We present Aurora Hunter, a two-stage cascade that decouples these factors. Stage 1 predicts P(occurring) with XGBoost using 51 physics-driven features trained on joint Tromso+Kiruna data (about 16,600 hourly samples, 2015-2023) with labels from the Tromso AI all-sky image classifier. Stage 2 predicts P(clear observation given occurring) with logistic regression using 21 cloud-cover and lunar-illumination features trained only on aurora-occurring hours. The cascade P(visible)=P(occurring)*P(clear|occurring) reaches ROC-AUC 0.937 (Tromso test, 2019-2020) and 0.905 (independent Kiruna, 2024), improving a single-stage baseline by +0.087. Held-out Skibotn data (2022-2025) confirm cross-site generalization. SHAP identifies the Kp x nightside interaction, MLT position, and auroral oval distance as dominant predictors (39% combined). Prototype: this https URL.

[LG-191] Volatility Surface Reconstruction using Deep Learning under No-Arbitrag e Constraints

链接: https://arxiv.org/abs/2605.24031
作者: Pablo Rodriguez Manzi
类目: Computational Finance (q-fin.CP); Machine Learning (cs.LG)
*备注: MSc thesis, Universidad de Buenos Aires, 2026. 94 pages, 27 figures

点击查看摘要

Abstract:We study the reconstruction of implied volatility surfaces from sparse and noisy option quotes using deep learning models under no-arbitrage constraints. We compare multiple neural architectures, including multilayer perceptrons, convolutional networks, U-Nets, variational autoencoders, and Transformer-based models against classical SVI parameterizations on option market data. Results show that Transformer and U-Net architectures achieve strong reconstruction accuracy, particularly under sparse observation regimes, while soft arbitrage penalties significantly reduce arbitrage violations with moderate impact on reconstruction error. We further analyze the trade-off between accuracy and arbitrage consistency across architectures and regularization strengths.

[LG-192] Improving Ensemble CAPE Forecasts with a Diffusion Model Incorporating Aerosol Information

链接: https://arxiv.org/abs/2605.24009
作者: Zachary James,Joseph Guinness,Arthur DeGaetano
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Convective available potential energy (CAPE) is an important variable for forecasting severe weather and understanding deep convection and precipitation. The latest versions of the Global Forecast System (GFS) and related Global Ensemble Forecast System (GEFS) have exhibited a bias towards underestimating CAPE values during the summertime. We train an artificial intelligence (AI) diffusion model to improve the skill and uncertainty quantification of afternoon 6-hour lead time ensemble forecasts over the United States. Our model takes a GFS CAPE forecast as input and outputs an ensemble that significantly outperforms both GFS and GEFS 6-hour forecasts on root mean square error, continuous ranked probability score, and Brier score. We propose a two-stage training pipeline to leverage both a larger historical GFS forecast dataset and a smaller historical GEFS dataset, despite the two using initialization and parameterization schemes that vary over time. We also show that classifier-free guidance can be used to control the skill and spread of the forecasts. We then demonstrate the versatility of our framework by adding aerosol optical depths (AODs) of black carbon, organic carbon, dust, sea salt, and sulfates as additional input features. Aerosols can invigorate or suppress convection depending on atmospheric conditions. Our AI models effectively incorporate aerosols to produce improved CAPE forecasts. We interpret the model components by using permutation feature importance to rank the influence of the different AODs and find that black carbon, organic carbon, and sulfate aerosols have a greater impact on the model’s CAPE predictions than sea salt and dust aerosols.

[LG-193] Quantification of atmospheric carbon dioxide from the Geostationary Operational Environmental Satellite (GOES East)

链接: https://arxiv.org/abs/2605.23991
作者: Aaron Sonabend-W,Sean Campbell,John Platt,Christopher Van Arsdale,Anna M. Michalak
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Earth and Planetary Astrophysics (astro-ph.EP); Machine Learning (cs.LG)
*备注: 23 pages, 7 figures, 1 table

点击查看摘要

Abstract:There is a growing urgency to track greenhouse gasses with the resolution, precision and accuracy needed to support independent verification of CO_2 fluxes at local to global scales. The current generation of space-based sensors, however, only provides sparse observations in space and time. This challenge has fueled interest in the potential use of data from existing missions originally developed for other applications for inferring global greenhouse gas variability. The Advanced Baseline Imager (ABI) onboard the Geostationary Operational Environmental Satellite (GOES-East), operational since 2017, provides full coverage of much of the western hemisphere at 10-minute intervals from geostationary orbit at 16 wavelengths at an approximately 2 km^2 spatial resolution. Here, we leverage this high spatial coverage and temporal revisit to develop a single-pixel, physics-guided neural network to estimate dry-air column CO_2 mole fraction ( XCO_2 ). The model employs a time series of GOES-East’s 16 spectral bands, ECMWF ERA5 lower tropospheric meteorology, MODIS surface reflectance, solar and satellite viewing geometry, and day of year. Training used collocated GOES-East and OCO-2/OCO-3 observations. We also present case studies illustrating the use of the model to observe XCO_2 enhancements over urban areas and drawdown over agricultural regions. Overall, while the precision of GOES-East derived XCO_2 can never rival that of dedicated instruments, the unprecedented combination of contiguous geographic coverage, 10-minute temporal frequency, and multi-year record offers the potential to observe aspects of atmospheric CO_2 variability currently unseen from space.

[LG-194] Physics-Guided Concentration Inference from Resistance Transients in a Mixed-Phase SnO-SnO_2 Carbon Monoxide Sensor with p-n Switching

链接: https://arxiv.org/abs/2605.23971
作者: Sani Biswas,Preetam Singh,Amit Kumar Gangwar
类目: Chemical Physics (physics.chem-ph); Machine Learning (cs.LG); Applied Physics (physics.app-ph)
*备注: 15 pages, 14 figures

点击查看摘要

Abstract:This work presents a physics-guided machine-learning framework for carbon monoxide concentration inference from experimentally measured resistance transients of a mixed-phase SnO-SnO _2 material gas sensor exhibiting temperature-dependent p-n switching behavior. Cycle-level transient responses are represented through physically interpretable descriptors and complemented by compact fast Fourier transform (FFT) and discrete wavelet transform (DWT)-based summaries. Using leakage-aware grouped cross-validation, we study both multi-class concentration classification and continuous concentration regression for the p-type and n-type sensing regimes separately. Across both regimes, fused features provide the strongest overall performance, while the physics-guided descriptor block remains highly competitive, indicating that the dominant concentration information is already encoded in physically meaningful transient dynamics. The p-type branch shows the best concentration-class discrimination, with the fused Random Forest classifier reaching approximately 96.5% accuracy, whereas the n-type branch yields the best quantitative concentration estimation, with the fused Random Forest regressor achieving an MAE \approx 1.48 ppm and an R ^2 \approx 0.992 . These results reveal a clear dual-regime behavior: p-type sensing is particularly favorable for classification, whereas n-type sensing is more favorable for high-fidelity regression. More broadly, the study demonstrates that leakage-aware, cycle-level, physics-guided machine learning can extend conventional gas-sensing analysis beyond single-response metrics while preserving physical interpretability

[LG-195] From Index to Equity: Pre-Training Transformers for Stock Return Prediction

链接: https://arxiv.org/abs/2605.23962
作者: Marie Soehl Coolsaet,Roberto Gallardo,Zhen Gao
类目: atistical Finance (q-fin.ST); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This research aims to leverage machine learning to improve stock price prediction and support informed investment decisions related to buying, selling, and holding assets. Specifically, this work investigates transformer-based models for stock prediction and examines the impact of pre-training strategies on forecasting performance. A transformer model was first pre-trained on the Toronto Stock Exchange Index (TSX) to predict intra-day return direction and subsequently fine-tuned on individual TSX stocks. The model was further adapted for return-value regression tasks. Performance was benchmarked against Long Short-Term Memory (LSTM) and XGBoost models. Pre-training on the market index improved the binary cross-entropy loss for individual stock prediction from 0.69 to 0.64. The fine-tuned transformer regression model achieved lower mean squared error than the benchmark models, although the ensemble and XGBoost models achieved higher average daily returns. In addition, a practical application was developed to deliver real-time stock predictions for trading support. Future work will focus on increasing transformer model capacity, incorporating broader global technical indicators, and filtering out stocks with low predictability.

[LG-196] Learning Protein Structure-Function Relationships through Knowledge-guided Representation Decomposition ICML2026

链接: https://arxiv.org/abs/2605.23960
作者: Mingqing Wang,Zhiwei Nie,Athanasios V. Vasilakos,Yonghong He,Zhixiang Ren
类目: Biomolecules (q-bio.BM); Machine Learning (cs.LG)
*备注: 28 pages, 17 figures, icml 2026 regular

点击查看摘要

Abstract:Proteins encode diverse functions within complex three-dimensional structures, yet most deep learning representations remain highly entangled, obscuring the biophysical signals that underlie function. Here we introduce ProtDiS, a knowledge-guided framework that decomposes pretrained protein micro-environment embeddings into biologically grounded and task-relevant dimensions. Inspired by the information bottleneck principle, ProtDiS learns representations that balance informativeness and compression, yielding structural features that are more specific, independent, and information-efficient, and achieving consistent improvements across twelve downstream tasks, with the largest gains under structure-based splits. Protein- and residue-level analyses further show that ProtDiS differentiates proteins with similar folds but divergent functions and captures fine-grained biophysical signals critical. These findings suggest that knowledge-guided decomposition provides a general and interpretable approach for structuring latent spaces in protein structural modeling. The source code and implementation details are publicly available at this https URL.

[LG-197] Game-Theoretic Modeling of Heterogeneous Investor Interactions for Stock Price Forecasting

链接: https://arxiv.org/abs/2605.23953
作者: Yong Zhang,Xinxiao Wu,Yunde Jia,Che Sun
类目: Trading and Market Microstructure (q-fin.TR); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
*备注: 10 pages, 1 figure, intended for conference submission

点击查看摘要

Abstract:Accurate stock price forecasting has consistently remained a pivotal yet challenging FinTech task that underpins quantitative trading and investment decision making. Recent efforts have been dedicated to modeling various complex relationships among stocks in the stock market toward more reliable stock price this http URL methods depend heavily on strong static prior assumptions by modeling either temporal dependencies within individual stocks or spatial dependencies across different stocks based on predefined structures, while the complex market dynamics that drive stock price movements remain unexplored. To alleviate this issue, we propose a novel game-theoretic modeling method that captures heterogeneous investor interactions for stock price forecasting. The core idea is to embed game-theoretic mechanisms into the heterogeneous graph structure to finely model the dynamic strategic interactions among heterogeneous investors with respect to target stocks. Additionally, temporal positional encoding is adopted to reflect the differentiated influences of each game event at different time steps within the time window on future stock price movements. Leveraging heterogeneous graph networks, we proxy the intricate dynamics of the stock market through investor games and enable real-time information propagation and node updates among all nodes. Extensive experiments conducted on two real-world benchmark dataset demonstrate that our method effectively outperforms state-of-the-art stock price forecasting methods.

附件下载

点击下载今日全部论文列表