本篇博文主要内容为 2026-05-21 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR、MA六个大方向区分。

说明:每日论文数据从Arxiv.org获取,每天早上12:30左右定时自动更新。

提示: 当天未及时更新,有可能是Arxiv当日未有新的论文发布,也有可能是脚本出错。尽可能会在当天修复。

目录

概览 (2026-05-21)

今日共更新748篇论文,其中:

  • 自然语言处理106篇(Computation and Language (cs.CL))
  • 人工智能240篇(Artificial Intelligence (cs.AI))
  • 计算机视觉186篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习289篇(Machine Learning (cs.LG))
  • 多智能体系统12篇(Multiagent Systems (cs.MA))
  • 信息检索8篇(Information Retrieval (cs.IR))
  • 人机交互30篇(Human-Computer Interaction (cs.HC))

多智能体系统

[MA-0] ransforming Privacy Artifacts into Accessible Reports for Non-Technical Stakeholders

【速读】:该论文试图解决的问题是:在工业5.0背景下,人机协作系统因涉及对人类工人的监控而引发隐私担忧,但当前的需求工程(Requirements Engineering, RE)实践缺乏有效方法将隐私威胁与缓解措施以非技术用户可理解的方式传达给工人及其工会等非技术利益相关方,从而导致信任缺失和对新技术的抵制。解决方案的关键在于提出一个概念性框架,该框架基于隐私设计(Privacy by Design)原则,利用大语言模型(Large Language Models, LLMs)将技术文档转化为面向非技术利益相关者的隐私报告,从而实现从人机监控用例到知情决策支持的转化,并促进早期利益相关方参与和透明化决策过程。

链接: https://arxiv.org/abs/2605.21269
作者: Zoe Pfister,Clemens Sauerwein,Benedikt Dornauer,Tina Mersch,Christian Wolf,Ruth Breu,Michael Vierhauser
机构: University of Innsbruck (因斯布鲁克大学); EKS InTec (EKS InTec)
类目: oftware Engineering (cs.SE); Multiagent Systems (cs.MA)
备注: 8 pages (7+1), Accepted for publication at RE@Next’26

点击查看摘要

Abstract:The transition toward Industry 5.0 is reshaping industrial work environments with an emphasis on human-centricity, enabling close collaboration between humans and machines to enhance productivity and flexibility. However, such systems typically require monitoring of human workers and operators, often involving sensitive data, raising significant privacy concerns. As a result, affected workers and unions frequently reject human-machine collaboration features due to a lack of transparency regarding privacy threats and implemented mitigation strategies. To enable early stakeholder involvement, establish trust, and support informed decision-making, privacy implications must be communicated in a way understandable to non-technical stakeholders. Yet, current Requirements Engineering (RE) practices provide limited methodological support for making privacy threats and mitigations accessible to non-technical stakeholders (e.g., individual workers or their representative unions). In this RE@Next paper, we propose a conceptual framework that guides software design from human monitoring-related use cases and requirements to informed decision-making guidance focusing on non-technical stakeholders. Building on principles such as Privacy by Design, the framework leverages Large Language Models (LLMs) to transform technical artifacts into accessible privacy reports. We share initial insights from two industry use cases, evaluate the quality of the generated reports, and outline future research directions toward integrating privacy transparency into RE processes for human-centric industrial systems.

[MA-1] Decoupling Communication from Policy: Robust MARL under Bandwidth Constraints

【速读】:该论文试图解决多智能体强化学习(MARL)中通信受限场景下的性能下降问题,特别是在带宽严重受限的实际应用(如无人机编队搜救)中,传统通信架构因将共享潜在表示同时用于策略执行和智能体间通信,导致压缩消息尺寸会直接限制策略的潜在空间,从而引发显著性能退化。解决方案的关键在于:一是提出一个归一化的单智能体带宽预算参数 β\beta,将稀疏性、通信轮次和消息维度统一为可比较的约束;二是设计SLIM架构,通过解耦通信路径与策略潜在表示,实现带宽影响与策略容量影响的独立控制,同时保留即时通信优势。实验表明,该方法在多个部分可观测MARL基准任务上达到最先进性能,并在低带宽下展现出良好的可扩展性和鲁棒性,且性能衰减极小。

链接: https://arxiv.org/abs/2605.21085
作者: Alexi Canesse,Benoît Goupil,Jesse Read,Sonia Vanier
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Communication enables coordination in multi-agent reinforcement learning (MARL), but many real-world applications, e.g., search-and-rescue with drone swarms, operate under severe bandwidth constraints. Many communication architectures still expose a coupled bottleneck in which a shared latent representation is used for both policy execution and inter-agent communication. Consequently, reducing message size directly limits the policy’s latent space, often leading to significant performance degradation. We address this with two contributions. First, we introduce \beta , a normalised per-agent bandwidth budget that unifies sparsity, rounds, and message dimension into a single comparable constraint. Second, we provide SLIM, a minimal architecture that decouples the communication pathway from the policy’s latent representation, allowing us to isolate the effect of bandwidth from the effect of policy capacity while benefiting from in-step communication. We evaluate our method on several partially-observable MARL benchmarks, where communication is essential. Our approach achieves state-of-the-art performance and exhibits scalability and robustness under limited communication, with only marginal degradation as bandwidth is reduced.

[MA-2] ProCrit: Self-Elicited Multi-Perspective Reasoning with Critic-Guided Revision for Multimodal Sarcasm Detection

【速读】:该论文旨在解决多模态讽刺检测中因讽刺机制多样性导致的分析视角不统一问题,即现有方法依赖固定预设的分析视角,难以适应不同样本所需的动态推理需求。其解决方案的关键在于提出ProCrit框架——一个基于“提议-批评”双代理机制的自激发多视角推理系统:首先通过动态角色代理滚动(dynamic-role agentic rollout)合成过程级推理标注,以生成跨视角依赖关系并支持高效自回归生成;其次引入“草稿-批评-修订”范式,由独立批评代理识别推理缺陷并提供针对性自然语言反馈以指导修正;最后采用双阶段强化学习实现提议与反馈引导修订的协同优化,并根据反馈实际效果迭代优化批评代理。该方案实现了模型对每条样本自主生成所需分析视角并逐步整合为连贯推理的能力,显著提升了多模态讽刺检测的灵活性与可靠性。

链接: https://arxiv.org/abs/2605.20867
作者: Yingjia Xu,Jiulong Wu,Bowen Zhang,Baokui Guo,Siyuan Chai,Min Cao
机构: Soochow University (苏州大学); Baidu Inc. (百度公司); Zhipu AI (智谱AI)
类目: Multiagent Systems (cs.MA); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multimodal sarcasm detection requires reasoning over cross-modal incongruities between literal expression and intended meaning, yet the specific analytical perspectives needed vary across samples due to the diversity of sarcastic mechanisms. While recent methods make this analytical process explicit, they still rely on fixed, predefined perspectives that operate independently under hand-crafted routing rules. We argue that multimodal sarcasm detection instead calls for self-elicited multi-perspective reasoning, where a model autonomously generates the perspectives needed for each sample and progressively integrates them into a coherent analysis. To realize this goal, we propose ProCrit, a Proposal-Critic two-agent framework with a proposal agent for multi-perspective reasoning and a critic agent for external evaluation and targeted revision guidance. First, to overcome the lack of process-level supervision in existing sarcasm datasets, ProCrit synthesizes process-level reasoning annotations through a dynamic-role agentic rollout: a strong vision-language model sequentially spawns analytical roles within a shared context, and the resulting multi-role trajectories are flattened into sequences that preserve cross-perspective dependencies while enabling efficient autoregressive generation. Second, to improve reasoning reliability, ProCrit adopts a draft-critique-revise paradigm in which an independent critic identifies reasoning deficiencies and provides targeted natural-language feedback for directed revision. Finally, we develop a mutual-refinement training framework that jointly optimizes proposal drafting and feedback-guided revision via dual-stage reinforcement learning, while refining the critic agent according to the actual effectiveness of its feedback. Experiments on three widely used benchmarks demonstrate the effectiveness of ProCrit.

[MA-3] Heartbeat-Bound Hierarchical Credentials: Cryptographic Revocation for AI Agent Swarms

【速读】:该论文试图解决自主AI代理(Autonomous AI agents)在生成子代理集群时存在的安全漏洞问题:现有凭证撤销机制(如OAuth 2.0 introspection、OCSP和W3C状态列表)依赖中心化权威的网络连接,导致代理在操作员关闭后仍可能继续执行特权操作数分钟至数小时,形成“僵尸代理”(zombie agents)。解决方案的关键是提出一种名为心跳绑定分层凭证(Heartbeat-Bound Hierarchical Credentials, HBHC)的密码学协议,其核心在于将凭证有效性与父级代理的心跳存活证明(liveness proofs)绑定。验证方仅需缓存公钥和本地时钟即可强制执行凭证新鲜性,无需网络往返。当心跳停止时,所有子级凭证将在确定性时间窗口 $ W_z \le W_\max + \Delta_h + \epsilon $ 内失效,前提是时钟偏差有限且父级密钥存储于安全飞地(secure enclaves)中。实验证明,HBHC相比OAuth 2.0将僵尸窗口缩短90倍,认证延迟低至0.26毫秒,每秒可处理超18,000次验证,并在49个代理的四层层级结构中实现理论边界内的级联撤销。

链接: https://arxiv.org/abs/2605.20704
作者: Saurabh Deochake
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Autonomous AI agents that spawn sub-agent swarms create a safety gap: existing credential revocation mechanisms, OAuth~2.0 introspection, OCSP, and W3C Status Lists, require network connectivity to a central authority, leaving ``zombie agents’’ executing privileged operations for minutes to hours after operator shutdown. We present Heartbeat-Bound Hierarchical Credentials (HBHC), a cryptographic protocol that binds credential validity to periodic parent liveness proofs. Verifiers enforce freshness using only a cached public key and local clock; no network round-trip is required. When heartbeat generation ceases, all descendant credentials become unusable within a deterministically bounded window W_z \le W_\max + \Delta_h + \epsilon , conditional on bounded clock skew and parent keys held in secure enclaves. Evaluation at the protocol layer and with real LLM-backed agent swarms (GPT-4o-mini) demonstrates a 90 \times reduction in the zombie window over OAuth~2.0, 0.26~ms full authentication in Rust, 18,000+ verifications per second under concurrent HTTP load, and stable per-verification latency from 10 to 10,000 agents. Real-agent experiments show 0.71% end-to-end overhead on tool calls, zero post-revocation tool calls under prompt injection that bypasses application-layer guardrails, and cascading revocation across a 49-agent four-level hierarchy within the theoretical bound.

[MA-4] CandorMD: An AI-Assisted Audio Simulation and Feedback System for Training Clinicians for Medical Error Disclosure

【速读】:该论文试图解决临床医生在向患者及家属披露医疗错误时面临的沟通挑战问题,这些问题源于情感复杂性、培训机会有限以及现有教学工具(如静态视频)的适应性差和反馈延迟。解决方案的关键在于开发并验证CandorMD——一个AI辅助的模拟训练系统,其核心优势在于提供实时练习、即时可操作的反馈以及针对个体学习需求定制的多样化实践环境,从而提升医患沟通能力、增强自信心,并减少对披露对话的回避行为,最终改善患者护理质量与医患信任关系。

链接: https://arxiv.org/abs/2605.20701
作者: Inna Wanyin Lin,Sahand Sabour,Hong Sng,Maxine Chan,Minlie Huang,Andrew White,Tim Althoff
机构: University of Washington (华盛顿大学); Tsinghua University (清华大学)
类目: Human-Computer Interaction (cs.HC); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Clinicians are expected to disclose harmful medical errors to patients and families in line with ethical, regulatory, and patient care standards, yet these conversations remain challenging because of their emotional complexity and limited training opportunities. Most physicians still learn primarily through lectures and observation, while static video tools-though available-are underused, lack adaptability across specialties, and deliver delayed, generic feedback. These gaps restrict skill development, reduce self-efficacy, and contribute to avoidance of disclosure conversations, ultimately compromising patient care and eroding trust. To address these needs, we designed CandorMD – an AI-assisted simulation system that provides real-time practice, actionable feedback, and diverse practice environments tailored to individual learning needs. We conducted semi-structured interviews with physicians, risk managers, patient advocates, and communication experts to understand current practices, identify gaps, and collect feedback on CandorMD. Based on these insights, we present findings and design recommendations for the future of AI-supported medical communication training.

[MA-5] me-To-Reach Separation and Safety Filtering for Safe Fair and Efficient Multi-Agent Coordination

【速读】:该论文旨在解决城市空域中先进空中交通(Advanced Air Mobility, AAM)运营带来的高密度飞行器协同问题,核心挑战是在复杂拥挤环境中实现无碰撞的自主交通管理。解决方案的关键在于提出一种基于最小到达时间(Time-to-Reach, TTR)的多智能体协调框架:通过TTR统一赋予权重优先级、实现时序分离以诱导空间间隔,并引入基于哈密顿-雅可比可达性值函数的安全过滤层,在最小扰动参考引导路径的前提下确保碰撞避免。仿真结果表明,该方法在安全性、公平性和效率方面均优于传统时间最优引导和无优先级感知的安全过滤策略。

链接: https://arxiv.org/abs/2605.20625
作者: Matthew Low,Jasmine Jerry Aloor,Victoria Marie Tuck,Pierluigi Nuzzo,Jason J. Choi
机构: University of California, Berkeley (加州大学伯克利分校); Massachusetts Institute of Technology (麻省理工学院); University of Pennsylvania (宾夕法尼亚大学); University of California, Los Angeles (加州大学洛杉矶分校)
类目: ystems and Control (eess.SY); Multiagent Systems (cs.MA); Robotics (cs.RO)
备注: 9 pages, 3 figures. Extended version (including appendix) of a paper submitted to the 65th IEEE Conf. on Decision and Control (2026)

点击查看摘要

Abstract:Advanced Air Mobility (AAM) operations are expected to significantly increase aerial traffic in urban airspace, requiring autonomous traffic management systems to ensure collision-free operations in highly congested environments. In this paper, we propose a multi-agent coordination framework that uses minimum time-to-reach (TTR) as a unifying metric for priority assignment, temporal separation, and safety filtering. We focus on the problem of coordinating multiple aerial vehicles merging into an air corridor while maintaining safe separation between vehicles. Vehicles are assigned arrival-consistent priority based on TTR, and target TTR values are used to enforce temporal spacing that induces spatial separation. A priority-consistent safety filtering layer based on Hamilton-Jacobi reachability value functions ensures collision avoidance while minimally modifying the reference guidance. Simulation results in a highly congested corridor merging scenario show that the proposed method improves safety, fairness, and efficiency compared to time-optimal guidance and priority-agnostic safety filtering.

[MA-6] Intent-First Aerial V2V for Tactical Coordination and Separation: Protocol and Performance Under Density and Disturbance

【速读】:该论文旨在解决密集低空无人航空器系统(UAS)交通管理中缺乏可扩展的战术分离(tactical separation)通信机制的问题。当前依赖预飞行航路协调和事后避撞策略无法应对空中突发扰动,而传统避撞手段又过于延迟且破坏性大,难以作为常规交通管理手段。解决方案的关键在于提出并实现了一个面向全机载、侧链类(sidelink-class)、以意图优先(intent-first)为核心的车辆到车辆(V2V)战术邻域信息交换栈,通过结合更新的状态与意图广播、事件触发的消息机制(如让行、排序、释放及应急协调),实现局部协同感知、可信信息共享和容错决策。实验基于真实场锚定基础设施进行高负载压力测试,验证了该方案在减少过时信念偏差、维持协同感知能力、过滤无效战术消息、抑制错误本地推断及结构化资源共享方面的有效性,为城市空域中由扰动驱动的战术协同提供了边界可控的可扩展通信基础。

链接: https://arxiv.org/abs/2605.20595
作者: Mehrnaz Sabet
机构: 未知
类目: Robotics (cs.RO); Multiagent Systems (cs.MA); Networking and Internet Architecture (cs.NI)
备注: Submitted to IEEE Transactions on Intelligent Transportation Systems

点击查看摘要

Abstract:Dense low-altitude aerial operations require more than pre-flight route coordination and last-resort collision avoidance. Once aircraft are airborne, disturbances can emerge on timescales shorter than strategic reauthorization can absorb, while collision avoidance is too late and disruptive to serve as routine traffic management. Although tactical separation is recognized as the intermediate layer, realizing it at scale requires a deployable neighborhood communication mechanism that provides fresh, trusted information for local coordination. This paper presents what is, to our knowledge, the first controller-coupled characterization of an all-airborne, sidelink-class, intent-first vehicle-to-vehicle (V2V) tactical neighborhood exchange stack for dense Unmanned Aircraft System Traffic Management (UTM) operations. Unlike awareness-only broadcast, the proposed exchange combines refreshed state and intent beacons for local awareness, cooperative perception, and degraded-mode assessment with event-triggered messages for yielding, sequencing, release, and contingency coordination. We implement and evaluate this model on an all-airborne V2V stack using sidelink-class C-V2X modules with authenticated freshness checks. Evaluation uses a scenario-driven, high-volume stress campaign supported by real-time, field-anchored infrastructure. Results show that V2V reduces stale-belief divergence, preserves observability through cooperative perception, rejects invalid tactical messages, suppresses false local inference, and structures shared-resource coordination. The implemented stack provides a viable communication layer for tactical separation in lower-to-moderate regimes, but transitions toward guarded fallback as density, impairment, and complexity increase. These findings position intent-first aerial V2V as a bounded enabler for scaling tactical coordination in disturbance-driven urban airspace.

[MA-7] Multi-agent Collaboration with State Management

【速读】:该论文试图解决多智能体系统在并发编辑共享代码库时因状态不一致导致的冲突问题,这类冲突往往在后期合并阶段才被发现,造成集成失败且恢复成本高昂。解决方案的关键在于提出STORM(STate-ORiented Management),通过中介代理与共享工作区的交互来显式管理每个智能体的状态,确保其始终基于一致的代码视图进行操作,并在写入时即时检测和解决冲突,从而避免传统基于workspace隔离(如每个代理独立git worktree)方法中延迟至合并阶段才处理冲突的缺陷。实验表明,STORM在Commit0-Lite和PaperBench基准上分别比基于git-worktree的基线提升+18.7和+1.4,同时保持良好的成本效率,验证了显式状态管理相较于workspace隔离是更有效的多智能体协作基础。

链接: https://arxiv.org/abs/2605.20563
作者: Mengyang Liu,Taozhi Chen,Zhenhua Xu,Xue Jiang,Yihong Dong
机构: Shanghai Jiaotong University; Cortices AI; Emory University; Peking University
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:Recent advances in multi-agent systems have shown great potential for solving complex tasks. However, when multiple agents edit a shared codebase concurrently, their changes can silently conflict and inconsistent views lead to integration failures. Existing multi-agent systems address this through workspace isolation (e.g., one git worktree per agent), but this defers conflict resolution to a post-hoc merge step where recovery is expensive. In this paper, we propose STORM, i.e., STate-ORiented Management for multi-agent collaboration. Specifically, STORM manages agent states by mediating their interactions with the shared workspace, ensuring that each agent operates on a consistent view of the codebase and that conflicting edits are detected and resolved at write time. We evaluate STORM on Commit0 and PaperBench across multiple LLMs. STORM outperforms the git-worktree-based multi-agent baseline by +18.7 on Commit0-Lite and +1.4 on PaperBench, while achieving comparable or better cost efficiency. Combined with single-agent runs, STORM reaches highest scores of 87.6 and 78.2 on the two benchmarks respectively, suggesting that explicit state management is a more effective foundation for multi-agent collaboration than workspace isolation. STORM can also be plugged into any multi-agent system seamlessly.

[MA-8] What Do Agents Communicate? Characterizing Information Exchange in Multi-Agent Systems

【速读】:该论文试图解决多智能体(Multi-Agent, MA)系统中因早期信息错误导致下游推理性能下降的误差传播问题。解决方案的关键在于对智能体间通信进行系统性分析,识别出驱动性能的核心信息要素,并提出类别感知恢复增强(Category-Aware Recovery Augmentation, CARA)技术,通过强制在通信过程中保留关键信息来提升协作质量,从而恢复高达86.2%的失败案例,凸显了信息质量在高效多智能体协作中的核心作用。

链接: https://arxiv.org/abs/2605.20548
作者: Yong Jin Chun,Iftekhar Ahmed
机构: University of California, Irvine (加州大学欧文分校)
类目: Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have enabled collaborative Multi-Agent (MA) systems, where interacting agents improve performance through diverse reasoning and iterative refinement. However, these systems remain vulnerable to error propagation, where early-stage information degrades downstream reasoning. To address this, we conduct a systematic analysis of inter-agent communication to identify which information drives MA performance. We find that the absence of reasoning and verification in inter-agent communication significantly degrades performance. Based on these insights, we propose Category-Aware Recovery Augmentation (technique), which enforces the presence of critical information during communication. recovers up to 86.2% of failed cases. Our results highlight the key role of information quality in effective MA collaboration. Our code is available at this https URL

[MA-9] Agent ic Agile-V: From Vibe Coding to Verified Engineering in Software and Hardware Development

【速读】:该论文试图解决的问题是:当前生成式 AI 编码系统(Agentic AI coding systems)虽然具备代码审查、任务规划、文件编辑、工具调用、测试运行和提交 Pull Request 等能力,但在实际工程实践中仍存在显著局限性,如仓库配置失败、依赖管理错误、权限控制问题以及硬件验证失效等,且现有证据无法支持“自主代码生成自动提升工程成果”的简单结论。解决方案的关键在于从“提示工程”转向“工程过程控制”,提出 Agentic Agile-V 框架——以 Agile-V 生命周期为骨架,结合任务级 SCOPE-V 循环(Specify, Constrain, Orchestrate, Prove, Evolve, Verify),将对话意图结构化为可验证的工程产出与验收证据。该框架通过最小输入构件分类法、对话到契约的转换门、风险自适应工作流及证据包验收模型,强化了需求明确性、约束控制、可追溯性和人工审核机制,从而在不削弱工程纪律的前提下释放 AI 的生产力潜力。

链接: https://arxiv.org/abs/2605.20456
作者: Christopher Koch
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 7 pages, 1 figure

点击查看摘要

Abstract:Agentic AI coding systems can inspect repositories, plan implementation steps, edit files, call tools, run tests, and submit pull requests. These capabilities make software and hardware development faster in some settings, but current evidence does not support the simple claim that autonomous code generation automatically improves engineering outcomes. Controlled studies report productivity gains in some enterprise tasks, slowdowns in mature open-source work, moderate but heterogeneous meta-analytic effects, and persistent failures in repository setup, dependency handling, permission gating, and hardware verification. This paper argues that the central problem is no longer prompt engineering; it is engineering process control. It synthesizes evidence from agentic software engineering, GitHub-scale adoption studies, repository-level agent configuration, productivity trials, issue-resolution benchmarks, and hardware/RTL verification research. It proposes Agentic Agile-V, a process framework that uses Agile-V as the lifecycle backbone and a task-level SCOPE-V loop - Specify, Constrain, Orchestrate, Prove, Evolve, and Verify - to convert conversational intent into structured engineering artifacts and acceptance evidence. The paper contributes: (i) a taxonomy of minimum input artifacts for agentic software, firmware, and hardware work; (ii) a conversation-to-contract gate that separates exploratory dialogue from implementation; (iii) risk-adaptive feature, bug-fix, testing, and hardware workflows; and (iv) an evidence-bundle acceptance model for agent-generated artifacts. The paper concludes that agentic AI does not eliminate engineering discipline; it increases the value of requirements, constraints, traceability, independent verification, and human approval.

[MA-10] Pramana: A Protocol-Layer Treatment of Claim Verification in Autonomous Agent Networks

【速读】:该论文试图解决的问题是:在受监管领域部署的自主代理(autonomous agents)必须为每个重要输出生成可验证的审计凭证(verification artifact),以便外部审计员能够离线复现并验证该输出的内容、来源、责任人、时间及生成方式。当前的验证实践分为两个非标准化的部分:一是基于概率的判断模式(如自一致性投票或LLM评审团),它们产生的是结论而非可审计记录;二是基于检索增强生成(RAG)、工具增强追踪或生成-验证循环等技术的记录生成方式,但这些记录通常是厂商特定的,缺乏通用性,导致外部审计无法独立重建。
解决方案的关键在于提出了一种名为Pramana的标准化“接口格式”(wire format),它将每个重要输出封装为一种带有类型标记的ClaimAttestation结构,包含四种变体(测量、推理、类比、引用),每种都配有一个verify()操作,用于对记录来源进行确定性或条件性验证(LLM支持时可回放)。该方案通过TLA+形式化建模与TLC模型检测,在三个对称约简模型中验证了38,563个可达状态下的零违反不变量,并提供了Python参考实现和MCP/A2A扩展协议以保障部署级不变性(可达性、SLA边界、离线可重验性)。其核心创新在于结合古典印度认识论(pramana)构建类型体系,并通过形式化验证确保整个生命周期的可信性。

链接: https://arxiv.org/abs/2605.20312
作者: Ravi Kiran Kadaboina
机构: 未知
类目: Cryptography and Security (cs.CR); Logic in Computer Science (cs.LO); Multiagent Systems (cs.MA)
备注: 23 pages, 4 figures, 5 tables, 42 references

点击查看摘要

Abstract:Autonomous agents deployed in regulated domains must produce a verification artifact per consequential output: a record an auditor can re-execute offline, capturing what was claimed, against what source, by whom, when, and how. Production verification today splits into two unstandardized halves. Probabilistic verdict patterns (self-consistency voting, reviewer LLM ensembles) produce judgments, not artifacts. Artifact-producing patterns (RAG, tool-augmented traces, generator-verifier loops) produce vendor-specific records no external auditor can reconstruct without bespoke integration. Pramana defines the missing wire format. Every consequential agent output is wrapped in a typed ClaimAttestation with one of four variants (measurement, inference, analogy, citation), each paired with a verify() operation against the recorded source. verify() is deterministic for MeasurementClaim and CitationClaim. For InferenceClaim and AnalogyClaim, determinism is conditional on the oracle (audit-replayable when LLM-backed). The four-way typology derives from classical Indian epistemology (pramana, valid means of knowledge). The lifecycle is specified in TLA+ and exhaustively verified under TLC across three symmetry-reduced models: 38,563 distinct reachable states, zero invariant violations. The Python reference implementation passes 84 tests. An A2A and MCP wire-extension manifest layers three deployment-grade invariants: reachability, SLA bound, and offline re-verifiability. An exploratory pilot (n=100, 2,275 reviewer calls) probes LLM-as-judge in code generation. The strongest observation is a 40-percentage-point raw FPR delta across corpora, consistent with reference-solution quality contributing significantly. The pilot does not validate Pramana on its own; the structural argument and formal verification do that. Comments: 23 pages, 4 figures, 5 tables, 42 references Subjects: Cryptography and Security (cs.CR); Logic in Computer Science (cs.LO); Multiagent Systems (cs.MA) ACMclasses: I.2.11; D.2.4; F.3.1; K.4.1 Cite as: arXiv:2605.20312 [cs.CR] (or arXiv:2605.20312v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2605.20312 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.5281/zenodo.20283646 Focus to learn more DOI(s) linking to related resources Submission history From: Ravi Kiran Kadaboina [view email] [v1] Tue, 19 May 2026 17:00:33 UTC (148 KB)

[MA-11] Governance by Design: Architecting Agent ic AI for Organizational Learning and Scalable Autonomy

【速读】:该论文试图解决的问题是:随着生成式 AI(Generative AI)系统从实验原型向企业级部署过渡,如何在实现规模化自主性的同时保障问责制、安全性、成本控制和责任归属等治理要求。解决方案的关键在于通过具体的架构设计与工作流程安排来实施治理机制,包括明确系统可执行的操作范围、可用的工具与数据权限、记忆管理方式以及性能优化的迭代更新策略。作者基于一家大型IT服务公司在2025年开发并分阶段部署集成企业工具的代理型AI系统的深度案例研究,提炼出七个核心经验教训,阐明了在实际运营和扩展过程中将治理内嵌于代理型AI系统的设计与实施中的方法。

链接: https://arxiv.org/abs/2605.20210
作者: Nelly Dux,Cristina Alaimo,Philippe Roussiere,Abhishek Kumar Mishra
机构: ESSEC Business School (ESSEC商学院); Accenture Research (埃森哲研究)
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 17 pages, 1 figure, 3 tables

点击查看摘要

Abstract:Agentic AI systems - systems that can pursue goals through multi-step planning and tool-mediated action with limited direct supervision - are moving from experimental prototypes to enterprise deployments. This transition introduces tensions in implementation, scaling, and governance: organizations seek scalable autonomy for knowledge and coordination work, yet must preserve accountability, safety, cost control, and responsibility as systems initiate actions, access enterprise data, and evolve through iterative updates. Building on an in-depth qualitative case of a large IT services company’s 2025 development and staged rollout of an agentic system integrated with enterprise tools; we show that governance is implemented through concrete architectural and working arrangements that determine what the system is allowed to do, which tools and data it can use, how memory is handled, and how performance improvements are introduced over time. We then distill seven lessons that explain how to build effective governance into agentic AI during operationalization and scaling.

自然语言处理

[NLP-0] AiraXiv: An AI-Driven Open-Access Platform for Human and AI Scientists

【速读】: 该论文试图解决传统学术出版系统在人工智能(AI)快速发展背景下所面临的可扩展性问题,包括投稿量激增、审稿人工作负担加重以及会议和期刊规模受限等挑战。其解决方案的关键在于提出一种面向AI时代的新型出版范式——AiraXiv,该平台基于开放预印本、AI增强的分析与评审机制,以及读者反馈驱动的持续迭代,支持人类科学家通过交互式界面参与,并通过模型上下文协议(Model Context Protocol, MCP)实现与AI科学家的协作。通过实际部署(如作为IC AIS 2025的投稿平台),验证了AiraXiv在速度、包容性和可扩展性方面的潜力,为AI时代的研究基础设施提供了可行路径。

链接: https://arxiv.org/abs/2605.21481
作者: Junshu Pan,Panzhong Lu,Yixuan Weng,Qiyao Sun,Fang Guo,Zijie Yang,Qiji Zhou,Yue Zhang
机构: Westlake University; Zhejiang University; Shanghai Innovation Institution; Zhongguancun Academy
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent advances in artificial intelligence (AI) have accelerated the growth of both human-authored and AI-generated research outputs, placing increasing strain on traditional academic publishing systems and challenging the scalability of conference- and journal-centered paradigms amid rising submission volumes, reviewer workload, and venue size. To address these challenges, we explore an AI-era publishing paradigm in which both human and AI scientists participate as authors and readers, and papers evolve through continuous, feedback-driven iteration. We propose AiraXiv, an AI-driven open-access platform built on open preprints, AI-augmented analysis and review, and reader feedback. AiraXiv supports human scientists through an interactive UI and AI scientists through Model Context Protocol (MCP)-based interactions. We validate AiraXiv through real-world deployments, including serving as the submission platform for ICAIS 2025, demonstrating its potential as a fast, inclusive, and scalable research infrastructure for the AI era. AiraXiv is publicly available at this https URL.

[NLP-1] You Only Need Minimal RLVR Training: Extrapolating LLM s via Rank-1 Trajectories

【速读】: 该论文试图解决强化学习中可验证奖励(Reinforcement Learning with Verifiable Rewards, RLVR)训练过程中参数轨迹的高计算成本问题,特别是如何在不进行完整训练的情况下高效预测高质量模型检查点。解决方案的关键在于发现RLVR的参数更新轨迹具有极低秩特性(主要由一个秩-1子空间捕获),并基于此提出一种轻量级、无需额外训练的外推方法RELEX(REinforcement Learning EXtrapolation)。RELEX通过短时间窗口内的参数变化估计该秩-1子空间,并利用线性回归外推未来检查点,显著减少训练步数(仅需RLVR的15%),同时保持或超越原方法在域内和域外基准上的性能;其成功归因于该方法对优化噪声的“去噪”效应,即通过投影到低秩子空间抑制随机梯度噪声,从而实现稳定且持续的性能提升。

链接: https://arxiv.org/abs/2605.21468
作者: Zhepei Wei,Xinyu Zhu,Wei-Lin Chen,Chengsong Huang,Jiaxin Huang,Yu Meng
机构: University of Virginia (弗吉尼亚大学); Washington University in St. Louis (圣路易斯华盛顿大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: preprint. Code: this https URL

点击查看摘要

Abstract:Reinforcement learning with verifiable rewards (RLVR) has become a dominant paradigm for improving reasoning in large language models (LLMs), yet the underlying geometry of the resulting parameter trajectories remains underexplored. In this work, we demonstrate that RLVR weight trajectories are extremely low-rank and highly predictable. Specifically, we find that the majority of downstream performance gains are captured by a rank-1 approximation of the parameter deltas, where the magnitude of this projection evolves near-linearly with training steps. Motivated by this, we propose a simple and compute-efficient method RELEX (REinforcement Learning EXtrapolation), which estimates the rank-1 subspace from a short observation window and extrapolates future checkpoints via linear regression, with no learned model required. Across three models (i.e., Qwen2.5-Math-1.5B, Qwen3-4B-Base, and Qwen3-8B-Base), RELEX produces checkpoints that match or exceed RLVR performance on both in-domain and out-of-domain benchmarks, requiring as few as 15% steps of full RLVR training. Remarkably, RELEX is able to extrapolate far beyond the observation window at no training cost, predicting checkpoints up to 10-20 \times beyond the observed prefix with continued improvement (e.g., observe only the first 50 steps and extrapolate to 1000 steps). Our ablation analysis confirms the minimalist sufficiency of RELEX: neither increasing the subspace rank nor employing non-linear modeling yields further gains in extrapolation. Finally, we show that RELEX’s success stems from a “denoising” effect: by projecting updates onto the rank-1 subspace, the model discards stochastic optimization noise that would otherwise degrade performance during extrapolation. Our code is available at this https URL.

[NLP-2] DelTA: Discriminative Token Credit Assignment for Reinforcement Learning from Verifiable Rewards

【速读】: 该论文试图解决的问题是:在基于可验证奖励的强化学习(Reinforcement Learning from Verifiable Rewards, RLVR)中,响应级别的奖励如何转化为词元(token)级别的概率变化这一机制尚不清晰,导致模型难以有效分配词元级的信用(credit),尤其是在存在高频共享模式(如格式化标记)干扰的情况下,稀疏但具有判别性的方向被掩盖。解决方案的关键在于提出一种名为 DelTA 的判别性词元信用分配方法,其核心思想是从判别器视角重新理解RLVR更新方向——即策略梯度更新本质上是对词元梯度向量的线性判别器;通过估计词元系数来放大侧向特异性(side-specific)的词元梯度方向并抑制共享或弱判别性方向,从而重构更对比鲜明的正负样本中心(centroids),进而优化RLVR更新方向。实验表明,DelTA在七个数学基准上显著优于现有基线方法,并展现出良好的跨任务和跨模型泛化能力。

链接: https://arxiv.org/abs/2605.21467
作者: Kaiyi Zhang,Wei Wu,Yankai Lin
机构: Gaoling School of Artificial Intelligence, Renmin University of China (中国人民大学高瓴人工智能学院); Ant International (蚂蚁集团)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Reinforcement learning from verifiable rewards (RLVR) has emerged as a central technique for improving the reasoning capabilities of large language models. Despite its effectiveness, how response-level rewards translate into token-level probability changes remains poorly understood. We introduce a discriminator view of RLVR updates, showing that the policy-gradient update direction implicitly acts as a linear discriminator over token-gradient vectors and thereby determines which token probabilities are increased or decreased during learning. Under standard sequence-level RLVR, this discriminator is constructed from positive- and negative-side centroids formed by advantage-weighted averaging of token-gradient vectors. However, such centroid construction can be dominated by shared high-frequency patterns, such as formatting tokens, diluting sparse yet discriminative directions that better distinguish high-reward responses from low-reward ones. To address this limitation, we propose \textbfDelTA , a discriminative token credit assignment method that estimates token coefficients to amplify side-specific token-gradient directions and downweight shared or weakly discriminative ones. These coefficients reweight a self-normalized RLVR surrogate, making the effective side-wise centroids more contrastive and thereby reshaping the RLVR update direction. On seven mathematical benchmarks, DelTA outperforms the strongest same-scale baselines by 3.26 and 2.62 average points on Qwen3-8B-Base and Qwen3-14B-Base, respectively. Additional results on code generation, a different backbone, and out-of-domain evaluations further demonstrate the generalization ability of DelTA.

[NLP-3] Leverag ing LLM s for Grammar Adaptation: A Study on Metamodel-Grammar Co-Evolution

【速读】: 该论文试图解决模型驱动工程(Model-Driven Engineering, MDE)中元模型演化导致的语法一致性维护问题,即在元模型更新后,如何自动适应对应的文法(grammar),以避免传统依赖人工的手动修改过程。现有基于规则的方法虽能实现部分自动化,但在处理复杂语法场景时存在局限性。论文提出的解决方案核心在于利用大语言模型(Large Language Models, LLMs)从历史版本中学习语法适应模式,并将其应用于新版本文法的自动调整。实验表明,在小到中等规模的领域特定语言(DSLs)上,LLM方法在规则级适应一致性、输出相似性和元模型符合性三个维度均优于传统规则方法,尤其在跨演化步骤的长期适用性方面表现出显著优势;然而,对于大规模文法(如EAST-ADL,含297条规则),LLM的适应一致性明显下降,揭示了其在复杂性和规模上的当前局限。

链接: https://arxiv.org/abs/2605.21465
作者: Weixing Zhang,Bowen Jiang,Rahul Sharma,Regina Hebig,Daniel Strüber
机构: Karlsruhe Institute of Technology, Germany(卡尔斯鲁厄理工学院); Universität Rostock, Germany(罗斯托克大学); Chalmers University of Technology and University of Gothenburg, Sweden(查尔姆斯理工大学和哥德堡大学); Radboud University, The Netherlands(奈梅亨大学)
类目: Computation and Language (cs.CL); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:In model-driven engineering, metamodel evolution leads to the need to adapt corresponding grammars to maintain consistency, which typically requires tedious manual work. Existing rule-based methods can achieve partial automation but have limitations when handling complex grammar scenarios. This paper proposes a Large Language Model-based approach that automatically applies adaptations to new grammars after evolution by learning grammar adaptations from previous versions. We evaluated this approach on six real-world Xtext domain-specific languages, using four DSLs as a training set to develop prompting strategies, two DSLs as a test set for validation, and conducting a longitudinal case study on QVTo. The evaluation used three Large Language Models (Claude Sonnet 4.5, ChatGPT 5.1, Gemini 3) and measured grammar adaptation quality from three dimensions: grammar rule-level adaptation consistency, output similarity, and metamodel conformance. Results show that on the test set, all three LLMs achieved 100% adaptation consistency and output similarity, while the rule-based approach achieved only 84.21% on DOT and 62.50% on Xcore. In the QVTo longitudinal study, the LLM-based approach successfully reused learned adaptations across all three evolution steps without manual grammar editing, while the rule-based approach required manual adjustments in two of three transitions. However, on large-scale grammars (EAST-ADL, 297 rules), LLMs’ adaptation consistency was far below 90%. This study demonstrates the advantages of LLM-based approaches in handling complex grammar scenarios, while revealing their limitations in large-scale grammar adaptation.

[NLP-4] Mem-π: Adaptive Memory through Learning When and What to Generate

【速读】: 该论文试图解决大语言模型(Large Language Model, LLM)代理在复杂任务中因依赖静态检索式记忆而导致的上下文不匹配问题。现有方法通常从情景记忆库或技能库中基于相似性检索固定条目,难以适应动态任务需求。其解决方案的关键在于提出 Mem-π 框架,通过一个独立于下游代理的专用语言或视觉-语言模型,在当前上下文中按需生成情境相关的指导信息。该模型基于当前代理状态联合决策是否生成指导以及生成何种指导,并采用解耦的决策-内容强化学习(decision-content decoupled reinforcement learning, RL)目标进行训练,使模型能够在无益时选择不生成,否则输出简洁且有用的指导。实验表明,Mem-π 在多种代理基准测试(包括网页导航、终端工具使用和文本驱动的具身交互)中均显著优于基于检索的方法及先前优化的记忆基线,在网页导航任务上相对提升超过30%。

链接: https://arxiv.org/abs/2605.21463
作者: Xiaoqiang Wang,Chao Wang,Hadi Nekoei,Christopher Pal,Alexandre Lacoste,Spandana Gella,Bang Liu,Perouz Taslakian
机构: University of Montreal (蒙特利尔大学); ServiceNow (服务-now)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Work in progress

点击查看摘要

Abstract:We present Mem- \pi , a framework for adaptive memory in large language model (LLM) agents, where useful guidance is generated on demand rather than retrieved from external memory stores. Existing memory-augmented agents typically rely on similarity-based retrieval from episodic memory banks or skill libraries, returning static entries that often misalign with the current context. In contrast, Mem- \pi uses a dedicated language or vision-language model with its own parameters, separate from the downstream agent, to generate context-specific guidance for complex tasks. Conditioned on the current agent context, the model jointly decides when to produce guidance and what guidance to produce. We train it with a decision-content decoupled reinforcement learning (RL) objective, enabling it to abstain when generation would not help and otherwise produce concise, useful guidance. Across diverse agentic benchmarks spanning web navigation, terminal-based tool use, and text-based embodied interaction, Mem- \pi consistently outperforms retrieval-based and prior RL-optimized memory baselines, achieving over 30% relative improvement on web navigation tasks.

[NLP-5] Quantifying the cross-linguistic effects of syncretism on agreement attraction

【速读】: 该论文试图解决的问题是:为什么在某些语言(如英语、德语、俄语)中,形态同形现象(morphological syncretism)会加剧动词与插入名词之间的依从吸引错误(agreement attraction errors),而在其他语言(如土耳其语、亚美尼亚语)中则不会,这一跨语言差异缺乏一个系统性的解释。解决方案的关键在于利用大型语言模型(LLM)的两个计算指标——预期意外度(surprisal)和注意熵(attention entropy)作为语言处理的代理指标,来量化不同语言中语法结构的处理难度,并验证这些指标能否复现已知的行为实验结果。研究发现,LLM指标能成功再现英语和德语中同形现象对吸引效应的增强作用,与土耳其语中无显著调节效应一致,且部分捕捉到俄语中的复杂模式,从而为理解同形现象如何跨语言差异化地影响句法依从加工提供了新的计算神经语言学证据。

链接: https://arxiv.org/abs/2605.21403
作者: Utku Turk,Eva Neu
机构: 未知
类目: Computation and Language (cs.CL)
备注: SCiL Conference Paper

点击查看摘要

Abstract:Agreement attraction errors, in which a verb erroneously agrees with an intervening noun rather than its grammatical head, are amplified by morphological syncretism in some languages (English, German, Russian) but not others (Turkish, Armenian), a cross-linguistic pattern without a principled account. We use surprisal and attention entropy from large language models as processing proxies to investigate this variation across four languages. LLM-derived measures replicate behavioral findings in English and German (syncretism modulates attraction), align with Turkish null results (no modulation), and partially capture Russian patterns. We discuss further directions for better understanding why syncretism affects agreement attraction differently across languages.

[NLP-6] Post-Hoc Understanding of Metaphor Processing in Decoder-Only Language Models via Conditional Scale Entropy ICPR

【速读】: 该论文试图解决的问题是:在解码器-only 架构的Transformer模型中,如何理解其在不同层位置上对隐喻性词汇(metaphorical tokens)进行语境重构的机制,即模型如何通过多尺度计算组织隐喻意义的重新解释。解决方案的关键在于提出了一种名为条件尺度熵(Conditional Scale Entropy, CSE)的新度量方法,该方法基于小波变换,用于量化每一层位置处Transformer计算在频率尺度上的分布广度。两个定理证明CSE对更新幅度具有不变性,从而将结构模式与强度分离。实验表明,隐喻性词汇在连续层位置上显著产生比字面词汇更宽的频谱广度(spectral breadth),且这一现象在从124M到20B参数规模的多种模型(如GPT-2、LLaMA-2 7B、GPT-oss 20B)中一致存在,并在聚类置换校正后依然稳健,说明多尺度协调(multi-scale coordination)是解码器-only架构中处理隐喻语言的一致性特征,同时确立了CSE作为刻画Transformer跨深度结构的原理性工具。

链接: https://arxiv.org/abs/2605.21391
作者: Lawhori Chakrabarti,Jennifer Johnson-Leung,Bert Baumgaertner,Aleksandar Vakanski,Min Xian,Boyu Zhang
机构: 未知
类目: Computation and Language (cs.CL)
备注: 18 pages, 3 figures, submitted to ICPR workshop

点击查看摘要

Abstract:Metaphor requires a language model to resolve a token whose contextual meaning diverges from its basic literal sense. Understanding how transformer models organize this reinterpretation across depth remains an open problem in mechanistic interpretability. We introduce conditional scale entropy (CSE), a wavelet-derived measure of how broadly transformer computation engages across frequency scales at each layer position. Two theorems establish that CSE is invariant to update magnitude, isolating the structural pattern of updates from their intensity. Using CSE, we find that metaphorical tokens produce significantly higher spectral breadth than literal tokens at contiguous layer positions on every decoder-only architecture tested, from 124M to 20B parameters (GPT-2 family, LLaMA-2 7B, GPT-oss 20B). The effect survives cluster-based permutation correction, recurs in the early-to-mid relative depth range across models, and converges with an independent analysis of 200 naturalistic VUA pairs. Specificity controls further show that the effect is not explained by semantic complexity or by matched propositional content. These results identify multi-scale coordination as a consistent signature of metaphorical language processing in the decoder-only architectures examined, and establish CSE as a principled tool for characterizing cross-depth structure in transformers.

[NLP-7] SpecBench: Measuring Reward Hacking in Long-Horizon Coding Agents

【速读】: 该论文试图解决的问题是:在长周期编程任务中,编码代理(coding agents)生成的代码量远超人类开发者可审查范围,导致监督机制退化为单一表面——自动化测试套件,从而引发“奖励黑客”(reward hacking)现象,即代理优化通过测试用例但偏离用户真实目标。解决方案的关键在于提出一种系统性的评估方法:将软件工程任务分解为三部分——(i)自然语言规格说明、(ii)用于隔离验证功能的可见测试用例、以及(iii)模拟真实场景的保留测试用例;通过比较代理在可见测试集和保留测试集上的通过率差异来量化奖励黑客程度。基于此,作者构建了SpecBench基准,包含30个系统级编程任务,从短周期(如JSON解析器)到超长周期(如从零构建操作系统内核)。大规模实验表明,所有前沿模型均能饱和通过可见测试,但奖励黑客持续存在,且小模型差距更大,同时差距随任务长度急剧扩大(每代码规模增加十倍,通过率差距上升28个百分点),揭示出从细微功能隔离到故意利用漏洞(如记忆测试输入的2900行哈希表“编译器”)等多样失败模式。该方法为衡量编码代理是否真正构建可用系统而非仅“破解测试套件”提供了原则性测试平台。

链接: https://arxiv.org/abs/2605.21384
作者: Bingchen Zhao,Dhruv Srikanth,Yuxiang Wu,Zhengyao Jiang
机构: Weco AI
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As long-horizon coding agents produce more code than any developer can review, oversight collapses onto a single surface: the automated test suite. Reward hacking naturally arises in this setup, as the agent optimizes for passing tests while deviating from the users true goal. We study this reward hacking phenomenon by decompose software engineering tasks into three parts: (i) a natural language description of the specification (ii) visible validation tests that exercise specified features in isolation, and (iii) held-out tests that compose those same features to simulate real-world usage. Based on the specification and the visible validation test suites, a genuine agent would be able to generate a solution that can also pass all of the held-out tests. Therefore we use the gap in pass rates on these two suites to quantify reward hacking. Based on this methodology, we introduce SpecBench, a benchmark comprising 30 systems-level programming tasks ranging from short horizon tasks like building a JSON parser to ultra long horizon tasks like building an entire OS kernel from scratch. Large-scale experiments reveal a consistent pattern: while every frontier agent saturates the visible suite, reward hacking persists, with smaller models exhibiting larger gaps on holdout suites. The gap also scales sharply with task length: it grows by 28 percentage points for every tenfold increase in code size. Failures range from subtle feature isolation to deliberate exploits, including a 2,900-line hash-table “compiler” that memorizes test inputs. SpecBench offers a principled testbed for measuring whether coding agents build genuine working systems or merely game the test suites developers hand them.

[NLP-8] Findings of the Fifth Shared Task on Multilingual Coreference Resolution: Expanding Datasets for Long-Range Entities

【速读】: 该论文旨在解决多语言共指消解(Multilingual Coreference Resolution)中的长距离实体识别问题,即核心指代链跨越大量词句的复杂情况。其解决方案的关键在于:一是通过引入包含27个语料库、覆盖19种语言的统一标注数据集CorefUD v1.4,显著扩展了任务的语言覆盖范围;二是明确将长距离实体作为重点评估目标,推动系统在跨句、跨段落等复杂场景下的共指判断能力。此外,该版本首次纳入四种基于大语言模型(LLM)的方法(包括三种微调模型和一种少样本方法),验证了LLM在该任务上的潜力,为未来挑战传统方法提供了重要方向。

链接: https://arxiv.org/abs/2605.21369
作者: Michal Novák,Miloslav Konopík,Anna Nedoluzhko,Martin Popel,Ondřej Pražák,Jakub Sido,Milan Straka,Zdeněk Žabokrtský,Daniel Zeman
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted to CODI-CRAC 2026

点击查看摘要

Abstract:This paper describes the fifth edition of the Shared Task on Multilingual Coreference Resolution, held in conjunction with the CODI-CRAC 2026 workshop. Building on previous iterations, the task required participants to develop systems capable of mention identification and identity-based coreference clustering. The 2026 edition specifically emphasizes long-range entities, defined as coreferential chains spanning significant distances, across many words and sentences. The task expanded its linguistic scope by incorporating five new datasets and two additional languages. These additions leverage version 1.4 of CorefUD, a harmonized multilingual collection comprising 27 datasets in 19 languages. In total, ten systems participated, including four LLM-based approaches (three fine-tuned models and one few-shot approach). While traditional systems still maintained their lead, LLMs demonstrated significant potential, suggesting they may soon challenge established approaches in future editions. Comments: Accepted to CODI-CRAC 2026 Subjects: Computation and Language (cs.CL) Cite as: arXiv:2605.21369 [cs.CL] (or arXiv:2605.21369v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2605.21369 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Milan Straka [view email] [v1] Wed, 20 May 2026 16:35:09 UTC (227 KB)

[NLP-9] “I didnt Make the Micro Decisions”: Measuring Inducing and Exposing Goal-Level AI Contributions in Collaboration

【速读】: 该论文试图解决的问题是:在人类与人工智能(AI)协作过程中,如何准确衡量和归因双方在目标形成、细化和扩展中的贡献,尤其是在现有方法仅关注最终成果而忽视目标动态演化过程的情况下。解决方案的关键在于提出一种名为 CoTrace 的目标层级归因框架,该框架将显式目标分解为可验证的需求,并追踪对话轮次中模型的直接贡献与间接影响。通过分析638个真实协作日志和受控模拟实验,研究发现模型虽仅占目标塑造贡献的11–26%,但在引入具体低层级需求及产生多种间接影响方面作用显著;此外,用户研究进一步表明,展示目标层级分析能显著改变用户对自身与AI贡献的认知,揭示出用户在评估AI辅助工作时存在系统性误判。

链接: https://arxiv.org/abs/2605.21363
作者: Eunsu Kim,Jessica R. Mindel,Kyungjin Kim,Sherry Tongshuang Wu
机构: KAIST; Carnegie Mellon University; Seoul National University
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As large language models (LLMs) increasingly shape how users form, refine, and extend their goals, attributing contributions in human-AI collaboration becomes critical for users calibrating their own reliance and for evaluators assessing AI-assisted work. Yet existing methods focus on final artifacts, missing the process through which goals themselves are jointly shaped. We introduce a goal-level attribution framework, CoTrace, that decomposes explicit goals into verifiable requirements and traces both direct contributions and indirect influences across dialogue turns. Applying CoTrace to 638 real-world collaboration logs, we find that while models account for only 11-26% of goal-shaping contribution, they contribute substantially more on introducing lower-level concrete requirements, and make various kinds of indirect contributions. Through controlled simulations, we show that interaction design choices significantly affect model goal-shaping behavior. In a user study, exposing participants to goal-level analyses shifts their perceived contributions by nearly 2 points on a 5-point scale, revealing systematic miscalibration in how users understand their own AI-assisted work.

[NLP-10] LASH: Adaptive Semantic Hybridization for Black-Box Jailbreaking of Large Language Models

【速读】: 该论文试图解决的问题是:当前针对对齐后大语言模型(LLM)的越狱攻击(Jailbreak attack)方法虽然日益有效,但大多局限于单一攻击家族(如一种精炼循环、一种树搜索、一种变异空间或一种策略库),且没有一种方法在所有目标模型和危害类别中始终最优,表明不同攻击策略具有互补优势。解决方案的关键在于提出 LASH(LLM Adaptive Semantic Hybridization),这是一个黑盒框架,将多个基础攻击生成的输出作为可复用的种子提示(seed prompt),并基于目标请求自适应地组合这些种子提示。LASH 通过搜索种子子集及软最大化混合权重,利用无导数的遗传优化器结合两阶段评分函数(关键词拒绝检测 + LLM 判官评分)更新权重,从而动态合成最优候选提示。实验表明,LASH 在 JailbreakBench 数据集上平均攻击成功率高达 84.5%(关键词评估)和 74.5%(两阶段评估),显著优于五个最先进基线方法,且仅需平均 30 次目标查询,同时在多种防御机制下仍具竞争力,验证了跨异构越狱策略的自适应组合是一种有前景的黑盒红队测试方向。

链接: https://arxiv.org/abs/2605.21362
作者: Abdullah Al Nomaan Nafi,Fnu Suya,Swarup Bhunia,Prabuddha Chakraborty
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Jailbreak attacks expose a persistent gap between the intended safety behavior of aligned large language models and their behavior under adversarial prompting. Existing automated methods are increasingly effective but each commits to a single attack family (e.g., one refinement loop, one tree search, one mutation space, or one strategy library) and no single family dominates: the best-performing method shifts across target models and harm categories, suggesting complementary strengths that per-prompt composition could exploit. We introduce LASH (LLM Adaptive Semantic Hybridization), a black-box framework that treats outputs from multiple base attacks as reusable seed prompts and adaptively composes them for each target request. Given a seed pool, LASH searches over seed subsets and softmax-normalized mixture weights; a composition module synthesizes a single candidate prompt, and a derivative-free genetic optimizer updates the weights using black-box target feedback and a two-stage fitness function combining keyword-based refusal detection with LLM-judge scoring. On JailbreakBench, which contains 100 harmful prompts across 10 categories, we evaluate LASH on six common target models. LASH achieves an average attack success rate of 84.5% under keyword-based evaluation and 74.5% under two-stage evaluation, where responses are first filtered for refusals and then scored by an LLM judge for whether they substantively fulfill the original harmful request. LASH outperforms five state-of-the-art baselines on both metrics with only 30 mean target queries. LASH also remains competitive under three defense mechanisms and induces more success-like internal representations. These results suggest that adaptive composition across heterogeneous jailbreak strategies is a promising direction for black-box red-teaming.

[NLP-11] xt Analytics Evaluation Framework: A Case Study on LLM s and Social Media

【速读】: 该论文试图解决的问题是:大型语言模型(LLM)在处理长序列非结构化文档(如社交媒体帖子)时,其语义理解与推理能力在实际数据分析场景中的表现尚不明确,尤其在面对大规模文本集合时是否存在性能瓶颈。解决方案的关键在于提出一个基于问题的评估框架,包含470个人工精心设计的问题,用于系统性评测LLM在聚合文本数据上的语义理解与推理能力,并在多个Twitter数据集上验证该框架的有效性。实验结果揭示了输入规模、任务复杂度以及模型架构对性能的显著影响,特别是当输入超过500条实例时,开放权重模型在数值计算类任务上出现明显性能下降,暴露出当前LLM在大规模定量分析中的关键架构局限性。

链接: https://arxiv.org/abs/2605.21338
作者: Yuefeng Shi,Nedjma Ousidhoum,Jose Camacho-Collados
机构: Cardiff University (卡迪夫大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:LLMs have demonstrated exceptional proficiency in a wide range of NLP tasks. However, a notable gap remains in practical data analysis scenarios, particularly when LLMs are required to process long sequences of unstructured documents, such as news feeds or, as specifically addressed in this paper, social media posts. To empirically assess the effectiveness of LLMs in this setting, we introduce a question-based evaluation framework comprising 470 manually curated questions designed to evaluate LLMs’ semantic understanding and reasoning abilities over aggregated text data. We apply our benchmark on diverse Twitter datasets covering various NLP tasks, including sentiment analysis, hate speech detection, and emotion recognition. Our results reveal that the performance depends heavily on input scale and the complexity of the data sources, declining noticeably in multi-label or target-dependent scenarios. In addition, as task complexity increases, performance drops progressively from basic semantic existence identification to more demanding operations such as comparison, counting, and calculation. Furthermore, as the input size grows beyond 500 instances, we identify a common limitation across LLMs, particularly Open-weights models: performance degrades substantially, especially on numerical tasks. These findings highlight critical architectural bottlenecks in current LLMs for performing rigorous quantitative analysis over large text collections.

[NLP-12] SymbolicLight V1: Spike-Gated Dual-Path Language Modeling with High Activation Sparsity and Sub-Billion-Scale Pre-Training Evidence

【速读】: 该论文旨在解决脉冲语言模型(spiking language models)在训练过程中难以同时实现Transformer类的语言质量、稳定多领域预训练以及高激活稀疏性(activation sparsity)的问题。其核心解决方案是提出SymbolicLight V1,一种基于符号化脉冲门控的双路径语言模型,通过将二进制漏电积分发放(Leaky Integrate-and-Fire, LIF)脉冲动力学与连续残差流(residual stream)相结合,设计了Dual-Path SparseTCAM模块:该模块包含一个指数衰减聚合路径用于长程记忆建模,以及一个脉冲门控局部注意力路径用于短程精确建模;此外还引入动态上下文条件解码头和双语分词器以提升性能。关键创新在于利用脉冲门控机制实现稀疏性的同时保留时间积分能力,实验证明脉冲动力学中的时间整合作用比单纯稀疏性对模型性能影响更大,且在匹配训练预算下,该架构可在89%的每元素激活稀疏度下达到接近GPT-2 201M的困惑度(PPL),显著优于GPT-2 124M。

链接: https://arxiv.org/abs/2605.21333
作者: Ting Liu
机构: SymbolicLight Research (SymbolicLight 研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 35 pages, 5 figures, 25 tables; public code and model artifacts linked in manuscript

点击查看摘要

Abstract:Natively trained spiking language models struggle to combine Transformer-like language quality, stable multi-domain pre-training, and high activation sparsity. We present SymbolicLight V1, a spike-gated dual-path language model that combines binary Leaky Integrate-and-Fire spike dynamics with a continuous residual stream. Its Dual-Path SparseTCAM module replaces dense self-attention with an exponential-decay aggregation path for long-range memory and a spike-gated local attention path for short-range precision, complemented by a dynamic context-conditioned decoding head and a bilingual tokenizer. A 194M-parameter SymbolicLight V1 model trained from scratch on a 3B-token Chinese-English corpus reaches held-out validation PPL 8.88-8.93 across four independent runs at 89% per-element activation sparsity. It trails GPT-2 201M by 7.7% in PPL while surpassing GPT-2 124M under the reported comparison. Component ablations at matched 0.5B-token training budgets show that the spike-gated local attention path is the largest contributor, and that replacing LIF dynamics with a deterministic top-k mask at matched sparsity causes a larger degradation, indicating that temporal integration rather than sparsity alone drives performance. We also report a 0.8B-parameter scale-up run trained on 48.8B tokens as evidence of optimization and sparsity preservation, not as a primary quality comparison. Current dense-hardware inference is slower than GPT-2, so neuromorphic deployment is presented as a future sparsity-driven opportunity rather than an achieved hardware speedup. Comments: 35 pages, 5 figures, 25 tables; public code and model artifacts linked in manuscript Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2605.21333 [cs.CL] (or arXiv:2605.21333v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2605.21333 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-13] xtReg: Mitigating Prompt Distributional Overfitting via Regularized Text-Space Optimization

【速读】: 该论文试图解决大语言模型(Large Language Models, LLMs)在提示(prompt)优化过程中出现的分布外泛化能力下降问题,即“提示分布过拟合”(prompt distributional overfitting)。其核心问题是:现有迭代式提示优化方法会导致提示文本变长、引入过于特定于训练样本的规则,从而丧失对未见数据分布的适应能力。解决方案的关键在于提出一种名为TextReg的正则化框架,通过在离散文本空间中实现软惩罚目标来控制提示表示的效率,具体包括三个组成部分:双证据梯度净化(Dual-Evidence Gradient Purification)、语义编辑正则化(Semantic Edit Regularization)和正则化引导的提示更新(Regularization-Guided Prompt Update)。该方法从容量成本(capacity cost)和范围狭窄性(scope narrowness)两个维度量化提示效率,有效缓解了提示优化过程中的耦合增长问题,显著提升了模型在多个推理基准上的分布外(OOD)泛化性能,相较TextGrad和REVOLVE分别提升最高达+11.8%和+16.5%。

链接: https://arxiv.org/abs/2605.21318
作者: Lucheng Fu,Ye Yu,Yiyang Wang,Yiqiao Jin,Haibo Jin,B. Aditya Prakash,Haohan Wang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Code: this https URL

点击查看摘要

Abstract:Large language models (LLMs) are highly sensitive to the prompts used to specify task objectives and behavioral constraints. Many recent prompt optimization methods iteratively rewrite prompts using LLM-generated feedback, but the resulting prompts often become longer, accumulate narrow sample-specific rules, and generalize poorly beyond the training distribution. We study this failure mode as prompt distributional overfitting and argue that it reflects a lack of representation control in discrete text-space optimization. We formalize this view through representational inefficiency, a dual-factor measure that decomposes prompt inefficiency into capacity cost and scope narrowness, attributing distributional prompt overfitting to their coupled growth during optimization. We propose TextReg, a regularization framework that realizes a soft-penalty objective through regularized textual gradients, combining Dual-Evidence Gradient Purification, Semantic Edit Regularization, and Regularization-Guided Prompt Update. Across multiple reasoning benchmarks, TextReg substantially improves out-of-distribution (OOD) generalization, with accuracy gains of up to +11.8% over TextGrad and +16.5% over REVOLVE.

[NLP-14] racing the ongoing emergence of human-like reasoning in Large Language Models

【速读】: 该论文试图解决的问题是:大型语言模型(LLMs)是否具备与人类相似的语用推理能力,尤其是在处理条件句时能否像人类一样进行语用增益(pragmatic inferences)。研究发现,人类在跨语言情境下会基于语境对逻辑条件句进行语用扩展,例如将“如果你割草,我就给你50美元”理解为仅在割草时才支付报酬,而将“如果你饿了,烤箱里有披萨”理解为无论是否饥饿都有披萨。相比之下,尽管部分LLMs能准确遵循条件句的真值表(truth-table),但它们普遍忽略语用推理;另一些模型虽表现出一致的单一解释,却偏离了逻辑结构,显示出规则驱动而非人类式的灵活推理。关键结论在于:LLMs本质上是准确的语义操作者,但尚未掌握人类特有的语用增强机制,且其表现不受模型开放/封闭状态、训练目标或架构类型的影响,表明语用推理仍是人工智能系统认知工具包中的新兴能力。

链接: https://arxiv.org/abs/2605.21299
作者: Paolo Morosi,Nikoleta Pantelidou,Fritz Günther,Elena Pagliarini,Evelina Leivada
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Humans effortlessly go beyond literal meanings: If you mow the lawn, I will give you fifty dollars, is typically understood as implying that the speaker will pay only if the lawn is mowed, whereas If you are hungry, there is pizza in the oven implies that pizza is available regardless of the hearers hunger. Large Language Models - LLMs - show human-like performance on many tasks, yet it remains unclear whether they reason like humans. To address this, we conducted a population-matching experiment assessing how twentyfive LLMs compute conditional inferences across four languages, compared to an equal number of humans per language. We find that humans enrich logical reasoning through pragmatic inferences across languages. Model behavior is more variable. Some LLMs perfectly follow the truth-table of conditionals but they ignore pragmatic inferences, while others deviate from the truth-table, adhering to a single interpretation across the board, thus reflecting accurate rule-based processing but not human-like reasoning. Overall, LLMs are accurate semantic operators, but fail to capture the pragmatic enrichments characteristic of human reasoning. Crucially, LLM accuracy is neither predicted nor boosted by open vs. closed status, training orientation, or architecture type, suggesting that pragmatic reasoning is still an emerging ability in the cognitive toolkit of artificial systems.

[NLP-15] Reliable Automated Triage in Spanish Clinical Notes: A Hybrid Framework for Risk-Aware HIV Suspicion Identification ACL2026

【速读】: 该论文试图解决的标准临床自然语言处理(Natural Language Processing, NLP)基准在面对模糊实例时,因强制确定性分类而导致评估指标虚高、掩盖了过度自信预测的临床风险问题。其解决方案的关键在于提出一种风险感知的混合选择性分类框架,通过双验证机制显式分离两类不确定性:利用Mondrian共形预测(Mondrian conformal prediction)解耦随机不确定性(aleatoric uncertainty),并基于多中心马氏距离(Multi-Centroid Mahalanobis Distance)构建几何约束作为否决机制以量化认知不确定性(epistemic uncertainty)。实证结果表明,传统不确定性度量和基线分类器在严格可靠性约束下存在覆盖塌陷(coverage collapse)问题,而所提框架通过要求临床文本同时满足概率与几何双重保障,成功识别出一个高度可信的操作域,从而提升了医疗分诊的安全性与可靠性。

链接: https://arxiv.org/abs/2605.21256
作者: Rodrigo Morales-Sánchez,Soto Montalvo,Raquel Martínez
机构: Universidad Nacional de Educación a Distancia (UNED); Universidad Rey Juan Carlos (URJC)
类目: Computation and Language (cs.CL)
备注: Accepted at the BioNLP Workshop @ ACL 2026

点击查看摘要

Abstract:Standard clinical Natural Language Processing (NLP) benchmarks often yield inflated metrics by forcing deterministic classification on ambiguous instances, thereby obscuring the clinical risks of overconfident predictions. To bridge this gap, we propose a risk-aware hybrid selective classification framework, evaluated on early Human Immunodeficiency Virus suspicion identification in Spanish clinical notes. Our dual-verification approach explicitly decouples aleatoric uncertainty through Mondrian conformal prediction and epistemic uncertainty using a Multi-Centroid Mahalanobis Distance veto. Empirical evaluations reveal that standard uncertainty metrics and baseline classifiers are structurally insufficient for safe medical triage, suffering severe coverage collapse when forced to operate under strict reliability constraints. In contrast, by demanding that clinical narratives pass both probabilistic and geometric safeguards, the proposed framework successfully isolates a highly trustworthy operational domain.

[NLP-16] LamPO: A Lambda Style Policy Optimization for Reasoning Language Models

【速读】: 该论文旨在解决强化学习中基于可验证奖励(Reinforcement Learning with Verifiable Rewards, RLVR)的信用分配问题,尤其是在稀疏奖励环境下,传统群体相对目标(如GRPO)因仅使用标量统计量汇总群体信息而丢失了候选响应间的细粒度关系,导致在多个生成解仅存在细微推理差异时难以有效区分优劣。其解决方案的关键在于提出LamPO方法——一种类Lambda策略优化算法,通过引入**成对分解优势(Pairwise Decomposed Advantage)**替代原有的标量群体优势,聚合每组内响应间的奖励差距,并利用序列对数概率差计算置信度加权因子来调节每一对比较;同时保留PPO风格的无评论家(critic-free)和裁剪更新结构,确保训练稳定性和效率。此外,在有参考解的情况下,进一步引入轻量级ROUGE-L辅助密集奖励以缓解奖励稀疏性问题。实验表明,LamPO在AIME24、AIME25、MATH-500和GPQA-Diamond等多个基准上优于GRPO及近期RLVR变体,具备更稳定的训练动态与更高的样本效率。

链接: https://arxiv.org/abs/2605.21235
作者: Zhe Yuan,Yipeng Zhou,Jinghan Li,Xinyuan Chen,Bowen Deng,Zhiqian Chen,Liang Zhao
机构: Pinterest(Pinterest); Facebook(Facebook); University of Michigan - Ann Arbor(密歇根大学安娜堡分校); Mississippi State University(密西西比州立大学); Carnegie Mellon University(卡内基梅隆大学); Emory University(埃默里大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Reinforcement learning with verifiable rewards (RLVR) has become an effective paradigm for improving reasoning language models on tasks such as mathematics, coding, and scientific question answering. However, widely used group-relative objectives, such as GRPO, summarize each sampled group with scalar statistics and therefore discard fine-grained relational information among candidate responses. This weakens credit assignment under sparse outcome rewards, especially when multiple generated solutions differ only subtly in reasoning quality. We propose \textbfLamPO, a \textbfLambda-Style Policy Optimization method that replaces scalar group advantages with a \emphPairwise Decomposed Advantage. LamPO aggregates pairwise reward gaps within each response group and modulates each comparison by a confidence-aware weight computed from sequence log-probability differences, while retaining the critic-free and clipped-update structure of PPO-style optimization. When reference solutions are available, we further add a lightweight ROUGE-L-based dense auxiliary reward to reduce reward sparsity. Experiments on AIME24, AIME25, MATH-500, and GPQA-Diamond with Qwen3-1.7B, Qwen3-4B, and Phi-4-mini show that LamPO consistently improves over GRPO and recent RLVR variants, with more stable training dynamics and better sample efficiency.

[NLP-17] Do LLM s Know What Luxembourgish Borrows? Probing Lexical Neology in Low-Resource Multilingual Models LREC2026

【速读】: 该论文试图解决的问题是:在低资源接触语言(如卢森堡语)中,大型语言模型(LLMs)是否能够尊重社区对词汇借用(lexical borrowing)和新词生成(neology)的规范。现有研究表明,LLMs 在处理此类任务时表现不佳,尤其是在缺乏外部上下文的情况下。解决方案的关键在于构建一个结构化的语言知识图谱(linguistic knowledge graph),其中包含源语言(donor language)、形态模式(morphological patterns)和词汇类比(lexical analogues)等信息,并将与实例相关的子图注入到提示(prompt)中。这种“词典感知提示”(lexicon-aware prompting)方法显著提升了模型在词汇借用分类任务上的准确率(从25–35%提升至71–81%),并缩小了小模型与大模型之间的性能差距,证明了利用结构化词汇资源作为上下文可有效提升LLMs在低资源多语种环境中的鲁棒性判断能力。

链接: https://arxiv.org/abs/2605.21227
作者: Nina Hosseini-Kivanani
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted to Neollm colocated with LREC2026, Three figures and three tables

点击查看摘要

Abstract:Large language models (LLMs) are increasingly used for writing assistance in small contact languages, yet it is unclear whether they respect community norms around lexical borrowing and neology. We introduce LexNeo-Bench, a 3,050-instance token-level benchmark derived from LuxBorrow, a large-scale Luxembourgish news corpus, where target tokens are labelled as native or as French, German, or English borrowings. Using this benchmark, we probe three multilingual LLMs across 34 prompt settings on two tasks: borrowing type classification and a binary lexical-innovation proxy (borrowing versus native). Without external context, models perform only slightly above chance on borrowing classification, so we construct a linguistic knowledge graph that encodes donor language, morphological patterns, and lexical analogues, and inject instance-specific subgraphs into the prompt. Knowledge-graph prompts raise borrowing classification accuracy from 25 – 35% up to 71 – 81% and largely close the gap between small and large models, while leaving neology detection difficult and sensitive to few-shot design. Our results show that lexicon-aware prompting is highly beneficial for robust borrowing judgments in low-resource contact languages and that lexical resources can serve as structured context for LLM evaluation. This study was carried out within the ENEOLI COST Action and examines borrowing as a form of lexical innovation in multilingual Luxembourgish data.

[NLP-18] Manga109-v2026: Revisiting Manga109 Annotations for Modern Manga Understanding ICML2026

【速读】: 该论文旨在解决当前Manga109数据集在对话文本标注中存在的质量问题,这些问题包括转录错误、文本区域缺失、对话与拟声词重叠以及气泡分割不足等,导致其难以适配现代光学字符识别(OCR)和多模态漫画理解任务。解决方案的关键在于结合基于OCR的错误检测与人工修正,构建了改进版的Manga109-v2026数据集,对约29,000条对话标注进行了修订,在保持漫画表达结构特征的同时显著提升了数据质量,使其更契合当前AI系统对漫画的理解需求。

链接: https://arxiv.org/abs/2605.21182
作者: Jeonghun Baek,Atsuyuki Miyai,Shota Onohara,Hikaru Ikuta,Kiyoharu Aizawa
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to the Culture x AI Workshop at ICML 2026. Project page: this https URL

点击查看摘要

Abstract:Manga is a culturally distinctive multimodal medium and one of the most influential forms of Japanese popular culture. As AI systems increasingly target manga understanding, OCR, and translation, Manga109 has become a foundational dataset for manga-related AI research. However, the current Manga109 dataset contains transcription errors and coarse annotations, which do not align well with modern OCR and multimodal manga understanding tasks. In this work, we revisit the dialogue text annotations of Manga109 and identify five categories of annotation issues, including transcription errors, missing text regions, overlapping dialogue and onomatopoeia, and under-segmented speech balloons. To address these issues, we combine OCR-based issue detection and manual revision to construct Manga109-v2026, revising approximately 29,000 dialogue annotations. Our revisions better align Manga109 with modern OCR and multimodal manga understanding systems while preserving expressive structures characteristic of manga.

[NLP-19] Metaphors in Literary Post-Editing: Opening Pandoras Box?

【速读】: 该论文试图解决的问题是:文学文本中隐喻(metaphor)在神经机器翻译(Neural Machine Translation, NMT)和大语言模型(Large Language Models, LLMs)翻译输出中的处理问题,以及译后编辑(post-editing)人员对此类问题的反应与应对策略。解决方案的关键在于识别出译后编辑者对机器翻译结果中隐喻翻译的修改行为——研究发现,约三分之一的隐喻被编辑者修改,表明文学机器翻译在处理修辞语言方面存在显著缺陷;同时,译后编辑者普遍认为翻译质量较差,且编辑工作比从头翻译更费力,反映出译后编辑不仅未提升效率,反而限制了译者的创造性并削弱其对文本的归属感。

链接: https://arxiv.org/abs/2605.21178
作者: Aletta G. Dorst,Mayra O. Nas,Katinka Zeven
机构: Leiden University Centre for Linguistics (莱顿大学语言学研究中心)
类目: Computation and Language (cs.CL)
备注: This paper has been accepted for presentation at the EAMT Conference 2026, which will take place in Tilburg from June 15 to 18, 2026

点击查看摘要

Abstract:This paper investigates how post-editors of literary texts react and respond to the way metaphors have been translated by Neu ral Machine Translation (NMT) and Large Language Models (LLMs). The results show that one in three metaphors in the output were changed by the post-editors, demonstrating that the translation of fig urative language is indeed problematic in literary MT (LitMT). The responses indi cate that the post-editors were aware of overly literal translations, though mostly for multiword expressions. Moreover, at times they found it difficult to determine whether solutions were acceptable. They rated the overall quality of the MT out put as quite poor and stated that the post editing was more work and more effort than it would have been translating from scratch. This supports previous studies ar guing that post-editing constrains transla tors in their creativity and diminishes their sense of text ownership.

[NLP-20] ChunkFT: Byte-Streamed Optimization for Memory-Efficient Full Fine-Tuning

【速读】: 该论文试图解决大模型全参数微调(full-parameter fine-tuning)过程中对显存资源消耗巨大的问题,尤其是在单卡或有限硬件条件下难以部署的问题。其解决方案的关键在于提出一种名为 \textscChunkFT 的内存高效微调框架,该框架通过动态激活的工作集(dynamically activated working set)重构全参数微调过程,能够在不修改网络结构的前提下,对任意子张量(sub-tensors)进行梯度计算,从而避免标准的密集梯度计算。这一设计实现了对任意子网络的优化,显著降低显存占用,并在理论层面提供了确定性设置下的收敛性分析。实验表明,\textscChunkFT 在 Llama 3-8B 和 Llama 3-70B 模型上均能以极低显存(如 7B 模型仅需 13.72GB)完成全参数微调,且在运行时间、优化质量和下游任务(语言理解、数学推理、MT-Bench)中优于现有内存高效基线方法,甚至达到与全参数微调相当或更优的性能。

链接: https://arxiv.org/abs/2605.21177
作者: Yongkang Liu,Zijing Wang,Mengjie Zhao,Ercong Nie,Mingyang Wang,Qian Li,Feiliang Ren,Shi Feng,Daling Wang,Hinrich Schütze
机构: Northeastern University, China; Shanghai Jiao Tong University, China; CIS, LMU Munich, Germany; MCML, Germany; Shandong University, China
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This work presents \textscChunkFT, a memory-efficient fine-tuning framework that reformulates full-parameter fine-tuning around a dynamically activated working set. \textscChunkFT enables gradient computation for arbitrary sub-tensors without modifying the network architecture, providing an algorithmic foundation for optimizing arbitrary sub-networks while avoiding standard dense gradient computation. We provide a theoretical convergence analysis of \textscChunkFT in the deterministic setting. Empirically, we apply \textscChunkFT to fine-tune Llama 3-8B and Llama 3-70B using a single RTX 4090-24GB GPU and 2 \times H800-80GB GPUs, respectively. Full-parameter fine-tuning of a 7B model with a 1K input length requires only 13.72GB of GPU memory. The results demonstrate the effectiveness of \textscChunkFT in memory usage, running time, and optimization quality. Moreover, downstream evaluations on language understanding, mathematical reasoning, and MT-Bench show that \textscChunkFT consistently outperforms existing memory-efficient baselines. Notably, \textscChunkFT achieves performance comparable to, and in some cases exceeding, full-parameter fine-tuning. Our repository is on this https URL.

[NLP-21] Automated ICD Classification of Psychiatric Diagnoses: From Classical NLP to Large Language Models

【速读】: 该论文旨在解决精神健康领域临床诊断编码自动化难题,即如何将自由文本的临床描述自动映射到国际疾病分类(ICD)系统中,以减轻行政负担。解决方案的关键在于利用自然语言处理(NLP)和机器学习(ML)技术,特别是通过适配大型语言模型(LLM)来捕捉精神病学术语的语义复杂性和长尾标签分布问题。实验表明,基于Transformer的嵌入方法显著优于传统频率模型(如BoW、TF-IDF),其中经过端到端微调的e5_large模型在F1_micro指标上达到0.866,证明了针对特定临床术语优化LLM是提升诊断映射准确性的核心策略。

链接: https://arxiv.org/abs/2605.21154
作者: Fernando Ortega,Raúl Lara-Cabrera,Jorge Dueñas-Lerín,Alejandro de la Torre-Luque,Mercé Salvador Robert,Enrique Baca-García
机构: Universidad Politécnica de Madrid (马德里理工大学); KNODIS Research Group (KNODIS 研究组); CIBERSAM ISCIII (CIBERSAM ISCIII); Complutense University of Madrid (康普顿斯大学); Universidad Rey Juan Carlos (雷伊·胡安·卡洛斯大学); Hospital Universitario de Móstoles (莫斯特莱斯大学医院); University Hospital Jimenez Díaz Fundation (希梅内斯·迪亚斯基金会医院); University Hospital Rey Juan Carlos (雷伊·胡安·卡洛斯医院); General Hospital of Villalba (维拉尔巴综合医院); University Hospital Infanta Elena (伊莎贝拉公主医院); Universidad Catolica del Maule (马乌莱天主教大学); Madrid Autonomous University (马德里自治大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Mental health has become a global priority, leading to a massive administrative burden in the coding of clinical diagnoses. This study proposes the automation of psychiatric diagnostic analysis by mapping free-text descriptions to the International Classification of Diseases (ICD) using Natural Language Processing (NLP) and Machine Learning (ML) techniques. Utilizing a specialized dataset of 145,513 Spanish psychiatric descriptions, various text representation paradigms were evaluated, ranging from classical frequency-based models (BoW, TF-IDF) to state-of-the-art Large Language Models (LLMs) such as e5_large, BioLORD, and Llama-3-8B. Results indicate that transformer-based embeddings consistently outperform traditional methods by capturing implicit semantic cues and nuanced medical terminology. The e5_large model, through end-to-end fine-tuning, achieved the highest performance with a F1_micro score of 0.866. This research demonstrates that adapting LLMs to specific clinical nomenclature is essential for overcoming the challenges of ``long-tail’’ label distributions and the inherent ambiguity of psychiatric discourse.

[NLP-22] SMoA: Spectrum Modulation Adapter for Parameter-Efficient Fine-Tuning

【速读】: 该论文试图解决的问题是在参数高效微调(Parameter-Efficient Fine-Tuning, PEFT)中,低秩适配(LoRA)方法因秩(rank)受限而导致的表示能力不足与计算资源消耗之间的权衡困境。具体而言,LoRA通过低秩更新模拟全参数微调,虽能显著降低资源需求,但其性能受限于可学习的奇异值方向数量——较小的秩导致模型无法充分捕捉预训练权重矩阵的主要特征方向,从而影响性能;而增大秩虽能提升表现,却引入更多可训练参数,增加计算开销。解决方案的关键在于提出SMoA(Spectrum Modulation Adapter),其核心创新是将模型层划分为多个对齐的谱块(spectral blocks),并在每个对角块内应用一种受哈达玛乘积(Hadamard-modulated)调制的低秩分支,从而在更小的参数预算下扩展了对预训练谱方向的覆盖范围,实现了更优的谱感知更新策略。理论分析与多任务实验证明,SMoA在低预算条件下相比LoRA及同类基线方法取得了更好的平均性能表现。

链接: https://arxiv.org/abs/2605.21147
作者: Yongkang Liu,Xing Li,Mengjie Zhao,Shanru Zhang,Zijing Wang,Qian Li,Shi Feng,Feiliang Ren,Daling Wang,Hinrich Schütze
机构: Northeastern University, China; Shandong University, China; CIS, LMU Munich, Germany; MCML, Germany
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As the number of model parameters increases, parameter-efficient fine-tuning (PEFT) has become the go-to choice for tailoring pre-trained large language models. Low-rank Adaptation (LoRA) uses a low-rank update method to simulate full parameter fine-tuning, which is widely used to reduce resource requirements. However, decreasing the rank encounters challenges with limited representational capacity. Theory suggests that LoRA fine-tuning with rank r converges toward the top r singular values of the pre-trained weight matrix. As the rank increases, more principal singular directions are preserved, which generally improves the model’s performance. However, a larger rank also introduces more trainable parameters, leading to higher computational cost. To overcome this dilemma, we propose SMoA, a \textbfSpectrum \textbfModulation \textbfAdapter that enlarges the accessible family of spectrum-aware updates under a smaller parameter budget. SMoA partitions the layer into multiple aligned spectral blocks and applies one in-block Hadamard-modulated low-rank branch to each diagonal block, yielding broader coverage of pretrained spectral directions. We provide theoretical analysis and empirical results on multiple tasks. In our experiments, SMoA improves average performance in the current lower-budget setting over LoRA and competitive LoRA-style baselines.

[NLP-23] Smarter edits? Post-editing with error highlights and translation suggestions

【速读】: 该论文试图解决的问题是:随着机器翻译(MT)质量的提升,增强型后编辑(Enhanced Post-Editing, EPE)功能(如基于质量评估(QE)的错误高亮)日益受到关注,但其实际有效性仍缺乏充分证据。解决方案的关键在于利用大语言模型(LLM)生成的错误高亮和修正建议,结合自动后编辑(APE)技术,在专业译员(英-荷语种)的实际后编辑任务中进行对比实验,评估其在生产力、翻译质量和用户体验方面的表现。结果表明,尽管与常规后编辑(PE)相比未显著提升生产力或质量,但APE生成的错误高亮比QE-derived高亮更受用户欢迎,且修正建议显著改善了整体用户体验。

链接: https://arxiv.org/abs/2605.21135
作者: Fleur V.J. van Tellingen,Gautam Ranka,Dora Žugčić,Joyce van der Wal,Andrea Camasta,Livio Guerra,Alina Karakanta
机构: Leiden University Centre for Linguistics (莱顿大学语言学研究中心); Visvesvaraya National Institute of Technology (维斯韦萨瓦拉国家技术学院); Department of Bionanoscience, Faculty of Applied Sciences, Delft University of Technology (代尔夫特理工大学应用科学学院生物纳米科学系); Pedagogical Sciences, Leiden University (莱顿大学教育科学系); Faculty of Science, Leiden University (莱顿大学理学院)
类目: Computation and Language (cs.CL)
备注: Accepted at EAMT 2026

点击查看摘要

Abstract:As MT quality increases, interest in enhanced post-editing features such as QE-derived error highlights is growing, yet evidence for their usefulness remains limited. In this work, we explore the usefulness of LLM-derived error highlights and correction suggestions based on automatic post-editing (APE). We conduct a study where professional translators (En-Nl) post-edit translations using APE error highlights and correction suggestions and compare productivity, quality and user experience to regular PE and PE with QE-derived highlights. While no condition yielded productivity or quality gains compared to regular PE, APE highlights were better received than QE-derived highlights, and correction suggestions improved overall user experience.

[NLP-24] ACL-Verbatim: hallucination-free question answering for research

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在辅助学术研究时易产生事实性错误或无意义输出(即“幻觉”)的问题,从而影响信息获取的可靠性。其解决方案的关键在于采用抽取式问答系统VerbatimRAG,将用户查询直接映射到论文文档中的原文片段(verbatim text spans),并通过构建一个全新的标注数据集来训练和评估抽取模型。该数据集基于NLP研究人员对由ScIRGen方法生成的合成用户查询与VerbatimRAG检索到的论文片段进行人工标注,最终表明,使用该数据集通过银标监督(silver supervision)训练的150M参数ModernBERT分类器在词级F1指标上达到53.6,优于所测试的最强LLM抽取器(48.7),验证了抽取式方法在提升学术研究中信息提取准确性和可信度方面的有效性。

链接: https://arxiv.org/abs/2605.21102
作者: Gábor Recski,Szilveszter Tóth,Nadia Verdha,István Boros,Ádám Kovács
机构: TU Wien (维也纳工业大学); KR Labs (KR实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: 13 pages

点击查看摘要

Abstract:Academic researchers need efficient and reliable methods for collecting high-quality information from trusted sources, but modern tools for AI-assisted research still suffer from the tendency of Large Language Models (LLMs) to produce factually inaccurate or nonsensical output, commonly referred to as hallucinations. We apply the extractive question answering system VerbatimRAG to research papers in the ACL Anthology, directly mapping user queries to verbatim text spans in retrieved documents. We contribute a novel ground truth dataset for the task of mapping user queries to relevant text spans in research papers, and use it to train and evaluate a variety of extractive models. Human annotation is performed by NLP researchers and is based on synthetic user queries generated using a custom pipeline based on the ScIRGen methodology, paired with chunks of research papers retrieved by VerbatimRAG. On this benchmark, a 150M-parameter ModernBERT token classifier trained on silver supervision from our pipeline achieves the best word-level F1 (53.6), ahead of the strongest evaluated LLM extractor (48.7).

[NLP-25] WCXB: A Multi-Type Web Content Extraction Benchmark

【速读】: 该论文旨在解决网页内容提取(Web Content Extraction)领域中因现有评估基准局限性而导致的性能评估不全面问题。现有基准普遍存在规模小(仅100–800页)、仅覆盖新闻类页面或使用十余年历史网页数据的问题,无法充分反映模型在多样化网页结构下的实际表现。解决方案的关键在于构建一个更大规模、更具代表性的新基准——Web Content Extraction Benchmark (WCXB),包含来自1,613个域名的2,008个网页,涵盖七种结构差异显著的页面类型(如文章、论坛、产品页、文档等),并提供高质量标注的开发集(1,497页)和测试集(511页)。该标注通过五阶段流程生成:LLM辅助草拟、自动化验证、四轮前沿模型审查、片段与质量校验脚本及人工审核,确保标注可靠性。实验表明,尽管顶级系统在文章类页面上表现优异(F1=0.93),但在结构化页面类型上性能差异显著(F1=0.41–0.84),揭示了传统仅基于文章的评估基准无法发现的盲区,从而推动更鲁棒的内容提取方法的发展。

链接: https://arxiv.org/abs/2605.21097
作者: Murrough Foley
机构: Murrough Foley
类目: Computation and Language (cs.CL)
备注: Dataset: this http URL , this http URL . Leaderboard: this http URL . Preprint also deposited at this http URL

点击查看摘要

Abstract:Web content extraction - isolating a page’s main content from surrounding boilerplate - is a prerequisite for search indexing, retrieval-augmented generation, NLP dataset construction, and large language model training. Progress in this area has been constrained by the limitations of existing evaluation benchmarks, which are small (100-800 pages), restricted to news articles, or based on web pages from over a decade ago. We introduce the Web Content Extraction Benchmark (WCXB), a dataset of 2,008 web pages from 1,613 domains spanning seven structurally distinct page types: articles, forums, products, collections, listings, documentation, and service pages. The dataset includes a 1,497-page development set and a 511-page held-out test set with matched page type distributions. Ground truth annotations were produced through a five-stage pipeline: LLM-assisted drafting, automated verification, four-pass frontier model review, snippet and quality verification scripts, and human review. We evaluate 13 extraction systems - 11 heuristic and 2 neural - and find that while top systems converge on articles (F1 = 0.93), performance diverges sharply on structured page types (F1 = 0.41-0.84), revealing blind spots invisible to existing article-only benchmarks. The dataset is released under CC-BY-4.0 with HTML source files, ground truth annotations, page type labels, and baseline results.

[NLP-26] LoCar: Localization-Aware Evaluation of In-Vehicle Assistants through Fine-Grained Sociolinguistic Control ACL2026

【速读】: 该论文试图解决的问题是:在车载对话系统中,如何针对真实部署需求,识别最适合的大型语言模型(LLM),尤其是在韩语本地化场景下缺乏专门评估标准的情况下。解决方案的关键在于提出一个面向车载助手的新型评估框架,该框架特别关注韩语语境下的细粒度礼貌等级控制(honorific control)和战略对话能力(如澄清与主动性),并通过保守的评估立场确保可靠性,从而推动汽车AI从通用能力向精准语言适配与安全导向的交互管理演进。

链接: https://arxiv.org/abs/2605.21086
作者: Seogyeong Jeong,Kiwoong Park,Seyoung Song,Eunsu Kim,Ken E. Friedl,Jaeho Kim,Alice Oh
机构: 未知
类目: Computation and Language (cs.CL)
备注: To appear in ACL 2026 Industry Track

点击查看摘要

Abstract:While Large Language Models (LLMs) are increasingly integrated into in-vehicle conversational systems, identifying the optimal model remains challenging due to the lack of domain-specific evaluation standards tailored to real-world deployment requirements. In this paper, we propose a novel evaluation framework for in-vehicle assistants, with a particular focus on Korean-language localization. Our empirical analysis reveals notable patterns in model behavior. First, fine-grained Korean honorific control remains unstable in current LLMs, indicating that precise speech-level realization must be explicitly evaluated in localization settings. Second, models exhibit weaker performance in strategic conversational metrics like clarification and proactivity. Our analysis suggests this stems from the inherent subjective complexity of these tasks, where our framework adopts a conservative evaluation stance to prioritize reliability. Together, our findings underscore that automotive AI must move beyond general competence toward precise linguistic tailoring and reliable, safety-oriented interaction management.

[NLP-27] GradeLegal: Automated Grading for German Legal Cases

【速读】: 该论文试图解决德国法律考试评分中因阅卷量激增与合格评卷人短缺而导致的反馈延迟和瓶颈问题,同时确保这一高风险专家任务(因其成绩显著影响学生职业发展)的评分质量。解决方案的关键在于系统评估27个专有及开源大语言模型(LLMs)在刑事法和公法案例解答自动评分中的表现,通过逐步增加任务相关信息(如样例答案和评分量表)的提示策略优化评分一致性,并采用加权肯德尔和谐系数(QWK)进行量化评估。研究发现,经过合理提示设计的推理导向型LLM在公法评分中可达到与专家评分高度一致(最高QWK=0.91),而刑事法评分难度更高(QWK=0.60);此外,集成多个模型可使评分一致性提升最多0.15,优于单一高性能闭源模型,表明有效提示工程和模型选择对实现可靠自动化评分至关重要。

链接: https://arxiv.org/abs/2605.21076
作者: Abdullah Al Zubaer,Lorenz Wendlinger,Simon Alexander Nonn,Michael Granitzer,Jelena Mitrovic
机构: University of Passau (帕绍大学); Deggendorf Institute of Technology (德根多夫应用技术大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Grading German legal exam solutions faces growing volumes and a shortage of qualified graders, delaying feedback and creating a bottleneck. At the same time, it is a high-stakes expert task, since state exam grades strongly influence career outcomes in Germany. Despite this practical relevance, literature lacks systematic studies on effective methods for grading legal exams. To address this gap, we investigate whether large language models (LLMs) can support the automated grading of German legal case solutions in criminal and public law, thereby enabling scalable feedback and student self-testing. We present a systematic evaluation of 27 proprietary and open-source LLMs, benchmarking prompting strategies that incrementally add task-related information, such as a sample solution and a grading rubric. Using quadratic weighted kappa (QWK), reasoning-oriented LLMs can approximate expert grading in public law when given a sample solution and a grading rubric (up to 0.91), compared to 0.60 in criminal law, suggesting a harder grading task in criminal law. Beyond single-model grading, ensembling improves agreement by up to 0.15 over its best member and can offer an alternative to stronger closed-source single models. In addition, our findings suggest that effective prompt design and model selection are necessary for reliable LLM-based grading of legal exams.

[NLP-28] Fine-grained Claim-level RAG Benchmark for Law

【速读】: 该论文试图解决法律领域检索增强生成(Retrieval-Augmented Generation, RAG)系统在实际应用中仍存在幻觉(hallucination)问题,且现有评估框架缺乏细粒度分析能力,无法区分检索与生成模块的性能表现。此外,当前基准测试主要局限于英文和法律专家查询,忽视了非专业人士的实际需求。解决方案的关键在于提出ClaimRAG-LAW数据集,该数据集支持英法双语、覆盖专家与非专家用户,并包含多样化的现实场景问题类型;同时引入细粒度评估框架,对最先进的法律RAG系统进行深入分析,揭示了检索、生成及主张层面(claim-level)存在的局限性,从而为提升法律RAG系统的可靠性与实用性提供关键支撑。

链接: https://arxiv.org/abs/2605.21071
作者: Souvick Das,Sallam Abualhaija,Domenico Bianculli
机构: University of Luxembourg (卢森堡大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The rapid progress of large language models (LLMs) is shifting semantic search toward a question-answering paradigm, where users ask questions and LLMs generate responses. In high-stake domains such as law, retrieval-augmented generation (RAG) is commonly used to mitigate hallucinations in generated responses. Nonetheless, prior work shows that RAG systems, whether general-purpose or legal-specific, still hallucinate at varying rates, making fine-grained evaluation essential. Despite the need, existing evaluation frameworks for legal RAG systems lack the granularity required to provide detailed analysis of retrieval and generation performance separately. Moreover, current benchmarks are largely English-only and centered on legal expert queries, overlooking non-expert needs. We introduce ClaimRAG-LAW, a comprehensive dataset for legal RAG that supports French and English, targets both experts and non-experts, and includes diverse question types reflecting realistic scenarios. We further apply a fine-grained evaluation framework of state-of-the-art legal RAG systems, revealing limitations in retrieval, generation, and claim-level analysis in the legal domain.

[NLP-29] APM: Evaluating Style Personalization in LLM s with Arbitrary Preference Mappings

【速读】: 该论文试图解决大语言模型(LLM)在响应中缺乏个性化适应的问题,即模型往往采用默认风格输出,而用户对语气、冗余度和正式程度等隐性偏好难以通过显式提示表达。为应对这一挑战,作者提出了一种新的评估框架——任意偏好映射(Arbitrary Preference Mapping, APM),其核心在于通过一个隐藏且随机重采样的映射矩阵 C\mathbf{C} 将用户属性(如“热情”)与响应特征偏好(如“有说服力”)解耦,从而迫使模型从对话历史中推断隐含偏好,而非依赖刻板印象。关键解决方案是利用APM基准实现无参考的公平评估,避免了传统方法因混淆个性化与整体质量而导致的偏差;在此基础上,作者对比了检索增强生成(RAG)、软提示优化和路由策略三种个性化方法,发现路由策略最为稳定可靠,而RAG仅在更强的基础模型上有效,软提示优化则未显著优于基线,表明在真实场景下个性化仍具挑战,但所提方法具有改进潜力。

链接: https://arxiv.org/abs/2605.21063
作者: Philipp Spohn,Leander Girrbach,Zeynep Akata
机构: Technical University of Munich (慕尼黑工业大学); Helmholtz Munich (亥姆霍兹慕尼黑研究中心); Munich Center for Machine Learning (慕尼黑机器学习中心)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Typical LLM responses tend to follow a default style, even though users often have distinct preferences regarding tone, verbosity, and formality that they do not explicitly state in their prompts. Evaluating whether personalization methods can adapt to these implicit preferences is challenging, since users typically provide prompts rather than reference responses, style preferences are not factually verifiable, and reference-free LLM judges may conflate personalization with general response quality. To address these challenges, we introduce the Arbitrary Preference Mapping (APM) benchmark, which decouples user attributes (e.g. enthusiastic) from response principles (e.g. persuasive) via a hidden, randomized mapping \mathbfC that maps user attributes to preferences about response traits. Because \mathbfC carries no semantic content and is resampled across runs, models cannot exploit stereotypical associations and must infer preferences from conversation history. Using this unbiased evaluation methodology, we adapt retrieval-augmented, prompt-optimization, and routing personalization methods and evaluate them on Llama-3.1-8B and Qwen-3.5-27B. Our results show that routing is the most reliable approach, while RAG only improves with the stronger base LLM, and soft prompt optimization fails to improve significantly over a non-personalized baseline. Our extensive evaluation reveals that in this realistic setting, personalization remains challenging, but our adapted methods show promise.

[NLP-30] Cross-lingual robustness of LLM -brain alignment and its computational roots

【速读】: 该论文试图解决的问题是:大型语言模型(LLMs)与大脑活动之间的对齐是否延伸至皮下区域、是否在不同语言间存在空间重叠,以及这种对齐的计算根源是什么。解决方案的关键在于采用多语言全脑编码框架,在自然故事聆听任务中考察三种语言(汉语、英语和法语)下大脑-语言模型对齐的空间分布和层间稳定性。研究发现,Transformer模型在广泛分布的皮层功能网络(如边缘系统、腹侧注意网络、默认模式网络)及皮下结构中均能有效预测神经活动,且跨语言的空间对齐模式具有显著一致性,并在模型各层保持稳定,但未表现出与功能皮层层级一致的逐层演进特征;此外,上下文嵌入并未优于静态嵌入,且预测处理(predictive processing)和信息压缩(information compression)等候选计算机制均无法解释神经对齐模式。因此,大脑-语言模型对齐的稳健性并非源于预测不确定性或表征几何结构,而是主要由跨语言通用的词汇-语义对应关系驱动。

链接: https://arxiv.org/abs/2605.21049
作者: Ni Yang,Rui He,Philipp Homan,Iris Sommer,Davide Staub,Wolfram Hinzen
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) reliably predict neural activity during language comprehension and transformer depth has been interpreted as mirroring hierarchical cortical organization. However, it remains unclear whether such alignment extends to subcortical regions, overlaps spatially across languages, and what the computational roots of such alignment are. Here, we used a multilingual, whole-brain encoding framework to examine brain-LLM alignment across three typologically distinct languages: Mandarin, English, and French during naturalistic story listening. Our results show that across languages, transformer-based models predicted activity in a distributed landscape spanning widely distributed cortical functional networks like limbic, ventral attention, default mode network, and subcortical structures. Spatial alignment patterns showed substantial cross-linguistic overlap and remained largely stable across model layers, with limited layer progression consistent with functional cortical hierarchies. Contrary to previous evidence, contextual embeddings did not outperform static embeddings. To test candidate computational explanations, we examined whether layer-wise brain scores reflect surprisal and intrinsic dimensionality, and thereby predictive processing and information compression. Neither of these two computational metrics mirrored neural alignment profiles. Our findings suggest that brain-LLM alignment is spatially robust and cross-linguistically stable but not explainable from predictive uncertainty or representational geometry. Rather than directly reflecting shared hierarchical computation, neural predictivity may primarily arise from distributed lexical-semantic correspondences that generalize across languages.

[NLP-31] Building a Custom Taxonomy of AI Skills and Tasks from the Ground Up with Job Postings

【速读】: 该论文试图解决的问题是如何在处理大规模、快速增长的文本语料库时,高效且准确地构建领域特定的分类体系(taxonomy),特别是在生成式 AI (Generative AI) 技术应用于自动化知识组织场景下,如何优化数据输入以提升分类体系的质量。其解决方案的关键在于提出 TaxonomyBuilder 作为系统性研究框架,并通过实证表明:相较于直接使用未经筛选的原始数据,对输入数据进行有针对性的过滤(即“少而精”的数据选择策略)能够显著提升分类体系在特定领域内的覆盖度和清晰度,从而优于传统无过滤的数据输入结合聚类与大语言模型(LLM)增强的层次化标签方法。

链接: https://arxiv.org/abs/2605.21029
作者: Stephen Meisenbacher,Peter Norlander
机构: Technical University of Munich (慕尼黑工业大学); Loyola University Chicago (洛约拉大学芝加哥分校)
类目: Computation and Language (cs.CL)
备注: 14 pages, 2 figures, 8 tables. Accepted to CustomNLP4U 2026

点击查看摘要

Abstract:Utilizing LLMs for automated taxonomy construction presents a clear opportunity for the comprehensive, yet efficient mapping of potentially complex domains. When contending with high volumes of rapidly growing corpora, however, it becomes unclear how to best leverage such data for optimal taxonomy construction. Taking the case of systematizing AI skills in the workplace, we use two large-scale job postings corpora to investigate key design decisions for the inclusion (or exclusion) of data points for taxonomy construction. We propose TaxonomyBuilder as a blueprint for our systematic study, with which we evaluate various configurations of custom, data-informed, and hierarchical taxonomies. We demonstrate that less data can provide more clarity: filtering inputs to TaxonomyBuilder provides better domain-specific coverage than offering unfiltered inputs to clustering and LLM-enhanced hierarchical taxonomy labeling tools.

[NLP-32] Beyond Text-to-SQL: An Agent ic LLM System for Governed Enterprise Analytics APIs

【速读】: 该论文试图解决的问题是:在企业环境中,非技术用户难以使用传统的商业智能工具或Text-to-SQL系统来访问和分析结构化数据,而现有的基于大语言模型(LLM)的Text-to-SQL方法无法适配依赖受控API而非原始数据库的企业分析流程。其解决方案的关键在于提出Analytic Agent——一个基于LLM的代理系统,通过多步骤推理和策略感知编排,将自然语言意图转化为对受控企业分析API的安全调用,从而实现用户目标的可靠解析、权限验证、合规查询执行及可视化生成,确保在整个过程中满足业务逻辑一致性、审计可追溯性和安全性要求。

链接: https://arxiv.org/abs/2605.21027
作者: Gundeep Singh,Parsa Kavehzadeh,Jing Xia,Xue-Yong Fu,Julien Bouvier Tremblay,Md Tahmid Rahman Laskar,Vincent Lum,Shashi Bhushan TN
机构: Dialpad Inc.
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: The first four authors contributed equally to this work

点击查看摘要

Abstract:Enterprise analytics aims to make organizational data accessible for decision-making, yet non-technical users still face barriers when using traditional business intelligence tools or Text-to-SQL systems. While recent Text-to-SQL approaches based on Large Language Models (LLMs) promise natural language access to structured data, they fall short in enterprise settings where analytics pipelines rely on governed APIs rather than raw databases. In practice, these APIs encapsulate complex business logic to ensure consistency, auditability, and security. However, delegating mathematical or aggregation logic to an LLM introduces reliability and compliance risks. To this end, we present Analytic Agent, an LLM-based agentic system that translates natural language intents into secure interactions with enterprise analytics APIs. Evaluated on 90 real enterprise use cases constructed by domain experts, it reliably interprets user goals, validates permissions, executes governed queries, and generates compliant visualizations through multi-step reasoning and policy-aware orchestration.

[NLP-33] Playing Devils Advocate: Off-the-Shelf Persona Vectors Rival Targeted Steering for Sycophancy

【速读】: 该论文试图解决的问题是模型在面对用户错误时仍盲目附和(sycophancy)的现象,即模型倾向于无条件认同用户观点,即使用户表述错误。传统解决方案Contrastive Activation Addition (CAA)通过标注的附和与诚实响应对来学习一个可操控方向以减少附和行为。本文的关键创新在于提出使用现成的、用于通用角色扮演的“人格向量”(persona steering vectors)作为替代方案,这些向量并未专门针对附和数据训练。实验表明,在两个指令微调模型中,引导至怀疑或审视型人格可分别将附和程度降低至CAA效果的68%和98%,且相比CAA,这类方法在用户正确时能更好保持准确性。此外,该效应具有不对称性:引导至顺从型人格不会显著增加附和行为。几何分析进一步揭示,人格向量与附和方向在激活空间中基本正交,说明附和更应被视为一种人格层面的属性而非单一可操控方向。

链接: https://arxiv.org/abs/2605.21006
作者: Ishaan Kelkar,Nebras Alam,Vikram Kakaria,Madhur Panwar,Vasu Sharma,Maheep Chaudhary
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We study the effect of different persona on \textbfsycophancy: model’s agreement with users even when the user is incorrect. The standard mitigation, Contrastive Activation Addition (CAA), derives a steering direction from labelled pairs of sycophantic and honest responses. This study evaluates whether off-the-shelf persona steering vectors, originally developed for general role-playing and not trained on sycophancy data, can serve as an alternative. In two instruction-tuned models, steering toward personas characterised by doubt or scrutiny reduces sycophancy to approximately 68% and 98% of CAA’s effect, and, unlike CAA, maintains accuracy when the user is correct. The effect is also asymmetric: steering toward agreeable personas does not produce a mirror increase in sycophancy. Geometrically, the persona vector is largely independent of the direction of sycophancy in activation space. Collectively, these findings suggest that sycophancy is better understood as a persona-level property rather than a single steerable direction. We release our code here: this https URL.

[NLP-34] Single-Pass Depth-Selective Reading for Multi-Aspect Sentiment Analysis ACL2026

【速读】: 该论文旨在解决多方面情感分析(Aspect-Term Sentiment Analysis, ATSA)中效率与表达能力之间的权衡问题。现有模型要么为每个方面重新编码句子,要么依赖静态深度表示,导致计算冗余且适应性有限。其解决方案的关键在于提出DABS(Depth-Ordered Adaptive Browsing System),一种单次推理框架:首先对每条句子进行一次编码,构建一个可复用的、按深度排序的底层表示(substrate),随后每个方面通过查询该共享表示来选择性地读取相关词元和抽象层级,无需重新编码。这一机制将共享的句子编码与轻量级、面向方面的读出操作解耦,从而在保持竞争力性能的同时,将多方面场景下的端到端计算量减少高达60%(M=2)。实验表明,自适应深度查询在语言复杂情形(如否定和对比)中效果最为显著。

链接: https://arxiv.org/abs/2605.20998
作者: Yan Xia,Zhuangzhuang Pan,Amirrudin Kamsin,Chee Seng Chan
机构: Universiti Malaya, Malaysia; Suzhou University of Technology, China; VinUniversity, Vietnam
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at ACL2026 (main). Our solution (DABS) reads the sentence once, then lets each aspect selectively query the right tokens and Transformer depths, cutting redundant computation while preserving ATSA accuracy

点击查看摘要

Abstract:Aspect-Term Sentiment Analysis (ATSA) in multi-aspect sentences faces a fundamental tradeoff between efficiency and expressiveness. Existing models either re-encode the sentence for each aspect or rely on static use of deep representations, leading to redundant computation and limited adaptivity. We argue that Transformer depth is a costly, queryable resource, and propose DABS, a single-pass inference framework that encodes each sentence once to construct a reusable, depth-ordered substrate. Each aspect then queries this shared representation to selectively read relevant tokens and abstraction levels, without re-encoding. This decouples shared sentence encoding from lightweight, aspect-conditioned readout. Experiments on four ATSA benchmarks show that DABS achieves competitive performance while reducing end-to-end computation by up to 60% in multi-aspect settings (M = 2). Further analyses indicate that adaptive depth querying is most beneficial for linguistically complex cases such as negation and contrast. Code is publicly available at this https URL

[NLP-35] owards Context-Invariant Safety Alignment for Large Language Models ICML2026

【速读】: 该论文旨在解决大语言模型(LLM)在偏好对齐后仍存在安全行为脆弱性的问题,即模型可能在标准提示下拒绝有害请求,但在对抗性表述下却会遵从相同意图。其核心解决方案是实现“上下文不变对齐”(context-invariant alignment),确保模型行为依赖于意图而非表面形式。关键创新在于提出锚定不变性正则化(Anchor Invariance Regularization, AIR),通过将可验证提示作为锚点,利用停止梯度目标仅对开放式变体进行正则化,从而避免降低可靠提示上的性能,同时提升对对抗性表述的鲁棒性。AIR作为辅助损失模块,结合分组偏好优化(如GRPO)实现异构提示分组训练,在安全、道德推理和数学任务中显著提升了分布内准确率(+12.71%)与分布外一致性(+33.49%),使安全约束更具抗对抗性。

链接: https://arxiv.org/abs/2605.20994
作者: Yixu Wang,Yang Yao,Xin Wang,Yifeng Gao,Yan Teng,Xingjun Ma,Yingchun Wang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: ICML 2026

点击查看摘要

Abstract:Preference-based post-training aligns LLMs with human intent, yet safety behavior often remains brittle. A model may refuse a harmful request in a standard prompt but comply when the same intent is wrapped in adversarial wording. We suggest that robust safety requires context-invariant alignment, where behavior depends on the underlying intent rather than surface form. Enforcing invariance is difficult in alignment because not all training signals are equally trustworthy; for some prompt variants we can obtain verifiable feedback (e.g., multiple-choice), while for open-ended variants we typically rely on noisy, gameable reward proxies (e.g., learned judges). As a result, standard symmetric invariance regularizers can reduce cross-context discrepancies by lowering performance on reliable variants instead of improving open-ended robustness. To address this, we introduce Anchor Invariance Regularization (AIR), which treats verifiable prompts as anchors and uses a stop-gradient target to regularize only the open-ended variants toward the anchor performance. AIR is implemented as a plug-in auxiliary loss and combined with group-based preference optimization (e.g., GRPO) via heterogeneous prompt grouping. Across Safety, Moral Reasoning, and Math, AIR improves context invariance, boosting in-distribution group accuracy by 12.71% and out-of-distribution consistency by 33.49%, making safety constraints robust to adversarial framings.

[NLP-36] ArPoMeme: An Annotated Arabic Multimodal Dataset for Political Ideology and Polarization LREC2026

【速读】: 该论文试图解决阿拉伯世界政治迷因(meme)在多模态和意识形态维度上缺乏系统性数据资源的问题,从而阻碍了对其政治传播机制的深入分析。解决方案的关键在于构建一个大规模、标注精细的阿拉伯语政治迷因数据集ArPoMeme(约7300条),其分类基于内容生产者(Facebook页面和群组)的自我识别,确保意识形态标签的可靠性;同时设计了一套半自动化数据采集流程(结合Playwright爬取与Google Drive同步),并利用Qwen2.5-VL-7B视觉语言模型提取文本信息,再通过自定义Streamlit界面进行人工校验与三维度标注(“我们 vs. 他们”框架、对外群体敌意、号召行动),实现可视化内容、文本信息与意识形态倾向的精准关联。定量分析进一步揭示了不同意识形态群体在对抗性叙事上的显著不对称性,为研究阿拉伯语政治话语、多模态意识形态检测及极化动态提供了可复现且公开可用的资源。

链接: https://arxiv.org/abs/2605.20967
作者: Wajdi Zaghouani,Kais Attia,Md. Rafiul Biswas,Fadhl Eryani
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted at LREC 2026 Main Conference

点击查看摘要

Abstract:Memes have become a prominent medium of political communication in the Arab world, reflecting how humor, imagery, and text interact to express ideological and cultural positions. Despite the centrality of memes to online political discourse, there is a lack of systematically curated resources for analyzing their multimodal and ideological dimensions in Arabic. This paper presents ArPoMeme, a large-scale dataset of approximately 7,300 Arabic political memes categorized by ideological orientation, including Leftist, Islamist, Pan-Arabist, and Satirical perspectives. The dataset captures the diversity of Arabic meme ecosystems by grounding classification in the self-identification of public Facebook pages and groups that produce and disseminate these memes. To ensure both scale and accuracy, we designed a semi-automated data collection pipeline combining Playwright-based Facebook scraping with Google Drive synchronization, followed by text extraction using the Qwen2.5-VL-7B vision language model. The extracted text was manually verified and annotated for three polarization dimensions: Us vs. Them framing, Hostility toward out-groups, and Calls to action. Annotation was conducted through a custom Streamlit-based interface supporting distributed labeling, real-time tracking, and version control. The resulting dataset links visual content, textual messages, and ideological orientation, enabling fine-grained analysis of political antagonism, mobilization, and humor. Quantitative analysis of the annotated corpus reveals strong asymmetries in antagonistic framing across ideological groups, with Islamist and satirical memes exhibiting the highest levels of hostility and mobilization cues. The dataset and the annotation tool offers a reproducible and publicly available resource for studying Arabic political discourse, multimodal ideology detection, and polarization dynamics.

[NLP-37] JobArabi: An Arabic Corpus and Analysis of Job Announcements from Social Media LREC2026

【速读】: 该论文试图解决阿拉伯语招聘语言在社交媒体上的表达模式及其社会语言学特征缺乏系统性研究的问题。解决方案的关键在于构建一个大规模、多源的阿拉伯语职位公告语料库(JobArabi),其通过基于语言学知识的查询框架(涵盖21个阿拉伯语关键词族,包括性别化、复数形式、正式与方言表达)从X平台(原Twitter)收集了超过两年(2024年1月至2025年10月)的20,528条公开帖子,并附带时间戳、互动指标和地理信息等元数据,从而支持对就业话语的时间演化和区域差异进行量化分析。这一方法使研究能够揭示在线招聘中性别化语言的持续存在、职业需求的地域差异以及招聘信息的情感建构模式,为阿拉伯语自然语言处理(NLP)、计算社会科学和数字劳动研究提供关键数据资源。

链接: https://arxiv.org/abs/2605.20960
作者: Wajdi Zaghouani,Shimaa Amer Ibrahim,Mabrouka Bessghaier,Houda Bouamor
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted at LREC 2026 Main Conference

点击查看摘要

Abstract:This paper introduces JobArabi, a large-scale corpus of Arabic job announcements collected from social media between January 2024 and October 2025. The dataset contains 20,528 public posts from X and captures more than two years of employment-related discourse across Arabic-speaking online communities. The corpus was compiled using a linguistically informed query framework covering 21 Arabic keyword families that reflect gendered, plural, formal, and dialectal expressions of recruitment language. The resulting dataset includes posts from institutional, commercial, and individual accounts and provides metadata such as timestamps, engagement indicators, and geolocation when available, enabling temporal and regional analysis of employment discourse. Quantitative analysis reveals several sociolinguistic patterns in online recruitment, including the persistence of gendered hiring language, regional variation in occupational demand, and the emotional framing of recruitment messages. These findings highlight the potential of Arabic social media as a resource for studying labor market communication and linguistic change. The JobArabi corpus, together with documentation and collection scripts, will be released to support research in Arabic NLP, computational social science, and digital labor studies.

[NLP-38] Memory Grafting: Scaling Language Model Pre-training via Offline Conditional Memory

【速读】: 该论文试图解决的问题是:如何在不显著增加训练和推理开销的前提下,有效扩展语言模型的条件记忆(conditional memory)容量。现有方法如Engram需在预训练阶段从头学习大型记忆表,导致成本高且效果不稳定。其解决方案的关键在于提出“记忆嫁接”(Memory Grafting)机制——利用一个冻结的“嫁接模型”(grafting model)离线生成并存储局部n-gram对应的最终token隐藏状态作为记忆值,接收模型通过精确最长匹配后缀查找实现高效检索;检索到的记忆经轻量级投影与门控机制适配,并辅以基于哈希的Engram回退策略确保覆盖未命中上下文。由于嫁接模型仅需离线运行,且精确查找具有相对于记忆库大小期望O(1)的时间复杂度,该方法实现了外部潜在容量的低成本扩展,在多个规模的语言模型设置中均优于MoE和vanilla Engram基线,验证了预训练模型可作为可复用的外部隐式记忆构建器,为突破仅依赖可训练参数的模型扩展瓶颈提供了实用路径。

链接: https://arxiv.org/abs/2605.20948
作者: Runxi Cheng,Yuchen Guan,Yongxian Wei,Qianpu Sun,Qixiu Li,Sinan Du,Feng Xiong,Chun Yuan,Yan Lu,Yeyun Gong
机构: Tsinghua University; Microsoft Research Asia
类目: Computation and Language (cs.CL)
备注: 25 pages, 12 figures, 5 tables

点击查看摘要

Abstract:Scaling conditional memory offers a promising way to increase language-model capacity, but existing methods such as Engram learn large memory tables from scratch during pre-training, making memory scaling expensive and sometimes ineffective. We propose Memory Grafting, a conditional memory scaling method that utilizes frozen hidden states from a grafting model as conditional n-gram memory. Given frequent local n-grams, we run the grafting model offline, store final-token hidden representations as memory values, and let the recipient model retrieve them through exact longest-match suffix lookup. Retrieved memories are adapted by lightweight projections and gates, while a hash-based Engram fallback preserves coverage for unmatched contexts. Since the grafting model is only run offline and exact lookup has expected O(1) complexity with respect to memory-bank size, Memory Grafting expands external latent capacity with limited training and inference overhead. Experiments under matched recipient architectures and pre-training budgets show that Memory Grafting improves over both MoE and vanilla Engram baselines. In the 2.8B-scale setting, it improves the average benchmark score from 51.95 for MoE and 52.43 for vanilla Engram to 53.86. In the 0.92B-scale setting, all grafting-model variants improve over the baselines, with Qwen3.5-35B-A3B giving the strongest gains. These results suggest that pretrained models can serve as reusable constructors of external latent memory, providing a practical step toward scaling future language models beyond trainable parameters alone.

[NLP-39] hinking-while-speaking: A Controlled Interleaved Reasoning Method for Real-Time Speech Generation

【速读】: 该论文试图解决的问题是:如何在保持自然流畅口语的同时,使人工智能具备深度推理能力,从而实现更类人的对话交互。其解决方案的关键在于提出一种名为InterRS的新方法,该方法通过仅在自然语音生成过程中插入推理步骤(即“边说边想”),避免打断语音流的连续性;同时构建了一个新颖的数据生成流水线,以确保推理与语音内容精确对齐且长度比例可控,并采用两种强化学习奖励机制——TA-Balance Reward用于调节推理与回答的时间和比例平衡,Linguistic Quality Reward用于提升语言表达质量。实验表明,该方法在数学与逻辑基准测试中性能提升13%,并能像口语指令模型一样即时生成链式思维(CoT)响应,同时相比先前方法更具自然性和流畅性。

链接: https://arxiv.org/abs/2605.20946
作者: Xuan Du,Qiangyu Yan,Wenshuo Li,Borui Jiang,Changming Xiao,Han Shu,Xinghao Chen
机构: Huawei Technologies
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The thinking-while-speaking paradigm aims to make AI communication more human. A key challenge is maintaining fluent speech while performing deep reasoning. Our method, InterRS, tackles this by inserting reasoning steps only during natural speech generation. This requires high-quality data where reasoning and speech are precisely aligned, and the length ratio are under controlled. We introduce a novel pipeline to generate such seamlessly interleaved audio data. To train our model, we combine interleaved SFT with refined data and reinforcement learning with two new rewards: a TA-Balance Reward to manage timing and thinking-answer ratio, and a Linguistic Quality Reward to refine expression. Experiments show our approach achieves 13% better performance on mathmatical and logic benchmarks while generating instant response like a spoken-language instruct model which outputs fast CoT response. Furthermore, our method generates more natural and fluent answers than prior methods.

[NLP-40] DASH: Fast Differentiable Architecture Search for Hybrid Attention in Minutes on a Single GPU

【速读】: 该论文试图解决的问题是:如何高效地设计混合注意力架构(hybrid attention architecture),在保持大语言模型(LLM)性能的同时显著提升推理效率。现有方法依赖人工经验规则或基于代理信号的层间算子分配策略,而自动化搜索方法如Jet-Nemotron虽具潜力,但其搜索成本过高(单次需2000亿token),难以作为常规设计工具。解决方案的关键在于提出DASH——一种快速可微分搜索框架,其核心创新包括:将离散的层级注意力算子分配松弛为连续的架构logits、预构建可复用的教师对齐线性候选集,并在冻结模型与算子权重的前提下仅对架构进行搜索,从而极大提升搜索效率。实验表明,DASH在Qwen2.5-3B-Instruct上优于多种基线方法,且在RULER指标上超越Jet-Nemotron模型,同时每次搜索仅消耗1230万token(约20分钟,单张RTX Pro 6000 GPU),仅为Jet-Nemotron的0.006%,证明高质量混合架构可在分钟级完成搜索,为高效架构设计提供了新范式。

链接: https://arxiv.org/abs/2605.20936
作者: Weizhe Chen,Miao Zhang,Junpeng Jiang,Yaping Li,Weili Guan,Liqiang Nie
机构: Harbin Institute of Technology (Shenzhen)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 19 pages, 7 figures

点击查看摘要

Abstract:Hybrid attention architectures are becoming an increasingly important paradigm for improving LLM inference efficiency while preserving model quality, making hybrid architecture design a central problem. Existing designs often rely on manual empirical rules or proxy-based selector signals for layer-wise operator allocation. Recent NAS-style systems such as Jet-Nemotron demonstrate the promise of automated hybrid architecture search. However, Jet-Nemotron’s PostNAS search stages alone use 200B tokens, making such search pipelines difficult to use as routine methods for hybrid architecture design. We introduce DASH, a fast differentiable search framework for hybrid attention architecture design, which relaxes discrete layer-wise attention operator placement into continuous architecture logits, prepares reusable teacher-aligned linear candidates, and performs architecture-only search with model and operator weights frozen to significantly enhance search efficiency. On Qwen2.5-3B-Instruct, DASH consistently outperforms a comprehensive suite of existing selector-style hybrid attention design baselines, showing that direct differentiable search can discover stronger hybrid architectures. Moreover, DASH achieves stronger RULER performance than released Jet-Nemotron models while remaining competitive on overlapping short-context and general benchmarks. Notably, each DASH search run uses only 12.3M tokens and takes about 20 minutes on a single RTX Pro 6000 GPU, corresponding to merely 0.006% of the PostNAS search tokens reported by Jet-Nemotron. These results suggest that high-quality hybrid attention architectures can be obtained through minutes-level differentiable search, providing a promising direction for hybrid architecture design.

[NLP-41] Strategy-Induct: Task-Level Strategy Induction for Instruction Generation ACL2026

【速读】: 该论文试图解决的问题是:如何在缺乏标注答案的情况下,设计有效的任务级提示(task-level prompts)以提升大语言模型(LLM)的性能。现有方法通常依赖于输入-输出对来推导指令,但获取标注答案往往成本高昂或难以获得。解决方案的关键在于提出一种名为Strategy-Induct的框架,该框架仅使用少量示例问题(无需标注答案),通过让模型生成每个问题对应的显式推理策略,构建(策略,问题)对,进而从中归纳出能够指导推理的任务指令。实验表明,Strategy-Induct在仅使用问题的设置下优于当前最优方法;此外,研究还发现,在任务指令生成和推理阶段联合使用LLM与大型推理模型(Large Reasoning Models)可进一步提升性能。

链接: https://arxiv.org/abs/2605.20924
作者: Po-Chun Chen,Hen-Hsen Huang,Hsin-Hsi Chen
机构: National Taiwan University(台湾大学); Academia Sinica(中央研究院); AI Research Center (AINTU), National Taiwan University(人工智能研究中心,台湾大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to Findings of ACL 2026

点击查看摘要

Abstract:Designing effective task-level prompts is crucial for improving the performance of Large Language Models (LLMs). While prior work on instruction induction demonstrates that LLMs can infer better instructions with limited examples, existing approaches often rely on input-output pairs, where obtaining labeled answers can be difficult or costly. To address this limitation, we propose Strategy-Induct, a framework that derives task-level instructions solely from a small set of example questions without requiring labeled answers. Our approach first prompts the model to generate explicit reasoning strategies for each question, forming (strategy, question) pairs. These pairs are then used to induce a task instruction that guides reasoning. Experiments across multiple tasks and model scales demonstrate that Strategy-Induct outperforms state-of-the-art methods in question-only settings. Furthermore, we observe that jointly utilizing LLMs and Large Reasoning Models across task instruction generation and inference may lead to further performance improvements.

[NLP-42] Evaluating Speech Articulation Synthesis with Articulatory Phoneme Recognition

【速读】: 该论文试图解决的问题是:如何更有效地评估语音发声合成(articulatory speech synthesis)的质量,尤其是在缺乏通用客观指标的情况下。传统方法依赖于点对点距离等声学指标,难以捕捉发音部位(place of articulation)等关键语音细节,且评估过程受主观因素影响较大。解决方案的关键在于提出一种基于音素识别(phoneme recognition)的代理评估方法——通过训练神经网络使用来自单说话者RT-MRI数据集的声学与发音特征进行建模,并在测试时比较不同合成发音特征下的音素识别性能。实验表明,所提出的发音特征集具有丰富的音素区分能力,能够揭示语音合成中更细微的发音差异,从而为生成式语音合成模型提供更具语义意义的评估维度。

链接: https://arxiv.org/abs/2605.20920
作者: Vinicius Ribeiro,Yves Laprie
机构: Université de Lorraine (洛林大学)
类目: Computation and Language (cs.CL); Sound (cs.SD)
备注: Accepted for publication at the European Signal Processing Conference (EUSIPCO), 2026

点击查看摘要

Abstract:Recent advances in machine learning and the availability of articulatory datasets allow vocal tract synthesis to be conditioned on phonetic sequences, a primary task of articulatory speech synthesis. However, quality assessment needs a better definition. Generally, ranking generative models is tricky due to subjectivity. However, articulatory synthesis has the additional difficulty of requiring specialized knowledge in vocal tract anatomy and acoustics. To address this problem, this paper proposes to evaluate speech articulation synthesis using phoneme recognition as a proxy. Our hypothesis is that phoneme recognition using articulatory features better captures nuances in phoneme production, such as correct places of articulation, which traditional metrics (e.g., point-wise distance metrics) do not. We train a neural network with acoustic and articulatory features extracted from a single-speaker RT-MRI dataset. Then, we compare the recognition performance when testing the model with different synthetic articulatory features. Our results show that our articulatory feature set is phonetically rich and helps exploring additional dimensions on speech articulation synthesis. Comments: Accepted for publication at the European Signal Processing Conference (EUSIPCO), 2026 Subjects: Computation and Language (cs.CL); Sound (cs.SD) Cite as: arXiv:2605.20920 [cs.CL] (or arXiv:2605.20920v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2605.20920 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-43] ask-Routed Mixture-of-Experts with Cognitive Appraisal for Implicit Sentiment Analysis

【速读】: 该论文旨在解决隐式情感分析(implicit sentiment analysis)中因缺乏显式情感词而难以准确推断情感极性的问题。现有模型通常仅依赖最终的情感标签进行训练,导致对上下文推理的指导信息不足。为此,作者基于认知评估理论(cognitive appraisal theory),提出一种评估感知的多任务学习(appraisal-aware multi-task learning, MTL)框架,通过两个互补的辅助任务——隐式情感检测和认知理由生成——增强主任务的情感极性预测能力。为减少多任务间因共享单一骨干网络而导致的任务干扰(task interference),该方法引入任务级专家混合模型(task-level mixture-of-experts),其中所有任务共享一组专家,任务身份控制专家的稀疏组合,从而提升灵活性与解耦性。具体实现上,模型采用编码器-解码器架构,并将部分编码器和解码器块替换为稀疏混合专家结构;同时设计任务条件路由机制(task-conditioned router)与任务分离路由目标(task-separated routing objective),促使不同任务学习差异化的专家选择模式。实验表明,该方法在隐式情感子集上显著优于现有主流方法。

链接: https://arxiv.org/abs/2605.20916
作者: Yaping Chai,Haoran Xie,Joe S. Qin
机构: Lingnan University (岭南大学)
类目: Computation and Language (cs.CL)
备注: 8 pages, 4 figures, and 3 tables

点击查看摘要

Abstract:Implicit sentiment analysis is challenging because sentiment toward an aspect is often inferred from events rather than expressed through explicit opinion words. Existing models typically learn from the final polarity label, which provides limited guidance for reasoning about sentiment from the context. Motivated by cognitive appraisal theory, we propose an appraisal-aware multi-task learning (MTL) framework for implicit sentiment analysis that provides polarity prediction with two complementary auxiliary tasks: implicit sentiment detection and cognitive rationale generation. However, training several objectives with different targets and sharing a single backbone across tasks in MTL limits flexibility and can lead to task interference. To reduce interference among these related but distinct objectives, we adopt task-level mixture-of-experts models in which all tasks share a common set of experts, and task identity controls the sparse combination of these experts. Our method builds on an encoder-decoder architecture and replaces a subset of encoder and decoder blocks with these sparse mixtures. We use a task-conditioned router to select sparse expert mixtures for each task, and a task-separated routing objective to encourage different tasks to learn distinct expert-selection patterns. Experimental results show that our model outperforms recently proposed approaches, with strong gains on the implicit sentiment subset. Our code is available at this https URL.

[NLP-44] Calibration vs Decision Making: Revisiting the Reliability Paradox in Unlearned Language Models ACL2026

【速读】: 该论文试图解决的问题是:在机器遗忘(machine unlearning)过程中,如何确保模型在移除特定训练数据影响的同时,仍能保持可靠的预测行为和不确定性估计。现有方法常以校准性(calibration)作为可靠性的代理指标,但研究表明,即使模型校准良好,也可能依赖于虚假相关性(spurious correlations)做出决策,从而导致决策规则不可靠。解决方案的关键在于通过多维度评估机制揭示这种“可靠性悖论”——即使用校准误差(如ECE、MCE、Brier分数)衡量概率可靠性,并结合基于梯度的归因方法(Integrated Gradients)与局部互信息(Local Mutual Information)检测决策中的快捷方式(shortcut-based tokens)。实验结果表明,经过遗忘后的模型虽保持低校准误差,却表现出对相关性特征的更强依赖,说明校准性不能充分反映决策规则的可靠性,从而将可靠性悖论扩展至机器遗忘场景。

链接: https://arxiv.org/abs/2605.20915
作者: Divyaksh Shukla,Ashutosh Modi
机构: Indian Institute of Technology Kanpur (IIT Kanpur)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at SRW, ACL 2026; 17 pages (9 + 2 + 6)

点击查看摘要

Abstract:Machine unlearning aims to remove the influence of specific training data from a model while preserving reliable behavior on the remaining data, making reliable prediction and uncertainty estimation essential for evaluation. Calibration is commonly used as a proxy for reliability in language models, but low calibration error does not necessarily imply reliable decision rules, as models may rely on spurious correlations while remaining well calibrated. We investigate this gap in generative language models using the multiple-choice question-answering evaluation protocol on the TOFU benchmark, measuring probabilistic reliability with calibration metrics (ECE, MCE, Brier) and decision-rule reliability via attribution-based shortcut detection with Integrated Gradients and Local Mutual Information. We find that fine-tuned models achieve low calibration error (ECE ~ 0.04) compared to pretrained models (ECE 0.5), and models after unlearning retain similarly low calibration despite reduced accuracy on the forget split, while attribution analysis shows increased reliance on correlation-based tokens. These results demonstrate that good calibration can coexist with shortcut-based decision rules after unlearning, extending the reliability paradox to the machine unlearning setting.

[NLP-45] Enhancing Scientific Discourse: Machine Translation for the Scientific Domain

【速读】: 该论文旨在解决科学文献跨语言传播中因专业术语复杂性和句式结构严谨性所带来的机器翻译(Machine Translation, MT)质量低下问题。其解决方案的关键在于构建高质量的平行语料库和单语语料库,覆盖西班牙语-英语、法语-英语和葡萄牙语-英语三种语言对,并针对癌症研究、能源研究、神经科学和交通运输研究四个具体科学领域进行精细化划分,从而支持领域自适应的神经机器翻译(Neural Machine Translation, NMT)模型微调。通过在这些定制化语料上训练和评估NMT系统,论文验证了专用语料对提升科学文本翻译准确性的有效性。

链接: https://arxiv.org/abs/2605.20912
作者: Dimitris Roussis,Sokratis Sofianopoulos,Stelios Piperidis
机构: Institute for Speech and Language Processing, Athena RC (希腊语音与语言处理研究所,阿斯纳研究中心)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The increasing volume of scientific research necessitates effective communication across language barriers. Machine translation (MT) offers a promising solution for accessing international publications. However, the scientific domain presents unique challenges due to its specialized vocabulary and complex sentence structures. In this paper, we present the development of a collection of parallel and monolingual corpora for the scientific domain. The corpora target the language pairs Spanish-English, French-English, and Portuguese-English. For each language pair, we create a large general scientific corpus as well as four smaller corpora focused on the domains of: Cancer Research, Energy Research, Neuroscience, and Transportation research. To evaluate the quality of these corpora, we utilize them for fine-tuning general-purpose neural machine translation (NMT) systems. We provide details regarding the corpus creation process, the fine-tuning strategies employed, and we conclude with the evaluation results.

[NLP-46] rminal-World: Scaling Terminal-Agent Environments via Agent Skills

【速读】: 该论文旨在解决终端代理(Terminal Agent)在训练过程中因高质量数据稀缺而导致性能受限的问题。现有方法依赖于人类定义的种子或GitHub代码库等局部来源来构建任务,导致生成的任务分布狭窄、环境语义与任务不匹配,且探索轨迹缺乏引导,效率低下。其解决方案的关键在于提出一种全自动的合成管道——Terminal-World,该管道以“技能”(Skill)为核心合成单元,将任务目标、执行条件(前置条件与环境状态)和操作方式统一编码,从而实现任务指令、环境和教师轨迹的协同生成。为进一步拓展合成空间,Terminal-World还引入技能团队(Skill Teams)与技能图(Skill Graph),支持多角色协作和跨领域任务合成。实验表明,基于该框架构建的Terminal-World-32B模型,在仅使用1.2%训练数据的情况下,便超越了Nemotron-Terminal-32B模型在Terminal-Bench 2.0上的表现(Pass@1提升+4.5至31.5,Pass@3达43.8)。

链接: https://arxiv.org/abs/2605.20876
作者: Zihao Cheng,Hongru Wang,Zeming Liu,Xinyi Wang,Xiangrong Zhu,Yuhang Guo,Wei Lin,Jeff Z. Pan,Yunhong Wang
机构: Beihang University (北京航空航天大学); Independent Researcher (独立研究者); Beijing Institute of Technology (北京理工大学); University of Edinburgh (爱丁堡大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Work in Progress

点击查看摘要

Abstract:Terminal agents extend Large Language Models with the ability to execute tasks directly in command-line environments, but their progress is bottlenecked by the scarcity of high-quality training data. Existing approaches bootstrap from partial sources such as human-defined seeds or GitHub repositories to instantiate one component and then complete the rest, producing tasks confined to narrow seed distributions, environments misaligned with task semantics, and inefficient trajectories from unguided exploration. To address these limitations, we introduce Terminal-World, a fully automated pipeline that uses agent skills as the central synthesis primitive, which jointly encode what to accomplish, when to apply (preconditions and environment state), and how to execute, enabling task instructions, environments, and teacher trajectories to be co-derived. To further broaden the synthesis space, Terminal-World composes skills into skill teams and skill graphs for multi-role and cross-domain task synthesis. Using this pipeline, we construct 5,723 training environments and train Terminal-World-8B/14B/32B, evaluated across 6 benchmarks where the Terminal-World series consistently outperforms terminal-agent baselines. Notably, using the same teacher model and only 1.2% of the training data, Terminal-World-32B surpasses Nemotron-Terminal-32B on Terminal-Bench 2.0 by +4.5 Pass@1 (31.5) and achieves 43.8 Pass@3.

[NLP-47] MemGym: a Long-Horizon Memory Environment for LLM Agents

【速读】: 该论文试图解决的问题是:当前主流的内存评估基准主要聚焦于多轮对话中个性化信息的保留能力,忽略了在长期代理任务执行过程中动态形成的记忆机制,导致现有内存系统在真实代理环境(如编程和网页导航)中迁移效果不佳。解决方案的关键在于提出MemGym——一个统一现有代理测试平台与自研记忆锚定流水线的代理记忆评估基准,其核心创新包括:(1)涵盖五类评估赛道、四种代理场景(工具使用对话、深度研究搜索、编程、计算机操作),实现对代理记忆的全面刻画;(2)引入记忆隔离评分机制,将记忆性能从推理、检索和工具使用能力中解耦,从而公平评估不同记忆策略;(3)构建长度可控、阶段可消融验证且高度贴近下游任务的合成数据管道(MEMGYM-CODEQA 和 MEMGYM-DR);(4)开发轻量级奖励模型MemRM(基于Qwen3-1.7B微调的QLoRA模型),以快速量化代码压缩质量,替代耗时的Docker全量回放,提升编程环境评估效率。

链接: https://arxiv.org/abs/2605.20833
作者: Wujiang Xu,Yu Wang,Kai Mei,Kaiqu Liang,Zhenting Wang,Mingyu Jin,Han Zhang,Shi-Xiong Zhang,Wenyue Hua,Sambit Sahu,Dimitris N. Metaxas
机构: Rutgers University (罗格斯大学); Capital One (资本one); Princeton University (普林斯顿大学); Microsoft Research (微软研究院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Memory is a central capability for LLM agents operating across long-horizon tasks. Existing memory benchmarks predominantly evaluate retention of personalized information in multi-turn chat scenarios, overlooking the dynamic memory formation that occurs during extended agent execution. Consequently, the memory systems they produce transfer poorly to realistic agentic environments, such as coding and web navigation. We present MemGym, a benchmark for agentic memory that unifies existing agent gyms and in-house memory-grounded pipelines behind one memory-reasoning interface. MemGym spans five evaluation tracks grouped into four agentic regimes: tool-use dialogue (tau2-bench), multi-turn deep-research search (MEMGYM-DR), coding (SWE-Gym and MEMGYM-CODEQA), and computer use (WebArena-Infinity). MemGym reports memory-isolated scores that decouple memory performance from reasoning, retrieval, and tool-use ability, so memory strategies can be ranked without those confounders. Our synthetic pipelines for MEMGYM-CODEQA and MEMGYM-DR are length-controllable, ablation-verified at every stage, and tightly aligned with downstream scenarios. To make evaluation on coding environments academically tractable, we train MemRM, a lightweight reward model (Qwen3-1.7B fine-tuned with QLoRA) that scores compression quality as a fast scalar read in place of full Docker rollouts.

[NLP-48] PulseCol: Periodically Refreshed Column-Sparse Attention for Accelerating Diffusion Language Models

【速读】: 该论文旨在解决扩散大语言模型(dLLMs)在推理过程中计算成本高昂的问题,尤其是在无键值缓存(KV cache)情况下,每步去噪过程都需重复执行完整的自注意力机制。现有稀疏注意力方法虽通过块级稀疏计算缓解了部分开销,但仅适用于后期迭代且效率提升有限。其解决方案的关键在于提出一种细粒度的列稀疏注意力机制(PulseCol),将粗粒度的块级稀疏结构替换为更精确的列级稀疏结构,从而在保留重要注意力交互的同时实现更高程度的稀疏性;此外,PulseCol在早期去噪步骤中识别并复用稀疏模式,在少量中间步骤刷新以追踪注意力模式演化,显著提升了计算效率。实验表明,PulseCol在保持模型质量的前提下实现了比现有方法更高的稀疏率和实际加速比,并借助优化的GPU核函数实现了最高达1.95倍的端到端速度提升。

链接: https://arxiv.org/abs/2605.20813
作者: Yanyi Lyu,Letian Chen,Futing Sun,Miao Zhang,Weili Guan,Liqiang Nie
机构: Harbin Institute of Technology (Shenzhen)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Inference in diffusion large language models (dLLMs) is computationally expensive, as full self-attention must be repeatedly executed at each step of the denoising process without KV cache. Recent sparse attention methods for dLLMs mitigate this cost via block-sparse computation, which is applied only in later iterations when model performance is less sensitive to coarse-grained sparse approximation, but yields limited improvements in computational efficiency and acceleration. This motivates a finer-grained sparsification strategy that can be applied from earlier iterations and leverages reusable sparsity patterns, enabling further efficiency gains. In this work, we introduce PulseCol, a periodically refreshed column-sparse attention method for accelerating diffusion language models. PulseCol replaces coarse block-level sparsity with a finer-grained column-sparse structure, allowing important attention interactions to be retained more precisely while exposing greater sparsity. Built on this column-level formulation, PulseCol further identifies sparse patterns at the early denoising step and reuses them across subsequent iterations, refreshing them only at a small number of intermediate steps to track the evolution of sparse attention patterns during denoising. Experiments show that PulseCol achieves higher sparsity and greater practical speedup than prior sparse attention methods for dLLMs, while maintaining model quality. Enabled by optimized GPU kernels for column-sparse attention, PulseCol delivers up to 1.95 \times end-to-end speedup over FlashAttention across several context lengths.

[NLP-49] Refining and Reusing Annotation Guidelines for LLM Annotation ACL2026

【速读】: 该论文试图解决大语言模型(Large Language Models, LLMs)在零样本标注任务中难以遵循金标准基准数据集专业规范的问题。解决方案的关键在于引入一种迭代式审核框架(iterative moderation framework),通过系统性地复用与优化标注指南(annotation guidelines),模拟标注项目初期的流程,从而提升模型输出与专业标注标准的一致性。该框架验证了三个假设:(1)指南整合的有效性,(2)推理优化模型的优势,以及(3)在极少监督下实施审核的可行性。实验结果表明,该方法在生物医学命名实体识别(NER)任务中显著提升了标注质量,但仍有改进空间。

链接: https://arxiv.org/abs/2605.20809
作者: Kon Woo Kim,Jin-Dong Kim,Akiko Aizawa
机构: The Graduate University for Advanced Studies, SOKENDAI; National Institute of Informatics (NII); BioData Science Initiative (BSI), National Institute of Genetics (NIG)
类目: Computation and Language (cs.CL)
备注: 14 pages, 7 figures. Accepted to the ACL 2026 Main Conference

点击查看摘要

Abstract:While Large Language Models (LLMs) demonstrate remarkable performance on zero-shot annotation tasks, they often struggle with the specialized conventions of gold-standard benchmarks. We propose the systematic reuse and refinement of annotation guidelines as an alignment mechanism, introducing an iterative moderation framework that simulates the early phases of annotation projects. We evaluate three hypotheses: (1) the efficacy of guideline integration, (2) the advantage of reasoning optimized models, and (3) the viability of moderation under minimal supervision. Testing across biomedical NER tasks (NCBI Disease, BC5CDR, BioRED) with three LLM families (GPT, Gemini, DeepSeek), our results empirically confirm all three hypotheses. While the iterative moderation framework shows good potential in effectively refining guidelines, our analysis also reveals substantial room for improvement.

[NLP-50] Most Transformer Modifications Still Do Not Transfer at 1-3B: A 2020-2026 Update to Narang et al. (2021) with Downstream Evaluation and a Noise Floor EMNLP2026

【速读】: 该论文试图解决的问题是:在当前主流模型规模(1-3B参数)和评估范式(以下游任务性能为主)下,近年来提出的20种Transformer结构改进方法是否真正有效,以及如何科学地评估这些改进的实际价值。解决方案的关键在于采用严格控制变量的实验设计——包括固定数据量(iso-data)、计算资源(iso-compute)、训练配方(iso-recipe),并引入多种子基准噪声水平(multi-seed baseline noise floor)和CLIMB-12下游评估指标作为核心评价标准。研究发现,大多数改进方法无法稳定提升下游性能,仅有少数在1.2B规模下通过Bonferroni校正验证为显著有效,且其中一种在3B规模下训练不稳定;此外,注意力输出相关修改虽能保持接近基线的验证损失,但下游性能显著下降,说明仅依赖预训练损失(如困惑度)会严重高估改进效果。因此,作者强调:噪声基线报告、下游任务评估和跨规模稳定性测试已成为1-3B参数级别架构比较的必要前提。

链接: https://arxiv.org/abs/2605.20798
作者: Yang Zhao,Jiahao Lu,Bin Huang,Guhua Zhang,Jie Zhou
机构: Tencent
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 19 pages, 3 figures, under review at EMNLP 2026

点击查看摘要

Abstract:Narang et al. (2021) evaluated 40+ Transformer modifications at T5-base scale and concluded that most did not transfer. Five years later, the typical working regime has moved to 1-3B parameters, downstream evaluation has replaced pretraining perplexity, and a substantially different catalogue of modifications has emerged. We revisit their question by testing 20 post-2021 Transformer modifications at 1.2B and 3B under strict iso-data, iso-compute, iso-recipe control, with a multi-seed baseline noise floor and CLIMB-12 downstream evaluation as the primary metric. The central finding reproduces theirs at this curated set: most modifications do not transfer. Of the 20 modifications, only two clear Bonferroni correction at 1.2B; one of those two further fails to train stably at 3B under the shared recipe. We also find that the loss-downstream gap reported by Tay et al. (2023) enlarges several-fold for attention-output modifications: two significant failures converge to within 2-3% of baseline validation loss yet drop 6-16 CLIMB-points. We conclude that noise-floor reporting, downstream evaluation, and cross-scale stability testing are now prerequisites for architecture comparisons at 1-3B.

[NLP-51] Assessing socio-economic climate impacts from text data

【速读】: 该论文试图解决的问题是:当前利用文本数据(如新闻、社交媒体和报告)进行气候灾害社会经济影响评估的研究方法存在碎片化现象,缺乏统一的定义标准、时间与空间偏差处理策略以及建模和后处理方法的选择指南,导致研究透明度和可比性不足。解决方案的关键在于系统梳理现有实践,识别文本作为数据方法在分析社会经济影响时的核心挑战,并提出具体建议以形成最佳实践框架,从而支持构建稳健、可靠的文本衍生社会经济影响数据集,提升其在灾害风险管理与归因研究中的应用价值。

链接: https://arxiv.org/abs/2605.20793
作者: Mariana Madruga de Brito,Brielen Madureira,Taís Maria Nunes Carvalho,Damien Delforge,Aglaé Jézéquel,Murathan Kurfalı,Ni Li,Gabriele Messori,Joakim Nivre,Barbara Pernici,Niko Speybroeck,Stefano Terzi,Wim Thiery,Bram Valkenborg,Jingxian Wang,Shorouq Zahra,Jakob Zscheischler,Jan Sodoge
机构: 未知
类目: Computation and Language (cs.CL)
备注: Work in progress

点击查看摘要

Abstract:Recent advances in natural language processing (NLP) and large language models (LLMs) have enabled the systematic use of large-scale textual data from news, social media, and reports to create datasets with socio-economic impacts of climate hazards such as floods, droughts, storms, and multi-hazard events. As the field of text-as-data for impact assessment expands, so does its methodological complexity. Yet research remains fragmented, with no clear guidelines for defining what constitutes an impact, handling temporal and spatial biases, and selecting appropriate modeling and post-processing strategies. This lack of coherence limits transparency and comparability across studies. Here, we address this gap by synthesising common practices, describing key challenges specific to the use of text-as-data methods for analyzing socio-economic impact data, and proposing recommendations to address them. By providing guidance on best practices, we aim to support the construction of robust text-derived socio-economic impact datasets that can more accurately inform disaster risk management and attribution studies.

[NLP-52] Building Arabic NLP from the Ground Up: Twenty Years of Lessons Failures and Open Problems ACL2026

【速读】: 该论文旨在解决阿拉伯语(Arabic)在自然语言处理(NLP)领域长期资源匮乏的问题,尽管其使用者达数亿人。解决方案的关键在于反思过去二十年构建阿拉伯语NLP基础设施的经验,揭示出三个反直觉的洞见:数据集建设本质上是社会过程而非纯技术活动;围绕共享任务形成的社区往往比任务本身更具价值;从语言资源向计算社会科学转型时,暴露出传统NLP训练无法应对的新挑战。论文进一步指出三大失败案例,包括抑郁检测语料库未能进入临床实践、过度分散于多个共享任务导致深度不足,以及对现代标准阿拉伯语(Modern Standard Arabic)资源可直接迁移至方言任务的错误假设。这些经验表明,为边缘化语言社区开发NLP面临的最困难问题并非语言学层面,而是社会、制度和认识论层面的,亟需业界培养超越传统技术能力的跨学科素养。

链接: https://arxiv.org/abs/2605.20786
作者: Wajdi Zaghouani
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted at the ACL 2026 Workshop : The Big Picture 2026: Crafting a Research Narrative v2

点击查看摘要

Abstract:This paper reflects on twenty years of building NLP resources and research infrastructure for Arabic, a language spoken by hundreds of millions yet historically underserved relative to languages such as English or Chinese. The first decade focused on foundational linguistic infrastructure; the second shifted toward computational social science, social media analysis, and socially oriented applications. Rather than cataloguing outputs, the paper examines what the experience of building them revealed. Three counterintuitive lessons emerge: building datasets is as much a social process as a technical one; communities formed around shared tasks often matter more than the tasks themselves; and moving from language resources to computational social science exposes challenges that traditional NLP training does not address. We discuss three failures: a depression detection corpus that never reached clinical practice, a period of spreading across too many shared tasks without sufficient depth, and a long-standing assumption that Modern Standard Arabic infrastructure would transfer cleanly to dialectal tasks. These experiences suggest that the hardest problems in developing NLP for underserved communities are not linguistic but social, institutional, and epistemic, and require competencies the field rarely teaches.

[NLP-53] he Illusion of Intervention: Your LLM -Simulated Experiment is an Observational Study

【速读】: 该论文试图解决的问题是:在使用大语言模型(Large Language Models, LLMs)作为人类行为模拟器进行干预实验时,由于LLMs主要基于观察数据训练,干预可能导致隐变量属性的非预期变化,从而引发用户漂移(user drift),即不同处理组下模拟用户的潜在分布发生偏移,进而扭曲因果效应估计。解决方案的关键在于:首先,通过引入负控制结果(negative control outcomes)——即理论上不应受干预影响的变量——来诊断是否存在因干预导致的分布偏移,从而识别用户漂移;其次,通过调整角色设定(persona specification)以纳入更多与场景相关的混杂因素(confounders),发现针对性地引入设置相关混杂变量可显著降低偏差,无论是在问卷式评估还是多轮交互式代理评估中均有效。

链接: https://arxiv.org/abs/2605.20767
作者: Victoria Lin,Taedong Yun,Maja Matarić,John Canny,Arthur Gretton,Alexander D’Amour
机构: Google DeepMind; Carnegie Mellon University
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Methodology (stat.ME)
备注:

点击查看摘要

Abstract:Large language models (LLMs) show potential as simulators of human behavior, offering a scalable way to study responses to interventions. However, because LLMs are trained largely on observational data, interventions in experiments with LLM-simulated synthetic users can induce unintended shifts in latent user attributes, causing user drift where the implicit simulated population differs across treatment conditions, potentially distorting effect estimates. We formalize the confounding or selection bias that can arise due to user drift and show how intervention-dependent shifts can inflate or attenuate observed differences in user responses under intervention. To diagnose confounding, we propose using negative control outcomes–attributes that should remain invariant under intervention–to identify distribution shifts across intervention conditions, providing evidence of user drift. To mitigate drift, we study adjusting the persona specification by eliciting additional confounders, finding that targeted, setting-relevant confounders can substantially reduce bias across survey-style and multi-turn agent evaluations.

[NLP-54] Findings of the Counter Turing Test: AI-Generated Text Detection AAAI2025

【速读】: 该论文旨在解决人工智能生成文本(AI-generated text)日益泛滥所引发的数字内容真实性与可信度问题,特别是如何有效区分人类撰写的文本与由大型语言模型(LLM)生成的内容。其核心挑战在于:随着GPT-4、Claude 3.5和Llama等先进生成模型产出高度自然且连贯的文本,传统检测方法面临失效风险,进而可能助长虚假信息传播、偏见叙事及安全威胁。解决方案的关键在于通过“反图灵测试”(Counter Turing Test, CT2)共享任务对当前最先进的检测技术进行全面评估,其中Task A(二分类)聚焦于识别文本来源(人类 vs AI),而Task B(模型归属)则进一步要求识别具体生成模型。实验结果显示,二分类任务表现优异(最佳F1达1.0000),但模型归属任务性能显著下降(最佳F1为0.9531),凸显了跨模型识别的复杂性。顶级方案主要依赖微调的Transformer架构(如DeBERTa和BART)、集成学习与混合检测策略,但仍暴露出在对抗鲁棒性、特征提取和跨领域泛化方面的不足,亟需深入研究以提升检测系统的可靠性与实用性。

链接: https://arxiv.org/abs/2605.20761
作者: Rajarshi Roy,Gurpreet Singh,Ashhar Aziz,Shashwat Bajpai,Nasrin Imanpour,Shwetangshu Biswas,Kapil Wanaskar,Parth Patwa,Subhankar Ghosh,Shreyas Dixit,Nilesh Ranjan Pal,Vipula Rawte,Ritvik Garimella,Amitava Das,Amit Sheth,Vasu Sharma,Aishwarya Naresh Reganti,Vinija Jain,Aman Chadha
机构: Kalyani Government Engineering College, India; IIIT Guwahati, India; IIIT Delhi, India; BITS Pilani Hyderabad Campus, India; AI Institute, University of South Carolina, USA; National Institute of Technology Silchar, India; San José State University, USA; University of California Los Angeles, USA; Washington State University, USA; Vishwakarma Institute of Information Technology, India; Meta AI, USA; Amazon AI, USA; BITS Pilani, Goa
类目: Computation and Language (cs.CL)
备注: Defactify4 @AAAI 2025

点击查看摘要

Abstract:The rapid proliferation of AI-generated text has introduced significant challenges in maintaining the integrity of digital content. Advanced generative models such as GPT-4, Claude 3.5, and Llama can produce highly coherent and human-like text, making it increasingly difficult to differentiate between human-written and AI-generated content. While these models have transformative applications, their misuse has raised concerns about misinformation, biased narratives, and security threats. This paper provides a comprehensive analysis of state-of-the-art AI-generated text detection techniques and evaluates their effectiveness through the Counter Turing Test (CT2) shared tasks. Task A (Binary Classification) required participants to distinguish between human-written and AI-generated text, while Task B (Model Attribution) focused on identifying the specific language model responsible for generating a given text. The results demonstrated high performance in binary classification, with the top system achieving an F1 score of 1.0000, but significantly lower scores in model attribution, where the best system achieved 0.9531, highlighting the increased complexity of this task. The top-performing teams leveraged fine-tuned transformer models, ensemble learning, and hybrid detection approaches, with DeBERTa-based and BART-based methods demonstrating strong results. However, the lower scores in Task B underscore the challenges of distinguishing outputs from different LLMs, necessitating further research into adversarial robustness, feature extraction, and cross-domain generalization. Comments: Defactify4 @AAAI 2025 Subjects: Computation and Language (cs.CL) Cite as: arXiv:2605.20761 [cs.CL] (or arXiv:2605.20761v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2605.20761 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-55] he Hidden Signal of Verifier Strictness: Controlling and Improving Step-Wise Verification via Selective Latent Steering

【速读】: 该论文试图解决生成式验证器(generative verifier)在逐步验证过程中存在严格度(verifier strictness)不可控的问题,即验证器可能过于宽松而忽略错误步骤,或过于严格而拒绝正确推理。解决方案的关键在于发现并利用验证过程中的特定隐藏状态信号:在逐步验证中,验证器对某一步骤的接受或拒绝倾向编码在对应验证段落的边界附近。基于此信号,作者提出VerifySteer方法,通过样本级路由机制选择性地干预段落边界处的隐藏状态,从而实现无需微调即可调节验证器严格度,并有效平衡错误检测与正确性认证之间的权衡。实验表明,VerifySteer在ProcessBench和Hard2Verify数据集上优于提示优化和激活引导基线方法,且推理计算量仅为自一致性(self-consistency)的1/4至1/7,同时可与验证微调互补,进一步提升性能。

链接: https://arxiv.org/abs/2605.20745
作者: Yefan Zhou,Yilun Zhou,Austin Xu,Soroush Vosoughi,Shafiq Joty,Jiang Gui
机构: Dartmouth College (达特茅斯学院); Datadog AI Research (Datadog人工智能研究); Salesforce AI Research (Salesforce人工智能研究)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Generative verifiers have emerged as a promising paradigm for step-wise verification, but their verification behavior is often poorly calibrated: they may be under-critical and miss erroneous steps, or over-critical and reject correct reasoning. We refer to this tendency to be overly lenient or overly critical as verifier strictness. In this work, we study whether verifier strictness can be controlled through hidden-state intervention. We uncover a verification-specific hidden-state signal: in step-wise verification, a verifier’s tendency to accept or reject a solution step is encoded near the boundary of the corresponding verification paragraph. Exploiting this signal, we show that hidden-state steering can directly modulate verifier strictness without fine-tuning. However, uniform steering induces a trade-off between error detection and correctness certification. To address this, we propose VerifySteer, which exploits latent correctness signals for sample-level routing and selectively intervenes on paragraph boundaries. Experiments on ProcessBench and Hard2Verify show that VerifySteer outperforms prompt optimization and activation steering baselines, and is competitive with self-consistency while requiring 4-7x less inference compute. VerifySteer is also complementary to verification fine-tuning, providing further gains on top of fine-tuned verifiers. The code is available at this https URL.

[NLP-56] Draw2Think: Harnessing Geometry Reasoning through Constraint Engine Interaction

【速读】: 该论文试图解决视觉-语言模型在几何问题求解中缺乏可验证性的问题:现有方法通过渲染像素或一次性脚本外部化中间状态,无法提供每一步操作的精确几何约束保障。解决方案的关键在于将几何推理从隐式的空间推断重构为与GeoGebra约束引擎的代理式交互,提出Draw2Think框架,其核心是通过“提议-绘制-验证”循环将假设外化到一个可执行画布上,利用约束引擎确保几何关系严格满足,并基于结构化观测反馈引导后续推理。这一机制使两个属性可分别审计:模型层面的构造保真度(canvas是否实现预期构型)和引擎层面的测量忠实度(从约束中获得的精确数值与关系)。实验表明,Draw2Think在GeoGoal数据集上达到95.9%的谓词级和84.0%的问题级构造检查通过率,在平面与立体基准测试中提升结果准确率最高达4.1%/16.4%,并在GenExam-math的渲染评估中取得68.2%/90.5%的严格/宽松得分。

链接: https://arxiv.org/abs/2605.20743
作者: Juncheng Hu,Jiawei Du,Xin Zhang,Joey Tianyi Zhou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Vision-language models solve geometry problems with rising accuracy, yet their intermediate states remain latent and unverifiable: a relation expressed in textual reasoning or drawing code carries no guarantee that a constraint-satisfying configuration realizes it. We observe that existing externalization methods based on rendered pixels or one-shot scripts fail to provide exact, per-action geometric guarantees. Enforcing geometric relations by algebraic definition closes this gap: the workspace becomes a constraint-checked evolving canvas. We present Draw2Think, a framework that recasts geometric reasoning from latent spatial inference into agentic interaction with the GeoGebra constraint engine. In a Propose-Draw-Verify loop, Draw2Think externalizes hypotheses onto an executable canvas, measures exact geometric quantities, and feeds structured observations back to the model, so subsequent reasoning proceeds from checked canvas state grounded by the shared workspace. This externalization makes two properties separately auditable: model-level Construction Fidelity (whether the canvas realizes the intended configuration) and engine-level Measurement Faithfulness (exact values and relations from canvas constraints). Across construction, outcome, and rendering evaluations, Draw2Think builds canvases that pass 95.9% predicate-level and 84.0% strict problem-level construction checks on GeoGoal, improves outcome accuracy by up to 4.1%/16.4% on planar/solid benchmarks, and attains 68.2%/90.5% strict/relaxed rendering scores on GenExam-math. Project page is available at this https URL

[NLP-57] Distribution-Aware Reward: Reinforcement Learning over Predictive Distributions for LLM Regression

【速读】: 该论文试图解决大语言模型在回归任务中生成预测分布不准确的问题,即现有训练目标仅优化单个预测值(point estimates),而忽视了预测分布的校准性(calibration),从而限制了需要候选排序或不确定性估计的应用场景。解决方案的关键在于提出一种分布感知奖励(Distribution-Aware Reward),这是一种基于策略梯度的强化学习目标,其核心创新是将多个解码样本视为经验预测分布,并使用连续排名概率分数(Continuous Ranked Probability Score, CRPS)评估该分布质量;同时采用留一法信用分配机制(leave-one-out credit assignment),根据每条轨迹对整体分布质量的边际贡献进行奖励,从而鼓励模型输出既准确又适当分散的预测分布。实验表明,该方法在高斯混合任务、代码性能预测和分子属性预测等多类任务上均显著优于监督微调和点式强化学习基线,尤其在排名相关性(如KBSS上的Spearman提升6点)和不确定性诊断方面表现突出,且能缓解轨迹多样性崩溃问题,提升了语言模型回归任务的鲁棒性和校准能力。

链接: https://arxiv.org/abs/2605.20740
作者: Jungsoo Park,Hyungjoo Chae,Ethan Mendes,Jay DeYoung,Varsha Kishore,Wei Xu,Alan Ritter
机构: Georgia Institute of Technology (佐治亚理工学院); Allen Institute for AI (艾伦人工智能研究所)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 21 pages, 5 figures

点击查看摘要

Abstract:Large language models can predict real-valued quantities from heterogeneous inputs such as text, code, and molecular strings, but most training objectives score each decoded floating-point number independently, improving point estimates without ensuring calibrated predictive distributions. This limits applications requiring candidate ranking or uncertainty estimation. We introduce Distribution-Aware Reward, an on-policy reinforcement learning objective whose main contribution is to train language models to produce better predictive distributions for regression tasks, rather than only optimizing individual decoded outputs against scalar targets. Our method treats multiple decoded samples as an empirical predictive distribution, evaluates it with the Continuous Ranked Probability Score, and assigns leave-one-out credit based on each rollout’s marginal contribution to distribution quality, rewarding predictions that are both accurate and appropriately dispersed. We evaluate our method on a controlled Gaussian-mixture task, code performance prediction, and molecular property prediction from SMILES strings. Across tasks, our method improves over supervised fine-tuning and pointwise reinforcement learning baselines, with strong rank-correlation gains, including a 6-point Spearman improvement on KBSS. On MoleculeNet, it uses only SMILES strings yet remains competitive with strong graph-based and 3D molecular models. Further analyses show that our method mitigates rollout diversity collapse and improves uncertainty diagnostics, suggesting that directly optimizing predictive distributions makes language model regression more robust and better calibrated.

[NLP-58] Distributional Alignment as a Criterion for Designing Task Vectors in In-Context Learning

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在上下文学习(In-context Learning, ICL)中因上下文长度增加而导致推理成本上升的问题。现有方法如任务向量(Task Vector)虽能通过压缩演示样本为紧凑的隐藏状态表示来降低开销,但其有效性仅以下游任务准确率间接评估,缺乏对提取机制设计的指导。论文提出的核心解决方案是:将任务向量推理的预测分布与ICL保持一致,并引入一个新的量化指标 $ d_\text{NTP} $ 来衡量两者在下一个词概率上的差异。实证表明,$ d_\text{NTP} $ 与任务准确率呈强负相关,可作为性能代理指标。基于此,作者提出线性任务向量(Linear Task Vector, LTV),通过闭式线性映射实现对演示效应的回归估计,从而最小化 $ d_\text{NTP} $。实验显示,LTV在8个分类基准和5个LLM上均显著优于现有基线,平均准确率提升9.2%,同时降低推理延迟;此外,LTV在回归任务中也表现更优,并首次验证了任务向量在不同模型规模间的迁移能力——来自更大模型的任务向量可使小模型性能提升6.4%,揭示了任务表示的新应用潜力。

链接: https://arxiv.org/abs/2605.20730
作者: Jihoon Kwon,Jiwon Choi,Jy-yong Sohn
机构: Seoul National University (首尔国立大学); Yonsei University (延世大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 9 pages, preprint

点击查看摘要

Abstract:In-context learning (ICL) allows large language models (LLMs) to adapt to new tasks through demonstrations, yet it suffers from escalating inference costs as context length increases. While task vectors offer a promising alternative by compressing demonstrations into compact hidden-state representations, their quality has been evaluated only through downstream task accuracy. This indirect criterion provides limited insight into how to design more effective task vector extraction methods. In this paper, we posit that inference using task vectors should align their predictive distribution with that of ICL. To quantify this, we introduce d_\textNTP , a metric that measures the discrepancy in next-token probabilities between task vector-based and ICL-based inference. Our empirical analysis reveals that d_\textNTP serves as a performance proxy, exhibiting a strong negative correlation with downstream accuracy. Motivated by this, we develop Linear Task Vector (LTV), a method designed to minimize d_\textNTP via a closed-form linear mapping that estimates demonstration effects through regression. Across eight classification benchmarks and five LLMs, LTV consistently outperforms existing task vector baselines, improving average accuracy by 9.2% while reducing inference latency. We further show that LTV outperforms the baselines on regression tasks. Moreover, we investigate the transferability of LTV across different model scales; an aspect that has remained nascent in task vector research. Specifically, we empirically show that task vectors from a larger model can enhance a smaller model’s performance by 6.4%, suggesting a new utility for extracted task representations.

[NLP-59] MTR-Suite: A Framework for Evaluating and Synthesizing Conversational Retrieval Benchmarks ACL2026

【速读】: 该论文旨在解决当前对话式检索(conversational retrieval)评估中存在的两大问题:一是现有基准测试依赖昂贵且稀疏的人工标注,二是自动化评估方法采用僵化、不自然的启发式规则,导致评估结果不可靠。解决方案的关键在于提出MTR-Suite这一统一框架,其核心创新包括:(1) MTR-Eval,基于大语言模型(LLM)的审计工具,可量化现有基准中语义对齐差距;(2) MTR-Pipeline,一种多智能体系统,利用贪心遍历聚类技术生成高质量对话,成本仅为人工标注的1/400;(3) MTR-Bench,一个严格设计的通用领域基准,模拟真实生产场景中的挑战(如难topic切换和冗长表达),具备更强的区分能力。该框架显著提升了对话检索评估的效率与真实性,为RAG(检索增强生成)系统的优化提供了可靠支撑。

链接: https://arxiv.org/abs/2605.20729
作者: Junhao Ruan,Abudukeyumu Abudula,Bei Li,Yongjing Yin,Xinyu Liu,Kechen Jiao,Xin Chen,Jingang Wang,Xunliang Cai,Tong Xiao,Jingbo Zhu
机构: Northeastern University (东北大学); Meituan Inc. (美团); NiuTrans Research (牛津研究); Tsinghua University (清华大学)
类目: Computation and Language (cs.CL)
备注: Accepted to ACL 2026 (main conference). 28 pages. Code and data: this https URL

点击查看摘要

Abstract:Accurate evaluation of conversational retrieval is pivotal for advancing Retrieval-Augmented Generation (RAG) systems. However, existing conversational retrieval benchmarks suffer from costly, sparse human annotation or rigid, unnatural automated heuristics. To address these challenges, we introduce MTR-Suite, a unified framework for auditing, synthesizing, and benchmarking retrieval. It features: (1) MTR-Eval, an LLM-based auditor quantifying alignment gaps in previous benchmarks; (2) MTR-Pipeline, a multi-agent system using greedy traversal clustering to generate high-fidelity dialogues at 1/400th human cost; and (3) MTR-Bench, a rigorous general-domain benchmark. MTR-Bench mimics production-style challenges (hard topic switching, verbosity), offering superior discriminative power. We make our code and data publicly available to facilitate future research at this https URL.

[NLP-60] SCRIBE: Diagnostic Evaluation and Rich Transcription Models for Indic ASR INTERSPEECH2026

【速读】: 该论文试图解决自动语音识别(ASR)评估中传统词错误率(WER)指标的局限性问题,即WER无法区分不同类型的错误,并且对黏着语(agglutinative languages)存在结构性偏差。其解决方案的关键在于提出SCRIBE框架,该框架通过引入领域词汇库并采用容忍音节合并(sandhi)的对齐方式,实现对词汇、标点、数字和领域实体等错误类别的细粒度分解,从而更准确地反映实际纠错成本。实验证明,SCRIBE的结果与专家判断高度一致,而WER则不能。

链接: https://arxiv.org/abs/2605.20712
作者: Kavya Manohar,Arghya Bhattacharya,Kush Juvekar,Kumarmanas Nethil
机构: Adalat AI(印度)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Submitted to Interspeech 2026

点击查看摘要

Abstract:Automatic speech recognition replaces typing only when correction costs less than manual entry, a threshold determined by error types, not counts: fixing a misrecognized domain term costs far more than inserting a comma. Word error rate (WER) fails on two fronts: it collapses distinct error categories into a single scalar, and it structurally penalizes agglutinative languages where valid sandhi merges inflate scores. We introduce SCRIBE, a diagnostic framework that provides categorical error decomposition into lexical, punctuation, numeral, and domain-entity rates through sandhi-tolerant alignment with domain vocabulary injection. Human validation confirms SCRIBE aligns with expert judgment where WER does not. We release SCRIBE, an LLM curation pipeline, benchmarks, and open-weight rich transcription models for Hindi, Malayalam, and Kannada.

[NLP-61] Interpretable Discriminative Text Representations via Agreement and Label Disentanglement

【速读】: 该论文试图解决的问题是:如何构建可解释的判别式文本表示(discriminative text representations),使得每个特征坐标不仅具备预测能力,还能被独立审计者清晰理解并验证其与标签的解耦性(label disentanglement)。现有方法如匿名嵌入方向或基于大语言模型(LLM)辅助的概念瓶颈法,往往无法保证特征定义的可复现性或与目标标签的区分度。解决方案的关键在于提出一个操作性标准——每项特征必须同时满足两个条件:概念清晰性(conceptual clarity,通过独立标注者对特征定义的一致性进行衡量,使用调整偶然一致性的Cohen’s κ)和标签解耦性(label disentanglement,即特征不应只是目标标签的同义重复)。作者据此设计了LLM辅助特征发现(LFD)方法,该方法通过对比正负样本对迭代生成候选特征,利用跨LLM Cohen’s κ筛选高一致性特征,并基于残差预测增益选择最终特征。实证表明,LFD在保持与强基线相当预测性能的同时,显著提升了特征的可解释性和抗标签泄露能力,且人类审计显示其特征具有更高的标注者间一致性与可信度,从而为可解释文本分类提供了可操作的审计标准。

链接: https://arxiv.org/abs/2605.20693
作者: Tong Wang,Yiqing Xu,Leo Yang Yang
机构: Yale University (耶鲁大学); Stanford University (斯坦福大学); Hong Kong Baptist University (香港浸会大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Interpretable text representations should expose coordinates that are not only predictive, but also meaningful enough for independent auditors to apply. Existing discriminative representations often use anonymous embedding directions, while concept-bottleneck and LLM-assisted methods attach natural-language names to features without ensuring that those definitions are reproducible or distinct from the target label. We propose an operational criterion for interpretable discriminative text representations: each coordinate should satisfy conceptual clarity, measured by chance-adjusted agreement between independent annotators applying the feature definition, and label disentanglement, meaning the feature should not merely paraphrase the prediction target. We instantiate this criterion in LLM-assisted Feature Discovery (LFD), an iterative method that proposes lexical and semantic features from contrastive outcome-opposed text pairs, screens candidates using cross-LLM Cohen’s \kappa , and selects features by residual held-out predictive gain. A stylized analysis connects the \kappa screen to a per-feature annotation-noise bound, formalizing agreement as a reliability check. Across ten text-classification tasks spanning seven corpora, LFD matches the predictive performance of a strong text bottleneck baseline while producing substantially clearer and less label-entangled features. Human audits with 232 raters show that LFD features achieve higher human–human and human–LLM agreement than baseline concepts, and raters consistently judge them as less label-leaking. These results suggest that agreement-tested, label-disentangled coordinates provide a practical auditability standard for interpretable text classification.

[NLP-62] Beyond Semantic Similarity: A Two-Phase Non-Parametric Retrieval Workflow for Corporate Credit Underwriting

【速读】: 该论文旨在解决企业信贷审核中因文档复杂性带来的信息提取难题,即标准检索增强生成(RAG)系统在处理数百页、多语言的财务文档时,仅基于语义相似度进行检索,导致返回内容虽主题相关但缺乏决策实用性的问题(称为“相似性-实用性差距”)。解决方案的关键在于提出一种两阶段非参数化检索架构:第一阶段结合词法与稠密多语言检索构建高召回候选池;第二阶段引入自适应检索控制器,利用查询意图和文档结构信号过滤候选,并通过大语言模型作为裁判(LLM-as-a-Judge)机制对片段进行实用性评分,从而按分析价值而非语义接近度排序。此外,系统采用上下文感知的抽取模块以保持叙事文本与复杂财务表格间的结构一致性,且全部部署于本地以满足企业数据治理要求。实证表明,该方法显著优于基线方案,在超过800名信贷分析师的实际应用中,文档审阅时间从数小时缩短至约三分钟,验证了实用性导向的RAG架构在密集文档决策支持流程中的有效性。

链接: https://arxiv.org/abs/2605.20684
作者: Linus Ng Junjia,Ezekiel Tee Kongquan,Kelvin Heng,Kenneth Zhu Ke,Zhao Jing Yuan
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Corporate credit underwriting requires analysts to extract actionable evidence from long, heterogeneous financial documents spanning hundreds of pages and multiple languages. Standard Retrieval-Augmented Generation (RAG) pipelines optimize for semantic similarity, which frequently surfaces passages that are topically related but lack decision utility, a problem we term the similarity-utility gap. We propose a two-phase non-parametric retrieval architecture that separates high-recall candidate retrieval from high-precision utility ranking. The first phase combines lexical and dense multilingual retrieval to construct a broad candidate pool. The second phase applies an adaptive retrieval controller that filters candidates using query intent and document structure signals, followed by an LLM-as-a-Judge utility scoring mechanism that ranks passages by analytical usefulness rather than semantic proximity. A context-aware extraction module preserves structural fidelity across narrative text and complex financial tables. The system is deployed entirely on-premise to satisfy enterprise data governance requirements. Evaluated on a multilingual corpus of proprietary financial documents with analyst-curated relevance labels, the system significantly outperforms naive retrieval baselines. In production deployment across more than 800 credit analysts, document review time was reduced from several hours to approximately three minutes, demonstrating the practical value of utility-aware RAG architectures for document-intensive decision-support workflows. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2605.20684 [cs.CL] (or arXiv:2605.20684v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2605.20684 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-63] On the limits and opportunities of AI reviewers: Reviewing the reviews of Nature-family papers with 45 expert scientists

【速读】: 该论文试图解决的问题是:当前AI评审员在科学同行评审中的能力与局限性尚不明确,现有评估方法仅关注其判断是否与人类一致(如评分一致性、接受预测),无法全面刻画其真实表现。解决方案的关键在于开展一项大规模专家标注研究,由45位来自物理、生物和健康科学领域的专家对2,960条针对《自然》系列期刊论文的具体批评进行细致评分(涵盖正确性、重要性和证据充分性三个维度)。结果显示,基于GPT-5.2的AI评审系统在综合评分上优于每篇论文中评分最高的真人评审(60.0% vs. 48.2%,p=0.009),且所有AI评审均超越最低评分的人类评审;同时,AI生成的批评更具显著性和证据支持,并揭示了人类未提出的26%的新问题。然而,AI评审存在重复率高(21% vs. 人类3%)及16种人类未共有的系统性弱点(如子领域知识有限、多文件长上下文管理不足、对小问题过度苛责)。因此,研究结论认为当前AI评审是人类评审的补充而非替代。

链接: https://arxiv.org/abs/2605.20668
作者: Seungone Kim,Dongkeun Yoon,Kiril Gashteovski,Juyoung Suk,Jinheon Baek,Pranjal Aggarwal,Ian Wu,Viktor Zaverkin,Spase Petkoski,Daniel R. Schrider,Ilija Dukovski,Francesco Santini,Biljana Mitreska,Yong Jeong,Kyeongha Kwon,Young Min Sim,Dragana Manasova,Arthur Porto,Biljana Mojsoska,Makoto Takamoto,Marko Shuntov,Ruoqi Liu,Hyunjoo Jenny Lee,Niyazi Ulas Dinç,Yehhyun Jo,Sunkyu Han,Chungwoo Lee,Huishan Li,Esther H. R. Tsai,Ergun Simsek,Khushboo Shafi,Yeonseung Chung,Jihye Park,Aleksandar Shulevski,Henrik Christiansen,Yoosang Son,Elly Knight,Amanda Montoya,Jeongyoun Ahn,Christian Langkammer,Heera Moon,Changwon Yoon,Nikola Stikov,Mooseok Jang,Edward Choi,Junhan Kim,Yeon Sik Jung,Woo Youn Kim,Jae Kyoung Kim,Ishraq Md Anjum,Hyun Uk Kim,Drew Bridges,Carolin Lawrence,Xiang Yue,Alice Oh,Akari Asai,Sean Welleck,Graham Neubig
机构: Carnegie Mellon University; KAIST; NEC Laboratories Europe; Ss. Cyril and Methodius University in Skopje; INM - Leibniz Institute for New Materials; Saarland University German Research Center for Artificial Intelligence (DFKI); Aix Marseille University, INSERM; University of North Carolina at Chapel Hill; Boston University; University of Basel; University Hospital of Basel; University of Manchester; Massachusetts Institute of Technology; Florida Museum of Natural History, University of Florida; Roskilde University; University of Copenhagen; Stanford University; École Polytechnique Fédérale de Lausanne; Institute for Basic Science (IBS); Brookhaven National Laboratory; University of Maryland Baltimore County; Lawrence Berkeley National Laboratory; The Netherlands Institute for Radio Astronomy; University of Alberta; The University of Texas MD Anderson Cancer Center; Medical University of Graz; Polytechnique Montréal; Montreal Heart Institute
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Work in progress

点击查看摘要

Abstract:With the advancement of AI capabilities, AI reviewers are beginning to be deployed in scientific peer review, yet their capability and credibility remain in question: many scientists simply view them as probabilistic systems without the expertise to evaluate research, while other researchers are more optimistic about their readiness without concrete evidence. Understanding what AI reviewers do well, where they fall short, and what challenges remain is essential. However, existing evaluations of AI reviewers have focused on whether their verdicts match human verdicts (e.g., score alignment, acceptance prediction), which is insufficient to characterize their capabilities and limits. In this paper, we close this gap through a large-scale expert annotation study, in which 45 domain scientists in Physical, Biological, and Health Sciences spent 469 hours rating 2,960 individual criticisms (each targeting one specific aspect of a paper) from human-written and AI-generated reviews of 82 Nature-family papers on correctness, significance, and sufficiency of evidence. On a composite of all three dimensions, a reviewing agent powered by GPT-5.2 scores above each paper’s top-rated human reviewer (60.0% vs. 48.2%, p = 0.009), while all three AI reviewers (including Gemini 3.0 Pro and Claude Opus 4.5) exceed the lowest-rated human across every dimension. AI reviewers’ accurate criticisms are also more often rated significant and well-evidenced, and surface a distinct 26% of issues no human raises. However, AI reviewers overlap far more than humans do (21% vs. 3% for cross-reviewer pairs), and exhibit 16 recurring weaknesses humans do not share, such as limited subfield knowledge, lack of long context management over multiple files, and overly critical stance on minor issues. Overall, our results position current AI reviewers as complements to, not substitutes for, human reviewers.

[NLP-64] AVSD: Adaptive-View Self-Distillation by Balancing Consensus and Teacher-Specific Privileged Signals

【速读】: 该论文旨在解决自蒸馏(self-distillation)中因教师模型使用学生无法访问的特权信息(privileged information)而导致的训练-推理不一致问题,以及如何选择最优特权信息视图(view)这一任务依赖性难题。其解决方案的关键在于提出一种新型多视图自蒸馏方法AVSD(Adaptive-View Self-Distillation),通过将token级监督信号分解为跨视图稳定的共识信号(consensus signal)与视图特异的残差信号(residual signal),从而在保持更新方向可靠性的同时,动态调整更新幅度:仅当残差信号与共识方向一致且比例适当时才被引入。实验表明,AVSD在数学竞赛基准(AIME24/25、HMMT25)和代码生成基准(Codeforces、LiveCodeBench v6)上均显著优于单视图自蒸馏基线和GRPO方法,验证了其有效性与泛化能力。

链接: https://arxiv.org/abs/2605.20643
作者: Duy Nguyen,Hanqi Xiao,Archiki Prasad,Zaid Khan,Anirban Das,Austin Zhang,Sambit Sahu,Hyunji Lee,Elias Stengel-Eskin,Mohit Bansal
机构: UNC Chapel Hill; Capital One; The University of Texas at Austin
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Code: this https URL

点击查看摘要

Abstract:Self-distillation enables language models to learn on-policy from their own trajectories by using the same model as both student and teacher, with the teacher being conditioned on privileged information unavailable to the student. Such information can come in different types or views, such as solutions, demonstrations, feedback, or final answers. This setup provides dense token-level feedback without relying on a separate external model, but creates a fundamental asymmetry: the teacher may rely on view-specific information that the student cannot access at inference time. Moreover, the best type of privileged information is often task-dependent, making it difficult to choose a single teacher view. In this work, we address both these challenges jointly by introducing AVSD (Adaptive-View Self-Distillation), a novel method of self-distillation with multiple privileged-information views, which reconstructs token-level supervision by separating stable cross-view consensus from view-specific residual signals. AVSD identifies the consensus signal shared across views, which provides a reliable update direction, and then selectively adds the view-specific residual signal to adjust the update magnitude when it both aligns with the consensus direction and remains proportionate to the consensus signal. Experiments on math competition benchmarks (AIME24, AIME25, and HMMT25) show that AVSD consistently outperforms both single-view self-distillation baselines and GRPO, achieving average Avg@8 gains of 3.1% and 2.2% over the strongest baselines on Qwen3-8B and Qwen3-4B, respectively. Moreover, on code-generation benchmarks (Codeforces, LiveCodeBench v6) using Qwen3-8B, AVSD outperforms the single-view self-distillation baseline by 2.4% on average.

[NLP-65] Divide-Prompt-Refine: a Training-Free Structure-Aware Framework for Biomedical Abstract Generation

【速读】: 该论文旨在解决生物医药文献中缺乏摘要(abstract)的问题,这类缺失显著降低了文章在信息检索、生物医学知识发现等下游自然语言处理任务中的可用性。其解决方案的关键在于提出一种无需训练的零样本框架DPR-BAG(Divide, Prompt, and Refine for Biomedical Abstract Generation),该框架首先将无摘要的全文按背景-目的-方法-结果-结论(BOMRC)结构化分段,随后并行调用大语言模型(LLM)对各部分进行摘要生成,并通过最终的精炼阶段恢复整体语篇连贯性。实验表明,DPR-BAG在PMC-MAD数据集上优于强提取式与微调基线模型,在保持事实一致性的同时提升抽象新颖性;此外,消融研究揭示了一个反直觉现象:增加提示复杂度或显式注入实体级指导反而会损害事实对齐,强调了控制性提示策略的重要性。这一成果验证了无需训练、结构感知型框架在低资源场景下实现可扩展生物医学摘要生成的潜力。

链接: https://arxiv.org/abs/2605.20628
作者: Sylvey Lin,Joe Menke,Shufan Ming,Dongin Nam,Neil Smalheiser,Halil Kilicoglu
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); University of Illinois College of Medicine (伊利诺伊大学医学院)
类目: Computation and Language (cs.CL)
备注: Accepted by BioNLP 2026

点击查看摘要

Abstract:Biomedical abstracts play a critical role in downstream NLP applications, such as information retrieval, biocuration, and biomedical knowledge discovery. However, a non-trivial number of biomedical articles do not have abstracts, diminishing the utility of these articles for downstream tasks. We propose DPR-BAG (Divide, Prompt, and Refine for Biomedical Abstract Generation), a training-free, zero-shot framework that generates coherent and factually grounded abstracts for biomedical articles with full text but no abstract. DPR-BAG decomposes full-text documents into structured rhetorical facets following the Background-Objective-Methods-Results-Conclusions (BOMRC) schema, performs parallel LLM-based summarization for each facet, and applies a final refinement stage to restore global discourse coherence. On PMC-MAD, a distribution-aligned dataset of 46,309 biomedical articles, DPR-BAG improves abstractive novelty over strong extractive and fine-tuned baselines, while maintaining factual consistency. Our ablation study reveals a counterintuitive finding: increasing prompt complexity or explicitly injecting entity-level guidance can degrade factual alignment, highlighting the importance of controlled prompting strategies. These findings underscore the potential of training-free, structure-aware frameworks for scalable biomedical abstract generation in low-resource settings. Our data and code are available at this https URL and this https URL.

[NLP-66] Retrieval-Augmented Long-Context Translation for Cultural Image Captioning: Gators submission for AmericasNLP 2026 shared task

【速读】: 该论文旨在解决原住民语言文化图像描述生成(cultural image captioning for Indigenous languages)这一低资源场景下的多语言生成难题,尤其针对Bribri、Guaraní和Orizaba Nahuatl等语言的图像描述生成任务。其解决方案的关键在于采用两阶段流水线:第一阶段利用Qwen2.5-VL模型生成西班牙语中间描述,第二阶段通过检索增强的多示例提示(retrieval-augmented many-shot prompting)结合Gemini 2.5 Flash模型生成目标语言描述。实验表明,该方法在开发集上分别实现了164.1%、131.7%和122.6%的性能提升,并在测试集上保持了对Bribri和Orizaba Nahuatl语言约150%的改进;同时发现检索机制具有高度语言依赖性,仅在大规模且领域相关的语料中有效,而合成数据增强贡献了约28 chrF++的Guaraní语言性能提升。该方案最终成为共享任务的优胜者,在人类评估中位列五项决赛提交方案中的第二名。

链接: https://arxiv.org/abs/2605.20626
作者: Aashish Dhawan,Christopher Driggers-Ellis,Dzmitry Kasinets,Daisy Zhe Wang,Christan Grant
机构: University of Florida (佛罗里达大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present the University of Florida Gators submission to the AmericasNLP 2026 shared task on cultural image captioning for Indigenous languages. Our two-stage pipeline generates a Spanish intermediate caption with Qwen2.5-VL, then produces the target-language caption using retrieval-augmented many-shot prompting with Gemini 2.5 Flash. We achieve 164.1%, 131.7%, and 122.6% improvements over the shared task baseline for Bribri, Guaraní, and Orizaba Nahuatl captioning, respectively, in our dev set evaluation and maintain 150% improvements for the Bribri and Orizaba Nahuatl languages in the test set evaluation. We find retrieval is highly language-dependent, beneficial only for large, in-domain corpora, and that synthetic data augmentation accounts for around 28 chrF++ of the dev set Guaraní performance gain. Our submission is the overall winner of the shared task, placing second out of five finalist submissions in human evaluations of target-language captions.

[NLP-67] Auto-Dreamer: Learning Offline Memory Consolidation for Language Agents

【速读】: 该论文试图解决语言智能体在处理连续任务流时,现有记忆系统难以将积累的经验转化为可复用知识的问题。核心挑战在于当前方法通常将记忆获取与巩固耦合在在线过程中,导致智能体缺乏跨会话的全局视角,无法发现重复模式、抽象共享流程或删除冗余条目。解决方案的关键是提出Auto-Dreamer——一种基于互补学习系统理论的离线记忆巩固器,它通过解耦快速的单会话记忆获取与慢速的跨会话记忆巩固过程,实现更高效的长期知识存储与利用。Auto-Dreamer以类型化记忆库中的选定区域为输入,将其视为只读证据,通过有限工具调用来检查条目及其来源轨迹,并合成一个紧凑的新集合来抽象跨会话信息并取代原区域。该模型使用GRPO算法训练,以端到端智能体性能作为奖励信号,从而学会如何从快速在线经验中有效巩固记忆。实验表明,在仅用ScienceWorld数据训练后,Auto-Dreamer在ScienceWorld上比固定、强化学习训练和提示基线高出7分,同时活跃记忆空间仅为最强基线的1/12;且在未见的ALFWorld和WebArena环境中无需重新训练即保持领先,内存占用仅为最强基线的1/6。

链接: https://arxiv.org/abs/2605.20616
作者: Chongrui Ye,Yuxiang Liu,Yu Wang,Haofei Yu,Yining Zhao,Ge Liu,Julian McAuley,Jiaxuan You
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); University of California San Diego (加州大学圣地亚哥分校)
类目: Computation and Language (cs.CL)
备注: Preprint

点击查看摘要

Abstract:Language agents increasingly operate over streams of related tasks, yet existing memory systems struggle to convert accumulated experience into reusable knowledge. Retrieval-augmented and structured memory methods record per-session observations effectively, but often couple acquisition and consolidation into a single online process, leaving the agent without a global view across sessions to discover recurring patterns, abstract shared procedures, or prune redundant entries. Inspired by complementary learning systems theory, we propose Auto-Dreamer, a learned offline consolidator for language-agent memory. Auto-Dreamer decouples fast per-session memory acquisition from slow cross-session consolidation. Given a selected working region of a typed memory bank, the consolidator treats the region as read-only evidence, performs bounded tool-use to inspect entries and provenance-linked source trajectories, and synthesizes a fresh compact replacement set that abstracts across sessions and supersedes the original region. We train Auto-Dreamer via GRPO, using end-to-end agent performance as the reward signal to learn how to consolidate memories acquired through fast online experience. Trained on ScienceWorld trajectories alone, Auto-Dreamer outperforms fixed, RL-trained, and prompted memory baselines on ScienceWorld by 7 points while using an active memory bank 12 \times smaller than the strongest baseline, and continues to lead on held-out ALFWorld and WebArena without retraining – using 6 \times less memory than the strongest baseline on ALFWorld.

[NLP-68] HRM-Text: Efficient Pretraining Beyond Scaling

【速读】: 该论文试图解决大语言模型(Large Language Models, LLMs)预训练对算力和数据规模高度依赖的问题,即当前主流范式需要互联网规模的原始文本和巨大的计算资源,限制了基础研究的可及性。其解决方案的关键在于:提出一种受生物系统多时间尺度处理机制启发的分层递归模型(Hierarchical Recurrent Model, HRM),将计算解耦为慢速演化的策略层与快速演化的执行层,并结合MagicNorm稳定深层递归、温启动深度信用分配机制优化训练稳定性;同时采用任务完成目标(task-completion objective)和PrefixLM掩码策略,在仅400亿唯一token和约1500预算单位的数据上进行训练,实现了在多项基准测试中媲美2-7B参数开源模型的性能,显著降低了预训练所需的计算成本(减少96-432倍)和数据量(减少100-900倍)。这一成果验证了架构与目标协同设计可在不牺牲性能的前提下大幅降低预训练门槛。

链接: https://arxiv.org/abs/2605.20613
作者: Guan Wang,Changling Liu,Chenyu Wang,Cai Zhou,Yuhao Sun,Yifei Wu,Shuai Zhen,Luca Scimeca,Yasin Abbasi Yadkori
机构: Sapient Intelligence; MIT
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The current pretraining paradigm for large language models relies on massive compute and internet-scale raw text, creating a significant barrier to foundational research. In contrast, biological systems demonstrate highly sample-efficient learning through multi-timescale processing, such as the functional organization of the frontoparietal loop. Taking this as inspiration, we introduce HRM-Text, which replaces standard Transformers with a Hierarchical Recurrent Model (HRM) that decouples computation into slow-evolving strategic and fast-evolving execution layers. To stabilize this deep recurrence for language modeling, we introduce MagicNorm and warmup deep credit assignment. Furthermore, instead of standard raw-text pretraining, we train exclusively on instruction-response pairs using a task-completion objective and PrefixLM masking. Serving as an empirical existence proof of efficient pretraining, a 1B-parameter HRM-Text model trained from scratch on only 40 billion unique tokens and 1,500 budget achieves 60.7% on MMLU, 81.9% on ARC-C, 82.2% on DROP, 84.5% on GSM8K, and 56.2% on MATH. Despite utilizing roughly 100-900x fewer training tokens and 96-432x less estimated compute than standard baselines, HRM-Text performs competitively with 2-7B parameter open models. These results demonstrate that co-designing architectures and objectives can radically reduce the compute-to-performance ratio, making pretraining from scratch accessible to the broader research community.

[NLP-69] Self-Training Doesnt Flatten Language – It Restructures It: Surface Markers Amplify While Deep Syntax Dies

【速读】: 该论文试图解决的问题是:语言模型在持续自我训练(successive self-training)过程中,其生成文本的结构变化是否仅仅是“扁平化”(flattening)——即多样性下降、分布变窄、文本趋同——还是存在更复杂的结构性重塑。研究表明,这种过程并非均匀的扁平化,而是呈现出不对称的重构现象:表面复杂性(如连接词、模糊语、破折号)上升,而中深层句法结构(如疑问句、插入语、被动语态、虚拟语气)显著衰减。解决方案的关键在于提出并验证“结构深度假说”(Structural Depth Hypothesis, SDH),即语言特征每代衰减速率主要由其结构深度(嵌套句法依赖数量)决定,其次才是初始频率。通过跨五种模型(涵盖三种架构家族)的17个特征面板分析(N=85),SDH的Spearman相关系数达ρ=0.540(p<10⁻⁶),远强于频率的预测力(ρ=0.225),且人工微调对照组的相关性接近零(ρ=0.039),证明该梯度效应具有自训练特异性。此外,研究还揭示了“表层复杂性悖论”:尽管深层句法结构衰退,但树深度、类型-token比(TTR)、词长等复杂性指标反而上升,这对训练数据筛选和大语言模型生成文本检测具有直接启示意义。

链接: https://arxiv.org/abs/2605.20602
作者: Ming Liu
机构: Amazon
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 19 pages (14 main + 5 appendix), 8 figures, 3 tables

点击查看摘要

Abstract:Successive self-training on a language model’s own outputs is widely characterized as a process of flattening: diversity drops, distributions narrow, and the text becomes “more like itself.” We provide evidence that this characterization is incomplete. Across eleven generations of self-training on five models (GPT-2 124M, Pythia-410M, Pythia-1.4B, OPT-1.3B, Pythia-2.8B), language is not flattened uniformly – it is restructured. Surface markers (discourse connectives, hedges, em-dashes) rise, while mid- and deep-syntactic structures (questions, parentheticals, passives, subjunctives) collapse. We formalize this asymmetric collapse as the Structural Depth Hypothesis (SDH): the per-generation decay rate of a linguistic feature is predicted primarily by its structural depth – the number of nested syntactic dependencies it requires – and only secondarily by its generation-zero output frequency. Pooling 17-feature panels from five models spanning three architecture families (N=85), the pooled Spearman correlation is rho=0.540 (p 10^-6; cluster-bootstrap 95% CI [0.434, 0.634]), while frequency is a substantially weaker predictor (rho=0.225). A matched human-text fine-tuning control yields rho=0.039 (p=0.88), confirming the gradient is self-training-specific. We further document a Superficial Complexity Paradox: aggregate complexity proxies (dep-tree depth, TTR, word length) all rise as the underlying clause structure dies, with direct implications for training-data curation and LLM-text detection.

[NLP-70] Do No Harm? Hallucination and Actor-Level Abuse in Web-Deployed Medical Large Language Models

【速读】: 该论文旨在解决医疗大语言模型(Medical Large Language Models, MedLLMs)在Web平台部署时存在的幻觉(hallucination)、政策合规性不足及设计不安全等问题。其解决方案的关键在于提出两个评估框架:一是MedGPT-HEval,用于检测模型生成内容的虚假信息;二是基于大语言模型的流水线方法,用于评估政策违规行为和开发者意图。研究通过对6,233个MedGPTs的大规模评估发现,约25%-30%的模型存在低事实准确性问题,其中底层和中层模型风险最高;33.6%-54.3%的模型违反操作阈值,且57.06%的具备行动能力(Action-enabled)模型缺乏充分隐私披露。尽管MedGPTs在事实准确性和语义对齐上优于开源模型,但后者更稳定,凸显了当前医疗大模型在安全性与合规性上的系统性缺陷,亟需多指标评估体系和更强的防护机制。研究还发布了HAA-MedGPT数据集,以支持未来针对面向网络的医疗大模型安全性的研究。

链接: https://arxiv.org/abs/2605.20591
作者: Sunday Oyinlola Ogundoyin,Muhammad Ikram,Rahat Masood
机构: University of Ibadan (伊巴丹大学); National University of Science and Technology (国家科技大学)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Medical large language models (LLMs), including custom medical GPTs (MedGPTs) and open-source models, are increasingly deployed on web platforms to provide clinical guidance. However, they pose risks of hallucination, policy noncompliance, and unsafe design. We conduct a large-scale assessment of 6,233 MedGPTs, evaluating a stratified sample of 1,500, together with 10 open-source LLMs. We introduce two frameworks: MedGPT-HEval for hallucination detection and an LLM-based pipeline for assessing policy violations and developer intent. Our results show that 25-30% of MedGPTs exhibit low factual accuracy, with bottom- and middle-tier models at highest risk; 33.6-54.3% violate operational thresholds, and 57.06% of Action-enabled models lack adequate privacy disclosures. Compared with open-source models, MedGPTs achieve higher factual accuracy and semantic alignment, though open-source models are more stable. These results reveal systemic gaps in hallucination and compliance, highlighting the need for multi-metric evaluation and stronger safeguards. We release HAA-MedGPT, a structured dataset that supports future research on the safety of web-facing medical LLMs.

[NLP-71] Direct Translation between Sign Languages

【速读】: 该论文试图解决的是手语之间直接翻译(sign-to-sign translation)的问题,旨在为全球约15亿聋人和听力障碍者(DHH)提供跨语言交流的解决方案,从而摆脱对听力翻译员或书面语言能力的依赖。当前主流的级联方法(cascade approach)将翻译拆分为签到文本、文本到文本、文本到签三个步骤,存在误差传播、延迟增加以及视觉模态特有信息丢失等缺陷。论文的关键解决方案是:利用回译(back-translation)技术,从未对齐的单一语言手语语料中合成手语-手语配对数据,并基于此联合训练一个MBART架构的单一模型,实现文本到手语(T2S)与手语到手语(S2S)的端到端联合建模。实验表明,该方法在几何手语错误指标(DTW对齐MPJPE降低20%)和语言匹配度(BLEU-4提升50%)上优于级联基线,同时速度提升约2.3倍,验证了直接手语翻译的可行性与优越性。

链接: https://arxiv.org/abs/2605.20588
作者: Zetian Wu,Bowen Xie,Wuyang Meng,Milan Gautam,Stefan Lee,Liang Huang
机构: Oregon State University (俄勒冈州立大学)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The field of sign language translation has witnessed significant progress in the translation between sign and spoken languages, but the translation between sign languages remains largely unexplored and out of reach. The latter can help 1.5 billion deaf and hard-of-hearing (DHH) people worldwide communicate across language barriers without relying on hearing interpreters or written-language fluency. The cascade approach composing separate sign-to-text, text-to-text, and text-to-sign systems suffers from error propagation and extra latency as well as the loss of information unique in the visual modality. We aim to develop direct sign-to-sign translation. However, a large-scale open-domain parallel corpus has not been curated between sign languages. To enable direct translation between sign language utterances, we use back-translation to produce synthetic sign-sign pairs from unaligned individual language utterance-sign corpora. Using this data, we jointly train a single MBART-based model for both text-sign (T2S) and sign-sign (S2S). On synthetically generated paired sets between American Sign Language (ASL), Chinese Sign Language (CSL), and German Sign Language (DGS), our direct S2S method outperforms the cascaded baseline on geometric sign error metrics (20% lower DTW-aligned MPJPE) and language matching metrics after predicted sign utterances are translated back to sentences (50% high BLEU-4) while achieving a roughly 2.3* speedup. On a small set of pre-existing cross-lingual sign data, we find similar improvements for our proposed method.

[NLP-72] When Irregularity Helps: A Subclass Analysis of Inductive Bias in Neural Morphology

【速读】: 该论文试图解决的问题是:神经形态生成系统在基准数据集上虽然整体准确率较高,但其性能可能掩盖了在罕见形态子类中的系统性错误。解决方案的关键在于通过细粒度的子类分析发现,日本语动词过去时态屈折中一个仅占1%数据的结构特定不规则子类(具有辅音促音化现象)导致了不成比例的模型错误;控制性消融实验表明,移除该子类带来的泛化性能提升超过移除所有不规则动词的效果,说明并非所有不规则性对模型稳定性的影响均等。研究指出,错误集中是由极端低频形态模式与特定音变过程(尤其是促音化,gemination)之间的相互作用驱动的,因此主张在形态学评估中引入超越标准屈折类别的更细粒度子类分析。

链接: https://arxiv.org/abs/2605.20558
作者: Wen Zhang
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Neural morphological generation systems often achieve high aggregate accuracy on benchmark datasets, yet such performance can conceal systematic errors concentrated in rare morphological subclasses. We examine Japanese past-tense verb inflection and show that a very small, structurally specific irregular subtype (1% of data) accounts for a disproportionate share of model errors. Controlled ablation experiments demonstrate that removing this subtype yields larger improvements in generalization than removing all irregular verbs, indicating that not all irregularity contributes equally to model instability. These findings suggest that error concentration is driven by the interaction between extreme low-frequency morphological patterns and specific morphophonological processes, particularly gemination. We argue that morphological evaluation should incorporate finer-grained subclass analysis beyond standard conjugation categories.

[NLP-73] What Do Biomedical NER and Entity Linking Benchmarks Measure? A Corpus-Centric Diagnostic Framework ACL25

【速读】: 该论文试图解决的问题是:当前生物医学命名实体识别(NER)和实体链接(EL)任务的基准评估往往依赖于标注语料库,但这些语料库的实际特性(如标签分布、训练测试集重叠、术语覆盖范围等)常被假设为一致或充分,而未被系统性诊断。这种忽视可能导致对模型性能的误判和跨语料库迁移风险的低估。解决方案的关键在于提出一个以语料库为中心的诊断框架,通过分析标注数据、概念链接、训练测试划分、文档元数据及术语映射等结构化信息,将语料库属性归纳为五个标准化统计类别:(1)规模、密度与标签分布,(2)词汇与概念结构,(3)训练测试重叠程度,(4)元数据组成,(5)术语覆盖范围。该框架能够揭示不同语料库在评估信号强度、泛化需求、训练测试重复使用可能性以及所代表的生物医学文献与概念空间方面的差异,从而提供比传统表面指标(如语料库大小和实体类型)更深入的基准分析能力,并支持可复现的语料库诊断与迁移风险识别。

链接: https://arxiv.org/abs/2605.20537
作者: Robert Leaman,Rezarta Islamaj,Zhiyong Lu
机构: National Library of Medicine, Bethesda, MD (美国国家医学图书馆)
类目: Computation and Language (cs.CL)
备注: Accepted to the ACL 25th Workshop on Biomedical Language Processing

点击查看摘要

Abstract:Biomedical named entity recognition (NER) and entity linking (EL) strongly depend on annotated corpora, but the utility of these resources for benchmarking is often assumed rather than characterized. We present a corpus-centric framework for diagnosing benchmark-relevant properties directly from corpus annotations, concept links, train-test splits, document metadata, and terminology mappings. The framework organizes standardized statistics into five families: (1) scale, density and label distribution, (2) lexical and conceptual structure, (3) train-test overlap, (4) metadata composition, and (5) terminology coverage where applicable. Applying the framework to nine corpora spanning diseases, chemicals, and cell types, we find that corpus properties can differ substantially, even when they address the same apparent task. We find differences in the evaluation signal they provide, the generalization demands they impose, the degree of train-test reuse they permit, and the regions of biomedical literature and concept space they represent. These differences suggest that commonly reported corpus statistics can be insufficient to characterize what biomedical NER and EL benchmarks evaluate. We argue that corpus-centric diagnostics provide a practical framework for analyzing corpora beyond surface descriptors such as corpus size and entity type, for identifying potential transfer risks, and for interpreting the scope of benchmarking conclusions. We release the framework as open-source code with an interactive dashboard to support reproducing our analyses and characterizing additional corpora.

[NLP-74] Agent Atlas: Beyond Outcome Leaderboards for LLM Agents

【速读】: 该论文试图解决当前大型语言模型代理(LLM agents)评估体系碎片化的问题,即现有基准测试在衡量指标上不统一,如任务最终成功率、工具调用有效性、重复通过一致性、轨迹安全性或抗攻击鲁棒性等,导致难以客观比较不同代理的实际能力。其解决方案的关键在于提出一个系统性的多维评估框架——AgentAtlas,包含四个核心组件:(1) 六状态控制决策分类法(Act / Ask / Refuse / Stop / Confirm / Recover),用于细粒度刻画代理行为;(2) 九类轨迹失败分类法,结合主错误来源与影响层级两个正交标签,实现对失败原因的结构化诊断;(3) 基于分类法感知与盲视的方法论,量化提示词中显式标签对模型表现的影响程度;(4) 对十五个主流代理基准进行覆盖审计,映射至六个行为轴上以揭示各基准的测量侧重。实验表明,去除提示中的显式标签后,所有模型的轨迹准确率下降14–40个百分点,稳定在0.54–0.62区间,且无单一模型在控制准确性、轨迹诊断和工具上下文保留三个维度均领先,凸显了复杂代理能力评估需多维协同分析的必要性。

链接: https://arxiv.org/abs/2605.20530
作者: Parsa Mazaheri,Kasra Mazaheri
机构: University of California, Santa Cruz (加州大学圣克鲁兹分校); Massachusetts Institute of Technology (麻省理工学院)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:Large language model agents now act on codebases, browsers, operating systems, calendars, files, and tool ecosystems, but the benchmarks used to evaluate them are fragmented: each emphasizes a different unit of measurement (final task success, tool-call validity, repeated-pass consistency, trajectory safety, or attack robustness). A line of 2024-2025 work has converged on the diagnosis that a single accuracy column is no longer the right unit of comparison for deployable agents. AgentAtlas extends this line of work with four components: (i) a six-state control-decision taxonomy (Act / Ask / Refuse / Stop / Confirm / Recover); (ii) a nine-category trajectory-failure taxonomy with two orthogonal hierarchical labels (primary_error_source, impact); (iii) a taxonomy-aware vs. taxonomy-blind methodology that measures how much of a model’s apparent capability comes from the supervision in the prompt; and (iv) a benchmark-coverage audit mapping fifteen agent benchmarks against six behavioral axes. To demonstrate the methodology we run a small fixed eight-model set (1,342 generated items, four frontier closed and four open-weight) under both prompt modes. Removing the explicit label menu drops every model’s trajectory accuracy by 14-40 pp to a tight 0.54-0.62 floor regardless of family, and no single model wins on all three of control accuracy, trajectory diagnosis, and tool-context utility retention. We treat the synthetic run as a measurement-protocol demonstration, not a benchmark release.

[NLP-75] Collocational bootstrapping: A hypothesis about the learning of subject-verb agreement in humans and neural networks CONLL

【速读】: 该论文试图解决的问题是:语言输入中的统计信号如何辅助语法习得,具体聚焦于英语主谓一致(subject-verb agreement)这一句法结构的获得机制。解决方案的关键在于提出并验证“搭配式启动”(collocational bootstrapping)机制,即通过词项共现模式的规律性提供句法依赖关系的线索。作者通过训练神经网络在不同可预测性水平的合成数据集上进行模拟学习,发现存在一个适宜的变异性范围,使得统计学习者能够稳健地习得主谓一致规则;进一步分析儿童语料库中的主谓搭配变异性后发现,真实儿童语言输入的变异性恰好落在该有效范围内,从而支持了该机制作为儿童实际语言输入下可行的学习策略。

链接: https://arxiv.org/abs/2605.20529
作者: Claire Hobbs,R. Thomas McCoy
机构: Yale University (耶鲁大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to CoNLL

点击查看摘要

Abstract:In what ways might statistical signals in linguistic input assist with the acquisition of syntax? Here we hypothesize a mechanism called collocational bootstrapping, in which regularities in word co-occurrence patterns can provide cues to syntactic dependencies. We investigate whether this mechanism can support the acquisition of English subject-verb agreement. First, we simulate language acquisition by training neural networks on synthetic datasets that vary in how predictable their subject-verb pairings are. We find that there is a range of variability levels at which these statistical learners robustly learn subject-verb agreement. We then analyze the variability of subject-verb pairings in child-directed language, and we find that the variability in such data falls within the range that supported robust generalization in our computational simulations. Taken together, these results suggest that collocational bootstrapping is a viable learning strategy for the type of input that children receive.

[NLP-76] NeuroQA: A Large-Scale Image-Grounded Benchmark for 3D Brain MRI Understanding

【速读】: 该论文试图解决当前医学视觉问答(Medical Visual Question Answering, VQA)研究中普遍存在的局限性问题,即大多数工作仅基于二维切片图像且依赖狭窄的诊断标签,缺乏对完整三维脑部磁共振成像(3D brain MRI)的利用以及对临床推理能力的系统评估。解决方案的关键在于构建一个大规模、多模态、临床语义丰富的基准数据集 NeuroQA,其包含来自12个数据集的56,953个问答对(QA pairs),覆盖年龄范围5–104岁及五类神经系统疾病(阿尔茨海默病、帕金森病、肿瘤、白质病变和神经发育障碍)。该方案的核心创新包括:(1) 以完整的3D体积为基础进行问答设计,而非2D切片;(2) 设计11种临床相关的推理技能评测维度,并区分图像接地型(image-grounded)与图像信息型(image-informed)问题类型;(3) 引入答案分布精炼机制以消除文本线索干扰,使闭合格式下的纯文本准确率从80%降至44.6%,并辅以图像必要性验证协议;(4) 建立严格的专家审核流程(38条规则+两轮评审)确保每组QA对的一致性和准确性,且无同一受试者在不同模板下出现矛盾结果。最终,NeuroQA通过分层发布策略(公开+受限数据)、受试者级划分、私有测试集和在线排行榜,为未来面向3D脑影像的生成式AI(Generative AI)模型提供了可复现、高可信度的评估平台。

链接: https://arxiv.org/abs/2605.20525
作者: Mohammad H. Abbasi,Favour Nerrise,Shaurnav Ghosh,Ridvan Yesiloglu,Yuncong Mao,Bailey Trang,Mohammad Asadi,Merryn Daniel,Gustavo Chau Loo Kung,Ken Chang,Pavan Pinkesh Shah,Adam Turnbull,Kyan Younes,Seena Dehkharghani,Ehsan Adeli(Stanford University)
机构: Stanford University (斯坦福大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注: 30 pages, dataset and benchmark release

点击查看摘要

Abstract:We present NeuroQA, a large-scale benchmark for visual question answering in 3D brain magnetic resonance imaging (MRI), with 56,953 QA pairs from 12,977 subjects across 12 datasets. It spans ages 5-104 and five clinical domains: Alzheimer’s, Parkinson’s, tumors, white matter disease, and neurodevelopment. Unlike prior medical Visual Question Answering (VQA) efforts that operate on 2D slices or rely on narrow diagnostic labels, NeuroQA pairs every item with a full 3D volume. It evaluates 11 clinically grounded reasoning skills across Yes/No, multiple-choice, and open-ended formats. Of the 203 templates, 131 are image-grounded (answerable from a 3-plane viewer) and 72 are image-informed (ground truth from quantitative volumetry or clinical instruments). To remove text-only shortcuts, we apply answer-distribution refinement, reducing closed-format text-only accuracy from 80% to 44.6%; image necessity is assessed separately through an image-grounding protocol released with the benchmark. A 38-rule deterministic pipeline and two rounds of expert review verify every QA pair against FreeSurfer measurements, metadata, or radiology report fields, with zero same-subject contradictions across templates. We conduct a clinician evaluation in which two clinicians independently assess 100 frozen test items on a three-plane viewer. On closed-format (Yes/No + multiple-choice) test-public items, the best zero-shot vision-language model and a supervised 3D CNN baseline reach 47.5% and 43.7% accuracy respectively, both below the 49.4% text-only majority-template floor. NeuroQA adopts a two-tier release with public QA pairs for open-access datasets and reproducible generation scripts for datasets restricted by data use agreements (DUAs), plus subject-level splits, a held-out private test set, and an online leaderboard.

[NLP-77] Reinforcing Human Behavior Simulation via Verbal Feedback

【速读】: 该论文试图解决的问题是:如何让大语言模型(LLM)更有效地学习人类社会规范与行为,尤其是在缺乏可量化奖励信号的场景下,传统强化学习(RL)方法难以利用人类提供的口语化、主观且多维度的反馈。解决方案的关键在于提出DITTO模型,其核心创新是将口语反馈视为强化学习中的第一类信号——通过在每轮推理后接收自然语言反馈,并生成条件化的改进输出,结合GRPO(Generalized Reward Policy Optimization)联合优化反馈和策略,从而将口头指导提炼到基础策略中,且无需在测试阶段依赖反馈。此外,作者还构建了SOUL基准测试套件,涵盖六大类人类行为模拟任务,实验证明DITTO在平均性能上比基线模型提升36%,并在6个SOUL任务上超越GPT-5.4,验证了基于口语反馈的强化学习在训练拟人化LLM方面的有效性。

链接: https://arxiv.org/abs/2605.20506
作者: Weiwei Sun,Xuhui Zhou,Jiarui Liu,Weihua Du,Haojia Sun,Yiqing Xie,Qianou Ma,Sihao Chen,Mengting Wan,Longqi Yang,Pei Zhou,Sherry Wu,Sean Welleck,Graham Neubig,Yiming Yang,Maarten Sap
机构: Carnegie Mellon University (卡内基梅隆大学); Microsoft (微软)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Humans learn social norms and behaviors from verbal feedback (e.g., a parent saying “that was rude” or a friend explaining “here’s why that hurt”). Yet, learning from feedback for LLMs has largely focused on domains like code and math, where RL rewards are directly verifiable and condensed into scalar values. As LLMs are increasingly used to simulate human behavior, e.g., standing in for users, patients, students, and other personas, there is a pressing need to make them more human-like, which requires embracing a fundamentally different kind of signal: feedback that is verbal, subjective, and multi-faceted. We present DITTO, a model trained by treating verbal feedback as a first-class signal in reinforcement learning. After each rollout, DITTO receives verbal feedback and generates a feedback-conditioned improved rollout; both outputs are jointly optimized with GRPO, distilling verbal guidance into the base policy without requiring feedback at test time. We also introduce SOUL (Simulation gym Of hUman-Like behavior), a unified benchmark and training data suite spanning 10 tasks across six categories: Theory of Mind, character role play, social skill, learner simulation, user simulation, and persona simulation. DITTO achieves an average 36% improvement over the base model and exceeds GPT-5.4 on 6 of 10 SOUL benchmarks, demonstrating that RL with verbal feedback is a promising direction for training LLMs to simulate human behavior.

[NLP-78] Stage-Audit: Auditable Source-Frontier Discovery for Cross-Wiki Tables

【速读】: 该论文试图解决的问题是:大型语言模型(Large Language Model, LLM)在构建结构化表格时,可能生成看似有来源支持但实际上缺乏真实依据的条目(即“虚假源”问题),尤其是在Seed2Frontier发现任务中——该任务旨在从种子页面出发,找到补充的维基百科页面以组装成结构化表格。解决方案的关键在于Stage-Audit机制,其核心包括三个要素:1)分离 curator(编辑者)与 auditor(审核者)的写权限以实现职责隔离;2)引入基于行级别的源引用门控机制(row-level source-citation gate),确保每条记录都有明确来源;3)设计一套包含12项检查项的审计分类法(audit taxonomy),覆盖键值、模式、来源角色、基数和范围等维度,从而系统性识别并过滤不合规条目。实验表明,在51个实例的评估集上,Stage-Audit将源前沿精度(source-frontier precision)从0.356提升至0.505(相对提升42%),F1分数从0.334提升至0.451(相对提升35%),同时保持逐行可追溯的来源信息。该对比有效隔离了策略改进的贡献,而非单纯LLM能力提升的影响。

链接: https://arxiv.org/abs/2605.20478
作者: Chen Shen
机构: Megagon Labs
类目: Computation and Language (cs.CL)
备注: 9 pages, 2 figures, 3 tables. Accepted at the ACM CAIS 2026 Workshop on AI Agents for Discovery in the Wild

点击查看摘要

Abstract:LLM-curated tables can appear source-grounded while containing unsupported rows: the curator may recall entries from parametric memory and retroactively attach page-level citations that are not the actual source. We study this hazard in Seed2Frontier discovery: the task of finding complement Wikipedia pages from a seed page to assemble a structured table. Stage-Audit addresses it with disjoint curator-auditor write rights, a row-level source-citation gate, and a 12-check audit taxonomy over keys, schema, source roles, cardinality, and scope. On a curated 51-instance Seed2Frontier evaluation set spanning 15 top-level domains, Stage-Audit improves source-frontier precision over a vanilla LLM curator from 0.356 to 0.505 (+42% relative) and F1 from 0.334 to 0.451 (+35%), while maintaining explicit per-row source traceability. The vanilla-LLM-vs-Stage-Audit comparison isolates the policy contribution rather than LLM-based discovery in general.

[NLP-79] raining Language Agents to Learn from Experience

【速读】: 该论文试图解决的问题是:当前基于反思(reflection-based)的语言代理(language agent)仅能在单个任务实例中进行自我修正,而无法将经验提炼为可复用的学习成果以提升在未见过的任务上的表现。解决方案的关键在于提出一种名为“上下文训练”(In-context Training, ICT)的新框架,通过一个反射模型(reflector model)从行为模型(actor model)收集的轨迹中学习生成系统提示(system prompts),从而提升行为模型在未来未见任务中的性能;同时设计了一种基于强化学习(RL)的训练流程,无需人工标注示例即可直接从经验中学习这些反思策略。实验表明,所训练的反射模型在ALFWorld和MiniHack等多个环境中均显著优于未经训练的基线,并且在某些情况下实现了跨基准环境的泛化能力,验证了语言代理具备“学会如何从经验中学习”的潜力。

链接: https://arxiv.org/abs/2605.20477
作者: Yuval Shalev,Zifeng Ding,Mateja Jamnik
机构: University of Cambridge(剑桥大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Language agents can adapt from experience in interactive environments, but current reflection-based methods can only self-correct within a single task instance. Whether such experience can be distilled into reusable lessons that improve performance on future unseen tasks remains unclear. We address this problem by introducing the In-context Training (ICT) task, a framework for evaluating cross-task self-improvement in language agents. In ICT, a reflector model observes trajectories collected by an actor model and generates system prompts intended to improve the actor’s performance on future unseen tasks. We then propose an RL-based training pipeline for learning such reflections directly from experience, without human-provided examples. Across ALFWorld and MiniHack, our trained reflectors outperform an untrained baseline on most held-out task families, showing that the ability to learn from experience can itself be learned. In some cases, we observe generalisation beyond the benchmark on which the reflector was trained, to substantially different environments. Finally, we introduce MetaGym, a generic Python library for constructing meta-environments, enabling future research on self-improving language agents.

[NLP-80] Hiding in Plain Sight: Finding MAHA on Reddit

【速读】: 该论文试图解决的问题是:如何系统性地研究“让美国更健康”(Make America Healthy Again, MAHA)这一复杂且多元的全国性健康运动在社交媒体上的结构、话语传播与信念扩散机制,尤其是在面对海量非结构化社交数据时,缺乏细粒度、跨主题的结构化数据支持。解决方案的关键在于构建一个大规模、多维度、时间跨度长的Reddit数据集——涵盖2020至2025年共1940万条帖子、400万用户,包含12个MAHA相关信念主题的自然语境和主题标签信息,从而为跨学科研究者提供可计算、可分析的数据基础,以揭示该运动的动态演化、结构特征及其参与者的行为与语言模式。

链接: https://arxiv.org/abs/2605.20435
作者: Sabit Ahmed,Subigya Nepal,Henry Kautz
机构: 未知
类目: ocial and Information Networks (cs.SI); Computation and Language (cs.CL)
备注: Submitted to ASONAM 2026

点击查看摘要

Abstract:Make America Healthy Again (MAHA) is a national health movement that encompasses a striking mix of beliefs, from broadly accepted concerns about good diet and exercise to controversial takes on organic and genetically modified food, childhood vaccination, science, and institutions. Various influencers and promoters of the MAHA movement on social media are scattered throughout the online space. Investigating the structure, discourse, and contagion of MAHA beliefs requires large-scale fine-grained digital footprints. Constructing structured data covering different MAHA themes from vast unstructured social media data is challenging. We introduce a Reddit dataset that spans six years (2020-2025), comprising 19.4M posts from 4M users. Containing the natural and thematic context of 12 MAHA-aligned beliefs, this dataset offers researchers from various domains the opportunity to study the dynamics of the MAHA movement, its structural and functional components, and the linguistic and behavioral patterns of its proponents.

[NLP-81] Mechanics of Bias and Reasoning : Interpreting the Impact of Chain-of-Thought Prompting on Gender Bias in LLM s ICLR2026

【速读】: 该论文试图解决的问题是:尽管生成式 AI(Generative AI)在社会敏感场景中广泛应用,但其仍存在显著的性别偏见;而链式思维(Chain-of-Thought, CoT)提示方法被提出作为缓解此类偏见的手段,但现有评估主要依赖基准性能变化,难以判断这种偏见减少是否反映了模型内部机制的真实改变。

解决方案的关键在于:通过结合基准评估、机制可解释性技术(mechanistic interpretability techniques)和推理链失败分析(reasoning chain failure analysis),系统地考察CoT提示如何影响大语言模型(LLM)中的性别偏见。研究发现,CoT虽能在某些注意力头簇中平衡偏见行为,但性别偏见仍深嵌于隐藏表征中,表明其仅实现了表面缓解;进一步分析显示,所谓改进源于对训练数据的记忆与熟悉度,而非对偏见本质的理解。

链接: https://arxiv.org/abs/2605.20410
作者: Edie Pearman,Sophia Osborne,Mira Kandlikar-Bloch,Mina Arzaghi,Florian Carichon,Golnoosh Farnadi
机构: Mila – Quebec AI Institute (蒙特利尔魁北克人工智能研究所); McGill University (麦吉尔大学); HEC Montreal (蒙特利尔高等商学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 24 pages, 6 figures, including appendix. Accepted at the ICLR 2026 Workshop on Algorithmic Fairness Across Alignment Procedures and Agentic Systems. Submitted to COLM 2026

点击查看摘要

Abstract:Large language models (LLMs) are increasingly deployed in socially sensitive settings despite substantial documentation that they encode gender biases. Chain-of-Thought (CoT) prompting has been proposed as a bias-mitigation approach. However, existing evaluations primarily focus on changes in LLM benchmark performance, providing limited insight into whether apparent bias reductions reflect meaningful changes in a model’s internal mechanisms. In this work, we investigate how CoT prompting affects gender bias in LLMs, combining benchmark-based evaluation with mechanistic interpretability techniques and reasoning chain failure analysis. Our results confirm a stereotypical bias present in LLM outputs across benchmarks, showing that CoT prompting does not consistently reduce the bias gap. Mechanistic analyses reveal that although CoT balances biased behavior in certain attention head clusters, gender bias remains embedded in hidden representations, indicating only superficial mitigation. Inspection of reasoning chains further suggests that these improvements stem from memorization and familiarity with the dataset rather than genuine understanding of bias.

[NLP-82] Puzzled By ChatGPT ? No more! A Jigsaw Puzzle to Promote AI Literacy and Awareness

【速读】: 该论文试图解决的问题是:随着生成式 AI(Generative AI)尤其是基于大语言模型(LLM)的聊天机器人(如 ChatGPT)的快速普及,公众对 AI 的理解与素养存在显著不足,亟需开发直观、易用且具教育意义的工具来提升大众的 AI 认知水平。解决方案的关键在于提出一种基于拼图游戏的互动式学习方法——通过将一幅由漫画风格信息图组成的拼图完成过程,直观呈现 AI 的工作机制、能力边界、社会影响等多维度内容;每块拼图本身也是独立的信息卡片,可单独用于聚焦讲解特定 AI 主题。该设计融合了动手操作、视觉叙事与协作互动,使用户在非正式学习场景中以趣味方式深入理解 AI 系统的优势与风险。

链接: https://arxiv.org/abs/2605.20404
作者: Francesca Padovani,Malvina Nissim
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The rapid adoption of Generative AI, including LLM-based chatbots like ChatGPT, has highlighted the need for accessible ways to support public understanding and AI literacy. To address this need, we introduce a game-based, interactive approach in the form of a jigsaw puzzle whose completed image is a comic-based infographic illustrating the workings, capabilities, limitations, and societal implications of these technologies. Each comic sketch also functions as a standalone informational card, providing focused explanations of specific facets of AI use, design, and impact. The visual content was created in a live collaborative session with a professional illustrator and a multidisciplinary group of experts and non experts, combining structured knowledge with informal, exploratory reflections shared during the discussion. By integrating hands-on assembly, visual storytelling, and collaborative interaction, the puzzle provides an engaging and playful tool for exploring the mechanisms, perks, and perils of AI systems in informal learning contexts.

[NLP-83] Do as I Say Not as I Do: Instruction-Induction Conflict in LLM s

【速读】: 该论文试图解决的问题是:当语言模型的两个目标——遵循用户指令(instruction-following, IF)与基于训练数据模式进行预测(pattern completion)发生冲突时,模型如何表现?其核心发现是,即使在具备强大能力的语言模型中,指令遵循依然在诱导压力下表现出脆弱性。解决方案的关键在于识别出影响模型鲁棒性的主要因素:输出格式(output format)中的多样性(如多标记输出相比单标记输出更具抵抗性)比语义层面的推理(如链式思维推理)更能预测模型对诱导压力的抗性;此外,模型自身对行为的预测准确率虽高(平均83.5%),但普遍低估了自身的抗诱导能力,表明其对自身行为机制的认知存在偏差。

链接: https://arxiv.org/abs/2605.20382
作者: Carolina Camassa,Derek Shiller
机构: Future Impact Group; Rethink Priorities
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 31 pages

点击查看摘要

Abstract:Language models are trained to follow instructions, but they are also powerful pattern completers. What happens when these two objectives conflict? We construct conversations in which a user instruction to behave in a target way T (e.g., always output a specific token, answer in a particular language, or adopt a persona) is opposed by N hardcoded assistant turns demonstrating a competing pattern P. We then measure instruction-following (IF) rates in this setting, across 13 models and 16 different instructions, for up to 50 turns. Average instruction-following rates range from 1% to 99% across models, largely uncorrelated with standard capability benchmarks. The transition from instruction-following to pattern-following is universal but highly model-dependent. Robustness is modulated both by instruction content, with models resisting induction longer when instructions align with their trained value priors, and by output format, with diverse multi-token responses proving substantially more resistant than single-token outputs. Chain-of-thought reasoning improves robustness but does not eliminate susceptibility, and can produce dissociation between correct deliberation and incorrect output. When asked to predict their behavior in this setting, models achieve 83.5% accuracy on average but systematically underestimate their own resistance to induction pressure. These results suggest that instruction-following remains brittle under induction pressure even for otherwise capable models, and that output diversity, rather than semantic engagement with the input, is the primary factor predicting robustness.

[NLP-84] DEL: Digit Entropy Loss for Numerical Learning of Large Language Models

【速读】: 该论文旨在解决大语言模型(LLM)在数学推理与代码生成中对数值预测能力不足的问题,特别是现有基于最大似然估计(MLE)的方法未能针对数值特性进行优化,而近期引入的惩罚驱动方法(如Number Token Loss和Discretized Distance Loss)虽尝试引入数值距离归纳偏置,却分别导致过尖锐(over-sharpened)和过平坦(over-flattened)的数字分布问题。其解决方案的关键在于提出一种新的数字熵损失(Digit Entropy Loss, DEL),该方法通过三个核心设计重构了传统的无监督熵优化:一是利用数字条件概率和二元交叉熵将熵优化转化为有监督形式;二是摒弃数值距离项以避免数值距离带来的偏差;三是将整数数值学习推广至浮点数域,从而实现更精确的数值预测。DEL能够统一处理整数、小数及小数点,扩展学习目标从单个数字到完整的浮点数域,在七个数学推理基准测试中显著优于现有方法,且在整体预测准确率和数值距离指标上均表现优异。

链接: https://arxiv.org/abs/2605.20369
作者: Zhaohui Zheng,Chenhang He,Shihao Wang,Yuxuan Li,Ming-Ming Cheng,Lei Zhang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Number prediction stands as a fundamental capability of large language models (LLMs) in mathematical problem-solving and code generation. The widely adopted maximum likelihood estimation (MLE) for LLM training is not tailored to number prediction. Recently, penalty-driven approaches, e.g., Number Token Loss and Discretized Distance Loss, introduce an inductive bias of numerical distance but induce over-sharpened and over-flattened digit distributions, respectively. In this paper, we make an in-depth analysis on LLM numerical learning, and show that existing numerical learning methods conceptually follow a criterion-distance formulation, where the criterion term represents optimization pattern and the distance term instills geometric prior. Consequently, we present Digit Entropy Loss (DEL) for auto-regressive numerical learning, which reformulates the conventional unsupervised entropy optimization in three key designs: leveraging digit conditional probability and binary cross-entropy to guide the entropy optimization into a supervised manner; deprecating the distance term to bypass the issue of numerical distance; and generalizing the integer-based numerical learning to floating-point number optimization, enabling more accurate number prediction. Our DEL formulation can incorporate integers, decimals, and decimal points, expanding the learning objective from a single digit to the floating-point number domain. Experiments conducted on seven mathematical reasoning benchmarks with four representative LLMs, including CodeLlama, Mistral, DeepSeek, and Qwen-2.5, demonstrate that DEL consistently outperforms its counterparts in both overall prediction accuracy and numerical distance. Source codes are at this https URL

[NLP-85] When Reasoning Supervision Hurts: TTCW-Based Long-Form Literary Review Generation EMNLP2026

【速读】: 该论文旨在解决长篇文学写作的自动评估难题,特别是现有基于大语言模型(LLM)作为评判者(LLM-as-Judge)的方法难以准确捕捉创造力维度(如原创性与灵活性)的问题。其解决方案的关键在于构建一个大规模数据集(263,911条长篇故事),每条故事均标注了14个基于托兰斯创造性写作测试(Torrance Test of Creative Writing, TTCW)维度的标量分数和元合成评论,并在此基础上对Qwen3模型(4B和8B参数规模)进行微调,对比有无推理内容的两种训练条件。实验表明,不包含推理监督的微调策略表现更优且稳定,最佳设置达到0.6820的评估得分;而引入推理监督的模型易出现解析失败,常生成无关或重复的推理文本而非完成指定的14维评分报告,说明在固定格式评分任务中,推理监督并非直接有益,且即使经过任务特化微调,精准对齐指标的评分仍具挑战性。

链接: https://arxiv.org/abs/2605.20364
作者: Jinlong Liu,Mohammed Bahja,Mark Lee
机构: University of Birmingham (伯明翰大学)
类目: Computation and Language (cs.CL)
备注: Submit to EMNLP 2026

点击查看摘要

Abstract:Automatic evaluation of long-form literary writing remains challenging, as generic LLM-as-Judge approaches may not fully capture creativity-related dimensions such as originality and flexibility. Although the Torrance Test of Creative Writing (TTCW) provides a structured creativity framework, and prior work has demonstrated reference-based TTCW evaluation at the pairwise level, no large-scale dataset exists for long-form TTCW-based literary review generation. We address this gap by constructing a dataset of 263,911 long-form stories, each annotated with scalar scores and meta-synthesised review comments across 14 TTCW-based dimensions. Using this dataset, we fine-tune Qwen3 models at two scales, 4B and 8B, under two conditions: with and without reasoning content. Results show that non-reasoning fine-tuning achieves stronger and more stable performance, with the best setting reaching an evaluation score of 0.6820. Further analysis shows that reasoning-supervised models are more prone to parse failures, often continuing with irrelevant or repetitive reasoning-style text rather than completing the required 14-metric review report. These results suggest that, for fixed-format rubric-based review generation, reasoning supervision is not straightforwardly beneficial, and precise metric-aligned scoring remains challenging even after task-specific fine-tuning.

[NLP-86] Synchronization and Turn-Taking in Full-Duplex Speech Dialogue Models

【速读】: 该论文试图解决的问题是:如何在全双工语音对话模型(Full-duplex Spoken Dialogue Models, SDMs)中实现类似人类交流的交互动态,特别是模型内部表征如何在对话过程中进行协调与同步。解决方案的关键在于:通过在受控条件下模拟两个预训练Moshi模型之间的全双工对话,利用中心核对齐(Centered Kernel Alignment, CKA)度量不同时间滞后下的表征同步性,并借助因果LSTM模型从说话方和听者视角分析延迟内部激活状态中的前瞻性换轮线索。研究发现,在无噪声条件下存在强表征同步性(峰值出现在零滞后),且随着噪声增加而退化;同时表明内部状态编码了支持提前预测换轮行为的信息,从而揭示了SDMs中潜在的协同机制与前瞻能力。

链接: https://arxiv.org/abs/2605.20356
作者: Pablo Riera,Pablo Brusco,Cristina Kuo,Marcelo Sancinetti,S.R.K. Branavan
机构: ASAPP Inc., USA; Departamento de Computación, FCEyN, Universidad de Buenos Aires, Argentina
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD)
备注:

点击查看摘要

Abstract:Full-duplex spoken dialogue models (SDMs) can listen and speak simultaneously, enabling interaction dynamics closer to human conversation than turn-based systems. Inspired by neural coupling in human communication, we study how such models coordinate their internal representations during interaction. We simulate full-duplex dialogues between two instances of the pretrained \textitMoshi model under controlled conditions, manipulating channel noise and decoding bias. Synchronization is measured using Centered Kernel Alignment (CKA) across temporal lags, while anticipatory turn-taking cues are probed from delayed internal activations using causal LSTM models, from both speaker and listener perspectives. We find strong representational synchronization under no noise conditions, peaking near zero lag and degrading with noise, and we show that internal states encode anticipatory information that supports turn-taking prediction ahead of time.

[NLP-87] Mix-Quant: Quantized Prefilling Precise Decoding for Agent ic LLM s

【速读】: 该论文旨在解决大语言模型(LLM)代理在复杂任务执行过程中因多轮交互、工具调用和记忆检索等操作导致的输入侧计算开销过大问题,尤其聚焦于长上下文场景下预填充(prefilling)阶段成为推理瓶颈的挑战。其解决方案的关键在于提出一种分阶段感知的量化框架 Mix-Quant:通过分析发现,预填充阶段存在显著的量化冗余,可采用高吞吐量的 NVFP4 量化方式实现加速,而解码阶段则保持 BF16 精度以确保任务性能不受损;这种将预填充加速与解码质量解耦的设计,结合硬件友好的 NVFP4 执行机制,在不显著降低任务准确率的前提下,实现了高达 3 倍的预填充阶段速度提升。

链接: https://arxiv.org/abs/2605.20315
作者: Haiquan Lu,Zigeng Chen,Gongfan Fang,Xinyin Ma,Xinchao Wang
机构: National University of Singapore (新加坡国立大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:LLM agents have recently emerged as a powerful paradigm for solving complex tasks through planning, tool use, memory retrieval, and multi-step interaction. However, these agentic workflows often introduce substantial input-side overhead, making the compute-intensive prefilling stage a key bottleneck in long-context, multi-turn inference. In this work, we propose Mix-Quant, a simple and effective phase-aware quantization framework for fast agentic inference. We first investigate FP4 quantization in agentic LLM workflows and observe that quantizing the entire inference process can incur significant performance degradation. In contrast, the prefilling stage exhibits substantial quantization redundancy and can therefore be quantized with minimal accuracy loss, despite being the dominant source of computation. Based on this insight, we apply high-throughput NVFP4 quantization to the prefilling phase while preserving BF16 precision for decoding. By decoupling prefilling acceleration from decoding quality, Mix-Quant combines phase-aware algorithmic quantization with hardware-efficient NVFP4 execution to alleviate the inference bottleneck in LLM agents. Extensive experiments across long-context and agentic benchmarks demonstrate that Mix-Quant largely preserves task performance while delivering significant efficiency improvements, achieving up to a 3x speedup during prefilling.

[NLP-88] Chronicle: A Multimodal Foundation Model for Joint Language and Time Series Understanding

【速读】: 该论文试图解决当前时间序列基础模型忽视文本信息以及多模态模型在训练和评估上的局限性问题。现有方法要么仅处理数值序列,要么在预训练阶段将语言模型作为后验适配模块,导致其文本表示未考虑时序数据;同时,这些模型通常仅与同类多模态基线对比,缺乏与各自领域最优单模态基础模型的公平比较。解决方案的关键在于提出Chronicle——一个从零开始联合预训练的324M参数解码器架构,统一处理自然语言和时间序列数据:两种模态共享相同的Transformer块、注意力机制和残差流,通过大量单模态批次训练使跨模态能力自发涌现,并辅以短时对齐阶段实现融合。实验表明,Chronicle在19项自然语言理解任务上达到Gemma-3-270M-PT水平,在24个UCR/UEA时间序列分类任务中刷新冻结嵌入性能纪录,并在Time-MMD多模态预测任务上超越所有监督融合基线,验证了统一架构的有效性和必要性。

链接: https://arxiv.org/abs/2605.20268
作者: Paul Quinlan,Jeremy Levasseur,Qingguo Li,Xiaodan Zhu
机构: InertialAI; Queen’s University (皇后大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Real-world time series come with text: metadata, descriptions, news, reports. Yet time series foundation models process numerical sequences in isolation, and the multimodal text-and-time-series models that attempt to bridge the two all adapt a pretrained language model post hoc, inheriting representations shaped without ever seeing temporal data. These models are also evaluated almost exclusively against other multimodal baselines, not against the strongest unimodal foundation models in either domain, leaving open whether joint training is needed at all. We present Chronicle, a compact 324M-parameter decoder-only transformer trained from scratch on natural language and time series within a single unified architecture. Both modalities share the same transformer blocks, attention mechanism, and residual stream; the bulk of pretraining uses unimodal batches so cross-modal capability emerges purely from shared parameters, with a short alignment stage that interleaves the two. To our knowledge, Chronicle is the first model jointly pretrained on text and time series from scratch, and the first multimodal model evaluated against dedicated foundation models in both domains. It matches Gemma-3-270M-PT on 19 NLU tasks, sets a new bar for frozen-embedding time series classification on 24 UCR/UEA datasets, and produces multimodal forecasts on Time-MMD that beat every supervised fusion baseline, all from a single backbone.

[NLP-89] CP-MoE: Consistency-Preserving Mixture-of-Experts for Continual Learning

【速读】: 该论文旨在解决大语言模型(LLM)和视觉-语言模型(VLM)在持续学习过程中面临的灾难性遗忘(catastrophic forgetting)问题。现有基于LoRA的MoE(Mixture-of-Experts)持续学习方法存在根本性权衡:要么过度隔离专家,限制跨任务知识迁移;要么允许特定任务更新覆盖重要参数,导致严重遗忘。解决方案的关键在于提出CP-MoE框架,其核心创新是引入一个“瞬态专家”(transient expert),用于捕获早期任务特定更新并引导其整合进稳定专家中。具体包括两个机制:一是保持一致性路由偏置(consistency-preserving routing bias),利用瞬态专家估计与稳定专家的表示相似性,优化专家选择;二是瞬态专家引导的正则化机制(transient expert-guided regularisation),在合并过程中选择性保护历史关键参数。这两个组件协同作用,在减少参数干扰的同时保留跨任务知识迁移能力。实验表明,CP-MoE在LLM和VLM的单模态与多模态持续学习基准上均优于现有方法,尤其在SuperNI和VQA v2数据集上实现了最先进的性能和更强的零样本迁移能力。

链接: https://arxiv.org/abs/2605.20247
作者: Yang Liu,Toan Nguyen,Flora D. Salim
机构: University of New South Wales (新南威尔士大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Catastrophic forgetting remains a major obstacle to continual learning in large language models (LLMs) and vision–language models (VLMs). Although Mixture-of-Experts (MoE) architectures offer an efficient path to scaling, existing LoRA-based MoE continual learning methods still face a fundamental trade-off: they either isolate experts too aggressively, limiting knowledge transfer across tasks, or allow task-specific updates to overwrite important existing parameters, leading to severe forgetting. To address this, we propose CP-MoE, a continual learning framework built around a transient expert that captures early task-specific updates and guides their integration into stable experts. CP-MoE introduces a consistency-preserving routing bias, which uses the transient expert to estimate representation similarity with stable experts and steer routing towards more compatible expert selection, and a transient expert-guided regularisation mechanism, which selectively protects important historical parameters during merging. Together, these components reduce parameter interference and forgetting while preserving cross-task knowledge transfer. We validate CP-MoE on both unimodal and multimodal continual learning benchmarks with LLM-based and VLM-based MoE models. On SuperNI benchmark, spanning diverse sequential language tasks, CP-MoE achieves state-of-the-art performance and stronger zero-shot transfer to unseen tasks. On VQA v2 dataset, it scales effectively to multimodal visual reasoning, consistently reduces forgetting, and outperforms strong MoE baselines.

[NLP-90] Lean Refactor: Multi-Objective Controllable Proof Optimization via Agent ic Strategy Search

【速读】: 该论文试图解决的是Lean证明重构中的三个核心问题:1)证明重构需同时优化多个目标(如证明长度、编译成本和版本兼容性,这些目标常存在冲突);2)Lean库版本脆弱,而大型语言模型(LLM)发布时无法感知Lean/Mathlib的具体版本;3)基于训练的重构流程需随每次LLM更新重复微调,难以适应模型迭代和Lean的版本演进。解决方案的关键在于提出Lean Refactor框架——一个可插拔的检索增强型智能体(retrieval-augmented agentic)系统,其核心创新是通过从一个结构化数据库中检索多目标重构策略(每条策略附带版本支持信息和预期编译成本降低量),引导冻结的LLM进行可控且版本鲁棒的重构。实验表明,该方法在竞赛基准上实现超过70%的token级压缩,在研究仓库上提升超20%,并带来高达60%的编译时间减少,且版本过滤后的检索显著提升了目标版本上的压缩效果,重构后的miniF2F证明展现出更强的零样本版本迁移能力。

链接: https://arxiv.org/abs/2605.20244
作者: Jialin Lu,Soonho Kong,Rodrigo Stehling,Kaiyu Yang,Zhangyang Wang,Weiran Sun,Wuyang Chen
机构: Simon Fraser University (西蒙菲莎大学); Amazon Web Services (亚马逊云服务); MiroMind; University of Texas at Austin (德克萨斯大学奥斯汀分校)
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:We present Lean Refactor, a plug-and-play retrieval-augmented agentic framework for multi-objective, controllable, and version-robust refactoring of Lean proofs. LLM-generated proofs are notoriously correct-but-verbose and brittle across library versions, yet existing refactoring works overlook three practical challenges: 1) Lean refactoring is natively multi-objective (proof length, compilation cost, and version compatibility are often in tension); 2) Lean repositories have fragile compatibility, whereas LLM releases are unaware of Lean/Mathlib versions; 3) Training-based pipelines require repeated fine-tuning with each new LLM release, scaling neither with model churn nor with Lean’s release cycle. Lean Refactor steers a frozen agentic LLM with retrievals from a curated database of multi-objective refactoring strategies, each densely annotated with metadata such as supported Lean/Mathlib versions and expected compilation-cost reduction. Experiments show over 70% token-level compression on competition benchmarks, over 20% on research repositories, and up to 60% compilation-time reduction, outperforming prior work and Claude Code. Version-filtered retrieval further improves compression on the target Lean version, and refactored miniF2F proofs exhibit stronger zero-shot version transfer to future Lean releases than their unrefactored counterparts.

[NLP-91] Geometry-Lite: Interpretable Safety Probing via Layer-Wise Margin Geometry

【速读】: 该论文试图解决的问题是:当前基于隐藏状态的提示级安全探测器(prompt-level safety probes)虽然在平均检测性能上表现良好,但其层间几何结构如何支撑安全判断、哪些几何特征有助于降低假阳性率(false positive rate, FPR),以及在基准分布偏移(benchmark shift)下哪些几何偏差保持稳定等问题仍不明确。解决方案的关键在于提出 Geometry-Lite——一种紧凑的提示级探测模型,它将每一层最终提示词表示映射为三类读出机制(中心点、局部邻域和监督线性边界)下的带符号边缘值(signed margins),并通过边界位置、层间变化和粗略形状三个维度对边缘谱进行总结。实验表明,安全证据主要由持续存在的边界位置几何结构决定,即最终或极端边缘值及不安全侧层占据主导;而层间变化量(finite-difference drift)和结构摘要对整体 AUROC 贡献较小,但在低 FPR 阈值下可提供微小召回提升。此外,在基准偏移场景中,优化的线性边界在训练混合数据上表现尖锐,而类别条件均值几何结构在预定义困难测试子集上更具鲁棒性。这揭示了提示级安全信号并非主要依赖层间动态演化,而是由稳定且决策关键的层内边缘几何构成。

链接: https://arxiv.org/abs/2605.20241
作者: Woo Seob Sim,Yu Rang Park
机构: Yonsei University (延世大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Prompt-level safety probes for large language models use hidden-state representations to separate safe from unsafe prompts, but strong average detection performance does not explain the geometry of this separation. In particular, it remains unclear how safety evidence is formed across layers, which aspects of that layer-wise geometry support low-false-positive decisions, and which geometric biases remain stable under benchmark shift. We study this as an empirical decomposition problem and introduce Geometry-Lite, a compact prompt-level probe that maps each layer’s final prompt-token representation to signed margins under centroid, local-neighborhood, and supervised linear-boundary readouts, then summarizes the resulting margin profiles by boundary position, layer-to-layer change, and coarse shape. Across nine instruction-tuned backbones ( 1.2 B-- 70 B) and seven safety benchmarks, Geometry-Lite improves over single-layer probes while remaining close to raw multi-layer score stacking, making it a useful instrument for analyzing the multi-layer safety signal. The decomposition shows that safety evidence is expressed primarily through persistent boundary-position geometry: final or extremal margins and unsafe-side layer occupancy dominate aggregate detection performance. In contrast, finite-difference drift and structural summaries add little to pooled AUROC, although drift can provide small recall-oriented corrections under shifted low-FPR thresholds. Under benchmark shift, optimized linear boundaries are sharp on the training mixture, whereas class-conditional mean geometry retains separation more reliably on a predefined hard held-out subset. Overall, prompt-level safety evidence is not primarily a layer-to-layer motion signal, but a persistent layer-wise margin geometry whose useful components and readout-level biases become visible in decision-critical regimes.

[NLP-92] Under Pressure: Emotional Framing Induces Measurable Behavioral Shifts and Structured Internal Geometry in Small Language Models

【速读】: 该论文试图解决的问题是:情感化提示框架(emotionally framed evaluation follow-ups)如何影响小型本地部署语言模型的行为及其内部表征(internal representations),特别是这些表征是否能被识别为与情绪相关的可测量方向。解决方案的关键在于设计了一个包含八种不同情绪框架(包括冷静、压力、紧迫感、赞赏、羞耻、好奇心、鼓励和威胁)的基准实验,通过分析Qwen 3.5系列小模型(0.8B和2B参数规模)在编码任务中的表现与激活模式变化,发现:1)压力条件最易引发“捷径标记”(shortcut markers)和过拟合现象,而冷静和好奇心条件更能维持显式诚实性;2)所有非基线条件下的“冷静相对方向向量”均在最终Transformer层达到峰值;3)主成分分析(PCA)揭示了第23层方向向量中存在一个主导成分(解释方差59.5%),其与人工标注的正负情绪标签高度对齐(余弦相似度0.951),且不同情绪框架在内部表征上具有可区分性(如批准与紧迫感几乎一致,而好奇心则明显偏离紧迫感)。这表明小型开源模型中存在可测量的提示敏感控制方向,但并未证明模型具备内在情感状态。

链接: https://arxiv.org/abs/2605.20202
作者: Rana Muhammad Usman
机构: Independent Researcher
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 18 pages, 4 figures. Exploratory empirical study with fully local experiments on small open language models. Code and data: this https URL

点击查看摘要

Abstract:I study whether emotionally framed evaluation follow-ups change both the behavior and the calm-relative internal representations of small, locally deployed language models. Our main benchmark uses Qwen 3.5 0.8B on four impossible-constraint coding tasks and eight follow-up framings: calm, pressure, urgency, approval, shame, curiosity, encouragement, and threat. In the 0.8B eight-condition sweep (160 conversations), pressure produces the strongest shortcut markers (11/20 runs) and the clearest overfit pattern (3/20), while calm and curiosity preserve explicit honesty more often (7/20 and 6/20). For all seven non-baseline conditions, the corresponding calm-relative direction vectors peak at the final transformer layer. An exploratory PCA of the layer-23 direction vectors reveals a dominant first component (59.5% explained variance) aligned with a hand-labeled positive/negative split (cosine alignment 0.951); approval and urgency are nearly identical internally (cosine 0.957), whereas curiosity points away from urgency (-0.252). In a separate calm-vs.-pressure rerun used for scale comparison, Qwen 3.5 2B shows higher honest rates under calm framing and directionally consistent activation steering on a small 4-prompt A/B probe, whereas the 0.8B steering result reverses. I interpret these results as evidence for measurable prompt-sensitive control directions in small open models, while stopping short of claiming intrinsic emotional states.

[NLP-93] Long-Context Reasoning Through Proxy-Based Chain-of-Thought Tuning ACL2026

【速读】: 该论文试图解决大语言模型在长上下文任务中因复杂推理需求而导致性能下降的问题,特别是模型在使用完整长文本输入时表现不佳,而实际上仅需其中一小部分(即代理上下文,proxy context)即可完成推理。解决方案的关键在于提出一种名为ProxyCoT的新型训练框架:首先通过强化学习或知识蒸馏从教师模型中获取高质量的思维链(chain-of-thought, CoT)推理轨迹,这些轨迹基于短代理上下文生成;随后通过监督微调将这些推理轨迹“锚定”到完整的长上下文中,从而将短上下文中的推理能力迁移至长上下文。实验表明,ProxyCoT在多个数据集上均优于强基线方法,且计算开销更低,并能泛化到域外任务。

链接: https://arxiv.org/abs/2605.20201
作者: Miao Li,Irina Saparina,Alexander Gurung,Mirella Lapata
机构: The University of Edinburgh (爱丁堡大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Long, ACL 2026 (Main conference)

点击查看摘要

Abstract:Recent large language models support inputs of up to 10 million tokens, yet they perform poorly on long-context tasks that require complex reasoning. Such tasks can be solved using only a subset of the input – a proxy context – rather than the full sequence. Despite sharing the same underlying reasoning process, models exhibit a significant performance disparity between proxy and full contexts. To improve long-context reasoning, we propose ProxyCoT, a novel training framework that transfers reasoning capabilities from short proxy contexts to full long contexts. Specifically, we first obtain high-quality chain-of-thought reasoning traces on proxy contexts through reinforcement learning or distillation from a larger teacher model, and then ground the generated traces in full long contexts with supervised fine-tuning. Experiments across different datasets demonstrate that ProxyCoT consistently outperforms strong baselines with reduced computational overhead. Furthermore, models trained with ProxyCoT generalize their long-context reasoning capabilities to out-of-domain tasks.

[NLP-94] FlowLM: Few-Step Language Modeling via Diffusion-to-Flow Adaptation

【速读】: 该论文试图解决的问题是:如何在保持生成质量的前提下,显著减少扩散语言模型(diffusion language models)所需的采样步数,从而提升生成效率。现有方法通常依赖大量采样步骤(如2000步)才能获得高质量文本,但计算成本高、速度慢。解决方案的关键在于将预训练扩散语言模型通过高效微调转化为流匹配语言模型(FlowLM),其核心创新是将扩散模型中弯曲的采样轨迹重新对齐为直线流(straight-line flows),从而实现高质量的少步生成(few-step generation)。此外,论文提出了一种更有效的流匹配训练目标——预测干净数据(clean data),以持续引导采样过程逼近真实数据分布,实验证明该方法仅需一半训练轮次即可达到性能饱和,且显著优于从头训练和原始扩散模型。

链接: https://arxiv.org/abs/2605.20199
作者: Runzhe Zhang,Letian Chen,Wenpeng Zhang,Zhouhan Lin,Peilin Zhao
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 26 pages, 11 figures

点击查看摘要

Abstract:We present FlowLM, a flow matching language model transformed from pre-trained diffusion language models via efficient fine-tuning. By re-aligning the curved sampling trajectories of diffusion models into straight-line flows, FlowLM enables high quality few-step generation that rivals or even outperforms the quality of 2,000-step diffusion sampling with very few training epochs. Remarkably, finetuned FlowLM reaches performance saturation with only half as many training epochs as training from scratch, both approaches greatly outperforming the original diffusion model, thereby validating our method. Furthermore, we validate a more effective training objective for flow matching: predicting clean data to consistently guide the sampling process towards the true data distribution. Empirical results demonstrate that our approach is highly effective for high-quality, few-step text generation.

[NLP-95] MedicalBench: Evaluating Large Language Models Toward Improved Medical Concept Extraction

【速读】: 该论文试图解决的问题是:在电子健康记录(Electronic Health Records, EHR)中提取隐含的医学概念(implicit medical concepts)并提供可解释的证据支持,这是当前医疗自然语言处理任务中的关键挑战。现有基准主要关注显式陈述的概念,忽视了临床实践中大量依赖隐含推理的场景。解决方案的关键在于构建一个名为MedicalBench的新基准,它将医学概念提取建模为医学笔记-概念对的验证任务,并结合句子级别的证据识别,从而评估模型在隐式推理和证据溯源方面的能力。该基准基于MIMIC-IV出院小结与人工验证的ICD-10编码,通过多阶段大语言模型(Large Language Model, LLM)筛选、专业标注和专家评审构建,刻意引入隐式正例、语义混淆负例以及LLM与专家判断不一致的案例,以确保评测难度聚焦于真正的医学推理能力。此外,论文定义了两个互补任务:概念提取和句子级证据检索,分别衡量准确性与可解释性,实证表明当前最先进LLMs在此任务上表现有限,且性能不受病历长度影响,说明该基准有效隔离了推理难度而非表面干扰因素。

链接: https://arxiv.org/abs/2605.20197
作者: Zhichao Yang,Gregory D. Lyng,Sanjit Singh Batra,Robert E. Tillman
机构: Optum AI
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Medical concept extraction from electronic health records underpins many downstream applications, yet remains challenging because medically meaningful concepts are frequently implied rather than explicitly stated in medical narratives. Existing benchmarks with human-annotated evidence spans underscore the importance of grounding extracted concepts in medical text. However, they predominantly focus on explicitly stated concepts instead of implicit concepts. We present MedicalBench, a benchmark for medical concept extraction with evidence grounding that evaluates implicit medical reasoning. MedicalBench formulates medical concept extraction as a verification task over medical note-concept pairs, coupled with sentence-level evidence identification. Built from MIMIC-IV discharge summaries and human-verified ICD-10 codes, the dataset is curated through a multi-stage large language model (LLM) triage pipeline followed by medical annotation and expert review. It deliberately includes implicit positives, semantically confusable negatives, and cases where LLM judgments disagree with medical expert assessments. We define two complementary evaluation tasks: (1) medical concept extraction and (2) sentence-level evidence retrieval, enabling assessment of both correctness and interpretability. Benchmarking state-of-the-art LLMs reveals that performance remains modest, highlighting the difficulty of extracting implicitly expressed concepts. We further show that performance is largely invariant to note length, indicating that MedicalBench isolates reasoning difficulty rather than superficial confounders. MedicalBench provides the first systematic benchmark for implicit, evidence-grounded medical concept extraction, offering a foundation for developing medical language models that can both identify medically relevant concepts and justify their predictions in a transparent and medically faithful manner.

[NLP-96] Data Scaling as Progressive Coverag e of a Predictive Contribution Spectrum

【速读】: 该论文试图解决的问题是:真实数据的规模定律(scaling laws)是否仅由词频尾部决定,还是受到潜在预测贡献谱(latent predictive contribution spectrum)的逐步覆盖机制影响。解决方案的关键在于引入一种基于后缀自动机(suffix automaton)表示的、数据内在的全局KL预测贡献谱(global-KL predictive contribution spectrum),其中每个状态的贡献由其经验频率与其相对于全局下一个词基线的KL散度乘积决定。研究发现,该谱的尾部斜率与固定小规模GPT学习器的经验数据缩放指数高度相关;进一步通过匹配观测到的超额损失与预处理的1000k全局KL谱的残余尾部质量,定义了每个训练规模N下的有效截断秩K(N)。实证表明,log K与log N近似线性关系,原始谱和光滑谱的R²分别达到约0.96和0.90,这为一个简单机制提供了强实证支持:训练规模推动有效前沿在预测状态谱中推进,而该谱的残余尾部质量可追踪剩余超额损失。

链接: https://arxiv.org/abs/2605.20196
作者: Zihui Song,Shihao Ji,Hongxi Li,Shuaizhi Cheng,Chunlin Huang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 8 pages,6 figures

点击查看摘要

Abstract:We investigate the hypothesis that real-data scaling laws are governed by progressive coverage of a latent predictive contribution spectrum rather than by token-frequency tails alone. We work with a suffix-automaton representation of text corpora and define a data-intrinsic global-KL predictive contribution spectrum, in which each state contributes according to its empirical mass times its KL deviation from a global next-token baseline. Across 12 real corpora, the tail slope of this spectrum is already strongly correlated with the empirical data-scaling exponent of a fixed small GPT learner. We then go beyond slope correlation and define, for each training size N, an effective truncation rank K(N) by matching the observed excess loss to the residual tail mass of the prepared 1000k global-KL spectrum. Empirically, log K is close to linear in log N, with pooled R^2 about 0.96 for the raw spectrum and R^2 about 0.90 for the smoothed spectrum. These findings provide strong empirical support for a simple mechanism picture: training scale advances an effective frontier through a predictive state spectrum, and the residual tail mass of that spectrum tracks the remaining excess loss.

[NLP-97] Pseudo-Siamese Network for Planning in Target-Oriented Proactive Dialogues ICASSP2026

【速读】: 该论文旨在解决目标导向型主动对话系统中对话路径规划(dialogue path planning)这一核心问题,即如何在对话过程中有策略地引导对话朝着预设目标推进,同时保持自然性和交互性。传统方法往往忽视了路径规划的系统性设计,导致生成响应时缺乏方向性和效率。本文的关键解决方案是提出一种前向聚焦的双向伪西门子网络(Forward-Focused Bidirectional Pseudo-Siamese Network, FF-BPSN),其创新点在于:采用两个相同的基于Transformer的解码器分别进行前向和后向规划,并引入一个前向聚焦模块融合双向信息,最终构建以当前对话状态为起点、优先考虑前向信息的最优路径。该路径不仅利用了双向规划的优势,还确保了对目标导向性的强约束,进而显著提升了语言模型生成响应的质量与目标达成率。实验表明,FF-BPSN在DuRecDial和DuRecDial 2.0数据集上均达到当前最优性能,验证了其在目标导向主动对话系统中的有效性。

链接: https://arxiv.org/abs/2605.20195
作者: Xinyue Kang,Maodong Li,Yibin Zheng,Fang Kong
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: ICASSP2026

点击查看摘要

Abstract:A target-oriented proactive dialogue system is designed to steer conversations toward predefined targets while actively providing suggestions. The core paradigm of such a system is to plan a reasonable dialogue path and subsequently guide language models (e.g., pre-trained or large language models) to generate responses, where dialogue path planning serves as the central component-a novel yet under-explored problem. In this work, we propose a Forward-Focused Bidirectional Pseudo-Siamese Network (FF-BPSN) for dialogue path planning toward predefined dialogue targets. FF-BPSN employs two identical transformer-based decoders for forward and backward planning, together with a forward-focused module that integrates bidirectional information to construct the final forward path. This path benefits from bidirectional planning while prioritizing forward information. We then employ the planned path to guide language models in response generation. Extensive experiments on DuRecDial and DuRecDial 2.0 demonstrate that FF-BPSN achieves state-of-the-art performance in dialogue path planning and significantly enhances the effectiveness of target-oriented proactive dialogue systems.

[NLP-98] Parallel LLM Reasoning for Bias-Resilient Robust Conceptual Abstraction

【速读】: 该论文试图解决大语言模型(Large Language Models, LLMs)在分析长文档时因上下文推理能力受限而导致的累积性分析偏差、遗漏错误(omission error)和过度泛化问题。其解决方案的关键在于提出了一种结构化框架,结合并行分块处理(parallel chunk-level processing)与证据锚定整合(evidence-anchored consolidation):首先将文本分割为语义连贯的块并在并行环境中独立处理,以消除早期内容对后续分析的主导影响;随后通过显式证据锚定与优先级排序机制对生成结果进行系统性整合,从而降低概念主导性和过度泛化,提升解释的可追溯性与准确性。实验表明,该方法显著减少了约84%的遗漏错误、提升了最高达130%的证据可追溯性,并减少高达91%的无依据陈述,尤其在小型模型中效果更为突出,验证了并行分块与整合策略在实现可靠且可扩展文本分析中的核心作用。

链接: https://arxiv.org/abs/2605.20194
作者: Aisvarya Adeseye,Jouni Isoaho,Adeyemi Adeseye
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to be Published in 12th Intelligent Systems Conference 2026, 3-4 September 2026 in Amsterdam, The Netherlands

点击查看摘要

Abstract:Large language models (LLMs) have been increasingly used to analyze text. However, they are often plagued with contextual reasoning limitations when analyzing long documents. When long documents are processed sequentially, early or dominant concepts can overshadow less visible but meaningful interpretations, leading to cumulative analytical bias, omission error, and over-generalization. Additionally, independently generated outputs are often merged without systematic grounding, introducing redundancy, conceptual drift, and unsupported claims. This study proposes a structured framework combining parallel chunk-level processing with evidence-anchored consolidation. Texts are first divided into semantically coherent chunks and processed independently in parallel to remove influence from earlier processing. The independently generated interpretations are then consolidated using explicit evidence anchoring and prioritization that reduces dominance and over-generalization while improving traceability. Experiments with multiple model types and sizes indicate that parallel processing significantly reduces omission error by approximately 84%, increases evidence traceability by up to 130%, and reduces unsupported claims by up to 91%. Smaller models benefited most, suggesting that efficient parallel chunking and consolidation play a critical role in achieving reliable and scalable textual analysis.

[NLP-99] Improving Quantized Model Performance in Qualitative Analysis with Multi-Pass Prompt Verification

【速读】: 该论文旨在解决低比特量化大型语言模型(LLMs)在定性分析中因压缩导致幻觉增多、结果不稳定的问题,尤其是面对非专家表述模糊时表现更差。其关键解决方案是提出一种“量化感知的多轮提示验证方法”(quantization-aware multi-pass prompt verification method),通过分步引导模型推理、剔除不可靠内容并逐轮验证,显著降低幻觉率并提升准确性。实验表明,8-bit模型最接近人工标注的黄金标准(GSGT),而4-bit及更低比特模型虽性能下降,但在该方法干预下实现稳定性和准确性的改善;同时发现相同比特位宽下不同量化类型对性能影响显著,说明量化策略需精细化设计。此方法使低成本低资源模型更适合高质量定性研究场景。

链接: https://arxiv.org/abs/2605.20193
作者: Aisvarya Adeseye,Jouni Isoaho,Adeyemi Adeseye
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to publish in 12th Intelligent Systems Conference 2026; 3-4 September 2026 in Amsterdam, The Netherlands

点击查看摘要

Abstract:Quantized Large Language Models (LLMs) are used more often in qualitative analysis because they run fast and need fewer computing resources. This study examines how different lower bits quantization levels (8-bit, 4-bit, 3-bit, and 2-bit) and quantization types affect the performance of LLaMA-3.1 (8B) on qualitative analysis. The study uses expert and non-expert responses from 82 interview transcripts. Low-bit models often produce higher levels of hallucinations and unstable results, especially when reading non-expert language with unclear terms. To improve performance, we propose a quantization-aware multi-pass prompt verification method. This method guides the model through controlled steps that reduce hallucinations. It removes unreliable content and passes the results to the next transcript after verification, improving accuracy. To validate performance, human coders analyzed transcripts using NVivo and BF16 LLaMA. BF16 LLaMA-3.1 produced high-precision output but had semantic drift and hallucination. These errors were corrected manually. The corrected BF16 output and NVivo human coding were combined to create a gold-standard ground truth (GSGT) for thematic extraction and frequency analysis. The results show that 8-bit models stay closest to the GSGT. The 4-bit models lose accuracy but become stable when the proposed method is applied. The 3-bit and 2-bit models drop in performance because of heavy compression, but they improve with the proposed prompt design and verification. The study also finds that models at the same bit level behave differently depending on quantization type. Overall, the method helps low-resource LLMs become more stable, accurate, and suitable for qualitative research at lower cost.

[NLP-100] Leverag ing Large Language Models for Sentiment Analysis: Multi-Modal Analysis of Decentralands MANA Token

【速读】: 该论文试图解决的问题是:如何通过融合社区情绪与多模态金融数据来提升对虚拟世界经济中加密货币价格的预测准确性。其解决方案的关键在于构建一个融合了基于BERT的大语言模型提取的Discord社区情绪得分、交易量和市值等多模态特征的LSTM预测模型,相较于仅使用历史价格信息的基线模型,该多模态架构显著提升了预测精度,验证了社区情绪信号在虚拟经济预测中的价值。

链接: https://arxiv.org/abs/2605.20192
作者: Xintong Wu,Peiting Tsai,Jing Yuan,Michael Yu,Greg Sun,Luyao Zhang
机构: Duke Kunshan University (昆山杜克大学); Microsoft China (微软中国)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Decentraland, a decentralized virtual reality platform operating within the expanding Metaverse ecosystem, utilizes its native MANA token to facilitate virtual asset transactions and governance. This study investigates the integration of Discord community sentiment with multi-modal financial data to enhance cryptocurrency price prediction within virtual world economies. We address: (1) identifying sentiment patterns within Decentraland’s Discord community, and (2) evaluating the impact of multi-modal features on token return forecasting. Using a BERT-based large language model for sentiment analysis, we develop two LSTM architectures: a baseline incorporating historical prices and a multi-modal variant integrating sentiment scores, trading volume, and market capitalization. Results indicate predominantly neutral community sentiment with a positive skew. The multi-modal model significantly outperforms the price-only baseline in prediction accuracy. These findings demonstrate the predictive value of community-derived signals for virtual economy forecasting and establish a foundation for future research at the intersection of immersive virtual environments, natural language processing, and cryptocurrency market analysis.

[NLP-101] Shiny Stories Hidden Struggles: Investigating the Representation of Disability Through the Lens of LLM s

【速读】: 该论文试图解决的问题是:大型语言模型(LLM)在模拟残障人士视角时是否存在理想化倾向,以及这种倾向是否会导致对残障群体的刻板印象或偏见,从而掩盖其真实生活挑战。解决方案的关键在于通过对比由LLM生成的社交媒体内容与真实残障人士撰写的帖子,在情感基调、情绪倾向及代表性词汇和主题上进行系统分析,揭示模型在表征残障经历时的偏差。研究发现,LLM倾向于过度理想化残障者的体验,忽略其复杂性和现实困境,并且在比较残障与非残障个体的表述时存在负向偏见,例如将职业和娱乐等话题更多关联于非残障人群,进一步强化了排斥性叙事,暴露了当前LLM在反映社会多样性尤其是边缘群体经验方面的局限性。

链接: https://arxiv.org/abs/2605.20191
作者: Marco Bombieri,Simone Paolo Ponzetto,Marco Rospocher
机构: University of Verona (维罗纳大学); University of Trento (特伦托大学); University of Mannheim (曼海姆大学)
类目: Computation and Language (cs.CL)
备注: Accepted for publication in ACM Transactions on Intelligent Systems and Technology

点击查看摘要

Abstract:Modern Large Language Models (LLMs) have recently attracted much attention for their ability to simulate human behavior and generate text that reflects personas and demographic groups. While these capabilities can open up a multitude of diverse applications across fields, it is crucial to examine how such models represent various target groups since LLMs can perpetuate and amplify biases or discrimination against historically marginalized communities or, alternatively, as a result of debiasing efforts, overcorrect by portraying overly positive stereotypes. This overcompensation can idealize these groups, erasing the complexities and challenges they face in favor of unrealistic depictions. In this paper, we investigate how LLMs represent disability by simulating the perspectives of individuals with disabilities in generating social media posts. These posts are then compared with those written by real people with disabilities, focusing on emotional tone, sentiment, and representative words and themes. Our analysis reveals two key findings: (1) LLMs often idealize the experiences of people with disabilities, producing overly positive stereotypes that, despite appearing uplifting, fail to authentically capture their lived realities; and (2) a comparative analysis of posts simulating individuals with and without disabilities highlights a negative bias, where certain topics, such as career and entertainment, are disproportionately associated with nondisabled individuals. This reinforces exclusionary narratives and over-idealized portrayals of disability, misrepresenting the actual challenges faced by this community. These findings align with broader concerns and ongoing research showing that LLMs struggle to reflect the diverse realities of society, particularly the nuanced experiences of marginalized groups, and underscore the need for critical scrutiny of their representations.

[NLP-102] Diverge to Induce Prompting: Multi-Rationale Induction for Zero-Shot Reasoning AACL2025

【速读】: 该论文试图解决标准链式思维(Chain-of-Thought, CoT)提示中因无引导推理路径不稳定而导致的性能局限问题。现有方法虽通过提取单一推理策略来提升稳定性,但单一策略难以适应多样化任务需求。其解决方案的关键在于提出“发散诱导提示”(Diverge-to-Induce Prompting, DIP)框架:首先让大语言模型(LLM)为每个问题生成多个高层面的多样化推理方案(rationales),随后将每个方案扩展为详细的分步计划草稿,最后通过多计划归纳机制整合成最终推理路径。该方法在不依赖资源密集型采样的前提下显著提升了零样本推理准确性,验证了多计划诱导在提示驱动推理中的有效性。

链接: https://arxiv.org/abs/2602.08028
作者: Po-Chun Chen,Hen-Hsen Huang,Hsin-Hsi Chen
机构: National Taiwan University (台湾大学); Academia Sinica (中央研究院); AI Research Center (AINTU) (人工智能研究中心)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to Findings of IJCNLP-AACL 2025

点击查看摘要

Abstract:To address the instability of unguided reasoning paths in standard Chain-of-Thought prompting, recent methods guide large language models (LLMs) by first eliciting a single reasoning strategy. However, relying on just one strategy for each question can still limit performance across diverse tasks. We propose Diverge-to-Induce Prompting (DIP), a framework that first prompts an LLM to generate multiple diverse high-level rationales for each question. Each rationale is then elaborated into a detailed, step-by-step draft plan. Finally, these draft plans are induced into a final plan. DIP enhances zero-shot reasoning accuracy without reliance on resource-intensive sampling. Experiments show that DIP outperforms single-strategy prompting, demonstrating the effectiveness of multi-plan induction for prompt-based reasoning.

信息检索

[IR-0] SG-LegalCite: A Principle-Augmented Benchmark for Legal Citation Retrieval in Singapore Law

链接: https://arxiv.org/abs/2605.21057
作者: Shannon Lee Yueh Ern,Kaidong Feng,Yingpeng Du,Chloe Lee En Jia,Zhu Sun
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Legal citation in common-law systems depends not only on factual similarity, but also on the legal principle for which a precedent is invoked. However, existing benchmarks for legal citation retrieval use case facts, citation context, or full judgments as inputs, where the governing legal principle is often missing or only implicitly expressed and entangled with broader context. As a result, models may retrieve precedents that are factually similar yet doctrinally irrelevant. This limitation is particularly consequential in Singapore, where the legal system has evolved independently: only domestic precedents are binding, while foreign authorities serve merely as persuasive references. Thus, we propose a new retrieval paradigm that ranks cited cases based on queries integrating case facts and explicit legal principles, inspired by real-world legal reasoning workflows. To support this paradigm, we introduce SG-LegalCite, a dataset of 100,890 case-principle pairs extracted from 8,523 Singapore Supreme Court judgments spanning from 2000 to 2025. Experiments across 11 baselines demonstrate the effectiveness of our principle-augmented retrieval paradigm, showing that explicit legal principles provide strong discriminative signals for legal citation retrieval.

[IR-1] MemConflict: Evaluating Long-Term Memory Systems Under Memory Conflicts

链接: https://arxiv.org/abs/2605.20926
作者: Zhen Tao,Jinxiang Zhao,Peng Liu,Dinghao Xi,Yanfang Chen,Wei Xu,Zhiyu Li
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Long-term memory systems enable conversational agents based on large language models (LLMs) to retain, retrieve, and apply user-specific information across multi-session interactions. However, existing evaluations mainly assess outcome-level performance or temporal updating, providing limited insight into how systems retrieve and rank temporally valid, factually correct, and contextually applicable memory evidence under conflicting alternatives. To address this gap, we propose MemConflict, a diagnostic framework that treats memory validity as a query-conditioned fitness-for-use problem. MemConflict formalizes dynamic, static, and conditional conflicts over temporal validity, factual correctness, and contextual applicability. It simulates controlled long-horizon histories from structured user profiles, introduces cross-session conflicts, and injects semantically similar distractors to create competition among memory candidates. The resulting multi-session dialogue benchmark supports black-box evaluation of final answers and white-box analysis of supporting-memory retrieval and ranking. Experiments on six representative long-term memory systems show uneven strengths across conflict types, with answer correctness often diverging from memory retrieval and ranking. Sensitivity analyses reveal that longer histories, distractors, implicit queries, and larger conflict distances degrade performance. Diagnostics show failures from missing supporting memories and ineffective use of retrieved memories. Collectively, MemConflict advances principled long-term memory governance through retrieval-aware, conflict-aware reliability assessment.

[IR-2] GraphRAG on Consumer Hardware: Benchmarking Local LLM s for Healthcare EHR Schema Retrieval

链接: https://arxiv.org/abs/2605.20815
作者: Peter Fernandes,Ria Kanjilal
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: 9 pages, 1 figure, 5 tables

点击查看摘要

Abstract:Graph-based Retrieval Augmented Generation (GraphRAG) extends retrieval-augmented generation to support structured reasoning over complex corpora, but its reliability under resource-constrained, privacy-sensitive deployments remains unclear. In healthcare, where Electronic Health Record (EHR) data is complex and strictly regulated, reliance on cloud-based large language models (LLMs) introduces challenges in cost, latency, and compliance. In this work, we present a systematic evaluation of GraphRAG for EHR schema retrieval using locally deployed open-source LLMs. We implement the Microsoft GraphRAG pipeline on real-world EHR schema documentation and benchmark four models, including Llama 3.1 (8B), Mistral (7B), Qwen 2.5 (7B), and Phi-4-mini (3.8B), each deployed via Ollama on a single consumer GPU (8 GB VRAM). We evaluate indexing efficiency, knowledge graph construction, query latency, answer quality, and hallucination under both global and local retrieval modes. Our results reveal substantial differences: Llama 3.1 produces the richest knowledge graph (1,172 entities), Qwen 2.5 achieves the best answer quality (3.3/5), Phi-4-mini fails to complete the pipeline due to structured-output errors, and Mistral exhibits degenerate repetition behavior. We further show that GraphRAG exhibits a practical capacity threshold, where models below approximately 7B parameters fail to reliably produce valid structured outputs and cannot complete the pipeline. In addition, indexing and answer quality are decoupled across models, and local retrieval consistently outperforms global summarization in both latency and factual grounding, with reduced hallucination. These findings demonstrate that GraphRAG is feasible on consumer hardware while highlighting the importance of model selection and retrieval design for robust deployment in regulated settings.

[IR-3] CALMem : Application-Layer Dual Memory for Conversational AI

链接: https://arxiv.org/abs/2605.20724
作者: Rajendra Narayan Jena,Rajan Padmanabhan,Sankar Arumugam
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Large language models (LLMs) operate within fixed context windows that fundamentally limit conversational continuity. When context fills, compaction discards history irreversibly; when sessions end, all memory resets to zero. Existing solutions-larger context windows, retrieval-augmented generation for knowledge bases, and memory-augmented architectures such as MemGPT-either require model modification, impose provider lock-in, or do not address the compaction continuity problem. We present CALMem (Conversational Application-Layer Memory), an application-layer dual memory architecture that gives LLM-based conversational assistants virtually unbounded effective context without any modification to the underlying model. CALMem combines two complementary memory subsystems: an episodic memory layer built on sliding-window vector embeddings of conversation history, and a semantic memory layer of agent-writable structured facts. A token-budget-adaptive injection mechanism, called the MOIM (Message of Injected Memory), automatically retrieves and injects relevant past context each turn, scaling injection depth inversely with context pressure. A key contribution is intra-session retrieval: compacted away turns from the current session remain searchable, closing a gap unaddressed by prior work. The system is implemented as a pure application layer in a production Rust codebase, is provider-agnostic, and degrades to original LLM behaviour with zero overhead when disabled. We describe the architecture, design decisions, and performance characteristics, and analyse the trade-offs that guided each implementation choice.

[IR-4] DIVE: Embedding Compression via Self-Limiting Gradient Updates

链接: https://arxiv.org/abs/2605.20689
作者: Dongfang Zhao
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:High-dimensional embeddings from large language models impose significant storage and computational costs on vector search systems. Recent embedding compression methods, including Matryoshka-Adaptor (EMNLP 2024), Search-Adaptor (ACL 2024), and SMEC (EMNLP 2025), enable dimensionality reduction through lightweight residual adapters, but their training objectives cause severe overfitting when labeled data is scarce, degrading retrieval performance below the frozen baseline. We propose \textscDIVE (\textbfDimensionality reduction with \textbfImplicit \textbfView \textbfEnsembles), a compression adapter that addresses this failure through two mechanisms. First, a self-limiting hinge-based triplet loss produces zero gradient once a triplet satisfies the margin constraint, bounding the total perturbation applied to the pretrained embedding space. Second, a head-wise NT-Xent contrastive loss treats multiple learned projections of each embedding as implicit views, providing dense self-supervised gradients that compensate for the sparsity of the triplet signal on small datasets. Across six BEIR datasets, \textscDIVE outperforms all three baseline adapters on every dataset and at every evaluated compression ratio, with a 14M-parameter open-source implementation.

[IR-5] Layer-wise Token Compression for Efficient Document Reranking SIGIR2026

链接: https://arxiv.org/abs/2605.20683
作者: Shengyao Zhuang,zhichao Xu,Ivano Lauriola
类目: Information Retrieval (cs.IR)
备注: SIGIR2026 short paper

点击查看摘要

Abstract:Transformer-based document cross-encoder rerankers are a central component of modern information retrieval systems. Despite their success, these models suffer from high computational costs due to processing long query-document sequences at inference time. A known approach to improve efficiency is token compression, which consists of aggregating groups of tokens together in the initial embedding layer, reducing the effective number of tokens, and making the computation faster. While token compression has proven to be successful for bi-encoder retrievers, we empirically observed that this approach may be ineffective for cross-encoder rerankers. In this paper, we propose Layer-wise Token Compression (LTC), which applies adaptive token pooling at intermediate transformer layers. Through extensive ablation studies on MS MARCO passage and document ranking tasks, we demonstrate that compression at middle layers preserves ranking quality while increasing inference QPS by up to 25% for passage ranking and up to 116% for document ranking. We also extend LTC to listwise LLM rerankers and show that the same approach can be easily applied to long-context listwise reranking, where the QPS improvements are even greater. More surprisingly, when applying rerankers trained on short passages to long-document ranking tasks, models trained with compression outperform their uncompressed counterparts, suggesting that compression may act as a beneficial regularizer that encourages length-invariant representations.

[IR-6] Efficient Table QA via TableGrid Navigation and Progressive Inference Prompting ICDAR2026

链接: https://arxiv.org/abs/2605.20254
作者: Amritansh Maurya,Navjot Singh,Mohammed Javed,Omar Moured
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted for Presentation in ICDAR 2026, Vienna, Austria

点击查看摘要

Abstract:Large Language Models (LLMs) have shown promising results on NLP tasks, however, their performance on tabular data still needs research attention, because Table Question-Answering (TQA) requires precise cell retrieval and multi-step structured reasoning. Existing work improves TQA either by fine-tuning or training LLMs on task-specific tabular data, but often lacks verifiable control over how the model navigates tables and derives answers. In this work, we propose a training-free TQA approach with two structured prompting frameworks: TableGrid Navigation (TGN), which iteratively navigates rows and columns via a three-module loop to locate evidence and refine answers, and Progressive Inference Prompting (PIP), which enforces columns identification for explicit progressive row selection constraint according to the query. We evaluate 17 LLMs against 6 baselines on TableBench and FeTaQa dataset. On TableBench, TGN improves over the strongest baseline by 3.8 points, and on FeTaQa, PIP achieves SOTA performance over ReAct and Chain-of-Thought. Beyond inference-time gains, PIP and TGN can also serve as supervision templates to fine-tune small models, narrowing the performance gap to much larger architectures in resource-constrained settings, offering versatile and cost-efficient solution for TQA.

[IR-7] Advanced Scientific Methodology Plays Rossini

链接: https://arxiv.org/abs/2605.20220
作者: Silvia Licciardi,Daniela Macchione,Emmanuel Caronna,Elisa Francomano
类目: ound (cs.SD); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:A musical score provides the essential instructions for its performance while containing indications - at times implicit - regarding the composer’s intentions. The presence of authorial variants, and even more so complex series of revisions associated with a single text, presents a challenging path for analytical study. This research, situated within the application of Scientific Methodologies to Music Philology, proposes a methodological approach oriented toward the structural analysis of one of the many settings composed by Gioachino Rossini on the same Metastasio arietta ``Mi lagnerò tacendo’'. Through Computational Analysis - incorporating parsing, data mining, and graph theory - the melodic, harmonic, and textual compositional choices have been rigorously explored. The results constitute a significant unicum in the field, laying the foundation for a systematic study that supports philological research and paves the way for the use of generative models to investigate the creative process.

人机交互

[HC-0] HITL-D: Human In The Loop Diffusion Assisted Shared Control ICRA2026

链接: https://arxiv.org/abs/2605.21460
作者: Riley Zilka,Sergey Khlynovskiy,Allie Wang,Martin Jagersand
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: Accepted for presentation at ICRA 2026

点击查看摘要

Abstract:Autonomous manipulation systems have achieved remarkable capabilities, yet the integration of human expertise with diffusion-based policies in shared control remains relatively unexplored. In this paper, we propose Human-In-The-Loop Diffusion (HITL-D), a shared control framework that enhances user performance in multi-step, insertion, and fine manipulation tasks. HITL-D leverages a novel combination of diffusion-based policies and human control to provide autonomous end effector orientation updates conditioned on a scene point cloud and the Cartesian position of the end effector. This approach reduces the number of joystick control axes required, thereby lowering mental workload. In a multi-task user study with 12 participants, HITL-D reduced average task completion times by 40%, decreased perceived workload by 37%, and improved Likert-scale ratings for independence, intuitiveness, and confidence compared to traditional teleoperation methods. These results demonstrate that HITL-D effectively integrates human expertise with autonomous assistance, improving both objective and subjective aspects of teleoperation.

[HC-1] Designing Conversations with the Dead: How People Engage with Generative Ghosts

链接: https://arxiv.org/abs/2605.21390
作者: Jack Manning,Daniel Sullivan,Dylan Thomas Doyle,Anthony T. Pinter,Jed R. Brubaker
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We examine how people experience two choices in the design of generative ghosts, AI systems that are trained on data of the dead: representation, where an AI speaks about a deceased person in the third person, and reincarnation, where the AI speaks as the deceased in the first person. Through a qualitative user study with 16 participants, we explore how each shaped authenticity, affect, and risk. Reincarnation was preferred for its immediacy, but participants shared fears of over-reliance. Representation was preferred for engaging with memory over conversational presence, though participants often ignored this distinction, engaging in dialogue despite third-person framing. Across both modes, participants privileged affective resonance over factual fidelity. We conclude by showing how factors such as tone, language, and conversational rhythm – factors unique to the user’s memory of the deceased – shape interactions with generative ghosts, and argue that those interactions are always collaborative.

[HC-2] Combating Harms of Generative AI in CS1 with Code Review Interviews and a Flipped Classroom

链接: https://arxiv.org/abs/2605.21374
作者: Peter Fowles,Erik Falor,Sulove Bhattarai,John Edwards,Seth Poulsen
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Background and Context: Large Language Models (LLMs) are more accessible and accurate than ever before, raising significant concerns for computing educators. One major concern is students using LLMs to bypass the effort needed to understand concepts and metacognitive strategies essential for success in computer science. Objectives: We contribute a unique approach to assessing and building up student understanding through weekly oral code review assessments. These formative assessments incentivize students to understand their submitted code, regardless of whether or not the code was generated by AI tools. We also use a flipped classroom to provide time for students to learn concepts outside of class and provide ample time for students to schedule code review interviews. Methods: For this paper, we collected data from three semesters. We analyze student exam scores, keystroke logs, and surveys to understand how the new course policies affected student learning, behavior, and attitudes. Findings: Pairwise comparison of exam results reveals a statistically insignificant increase in average scores for Fall 2025 compared to previous semesters. Keystroke logs show a significant increase in characters pasted per total characters input into coding assignments in Fall 2025, pointing towards higher AI usage. Survey results show positive student sentiment towards code reviews at the end of Fall 2025, with nearly all negative feedback being addressable through better scheduling and more rigorous TA training. Implications: Oral code reviews with a flipped classroom appear to be effective at mitigating harms of LLM use while providing space for students to freely experiment with these tools. Our work suggests that students in Fall 2025 still show adequate understanding of material covered in written exams, despite dramatic increases in LLM usage for coding assignments. Subjects: Human-Computer Interaction (cs.HC) Cite as: arXiv:2605.21374 [cs.HC] (or arXiv:2605.21374v1 [cs.HC] for this version) https://doi.org/10.48550/arXiv.2605.21374 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Peter Fowles [view email] [v1] Wed, 20 May 2026 16:37:07 UTC (207 KB) Full-text links: Access Paper: View a PDF of the paper titled Combating Harms of Generative AI in CS1 with Code Review Interviews and a Flipped Classroom, by Peter Fowles and 4 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.HC prev | next new | recent | 2026-05 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked="checked"class=“labs-tab-input”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status

[HC-3] Gen-AI-tecture: using generative AI to support architectural students in design tasks

链接: https://arxiv.org/abs/2605.21361
作者: Timo Kapsalis
类目: Human-Computer Interaction (cs.HC)
备注: Pre-print. Submitted to the Journal of Architectural Education

点击查看摘要

Abstract:The “Gen-AI-tecture” project embeds a locally executed, discipline-specific tool into a mixed-methods focus-group design, structured around three research objectives: (a) to evaluate how generative AI tools impact students’ creativity in design-thinking processes and outcomes, (b) to assess whether these tools enhance inclusivity in learning processes, and © to examine how they develop students’ AI-handling skills with a view to boosting future employability. Findings indicate enhanced creative fluency, broadened participation across diverse learner profiles, and strengthened confidence in AI-supported design processes. The study contributes evidence-based guidance for integrating generative-AI workflows into architectural pedagogy, demonstrating how such tools can operationalise constructivist principles of learner-led meaning-making, support connectivist understandings of learning as participation in human-AI networks, and advance universal learning theories by promoting more inclusive, flexible and accessible educational practices for contemporary learners.

[HC-4] he Human-AI Delegation Dilemma: Individual Strategies Collective Equilibria and Sociotechnical Lock-in

链接: https://arxiv.org/abs/2605.21351
作者: Angjelin Hila
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:This paper takes an ecological approach toward large-scale models of hybrid human-AI intelligence. Emerging models of human-AI interaction predominantly advance the complementarity thesis variously dubbed human-AI collaboration and human-AI hybrid intelligence. However, this constitutes an over-simplification of the modalities of human-AI interaction and possibility-space for both individual and collective action that human-AI interaction potentiates. To fill these gaps, this paper develops a decision and game-theoretic approach to the human-AI delegation-verification dilemma. First, we map out canonical decision-theoretic strategies that account for adaptive user trajectories, modeling how agents transition between strategies based on interaction feedback to reach stable equilibria. Second, we scale individually stable strategies to collective equilibria using three extrapolation principles: (a) non-communicative aggregation (b) local social signaling and © institutional norms setting. The analysis identifies the emergence of sociotechnical lock-in, a macro-behavioral state where individually adaptive delegation, in the absence of communicative and institutional safeguards, aggregates into a systemic collective action problem modeled as a prisoner’s dilemma that degrades shared epistemic standards. We argue that adoption under higher communicative standards and institutional norms can mitigate suboptimal collective equilibria by imposing social commitments on individual users.

[HC-5] meSRL: Generalizable Time-Series Behavioral Modeling via Semantic RL-Tuned LLM s – A Case Study in Mental Health

链接: https://arxiv.org/abs/2605.21295
作者: Yuang Fan,Lilin Xu,Millie Wu,Jingping Nie,Qingyu Chen,Yuzhe Yang,Zhuo Zhang,Xin Liu,Subigya Nepal,Xiaofan Jiang,Xuhai “Orson” Xu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Longitudinal passive sensing enables continuous health prediction, yet models often fail under cross-dataset distribution shifts. Traditional ML overfits cohort-specific artifacts, while Large Language Models (LLMs) struggle to reason reliably over long, heterogeneous time-series. We introduce TimeSRL, a two-stage LLM framework that routes predictions through an explicit semantic bottleneck. The model first abstracts raw signals into high-level natural language, then predicts behavioral outcomes from these abstractions alone. This forces the model to reason over semantic concepts that we argue generalize better than raw numbers. We optimize this process end-to-end using Group Relative Policy Optimization (GRPO) with Reinforcement Learning from Verifiable Rewards (RLVR), learning outcome-aligned abstractions without gold intermediate annotations. Instantiated on mental-health prediction, TimeSRL achieves state-of-the-art performance on a benchmark designed to stress-test cross-cohort generalization under a rigorous leave-one-dataset-out (LOSO) protocol, reducing mean absolute error (MAE) over strong non-LLM ML and LLM baselines by 3.1–10.1% and 9.5–44.1% for anxiety, and 3.2–9.6% and 27.4–57.6% for depression (all p s0.05). TimeSRL significantly outperforms prior methods in cross-benchmark transfer across different sensing pipelines, rivaling its own within-domain performance without target-domain fine-tuning. These results demonstrate that semantic abstractions are reusable and point to a new direction for generalizable behavior modeling via RL-tuned LLMs.

[HC-6] he Quiet Path from Seemingly Minor Design Errors to Workplace AI Incidents

链接: https://arxiv.org/abs/2605.21035
作者: Julia De Miguel Velázquez,Sanja Šćepanović,Andrés Gvirtz,Daniele Quercia
类目: Human-Computer Interaction (cs.HC)
备注: Accepted in April 2026 to be published in the 2026 ACM Conference on Fairness, Accountability, and Transparency (FAccT '26), June 25-28, 2026, Montreal, QC, Canada

点击查看摘要

Abstract:Recent human-computer interaction (HCI) research has revealed a widespread misalignment between how developers design workplace artificial intelligence (AI) systems, and what workers actually need from them. Yet, little research has examined the effects of this gap, or how it may cause harm. We analyzed 1,524 reports of incidents in which AI systems were used to perform 171 occupational tasks across 12 industry sectors. Using an Large Language Model (LLM)-as-an-expert approach, we extracted the main traits of the AI systems involved in those incidents using an established framework of twelve traits. We then compared them with the traits that 202 workers highly familiar with those tasks would have preferred. We found that as many as 83% of workplace incidents stem from worker-AI misalignments. In most cases, workers wanted systems that are precise, insightful, or personal, but instead received systems that are basic, simple, or general. Over the years, fast AI caused a considerable number of incidents, yet these declined, and imaginative AI, with the mass introduction of generative AI, started to cause incidents. We also compared the traits causing the incidents with the traits that 197 developers building AI systems for those tasks would have preferred. If the traits causing the incidents were the same as those designed by developers, then developers may be responsible for those incidents. We found that 74% of task misalignments could be attributed to developers who tended to overfocus on efficiency and speed, especially for systems performing tasks in people-facing occupations such as those in the human resources sector. Our results call for design interventions that better align AI development with workers’ needs, as without such corrections, workplace AI incidents are likely to persist, causing the invisible erosion of worker agency and organizational productivity.

[HC-7] PaintCopilot: Modeling Painting as Autonomous Artistic Continuation

链接: https://arxiv.org/abs/2605.20941
作者: Yunge Wen,Yuancheng Shen,Paul Pu Liang
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:We present PaintCopilot, a co-creative neural painting assistant that models painting as an open-ended autoregressive artistic behavior conditioned on evolving canvas states and prior brushstroke history, without requiring a target image. Unlike existing neural painting methods that frame painting as pixel reconstruction toward a predefined reference, PaintCopilot predicts future strokes directly from learned artistic dynamics, analogous to how large language models continue text sequences from prior context. The framework proposes three complementary models: a ViT-based Target Predictor that infers artist intent from partial canvas observations, an autoregressive Next Stroke Predictor that generates temporally coherent brushstrokes via flow matching, and a VAE-based Region Sampler that synthesizes semantically localized stroke sequences on demand. Built on three differentiable brush representations (Hard Round, Brush Tip, and 2D Gaussian), the system supports four interactive workflows: Optimize History, Stroke Completion, Region Inpainting, and Dynamic Brush. Through case studies with professional artists, we demonstrate that PaintCopilot enables fluid co-creative painting workflows in which artists and AI continuously alternate control throughout the creative process. Subjects: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Human-Computer Interaction (cs.HC) Cite as: arXiv:2605.20941 [cs.CV] (or arXiv:2605.20941v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2605.20941 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[HC-8] oward 6G-enabled Brain Computer Interfaces: Technical Requirements Use Cases Challenges and Future Trends

链接: https://arxiv.org/abs/2605.20939
作者: Houda Hafi,Bouziane Brik,Nuraini Jamil,Abdelkader Nasreddine Belkacem
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Brain computer interface (BCI) enables the brain to directly control an external device by converting neural signals into actionable outputs. However, effective real-time translation of brain activity strongly depends on the quality of neural communication between the brain and the external device. 6G is the next generation of wireless communication, expected to provide unprecedented levels of data rates, data security, and automation capabilities. In this context, integrating 6G into BCI systems would not only enhance the performance of brain-device communication, but would also create new opportunities for innovative applications. This work provides a comprehensive study on how BCI technology can be built effectively on top of 6G wireless networks by introducing several technical aspects and use cases. We first provide an overview of BCI and 6G, following their progression from early development to convergence through cognitive communication and advanced neural interfaces. We then highlight the need for the upcoming 6G systems toward BCI technology in every aspect, including 6G technologies such as intelligent edge and zero-touch networks, and 6G use cases such as digital twin, immersive communication, and internet of minds. Furthermore, we identify key technical challenges, open issues, and future research directions related to the 6G-enabled BCI paradigm.

[HC-9] Design Principles and Observable Indicators for AI-Enabled Pedagogical Accompaniment: Evidence from the Amico Dual-Mode Prototype in Italy and China

链接: https://arxiv.org/abs/2605.20665
作者: Pier Paolo Benedetti
类目: Human-Computer Interaction (cs.HC); Computers and Society (cs.CY)
备注: 11 pages. Author Accepted Manuscript. Accepted and presented at the 2026 International Conference on Artificial Intelligence and Education (ICAIE 2026). Proceedings forthcoming. Copyright 2026 IEEE

点击查看摘要

Abstract:AI-enabled systems are increasingly introduced into educational contexts, yet their effectiveness depends less on technological sophistication than on the quality of pedagogical mediation, ethical constraints, and context-sensitive design. This paper proposes a replicable framework for AI-enabled pedagogical accompaniment, grounded in a human-in-command approach in which adult responsibility remains central and AI functions as an enabling, non-substitutive infrastructure. Building on the Amico project, we operationalize the concept of a relational bridge as a sequence of micro-mediations that lower the threshold of access to educational relationships and facilitate transitions toward meaningful human interaction with teachers, peers, and communities of practice. The contribution synthesizes a set of design principles, including transparency of system identity and limits, scaffolding toward human contact, maieutic questioning, prevention of dependency dynamics, and data minimization, and maps them to observable indicators suitable for real educational settings. The paper also outlines an initial cross-context exploration of the prototype in Italy and China and discusses how the two interaction modes, AmicoMio (structured, task-oriented) and AmicoTuo (reflective, supportive), can be used as complementary pedagogical mediations. Pilot observations and participant feedback suggested feasibility and perceived usefulness in vocational contexts, motivating the present framework, informing the subsequent doctoral research program, and supporting the proposed collaborative research agenda.

[HC-10] Personality Engineering with AI Agents : A New Methodology for Negotiation Research

链接: https://arxiv.org/abs/2605.20554
作者: Michelle A. Vaccaro,Jared R. Curhan
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Social and Information Networks (cs.SI)
备注:

点击查看摘要

Abstract:According to canonical negotiation theory, people’s success in a negotiation depends on how well they balance competing demands–empathizing and asserting, demonstrating concern for other and concern for self, being soft on the people and hard on the problem. Yet people struggle to manage these tensions, so researchers have lacked the ability to rigorously test the field’s prescriptions under controlled conditions. AI agents do not face the same limitations, and their precision, repertoire, consistency, and scalability enable a new class of experiments to contribute to negotiation theory. In this article, we introduce personality engineering: a methodology that uses AI agents to precisely parameterize, manipulate, and evaluate negotiator personality. We propose using the interpersonal circumplex–and its two core dimensions of warmth and dominance–as a foundational coordinate system for the field. This approach offers both a rigorous methodology for testing classic negotiation theories and a practical guide for designing the personalities of AI negotiation agents.

[HC-11] Framing an AI with Values Reduces AI Reliance in AI-supported Writing Tasks

链接: https://arxiv.org/abs/2605.20512
作者: Alice Gao,Andrew N. Meltzoff,Maarten Sap,Katharina Reinecke
类目: Human-Computer Interaction (cs.HC)
备注: Accepted to FAccT 2026

点击查看摘要

Abstract:Despite a global user base adopting large language models (LLMs) for daily writing tasks, model suggestions tend to align with Western values. Research has shown users commonly accept a high fraction of these AI suggestions, homogenizing writing styles and rendering outputs more ``Western’’ than intended. While this suggests a need to reduce AI reliance, it remains unknown what kind of interventions could achieve this. Can framing the AI with specific values, and comparing it to one’s own, make users less susceptible to overreliance and support more unique writing? We tested this hypothesis in a between-subjects online experiment with Indian and American participants (n=149) in which they were asked to perform AI-supported writing tasks, either 1) without an intervention, 2) after seeing an overview of the AI’s framed values, or 3) after seeing an overview of the AI’s framed values compared to their own. Our results show that seeing the AI’s framed values reduces AI reliance, i.e., the proportion of the final essay generated by the AI, by an average of 20%. Additionally, when participants saw an overview of the AI’s framed values (without comparison to their own values), the final essays contain more unique text than without intervention. Our findings emphasize the importance of educating users about potential value biases in AI, showing that raising awareness with a simple overview of values encourages users to personalize their writing.

[HC-12] Creating Learning Scaffolds for Engineering Design Using Concept Catalyst

链接: https://arxiv.org/abs/2605.20511
作者: Madhuri Singh,Gennie Mansi,Mark Owen Riedl
类目: Human-Computer Interaction (cs.HC)
备注: Accepted for an Interactive Demo by ISLS 2026

点击查看摘要

Abstract:K-12 teachers employ Engineering Design Challenges to help students learn about the Engineering Design Process hands-on. They use techniques like hard scaffolding questions to guide the students as they think through the different stages of the engineering design process. While useful, the creation of these questions adds to the teacher’s preparation time for their classes. Concept Catalyst uses Large Language Models to assist teachers with the rapid creation of scaffold questions for engineering design challenges. Unlike open-ended chat, Concept Catalyst uses LLMs to summarize and decompose an engineering design challenge into the concepts that students will engage with, allow the teacher to visually manipulate and link related concepts, and to propose scaffolding questions for the teacher to modify or accept.

[HC-13] Art Card Game (ACG): Embedding Illustration in Gameplay to Mitigate Artist Self-Criticism

链接: https://arxiv.org/abs/2605.20465
作者: Catherine Mullings,Michael S. Bernstein
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Persistent self-criticism–harsh evaluative self-talk–can undermine illustrators’ performance and well-being. Traditional interventions draw on psychotherapeutic approaches (e.g., compassion training) but sit outside the illustration workflow, requiring time, facilitation, and skill transfer. We propose an in-workflow alternative: evaluative off-centering, a mechanism redirecting self-critical evaluation away from an inherently self-evaluative task (like illustration) by embedding it in an alternative activity. We instantiate evaluative off-centering in Art Card Game (ACG) that integrates illustration into a card customization game: players illustrate cards that become playable assets in a head-to-head battle. In a four-day randomized controlled study with hobbyist and professional illustrators (N=38), ACG outperformed a control condition with identical illustration constraints but no evaluative off-centering mechanisms (e.g. multiplayer, gameplay), yielding significantly higher pride in produced artwork and activity enjoyment. Pride and enjoyment–positive affect states linked to lower self-criticism–help explain how ACG reduces self-criticism. We discuss design implications for creativity support tools that apply evaluative off-centering across creative domains.

[HC-14] Modeling Emotional Dynamics in Agent -to-Agent Interactions on Moltbook

链接: https://arxiv.org/abs/2605.20442
作者: Syed Mhamudul Hasan,Abdur R. Shahid
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Generative AI systems are increasingly deployed as interactive agents in online environments, such as a social network called Moltbook. In Moltbook, large-scale agentic AIs can post, comment, and engage in activities generated at scale by AI-driven text. Yet these agent behavioral characteristics remain insufficiently understood, particularly in complex, multi-agent interaction. In this study, we analyze the emotional dynamics of agent interactions within Moltbook. We construct an emotion-aware framework that maps textual interactions to a predefined set of fine-grained emotional categories, enabling the extraction of structured emotion profiles across agents and interaction contexts. To further evaluate behavioral reliability, we introduce an emotion-based domain called Persona-Stimulus-Reaction (PSR) that captures the alignment of emotional responses across similar contexts. Our analysis shows distinct emotional patterns and varying levels of behavioral stability across agents. Our analysis reveals that agents exhibit distinct emotional signatures with varying levels of behavioral stability influenced by interaction context.

[HC-15] Can Conversational XAI Improve User Performance? An Experimental Study

链接: https://arxiv.org/abs/2605.20439
作者: Sven Kruschel,Julian Rosenberger,Lasse Bohlen,Mathias Kraus,Patrick Zschech
类目: Machine Learning (cs.LG); Human-Computer Interaction (cs.HC)
备注: Accepted at Thirty-Fourth European Conference on Information Systems (ECIS 2026), Milan, Italy

点击查看摘要

Abstract:Explainable AI (XAI) techniques aim to provide insights into predictive models and enhance user performance, yet they often fall short of these expectations. Conversational XAI assistants promise to overcome such limitations, but empirical evidence on their impact on objective performance measures remains limited. We propose an experimental design for evaluating explanation assistance through prediction accuracy, model understanding, and error identification. Using an explainable-by-design prediction model, we create conditions where users can outperform the model by identifying and compensating for systematic errors. We compare conversational assistance against QA-based assistance to assess which better supports users in working with model explanations. Preliminary results from testing our experimental design show that participants (N=42) in both treatments significantly outperformed the model but reveal no performance differences between assistance types and modest engagement overall. These findings inform refinements for our planned full study, including enhanced engagement interventions and investigation of the mechanisms driving improved predictions.

[HC-16] Closing the Motivation Gap: Incentives Enhance Visual Misinformation Discernment and Verification

链接: https://arxiv.org/abs/2605.20438
作者: Sijia Qian,Cuihua Shen,Jingwen Zhang,Magdalena Wojcieszak
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Cheapfakes, or real images presented misleadingly or in unrelated contexts, are an increasingly prominent form of visual misinformation. While media literacy interventions can enhance individuals’ ability to detect such content, motivational barriers often hinder the adoption of image verification. This study examines whether incorporating different mechanisms and types of incentives into a digital media literacy intervention improves visual misinformation discernment and image verification behavior, both immediately and over time. We conducted a pre-registered two-wave between-subjects online experiment (N = 1,421) on a professionally designed social media platform. The study used a 2 (Incentive Type: symbolic vs. monetary) x 2 (Incentive Mechanism: task- vs. result-based) factorial design with additional control groups. Results show that task-based incentives, particularly monetary ones, were most effective at initiating image verification behaviors, namely reverse image search, and boosting short-term discernment, whereas result-based incentives were more effective in sustaining discernment accuracy. These findings suggest that both the mechanism and the type of incentives play a critical role in shaping the short- and long-term effectiveness of media literacy interventions, highlighting the value of multi-phased incentive strategies for combating visual misinformation in digital environments.

[HC-17] Multi-Week In-Class Deployments of Telepresence Robots With Four Homebound K-12 Students: Benefits Challenges and Recommendations

链接: https://arxiv.org/abs/2605.20431
作者: Matthew Rueben,Rhianna Lee,Thomas R. Groechel,Hengzhi Chen,Haemi Lee,Gisele Ragusa,Maja J. Matarić
类目: Human-Computer Interaction (cs.HC); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Missing significant amounts of school during K-12 education is known to put students’ cognitive and social development at risk. Alternatives such as home instruction and online learning are common, but lack sufficient interaction with peers and teachers in the classroom. Mobile remote presence systems, or telepresence robots, are promising for homebound students because they provide embodiment and mobility in addition to the real-time participation offered by video conferencing technologies. Research is needed, however, for telepresence robots to meet the complex needs of homebound students participating remotely in the K-12 classroom context. We present findings from four multi-week deployments with homebound K-12 students attending classes via telepresence robots. The homebound students’ experiences were documented in a total of 15 interviews and analyzed qualitatively as case studies. The homebound student participants and their deployment contexts differed from one another along multiple dimensions, and while some benefits of mobile remote attendance were enjoyed by all participants, each participant also experienced unique benefits. Some challenges with hearing, seeing, and moving the robot around the classroom warranted improvements to the design of the telepresence system. Other challenges suggested priorities for managing a classroom deployment, such as ensuring that the remote student is included in classroom activities, accountable to the teacher, and treated with respect by classmates. Based on insights from the study, we make recommendations for real-world deployment procedures in similar contexts.

[HC-18] Music of Changing Lines: Toward a Culturally Situated Approach to the I-Ching

链接: https://arxiv.org/abs/2605.20386
作者: Ling Qi,Aleksandra Teng Ma,Alexandria Smith
类目: Multimedia (cs.MM); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC); Sound (cs.SD)
备注: Published and presented at the International Computer Music Conference (ICMC) 2026

点击查看摘要

Abstract:The I-Ching is one of the most influential texts in Chinese intellectual history, integrating divination, cosmology, and ethical reflection. While Western experimental music, most notably John Cage, has drawn on the I-Ching as a source of chance operation, such appropriations have often detached its formal mechanisms from the interpretive and philosophical processes that give the text meaning. This work, Music of Changing Lines, presents an interactive system that re-centers the I-Ching as a meaning-bearing framework rather than a neutral randomizer. Users perform Wen Wang Fa coin casting, which is accompanied in real time through probabilistic musical processes. The resulting hexagrams and changing lines are interpreted by a large language model, Gemini, in relation to the user’s inquiry. This textual interpretation is then translated into a prompt for a generative music model, Lyria, producing a responsive musical realization. By situating AI as an interpretive intermediary rather than a compositional authority, the system foregrounds the I-Ching’s ritual, interpretation, and participation as the primary sonic materials. Music of Changing Lines extends process-driven traditions in computer music by demonstrating how generative AI can support participatory, meaning-driven musical processes without prescribing musical structure or replacing human agency.

[HC-19] Proximal State Nudging: Reducing Skill Atrophy from AI Assistance

链接: https://arxiv.org/abs/2605.20355
作者: Megha Srivastava,Jonathan Ouyang,Eric Zhou,Andrew Silva,Emily Sumner,Dorsa Sadigh,Yuchen Cui,Deepak Gopinath,Guy Rosman
类目: Robotics (cs.RO); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注: 9 pages

点击查看摘要

Abstract:Skill atrophy, the gradual decline of human capability under AI assistance, poses a safety risk in shared-control of semi-autonomous systems, where operators may be unable to distinguish their own inputs from autonomous corrections. We propose Proximal State Nudging (PSN), a shared autonomy algorithm that jointly optimizes for skill development and task performance by nudging users toward states estimated to be most learnable. We first show that PSN outperforms existing shared autonomy baselines in balancing student improvement in unassisted reward with overall shared performance, using simulated students in the classic LunarLander environment. We then present, to the best of our knowledge, the first human subject studies of a planner incorporating learning-compatible shared autonomy: across two driving tasks in the CARLA simulator (High Performance Racing and Parallel Parking, n = 60), PSN produces up to 7x larger gains in unassisted skill than standard blended shared autonomy, while incurring 50% fewer collisions than unassisted self-practice.

[HC-20] Adaptive Human-Robot Collaboration for Masonry Construction Under Material and Assembly Uncertainty

链接: https://arxiv.org/abs/2605.20264
作者: Jutang Gao(1),Arash Adel(1) ((1) Princeton University)
类目: Robotics (cs.RO); Human-Computer Interaction (cs.HC)
备注: Accepted for publication in Proceedings of the 43rd International Symposium on Automation and Robotics in Construction (ISARC 2026)

点击查看摘要

Abstract:Human-robot collaboration in construction is often challenged by limited robot-to-human communication and the need to adapt to tolerance accumulation arising from material and assembly uncertainties. We present an adaptive human-robot collaborative workflow for masonry construction that addresses communication limitations and tolerance accumulation, demonstrated through a brickwork case study in which a robot places bricks while a human applies adhesive. This workflow is enabled by two complementary mechanisms: 1) an end-effector-mounted projector that provides spatially registered, just-in-time projection guidance for manual adhesive application, and 2) laser scanning for feedback-driven grasping and placement pose correction. Together, these mechanisms enable adjustment of human and robotic actions in response to material variability and accumulated assembly tolerances. Full-scale experiments across conventional running-bond and nonstandard configurations demonstrate that projection guidance improves adhesive application consistency and reduces application time, while laser-based correction maintains level courses and avoids collision-prone failures associated with open-loop execution. These results indicate that integrating spatial projection with feedback-driven adaptation, enabled by material and as-built sensing, can mitigate tolerance accumulation and improve precision and robustness in human-robot collaborative construction.

[HC-21] Multi-Agent Reinforcement Learning for Safe Autonomous Driving Under Pedestrian Behavioral Uncertainty ICRA2026

链接: https://arxiv.org/abs/2605.20255
作者: Prakash Aryan,Kaushik Raghupathruni,Timo Kehrer,Sebastiano Panichella
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Robotics (cs.RO)
备注: Submitted to ICRA 2026 Workshop “8th Workshop on Long-term Human Motion Prediction”

点击查看摘要

Abstract:Simulation-based testing of self-driving cars (SDCs) typically relies on scripted or simplified pedestrian models that do not capture the heterogeneity and uncertainty of real human crossing behavior. This limits the realism of safety assessments, especially in scenarios involving jaywalking, which is governed by latent personality traits that the vehicle cannot observe. We hypothesize that jointly training pedestrians and the SDC with multi-agent reinforcement learning (MARL) produces more realistic interaction scenarios than training the SDC against fixed pedestrian policies, and that the resulting behavior gap between predictable and unpredictable crossings can be measured directly from trajectories. This paper describes a MARL environment in which an SDC and 12 pedestrians are co-trained using Multi-Agent Proximal Policy Optimization (MAPPO). Pedestrian locomotion follows scripted Dijkstra pathfinding, while an RL policy controls high-level go/wait decisions. Jaywalking probability depends on a per-pedestrian personality trait sampled at episode start and hidden from the SDC. In 500-episode evaluations, the co-trained SDC reached 78% of goals with a 14% collision rate, compared to 35% goals and 33% collisions for the best rule-based baseline. A speed differential metric shows that the SDC traveled 2.65 m/s faster near jaywalkers than near crosswalk users at close range (0-3 m), indicating that jaywalking encounters were not anticipated. Jaywalking accounted for 13% of crossing events but was associated with 62% of collisions. Co-training with MARL pedestrians reduced collisions by 30% relative to single-agent RL, as pedestrians learned to wait when the SDC approached at speed.

[HC-22] HealthTale: A Patient-Centric Health Story Visualization Tool

链接: https://arxiv.org/abs/2605.20207
作者: Ryan Smith,Kyle D. Chin,Tamara Munzner
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Patients often struggle to communicate coherent accounts of their health histories during time-constrained clinical encounters. These accounts, which we refer to as health stories, include both clinical events and lived experiences. Existing systems prioritize structured, clinician-centered data and provide limited support for eliciting and communicating patient-generated narratives. We present HealthTale, a patient-centric visualization system designed to elicit health stories from patients and structure them to facilitate communication during initial clinical conversations. Its design arises from a multi-stage qualitative investigation across domain expert discussions, online narratives (n=20), patient (n=11) and clinician (n=6) interviews, and elicited health stories (n=22), identifying recurring patterns in how individuals construct and communicate their health stories. HealthTale transforms freeform narratives into structured timeline representations, grounded in a data abstraction that models health stories as events that are grouped by health concern and time, capturing both clinical and contextual information, with the flexibility to handle temporally imprecise data and non-linear distributions of events across time. Through evaluation with patients (n=34) and clinicians (n=3), we find that HealthTale supports recall, organization, and self-advocacy, while enabling clinicians to rapidly interpret patient-generated narratives and establish a shared understanding.

[HC-23] PrivacyAkinator: Articulating Key Privacy Design Decisions by Answering LLM -Generated Multiple-choice Questions

链接: https://arxiv.org/abs/2605.20206
作者: Qiyu Li,Yuen Sum Wong,Yuen Kei Wong,Longxuan Yu,Haojian Jin
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: Accepted to ACM CHI 2026

点击查看摘要

Abstract:NIST’s Privacy Risk Assessment Methodology (PRAM) provides a structured framework for privacy experts to assess privacy risks. However, its complexity and reliance on expert knowledge make it difficult for novice developers to use effectively. This paper explores methods to lower these barriers. We first performed an observational study with 12 participants using PRAM in real-world scenarios, and found that novice developers struggled most with articulating privacy-related design decisions. We then developed PrivacyAkinator, an interactive tool that helps developers articulate key privacy decisions by answering LLM-generated multiple-choice questions. PrivacyAkinator introduces three innovations: a universal privacy representation that abstracts privacy-related design decisions into data flows and stakeholder interactions; a domain-aware design space mined from 10K privacy-related news articles; and a dynamic question-generation workflow to prioritize relevant questions. Our user study with 24 participants suggests that developers using PrivacyAkinator identified 47% more key decisions in 73% less time compared to PRAM.

[HC-24] Challenges in Working Towards Patient Engagement in Developing Technology Prototypes ALT

链接: https://arxiv.org/abs/2605.20205
作者: Fateme Rajabiyazdi,Julie Babione,Doreen M. Rabi,Foroozan Daneshzand,Sheelagh Carpendale
类目: Human-Computer Interaction (cs.HC)
备注: Accepted to the CHI 2026 Workshop on Engagement in Digital Health Interventions

点击查看摘要

Abstract:Creating supportive technologies for people living with multiple chronic conditions is extremely challenging. These patients are often faced with substantial visible and invisible treatment work as well as their everyday responsibilities, including coordinating across providers, tracking information, and repeating communication in emotionally charged contexts. In the Cumulative Complexity Model (CuCoM), the balance between patient workload and patient capacity shapes what patients can realistically take on, including whether a digital tool can be adopted and sustained. In this paper, we report engagement lessons from implementing MyCareCompass, a patient-facing digital health intervention (DHI) intended to support day-to-day self-management for people living with multiple chronic conditions. We define engagement as patient uptake and sustained use during a two-month pilot study of our platform, drawing on usage analytics and follow-up feedback, and distill three implementation lessons for designing for engagement in complex chronic care.

[HC-25] RealUserSim: Bridging the Reality Gap in Agent Benchmarking via Grounded User Simulation

链接: https://arxiv.org/abs/2605.20204
作者: Ming Zhu,Juntao Tan,Rithesh Murthy,Jielin Qiu,Liangwei Yang,Wenting Zhao,Silvio Savarese,Shelby Heinecke,Huan Wang
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:LLM-based user simulation is the primary mechanism for end-to-end agent evaluation, yet simulated users are poor proxies for real humans: unconstrained LLM defaults produce a Formalism Ceiling (style match rates of 6-8% against real users), while hand-crafted behavioral directives trigger Directive Amplification, where models hyper-interpret instructions into unnatural behavioral extremes that vary dramatically across simulator models. We present RealUserSim, the first user simulation framework grounded in real behavioral data. From 14,000+ authentic human-LLM conversations (WildChat), we extract 7,275 executable behavioral profiles and use them to ground LLM simulators. A fidelity benchmark (PT3) on 600 conversations across 71+ domains with anti-leakage controls shows that grounded simulation raises match rate from 24.2% to 45.3% across five behavioral dimensions. Agent evaluation on TauBench with 6 simulator models and extensive analysis shows that grounded simulation acts as a realistic stress test, surfacing three failure mechanisms invisible to cooperative simulators (mean -3.2% to -3.5% task success degradation), while Directive Amplification in existing benchmarks produces unrealistic behavior that compromises the validity of agent evaluation.

[HC-26] GrandGuard: Taxonomy Benchmark and Safeguards for Elderly-Chatbot Interaction Safety

链接: https://arxiv.org/abs/2605.20203
作者: Changxuan Fan,Xi Yang,Yueyuan Zheng,Bin Zhou,Yuanping Wang,Wenbin Hu,Huihao Jing,Ki Sen Hung,Dazhao Du,Haoran Li,Janet Hui-wen Hsiao,Yangqiu Song
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As older adults increasingly use LLM-based chatbots for companionship and assistance, a safety gap is emerging. Older adults may face vulnerabilities from social isolation, limited digital literacy, and cognitive decline, yet existing safety benchmarks largely target general harms and overlook elderly-specific risks. For example, a prompt such as “how to repair a ceiling light alone in the dark” may be benign for most users but poses a serious fall risk for older adults with mobility limitations. We introduce GrandGuard, the first comprehensive framework for assessing and mitigating elderly-specific contextual risks in LLM interactions. We develop a three-level taxonomy with 50 fine-grained risk types across mental well-being, financial, medical, toxicity, and privacy domains, grounded in real-world incidents, community discussions, and analysis of stakeholder studies. Using this taxonomy, we construct a benchmark of 10,404 labeled prompts and responses, showing that several leading LLMs mishandle elderly-specific contextual risks in over 50% of cases. We mitigate these failures with two safeguards: a fine-tuned Llama-Guard-3 and a policy-enhanced gpt-oss-safeguard-20b, achieving up to 96.2% and 90.9% unsafe-prompt detection accuracy, respectively. GrandGuard lays the groundwork for AI systems that move beyond general safety to support aging populations.

[HC-27] Evaluating multimodal emotion recognition in proactive conversational agents : A user study

链接: https://arxiv.org/abs/2605.20200
作者: Adnana Dragut,Raquel Lacuesta,F. Xavier Gaya-Morey,Jose M. Buades-Rubio
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This article presents a multimodal emotion recognition module integrated into a proactive Socially Interactive Agent (SIA) powered by generative artificial intelligence. The system evaluates real-time affective states through two distinct channels: a computer vision-based facial recognition module and a semantic linguistic analysis engine. To validate the framework, an empirical study was conducted with 20 users who engaged in dynamic, unscripted dialogues with the conversational agent. The findings reveal a significant discrepancy between automated visual cues and actual internal emotional states. When interacting with the AI, users consistently exhibited a “poker face” effect, displaying serious, concentrated facial expressions even when experiencing positive emotions. Consequently, the generative AI linguistic analysis proved significantly more reliable, by contextualizing the users’ verbal expressions. Furthermore, an analysis of the interaction dynamics demonstrated that SIAs can effectively elicit specific emotions by adapting conversational themes and employing structured linguistic patterns, such as empathetic or humorous language. However, the study also noted that instances of uncalibrated proactivity occasionally led to user disengagement and a perception of artificiality. Ultimately, this research highlights the necessity of refining SIAs to dynamically adapt to users’ emotional evolution, relying on deep linguistic context to foster more natural, human-like interactions.

[HC-28] Augmented Analytics and Decision Quality: The Role of Trust among Non-Technical BI Users

链接: https://arxiv.org/abs/2605.20198
作者: Thuy Pham Thi Phuong,Ha Nguyen Manh,Ngan Nguyen Thi Thuy,Lan Hoang Thi
类目: Human-Computer Interaction (cs.HC); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注: 13 pages, 1 figure, 4 tables

点击查看摘要

Abstract:Augmented analytics has transformed how business intelligence (BI) systems support managerial decision-making. This is especially true for users without technical backgrounds, who increasingly rely on automated insights rather than manual analysis. BI research has previously concentrated on system adoption and user intention, with very little research examining the impact of AI-enabled analytics on decision quality and the cognitive mechanisms in between. Using the theory of cognitive delegation, this paper investigates the role of trust in augmented analytics and decision-making quality among non-technical BI users. 250 business professionals completed the survey, and the data were analyzed using partial least squares structural equation modeling (PLS-SEM). The results show that augmented analytics capabilities lead to a significant increase in perceived ease of use, perceived usefulness, and trust in BI systems. In addition, trust and usefulness influence BI adoption and improve decision quality. Furthermore, trust has a direct and positive impact on decision quality, highlighting its importance as an enabler of reliance on AI-generated insights. This study considers augmented analytics as a form of cognitive delegation and expands the scope of BI adoption research to include decision-making outcomes.

计算机视觉

[CV-0] Variance Reduction for Expectations with Diffusion Teachers

链接: https://arxiv.org/abs/2605.21489
作者: Jesse Bettencourt,Xindi Wu,Matan Atzmon,James Lucas,Jonathan Lorraine
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Computation (stat.CO); Machine Learning (stat.ML)
备注: Project page: this https URL

点击查看摘要

Abstract:Pretrained diffusion models serve as frozen teachers feeding downstream pipelines such as text-to-3D, single-step distillation, and data attribution. The teacher gradients these pipelines consume are Monte Carlo (MC) expectations over noise levels and Gaussian noise samples; their estimator variance dominates compute cost because each draw requires expensive upstream work (rendering, simulation, encoding). We introduce CARV, a compute-aware variance-accounting framework that motivates a hierarchical MC estimator: amortize the expensive upstream computation over cheap diffusion-noise resamples, sharpened by timestep importance sampling and a stratified-inverse-CDF construction. In our text-to-3D distillation and attribution experiments, CARV delivers 2-3x effective compute multipliers (most from amortized reuse; ~25% additional from IS+stratification) without changing the objective; in single-step distillation, the same techniques cut gradient variance by an order of magnitude but do not improve downstream FID, marking the regime where MC variance is no longer the bottleneck.

[CV-1] Uni-Edit: Intelligent Editing Is A General Task For Unified Model Tuning

链接: https://arxiv.org/abs/2605.21487
作者: Dian Zheng,Manyuan Zhang,Hongyu Li,Hongbo Liu,Kai Zou,Kaituo Feng,Hongsheng Li
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL Code: this https URL

点击查看摘要

Abstract:Currently, enhancing Unified Multimodal Models (UMMs) with image understanding, generation, and editing capabilities mainly relies on mixed multi-task training. Due to inherent task conflicts, such strategy requires complex multi-stage pipelines, massive data mixing, and balancing tricks, merely resulting in a performance trade-off rather than true mutual reinforcement. To break this paradigm, we propose Uni-Edit, an intelligent image editing task that serves as the first general task for UMM tuning. Unlike complex mixed pipelines, Uni-Edit improves performance across all three abilities at once using only one task, one training stage, and one dataset. Specifically, we first identify image editing as an inherently ideal general task, as it naturally demands both visual understanding and generation. However, existing editing data relies on simplistic instructions that severely underutilize a model’s understanding capacity. To address this, we introduce the first automated and scalable data synthesis pipeline for intelligent editing, transforming diverse VQA data into complex and effective editing instructions with embedded questions and nested logic. This yields Uni-Edit-148k, pairing diverse reasoning-intensive instructions with high-quality edited images. Extensive experiments on BAGEL and Janus-Pro demonstrate that tuning solely on Uni-Edit achieves comprehensive enhancements across all three capabilities without any auxiliary operations.

[CV-2] One-Step Distillation of Discrete Diffusion Image Generators via Fixed-Point Iteration

链接: https://arxiv.org/abs/2605.21484
作者: Chaoyang Wang,Yunhai Tong
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Discrete diffusion models excel at visual synthesis but rely on slow, iterative decoding. Existing single-step distillation methods attempt to bypass this bottleneck, either by training auxiliary score networks that effectively double compute, or by introducing specialized parameterizations and multi-stage pipelines that fragment optimization. In this paper, we introduce Fixed-Point Distillation (FPD), an end-to-end framework that constructs local correction targets by partially corrupting the student’s one-step draft and refining it with a single teacher step. To compute the training objective in a semantically meaningful space, we lift discrete tokens into continuous features and apply a multi-bandwidth drift loss that iteratively accumulates these corrections. To backpropagate through the discrete bottleneck, we employ a straight-through estimator that feeds exact hard-sampled tokens to the teacher and decoder during the forward pass, ensuring that training and inference operate on the same codebook manifold, while routing continuous gradients back to the student logits. This fully differentiable pathway additionally accommodates an optional unconditional adversarial objective to enhance perceptual realism. Evaluations on both class- and text-conditional generation validate the effectiveness of our framework. FPD achieves competitive visual fidelity and structural alignment within a single inference step, narrowing the gap to multi-step teachers while outperforming existing discrete distillation baselines.

[CV-3] WikiVQABench: A Knowledge-Grounded Visual Question Answering Benchmark from Wikipedia and Wikidata

链接: https://arxiv.org/abs/2605.21479
作者: Basel Shbita,Pengyuan Li,Anna Lisa Gentile
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Visual Question Answering (VQA) benchmarks have largely emphasized perception-based tasks that can be solved from visual content alone. In contrast, many real-world scenarios require external knowledge that is not directly observable in the image to answer correctly. We introduce WikiVQABench, a human-curated knowledge-grounded VQA benchmark constructed by systematically combining Wikipedia images, their associated article captions, and structured knowledge from Wikidata. Our pipeline uses large language models (LLMs) to generate candidate multiple-choice image-question-answer sets. All generated instances are subsequently reviewed and curated by human annotators to ensure factual correctness, visual-text consistency, and that each question requires external knowledge in addition to visual evidence for correct resolution. WikiVQABench comprises a substantial collection of Wikipedia images with curated multiple-choice questions designed to benchmark knowledge-aware vision-language models (VLMs). Evaluation of fifteen VLMs (256M-90B parameters) reveals a wide performance range (24.7%-75.6% accuracy), demonstrating that the benchmark effectively discriminates model capabilities on knowledge-intensive reasoning. The dataset and benchmarking code are publicly available.

[CV-4] Latent Dynamics for Full Body Avatar Animation

链接: https://arxiv.org/abs/2605.21478
作者: Shichong Peng,Chengxiang Yin,Fei Jiang,Zhongshi Jiang,Lingchen Yang,Qingyang Tan,Amin Jourabloo,Jason Saragih,Ke Li,Christian Häne
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: Supplementary video: this https URL

点击查看摘要

Abstract:Pose-driven full-body avatars built on neural rendering produce high-quality novel views of a captured subject. Yet loose clothing and other dynamic elements deform in ways pose alone cannot explain: the same pose can correspond to many different states, because their motion depends on history, inertia, and contact. Explicit simulation and layered-garment methods can model such dynamics, but they require either a dedicated garment template, which raw multi-view capture does not naturally provide, or a test-time physics simulator with non-trivial runtime cost. A parallel line of work learns data-driven clothing avatars that avoid explicit garment layers. These methods add an auxiliary latent for variation beyond pose; at inference, they fix it, regress it from pose, or retrieve it from training data, without explicitly modeling how the latent evolves with its own dynamics. Additionally, even in everyday motion with loose clothing, existing architectures often struggle to capture fine-grained detail, producing blurry renderings and temporal artifacts. We augment a pose-conditioned 3D Gaussian avatar with a transformer-based decoder and a dynamics residual latent that captures temporal appearance and geometry variation beyond the driving signals. At inference, a learned latent dynamics model evolves the residual latent from a short pose history and the previous latent state. The model decomposes each update into driving, restoring, and dissipative forces, producing temporally coherent, history-dependent rollouts with negligible added cost. Different initial conditions yield diverse yet plausible motion trajectories, and the force decomposition exposes controls such as stiffness. Across nine captured sequences of everyday motion with diverse loose garments, quantitative metrics and a perceptual user study show improved animation quality over recent data-driven baselines.

[CV-5] Stream3D: Sequential Multi-View 3D Generation via Evidential Memory

链接: https://arxiv.org/abs/2605.21472
作者: Kaichen Zhou,Zeyang Bai,Xinhai Chang,Mengyu Wang,Paul Liang,Fangneng Zhan
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Multi-view 3D Generation, Streaming 3D Generation

点击查看摘要

Abstract:View-conditioned 3D generators such as SAM 3D, TRELLIS and Hunyuan3D produce high-quality object reconstructions from a single view, but real-world visual observation often arrives as long monocular streams. Naively applying these generators to each streaming frame independently leads to severe temporal inconsistency in the generated results. To address this problem, we propose Stream3D, the first training-free streaming mechanism that turns a frozen view-conditioned 3D generator into a streaming generator with constant cross-chunk memory. Stream3D achieves this by maintaining a compact evidential memory, which selectively caches the most informative historical frames based on a proposed evidence score mechanism. As the stream progresses, the memory dynamically updates to retain a fixed number of informative frames, preventing the memory footprint from growing linearly with sequence length. This also prevents degradation over long sequences and keeps the underlying generator completely unchanged without retraining, architectural modifications, or auxiliary losses. Evaluated on both realistic and synthetic streaming benchmarks, Stream3D outperforms latent-transport baselines, including KV-cache reuse and flow-based feature editing, across both photometric and geometric metrics. More details can be found at: this https URL.

[CV-6] StreamGVE: Training-Free Video Editing via Few-Step Streaming Video Generation

链接: https://arxiv.org/abs/2605.21466
作者: Guanlong Jiao,Chenyangguang Zhang,Jia Jun Cheng Xian,Zewei Zhang,Renjie Liao
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:Although existing video editing methods are generally feasible, they often require many costly iterations and still struggle to deliver high-quality yet satisfying editing results. We attribute this limitation to the prevalent data-to-data paradigm, which is less compatible with modern generative models than noise-to-data generation. To address this gap, we revisit video editing from a noise-to-data perspective and propose Streaming-Generation-based Video Editing (StreamGVE), which preserves few-step sampling while seamlessly injecting source-video conditions. Built on pre-trained streaming generation models, StreamGVE introduces dual-branch fast sampling with a self-attention bridge and cross-attention grounding/boosting to satisfy both sampling and conditioning requirements. We further propose source-oriented guidance to improve target-generation quality, and a visual prompting strategy to enhance editing flexibility and practicality. The method is effective, robust, and generalizable across different models. Extensive experiments on diverse video editing tasks show that StreamGVE consistently outperforms existing approaches, even in few-step settings with minimal time cost.

[CV-7] ProtoPathway: Biologically Structured Prototype-Pathway Fusion for Multimodal Cancer Survival Prediction

链接: https://arxiv.org/abs/2605.21454
作者: Amaya Gallagher-Syed,Costantino Pitzalis,Myles J. Lewis,Michael R. Barnes,Gregory Slabaugh
类目: Computer Vision and Pattern Recognition (cs.CV); Quantitative Methods (q-bio.QM); Tissues and Organs (q-bio.TO)
备注: Currently under peer review

点击查看摘要

Abstract:We introduce ProtoPathway, an interpretable-by-design multimodal framework for cancer survival prediction that unifies whole slide imaging and transcriptomics through encoders producing biologically grounded representations on both sides of the fusion. On the histopathology side, K learnable morphological prototypes, trained end-to-end with the survival objective, serve as the slide representation itself: patches flow into prototype tokens via soft assignment, compressing variable-length patch sets into fixed task-adaptive tokens. On the genomic side, a bipartite graph neural network encodes gene expression within the Reactome pathway hierarchy, producing pathway embeddings that reflect both constituent genes and their broader biological context through bidirectional message passing over a shared gene–pathway graph. Cross-modal attention then operates over a compact prototype \times pathway matrix in which prototypes query pathways, modeling the biological direction in which molecular programs give rise to tissue morphology. Because both axes carry stable task-learned identity, the attention matrix is itself an interpretability output, yielding native inference-time attribution across the full biological hierarchy, from genes through pathways and prototypes to spatial tissue maps. We evaluate on five TCGA cancer cohorts, demonstrating competitive or superior survival prediction with substantially improved biological interpretability and reduced computational cost, with interpretability claims validated through fold-stratified rank-based population-level analysis. Our source code, model weights, and Reactome pathways, together with a unified codebase reimplementing all multimodal survival baselines under identical preprocessing and evaluation, are available at: this https URL.

[CV-8] mpGlitch: Evaluating Vision-Language Models for Temporal Glitch Detection in Gameplay Videos

链接: https://arxiv.org/abs/2605.21443
作者: Yakun Yu,Ashley Wiens,Adrián Barahona-Ríos,Benedict Wilkins,Saman Zadtootaghaj,Nabajeet Barman,Cor-Paul Bezemer
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Vision-language models (VLMs) are increasingly being explored for video game quality assurance, especially gameplay glitch detection. Most existing evaluations, however, treat glitches as static visual anomalies, asking models to detect failures from a single frame. We argue that this framing misses a key distinction: some glitches are spatial and visible in an isolated frame, whereas others are temporal and become evident only through changes across ordered frames. A preliminary study confirms this gap, showing that temporal glitches are substantially harder for VLMs to detect than spatial ones. To enable systematic evaluation of this underexplored setting, we introduce TempGlitch, a controlled gameplay video benchmark for temporal glitch detection. TempGlitch covers five temporal glitch types with balanced per-category samples, together with paired glitch-free videos that enable reliable binary evaluation. We evaluate 12 proprietary and open-weight VLMs across multiple frame-sampling settings. Our results show that current VLMs remain near chance on TempGlitch, often collapsing into either overly conservative behavior that misses most glitches or overly sensitive behavior that flags clean videos as glitchy. Moreover, denser frame sampling and larger model size do not reliably resolve these failures. TempGlitch provides a focused testbed for temporal reasoning, robust gameplay understanding, and automated glitch detection with VLMs. Code and data are available at the project website.

[CV-9] ReMATF: Recurrent Motion-Adaptive Multi-scale Turbulence Mitigation for Dynamic Scenes

链接: https://arxiv.org/abs/2605.21440
作者: Zhiming Liu,Zhicheng Zou,Nantheera Anantrasirichai
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Atmospheric turbulence severely degrades video quality by introducing distortions such as geometric warping, blur, and temporal flickering, posing significant challenges to both visual clarity and temporal consistency. Current state-of-the-art methods are based on transformer, 3D architectures and require multi-frame input, but their large computational cost and memory usage limit real-time deployment, especially in resource-constrained scenarios. In this work, we propose ReMATF, a lightweight recurrent framework that restores videos using only two frames at a time while preserving spatial detail and temporal stability. ReMATF combines a multi-scale encoder-decoder with temporal warping and a motion-adaptive temporal fusion module that performs per-pixel fusion between the warped previous output and the current prediction to enhance coherence without enlarging the temporal window. This design reduces flicker, sharpens details, and remains efficient. Experiments on synthetic and real turbulence datasets show consistent improvements in PSNR/SSIM and perceptual quality (LPIPS), along with substantially faster inference than multi-frame transformer baselines, making ReMATF suitable turbulence mitigation in resource-constrained scenarios.

[CV-10] ryOn: Mastering Interactive Video Virtual Try-On with Spatial-Semantic Guidance ICML2026

链接: https://arxiv.org/abs/2605.21431
作者: Jun Zheng,Zhengze Xu,Mengting Chen,Jing Wang,Jinsong Lan,Xiaoyong Zhu,Kaifu Zhang,Bo Zheng,Xiaodan Liang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL . Accepted by ICML 2026

点击查看摘要

Abstract:Video Virtual Try-On (VVT) aims to seamlessly replace a garment on a person in a video with a new one. While existing methods have made significant strides in maintaining temporal consistency, they are predominantly confined to non-interactive scenarios where models merely showcase garments. This limitation overlooks a crucial aspect of real-world apparel presentation: active human-garment interaction. To bridge this gap, we introduce and formalize a new challenging task: Interactive Video Virtual Try-On (Interactive VVT), where subjects in the video actively engage with their clothing. This task introduces unique challenges beyond simple texture preservation, including: (1) resolving the semantic ambiguity of interactions from standard pose information, and (2) learning complex garment deformations from video where interactive moments are sparse and brief. To address these challenges, we propose iTryOn, a novel framework built upon a large-scale video diffusion Transformer. iTryOn pioneers a multi-level interaction injection mechanism to guide the generation of complex dynamics. At the spatial level, we introduce a garment-agnostic 3D hand prior to provide fine-grained guidance for precise hand-garment contact, effectively resolving spatial ambiguity. At the semantic level, iTryOn leverages global captions for overall context and time-stamped action captions for localized interactions, synchronized via our novel Action-aware Rotational Position Embedding (A-RoPE). Extensive experiments demonstrate that iTryOn not only achieves state-of-the-art performance on traditional VVT benchmarks but also establishes a commanding lead in the new interactive setting, marking a significant step towards more dynamic and controllable virtual try-on experiences.

[CV-11] AIGaitor: Privacy-preserving and cloud-free motion analysis for everyone using edge computing

链接: https://arxiv.org/abs/2605.21421
作者: Lauhitya Reddy,Trisha M. Kesar,Hyeokhyen Kwon
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages 3 figures, 2 tables

点击查看摘要

Abstract:Motion capture is the gold standard for measuring human movement, but clinical use remains limited by cost, technical complexity, and privacy concerns. AIGaitor is a privacy-preserving, cloud-free motion analysis system that runs markerless monocular motion-capture pipelines and downstream deep-learning analysis entirely on a consumer smartphone using on-device neural accelerators. To motivate its design, we surveyed 74 rehabilitation clinicians: 92 percent said they would adopt an accurate, cost-effective, easy-to-use AI gait analysis tool, while 79.7 percent cited operating cost, 68.9 percent insufficient training, and 64.9 percent privacy concerns as leading barriers. We then optimized and benchmarked mobile iOS implementations of current monocular pipeline components, including 2D and 3D pose estimation, pose optimization, skeleton-based deep-learning analysis, and a vision-language model. A Time-Priority end-to-end on-device pipeline processes a 10 s 4K 60 fps video clip in 77 s on an iPhone 14, matching or beating the same pipeline on a high-end NVIDIA H200 cloud server when network transfer is included: 94 s at global mobile-average uplink and 66 s at developed-world Wi-Fi. Lightweight models such as ViTPose-s achieve real-time keypoint extraction, and skeleton-based action-recognition models provide sub-millisecond gait classification on the same clip. To our knowledge, AIGaitor is the first monocular system to demonstrate end-to-end on-device motion capture and downstream deep-learning analysis, supporting clinically applicable movement analysis that is low-cost, private, and accessible to smartphone users.

[CV-12] FedCritic: Serverless Federated Critic Learning-based Resource Allocation for Multi-Cell OFDMA in 6G

链接: https://arxiv.org/abs/2605.21418
作者: Amin Farajzadeh,Melike Erol-Kantarci
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Networking and Internet Architecture (cs.NI)
备注: Submitted to IEEE for possible publication

点击查看摘要

Abstract:In sixth-generation (6G) ultra-dense networks, aggressive frequency reuse amplifies inter-cell interference (ICI), making multi-cell orthogonal frequency-division multiple access (OFDMA) scheduling and power control strongly coupled across neighboring cells. We study distributed downlink resource management – joint subcarrier scheduling and power allocation – under interference coupling and long-term per-user quality-of-service (QoS) minimum-rate constraints. By using virtual-queue deficit weights to enforce long-term QoS, we develop FedCritic, a serverless federated multi-agent actor-critic framework with decentralized execution. Unlike centralized training with decentralized execution (CTDE) approaches that require centralized critic learning and joint trajectory aggregation, FedCritic federates the critic through lightweight gossip-based parameter averaging over the interference graph, enabling stable value estimation without a central coordinator while keeping policies local. Simulations in an interference-rich reuse-1 setting show that FedCritic improves mean signal-to-interference-plus-noise ratio (SINR) and cell-edge rate, increases network-wide average sum-rate and fairness relative to non-coordinated and CTDE baselines, and achieves more stable training with lower coordination overhead.

[CV-13] Ordering Matters: Rank-Aware Selective Fusion for Blended Emotion Recognition

链接: https://arxiv.org/abs/2605.21417
作者: Junghyun Lee,Hyunseo Kim,Hanna Jang,Junhyug Noh
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted at IEEE FG 2026. Final system ranked 2nd in the BlEmoRE Challenge. 9 pages including appendix, 8 figures

点击查看摘要

Abstract:Blended emotion recognition is challenging because emotions are often expressed as mixtures of subtle and overlapping multimodal cues rather than a single dominant signal. We propose a rank-aware multi-encoder framework that selectively combines complementary representations from diverse pre-extracted video and audio encoders. Our method projects heterogeneous encoder features into a shared latent space, estimates sample-wise encoder importance through an attention-based gating module, and fuses only the top-n most informative encoders. To better model blended emotions, we decouple prediction into presence and salience heads and align them through probability-level fusion. We further incorporate feature-level unsupervised domain adaptation without pseudo-labeling to improve robustness under distribution shift. Experiments on the BlEmoRE challenge show that the proposed framework outperforms strong individual encoders and naïve multi-encoder fusion baselines. Our final system ranked 2nd in the competition, supporting the effectiveness of rank-aware selective fusion for fine-grained blended emotion recognition.

[CV-14] PointACT: Vision-Language-Action Models with Multi-Scale Point-Action Interaction

链接: https://arxiv.org/abs/2605.21414
作者: Shizhe Chen,Paul Pacaud,Cordelia Schmid
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to RSS 2026; project webpage: this https URL

点击查看摘要

Abstract:Vision-Language-Action (VLA) models have shown strong potential for general-purpose robotic manipulation by leveraging large pretrained vision-language backbones. However, most existing VLAs rely primarily on 2D visual representations, which limit their ability to reason about fine-grained geometry and spatial grounding - capabilities that are essential for precise and robust manipulation in 3D environments. In this paper, we propose PointACT, a dual-system 3D-aware VLA policy that integrates hierarchical 3D point cloud representations directly into the action decoding process. PointACT employs a multi-scale point-action interaction mechanism with efficient bottleneck window self-attention, enabling evolving action tokens to densely attend to both local geometric detail and global scene structure. We evaluate PointACT on the LIBERO and RLBench benchmarks and systematically compare it against monolithic and dual-system VLA baselines, including variants augmented with point cloud inputs. PointACT achieves consistent improvements across both benchmarks, increasing success rates by 10% on the challenging RLBench-10Tasks suite over state-of-the-art pretrained VLAs, with even larger gains when the vision-language backbone is frozen and the action expert is trained from scratch. Extensive ablation studies demonstrate that tightly coupling hierarchical 3D geometry with pretrained 2D semantic representations is critical for robust and spatially grounded robot control. Our results also highlight the promise of pretrained 3D representations for 3D-aware VLA policies.

[CV-15] RoadTones: Tone Controllable Text Generation from Road Event Videos CVPR

链接: https://arxiv.org/abs/2605.21411
作者: Chirag Parikh,Siddhi Pravin Lipare,Ravi Kiran Sarvadevabhatla
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at CVPR Findings 2026. Project page: this https URL

点击查看摘要

Abstract:Existing video-language models can generate factual descriptions of road events but lack control over how these events are expressed: their tone, urgency, or style. This limits deployment in communication-critical settings where the effectiveness of a message depends on both content and presentation, not just factual accuracy. To mitigate this, we introduce a comprehensive dataset-model-evaluation suite for tone-controllable road video captioning. Our human-validated data generation pipeline expands road-video corpora with diverse tonal annotations and multi-tone captions, yielding the RoadTones-51K dataset. We propose RoadTones-VL-CoT, a controllable video-to-text model that also generates tone-conditioned Chain-of-Thought intermediate drafts for interpretability. We also introduce RoadTones-Eval, a new evaluation suite that jointly measures factual consistency and tone adherence. In addition, we conducted a user study whose results validate caption quality, tone control, and factual consistency. Together, these contributions lay the foundation for context-sensitive tone-controllable video captioning.

[CV-16] Disentangling Generation and Regression in Stochastic Interpolants for Controllable Image Restoration

链接: https://arxiv.org/abs/2605.21381
作者: Yi Liu,Jia Ma,Wengen Li,Jihong Guan,Shuigeng Zhou,Yichao Zhang
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 44 pages, 16 figures, 16 tables

点击查看摘要

Abstract:Recent advances in Image Restoration (IR) have been largely driven by generative methods such as Diffusion Models and Flow Matching, which excel in synthesizing realistic textures while suffering from slow multi-step inference and compromised pixel fidelity. In contrast, classical regression-based IR methods excel precisely in these aspects, offering single-step efficiency and high pixel-level reconstruction fidelity. To bridge this gap, we propose DiSI, a unified framework that Disentangles the underlying Stochastic Interpolant process into independent generation and regression components. This decoupling endows DiSI with remarkable versatility, enabling a continuous and controllable transition from a pure regression process to a fully generative one. Technically, we instantiate this framework with two specific sampling trajectories, accompanied by a unified sampler for high-quality, few-step inference on arbitrary trajectories. Furthermore, we design a dual-branch U-Net style transformer network in pixel space, using a dedicated branch to enhance conditional guidance while ensuring high throughput. Extensive experiments demonstrate that DiSI efficiently achieves competitive results on various IR tasks, while uniquely offering the inference-time flexibility to control the distortion-perception trade-off within a single model.

[CV-17] Closed Loop Dynamic Driving Data Mixture for Real-Synthetic Co-Training

链接: https://arxiv.org/abs/2605.21372
作者: Hongzhi Ruan,Pei Liu,Weiliang Ma,Zhengning Li,Xueyang Zhang,Jun Ma,Dan Xu,Kun Zhan
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Data scaling is fundamental to modern deep learning, and grows increasingly critical as autonomous driving shifts to end-to-end learning. Real-world driving data is expensive to annotate and scene-biased, making real-synthetic co-training with near-infinite synthetic data a promising direction. However, naively incorporating all available synthetic data is inefficient and leads to distribution shifts, and optimizing data mixture under practical training budgets remains a critical yet under-explored problem. In this sense, we claim that the mixture of training data requires clear guidance in terms of scene types and quantities. Particularly in this work, we conceptualize the data mixture approximately as a dynamic optimization process that iteratively adjusts the training data mixture to maximize model performance, guided by closed-loop evaluation feedback, and propose AutoScale, a fully automated closed-loop data engine unifying scene representation, data mixture optimization and retrieval, as well as model training and evaluation. Specifically, we propose Graph Regularized AutoEncoder (Graph-RAE) for driving scene representations, introduce Cluster-aware Gradient Ascent (Cluster-GA) for cluster-wise importance estimation and reweighting, and perform cluster-guided vector retrieval to select high-value samples. Experiments on NavSim demonstrate that AutoScale outperforms vanilla co-training and cross-domain baselines, achieving better performance with fewer synthetic samples under constrained budgets.

[CV-18] A Non-Reference Diffusion-Based Restoration Framework for Landsat 7 ETM SLC-off Imagery in Antarctica

链接: https://arxiv.org/abs/2605.21371
作者: Leyue Tang,Jonathan Louis Bamber,Gang Qiao,Yuanhang Kong
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to IEEE JSTARS

点击查看摘要

Abstract:Acquiring usable optical imagery in Antarctica is inherently challenging due to prolonged polar nights and frequent cloud cover. Landsat provides the longest and most continuous optical observations and constitutes one of the most important remote sensing data sources for Antarctic studies. However, the scan-line corrector (SLC) failure in 2003 resulted in approximately 22% missing pixels in Landsat 7 ETM+ SLC-off imagery, severely limiting its usability. Unlike many non-polar environments, Antarctic surfaces undergo rapid and substantial changes, which makes it difficult to obtain reliable reference imagery and reduces the applicability of conventional reference-based gap-filling methods. To address this challenge, we propose DiffGF, a non-reference diffusion-based framework for restoring Landsat 7 SLC-off imagery without requiring any external reference data. DiffGF adopts a two-stage design consisting of a latent-space diffusion process and a pixel-space refinement. A dedicated Antarctic dataset, SLCANT, is constructed for training and evaluation. Quantitative and qualitative results demonstrate that DiffGF restores Antarctic SLC-off imagery with high fidelity. Its practical value is further examined through a downstream crevasse segmentation application. The results suggest that DiffGF provides a useful approach for exploiting Landsat 7 SLC-off archives in Antarctica, enabling the extraction of valuable information from historical records and supporting related Antarctic studies.

[CV-19] OcclusionFormer: Arranging Z-Order for Layout-Grounded Image Generation ICML2026 CCL

链接: https://arxiv.org/abs/2605.21343
作者: Ziye Li,Henghui Ding
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICML 2026, Project Page: this https URL

点击查看摘要

Abstract:Recent layout-to-image models have achieved remarkable progress in spatial controllability. However, they still struggle with inter-object occlusion. When bounding boxes overlap, most existing methods lack explicit occlusion information, which makes the generation in intersection regions inherently ambiguous and hinders the determination of complex occlusion relationships. As a result, they often produce entangled textures or physically inconsistent layering in the overlapped areas. To address this issue, we first construct SA-Z, a large-scale dataset enriched with explicit occlusion ordering and pixel-level annotations. Building upon our proposed dataset, we introduce OcclusionFormer, a novel occlusion-aware Diffusion Transformer framework that explicitly models Z-order priority by decoupling instances and compositing them via volume rendering. Furthermore, to ensure fine-grained spatial precision, we introduce a queried alignment loss that explicitly supervises individual instances and enhances semantic consistency. The proposed method effectively reduces ambiguity in overlapping regions, enforces correct occlusion dependencies, and preserves structural integrity, leading to substantial accuracy gains across diverse scenes.

[CV-20] Hyper-V2X: Hypernetworks for Estimating Epistemic and Aleatoric Uncertainty in Cooperative Birds-Eye-View Semantic Segmentation

链接: https://arxiv.org/abs/2605.21309
作者: Abhishek Dinkar Jagtap,Sanath Tiptur Sadashivaiah,Andreas Festag
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Accepted for IEEE Intelligent Vehicle Symposium (IV) 2026

点击查看摘要

Abstract:Cooperative perception enabled by Vehicle-to-Everything (V2X) communication enhances autonomous driving safety by creating a unified environmental representation through shared sensory data. While recent works have advanced multi-agent fusion for improved perception, uncertainty quantification in such cooperative frameworks remains largely unexplored. This paper introduces Hyper-V2X, a hypernetwork-based framework for estimating both epistemic and aleatoric uncertainties in V2X-based perception. Specifically, we propose a partial weight generation scheme and V2X context embedding module that conditions a Bayesian hypernetwork on fused multi-agent features to generate weight distributions for stochastic Bird’s-Eye-View (BEV) segmentation. Unlike existing deterministic BEV models, Hyper-V2X enables efficient uncertainty estimation with little computation overhead. Our approach is architecture-agnostic, and can be seamlessly integrating with modern cooperative backbones such as CoBEVT. Experiments on the OPV2V benchmark demonstrate that Hyper-V2X provides accurate, well-calibrated uncertainty estimates and improves overall perception reliability. Our code and benchmark are publicly available under an open-source license: this https URL

[CV-21] Deformba: Vision State Space Model with Adaptive State Fusion

链接: https://arxiv.org/abs/2605.21308
作者: Hongyu Ke,Jack Morris,Yongkang Liu,Satoshi Kitai,Kentaro Oguchi,Yi Ding,Haoxin Wang
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:State Space Models (SSMs) have emerged as a powerful and efficient alternative to Transformers, demonstrating linear-time complexity and exceptional sequence modeling capabilities. However, their application to vision tasks remains challenging. First, existing vision SSMs largely depend on manually designed fixed scanning methods to flatten image patches into sequences, which imposes predefined geometric structures and increases the complexity. Second, the broader adoption of vision SSMs is hindered in domains that require query-based interactions between distinct information streams. This is a result of the inherently causal and self-referential nature of SSMs designed for 1D sequence modeling tasks. This fusion mechanism is indispensable for critical perception tasks such as multi-view 3D fusion. To address these limitations, we propose Deformba, a context adaptive method that dynamically augments the spatial structural information while maintaining the linear complexity of SSMs. Deformba also allows multi-modal fusion like cross attention. To demonstrate the effectiveness and general applicability of Deformba, we test its performance on general 2D vision tasks such as image classification, object detection, and segmentation, as well as 3D vision tasks like BEV perception. Extensive experiments show that Deformba achieves strong performance across various visual perception benchmarks.

[CV-22] Automatic Discovery of Disease Subgroups by Contrasting with Healthy Controls ECML-PKDD2026

链接: https://arxiv.org/abs/2605.21301
作者: Robin Louiset,Edouard Duchesnay,Benoit Dufumier,Antoine Grigis,Pietro Gori
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to Data Mining and Knowledge Discovery, ECML-PKDD 2026 Journal Track

点击查看摘要

Abstract:In biomedical Subgroup Discovery, practitioners are interested in discovering interpretable and homogeneous subgroups within a group of patients. In this paper, assuming that healthy subjects (i.e., controls) share common but irrelevant factors of variation with the patients, we motivate and develop a Contrastive Subgroup Discovery method, entitled Deep UCSL. By contrasting patients with controls, Deep UCSL identifies subgroups driven solely by pathological factors, ignoring common variability shared with healthy subjects. Our framework employs a deep feature extractor to learn a discriminative representation space. Mathematically, we derive a novel loss based on the conditional joint likelihood of latent clusters and patient/control labels, optimized via an Expectation-Maximization strategy alternating between subgroup inference and feature encoder updates. A regularization term further encourages representations to capture disease-specific variability while ignoring variability shared with controls. Compared to previous related works, our approach quantitatively improves the quality of the estimated subgroups, as demonstrated on a MNIST example and four distinct real medical imaging datasets. Code and datasets are available at: this https URL.

[CV-23] Reducing Object Hallucination in LVLMs via Emphasizing Image-negative Tokens

链接: https://arxiv.org/abs/2605.21300
作者: Meng Shen,Minghao Wu,Deepu Rajan
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 20 pages, 10 figures, 10 tables

点击查看摘要

Abstract:Object hallucination is a significant challenge that hinders the application of large vision-language models (LVLMs) in practice. We hypothesize that one possible origin of hallucination is the model’s tendency to prioritize text generation over meaningful interaction with images. To explore this, we examine the generation process and categorize text tokens into three groups: image-positive, invariant, and negative, based on their visual dependence on input image tokens. Our analysis reveals that most generated tokens are minimally influenced by the image information. This suggests that during the model’s training stage, more emphasis is placed on learning how to follow textual instructions, rather than extracting information from images. Based on this finding, we propose adjusting the training weights of different tokens depending on their visual dependence to control hallucination. Additionally, we remove a portion of the training data that potentially contains more hallucinations as a data filtering strategy. Both methods achieve a reduction in hallucination without compromising response length or introducing additional computational costs during inference. We validate our methods across three LVLM variants, demonstrating the effectiveness and general applicability.

[CV-24] Let EEG Models Learn EEG ICML2026

链接: https://arxiv.org/abs/2605.21280
作者: Yifan Wang,Yijia Ma,Wen Li,Chenyu You
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICML 2026

点击查看摘要

Abstract:High-fidelity EEG generation is critical for alleviating data scarcity and addressing privacy constraints in large-scale neural modeling. Despite recent progress, most existing approaches formulate EEG generation via discrete denoising objectives, which inadequately reflect the inherently continuous temporal dynamics and spectral structure of neural activity. As a result, these methods often struggle to preserve long-range temporal dependencies and exhibit mismatches in the spectral and temporal structure of the generated signals. In this work, we argue that effective EEG generation requires models that operate directly on the continuous evolution of neural signals. We introduce Just EEG Transformer (JET), a generative framework based on conditional flow matching that models EEG as raw sequences evolving along continuous trajectories. By learning a smooth vector field that transports noise to the EEG data distribution, JET captures temporal continuity and transient dynamics without relying on discretized denoising schemes or domain-specific representations. To ensure that the learned dynamics remain consistent with key properties of EEG signals, we introduce principled constraints that preserve spectral structure, temporal stationarity, and signal-level statistics. Across three large-scale benchmarks, JET consistently achieves state-of-the-art performance, reducing TS-FID by over 40% compared to strong baselines. Extensive analyses show that JET captures key structural properties of neural dynamics, providing a scalable and principled approach to EEG generation. Project page: this https URL .

[CV-25] DriveMA: Rethinking Language Interfaces in Driving VLAs with One-Step Meta-Actions

链接: https://arxiv.org/abs/2605.21273
作者: Weicheng Zheng,Yixin Huang,Qiao Sun,Derun Li,Hang zhao
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Driving Vision-Language-Action Models (Driving VLAs) commonly introduce natural-language reasoning as an intermediate interface for end-to-end planning, but reasoning-centric interfaces face three practical bottlenecks: obtaining high-quality reasoning annotations is difficult, generating and understanding long reasoning chains is challenging for compact models, and inference latency is substantially increased. In this paper, we rethink the design of language interfaces in Driving VLAs and show that concise one-step meta-actions are a simple yet effective alternative to verbose reasoning. Meta-actions provide semantic decision grounding while remaining low-entropy, and being automatically derivable from expert trajectories, enabling scalable supervision and reliable trajectory conditioning. Building on this interface, we propose DriveMA, which combines action-centric supervised training with a turn-level credit-assignment reinforcement learning framework that jointly optimizes meta-action correctness, trajectory quality, and trajectory–meta-action consistency. Experiments show that DriveMA already achieves a new state of the art on the Waymo End-to-End Driving Challenge with a 2B model, reaching a Rater Feedback Score (RFS) of 8.060, while its 4B version further improves the state of the art to 8.079; DriveMA also obtains competitive performance on NAVSIM. Ablations demonstrate that one-step meta-actions offer a better practical trade-off between expressiveness, predictability, and inference efficiency than natural-language reasoning or finer-grained action sequences. Code, data, and models will be released to facilitate future research.

[CV-26] MONET: A Massive Open Non-redundant and Enriched Text-to-image dataset

链接: https://arxiv.org/abs/2605.21272
作者: Benjamin Aubin,Gonzalo Iñaki Quintana,Onur Tasar,Sanjeev Sreetharan,Urszula Czerwinska,Damien Henry,Clément Chadebec
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Training large text-to-image models requires high-quality, curated datasets with diverse content and detailed captions. Yet the cost and complexity of collecting, filtering, deduplicating, and re-captioning such corpora at scale hinders open and reproducible research in the field. We introduce MONET, an open Apache 2.0 dataset of approx. 104.9M image–text pairs collected from 2.9B raw pairs across heterogeneous open sources through successive stages of safety filtering, domain-based filtering, exact and near-duplicate removal, and re-captioning with multiple vision-language models covering short to long-form descriptions, and further augmented with synthetically generated samples. Each image is shipped with pre-computed embeddings and annotations to accelerate downstream use. To validate the effectiveness of MONET, we train a 4B-parameter latent diffusion model exclusively on it and reach competitive GenEval and DPG scores, demonstrating that our dataset lowers the barrier to large-scale, reproducible text-to-image research.

[CV-27] Vision Transformers and Convolutional Neural Networks for Land Use Scene Classification

链接: https://arxiv.org/abs/2605.21268
作者: Arun D. Kulkarni
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages

点击查看摘要

Abstract:Land Use Scene Classification (LUSC) from remote sensing imagery plays a critical role in environmental monitoring, urban planning, and sustainable resource management. In recent years, deep learning methods have significantly advanced the state of the art, with Convolutional Neural Networks (CNNs) dominating the field because of their strong ability to capture local spatial features. However, the emergence of Vision Transformers (ViTs) has introduced a new paradigm that models long-range dependencies through self-attention mechanisms, potentially enabling improved global context understanding. This paper presents a comparative assessment of Vision Transformers and CNN-based architecture for remote sensing land use scene classification. Representative CNN models, such as AlexNet, is evaluated alongside the Vision Transformer (ViT) using benchmark remote sensing datasets, including the UC Merced Land Use and EuroSAT Land Use datasets. The study examines classification accuracy, precision, recall, F1-score, and computational complexity to provide a comprehensive performance comparison. Experimental results demonstrate that CNNs perform robustly on datasets with limited training samples and strong local texture characteristics, whereas Vision Transformers exhibit superior performance in capturing global spatial relationships in complex scenes when sufficient training data are available. However, ViTs typically require greater computational resources and larger training datasets to achieve optimal performance. The findings of this study provide insights into the strengths and limitations of both architectures and offer guidance for selecting appropriate models for remote sensing land use scene classification applications.

[CV-28] STiTch: Semantic Transition and Transportation in Collaboration for Training-Free Zero-Shot Composed Image Retrieval

链接: https://arxiv.org/abs/2605.21261
作者: Miaoge Li,Dongsheng Wang,Zening Sun,Jinsen Zhang,Wenhan Luo,Jingcai Guo
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Training-free zero-shot composed image retrieval models are recently gaining increasing research interest due to their generalizability and flexibility in unseen multimodal retrieval. Recent LLM-based advances focus on generating the expected target caption by exploring the compositional ability behind the LLMs. Although efficient, we find that 1) the generated captions tend to introduce unexpected features from the reference image due to the semantic gap between the input image and text modification, where the image contains much more details than the text; 2) the point-to-point alignment during the retrieval stage fails to capture diverse compositions. To address these challenges, we introduce a novel Semantic Transition and Transportation in collaboration framework for training-free zero-shot CIR tasks. Specifically, given the composed caption inferred by an LLM, we aim to refine it through a transition vector in the embedding space and make it closer to the target image. Combining LLMs with user instruction, the refined caption concentrates more on the core modification intent and thus filters out unnecessary noise. Moreover, to explore diverse alignment during the retrieval stage, we model the caption and image as discrete distributions and reformulate the retrieval task as a set-to-set alignment task. Finally, a bidirectional transportation distance is developed to consider fine-grained alignments across modalities and calculate the retrieval score. Extensive experiments demonstrate that our method can be general, effective, and beneficial for many CIR tasks.

[CV-29] SR-Ground: Image Quality Grounding for Super-Resolved Content

链接: https://arxiv.org/abs/2605.21244
作者: Artem Borisov,Evgeney Bogatyrev,Khaled Abud,Dmitriy Vatolin
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Super-Resolution (SR) has advanced rapidly in recent years, with diffusion-based models achieving unprecedented fidelity at the cost of introducing new types of visual artifacts. While existing Image Quality Assessment (IQA) methods provide holistic quality scores, they lack interpretability and fail to distinguish between different artifact types arising from modern SR approaches. To address this gap, we introduce SR-Ground, a large-scale dataset specifically designed for fine-grained artifact segmentation in super-resolved images. The dataset comprises images processed by a diverse set of state-of-the-art SR models, with pixel-level annotations for multiple artifact categories. We conduct a large-scale crowdsourcing study involving 1,062 participants to validate and refine automatically generated segmentations, resulting in a high-quality dataset of 63,000 images spanning 6 distinct artifact types. We demonstrate that training IQA models with grounding capabilities on SR-Ground significantly improves performance on downstream tasks. Furthermore, we introduce a fine-tuning pipeline that leverages our grounding model to reduce perceptible artifacts in SR outputs, showcasing the practical utility of our dataset. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2605.21244 [cs.CV] (or arXiv:2605.21244v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2605.21244 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-30] RePCM: Region-Specific and Phenotype-Adaptive Bi-Ventricular Cardiac Motion Synthesis MICCAI2026

链接: https://arxiv.org/abs/2605.21237
作者: Xuan Yang,Xiaohan Yuan,Hao Li,Lingyu Chen,Yanan Liu,Lei Li
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Early Accepted by MICCAI 2026. This is the author’s submitted version. 10 pages, 3 figures

点击查看摘要

Abstract:Cardiac motion over a cardiac cycle is crucial for quantifying regional function and is strongly affected by cardiovascular diseases. Since temporally dense mesh sequences are difficult to obtain in practice, we focus on leveraging the more accessible end-diastolic frame to infer a full-cycle sequence. Due to strong regional and disease-specific differences, traditional methods often oversmooth the data by relying on generative models that are optimized for global patterns. To address this problem, we propose Region-Aware and Phenotype-Adaptive Bi-Ventricular Cardiac Motion Synthesis (RePCM) for single frame Bi-ventricular mesh motion completion. In Stage I, a reconstruction network learns vertex wise motion descriptors and clustering yields a data driven functional partition, providing an explicit motion derived region structure. In Stage II, a Region-Specific Injection Module enforces masked, synchronized region exchange within a conditional VAE, preserving localized specific dynamics and restricting cross-region mixing. A Phenotype-Adaptive Mixture-of-Experts prior conditioned on ED shape uses anatomy-guided cues to model latent motion trends and capture inter-disease variability. Experiments on three datasets covering different cardiovascular diseases show consistent gains in geometric and functional metrics and improved preservation of region specific dynamics.

[CV-31] PGC: Peak-Guided Calibration for Generalizable AI-Generated Image Detection

链接: https://arxiv.org/abs/2605.21207
作者: Xiaoyu Zhou,Jianwei Fei,Peipeng Yu,Jingchang Xie,Chong Cheng,Zhihua Xia
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The rapid evolution of generative AI, from GANs to modern diffusion models, has resulted in increasingly subtle discriminative clues. These fine-grained signals are often overshadowed by dominant, high-fidelity image content (e.g., the main subject), limiting the reliability of existing detectors that predominantly rely on global representations. To address this challenge, we propose the Peak-Guided Calibration (PGC) framework. PGC introduces a novel strategy that aggregates salient features via a peak-focusing mechanism. Specifically, by employing a peak-sensitive aggregation that accentuates the most discriminative local clues, PGC leverages these critical signals to calibrate the global decision. This approach recovers subtle patterns that would otherwise be submerged in the global context. Furthermore, to better simulate real-world threats, we introduce the CommGen15 dataset, a challenging benchmark comprising samples from 15 commercial models. Extensive experiments demonstrate that PGC achieves state-of-the-art performance. Specifically, it improves mean accuracy by +12.3% on our CommGen15 dataset, and sets new records on standard benchmarks, including GenImage (+2.1%), AIGI (+3.5%), and UniversalFakeDetect (+1.7%). Code is available at this https URL.

[CV-32] RankE: End-to-End Post-Training for Discrete Text-to-Image Generation with Decoder Co-Evolution

链接: https://arxiv.org/abs/2605.21195
作者: Siyong Jian,Siyuan Li,Luyuan Zhang,Zedong Wang,Xin Jin,Ying Li,Cheng Tan,Huan Wang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Discrete autoregressive (AR) text-to-image (T2I) models pair a VQ tokenizer with an AR policy, and current post-training pipelines optimize only the policy while keeping the VQ decoder frozen. Recent diffusion T2I work, exemplified by REPA-E, has shown that the VAE itself constitutes a key alignment bottleneck, yet no analogous investigation exists for discrete AR models. We show that policy-only optimization induces Latent Covariate Shift: as the policy evolves, the resulting token distribution diverges from the ground-truth distribution on which the decoder was trained, such that reward scores improve while decoded image quality degrades. To address this mismatch, we propose RankE, the first end-to-end post-training framework for discrete T2I generation. Rather than optimizing the policy against a fixed decoder, RankE co-evolves both components through alternating optimization: each module maximizes a ranking-based alignment objective while being regularized by a stability-preserving anchor suited to its parameter space. This co-evolution breaks the fidelity–alignment trade-off that plagues frozen-decoder approaches: on LlamaGen-XL (775M), standard RL improves CLIP but degrades FID, whereas RankE improves both simultaneously (FID 15.21, CLIP 33.76 on MS-COCO 30K). Consistent gains on Janus-Pro (1B) confirm that decoder co-evolution reliably converts reward optimization into pixel-space quality improvements.

[CV-33] Semantic Granularity Navigation in Image Editing ICML2026

链接: https://arxiv.org/abs/2605.21190
作者: Liangsi Lu,Minzhe Guo,Xuhang Chen,Yang Shi
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICML 2026

点击查看摘要

Abstract:Despite the generative capabilities of diffusion and flow models, real-image editing remains constrained by a persistent trade-off between semantic editability and structural fidelity. We trace a primary cause of this limitation to the implicit coupling of edit progress with model scale in existing paradigms. Under this coupling, stronger edits typically require visiting noisier states, which spends computation on destabilizing layout before the semantic change is well localized. We introduce NaviEdit, a training-free inference-time controller that decouples edit progress from model scale traversal through a strict self-consistency contract. NaviEdit operates at the rollout level and leaves the underlying pretrained model unchanged. It treats scale as a control input and reallocates a fixed step budget toward semantically responsive intermediate scales instead of destructive high-noise regimes. Experiments show positive average gains across compatible editors and flow backbones, supporting decoupling as a portable inference-time control principle.

[CV-34] SAM-Sode: Towards Faithful Explanations for Tiny Bacteria Detection

链接: https://arxiv.org/abs/2605.21186
作者: Wanying Tan,Shuo Yan,Dazhi Huang,Yazheng Liu,Zili Shao,Rufeng Chen,Hechang Chen,Mude Shi,Tianxing Ji,Sihong Xie
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 10 pages, 4 figures, conference paper

点击查看摘要

Abstract:Interpretability in object detection provides crucial confidence support for clinical auxiliary diagnosis. However, in tiny bacteria detection, traditional explanation methods often suffer from blurred foreground boundaries and diffuse feature attribution due to the extreme sparsity of target morphological features and severe interference from complex backgrounds. Such limitations hinder the provision of logically coherent morphological evidence. To bridge this gap, we propose a novel eXplainable AI (XAI) framework, SAM-Sode. The framework innovatively transforms initial feature attribution maps into geometry-aware prompts, leveraging the prior knowledge of the foundation model (SAM3) to achieve spatial refinement and morphological reconstruction of the explanatory mappings. Furthermore, we introduce a dual-constraint mechanism based on physical significance and geometric alignment to perform instance-level denoising, generating coherent explanations that better align with human expert intuition. Experimental results on our self-constructed bacteria dataset with complex circuit backgrounds (containing 2,524 images) and other public datasets demonstrate that the proposed method effectively suppresses background redundancy and significantly enhances the decision-making transparency of tiny object detection.

[CV-35] FTerViT: Fully Ternary Vision Transformer

链接: https://arxiv.org/abs/2605.21171
作者: Szymon Ruciński,Pietro Bonazzi,Engin Türetken,Simon Narduzzi,Michele Magno,Nadim Maamari
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint

点击查看摘要

Abstract:Ternary Vision Transformers offer substantial model compression, however state-of-the-art methods only ternarize the encoder layers, leaving patch embeddings, LayerNorm parameters, and classifier heads in full precision. In compact models targeting resource-constrained processors, such as microcontrollers, these remaining full-precision components determine the total memory footprint, severely limiting deployment efficiency and on-device feasibility. In this work, we introduce a fully ternarized Vision Transformer in which \emphall weight matrices and normalization parameters are ternarized (FTerViT). To this end, we introduce two novel operators : TernaryBitConv2d with per-channel scaling for patch embedding and TernaryLayerNorm. FTerViT is trained using knowledge distillation, followed by a lightweight quantization-aware recovery phase. Our ternary W2A8 DeiT-III-S at 384 \times 384 resolution achieves 82.43% ImageNet-1K top-1 at 6.09,MB ( \sim 15 \times compression, - 2.42,pp vs.\ FP32), outperforming prior ternary ViTs methods up to 8 pp. Finally, we demonstrate the first implementation of ternary vision transformers on a dual cores XTensa LX7 microcontroller inside the ESP32-S3 system-on-chip. By deploying FTerViT-Small (based on DeiT-III-Small at 224 \times 224 resolution, 5.81,MB), we achieve 79.64% ImageNet-1K top-1 accuracy.

[CV-36] Comparative Analysis of Military Detection Using Drone Imagery Across Multiple Visual Spectrums

链接: https://arxiv.org/abs/2605.21157
作者: Sourov Roy Shuvo,Prajwal Panth,Rajesh Chowdhury,Sorup Chakraborty,Sudip Chakrabarty,Prasant Kumar Pattnaik
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
备注: 6 pages, 7 figures. Accepted at the 16th International Conference on Computing, Communication and Networking Technologies (ICCCNT), July 6-11, 2025, IIT Indore. Proceedings pending publication

点击查看摘要

Abstract:In modern warfare, drones are becoming an essential part of intelligence gathering and carrying out precise attacks in different kinds of hostile environments. Their ability to operate in real-time and hostile environments from a safe distance makes them invaluable for surveillance and military operations. The KIIT-MiTA dataset is comprised of images of different military scenarios taken from drones, and these provide a foundation for detecting military objects, but it does not take into account the various types of real-world scenarios. With that in mind, to evaluate how the models are performing under varying conditions, four different types of datasets are created: Gray Scale, Thermal Vision, Night Vision, and Obscura Vision. These simulate the real-world environments such as low visibility, heat-based imagery, and nighttime conditions. The YOLOv11-small model is trained and used to detect objects across diverse settings. This research boosts the performance and reliability of drone-based operations by contributing to the development of advanced detection systems in both defensive and offensive missions.

[CV-37] Distill to Think Foresee to Act: Cognitive-Physical Reinforcement Learning for Autonomous Driving

链接: https://arxiv.org/abs/2605.21139
作者: Yang Wu,Qiang Meng,Zhaojiang Liu,Youquan Liu,Jian Yang,Jin Xie
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Current end-to-end autonomous driving models are fundamentally constrained by the behavioral cloning ceiling of imitation learning. While reinforcement learning offers a path to smarter autonomy, it demands two missing pieces of infrastructure: (1) a cognitive foundation that understands traffic semantics and driving intent, and (2) a foresighted physical environment that can anticipate the consequences of candidate actions. To this end, we propose CoPhy, a CognitivePhysical reinforcement learning framework for autonomous driving. To distill to think, we distill VLM knowledge into the BEV encoder and then discard the VLM entirely, retaining cognitive ability at zero inference cost while releasing the cognitive channel as a pluggable interface for optional human language commands. To foresee to act, we build an auto-regressive BEV world model that explicitly predicts future semantic maps conditioned on candidate actions, serving as an interpretable physical sandbox from which safety metrics are directly derived. Built upon this dual infrastructure, we optimize the driving policy via GRPO with a novel dual-reward mechanism: a physical reward derived from BEV rollouts enforces hard safety constraints, while a cognitive reward from a language-aligned scorer ensures intent compliance. Extensive experiments demonstrate that CoPhy not only achieves state-of-the-art results on NAVSIM v1 and v2 benchmarks, but also enables safer driving via cognitively informed scene compliance and flexible intent control through user-defined language instructions.

[CV-38] SurgOnAir: Hierarchy-Aware Real-Time Surgical Video Commentary

链接: https://arxiv.org/abs/2605.21132
作者: Jingyi He,Yue Zhou,Long Bai,Kun Yuan,Nassir Navab,Yuan Bi
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Understanding surgical workflow in real time is fundamental for intelligent surgical embodiment, where AI systems continuously perceive and respond as surgery proceeds. In the operating room, critical decisions depend on subtle, moment-to-moment changes, such as fine instrument movements and evolving tissue states, where even slight perceptual delays can limit assistance or compromise safety. Yet existing methods remain offline or operate at coarse temporal scales, generating descriptions only after processing clips, preventing immediate reaction. We address this by proposing SurgOnAir, a streaming vision-language model that processes frames sequentially without future access and progressively generates narration tokens as visual input arrives. SurgOnAir achieves fine-grained frame-to-token generation, enabling instant responsiveness to evolving surgical dynamics. Built upon our curated hierarchical dataset SurgOnAir-11k spanning action-, step-, and phase-level supervision, the model is trained to produce multi-level textual responses that reflect the inherent hierarchy of surgical procedures. Furthermore, special transition tokens are generated to explicitly mark state changes, allowing SurgOnAir to capture and signal key workflow transitions as they occur. Experiments show that SurgOnAir enables real-time understanding through a single vision-language model that unifies streaming across multiple hierarchies of the surgical workflow, generating superior and hierarchy-aware narrations. Code and dataset will be public.

[CV-39] UniT: Unified Geometry Learning with Group Autoregressive Transformer

链接: https://arxiv.org/abs/2605.21131
作者: Haotian Wang,Yusong Huang,Zhaonian Kuang,Hongliang Lu,Xinhu Zheng,Meng Yang,Gang Hua
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to IEEE T-PAMI

点击查看摘要

Abstract:Recent feed-forward models have significantly advanced geometry perception for inferring dense 3D structure from sensor observations. However, its essential capabilities remain fragmented across multiple incompatible paradigms, including online perception, offline reconstruction, multi-modal integration, long-horizon scalability, and metric-scale estimation. We present UniT, a unified model built upon a novel Group Autoregressive Transformer, which reformulates these seemingly disparate capabilities within a single framework. The key idea is to treat groups of sensor observations as the basic autoregressive units and predict the corresponding point maps in an anchor-free and scale-adaptive manner. More specifically, diverse view configurations in both online and offline settings are naturally unified within a single group autoregression process. By varying the group size, online mode operates over multiple autoregressive steps with single-frame groups, whereas offline mode aggregates a multi-frame group in a single forward pass. Meanwhile, a queue-style KV caching mechanism ensures bounded autoregressive memory over long horizons. This is enabled by reducing long-range dependencies on early frames through anchor-free relational modeling, thereby allowing outdated memory to be discarded on the fly. To improve metric-scale generalization across scenes, a scale-adaptive geometry loss is further introduced within this framework. It couples relative geometric constraints with a partial absolute scale term, implicitly regularizing global scale and inducing a progressive transition from scale-invariant geometry to metric-scale solutions. Together with a dedicated modal attention module for integrating auxiliary modalities, UniT achieves state-of-the-art performance in unified geometry perception, as validated on ten benchmarks spanning seven representative tasks.

[CV-40] VersusQ: Pairwise Margin Reasoning for Generalizable Video Quality Assessment

链接: https://arxiv.org/abs/2605.21130
作者: Shibei Meng,Binxin Yang,Yuan Liu,Jiexuan Zhang,Zhengyao Lv,Hubery Yin,Qiang Xu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Large Multimodal Models (LMMs) have shown promise for video quality assessment, but most methods still predict an absolute score for each video. Such pointwise supervision often mixes perceptual quality with dataset-specific calibration, including annotation protocols, rating habits, and score distributions. As a result, the learned scoring rule may work well within a benchmark but transfer poorly across unseen domains. We argue that relative comparisons alleviate the absolute-scale calibration bias by focusing purely on perceptual differences rather than dataset-specific rating habits. Consequently, we propose \textbfVersusQ, a pairwise margin reasoning framework driven entirely by direct comparisons. Specifically, VersusQ performs LMM-based comparison between two videos, reasons about their visual and temporal quality differences, and predicts a signed continuous margin that captures both the preferred choice and the degree of difference. Furthermore, to align interpretable comparison rationales with fine-grained numerical differences, we introduce Margin-Coupled GRPO, which jointly optimizes rollout-based relational reasoning and continuous margin regression. Extensive experiments on multiple public VQA benchmarks demonstrate that VersusQ achieves state-of-the-art performance, strong cross-domain generalization, and reliable fine-grained ranking under heterogeneous evaluation scenarios.

[CV-41] Linear-DPO: Linear Direct Preference Optimization for Diffusion and Flow-Matching Generative Models

链接: https://arxiv.org/abs/2605.21123
作者: Kesong Li,Yixuan Xu,Kuo-kun Tseng,Weiyi Lu,Kan Liu,Tao Lan
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Code and models are available at: this https URL . Work done during an internship at Alibaba Group

点击查看摘要

Abstract:Direct Preference Optimization (DPO) is successful for alignment in LLMs but still faces challenges in text-to-image generation. Existing studies are confined to denoising diffusion models while overlooking flow-matching, and suffer from an objective mismatch when applying discrete NLP-based DPO to regression-based generative tasks.\ In this paper, we derive a generalized DPO objective that covers both diffusion and flow-matching via a unified reverse-time SDE framework, and point out from a gradient perspective that the standard DPO objective is suboptimal for text-to-image generation. Consequently, we propose Linear-DPO, which replaces the aggressive sigmoid-based utility function with a sustained linear utility and incorporates an EMA-updated reference model. Qualitative and quantitative experiments on diffusion models (SD1.5, SDXL) and flow-matching model (SD3-Medium) demonstrate the superiority of our approach over existing baselines.

[CV-42] ROAR-3D: Routing Arbitrary Views for High-Fidelity 3D Generation

链接: https://arxiv.org/abs/2605.21121
作者: Hanxiao Sun,Mingxin Yang,Shuhui Yang,Zebin He,Xintong Han,Hongbo Fu,Chunchao Guo,Wenhan Luo
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:

点击查看摘要

Abstract:Single-image-to-3D generative models can now produce high-quality geometry, yet conditioning on a single view inevitably introduces ambiguity about unseen regions. Multi-view conditioning can reduce this ambiguity, but existing methods either require fixed canonical viewpoints or rely on external reconstruction modules that impose heavy training costs and limit generation quality. We observe that pretrained single-view models already possess strong 2D-to-3D grounding that can be reused for multi-view conditioning. However, a closer analysis reveals that their conditioning mechanism entangles orientation control with geometry transfer, two functions that conflict when images from different viewpoints are naively combined. Based on this analysis, we propose ROAR-3D, a lightweight method that upgrades a pretrained single-view model to accept an arbitrary number of unposed images. A token-wise view router assigns each 3D latent token to its most relevant view, implicitly establishing 2D-to-3D correspondences without explicit pose input. A dual-stream attention design preserves the pretrained primary-view behavior while routing auxiliary views through a separate path dedicated to geometric enrichment. An orientation perturbation strategy ensures the auxiliary path learns orientation-independent geometry transfer. These components introduce minimal trainable parameters and add negligible inference overhead relative to the single-view baseline. ROAR-3D achieves state-of-the-art multi-view 3D generation quality and supports test-time view scaling from 1 to 12+ views with consistent improvements.

[CV-43] RCGDet3D: Rethinking 4D Radar-Camera Fusion-based 3D Object Detection with Enhanced Radar Feature Encoding

链接: https://arxiv.org/abs/2605.21112
作者: Weiyi Xiong,Bing Zhu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:4D automotive radar is indispensable for autonomous driving due to its low cost and robustness, yet its point cloud sparsity challenges 3D object detection. Existing 4D radar-camera fusion methods focus on complex fusion strategies, trading inference speed for marginal gains. This trade-off hinders real-time deployment due to heavy computation on dense feature maps. In contrast, feature extraction from sparse radar points is less time-consuming but remains under-explored. This work uncovers that simply enhancing radar feature extraction can achieve comparable or even higher performance than elaborate fusion modules, while maintaining real-time performance. Based on this finding, we propose RCGDet3D, which centers on radar feature encoding and simplifies multi-modal fusion. Its encoder inherits from the efficient Gaussian Splatting-based Point Gaussian Encoder (PGE) in RadarGaussianDet3D with two key improvements. First, the Ray-centric PGE (R-PGE) predicts Gaussian attributes in ray-aligned coordinate systems before unifying them to Bird’s-Eye View (BEV) space, significantly improving geometric consistency and reducing learning difficulty by decoupling the coordinate transformation from representation learning. Second, a Semantic Injection (SI) module incorporates visual cues from images, producing more geometrically accurate and semantically enriched radar features. Experiments on View-of-Delft (VoD) and TJ4DRadSet show that RCGDet3D outperforms state-of-the-art methods in both accuracy and speed, setting a new benchmark for real-time deployment.

[CV-44] R2AoP: Reliable and Robust Angle of Progression Estimation from Intrapartum Ultrasound MICCAI2026

链接: https://arxiv.org/abs/2605.21099
作者: Yuanhan Wang,Yifei Chen,Beining Wu,Mingxuan Liu,Xiaotian Hu,Chunbo Jiang,Yijin Li,Changmiao Wang,Feiwei Qin,Qiyuan Tian
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11pages,4 figures,Accepted by MICCAI 2026

点击查看摘要

Abstract:Accurate estimation of the Angle of Progression (AoP) from intrapartum transperineal ultrasound is critical for objective assessment of labor progression, yet remains highly sensitive to imaging noise, boundary ambiguities, and the geometric amplification of local segmentation errors. We propose R2AoP, a reliable and robust AoP estimation framework that integrates structurally informed segmentation and confidence-guided geometric modeling to achieve stable and reproducible measurements. A three-branch local-structure-enhanced backbone improves the delineation of the pubic symphysis (PS) and fetal head (FH), while confidence-weighted contour fitting explicitly suppresses the influence of unreliable boundary points in AoP computation. To further improve performance under heterogeneous acquisition conditions, we introduce a lightweight geometry-reliable test-time adaptation strategy as an auxiliary component, enabling stable inference without target annotations. Extensive evaluations on multi-center benchmarks demonstrate consistent reductions in AoP error and boundary metrics compared with state-of-the-art AoP methods. Our source code is available at this https URL.

[CV-45] xtSculptor: Training and Benchmarking Scene Text Editing

链接: https://arxiv.org/abs/2605.21090
作者: Yiheng Lin,Siyu Jiao,Xiaohan Lan,Wei Zhou,Qi She,Fei Yu,Heyun Chen,Zhengwei Wang,Jinghuan Chen,Moran Li,Yingchen Yu,Zijian Feng,Yao Zhao,Yunchao Wei,Yujie Zhong
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in Multimodal Large Language Models (MLLMs) and diffusion-based generative models have substantially improved prompt-driven image editing. However, scene text editing remains challenging, as it requires models to precisely modify textual content while preserving visual realism and non-target regions. Current open-source models still lag behind proprietary systems, largely due to the scarcity of high-quality training data and the lack of standardized benchmarks tailored to text editing. To address these challenges, we present TextSculptor, a comprehensive framework for data construction and evaluation of scene text editing. We first develop an automated data construction pipeline that combines text-aware image synthesis with programmatic text rendering and compositing. Based on this pipeline, we build TextSculpt-Data, a large-scale dataset containing 3.2M training samples, including 1.2M OCR-verified text-to-image samples and 2M paired text editing samples with naturally aligned source-target images and strong background consistency. We further introduce TextSculpt-Bench, a benchmark covering four fundamental text editing tasks: text addition, text replacement, text removal, and hybrid editing. To support reliable evaluation, we design a tailored protocol that measures text accuracy, visual quality, and background preservation through OCR-based text alignment, multimodal judgment, and background-region similarity. Extensive experiments show that TextSculptor improves open-source text editing performance and narrows the gap to proprietary models. The data and benchmark are available at this https URL.

[CV-46] VDFP: Video Deflickering with Flicker-banding Priors

链接: https://arxiv.org/abs/2605.21079
作者: Zhiyi Zhou,Libo Zhu,Zihan Zhou,Yulun Zhang,Xiaokang Yang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: this https URL

点击查看摘要

Abstract:Capturing digital screens with smartphones frequently induces severe banding due to hardware synchronization mismatches. Existing video restoration methods struggle with these structured, periodic luminance fluctuations, often resulting in residual artifacts or over-smoothed textures. We firstly construct DeViD, a real-world dataset in various scenes to deal with the lack of available this http URL we propose VDFP (Video Deflickering with Flicker-banding Priors), a novel perception-guided generation framework. First, we introduce a Degradation Field Modeling Based on Rolling Shutter Mechanism (DFM) capable of synthesizing complex multi-banding scenarios. Second, we present a spatial-temporal continuous prior perception (CPP). Unlike traditional binary segmentation, this module is optimized via a Flicker-Aware Mean Squared Error (FA-MSE) to capture the luminance transitions. By zero-initializing an augmented input layer, our model preserves pre-trained generative priors as well as spatial-temporal prior perception. Extensive experiments demonstrate that VDFP significantly outperforms other methods, eliminating complex banding with high-fidelity spatial details and temporal consistency. Our dataset and code will be released at~ this https URL.

[CV-47] SpectralEarth-FM: Bringing Hyperspectral Imagery into Multimodal Earth Observation Pretraining

链接: https://arxiv.org/abs/2605.21075
作者: Nassim Ait Ali Braham,Aaron Banze,Conrad M. Albrecht,Julien Mairal,Jocelyn Chanussot,Xiao Xiang Zhu
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Earth observation (EO) foundation models (FMs) are increasingly trained on multisensor data, spanning multispectral imagery (MSI), synthetic aperture radar (SAR), and derived geospatial layers, but hyperspectral imagery (HSI) remains underrepresented. Conversely, existing hyperspectral FMs are trained on HSI alone, leaving joint pretraining and fusion of HSI with co-located EO sensors unexplored. We introduce SpectralEarth-FM, a hierarchical transformer for multisensor EO input with heterogeneous spectral dimensionality. The architecture combines spectral tokenization for hyperspectral inputs, sensor-specific encoders, a cross-sensor fusion module, and a shared hierarchical encoder, enabling joint processing of HSI and lower-channel observations. To pretrain SpectralEarth-FM, we curate SpectralEarth-MM, a dataset that co-locates HSI from three spaceborne sensors (EnMAP, EMIT, DESIS) with Sentinel-2, Landsat-8/9 optical imagery, Landsat land surface temperature (LST), and Sentinel-1 SAR, over common geographic footprints. It comprises approximately 2M globally distributed locations, 25M georeferenced patches, and over 40TB of data. Pretraining uses a Joint-Embedding Predictive Architecture (JEPA)-style objective that matches representations between global views and single-sensor local views from the same location. We evaluate SpectralEarth-FM on hyperspectral downstream tasks and standard EO benchmarks following the PANGAEA protocol, achieving state-of-the-art results across both evaluation settings.

[CV-48] Q-ARVD: Quantizing Autoregressive Video Diffusion Models

链接: https://arxiv.org/abs/2605.21072
作者: Siao Tang,Xinyin Ma,Gongfan Fang,Xingyi Yang,Xinchao Wang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code: this https URL

点击查看摘要

Abstract:Autoregressive video diffusion models (ARVDs) have emerged as a promising architecture for streaming video generation, paving the way for real-time interactive video generation and world modeling. Despite their potential, the substantial inference cost of ARVDs remains a major obstacle to practical deployment, making model quantization a natural direction for improving efficiency. However, quantization for ARVDs remains largely unexplored. Our empirical analysis shows that directly applying existing quantization schemes developed for standard diffusion transformers to ARVDs leads to suboptimal performance, revealing quantization behaviors that differ from those observed in bidirectional diffusion models. In this paper, we identify two critical challenges in quantizing ARVDs: (C1) Highly unbalanced frame-wise quantization sensitivity. Error accumulation during autoregressive generation can induce severely skewed quantization sensitivity across frames, following an exponential-like decay pattern. (C2) Prominent and heterogeneous outlier patterns in weights. Weight distributions exhibit pronounced outlier channels, whose patterns vary substantially across layer types and block depths. To address these issues, we propose Q-ARVD, a novel framework for accurate ARVD quantization. (S1) To tackle the highly unbalanced frame-wise sensitivity, Q-ARVD incorporates a final-quality aware frame-weighting mechanism into the quantization objective. (S2) To prevent heterogeneous outliers from degrading performance, Q-ARVD introduces an outlier-aware adaptive dual-scale quantization, which automatically detects the presence and quantity of outlier channels for an arbitrary layer, and isolates them to protect normal channels. Extensive experiments demonstrate the superiority of Q-ARVD.

[CV-49] Grounding Driving VLA via Inverse Kinematics

链接: https://arxiv.org/abs/2605.21061
作者: Junsung Park,Hyunjung Shim
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Existing Driving VLAs predict trajectories while largely ignoring their visual tokens – a phenomenon we trace not to insufficient training but to a structurally ill-posed task formulation. We show that trajectory recovery, when viewed through the lens of inverse kinematics, requires both a current and a future visual state as boundary conditions; existing VLAs supply only the former, which encourages the model to shortcut through ego status and text commands alone. To address this, we re-design Driving VLA in the style of an inverse kinematics solver. First, a next visual state prediction objective that requires the LLM to predict the future visual scene provides dense visual supervision and suppresses shortcut paths. Second, a separate Inverse Kinematics Network (a cross-attention-based conditional diffusion model) that takes only the current and future visual states as input is designed to suppress reliance on ego status and textual shortcuts during trajectory decoding. With this simple prescription alone, our 0.5B-scale model recovers visual grounding and reaches trajectory planning performance comparable to 7B–8B VLAs more than an order of magnitude larger, on both the closed-loop NAVSIM-v2 and the nuScenes benchmarks. Extensive analysis further shows that this improvement stems from a recovered ability to exploit visual features, with the effect being most pronounced in dynamic driving situations such as turning.

[CV-50] Multimodal LLM s under Pairwise Modalities

链接: https://arxiv.org/abs/2605.21059
作者: Yan Li,Yunlong Deng,Yuewen Sun,Gongxu Luo,Kun Zhang,Guangyi Chen
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Despite the impressive results achieved by multimodal large language models (MLLMs), their training typically relies on jointly curated multimodal data, requiring substantial human effort to construct multi-way aligned datasets and thereby limiting scalability across domains. In this work, we explore training MLLMs by only leveraging multiple paired modalities as a surrogate for the full joint multimodal distribution. Specifically, we first provide a theoretical analysis of the conditions under which the representations are identifiable with only observing pairwise modalities. Building on this analysis, we propose a representation learning framework for aligning latent representations across modalities using only pairwise data. The framework consists of two stages: latent representation alignment and cross-modal recomposition. Specifically, in the first stage, we learn the shared latent space across modalities by both self-modal reconstruction and pair-wise contrastive learning. We also incorporate an inductive bias in the contrastive learning process by partially aligning and minimal latent specification. In stage two, we integrate the encoder of newly introduced modalities with the decoders of the pre-trained modalities to facilitate cross-modal transfer and generation. We evaluate our method by newly adding 3D point clouds and tactile modalities into pre-trained MLLMs with three modality pairs and show that, by learning an aligned latent representation space, our model achieves strong cross-modal performance.

[CV-51] Dynamic Video Generation: Shaping Video Generation Across Time and Space

链接: https://arxiv.org/abs/2605.21042
作者: Shikang Zheng,Jingkai Huang,Jiacheng Liu,Guantao Chen,Lixuan,Yuqi Lin,Peiliang Cai,Linfeng Zhang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion models have achieved impressive performance in video generation, but their iterative denoising process remains computationally expensive due to the large number of tokens processed at each timestep. Recently, progressive resolution sampling has emerged as a promising acceleration approach by reducing latent resolution in early stages. However, scaling this idea to video generation remains challenging, as the additional temporal dimension introduces diverse spatio-temporal demands across different videos, and compressing only a single dimension often leads to limited acceleration or degraded quality. Therefore, we propose DVG, a Dynamic Video Generation framework that jointly allocates computation across time and space, automatically selecting content-aware acceleration strategies without manual tuning or retraining. DVG achieves near-lossless acceleration across models and tasks, reaching up to 7 times speedup on HunyuanVideo and HunyuanVideo-1.5, and 18 times when combined with distillation, demonstrating its potential as a key component in today’s large-scale efficient video generation systems. Our code is in supplementary material and will be released on Github.

[CV-52] owards Physically Consistent 4D Scene Reconstruction for Closed-loop Autonomous Driving Simulation

链接: https://arxiv.org/abs/2605.21032
作者: Bowyn Tan,Yutong Xie,Bai Huang,Fan Luo,Xiao Li,Naizheng Wang,Yang Guan,Shengbo Eben Li
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 20 pages, 4 figures

点击查看摘要

Abstract:High-fidelity street scene reconstruction is pivotal for end-to-end autonomous driving simulation, where novel-view synthesis (NVS) and time-varying information modeling are two fundamental capabilities to facilitate closed-loop training. However, existing 3DGS methods and their 4D extensions fail to simultaneously achieve both. To bridge this gap, we establish an information-geometric diagnostic framework, revealing that this limitation stems from a credit assignment dilemma between spatial and temporal parameters. Specifically, the deterministic coupling between viewpoint and time in single-source observation creates a low-rank structure that induces massive null-space ambiguity between static view-dependent and dynamic time-varying components. Temporal information overshadows spatial cues, causing the estimation variance of spatial parameters to diverge. To address this issue, we propose Orthogonal Projected Gradient (OPG), a hierarchical training method designed to restore spatial identifiability. OPG prioritizes the integrity of spatial representations by securing them in an initial stage, then restricts temporal updates to the spatial null space, enabling proactive credit assignment. While OPG isolates temporal updates algebraically, Temporal Regularization Strategy is proposed to further refine the temporal solution space by imposing a smoothness constraint based on the physical prior of consistent appearance evolution, ensuring that the reconstructed scene remains physically consistent in closed-loop simulation. Extensive experiments demonstrate that our method not only maintains stable NVS capabilities but also demonstrates superior performance in traditional observation-reproducing metrics, which indirectly reflect the capability of modeling temporal dynamics.

[CV-53] DySink: Dynamic Frame Sinks for Autoregressive Long Video Generation

链接: https://arxiv.org/abs/2605.21028
作者: Bo Ye,Xinyu Cui,Jian Zhao,Tong Wei,Min-Ling Zhang
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Autoregressive long video generation often adopts bounded-memory streaming for efficiency, typically combining local windows for short-term continuity with static early-frame sinks as long-range anchors. However, this fixed allocation keeps early frames cached even when the current visual state has substantially diverged from them, while discarding potentially more relevant intermediate history. As a result, the retained long-range context may become less adaptive and bias generation toward outdated cues; in severe cases, RoPE-induced phase re-alignment can homogenize inter-head attention and cause sink collapse, where content regresses toward sink frames. We propose DySink, a retrieval-based framework that maintains a compact memory bank and selects visually relevant historical frames as dynamic frame sinks. DySink couples adaptive retrieval with a sink anomaly gate, which detects excessive inter-head consensus over retrieved context and suppresses collapse-prone context. Experiments on minute-long videos show that DySink consistently improves dynamic degree over strong baselines while also achieving higher temporal quality. The code and model weights will be released at this https URL.

[CV-54] LiteViLNet: Lightweight Vision-LiDAR Fusion Network for Efficient Road Segmentation

链接: https://arxiv.org/abs/2605.21007
作者: Daojie Peng,Bingtao Wang,Fulong Ma,Liang Zhang,Jun Ma
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Road segmentation is a fundamental perception task for autonomous driving and intelligent robotic systems, requiring both high accuracy and real-time inference, especially for deployment on resource-constrained edge devices. Existing multi-modal road segmentation methods often rely on heavy transformer-based encoders to achieve state-of-the-art performance, but their enormous computational cost prohibits real-time deployment on embedded platforms. To address this dilemma, we propose \textbfLiteViLNet, a lightweight multi-modal network that fuses RGB texture information and LiDAR geometric information for efficient road segmentation. Specifically, we design a dual-stream lightweight encoder and depth-wise separable convolutions to extract hierarchical features from both modalities with minimal parameters. We further propose a Multi-Scale Feature Fusion Module (MSFM) to facilitate cross-modal interaction at different levels, and a large-kernel-bridge module to capture long-range dependencies with linear complexity. Extensive experiments on the KITTI Road dataset and real-world applications demonstrate that LiteViLNet achieves a promising balance between accuracy and efficiency. Notably, with only 14.04M parameters, our model attains a 96.36% MaxF score, ranking the best among all CNN-based methods and being comparable to larger transformer-based models, and runs at 163.79 FPS in model-only inference on RTX 4060 Ti (22.18 FPS on Jetson Orin NX). It outperforms numerous heavy-weight methods in inference speed while maintaining highly competitive accuracy, fully validating the potential of LiteViLNet for real-time embedded deployment in autonomous driving and intelligent robotics.

[CV-55] Verifiable Provenance and Watermarking for Generative AI: An Evidentiary Framework for International Operational Law and Domestic Courts

链接: https://arxiv.org/abs/2605.21002
作者: Gustav Olaf Yunus Laitinen-Fredriksson Lundström-Imanov,Nurana Abdullayeva
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY); Multimedia (cs.MM)
备注: 13 pages, 4 figures, 10 tables. Submitted to IEEE Transactions on Information Forensics and Security

点击查看摘要

Abstract:Generative artificial intelligence now synthesizes photorealistic imagery, audio, and video at a cost that defeats traditional forensic intuition. The legal consequences span three regimes studied so far in isolation: international operational law, domestic procedure, and product regulation. This article presents a unified evidentiary framework that maps cryptographic content provenance, robust statistical watermarking, and zero knowledge attestation to the proof requirements of each regime. We define a five tier threat model spanning naive regeneration, adversarial laundering, cross model regeneration, active watermark removal, and insider provenance forgery. We release a public benchmark of 12000 generated items across image, audio, and video modalities under six laundering pipelines for 72000 evaluation samples. We evaluate four representative schemes and report true positive rate at fixed false positive rate, robustness area under the curve, computational overhead, and a regime conditioned legal sufficiency score. We translate empirical detection bounds into legal sufficiency thresholds for command decisions under the law of armed conflict, for criminal and civil admissibility under domestic procedure, and for persistence audits under the European Union Artificial Intelligence Act and analogous regimes. The result is a reproducible reference pipeline, a public benchmark, and model annexes that lawyers, engineers, and operators can deploy together.

[CV-56] DAMA: Disentangled Body-Anchored Gaussians for Controllable Multi-Layered Avatars

链接: https://arxiv.org/abs/2605.21001
作者: Daniel Eskandar,Berna Kabadayi,Garvita Tiwari,Gerard Pons-Moll
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Existing 3D clothed avatar reconstruction methods achieve high visual fidelity but ignore geometric structure and physical plausibility. They either model clothed humans as a single deformable surface or attempt garment disentanglement without enforcing geometric constraints, resulting in ambiguous garment boundaries and no control over stacking or layer ordering. To address these limitations, we introduce DAMA (Disentangled body-Anchored Gaussians for Controllable Multi-layered Avatars), a 3D avatar reconstruction method that produces physically plausible clothed avatars through a dedicated representation and reconstruction method. At the representation level, we bind Gaussians to SMPL-X faces using barycentric in-plane coordinates and a positive normal offset. Based on this parameterization, the reconstruction method lifts 2D segmentations to body-anchored Gaussians, refines layers using topology-guided correction, and jointly optimizes geometry and appearance. DAMA is the first Gaussian avatar reconstruction method from multi-view images to achieve physically plausible layering, clean garment separation, and explicit stacking control. On the full 4D-DRESS dataset (82 scans), it achieves state-of-the-art performance in geometry reconstruction, garment separation, penetration rate, and penetration depth. The representation further supports user-defined garment reordering and fast conversion of body-conforming garments to simulation-ready meshes. Project Page: this https URL

[CV-57] Hybrid Machine Learning Model for Forest Height Estimation from TanDEM-X and Landsat Data

链接: https://arxiv.org/abs/2605.20997
作者: Islam Mansour,Ronny Haensch,Irena Hajnsek,Konstantinos Papathanassiou
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
备注:

点击查看摘要

Abstract:Integrating machine learning (ML) with physical models (PM) has emerged as a promising way of retrieving geophysical parameters from remote sensing data. In this context, a ML model for estimating forest height from TanDEM-X interferometric coherence measurements has recently been proposed, that constrains the learning process through a PM. While the features used for training and inversion where selected to ensure the physical consistency of the solutions, they could not resolve all height / structure and baseline / terrain slope ambiguities in the data. To improve this, the extension of the feature space with optical Landsat data is proposed able to provide complementary information on forest type or structure. The extended model is applied and validated on several TanDEM-X acquisitions over the Gabonese Lopé national park site and assessed against airborne LiDAR measurements. Results show a 13.5% reduction in RMSE and a 16.6% reduction in MAE compared to the original hybrid model, confirming the added value of multispectral inputs.

[CV-58] CHOIR: Contact-aware 4D Hand-Object Interaction Reconstruction

链接: https://arxiv.org/abs/2605.20992
作者: Hao Xu,Yilin Liu,Yinqiao Wang,Chi-Wing Fu,Niloy J. Mitra
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We ask whether everyday open-world monocular videos can be turned into reusable 4D interaction primitives: articulated hand motion, object shape with 6D pose over time, and the when/where of contact. Such a capability would enable scalable mining of real interactions and, beyond reconstruction, support scene-aware synthesis and planning. However, reconstructing hand-object interaction (HOI) from challenging monocular videos remains difficult: methods often assume known objects or curated scenes, and separately estimated hands and objects easily become misaligned under clutter, occlusion, and unseen object geometries. Targeting this setting, we present CHOIR, a Contact-aware HOI Reconstruction framework for a monocular camera, using contact as an explicit coupling signal between hands and objects. CHOIR first initializes a coarse, contact-agnostic 4D HOI sequence from open-world visual priors. It then introduces a generative HOI spatial rectification module to predict ray-depth corrections and rectify hand-object relative placement, then derive initial per-frame contact correspondences on the rectified geometry. Last, a contact-aware joint optimization with dynamically updated contact constraints enforces geometric, temporal, and contact consistency. Experiments on controlled and challenging videos show that CHOIR improves object reconstruction, physical plausibility, and temporal consistency over state-of-the-art methods.

[CV-59] owards Integrated Rock Support Visualisation in 3D Point Cloud of Underground Mines

链接: https://arxiv.org/abs/2605.20973
作者: Dibyayan Patra,Simit Raval,Pasindu Ranasinghe,Bikram Banerjee,Ismet Canbulat
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The effectiveness of rock support in underground mines depends on the interaction between installed rock bolts and the structural fabric of the surrounding rock mass. However, discontinuity characterisation and rock bolt identification are commonly treated as separate tasks, limiting their value for integrated support assessment. This study presents an automated framework for integrated rock support visualisation using 3D point clouds of underground mine excavations. The framework integrates structure mapping, rock bolt identification, discontinuity plane fitting, and bolt orientation estimation into a unified workflow optimised for accuracy and computational efficiency. The outputs are used to generate an integrated 3D visualisation of fitted discontinuity planes and bolt vectors, enabling direct assessment of their spatial intersections and geometric relationships. A complementary stereographic analysis of discontinuity poles and bolt orientations is also performed to evaluate overall bolting geometric effectiveness relative to the mapped structural fabric. Additionally, bolt-level quality metrics, including exposed protrusion length and deviation from the local roof normal, are visualised to support assessment of installation quality. The proposed framework is demonstrated on real underground metal mine scans, producing accurate structure mapping and rock bolt identification results in medium-scale point clouds. Overall, the study provides a practical step towards automated, integrated geotechnical assessment of rock support effectiveness without requiring manual measurements or additional in-situ data acquisition.

[CV-60] Comparative Evaluation of Deep Learning Models for Fake Image Detection

链接: https://arxiv.org/abs/2605.20971
作者: Akhitha Pakala,Mohammed Mahir Rahman,Shahzad Memon,Tauseef Ahmed
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: Accepted at ICCIIoT26 and waiting to be indexed

点击查看摘要

Abstract:The growing sophistication of GAN-based image manipulation presents significant challenges for digital forensics. This study compares the performance of four pretrained CNN architectures including VGG16, ResNet50, EfficientNetB0, and XceptionNet for fake image detection using a unified preprocessing and training pipeline. A dataset of real and manipulated images was processed through resizing, normalization, and augmentation to address class imbalance and improve generalization. Models were evaluated using Accuracy, Precision, Recall, F1-score, and ROC-AUC. VGG16 achieved the highest accuracy at 91%, with XceptionNet, ResNet50, and EfficientNetB0 each reaching 90%. EfficientNetB0 showed stronger sensitivity to fake images but reduced reliability on real samples, reflecting imbalance-driven bias. Limitations include dataset imbalance, overfitting, and limited interpretability, which affect cross-domain robustness. The study provides a reproducible baseline and underscores the need for balanced datasets, advanced augmentation, and fairness-aware training to develop reliable fake image detection systems.

[CV-61] Finding the Correct Visual Evidence Without Forgetting: Mitigating Hallucination in LVLMs via Inter-Layer Visual Attention Discrepancy ICML2026

链接: https://arxiv.org/abs/2605.20965
作者: Yutong Xie,Zhenglin Hua,Ran Wang,Wing W. Y. Ng,Xizhao Wang,Yuheng Jia
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by ICML 2026

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) have shown remarkable performance on a wide range of vision-language tasks. Despite this progress, they are still prone to hallucination, generating responses that are inconsistent with visual content. In this work, we find that LVLMs tend to hallucinate when they pay insufficient attention to the correct visual evidence and gradually forget it during the generation process. We empirically find that although LVLMs overall attend insufficiently to visual evidence, they exhibit sensitivity to the correct visual evidence in specific layers, with notable inter-layer discrepancy. Motivated by this observation, we propose a novel hallucination mitigation method that enhances visual evidence based on Inter-Layer Visual Attention Discrepancy (ILVAD). Specifically, we obtain the attention weights from early generated tokens to visual tokens across layers and identify the tokens that are repeatedly activated as visual evidence, forming a saliency map. We then enhance attention to visual evidence during generation through the saliency map to reduce visual forgetting. In addition, we leverage the saliency map to obtain attention scores of generated text to visual evidence, in order to select and emphasize text tokens that are strongly grounded in visual evidence. Our method is training-free and plug-and-play. Multiple benchmark evaluations conducted on five recently released models show that our method can consistently mitigate hallucinations in different LVLMs over various architectures. Code is available at this https URL.

[CV-62] owards UAV Detection in the Real World: A New Multispectral Dataset UAVNet-MS and a New Method

链接: https://arxiv.org/abs/2605.20963
作者: Yihang Luo,Jun Chen,Chao Xiao,Yingqian Wang,Zhaoxu Li,Qiang Ling,Xu He,Nuo Chen,Gaowei Guo,Hongge Li,Miao Li,Longguang Wang,Yulan Guo,Li Liu,Wei An,Zhijie Chen
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: submitted to IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI)

点击查看摘要

Abstract:The proliferation of unmanned aerial vehicles (UAVs) has created urgent demand for precise UAV monitoring. Existing RGB-based systems rely on spatial cues that degrade at small scales, particularly with high inter-type similarity, target-clutter ambiguity, and low contrast. Multispectral imaging (MSI) encodes material-aware spectral signatures, yet MSI-based fine-grained small-UAV detection remains underexplored due to lack of dedicated datasets. We introduce UAVNet-MS, the first multispectral dataset for fine-grained small-UAV detection, comprising 15,618 temporally synchronized RGB-MSI data cubes (1440x1080) with bounding box annotations. The dataset features challenging small objects (93.7% = 32^2 pixels, average 18^2 pixels, ~0.02% image area) under low contrast. We propose MFDNet, a dual-stream baseline addressing array-induced parallax and spatial-spectral fusion. Extensive evaluation under RGB-only, MSI-only, and RGB+MSI protocols against 20 detectors shows MFDNet achieves +6.2% AP50 improvement over best RGB-only methods, demonstrating spectral cues provide complementary material evidence beyond spatial cues. This work provides foundational dataset, strong baseline, and benchmark for multispectral UAV monitoring research.

[CV-63] Preserve Reveal Expand: Faithful 4D Video Editing with Region-Aware Conditioning

链接: https://arxiv.org/abs/2605.20961
作者: Zhangchi Hu,Wenzhang Sun,Xiangchen Yin,Jiahui Yuan,Chunfeng Wang,Hao Li,Kun Zhan,Xiaoyan Sun
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 23 pages, 13 figures

点击查看摘要

Abstract:Existing 4D-driven video diffusion models primarily target plausible generation, but faithful 4D editing requires preserving source-observed regions while synthesizing disoccluded or out-of-view content. We identify Evidence-Role Mismatch: reliable source-backed evidence, unreliable rendered cues, and unsupported regions are entangled in a single conditioning signal, causing preservation drift, ghosting, and unstable extrapolation. We propose PREX (Preserve, Reveal, Expand), a region-aware framework that decomposes the target spatiotemporal volume into Preserve, Reveal, and Expand roles according to observation support and scene extent. PREX builds observation-backed appearance cues with calibrated confidence and injects them into a frozen video diffusion backbone through a region-aware adapter, trained with proxy tasks without requiring paired edited videos. We further introduce PREBench, a diagnostic benchmark with curated edits, region-role masks, and human-aligned metrics that complement global video-quality and 4D-control evaluations. Experiments show that PREX reduces region-structured failures while maintaining strong visual quality and 4D edit control capability. Project Page: this https URL

[CV-64] DrawMotion: Generating 3D Human Motions by Freehand Drawing

链接: https://arxiv.org/abs/2605.20955
作者: Tao Wang,Lei Jin,Zhihua Wu,Qiaozhi He,Jiaming Chu,Yu Cheng,Junliang Xing,Jian Zhao,Shuicheng Yan,Li Wang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Text-to-motion generation, which translates textual descriptions into human motions, faces the challenge that users often struggle to precisely convey their intended motions through text alone. To address this issue, this paper introduces DrawMotion, an efficient diffusion-based framework designed for multi-condition scenarios. DrawMotion generates motions based on both a conventional text condition and a novel hand-drawing condition, which provide semantic and spatial control over the generated motions, respectively. Specifically, we tackle the fine-grained motion generation task from three perspectives: 1) freehand drawing condition. To accurately capture users’ intended motions without requiring tedious textual input, we develop an algorithm to automatically generate hand-drawn stickman sketches across different dataset formats; 2) multi-condition fusion. We propose a Multi-Condition Module (MCM) that is integrated into the diffusion process, enabling the model to exploit all possible condition combinations while reducing computational complexity compared to conventional approaches; and 3) training-free guidance. Notably, the MCM in DrawMotion ensures that its intermediate features lie in a continuous space, allowing classifier-guidance gradients to update the features and thereby aligning the generated motions with user intentions while preserving fidelity. Quantitative experiments and user studies demonstrate that the freehand drawing approach reduces user time by approximately 46.7% when generating motions aligned with their imagination. The code, demos, and relevant data are publicly available at this https URL.

[CV-65] Focus-then-Context: Subject-Centric Progressive Visual Token Reduction for Vision-Language Models

链接: https://arxiv.org/abs/2605.20950
作者: Yulin Zhao,Yun Wang,Dehua Zheng,Borui jiang,Zheng Zhang
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Vision-Language Models (VLMs) face a bottleneck of prohibitive computational costs arising from massive visual token sequences during inference. Existing vision token reduction methods alleviate this burden, but they unintentionally preserve the isolated visual subject strictly aligned with the user’s query, which fails to substantially explore salient subjects and their contextual relationships. In this paper, we propose SPpruner, a subject-centric progressive reduction paradigm that emulates the \textitFocus-then-Context mechanism of the human visual perception system. Specifically, we first construct a focus identification module to explicitly model the interplay between visual saliency and semantic relevance. Herein, it can excavate the comprehensive visual subject spectrum to ensure a high-fidelity representation of visual input. Subsequently, a context-aware structural scanning module is developed to aggregate contextual cues from neighboring regions. As such, it can effectively restore global relational dependencies to uphold the structural integrity of the preserved subjects. Extensive experiments demonstrate that our paradigm consistently outperforms SOTA methods, achieving up to 2.53 times speedup with only 22.2% of visual tokens retained in Qwen2.5-VL and a 67% FLOPs reduction on LLaVA with a negligible 0.6% accuracy drop.

[CV-66] Bridging Structure and Language: Graph-Based Visual Reasoning for Autonomous Road Understanding

链接: https://arxiv.org/abs/2605.20942
作者: Lena Wild,Katie Z Luo,Marco Pavone
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Structured road understanding of lane geometry, topology, and traffic element relationships is foundational to safe autonomous driving. While vision-language models (VLMs) offer promising semantic flexibility, they lack the geometric and relational grounding required for precise road reasoning. Conversely, traditional modular systems, e.g., HD maps and topological road graphs, provide structural precision but remain semantically rigid. To bridge this gap, we introduce the Combined Road Substrate (CRS), a graph-grounded framework that makes geometric road structure and open-vocabulary semantics jointly executable in a single representation. CRS enables the automatic generation of compositionally complex and linguistically varied question-answer pairs via recursive graph queries, augmented with a “grounding for free” mechanism that ensures logical traceability to specific map elements, and procedurally extracted chain-of-thought supervision traces. We demonstrate that state-of-the-art VLMs - including large, closed-source models - struggle significantly with structured road reasoning, yet training a small 2- or 4-billion-parameter model with as few as 20 to 80 CRS-enriched scenes yields stable gains in compositional reasoning tasks of varying depth. Analysis of model behavior via verifiable reasoning traces reveals a systematic shift in failure modes: whereas baseline models fail at relational scene understanding, CRS-trained models reduce failures to attribute recognition, suggesting that the primary bottleneck in road understanding is not model scale, but the absence of structured supervision.

[CV-67] 3D Reconstruction and Knowledge Distillation to Improve Multi-View Image Models to Explore Spike Volume Estimation in Wheat

链接: https://arxiv.org/abs/2605.20940
作者: Olivia Zumsteg,Jannis Widmer,Yann Bourdé,Norbert Kirchgessner,Andreas Hund,Lukas Roth,Paraskevi Nousi
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 6 figures (Appendix: 4 pages, 5 figures)

点击查看摘要

Abstract:Accurate estimation of wheat spike volume is important for yield component analysis and stress resilience assessment, yet field-based measurement remains challenging. Active 3D sensing methods such as Light Detection and Ranging (LiDAR) or time-of-flight (ToF) are sensitive to plant motion or poorly suited to outdoor conditions, while 3D reconstructions are computationally expensive. Direct 2D image processing would offer computational advantages, but image-based models lack explicit geometric information. We therefore propose a hybrid 2D-3D approach with knowledge distillation during training while enabling efficient image-only inference. First, we train a rigid-invariant point cloud network using distance-based histogram features to obtain pose-robust geometric representations. We then combine the 3D model with a proposed multi-view image-based regulated Transformer (RT) in an ensemble architecture. Finally, we distill the ensemble knowledge into a purely image-based student model using either feature-based or label-based distillation. The two distilled RTs reduce the mean absolute error (MAE) from 654.31 mm ^3 of the non-distilled RT to 639.93 mm ^3 and 644.62 mm ^3 , and increase correlation from 0.76 to 0.77 and 0.82, respectively. At the same time, inference time is reduced from 160 ms to 1.4 ms per spike. Distillation further mitigates volume-dependent bias and reshapes the latent representation of the image model toward a geometry-aware shape. Our results demonstrate that 3D-informed training of a 2D Transformer allows for scalable and efficient spike volume estimation for high-throughput field phenotyping.

[CV-68] Winfree Oscillatory Neural Network

链接: https://arxiv.org/abs/2605.20922
作者: Jiawen Dai,Yue Song
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Oscillations and synchronization are widely believed to play a fundamental role in representation and computation. However, existing machine learning approaches based on synchronization dynamics have largely been confined to specialized settings such as object discovery, with limited evidence of scalability to standard vision benchmarks or logic reasoning tasks. We propose the Winfree Oscillatory Neural Network (WONN), a dynamical neural architecture based on generalized Winfree dynamics. WONN evolves representations on the torus (S^1)^d through structured oscillatory interactions, combining phase-based inductive biases with flexible and hierarchical interaction mechanisms instantiated as either fixed trigonometric mappings or learnable neural networks. We evaluate WONN on image recognition and complex reasoning tasks, including CIFAR, ImageNet, Maze-hard, and Sudoku. Across these domains, WONN achieves competitive or superior performance with strong parameter efficiency. In particular, WONN is, to our knowledge, the first synchronization-based oscillatory architecture to scale competitively to ImageNet-1K. Furthermore, on Maze-hard, WONN achieves 80.1% accuracy using only 1% of the parameters of prior state-of-the-art models. These results suggest that structured oscillatory dynamics provide a scalable and parameter-efficient alternative to conventional neural architectures.

[CV-69] RISE: Reliable Improvement in Self-Evolving Vision-Language Models

链接: https://arxiv.org/abs/2605.20914
作者: Chaoran Xu,Yingmao Miao,Pengfei Zhang,Hao Dou,Lei Sun,Xiangxiang Chu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-language models (VLMs) have achieved strong multimodal reasoning capabilities, but further improving them still relies heavily on large-scale human-constructed supervision for post-training. Such supervision is costly to obtain, especially for reasoning-intensive multimodal tasks where questions, answers, and feedback signals must be carefully designed. This motivates self-evolving learning, where a model improves itself through a dual-role closed loop: a questioner autonomously poses questions and a solver learns to solve them. However, we observe that current VLM self-evolving methods still face three major challenges: coarse-grained role alternation delays the interaction between question generation and solver adaptation; generated questions can progressively degrade in quality; and question types may collapse toward a narrow distribution. These issues limit the efficiency and reliability of self-evolution. Thus, we propose \textbfRISE, a reliable self-evolving framework for vision-language models. RISE is built on three complementary designs: fine-grained role alternation, which shortens the feedback loop between the questioner and the solver to improve efficiency; a quality supervisor, which improves question validity and pseudo-label reliability; and skill-aware dynamic balancing, which mitigates mode collapse and maintains broad skill coverage during evolution. Together, these components enable more reliable and effective self-evolution from unlabeled images. Experiments on two VLM backbones across seven benchmarks show that RISE consistently improves the base models, yielding broad and sustained gains. Our code is publicly available at this https URL.

[CV-70] FlowLong: Inference-time Long Video Generation via Manifold-constrained Tweedie Matching

链接: https://arxiv.org/abs/2605.20910
作者: Jangho Park,Geon Yeong Park,Gihyun Kwon,Jong Chul Ye
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:Extending the generation horizon of video diffusion models to long sequences remains a long-standing and important challenge. Existing training-free approaches fall into two categories: extensions of bidirectional models, which are tightly coupled to specific architectures and suffer from quality degradation over long horizons, and autoregressive models, which accumulate drift errors due to exposure bias and tend to produce repetitive motion patterns. To address these issues, we propose a novel but simple inference-time approach for long video generation that is architecture-agnostic and requires no additional training. Our method generates long videos via overlapping sliding windows, where predicted clean samples from adjacent windows are blended via \emphTweedie matching to enforce both \textbfmanifold constraint and temporal consistency across overlap regions. \emphStochastic early-phase sampling then synchronizes per-window trajectories by injecting fresh noise after each Tweedie matching correction in the high-noise phase, before transitioning to deterministic ODE sampling to preserve fine-grained visual fidelity. Applied to various video generation models, our method generates videos several times longer than the native window length while outperforming both training-free and autoregressive baselines in temporal consistency and visual quality, and further extends to audio-video joint generation and text-to-3DGS without any fine-tuning.

[CV-71] SynCB: A Synergy Concept-Based Model with Dynamic Routing Between Concepts and Complementary Neural Branches

链接: https://arxiv.org/abs/2605.20908
作者: Tores Julie,Sun Rémy,Sassatelli Lucile,Ancarani Elisa,Wu Hui-Yin,Precioso Frédéric
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Concept-based (CB) models provide interpretability and support test-time human intervention, while standard neural networks (NN) offer strong task performance but little transparency. Prior work has explored hybrid formulations that integrate concepts and additional representations to improve accuracy, often at the cost of human interventions. We introduce the \emphSynergy Concept-Based Model (SynCB) framework, that combines a CB branch with a complementary neural branch, and a trainable routing module that dynamically selects which branch to use for each input. Unlike prior models, which fuse residual and concept-based predictions, SynCB keeps the two branches distinct and coordinates them through the routing module. Moreover, both branches are learned jointly, allowing information sharing between the complementary neural branch and CB branches through their common backbone. To improve responsiveness to interventions, we further introduce a test-time intervention policy and a corresponding loss. Across five datasets and CB benchmarks, SynCB consistently achieves higher task accuracy while remaining more responsive to human interventions, surpassing the full neural baseline by up to 3.9 percentage points and exceeding the strongest competitor in intervention performance by up to 6.43 percentage points.

[CV-72] JFAA: Technical Report for the EPIC-KITCHENS-100 Action Anticipation Challenge at EgoVis 2026 ICIP CVPR

链接: https://arxiv.org/abs/2605.20904
作者: Qiaohui Chu,Haoyu Zhang,Yisen Feng,Meng Liu,Weili Guan,Dongmei Jiang,Liqiang Nie
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: The champion solution for the EPIC-KITCHENS-100 Action Anticipation Challenge at the CVPR EgoVis Workshop 2026

点击查看摘要

Abstract:We propose JFAA, a JEPA-based Future Action Anticipation method for the EPIC-KITCHENS-100 (EK-100) Action Anticipation task. Inspired by the representation learning and future prediction ability of V-JEPA 2.1, JFAA uses a frozen encoder and predictor to extract observed context features and near-future latent tokens. A lightweight attentive probe is then trained to predict verb, noun, and action logits with separate task queries. To improve robustness, we further build a field-aware ensemble over selected epoch-level predictions, allowing each output field to benefit from its most reliable candidates. Experimental results on the official challenge server show that JFAA achieves first place in the EgoVis 2026 EK-100 Action Anticipation Challenge. Our code will be released at this https URL.

[CV-73] VISTA: Technical Report for the Ego4D Short-Term Object Interaction Anticipation at EgoVis 2026 ICIP CVPR

链接: https://arxiv.org/abs/2605.20901
作者: Qiaohui Chu,Haoyu Zhang,Yisen Feng,Meng Liu,Weili Guan,Dongmei Jiang,Liqiang Nie
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: The champion solution for the Ego4D Short-Term Object Interaction Anticipation Challenge at the CVPR EgoVis Workshop 2026

点击查看摘要

Abstract:We propose VISTA, a V-JEPA Integrated StillFast Temporal Anticipator for the Ego4D Short-Term Object Interaction Anticipation (STA) Challenge at EgoVis 2026. Given an egocentric video timestamp, the task requires anticipating the next human-object interaction, including the future active object’s bounding box, noun category, verb category, time-to-contact, and confidence score. VISTA follows a StillFast-style design that combines object-centric spatial detection with short-horizon temporal context. Specifically, a COCO-pretrained Faster R-CNN ResNet-50 FPN detector generates object proposals from the last observed high-resolution frame, while a frozen V-JEPA 2.1 temporal branch extracts clip-level egocentric context from the observed video. The temporal representation is injected into the detection pathway through feature modulation and ROI-level context fusion. The fused proposal features are then passed to multi-head STA predictors for box refinement, noun classification, verb classification, time-to-contact regression, and interaction confidence estimation. For the final submission, we further ensemble complementary predictions to improve robustness. Experimental results on the official challenge server show that VISTA achieves first place in the EgoVis 2026 Ego4D STA Challenge. Our code will be released at this https URL.

[CV-74] FruitEnsemble: MLLM -Guided Arbitration for Heterogeneous ensemble in Fine-Grained Fruit Recognition CVPR2026

链接: https://arxiv.org/abs/2605.20892
作者: Enhui Yu,Junhui Li,Ruitong Lu,Jialu Li,Youshan Zhang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages,6 figures,submitted to CVPR 2026

点击查看摘要

Abstract:Fine-grained fruit classification is a critical yet challenging task in agricultural computer vision, primarily hindered by a severe shortage of high-quality datasets and the high visual similarity between classes. To address these challenges, we first constructed a comprehensive dataset comprising 306 fruit categories with 116,233 samples. Moreover, we propose FruitEnsemble, a practical two-stage dynamic inference framework designed to overcome the generalization limitations of static single-model architectures. In the first stage, FruitEnsemble employs a validation-calibrated weighted ensemble of heterogeneous backbones to generate a robust Top-3 candidate pool. To tackle difficult samples, we introduce an expert arbitration mechanism: when ensemble confidence falls below 0.6, a multimodal large language model (MLLM) is triggered to perform rigorous visual verification by integrating external botanical descriptions using Chain-of-Thought (CoT) reasoning. Furthermore, we optimized the training pipeline with a hard sample-aware joint loss. Extensive experiments demonstrate that FruitEnsemble achieves a classification accuracy of 70.49% and outperforms existing state-of-the-art models. Our framework provides an efficient, deployment-oriented solution for real-world agricultural visual sorting and quality inspection tasks.

[CV-75] HDMoE: A Hierarchical Decoupling-Fusion Mixture-of-Experts Framework for Multimodal Cancer Survival Prediction KDD2026

链接: https://arxiv.org/abs/2605.20891
作者: Huayi Wang,Haochao Ying,Yuyang Xu,Qiyao Zheng,jun wang,Cheng Zhang,Ying Sun,Jian Wu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, HDMoE has been accepted by KDD 2026 AI for Sciences Track

点击查看摘要

Abstract:Multimodal survival prediction, a crucial yet challenging task, demands the integration of multimodal medical data (\eg Whole Slide Images (WSIs) and Genomic Profiles) to achieve accurate prognostic modeling. Given the inherent heterogeneity across modalities, the feature decoupling-fusion paradigm has emerged as a dominant approach. However, these methods have the following shortcomings: (1) fail to reduce the redundant information of modality features before decoupling, which negatively affects the feature decoupling and fusion effect;(2) lack the ability to model the fine-grained relationships of the features and capture the local information interactions between intra- and inter-modality features. To address these issues, we propose a \underlineHierarchical \underlineDecoupling-Fusion \underlineMixture-\underlineof-\underlineExperts (HDMoE) framework with two levels of MoE and \underlineRandom \underlineFeature \underlineReorganization (RFR) this http URL the first-level MoE, shared experts and routed experts are employed to remove redundant information and extract fine-grained specific features within each modality, while the second-level MoE facilitates fine-grained inter-modality feature decoupling. Besides, we design two RFR modules following each level of MoE to finely fuse intra- and inter-modality features, which can help the model capture more fine-grained relationships between modalities. Extensive experimental results on our private Liver Cancer (LC) and three TCGA public datasets confirm the effectiveness of our proposed method. Codes are available at this https URL.

[CV-76] Map-Mono-Ego: Map-Grounded Global Human Pose Estimation from Monocular Egocentric Video ICIP2026

链接: https://arxiv.org/abs/2605.20889
作者: Hiroyuki Deguchi,Ryosuke Hori,Kotaro Amaya,Tsubasa Maruyama,Mitsunori Tada,Hideo Saito
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ICIP 2026, Project page: this https URL

点击查看摘要

Abstract:Monocular egocentric human pose estimation is essential for ubiquitous activity monitoring. However, understanding the user’s absolute location within the environment remains a challenge. Existing methods primarily focus on relative motion from an initial position, and tend not to account for the wearer’s absolute location within an environment. Furthermore, inherent scale ambiguity in monocular vision leads to severe translational drift, limiting long-term tracking without specialized multi-sensor hardware. To address this, we propose MapMonoEgo, a novel framework achieving globally consistent human pose estimation solely from a monocular camera by leveraging a pre-scanned 3D point cloud. We also introduce AIST-Living dataset, a new dataset pairing egocentric video with ground-truth motion in a scanned environment. Experiments demonstrate that our approach significantly outperforms the state-of-the-art baseline, proving its utility for practical monitoring tasks without specialized hardware.

[CV-77] Activation-Free Backbones for Image Recognition: Polynomial Alternatives within MetaFormer-Style Vision Models ICML2026

链接: https://arxiv.org/abs/2605.20839
作者: Jeffrey Wang,Jonathan Gregory,Grigorios G. Chrysos
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted to ICML 2026

点击查看摘要

Abstract:Modern vision backbones treat pointwise activations (e.g., ReLU, GELU) and exponential softmax as essential sources of nonlinearity, but we demonstrate they are not required within MetaFormer-style vision backbones. We design activation-free polynomial alternatives for three core primitives (MLPs, convolutions, and attention), where Hadamard products replace standard nonlinearities to yield polynomial functions of the input. These modules integrate seamlessly into existing architectures: instantiated within MetaFormer, a modular framework for vision backbones, our PolyNeXt models match or exceed activation-based counterparts across model scales on ImageNet classification, ADE20K semantic segmentation, and out-of-distribution robustness. We also substantially outperform prior polynomial networks at reduced computational cost, showing that polynomial variants of standard modules beat complex custom architectures.

[CV-78] USV: Towards Understanding the User-generated Short-form Videos

链接: https://arxiv.org/abs/2605.20838
作者: Haoyue Cheng,Su Xu,Liwei Jin,Wayne Wu,Chen Qian,Limin Wang
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Several large-scale video datasets have been published these years and have advanced the area of video understanding. However, the newly emerged user-generated short-form videos have rarely been studied. This paper presents USV, the User-generated Short-form Video dataset for high-level semantic video understanding. The dataset contains around 224K videos collected from UGC platforms by label queries without extra manual verification and trimming. Although video understanding has achieved plausible improvement these years, most works focus on instance-level recognition, which is not sufficient for learning the representation of the high-level semantic information of videos. Therefore, we further establish two tasks: topic recognition and video-text retrieval on USV. We propose two unified and effective baseline methods Multi-Modality Fusion Network (MMF-Net) and Video-Text Contrastive Learning (VTCL), to tackle the topic recognition task and video-text retrieval respectively, and carry out comprehensive benchmarks to facilitate future research. Our project page is this https URL.

[CV-79] ArchSIBench: Benchmarking the Architectural Spatial Intelligence of Vision-Language Models

链接: https://arxiv.org/abs/2605.20837
作者: Qirui Shen,Wenda Wang,Jiachen Lu,Zilong Huang,Jin Bai,Lei He,Hongxuan Chen,Weixin Huang
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 51 pages

点击查看摘要

Abstract:Architectural spatial intelligence, the ability to recognize and infer architectural space, is fundamental to tasks such as robot navigation, embodied interaction, and 3D scene understanding and generation. Although extensive research has evaluated the basic spatial skills of Vision-Language Models (VLMs) such as relative orientation, distance comparison, and object counting, these tasks cover only the most elementary levels of spatial cognition and largely overlook higher-level cognition of architectural space, including layout understanding, circulation patterns, and functional zoning. In this work, we present ArchSIBench, a Benchmark for Architectural Spatial Intelligence based on the perspectives from architecture, cognitive science, and psychology. ArchSIBench covers five core dimensions: perception, reasoning, navigation, transformation, and configuration, comprising 17 fine-grained subtasks. Through careful manual annotation by experts with architectural backgrounds, we construct 3,000 question-answer pairs to enable comprehensive evaluation of architectural spatial intelligence. Based on ArchSIBench, we evaluate various VLMs and find that the architectural spatial intelligence of most models shows significant differences from human baselines; additionally, models exhibit substantial variability across capability dimensions. Some state-of-the-art models can approach the level of human evaluators without architectural training. However, a clear gap remains compared to human evaluators with architectural training, particularly in spatial transformation and configuration reasoning. We believe that ArchSIBench will provide important insights and systematic resources for measuring and advancing the architectural spatial intelligence of VLMs. The dataset and code are available at this https URL.

[CV-80] HyDAR-Pano3D: A Hybrid Disentangled Anatomical Recovery Framework for Panoramic-to-3D Reconstruction

链接: https://arxiv.org/abs/2605.20827
作者: Yaoyao Yue,Jérôme Schmid,Xiaoshuang Li,Eduardo Delamare,Jinman Kim
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages

点击查看摘要

Abstract:Panoramic radiograph (PR) is fundamentally used in routine dental care, but it inherently provides only a two-dimensional (2D) projection of complex three-dimensional (3D) craniofacial anatomy. Most existing learning-based methods attempt to computationally recover this 3D information by directly regressing native cone-beam computed tomography (CBCT) volumes from PR. However, this direct mapping requires the model to simultaneously learn common anatomical structures and patient-specific morphological variations. This entangled formulation makes the ill-posed 2D-to-3D inverse problem highly ambiguous, often producing over-smoothed reconstructions with blurred anatomical boundaries. To address this, we propose HyDAR-Pano3D, a two-stage framework that reformulates PR-to-CBCT reconstruction as a disentangled anatomical recovery problem. In Stage 1, a dual-encoder network integrates radiographic features with SAM-derived semantic priors to reconstruct an arch-normalized canonical volume. In Stage 2, an Anatomical Restoration Network predicts a prior-constrained structured deformation field to map this canonical volume back to the native space, restoring individual morphological variations. Experiments on three large-scale datasets show that HyDAR-Pano3D significantly outperforms baseline methods ( p 0.05 ), achieving a 25.76 dB PSNR, 85.70% SSIM, and an 83.83% overall anatomical Dice score. The synthesized volumes successfully support downstream segmentation of whole teeth (82.4% Dice) and the inferior alveolar canal (72.2% Dice), demonstrating that our disentangled approach preserves clinically relevant structures to enable robust anatomy-aware assessment when CBCT data is unavailable.

[CV-81] RelWitness: Open-Vocabulary 3D Scene Graph Generation with Visual-Geometric Relation Witnesses

链接: https://arxiv.org/abs/2605.20823
作者: Minh Anh Nguyen,Quang Huy Tran,Bao Ngoc Le,Tuan Kiet Pham,Sui Yang Guang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Open-vocabulary 3D scene graph generation seeks to describe object instances and their relations with flexible natural-language predicates. The central difficulty is not only vocabulary expansion, but supervision reliability: relation annotations in 3D scene graph datasets are selective, and many valid object-pair relations are unannotated. We propose RelWitness, a framework for open-vocabulary 3D scene graph generation from posed RGB-D sequences under incomplete relation supervision. The key concept is a relation witness: a concrete visual-geometric cue that makes a relation observable in the captured scene. Support relations require contact and vertical ordering; containment requires enclosure; proximity requires metric closeness; orientation requires facing direction; and stable relations should persist across views where both objects are visible. RelWitness constructs relation witness records from RGB views, depth maps, reconstructed 3D geometry, role-sensitive text, object-prior null views, and multi-view consistency. A visual-geometric witness verifier assigns unannotated relation candidates to verified missing positives, reliable negatives, or uncertain unlabeled cases. A witness-guided positive-unlabeled objective then learns from incomplete annotations without turning every missing label into a negative. We further introduce witness-consistent decoding and an RGB-D missing-relation audit protocol. Simulated manuscript-planning experiments on 3DSSG/3RScan and ScanNet-derived open-vocabulary splits show the intended behavior: improved unseen-relation recognition, higher witness precision, lower hallucination, and reduced redundant relation phrases. All numerical results are planning values and must be replaced by reproduced measurements before submission

[CV-82] ERDNet: Transformer Encoder-Recurrent Decoder Network for Scene Change Detection ICRA

链接: https://arxiv.org/abs/2605.20822
作者: Jiae Yoon,Ue-Hwan Kim
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 4 figures. Accepted to the IEEE International Conference on Robotics and Automation (ICRA) 2026

点击查看摘要

Abstract:In this work, we address the challenge of Scene Change Detection (SCD), where the goal is to identify variations between two images of the same location captured at different times. Existing SCD models often overlook the varying importance of features across layers, employ single-step decoders that confine refinement, and provide limited insight into encoder pretraining strategies. We propose TERDNet, a Transformer Encoder-Recurrent Decoder Network designed to overcome these limitations. TERDNet consists of a transformer-based encoder that extracts multi-level representations, a feature fusion module that integrates correlation volumes with these features, a recurrent 3-gate-GRU decoder that performs iterative refinement, and a combined convolution-interpolation upsampler that restores fine-grained resolution. Extensive experiments on four public benchmarks show that TERDNet consistently outperforms prior approaches and produces more accurate and detailed change masks. Ablation studies confirm the benefit of segmentation-based pretraining and the effectiveness of our fusion design. In addition, robustness tests under viewpoint misalignment confirm TERDNet’s potential for deployment in real-world robotic systems, where reliable perception is critical. Our code is available at this https URL.

[CV-83] VSCD: Video-based Scene Change Detection in Unaligned Scenes ICML2026

链接: https://arxiv.org/abs/2605.20821
作者: Jiae Yoon,Ue-Hwan Kim
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 18 pages, 7 figures. Accepted to the 43rd International Conference on Machine Learning (ICML 2026)

点击查看摘要

Abstract:Detecting what has changed in an environment is essential for long-term autonomy, yet most change detection settings assume fixed viewpoints, mild misalignment, or only a few changed objects. We introduce Video-based Scene Change Detection (VSCD), which predicts a pixel-wise change mask for each query frame, given a reference and a query RGB video of the same indoor space recorded at different times under unconstrained camera motion. The two videos are not temporally synchronized, and many object instances may appear or disappear. To study this setting, we build a large-scale benchmark with over 1.1 million frames annotated with pixel-accurate change masks, together with a real-world test set for evaluating transfer beyond simulation. We propose a query-centric multi-reference model that learns temporal matching implicitly from change-mask supervision, aligns candidate reference features to the query via local patch correspondence, and fuses per-candidate change features using frame-level and patch-level confidence before decoding a high-resolution mask once per frame. Our approach achieves state-of-the-art performance against strong image- and video-based baselines, and we validate its real-world impact by deploying it on a mobile robot for two downstream applications – visual surveillance and object incremental learning.

[CV-84] AIR: Amortized Image Reconstruction Framework for Self-Supervised Feed-Forward 2D Gaussian Splatting

链接: https://arxiv.org/abs/2605.20820
作者: Zhaojie Zeng,Yuesong Wang,Yawei Luo,Tao Guan
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: preprint version

点击查看摘要

Abstract:2D Gaussian splatting provides an efficient explicit representation for image reconstruction, but existing methods still require costly per-image iterative optimization or rely on handcrafted priors for primitive allocation. We present AIR, a self-supervised feed-forward framework that amortizes iterative Gaussian fitting into a single network pass, eliminating per-image test-time optimization. AIR adopts a stage-wise residual architecture that progressively predicts additional Gaussian primitives from reconstruction residuals, together with an explicit Stage Control mechanism that activates new primitives only in under-reconstructed regions. A Predict–Optimize–Distill training strategy stabilizes multi-stage prediction by distilling short-horizon optimized Gaussian increments back into the predictor. The stabilized predictor is then jointly finetuned across stages and equipped with an image-adaptive quantizer for compact Gaussian storage. Experiments on Kodak and DIV2K show that AIR achieves better reconstruction quality than representative Gaussian-based baselines while reducing encoding time to 160–300,ms. Code: this https URL

[CV-85] OSGNet with MLLM Reranking @ Ego4D Episodic Memory Challenge 2026 CVPR

链接: https://arxiv.org/abs/2605.20818
作者: Yisen Feng,Leigang Qu,Haoyu Zhang,Qiaohui Chu,Meng Liu,Xuemeng Song,Weili Guan,Liqiang Nie
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Champion solution for the Natural Language Queries and GoalStep tracks of the Ego4D Challenge at the CVPR EgoVis Workshop 2026

点击查看摘要

Abstract:In this report, we present our champion solutions for the Natural Language Queries and GoalStep tracks of the Ego4D Episodic Memory Challenge at CVPR 2026. Both tracks require accurately localizing temporal segments from long untrimmed egocentric videos. To address these tasks, we propose a reranking-based framework that effectively leverages the strong video-language reasoning capability of multimodal large language model (MLLM) while preserving the efficiency and candidate recall of conventional localization pipelines. Specifically, we first obtain a set of candidate segments from existing localization model OSGNet, and then employ MLLM to select the segment that best matches the given query, thereby refining the final prediction. Ultimately, our method achieved first place in both the Natural Language Queries and GoalStep tracks. Our code can be found at this https URL.

[CV-86] Spatial Gram Alignment for Ultra-High-Resolution Image Synthesis

链接: https://arxiv.org/abs/2605.20808
作者: Jinjin Zhang,Xiefan Guo,Di Huang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Technical Report

点击查看摘要

Abstract:Modern ultra-high-resolution image synthesis relies heavily on the robust generative capacity of large-scale pre-trained Latent Diffusion Models (LDMs). While recent representation alignment methods have proven effective by distilling visual priors from foundation models (e.g., SAM or DINO) into generative latent features, scaling these approaches to pre-trained LDMs at extreme resolutions exposes a critical learnability-fidelity conflict. Specifically, forcing direct patch-wise feature distillation inherently perturbs the pre-trained latent manifold, ultimately leading to generation degradation. To address this bottleneck, we propose Spatial Gram Alignment (SGA), a novel framework that explicitly leverages the representation priors of vision foundation models while preserving the native generative capacity of LDMs. Moving beyond restrictive direct alignment, SGA imposes a non-invasive spatial constraint by aligning the internal self-similarities of the generative features with those of the foundation priors. This spatial constraint effectively establishes macroscopic structural coherence, while the native generative objectives retain the microscopic pixel-level fidelity inherent to the original LDMs. Notably, this versatile strategy integrates seamlessly across both intermediate diffusion features and VAE latents within pre-trained LDMs. Extensive experiments demonstrate that SGA achieves state-of-the-art performance for ultra-high-resolution text-to-image synthesis, yielding an effective reconciliation between global structural integrity and fine-grained visual details. Code is available at this https URL.

[CV-87] Decomposing Subject-Driven Image Generation via Intermediate Structural Prediction

链接: https://arxiv.org/abs/2605.20807
作者: Hanzhong Guo,Yizhou Yu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Subject-driven text-to-image generation still struggles to preserve high-frequency identity details such as logos, patterns, and text. Existing methods typically operate directly in RGB space, which often leads to detail degradation under substantial edits. We propose a two-stage framework that decouples structure from appearance by first predicting a Canny map and then rendering the final image conditioned on both the source appearance and the predicted structure. To improve text handling, we further introduce a fully automatic pipeline that constructs a 100k-pair text-aware dataset with cross-view textual consistency. Experiments, including GPT-4.1-based evaluation and a knowledge distillation study, show clear gains over selected baselines and suggest that intermediate structural prediction is an effective route for high-fidelity subject-driven generation. Our dataset and code will be made publicly available.

[CV-88] OlmoEarth v1.1: A more efficient family of OlmoEarth models

链接: https://arxiv.org/abs/2605.20804
作者: Gabriel Tseng,Yawen Zhang,Favyen Bastani,Henry Herzog,Joseph Redmon,Hadrien Sablon,Piper Wolters,Patrick Alan Johnson,Christopher Wilhelm,Patrick Beukema
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We present a set of improvements to the OlmoEarth family. These improvements allow us to cut compute costs during training ( 1.7 \times reduction in GPU hours required to train our Base models) and inference ( 2.9\times reductions in MACs on Sentinel-2 tasks), while maintaining the models’ overall performance. All training code is available at this http URL.

[CV-89] What Semantics Survive the Connector? Diagnosing VLM-to-DiT Alignment in Video Editing

链接: https://arxiv.org/abs/2605.20795
作者: Hangyu Lin,Chao Wen,Chengming Xu,Jianxiong Gao,Jiangning Zhang,Xiaobin Hu,Yanwei Fu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Flow matching based video generative models have been increasingly relying on prepended Vision-Language Models (VLMs) to handle complex, instruction-based video editing. The prevailing assumption underlying this paradigm is that a connector module can seamlessly align the VLM’s rich multi-modal reasoning with the original text embedding space of DiTs. However, we hypothesize that this alignment acts as a severe semantic bottleneck, degrading fine-grained structural variables. Verifying this is challenging, as end-to-end evaluations conflate alignment failures with generation errors, and natural datasets lack disentangled annotations. To rigorously investigate this, we propose a controlled data processing pipeline based on video composition that results in TRACE-Edit, a diagnostic dataset focusing on relation-based editing. Leveraging this dataset, we propose a comprehensive diagnostic protocol to analyze two important designs of meta-query and connector in the existing video editing models. Systematic evaluation of four representative model cases reveals that fine-grained structural semantics can be severely degraded during alignment. Our findings overturn the assumption of lossless semantic transfer, identifying the VLM-to-DiT alignment as a major bottleneck and providing a new diagnostic foundation for future multi-modal alignment architectures.

[CV-90] Findings of the Counter Turing Test: AI-Generated Image Detection AAAI2025

链接: https://arxiv.org/abs/2605.20787
作者: Rajarshi Roy,Nasrin Imanpour,Ashhar Aziz,Shashwat Bajpai,Gurpreet Singh,Shwetangshu Biswas,Kapil Wanaskar,Parth Patwa,Subhankar Ghosh,Shreyas Dixit,Nilesh Ranjan Pal,Vipula Rawte,Ritvik Garimella,Amitava Das,Amit Sheth,Vasu Sharma,Aishwarya Naresh Reganti,Vinija Jain,Aman Chadha
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Defactify4 @AAAI 2025

点击查看摘要

Abstract:The rapid advancements in generative AI technologies, such as Stable Diffusion, DALL-E, and Midjourney, have significantly transformed the creation of synthetic visual content. While these models enable innovation across industries, they also pose serious challenges, including misinformation, disinformation, and biased content generation. The increasing realism of AI-generated images makes their detection a pressing concern for researchers, policymakers, and industry stakeholders. In this paper, we present the findings of the Defactify 4.0 workshop, which introduced the Counter Turing Test (CT2) for AI-Generated Image Detection. The competition consisted of two key tasks: (1) binary classification of images as either AI-generated or real and (2) identification of the specific generative model responsible for an AI-generated image. To facilitate this, we developed the MS COCOAI dataset, consisting of 50,000 synthetic images from multiple generative models alongside real-world images from the MS COCO dataset. Participants employed diverse detection strategies, including convolutional neural networks (CNNs), Vision Transformers (ViTs), frequency-based analysis, contrastive learning, and multimodal techniques. The results demonstrated that while AI-generated images can be detected with high accuracy (F1-score 0.83), identifying the exact model used remains significantly more challenging (highest F1-score: 0.4986). These findings highlight the need for improved model fingerprinting, adversarial robustness, and real-time detection mechanisms. Comments: Defactify4 @AAAI 2025 Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2605.20787 [cs.CV] (or arXiv:2605.20787v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2605.20787 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-91] Learning to Think in Physics: Breaking Shortcut Learning in Scientific Diffusion via Representation Alignment

链接: https://arxiv.org/abs/2605.20780
作者: Haozhe Jia,Pengyu Yin,Wenshuo Chen,Shaofeng Liang,Lei Wang,Bowen Tian,Xiucheng Wang,Nanqian Jia,Yutao Yue
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Physics-informed diffusion models typically enforce PDE constraints only on final outputs, leaving intermediate representations unconstrained and prone to shortcut learning under shifted boundary conditions. We introduce REPA-P, a teacher-free, architecture-agnostic framework that aligns intermediate features with physical states using first-principles residuals. REPA-P attaches lightweight 1\times1 projection heads to selected layers, decodes hidden activations into physical quantities, and applies PDE residual losses during training. These heads are discarded at inference, introducing zero overhead. Across four PDE tasks, including Darcy flow, topology optimization, electrostatic potential, and turbulent channel flow, REPA-P accelerates convergence by up to 2\times , reduces physics residuals by up to 66.4% , and improves out-of-distribution robustness by up to 49.3% , with consistent gains on both U-Net and Diffusion Transformer backbones. Ablations show that supervising a small set of intermediate layers captures most benefits and complements output-level physics losses. Code is available at [this https URL](this https URL).

[CV-92] AttriStory: Fine-grained Attribute Realization for Visual Storytelling with Diffusion Models CVPR

链接: https://arxiv.org/abs/2605.20777
作者: Manogna Sreenivas,Rohit Kumar,Soma Biswas
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at CVPR AIStory Workshop, 2026

点击查看摘要

Abstract:Visual storytelling with diffusion models has made impressive strides in maintaining character consistency across narrative scenes. However, a critical gap remains: while these methods ensure a character remains consistent across scenes, they provide no systematic method to ensure if fine-grained attributes such as color and textures of clothing, accessories are faithfully rendered in the generated images. Towards this goal, we introduce AttriStory, a benchmark enabling attribute realization in visual storytelling. We curate 200 multi-scene stories across 10 distinct artistic styles using Large Language Model. Each scene is constructed with detailed attribute specifications to enable rich visual narratives. Further, to address attribute realization, we propose a plug-and-play latent optimization module that operates during early denoising steps, when the model establishes structural and semantic content. We achieve this through AttriLoss objective designed to maximize alignment between the cross-attention maps for desired attribute-object pairs while suppressing spurious associations, guiding models to localize attributes correctly. This approach operates orthogonally to existing consistency mechanisms, integrating seamlessly with current story generation pipelines without requiring architectural modifications. Our experiments demonstrate consistent improvements on incorporating AttriLoss across all baselines. This work positions attribute realization as a distinct, complementary dimension of visual storytelling, alongside character consistency, advancing the field toward fine-grained attribute-controlled story generation. Project-page:this https URL

[CV-93] VIHD: Visual Intervention-based Hallucination Detection for Medical Visual Question Answering MICCAI2026

链接: https://arxiv.org/abs/2605.20772
作者: Jiayi Chen,Benteng Ma,Zehui Liao,Winston Chong,Yasmeen George,Jianfei Cai
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Early accepted by MICCAI 2026

点击查看摘要

Abstract:While medical Multimodal Large Language Models (MLLMs) have shown promise in assisting diagnosis, they still frequently generate hallucinated responses that appear linguistically plausible but lack visual evidence. Such hallucinations pose risks to clinical decision-making and necessitate effective detection. Existing introspective detection methods primarily perform uncertainty estimation or logical verification by analyzing model responses conditioned on original or perturbed inputs. However, such external perturbations are often heuristic and context-agnostic, which overlooks the internal cross-modal dependency between generated tokens and related visual tokens during decoding. To address this issue, we propose VIHD, a Visual Intervention-based Hallucination Detection method that leverages targeted visual token masking to calibrate semantic entropy for more effective hallucination detection. VIHD locates visually dominant decoder layers via Visual Dependency Probing (VDP), executes Visual Intervention Decoding (VID) via token masking to calibrate the semantic distribution, and quantifies the resulting Calibrated Semantic Entropy (CSE) as a reliable hallucination signal. Extensive experiments on three medical VQA benchmarks with two medical MLLMs demonstrate that VIHD consistently outperforms state-of-the-art methods, underscoring the importance of fine-grained visual dependency for hallucination detection. The code will be available at this https URL

[CV-94] Diffuse to Detect: Bi-Level Sample Rebalancing with Pseudo-Label Diffusion for Point-Supervised Infrared Small-Target Detection

链接: https://arxiv.org/abs/2605.20766
作者: Zhu Liu,Yuanhang Yao,Ping Qian,Zihang Chen,Risheng Liu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Point supervision has become a scalable solution to address dense annotation for infrared small target detection, but its performance is limited by two coupled bottlenecks: unstable pseudo-label evolution in cluttered, low-contrast infrared imagery and severe sample-distribution imbalance. In this paper, we present a more adaptive and stable framework to address these issues. Leveraging the intrinsic consistency between thermal radiation patterns and heat diffusion, we propose a physics-induced annotation strategy that expands single-point labels into reliable pseudo-masks. To further enhance supervision and alleviate sample imbalance, we develop a bi-level dual-update framework that jointly optimizes detector weights, sample weights, and diffusion parameters. A meta-classifier dynamically predicts sample-wise loss weights, while a differentiable diffusion module refines pseudo-labels with detection feedback, enabling adaptive interaction between training and hyperparameter optimization. Extensive experiments across multiple datasets demonstrate five-fold annotation acceleration, superior detection accuracy, and comparable performance with 30% of the training data, validating the efficiency and practicality of our approach. Our code is available at this https URL.

[CV-95] SpineContextResUNet: A Computationally Efficient Residual UNet for Spine CT Segmentation

链接: https://arxiv.org/abs/2605.20760
作者: K S Nithurshen,Saurabh J. Shigwan
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 2 Figures, 3 Tables

点击查看摘要

Abstract:Automated segmentation of the vertebral column in Computed Tomography (CT) scans is a prerequisite for pathological assessment and surgical planning. However, state-of-the-art methods, particularly those based on Transformers or large-scale ensembles, demand substantial GPU resources, creating a barrier for clinical adoption in resource-constrained environments or on edge devices. To address this, we introduce SpineContextResUNet, a computationally efficient 3D Residual U-Net designed for rapid spinal localization. Our architecture integrates a lightweight Context Block that employs parallel multi-dilated convolutions to capture long-range anatomical dependencies without the high latency of Recurrent Neural Networks (RNNs) or the memory overhead of Self-Attention mechanisms. Extensive validation on two public benchmarks, VerSe2020 and CTSpine1K, demonstrates that our model achieves a Dice score of 88.17% and 88.13% respectively. To evaluate performance under strict hardware constraints, we compared our model against a bottlenecked SwinUNETR scaled to match our ~1.7M hardware footprint. While the constrained Transformer suffers severe performance degradation due to a lack of spatial inductive biases in a limited-data regime, our CNN-based approach successfully maintains high accuracy. Crucially, heavy baselines like TotalSegmentator fail due to memory exhaustion on commodity hardware (Intel Core i5, 8GB RAM), our model performs robust inference, making it a viable solution for point-of-care diagnostics and deployment on edge platforms like the Nvidia Jetson Orin Nano.

[CV-96] Conflict-Aware Additive Guidance for Flow Models under Compositional Rewards ICML2026

链接: https://arxiv.org/abs/2605.20758
作者: Xuehui Yu,Fucheng Cai,Meiyi Wang,Xiaopeng Fan,Harold Soh
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
备注: Forty-Third International Conference on Machine Learning (ICML 2026)

点击查看摘要

Abstract:Inference-time guided sampling steers state-of-the-art diffusion and flow models without fine-tuning by interpreting the generation process as a controllable trajectory. This provides a simple and flexible way to inject external constraints (e.g., cost functions or pre-trained verifiers) for controlled generation. However, existing methods often fail when composing multiple constraints simultaneously, which leads to deviations from the true data manifold. In this work, we identify root causes of this off-manifold drift and find that the approximation error scales severely with gradient misalignment. Building on these findings, we propose Conflict-Aware Additive Guidance ( g^\textcar ), a lightweight and learnable method, which actively rectifies off-manifold drift by dynamically detecting and resolving gradient conflicts. We validate g^\textcar across diverse domains, ranging from synthetic datasets and image editing to generative decision-making for planning and control. Our results demonstrate that g^\textcar effectively rectifies off-manifold drift, surpassing baselines in generation fidelity while using light compute. Code is available at this https URL.

[CV-97] STAR-IOD: Scale-decoupled Topology Alignment with Pseudo-label Refinement for Remote Sensing Incremental Object Detection

链接: https://arxiv.org/abs/2605.20738
作者: Yaoteng Zhang,Qing Zhou,Junyu Gao,Qi Wang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: STAR-IOD was accepted by ISPRS Journal of Photogrammetry and Remote Sensing

点击查看摘要

Abstract:Remote sensing imagery typically arrives in the form of continuous data streams. Traditional detectors often forget previously learned categories when learning new ones; therefore, research on Remote Sensing Incremental Object Detection (RS-IOD) is of great significance. However, existing methods largely overlook the intra-class scale variations prevalent in remote sensing scenes, which undermines the effectiveness of knowledge transfer and old knowledge preservation. Moreover, RS-IOD also suffers from missing annotations, which cause the model to misclassify old-class instances as background. To address these challenges, we propose a novel framework, STAR-IOD. First, we introduce a Subspace-decoupled Topology Distillation (STD) module to transfer structural knowledge, explicitly aligning inter-class topological relationships and mitigating intra-class representation discrepancies induced by scale shifts. Furthermore, we introduce the Clustering-driven Pseudo-label Generator (CPG), a plug-and-play module that leverages K-Means clustering to dynamically identify class-specific thresholds, thereby guaranteeing an accurate distinction between true positive targets and background noise and alleviating the issue of missing annotations for old classes. We also constructed two Remote Sensing Incremental Object Detection datasets, DIOR-IOD and DOTA-IOD to facilitate research on RS-IOD. Extensive experiments demonstrate that our method outperforms state-of-the-art approaches by 1.7% and 2.1% mAP on DIOR-IOD and DOTA-IOD, respectively, effectively alleviating catastrophic forgetting while preserving strong detection performance on both base and novel classes. The code and dataset are released at: this https URL.

[CV-98] Resolving Long-Tail Ambiguity in Unsupervised 3D Point Cloud Segmentation with Language Priors

链接: https://arxiv.org/abs/2605.20737
作者: Siqi Wei,Hongbin Xu,Feng Xiao,Tian Lan,Chun Li,Ming Li,Qiuxia Wu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: In submission. The code will be released at: this https URL

点击查看摘要

Abstract:Existing approaches for unsupervised 3D point cloud segmentation predominantly rely on a purely visual similarity-based learning-by-clustering paradigm, which suffers from a fundamental limitation: long-tail ambiguity. In such a paradigm, features of minor classes are consistently absorbed by dominant clusters, leading to severely imbalanced predictions. To address this issue, we propose LangTail, a language-guided hierarchical learning framework that leverages the balanced world knowledge encoded in language models to mitigate long-tail ambiguity in unsupervised 3D segmentation. The key idea is to establish multi-level associations between language-derived semantic priors and visually underrepresented minor classes, thereby compensating for the biased attention of purely visual clustering toward dominant classes. Specifically, LangTail first constructs an entity-level semantic prior from language models, capturing balanced and fine-grained world knowledge across categories. These priors are injected into a hierarchical clustering framework via contrastive alignment. This guides multi-granularity semantic structure formation and prevents minor classes from being absorbed by dominant clusters, yielding more discriminative representations for underrepresented categories. Extensive experiments on ScanNet-v2, S3DIS, and nuScenes demonstrate that LangTail consistently outperforms existing methods by significant margins, \ie, +13.5, +12.9, and +8.9 mIoU, respectively. These results demonstrate the effectiveness of language priors in improving the representation of minority classes in 3D point clouds. The code will be released at: this https URL.

[CV-99] Lowering the Barrier to IREX Participation: Open-Source Algorithms Toolkit and Benchmarking for Iris Recognition

链接: https://arxiv.org/abs/2605.20735
作者: Siamul Karim Khan,Patrick J. Flynn,Adam Czajka
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This paper proposes two new open-source iris recognition algorithms, providing both Python and IREX-compliant C++ implementations to be submitted to the official IREX X program. This work has two primary goals: (a) to conduct the first-ever assessment of open-source iris recognition solutions according to IREX testing protocols, and (b) to offer a model C++ submission that significantly facilitates the entry of other teams’ open-source methods into the IREX evaluation. The new methods consist of two Neural Networks trained with: (i) Triplet loss with Batch-Hard Triplet mining (TripletIris), and (ii) ArcFace loss (ArcIris). The paper also provides open-source IREX-compliant C++ implementations of two existing methods: (a) an iris image filtering-based algorithm utilizing human saliency-driven kernels (HDBIF), and (b) a human-interpretable algorithm for detecting and comparing Fuchs’ crypts (CRYPTS). Except for CRYPTS, which faced timing constraints during 1:N search, these methods have undergone the official IREX X evaluation and have also been assessed using several popular academic benchmarks: Quality-Face/Iris Research Ensemble, Warsaw-Biobase Post-Mortem Iris, CASIA-Iris-Thousand-V4, CASIA-Iris-Lamp-V4, IIT Delhi Iris Database, IIITD Contact Lens Iris Database, NDIris3D, and Notre Dame Variable Iris Image Quality Release 2. Finally, this paper also provides open-source models for iris segmentation and circle estimation that can be incorporated into any new iris recognition method.

[CV-100] Sketch2MinSurf: Vision-Language Guided Generation of Editable Minimal Surfaces from Hand-Drawn Sketches

链接: https://arxiv.org/abs/2605.20733
作者: Wenda Wang,Anqi Liu,Junqi Yang,Lei He,Luying Wang,Jiachen Lu,Weixin Huang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 22 pages, 16 figures, includes appendix

点击查看摘要

Abstract:Converting hand-drawn sketches into structured 3D geometries remains challenging due to the difficulty of representing non-Euclidean surfaces and maintaining topological consistency. Existing generative models such as GANs, NeRFs, and diffusion architectures often fail to produce editable manifolds directly usable in downstream design workflows. We present Sketch2MinSurf, a hybrid vision-language and geometric optimization framework that integrates vision-language guidance with minimal-surface theory to generate smooth and editable 3D surfaces from hand-drawn sketches. The core of our approach is a spatial-topological encoding that represents geometry as tuples of node coordinates and real/virtual edge skeletons, enabling stable topological control during generation. We further introduce the Sketch2MinSurf Structural Loss (S2MS-Loss), a reward-modulated objective that jointly constrains geometric reconstruction and topological coherence. On a test set of 100 sketches, Sketch2MinSurf achieves a topological similarity score of 0.844, outperforming existing sketch-to-shape baselines. The generated manifolds are directly editable and free from non-manifold artifacts. A public art installation at a university showcases the method’s potential for human-intent-driven 3D form generation. The dataset and code are available at this https URL.

[CV-101] Deep Attention Reweighting: Post-Hoc Attention-Based Feature Aggregation in CNNs for Disentangling Core and Spurious Features under Spurious Correlations

链接: https://arxiv.org/abs/2605.20732
作者: Kin Whye Chew,Jingxian Wang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Under review. 26 pages, 7 figures

点击查看摘要

Abstract:Convolutional Neural Networks (CNNs) often exploit spurious correlations in datasets, learning superficially predictive yet causally irrelevant features, leading to poor generalization and fairness issues. Deep Feature Reweighting (DFR) is a post-hoc technique that reduces a trained model’s reliance on spurious correlations by retraining its classification head on a target dataset. However, we show that DFR is fundamentally constrained by operating on entangled features, limiting its ability to amplify the core features while simultaneously suppressing the spurious ones. We trace this entanglement to the ubiquitous Global Average Pooling (GAP) layer, which indiscriminately collapses spatially distinct core and spurious features into a single representation. To address this, we propose Deep Attention Reweighting (DAR), a post-hoc attention-based aggregation module that replaces GAP and is retrained jointly with the classification head. DAR computes an adaptive weighting of spatial locations across feature maps, enabling selective suppression of spurious features before the collapse into entangled features. Across various datasets, metrics, and ablations, DAR consistently outperforms DFR, demonstrating that our attention-based aggregation mitigates GAP-induced entanglement and reduces spurious reliance.

[CV-102] ASTE: A Designer-Annotated Multi-Dimensional Preference Dataset for AI-Generated Graphic Design

链接: https://arxiv.org/abs/2605.20731
作者: Haonan Zhu,Elad Hirsch,Alexandria Minetti,Allison Nulty,Purvanshi Mehta
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Applications (stat.AP)
备注:

点击查看摘要

Abstract:Text-to-image models produce graphic design at production scale, but their supervision comes from photo-style preference data with a single overall verdict per comparison. Designers evaluate along several distinct axes, including typography, visual hierarchy, color harmony, layout, and brief fidelity, and a single label collapses them. We release TASTE (Typography, Aesthetics, Spatial, Tone, Etc.): ten professional designers ranked outputs from four current text-to-image models on nine criteria across two disjoint cohorts, yielding 1,600 ratings per criterion plus per-image hallucination flags on the holistic-preference cohorts. We pair the dataset with three contributions. First, a criterion-agnostic signal test framework, using Kendall’s tau, majority probability, and Condorcet cycles against exact iid-uniform nulls at p = 4 and R = 5, places designer agreement on graphic design between food and movie preferences and photo-style image quality, with every TASTE criterion rejecting the random-rater null. Second, no pre-trained system in our benchmark, including six open-weight VLM judges from 3B to 33B parameters and three dedicated T2I scorers, HPSv2.1, PickScore-v1, and LAION-Aesthetic-V2, exceeds 0.55 macro agreement with the 5-designer majority; VLM judges trade off position bias against content sensitivity, so scaling moves along this frontier without improving accuracy. Third, a small pairwise-difference head trained on TASTE reaches 0.611, closing roughly half the gap to the 0.741 single-rater ceiling.

[CV-103] Early High-Frequency Injection for Geometry-Sensitive OOD Detection

链接: https://arxiv.org/abs/2605.20728
作者: Chuanjie Cheng,Ningkang Peng,Chenxi Liu,Yifan He,Peirong Ma,Yanhui Gu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Post-hoc OOD detectors score logits or features after training, so their success depends on the geometry already encoded in the representation. We revisit this assumption through a band-wise MMD^2 analysis across CE, SimCLR, SupCon, and the OOD-oriented representation method PALM. In our diagnostic, low-frequency input bands induce weaker ID/OOD feature discrepancy, whereas higher-frequency bands tend to provide stronger separability. This observation motivates EIHF, an input-side intervention that exposes high-frequency evidence before the first convolution without changing the training objective. EIHF is strongest for geometry-sensitive OOD detection: under matched training and scoring settings, it reshapes class-conditional feature geometry and reduces ID/OOD Mahalanobis score overlap. Experiments on CIFAR-100 and ImageNet-100 show gains on CIFAR-100 and the best average FPR95 with second-best average AUROC on ImageNet-100, while also revealing a limitation on the scene-centric Places shift. Code is available at this https URL.

[CV-104] GAMR: Geometric-Aware Manifold Regularization with Virtual Outlier Synthesis for Learning with Noisy Labels

链接: https://arxiv.org/abs/2605.20727
作者: Ningkang Peng,Jingyang Mao,Xiaoqian Peng,Peirong Ma,Xichen Yang,Weiguang Qu,Yanhui Gu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Deep neural networks (DNNs) experience significant performance degradation when processing noisy labels, primarily due to overfitting on mislabeled data. Current mainstream approaches attempt to mitigate this issue by passively filtering clean samples during training. However, simple sample filtering within feature spaces degraded by noise struggles to distinguish between challenging samples and noisy samples, creating a bottleneck for model performance. We highlight for the first time the fundamental importance of actively reshaping feature space geometry for learning from noisy data. We propose a novel Geometry-aware Manifold Regularization Paradigm whose core idea is to explicitly construct energy barriers between data manifolds by actively synthesizing virtual outlier samples. By imposing geometric constraints that promote intra-class compactness and inter-class separation, this approach enhances the discriminability between hard and noisy samples, leading to the learning of more robust representations. Our regularization mechanism exhibits high universality, with effectiveness independent of any prior assumptions about noise patterns. It can be integrated as a standalone mechanism into existing sample selection frameworks, providing stronger robustness against diverse noisy environments. Experiments demonstrate that our paradigm achieves performance surpassing current state-of-the-art (SOTA) methods on multiple benchmarks, including CIFAR-10, with particularly pronounced advantages under more challenging asymmetric noise conditions. Furthermore, this paradigm significantly enhances the model’s capability in Out-of-Distribution (OOD) detection, ensuring superior reliability and safety for deployment in open-world scenarios.

[CV-105] Holistic Reliability Propagation: Decoupling Annotation and Prediction for Robust Noisy-Label

链接: https://arxiv.org/abs/2605.20725
作者: Jingyang Mao,Ningkang Peng,Yanhui Gu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Learning with noisy labels in multimedia classification often combines external annotations and model predictions into a single reliability weight, even though the two sources can fail for different reasons. We instead estimate disentangled reliabilities: bilevel meta-learning produces two batch-normalized scalars per sample, alpha for the given label and beta for the pseudo-label, without constraining them to sum to one. Holistic Reliability Propagation (HRP) then routes them to different objectives, using reliability-aware Mixup with global gating on the input branch and beta-gated pseudo-label positives on the contrastive branch. On synthetic and real-world benchmarks, HRP improves average accuracy over strong baselines and remains competitive at the highest noise rates.

[CV-106] E-ReCON: An Energy- and Resource-Efficient Precision-Configurable Sparse nvCIM Macro for Conventional and Spiking Neural Edge Inference

链接: https://arxiv.org/abs/2605.20717
作者: Ankit Kumar Tenwar,Mukul Lokhande,Santosh Kumar Vishvakarma
类目: Neural and Evolutionary Computing (cs.NE); Hardware Architecture (cs.AR); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:This work presents E-ReCON, a 16 Kb energy and resource-efficient digital compute-in-memory (DCIM) macro based on a compact 3T1R ReRAM bitcell for edge-AI inference. The proposed bitcell occupies only 0.85 um^2 and supports reliable AND-based in-memory multiplication for both conventional convolutional neural network (CNN) and spiking neural network (SNN) workloads. To reduce accumulation overhead, a novel interleaved 10T/28T adder tree is introduced, reducing transistor count and power consumption by 37% and 28%, respectively, compared to a conventional 28T RCA-based design. Implemented in 65 nm CMOS at 1.2 V, the proposed macro achieves a minimum latency of 0.48 ns, throughput of 2.31-3.1 TOPS, and energy efficiency of up to 419 TOPS/W. When evaluated on LeNet-5, AlexNet, and CNN-8 models, the macro achieves 97.81%, 93.23%, and 96.51% accuracy on MNIST/A-Z, CIFAR10, and SVHN datasets, respectively. In addition, 40% pruning preserves nearly 99.8% of the original accuracy while reducing MAC operations and computation cycles. For SNN-oriented workloads, the proposed AND-type bitcell efficiently supports spike-weight multiplication with low switching activity, where the 2A2W configuration achieves accuracy close to the FP32 baseline across VGG-8, VGG-16, and ResNet-18 networks on CIFAR-10, CIFAR-100, and ImageNet-1K datasets. Compared to prior ADC-based ReRAM-CIM designs, the proposed architecture improves latency and energy efficiency by nearly 30-40% while maintaining robust operation under full PVT and ReRAM variability. Overall, E-ReCON provides a scalable, low-latency, and energy-efficient nvCIM platform for next-generation edge-AI, IoT, biomedical sensing, and neuromorphic applications.

[CV-107] SAVER: Selective As-Needed Vision Evidence for Multimodal Information Extraction

链接: https://arxiv.org/abs/2605.20713
作者: Miaobo Hu,Shuhao Hu,Bokun Wang,Rui Chen,Xin Wang,Xiaobo Guo,Daren Zha,Jun Xiao
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Multimodal IE in social media is difficult because a post may attach multiple images that are weakly related, redundant, or even misleading with respect to the text. In this setting, always-on multimodal fusion wastes computation and can amplify spurious visual cues. The core challenge is to decide, for each candidate span or marked entity pair, whether vision should be consulted at all and, if so, which small subset of images provides trustworthy evidence. We propose SAVER, a selective vision-as-needed framework for multimodal named entity recognition and multimodal relation extraction. SAVER uses a Conformal Groundability Gate (CGG) to estimate span-level visual groundability in MNER, derive pair-level activation in MRE from the two marked entities, and calibrate the activation threshold on a held-out split via a conformal-style procedure with Clopper–Pearson upper bounds. When activated, a submodular relevance–diversity selector chooses a compact evidence subset across images, which is then aggregated by a Set Transformer. An energy-inspired joint scoring head combines text, optional visual evidence, text–image consistency, and sparse routing for entity typing or relation classification. Experiments show that SAVER consistently improves F1 over strong text-only and always-on multimodal baselines, while reducing AURC, increasing activation coverage at a fixed risk level, and lowering FLOPs and P90 latency. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2605.20713 [cs.CV] (or arXiv:2605.20713v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2605.20713 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-108] Rethinking Cross-Layer Information Routing in Diffusion Transformers

链接: https://arxiv.org/abs/2605.20708
作者: Chao Xu,Maohua Li,Qirui Li,Yixuan Xu,Yanke Zhou,Yunhe Li,Cuifeng Shen,Hanlin Tang,Kan Liu,Tao Lan,Lin Qu,Shao-Qun Zhang
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Diffusion Transformers (DiTs) have become a de facto backbone of modern visual generation, and nearly every major axis of their design – tokenization, attention, conditioning, objectives, and latent autoencoders – has been extensively revisited. The residual stream that governs how information accumulates across layers, however, has been directly inherited from the original Transformer. In this paper, we present a systematic empirical analysis of cross-layer information flow in DiTs, jointly along depth and denoising timestep, and identify three concrete symptoms of traditional residual addition, namely monotonic forward magnitude inflation, sharp backward gradient decay, and pronounced block-wise redundancy. Motivated by this diagnosis, we propose Diffusion-Adaptive Routing (\textscDAR), a drop-in residual replacement that performs \emphlearnable, timestep-adaptive, and non-incremental aggregation over the history of sublayer outputs. Moreover, the proposed \textscDAR is compatible with many modern Transformer enhancement methods, such as REPA. On ImageNet 256\times256 , \textscDAR improves SiT-XL/2 by 2.11 FID ( 7.56 vs.\ 9.67 ) and matches the baseline’s converged quality with 8.75\times fewer training iterations. Stacked on top of REPA, it yields a 2\times training acceleration in the early stage, suggesting cross-layer information routing as an underexplored design axis in diffusion modeling, one that operates orthogonally to existing representation-alignment objectives. Beyond pretraining, \textscDAR can also be applied during the fine-tuning stage of large-scale T2I models and preserves high-frequency details during Distribution Matching Distillation.

[CV-109] IndusAgent : Reinforcing Open-Vocabulary Industrial Anomaly Detection with Agent ic Tools

链接: https://arxiv.org/abs/2605.20682
作者: Rongbin Tan,Fangfang Lin,Zhenlong Yuan,Min Qiu,Kejin Cui,Mengmeng Wang,Yi Wang,Zijian Song,Zhiyuan Wang,Jiyuan Wang,Yue Wang,Shuhan Song§,Huawei Cao
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multimodal large language models (MLLMs) have shown remarkable capability in bridging visual perception and textual reasoning, enabling zero-shot understanding across diverse industrial scenarios. However, their performance in open-vocabulary industrial anomaly detection (IAD) is often limited by domain-misaligned reasoning and hallucinated structural inferences. To address these challenges, we propose \textbfIndusAgent, a tool-augmented agentic framework for open-vocabulary IAD. Specifically, we first construct \textbfIndus-CoT, a structured dataset that integrates global visual observations, high-resolution local patches, and expert normalcy priors, providing supervision for fine-tuning the model on rigorous industrial inspection trajectories. Building on this, IndusAgent dynamically orchestrates a set of external tools, including dynamic region cropping, high-frequency feature enhancement, and prior retrieval, thus enabling the agent to actively resolve visual ambiguities and disentangle subtle anomalies. Furthermore, we introduce a gated reinforcement learning objective that jointly optimizes anomaly classification, localization accuracy, anomaly type reasoning, and efficient tool usage, ensuring that tool invocation occurs only when beneficial. Extensive evaluations on five industrial anomaly benchmarks, including MVTec-AD, VisA, MPDD, DTD, and SDD, demonstrate that IndusAgent achieves state-of-the-art zero-shot performance among all existing methods, validating our robustness and generalization capacity.

[CV-110] DarkShake-DVS: Event-based Human Action Recognition under Low-light andShaking Camera Conditions

链接: https://arxiv.org/abs/2605.20680
作者: Jiaqi Chen,Qinfu Xu,Liyuan Pan
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8pages,7 figures

点击查看摘要

Abstract:Human Action Recognition (HAR) is a fundamental computer vision task with diverse real-world applications. Practical deployments often involve low-light environments and unconstrained 6-DoF camera motion, conditions that degrade visual quality, disrupt temporal coherence, and compromise reliability of existing methods. Event cameras, with high low-light sensitivity and microsecond-level temporal resolution, paired with an inertial measurement unit (IMU), present a promising solution. However, current research faces two key challenges: absence of a benchmark integrating low-light conditions, 6-DoF motion, and synchronized IMU data; and lack of effective motion compensation techniques. To address these, we propose Event-IMU Stabilized HAR (EIS-HAR), with two modules. The first is an EIS module that reduces motion blur via a non-linear warping function to reconstruct a motion-compensated input. The second is a HAR module with a four-stage hybrid architecture to efficiently extract spatiotemporal features for accurate action recognition. To alleviate data scarcity, we introduce DarkShake-DVS, the first large-scale event-based HAR benchmark that includes 18,041 realworld clips captured in low light and intense 6-DoF motion, supplemented by synchronized IMU data. Extensive experiments on three datasets demonstrate consistent superiority of EIS-HAR over state-of-the-art methods.

[CV-111] VISTAQA: Benchmarking Joint Visual Question Answering and Pixel-Level Evidence

链接: https://arxiv.org/abs/2605.20676
作者: Mozhgan Nasr Azadani,Yimu Wang,Yongpeng Zhu,Lihong Chen,Milan Ganai,Sean Sedwards,Marco Pavone,Krzysztof Czarnecki
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Establishing a clear link between model predictions and the visual evidence that supports them is critical for transparency and reliability in multimodal reasoning, yet current multimodal large language model (MLLM) evaluations do not explicitly enforce this alignment. Existing benchmarks assess either textual answer correctness or pixel-level localization in isolation, leaving the coupling of reasoning and grounding an open challenge. We introduce VISTAQA, a comprehensive benchmark for joint evaluation of free-form answer correctness and pixel-level evidence grounding in visual question answering. VISTAQA comprises 1,157 expert-curated samples spanning six task types and six visual domains, ranging from direct perception to compositional and relational reasoning. VISTAQA requires models to not only answer correctly, but to also provide precise segmentation masks that support their answers. It also includes hallucination-aware examples where no valid visual evidence exists. To support this enhanced evaluation, we introduce GROVE, a unified evaluation metric that enforces joint correctness by combining textual accuracy and grounding quality via a per-sample geometric mean, ensuring neither dimension can compensate for deficiencies in the other. Comprehensive experiments across grounding-aware models and hybrid pipelines with general-purpose MLLMs reveal that even the strongest systems achieve limited performance under GROVE, highlighting a substantial gap between answer accuracy and visual evidence alignment.

[CV-112] GSA-YOLO: A High-Efficiency Framework via Structured Sparsity and Adaptive Knowledge Distillation for Real-Time X-ray Security Inspection

链接: https://arxiv.org/abs/2605.20669
作者: Jiahao Kong
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 41 pages, 8 figures, submitted to Scientific Reports

点击查看摘要

Abstract:X-ray security inspection requires accurate real-time detection of prohibited items, but existing models often struggle to balance the challenges of severe occlusion, complex clutter, and strict speed requirements. To overcome these challenges, this paper proposes GSA-YOLO, a novel lightweight framework built upon the YOLOv8n architecture, specifically engineered to enhance detection robustness and inference efficiency. GSA-YOLO strategically integrates structured sparsity and adaptive knowledge transfer through three core components: Group Lasso (GL) applied to the network neck for robust feature extraction; Sparse Structure Selection (SSS) applied to the detection head for significant model slimming; and an Adaptive Knowledge Distillation (Ada-KD) mechanism for comprehensive accuracy recovery. This integrated approach synergistically enhances feature representation while pruning redundant channels, maximizing model efficiency without sacrificing performance. Rigorous evaluations on the HiXray and PIDray datasets confirm GSA-YOLO’s comprehensive capability, achieving a leading inference speed of 189.62 FPS, accompanied by a reduction in computational cost from 8.7G to 8.0G. Crucially, GSA-YOLO secures mAP50:95 results of 0.531 and 0.679 on HiXray and PIDray, demonstrating 2.4% and 1.8% improvements over the baseline, respectively. Compared to other models, GSA-YOLO exhibits enhanced accuracy while maintaining computational efficiency, making it a promising solution for practical X-ray security inspection.

[CV-113] LER-YOLO: Reliability-Aware Expert Routing for Misaligned RGB-Infrared UAV Detection

链接: https://arxiv.org/abs/2605.20667
作者: Liming Hou,Yueping Peng,Hexiang Hao,Ji Wang,Xuekai Zhang,Wei Tang,Zecong Ye,Xin Ying,Yubo He
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 6 figures, 8 tables

点击查看摘要

Abstract:Detecting small unmanned aerial vehicles from RGB-infrared remote-sensing pairs remains challenging due to tiny target scale, cluttered backgrounds, and spatial misalignment between heterogeneous sensors. Existing bimodal detectors often align or fuse features without assessing the reliability of local cross-sensor correspondence, allowing mismatch artifacts to propagate into the detection head. To address this issue, we propose LER-YOLO, a reliability-aware sparse mixture-of-experts framework for misaligned RGB-infrared UAV detection. LER-YOLO first introduces an Uncertainty-Aware Target Alignment module that resamples visible features toward the infrared reference and estimates a spatial reliability map. This reliability prior is then used by a Reliability-Guided Sparse MoE Fusion module to adaptively select k experts from RGB-dominant, infrared-dominant, and interactive fusion experts, enabling trustworthy cross-modal interaction while suppressing unreliable fusion. Experiments on the public MBU benchmark under a YOLOv5s-family protocol show that LER-YOLO achieves 89.7+/-0.2% AP50 over three independent seeds, with a best result of 89.9%. Extensive ablations, parameter-matched comparisons, synthetic-shift evaluations, and complexity analysis demonstrate that the gains mainly come from reliability-guided expert routing rather than increased model capacity.

[CV-114] RoPeSLR: 3D RoPE-driven Sparse-LowRank Attention for Efficient Diffusion Transformers

链接: https://arxiv.org/abs/2605.20659
作者: Yuxi Liu,Zekun Zhang,Yixiang Cai,Renjia Deng,Yutong He,Kun Yuan
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Diffusion Transformers (DiTs) have revolutionized high-fidelity video generation, yet their \mathcalO(L^2) attention complexity poses a formidable bottleneck for long-sequence synthesis. While recent sparse-linear attention hybrids aim to mitigate this, their performance severely degrades at extreme sparsity due to the “RoPE Dilemma”: standard linear attention fails to preserve the orthogonal relative-position structure of 3D Rotary Position Embeddings (RoPE), neutralizing vital distance awareness. To address this, we propose \textbfRoPeSLR, a 3D RoPE-guided Sparse-LowRank attention framework. We establish that under empirically validated assumptions, the DiT attention manifold admits a decoupling into a high-frequency semantic spike set (bounded by \mathcalO(L^3/2) sparsity) and an extreme low-rank ( \mathcalO(d_h \log L) ) background continuum. Guided by this structural prior, RoPeSLR eschews standard linear attention for a head-wise low-rank parameterization equipped with a learnable 3D Absolute Positional Embedding (PE) injection, seamlessly synthesizing long-range relative distance decay. By guaranteeing sub-quadratic sparsity and sub-linear rank growth, RoPeSLR is exceptionally suited for scaling to ultra-long video inference. Extensive evaluations validate this scalable superiority: at 90% sparsity, RoPeSLR achieves up to 10\times fewer FLOPs on Wan2.1-1.3B and delivers a 2.26\times end-to-end inference speedup on the ultra-long 100K+ token sequences of HunyuanVideo-13B, all while maintaining near-lossless generation fidelity (less than 1.3% average VBench degradation).

[CV-115] Gaze into the Details: Locality-Sensitive Enhancement for OCTA Retinal Vessel Segmentation

链接: https://arxiv.org/abs/2605.20651
作者: Tuopusen Huang,Ding Ma,Xiangqian Wu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Existing deep learning frameworks for Optical Coherence Tomography Angiography (OCTA) vessel segmentation are largely derived from the U-Net architecture, which serves as the foundation for most current designs. However, most of these methods focus only on holistic representation, struggling to address the problem of low local contrast unique to OCTA, which leads to vessel discontinuities and loss of detail. To address these problems, we propose LSENet, which builds upon the U-Net architecture by introducing three core innovative modules: To address vessel discontinuities, we introduce the Patch Information Enhance module (PIE), which replaces standard skip connections to execute patch-wise attention. To mitigate detail loss, the Multiscale Feature Fusion module (MFF) is proposed to feed the PIE module rich, multi-scale information by extracting visually interpretable features from both the original input and preceding layers. Finally, the Connectivity Refinement Decoder (CRD) is designed to refine features from all levels and utilize a large kernel in the final convolutional layer to reduce fragmentation. Experiments on three public datasets (OCTA-500, ROSE-1, and ROSSA) demonstrate that our proposed LSENet achieves state-of-the-art performance while requiring fewer parameters.

[CV-116] Seeing Through Fog: Towards Fog-Invariant Action Recognition

链接: https://arxiv.org/abs/2605.20645
作者: Enqi Liu,Liyuan Pan,Zhi Gao,Lingzhi Li,Qing Li
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Foggy conditions are commonly encountered in real-world applications; however, existing action recognition approaches typically assume favorable weather and high-quality video inputs. On foggy days, unpredictable visibility degradation and reduced contrast obstruct the extraction of semantic cues, posing significant challenges for current action recognition methods. In this paper, we mitigate the issues faced in action recognition under foggy conditions by employing two strategies. First, we present FogAct, the first benchmark dataset for foggy action recognition, consisting of paired clean and foggy videos captured with a stereo camera system. The dataset spans 10 scenes and 55 action categories, comprising nearly 10,000 video clips. Second, we propose FogNet, a two-stream CLIP model that discovers fog-invariant semantic information hidden behind the degraded videos. FogNet learns robust representations of foggy videos with guidance from clean videos, effectively capturing shared structural and motion cues between clean and foggy videos. Extensive experiments on FogAct and three other popular datasets demonstrate that our method achieves competitive performance compared with state-of-the-art (SOTA) approaches. Our FogAct and FogNet are given in our project page.

[CV-117] Pareto-Enhanced Portrait Generation: Vision-Aligned Text Supervision for Alignment Realism and Aesthetics

链接: https://arxiv.org/abs/2605.20640
作者: Yunlong Wang,Jinjin Shi,Wenbin Gao,Xuran Xu,Runyu Shi,Ying Huang
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Text-to-image diffusion models often face a severe trilemma in human portrait generation: text-image alignment, photorealism, and human-perceived aesthetics inherently inhibit one another. Supervised Fine-Tuning (SFT) is an effective method for enhancing the photorealism of image generation. However, it often leads to overfitting to the training dataset, corrupts pre-trained image priors, and degrades alignment or aesthetics. To break this bottleneck, we propose a feature supervision paradigm for Multimodal Diffusion Transformers (MM-DiT). Specifically, we introduce a lightweight cross-modal alignment mechanism that implicitly extracts multi-granularity vision-aligned text representations from SigLIP 2 and applies supervision to the image branch of MM-DiT during the training stage, with zero extra inference overhead. Our method injects vision-aligned text guidance while preserving the base model’s original generalization, avoiding degradation caused by SFT. Furthermore, our method directly mines implicit multi-granularity aesthetic signals from pre-trained vision foundation models to optimize human-perceived aesthetics. Extensive experiments on MM-DiTs show that our method pushes the Pareto frontier and achieves synergistic improvements across text-image alignment, photorealism, and human-perceived aesthetics.

[CV-118] Accelerating Video Inverse Problem Solvers with Autoregressive Diffusion Models

链接: https://arxiv.org/abs/2605.20624
作者: Taesung Kwon,Jonghyun Park,Hyungjin Chung,Jong Chul Ye
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Project page is available here: this https URL

点击查看摘要

Abstract:Diffusion models provide powerful priors for zero-shot video inverse problems, but their real-time deployment is hindered by two inefficiencies: high initial latency caused by holistic video restoration, and low throughput resulting from multiple VAE passes to enforce measurement consistency in pixel space. To overcome these limitations, we propose Autoregressive Video Inverse problem Solver (AVIS). The AVIS framework leverages autoregressive video diffusion models to restore videos in a streaming manner, naturally eliminating latency bottlenecks. Specifically, AVIS initializes reverse diffusion with a measurement-consistent estimate, reducing the required sampling steps. Compared to leading non-autoregressive solvers, AVIS drastically reduces initial latency from 114s to 4s and increases throughput from 0.71 to 1.18 FPS while achieving superior restoration quality. We further introduce a highly accelerated variant, dubbed AVIS Flash, that enforces measurement consistency solely on the first chunk. AVIS Flash substantially boosts throughput to 5.91 FPS on a single RTX 4090 GPU while maintaining competitive performance and achieving a favorable efficiency-performance trade-off, paving the way toward real-time deployment.

[CV-119] Beyond Routing: Characterising Expert Tuning and Representation in Vision Mixture-of-Experts

链接: https://arxiv.org/abs/2605.20610
作者: Gene Tangtartharakul,Katherine R. Storrs
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 21 Pages, 6 Main Figures, 1 Table

点击查看摘要

Abstract:Mixture-of-Experts (MoE) models are often interpreted by analysing which categories are routed to which experts. However, routing alone does not reveal what each expert actually encodes. We train sparsely-gated convolutional MoE models with a contrastive objective on natural images and characterise expert specialisation using tools from visual neuroscience. Extending from gating-level to expert-level analyses, we measure per-expert category separability, and per-expert tuning using the most exciting inputs. Extending from category-level to feature-level explanations, we interpret tuning via semantic dimensions derived from a dataset of human behavioural judgements (THINGS). Finally, we use tuning and representational similarity analysis to assess the stability of expertise-allocation across independent initialisations. We find that an animate-inanimate distinction dominates expert partitioning, apparent from gating through to expert readout, and is stable across independently trained models. Although routing statistics suggest relatively sparse, categorical preferences, expert analyses reveal broader tuning to continuous visual and semantic dimensions that extend beyond category boundaries. Experts exhibit similar category-separability to one another, despite distinct feature tuning, demonstrating the explanatory benefits of moving beyond category-level analyses. Together, these results show that expert specialisation in vision MoEs extends well beyond category routing and is better understood by probing fine-grained expert-level tuning and representational structure.

[CV-120] Mechanistic Interpretability for Learning Assurance of a Vision-Based Landing System

链接: https://arxiv.org/abs/2605.20607
作者: Romeo Valentin,Olivia Beyer Bruvik,Marc R. Schlichting,Mykel J. Kochenderfer
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 10 pages, 4 figures

点击查看摘要

Abstract:EASA’s learning-assurance guidance requires data-driven aviation systems to build and monitor their own situation representation, yet for neural networks the technical means to provide such evidence remain an open problem. We address this gap for a vision-based aircraft landing system: we propose that a minimally assurable model must at least be shown to separate content from style in its own situation representation. Showing that the model’s predictions then rely largely on the contentful representation components leads to a concrete assurance path. To demonstrate this assurance path on a concrete model we train a vision transformer model for runway keypoint regression on the LARDv2 dataset. The model, which acts as the subject for our assurance demonstration, produces per-patch embeddings that we decompose into interpretable atoms via K-SVD sparse dictionary learning. A qualitative visualization confirms that contentful atoms track task-relevant runway structure and stylistic atoms track domain-specific appearance, and the regression head is shown to place almost all of its linear weight on contentful atoms. We further build on the content/style separation and define out-of-model-scope (OOMS) detection, a novel runtime assurance approach directly monitoring the model’s situation representation. OOMS monitoring is complementary to operational design domain and output-space out-of-distribution monitoring and addresses concrete requirements of the recent EASA guidance. By directly analyzing a model’s situation representation both at test time and runtime, this work delivers the first concrete piece of the representation-level evidence that EASA learning-assurance guidance demands, and points to mechanistic interpretability as a practical building block of future aviation safety cases.

[CV-121] Mind Your Margin and Boundary: Are Your Distilled Datasets Truly Robust? ICML2026

链接: https://arxiv.org/abs/2605.20606
作者: Muquan Li,Yingyi Ma,Yihong Huang,Hang Gou,Ke Qin,Ming Li,Yuan-Fang Li,Tao He
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICML 2026

点击查看摘要

Abstract:Dataset distillation (DD) compresses a large training set into a small synthetic set for efficient training, but most DD methods optimize only clean accuracy and leave robustness uncontrolled. Recent robust DD methods improve robustness, yet they often suffer from a poor accuracy-robustness trade-off because they (i) treat all adversarially perturbed examples uniformly, despite robust risk being dominated by near-zero robust margins, and (ii) do not explicitly increase inter-class separation in the decision boundary where attacks concentrate. We present Contrastive Curriculum for Robust Dataset Distillation (C ^2 R), a framework that couples an attack-aware curriculum with a contrastive robustness objective. From a robust-margin perspective, we derive a perturbation score that approximates each sample’s robust hinge, enabling a curriculum that prioritizes the smallest-margin adversaries that most directly drive robust error. In parallel, a class-balanced contrastive robustness loss enforces adversarial invariance while explicitly widening boundary separation across classes. Experiments on CIFAR-10/100, Tiny-ImageNet, and multiple ImageNet-1K subsets under six attacks show that C ^2 R achieves the best robust accuracy, outperforming prior robust DD by 2.8 % on average.

[CV-122] Head-Aware Key-Value Compression for Efficient Autoregressive Image Generation

链接: https://arxiv.org/abs/2605.20600
作者: Guotao Liang,Baoquan Zhang,Zhiyuan Wen,Yunming Ye
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Under review

点击查看摘要

Abstract:Autoregressive (AR) visual generation has achieved remarkable performance but suffers from high memory usage and low throughput, as it requires caching previously generated visual tokens. Recent research has shown that retaining only a few lines of cache tokens can maintain high-quality images while significantly reducing memory usage and improving throughput. However, these methods allocate a fixed budget to each attention head, overlooking the heterogeneity among attention heads, leading to suboptimal memory allocation. In this paper, we observe that attention heads across different layers exhibit diverse attention patterns, where some heads focus on local neighborhoods while others capture broader contextual dependencies. Based on this insight, we propose a novel head-aware key-value (KV) cache compression framework for autoregressive image generation, called HeadKV, which assigns smaller budgets to locality-biased heads and larger budgets to heads with broader attention. A key challenge lies in identifying the type of each attention head to guide cache compression. We further observe that, within the same layer, each head exhibits consistent attention patterns across token positions, \emphi.e., a head’s behavior for early tokens remains consistent with that for later tokens. This insight suggests that head types can be identified during the early stage and reused for KV compression throughout generation. Its advantage is that it requires no additional training or dataset-level statistics and generalizes seamlessly across different inputs. Moreover, we design a Stratified Token Eviction strategy to effectively preserve long-range information. Extensive experiments demonstrate its effectiveness across multiple autoregressive image generation models.

[CV-123] Qwen Safe: Multimodal Content Rating Description Identification via Preference-Aligned VLMs

链接: https://arxiv.org/abs/2605.20584
作者: Dishanika Denipitiyage,Aruna Seneviratne,Suranga Seneviratne
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Mobile app marketplaces require developers to disclose standardized content rating descriptors (CRDs) to inform users about potentially sensitive or restricted content. Ensuring the accuracy and consistency of these disclosures remains challenging due to the multimodal nature of app content, which spans textual descriptions and visual interfaces. In this paper, we present QwenSafe, a Vision-Language Model (VLM) designed to automatically identify the presence of Apple-defined CRDs by jointly reasoning over app metadata and screenshots. To enable scalable training for this task, we introduce metadata2CRD, a data-construction pipeline that synthesizes descriptor-aligned question-answer pairs by combining app descriptions, screenshots, and formal descriptor definitions. We adapt Qwen3-VL-8B using supervised fine-tuning followed by Direct Preference Optimization (DPO) to align model predictions with descriptor-specific evidence and explanations across visual and textual modalities. We evaluate QwenSafe on 12 Apple-defined content rating descriptors and compare it against state-of-the-art vision-language models, including Qwen3-VL, LLaVA-1.6, and Gemini-2.5-Flash. QwenSafe consistently outperforms all baselines in binary CRD classification, achieving improvements in positive-class recall of 111.8%, 36.1%, and 2.1%, respectively. Our results demonstrate that descriptor-aware multimodal alignment substantially improves automated content classification and highlights the potential of vision-language models to support scalable and consistent content rating in mobile app marketplaces.

[CV-124] A strongly annotated passive acoustic dataset for tropical bird monitoring

链接: https://arxiv.org/abs/2605.20578
作者: Daniela Ruiz,Juan Sebastián Ulloa,Zhongqi Miao,Nicolás Betancourt,Maria Paula Toro-Gómez,Andrés Hernández,Bruno Demuro,Eliana Barona-Cortés,Angela Mendoza-Henao,Andrés Sierra-Ricaurte,Sebastián Pérez-Peña,Rahul Dodhia,Pablo Arbeláez,Juan M. Lavista Ferres
类目: ound (cs.SD); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Passive acoustic monitoring enables continuous, non-invasive biodiversity assessment across diverse ecosystems. The scale of these datasets has driven the adoption of machine learning, with supervised approaches showing strong performance. However, supervised methods require time-resolved annotated datasets, which remain scarce, especially in complex tropical soundscapes. We present PteroSet, a curated dataset of strongly annotated Neotropical bird vocalizations recorded in Puerto Asis (Putumayo) and Pivijay (Magdalena), Colombia, between 2023 and 2025. The dataset comprises 563 recordings (73.62 h) and 15,372 time-frequency annotations, including 6,702 events identified to the species level across 168 species. We release the annotations in a COCO-inspired JSON schema that unifies audio files, taxonomic categories, and labels for machine learning workflows. Beyond providing annotated data, PteroSet serves as a realistic benchmark that highlights key characteristics of tropical soundscapes, including acoustic co-occurrence and domain shift across recording sites. We provide a deep learning baseline for binary bird detection, demonstrating PteroSet’s usability and the challenges it presents.

[CV-125] Δynamics: Language-Based Representation for Inferring Rigid-Body Dynamics From Videos CVPR2026

链接: https://arxiv.org/abs/2605.20576
作者: Chia-Hsiang Kao,Cong Phuoc Huynh,Chien-Yi Wang,Noranart Vesdapunt,Stefan Stojanov,Bharath Hariharan,Oleksandr Obiednikov,Ning Zhou
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026. Project page: this https URL

点击查看摘要

Abstract:Inferring rigid-body physical states and properties from monocular videos is a fundamental step toward physics-based perception and simulation. Existing approaches assume specific underlying physical systems, object types, and camera poses, making them unable to generalize to complex real-world settings. We introduce \Delta YNAMICS, a vision-language framework that uses language as a unified representation of rigid-body dynamics. Instead of directly predicting parameters, \Delta YNAMICS generates scene configurations in a structured text format for physics simulation. We enhance the model’s generalization by integrating natural language motion reasoning and leveraging optical flow as a semantic-agnostic input. On the CLEVRER dataset, \Delta YNAMICS achieves a segmentation IoU of 0.30, a 7x improvement over leading VLMs (InternVL3-8B, Qwen2.5-VL-7B and Claude-4-Sonnet). Additionally, test-time sampling and evolutionary search further boost performance by 27% and 120% in segmentation IoU, respectively. Finally, we demonstrate strong transfer to a new dataset of 235 real-world rigid-body videos, highlighting the potential of language-driven physics inference for bridging perception and simulation.

[CV-126] End-to-End Unmixing with Material Prompts for Hyperspectral Object Tracking

链接: https://arxiv.org/abs/2605.20569
作者: Xu Han,Mohammad Aminul Islam,Lei Wang,Zekun Long,Guanmanyi Fu,Wangshu Cai,Kuldip K. Paliwal,Jun Zhou
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Hyperspectral imagery encodes rich material properties that can improve tracking robustness under appearance ambiguity, illumination change, and background clutter. However, due to the limited availability of hyperspectral video data, many existing methods adapt pretrained RGB trackers via spatial or channel fusion strategies, largely neglecting the intrinsic material information in hyperspectral imagery. Moreover, the few material-aware approaches typically rely on external spectral unmixing pipelines that are decoupled from the tracking objective, limiting effective optimization of material representations for target localization. To address these limitations, we formulate hyperspectral object tracking as a joint optimization problem of material decomposition and target localization, coupling the two tasks via a weighted target-oriented unmixing loss that explicitly aligns material representations with localization accuracy. Specifically, we propose a material representation decomposition module for deep learning-based spectral unmixing with adaptive frequency decomposition. Building on the decomposed material representations, we further introduce a dual-branch wavelet-enhanced material prompt module that learns low- and high-frequency material prompts through efficient spatial-material interactions in the frequency domain. The framework is model-agnostic and can be seamlessly generalized to different unmixing backbones. Extensive experiments on standard hyperspectral tracking benchmarks demonstrate state-of-the-art performance and validate the effectiveness of the proposed end-to-end material-aware tracking framework. Code is available at this https URL.

[CV-127] Faster or Stronger: Towards Flexible Visual Place Recognition via Weighted Aggregation and Token Pruning

链接: https://arxiv.org/abs/2605.20551
作者: Zichao Zeng,June Moh Goo,Junwei Zheng,Weijia Fan,Jiaming Zhang,Rainer Stiefelhagen,Jan Boehm
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Visual Place Recognition (VPR) aims to match a query image to reference images of the same place in a large-scale database. Recent state-of-the-art methods employ Vision Transformers (ViTs) as backbone foundation models to extract patch-level features that are robust to viewpoint, illumination, and seasonal variations, which are then aggregated into a compact global descriptor for retrieval. Most existing aggregation methods uniformly pool patch tokens into learned clusters, despite the fact that different clusters often encode distinct spatial or semantic patterns and contribute unequally to VPR performance. To address this limitation, we propose Weighted Aggregated Descriptor (WeiAD), which assigns weights to clusters during aggregation, producing more discriminative global representations. Beyond accuracy, retrieval latency is a critical concern for large-scale deployments and resource-constrained edge devices. Prior work mainly reduces latency by compressing global descriptors, while overlooking the cost of feature extraction, an issue exacerbated by ViT-based backbones. We therefore introduce WeiToP, a VPR-oriented token pruning framework that reduces feature extraction cost via self-distillation, where aggregation-induced token importance supervises a lightweight pruning module attached to an early transformer layer, enabling inference-time token pruning. After a single joint training phase, WeiToP enables plug-and-play token pruning at inference time, allowing flexible and on-demand control over the accuracy-efficiency trade-off without additional training. Moreover, WeiToP outperforms existing token pruning methods adapted from general vision tasks.

[CV-128] MAPS: A Synthetic Dataset for Probing Vision Models in a Controlled 3D Scene Space

链接: https://arxiv.org/abs/2605.20549
作者: Santiago Galella,Pamela Osuna-Vargas,Maren Wehrheim,Martina G. Vilas,Gemma Roig,Matthias Kaschube
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 33 pages, 20 figures

点击查看摘要

Abstract:Modern vision models achieve strong performance on standard benchmarks, yet their aggregate accuracy reveals little about which scene properties drive their predictions. Existing robustness benchmarks provide important stress tests, but typically manipulate global 2D image properties, rely on entangled real-world variation, or cover only a limited set of 3D objects and scene parameters. We introduce MAPS (Manifolds of Artificial Parametric Scenes), a scalable instrument for controlled attribution of vision model behavior to scene parameters. MAPS comprises 2,618 curated photorealistic 3D meshes validated for recognizability across 560 ImageNet classes and provides a Blender-based rendering pipeline for on-demand image generation under continuous variation of nine independent scene-factors spanning background, camera, and lighting, extensible to other factors. To showcase its applicability, we use MAPS to evaluate 20 convolutional and transformer-based models by quantifying their reliance on these scene factors through regression-based sensitivity analysis. We find a near-universal failure axis across all tested architectures: camera distance and elevation consistently dominate recognition failure regardless of ImageNet accuracy. However, the full sensitivity structure reveals that modern CNNs and transformers cluster together, distinct from older architectures, suggesting that fine-grained architectural design choices, rather than the coarse CNN-versus-transformer distinction, are the stronger determinant of sensitivity profiles.

[CV-129] he Yes-Man Syndrome: Benchmarking Abstention in Embodied Robotic Agents

链接: https://arxiv.org/abs/2605.20544
作者: Doguhan Yeke,Elif Su Temirel,Ananth Shreekumar,Brandon Lee,Dongyan Xu,Z Berkay Celik
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-language models (VLMs) are used as high-level planners for embodied agents, translating natural language instructions and visual observations into action plans. While prior work has studied abstention in LLMs, existing benchmarks are largely text-only and do not capture the perceptual grounding and physical constraints inherent to embodied robotics environments. In such settings, abstention requires recognizing when instructions are ambiguous, physically infeasible, based on false premises, or otherwise unresolvable given the available sensory modalities and context. To address this gap, we introduce a taxonomy to categorize abstention in the context of embodied robotics and present RoboAbstention, a scalable and auditable framework for generating abstention instructions grounded in images gathered from five robotics datasets. RoboAbstention instantiates the taxonomy through a three-phase pipeline: (1) structured visual grounding, (2) deterministic constraint derivation, and (3) controlled instruction generation via category-specific templates. This enables the construction of a diverse dataset with verifiable abstention conditions. We evaluate several frontier VLMs and find that all models exhibit significant weaknesses in abstention, including those with advanced reasoning capabilities. The best-performing model, Gemini 2.5 Flash, abstains on only 39.0% of our 6,069 benchmark instructions, while the embodied planner Gemini Robotics ER 1.6 Preview abstains on just 16.5%. We further explore methods for improving abstention in VLM planners, such as defensive prompting and in-context learning, and find that these interventions substantially improve performance, reaching 93.6% abstention rate for Gemini Robotics ER 1.6 Preview and 88.6% for GPT 5.4 Mini, yet no approach fully solves the problem. We open-source RoboAbstention at this https URL.

[CV-130] Uncertainty-Guided Conservative Propagation for Structured Inference in Vessel Segmentation

链接: https://arxiv.org/abs/2605.20543
作者: Huan Huang,Michele Esposito,Chen Zhao
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Pattern Recognition submission. 35 pages, 6 figures

点击查看摘要

Abstract:Accurate vessel segmentation is essential for medical image analysis, yet remains challenging due to complex vascular patterns and imaging ambiguity. Most deep models rely on single-pass prediction, limiting their ability to refine uncertain or disconnected regions during inference. To address this limitation, we propose Uncertainty-Guided Conservative Propagation (UGCP), a general plug-in module for vessel segmentation. Instead of directly using a one-shot output as the final prediction, UGCP performs a small number of logit-space update steps to refine the segmentation through local predictions interaction. Predictive uncertainty guides reliable regions to support ambiguous regions, while structure-aware modulation and source-based stabilization reduce unreliable propagation and excessive drift. The module is differentiable and can be trained end-to-end with different segmentation networks. We evaluate UGCP on four public vessel segmentation datasets covering 2D and 3D tasks, including retinal vessel, coronary artery, and cerebral vessel segmentation. Experiments with convolutional neural network-based and Transformer-based backbones show consistent improvements in Dice similarity coefficient, centerline Dice, and 95th percentile Hausdorff distance. Further analysis demonstrates that UGCP reduces vessel disconnections and improves structural consistency with limited additional computation. The code will be made available at this https URL.

[CV-131] Continual Segmentation under Joint Nonstationarity

链接: https://arxiv.org/abs/2605.20538
作者: Prashant Pandey,Himanshu Kumar,Devineni Sri Venkatraya Chowdary,Brejesh Lall
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Evolving data streams induce joint nonstationarity in continual semantic segmentation, where semantic classes, input distributions, and supervision availability change simultaneously over time. This setting reflects practical structured prediction systems, yet remains largely unexplored in prior continual learning work, which typically studies these factors in isolation. We formalize continual segmentation under coupled class, domain, and label shifts and investigate learning in heterogeneous dense prediction environments with limited annotations and abundant unlabeled data. To address instability and overfitting arising from few-shot supervision under distribution drift, we introduce gradient-adaptive stabilization, a parameter-wise regularization mechanism implemented via gradient-scaled stochastic perturbations that promotes a principled stability-plasticity tradeoff. We further leverage unlabeled data through semi-supervised learning and introduce prototype anchored supervision that validates pseudo-labels via joint confidence and prototype consistency. Together, these mechanisms enable learning under joint nonstationarity in continual segmentation. Extensive empirical evaluation across class-incremental, domain-incremental, and few-shot regimes demonstrates consistent improvements over prior methods in heterogeneous structured prediction settings. Our results expose fundamental failure modes of existing continual segmentation approaches and provide insight into learning robust dense predictors in dynamically evolving environments.

[CV-132] HADS-Net:A Hybrid Attention-Augmented Dual-Stream Network with Physics-Informed Augmentation for Breast Ultrasound Image Classification

链接: https://arxiv.org/abs/2605.20536
作者: Chinedu Emmanuel Mbonu,Blessing Nwamaka Iduh,Joseph Ikechukwu Odo,Doris Chinedu Asogwa
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 7 pages, 4 figures

点击查看摘要

Abstract:Accurate classification of breast ultrasound images into benign, malignant, and normal categories is a critical clinical task complicated by speckle noise, acoustic shadowing, and inter-class visual ambiguity. Existing deep learning methods rely on single-stream architectures with generic augmentation that ignores ultrasound acquisition physics, and no prior method dedicates a stream to the lesion boundary features identified as the most diagnostically significant visual cue. We propose HADS-Net, a Hybrid Attention-Augmented Dual-Stream Network exploiting global texture and local boundary cues through two parallel pathways. Stream 1 applies physics-informed augmentation simulating speckle noise, acoustic shadowing, and gain variation before extracting features via pretrained EfficientNet-B3 projected to 512 dimensions. Stream 2 extracts Sobel edge maps processed by a lightweight CNN projected to the same 512-dimensional space. A cross-attention fusion module allows the texture stream to selectively query boundary features, producing a jointly optimised representation classified by an MLP trained with adaptive class-weighted focal loss. Five-fold stratified cross-validation with cosine annealing over 50 epochs is used, with the globally best checkpoint selected by lowest validation loss evaluated on a held-out test set. On the BUSI dataset, HADS-Net achieves 96.58% accuracy, macro ROC-AUC of 0.9978, macro F1 of 0.9654, and per-class F1-scores of 0.970, 0.951, and 0.976 for benign, malignant, and normal. No malignant lesion is misclassified as normal. These results confirm that modality-specific augmentation with cross-modal attention fusion is an effective strategy for ultrasound-based breast cancer diagnosis.

[CV-133] ShadeBench: A Benchmark Dataset for Building Shade Simulation in Sustainable Society KDD2026

链接: https://arxiv.org/abs/2605.20510
作者: Longchao Da,Mithun Shivakoti,Xiangrui Liu,T Pranav Kutralingam,Yezhou Yang,Hua Wei
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 12 pages, 13 figures, 2 tables. Accepted by KDD 2026 AI for Sciences Track

点击查看摘要

Abstract:Urban heat exposure is becoming an increasingly critical challenge due to the intensifying urban heat island effect. Fine-grained shade patterns, especially those induced by urban buildings, strongly influence pedestrians’ thermal exposure and outdoor activity planning. However, accurately modeling and analyzing urban shade at scale remains difficult because of the lack of large-scale datasets and systematic evaluation frameworks. To address this challenge, we present ShadeBench, a comprehensive dataset and benchmark for urban shade understanding. ShadeBench contains geographically diverse urban scenes with temporally varying simulated shade maps and textual descriptions, together with aligned satellite imagery, building skeleton representations, and 3D building meshes. Built upon this multimodal dataset, ShadeBench supports a range of downstream tasks, including shade generation, shade segmentation, and 3D building reconstruction. We further establish standardized evaluation protocols and baseline methods for these tasks. By enabling scalable and fine-grained shade analysis, ShadeBench provides a foundation for data-driven urban climate research and supports future studies in heat-resilient urban planning and decision-making. The code and dataset are publicly available at this https URL.

[CV-134] ppett-minimum Fusion of Representation-space Diffusion Models for Multi-Encoder Out-of-Distribution Detection

链接: https://arxiv.org/abs/2605.20502
作者: Neelkamal Bhuyan
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Applications (stat.AP); Machine Learning (stat.ML)
备注: 14 pages

点击查看摘要

Abstract:We address out-of-distribution (OOD) detection across the full spectrum of distribution shifts – global domain changes, semantic divergence, texture differences, and covariate corruptions – through a multi-encoder fusion of per-encoder representation-space diffusion models (RDMs). We statistically identify each encoder’s sensitivity to specific shift types from ID data alone and introduce EncMin2L – an encoder-agnostic two-level \min(\cdot) -gate that combines and calibrates per-encoder diffusion-based likelihood detectors without OOD labels, outperforming monolithic multi-encoder baselines at 2.3\times lower parameter cost. Two ID-data diagnostics: \eta^2 (class-conditional F-test) and \Delta\mu (log-likelihood shift under synthetic corruptions) – quantify encoder specialization, while a Tippett minimum p -value combination aggregates per-encoder scores into a single, calibration-stable OOD signal. EncMin2L achieves \geq 0.94 AUROC across all four shift types simultaneously, outperforming the state-of-the-art representation-space diffusion OOD detectors across overlapping benchmarks.

[CV-135] A Human-in-the-Loop Framework for Efficient Prompt Selection in Microscopy Vision-Language Models CVPR

链接: https://arxiv.org/abs/2605.20495
作者: Abhiram Kandiyana,Ankur Mali,Lawrence O. Hall,Peter R. Mouton,Dmitry Goldgof
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR workshops, 2026

点击查看摘要

Abstract:Deep-learning pipelines for microscopy image classification often require expensive, labor- and time-intensive expert annotation to produce high-quality ground truth for training. Recent work has shown that prompt tuning of vision-language models (VLMs) can reduce manual annotation by constructing a small prompt set of expert-verified image-caption exemplars that is reused as few-shot context to classify all remaining images at inference time. To further reduce effort, the VLM can draft captions for candidate exemplars, which experts then verify and lightly edit instead of writing text de novo. However, two practical questions remain unaddressed: (1) which unlabeled images should be prioritized for verification, and (2) how many verified exemplars are needed to reach a performance target. In this work, we address these questions by formulating prompt-set construction as a target-driven active learning problem that prioritizes which images to annotate. We study three complementary selection criteria under strict low-resource constraints with small unlabeled pools. Experiments show that our methods reach the target performance with substantially fewer expert-verified images than random selection, achieving 100% test accuracy with as few as 20 annotated images on average. More broadly, our human-in-the-loop framework demonstrates a human-centered use of generative AI in biomedical image analysis, where experts remain actively involved in verifying and refining model output while significantly reducing annotation cost. Code and data will be publicly available.

[CV-136] Oracle Supervision Transfers for Hyperparameter Prediction in Model-Based Image Denoising

链接: https://arxiv.org/abs/2605.20479
作者: Jianmin Liao,Lixin Shen,Yuesheng Xu
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Hyperparameter prediction is a critical practical bottleneck for model-based image denoisers, ranging from classical TV/TGV variational solvers to modern diffusion-based models such as DiffPIR. While existing learned predictors can achieve near-oracle performance, this approach scales poorly: each new configuration conventionally requires its own oracle-labeled training set, and each label requires a hierarchical grid search evaluated against clean ground truth. We therefore ask whether oracle supervision collected on source configurations can transfer to target configurations with few or no target oracle labels. We propose HyperDn, a single configuration-conditioned predictor that pools oracle supervision across source configurations and predicts heterogeneous hyperparameters for new denoiser–noise configurations. In a cross-paradigm experiment, HyperDn transfers from relatively cheap TV/TGV variational sources to more expensive diffusion-based DiffPIR. With only 2 target oracle labels, it reaches 30.23 ,dB, within 0.90 ,dB of the oracle, and outperforms the 64 -label per-configuration predictor trained from scratch, using 1/32 as many target labels as that baseline point. Without any target oracle labels, HyperDn also reaches near-oracle PSNR on two unseen mixtures of seen noise types and on transfer from relatively cheap 96\times 96 source images to 512\times 768 targets. Together, these results show that expensive oracle supervision for hyperparameter prediction can be transferred from source to new target configurations, reducing the need to rebuild oracle labels for each new denoising configuration.

[CV-137] Goodbye Drift: Anchored Tree Sampling for Long-Horizon Video-to-Video Generation

链接: https://arxiv.org/abs/2605.20476
作者: Matthew Bendel,Stephen W. Bailey,Mithilesh Vaidya,Sumukh Badam,Xingzhe He
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 30 pages, 23 figures

点击查看摘要

Abstract:Long-horizon video generation suffers from two intertwined issues. First, there is drift, where video quality degrades over time. Second, there are continuity issues which manifest as object permanence issues, or improperly rendering transient content (e.g., an object that appears in non-consecutive frames changing color/style). Recent work has focused on autoregressive distillation techniques that attack both problems simultaneously. We instead choose to focus on drift directly and introduce \textbfAnchored Tree Sampling (ATS): a training-free inference-time scheduler that replaces left-to-right rollout with sparse-to-dense, anchor-bounded imputation organized as a tree. A root call produces sparse anchors over the full horizon, recursive refinement generates intermediate anchors, and final leaf spans are synthesized between neighboring anchors. This reduces the critical path from K sequential rollout steps to L+1 tree-hierarchical steps and converts horizon-compounding drift into anchor-bounded drift. We focus on V2V generation in the \emphstatic-camera regime, where sparse anchors over the horizon are well approximated by the dense conditioning signal, and the base model can produce them without retraining. We evaluate ATS against two contemporary autoregressive baselines on Wan 2.1 + VACE, across five conditioning modalities (inpainting, outpainting, edge, pose, depth). We show that ATS outperforms both competitors in overall quality, as well as in drift prevention. We additionally demonstrate stable \geq 40 -minute generation on LTX- 2.3 across the same five modalities. We conclude by proposing a path forward to extend ATS to arbitrarily long T2V generation, as well as the dynamic-camera and multi-shot regimes.

[CV-138] EPC-3D-Diff: Equivariant Physics Consistent Conditional 3D Latent Diffusion for CBCT to CT Synthesis

链接: https://arxiv.org/abs/2605.20470
作者: Alzahra Altalib,Chunhui Li,Haytham Al Ewaidat,Khaled Alawneh,Ahmad Qendel,Alessandro Perelli
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Medical Physics (physics.med-ph)
备注: 10 pages, 4 figures

点击查看摘要

Abstract:Cone-beam CT (CBCT) is routinely acquired during radiotherapy for patient setup, but its quantitative reliability is degraded by scatter, noise, and reconstruction artifacts, limiting Hounsfield Unit (HU) accuracy. We propose EPC-3D-Diff, a novel conditional 3D latent diffusion framework for volumetric CBCT to CT synthesis that introduces a projection domain equivariance loss derived from acquisition physics. Unlike common image domain equivariance, we exploit the fact that an in plane rotation of the volume corresponds to an angular shift in its projections. During training, we enforce this relationship by forward projecting rotated synthesized CT volumes and matching them to appropriately angle shifted projections of the paired target CT, yielding a physics consistent equivariance constraint integrated into the diffusion objective. To capture full 3D context efficiently, conditional diffusion is performed in a compact latent space learnt by a lightweight 3D autoencoder, preserving axial depth while downsampling in plane resolution for stable training. We validate on a paired head CBCT/CT phantom dataset, including repeat scans, and paired clinical data using patient wise splits, and perform single and mixed domain training, ablations, and comparisons with diffusion and CycleGAN. EPC-3D-Diff generalizes well and achieved substantial improvements, +7.4 dB (phantom) and +1.8 dB (clinical data) in PSNR compared to state of the art methods, alongside improved SSIM and HU accuracy, within tissue boundaries. Overall, EPC-3D-Diff improves robustness and physics consistency, supporting HU aware synthesis for downstream radiotherapy workflows.

[CV-139] HalluCXR: Benchmarking and Mitigating Hallucinations in Medical Vision-Language Models for Chest Radiograph Interpretation

链接: https://arxiv.org/abs/2605.20469
作者: Haoyu Wang,Zitong Li
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-language models (VLMs) are increasingly used for medical image interpretation, yet they frequently hallucinate, generating clinically plausible but factually incorrect findings that pose direct patient safety risks. We introduce HalluCXR, a benchmark evaluating six architecturally diverse VLMs across 856 stratified MIMIC-CXR chest radiographs and three query types, yielding 15,408 model evaluations. An eight-category hallucination taxonomy with clinical severity ratings and a two-layer detection pipeline are validated against 250 human annotations (auto-detection F1=0.959; LLM judge F1=0.907). We find that 61.9–82.3% of outputs contain hallucinations, with clinically dangerous errors in up to 80.2%. Three key patterns emerge: normal radiographs paradoxically attract the most severe hallucinations, common findings are systematically over-fabricated while rare findings go under-detected, and response length alone predicts hallucination risk (AUC up to 0.908). A six-model ensemble reduces fabrication by up to 84.8% at the cost of increased omission; a three-model subset retains comparable performance at half the cost. These results establish that hallucination auditing, verbosity-based risk monitoring, and ensemble-based safety layers are prerequisites for clinical deployment.

[CV-140] Understanding Model Behavior in Monocular Polyp Sizing

链接: https://arxiv.org/abs/2605.20461
作者: Xinqi Xiong,Andrea Dunn Beltran,Junmyeong Choi,Sarah K. McGill,Marc Niethammer,Roni Sengupta
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate polyp size stratification guides surveillance decisions, with lesions larger than 5 mm typically requiring closer follow-up. However, monocular colonoscopy lacks a reliable metric reference. We present a diagnostic audit of binary polyp size classification (=5 mm vs. 5 mm) across multiple public multi-center datasets, model families, and patient-stratified cross-validation. Across architectures and input modalities, including RGB appearance, relative depth, and photometry, model performance is moderately consistent, suggesting reliance on cues correlated with examination behavior rather than true metric scales. By providing ground-truth scale at varying granularities, we quantify the potential improvement from perfect scale information and show that current depth estimation and global calibration offer limited gains. We further demonstrate that segmentation errors under distribution shift eliminate most of this potential, with oracle scale under predicted masks recovering only baseline performance. These results highlight metric scale and mask robustness as two independent bottlenecks and provide reusable evaluation tools such as oracle scale ladders, shortcut partitions, and mask substitution for auditing future polyp sizing pipelines. Our code is publicly accessible at this https URL.

[CV-141] HyperBones: Realtime Bone-driven Neural Garment Simulation with Hypernetwork Conditioning

链接: https://arxiv.org/abs/2605.20460
作者: Astitva Srivastava,Hsiao-Yu Chen,Ryan Goldade,Philipp Herholz,Zhongshi Jiang,Gene Wei-Chin Lin,Lingchen Yang,Nikolaos Sarafianos,Tuur Stuyck,Doug Roble,Avinash Sharma,Egor Larionov
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in garment simulation have brought high-quality results closer to real-time performance. Physics-based simulators can produce accurate motion, but remain too computationally expensive for interactive applications. In contrast, linear blend skinning is efficient, but cannot capture the complex dynamics of loose-fitting garments, often leading to unrealistic motion and visual artifacts. Neural methods offer a promising alternative, yet they still struggle to animate loose clothing plausibly under strict runtime constraints. We present a fast and physically plausible approach for dynamic garment simulation. Our method trains a reduced-space neural dynamics simulator composed of independent coarse- and fine-level components. At the coarse level, the garment is driven by a set of virtual bones integrated with a lightweight neural network. Fine-scale wrinkle details are then recovered using a trained convolutional neural map. By decoupling identity-specific computation from real-time neural integration, our architecture maintains high performance while supporting diverse body shapes and motions. We further introduce an effective physics-supervision scheme that enables accurate results without relying on an external simulator. Experiments show that our method produces physically plausible garment dynamics, generalizes across a range of motions and body shapes, and supports a fixed set of garments. Our simulator runs at 300+ FPS on a commodity GPU, making it suitable for real-time applications.

[CV-142] Pixel Wised Lesion Prediction on COVID-19 CT Imagery: A Comparative Analysis of Automated Image Segmentation Architectures

链接: https://arxiv.org/abs/2605.20459
作者: Sarmad Khan,Arslan Shaukat,Umer Asgher,Basim Azam
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 7 pages, 6 figures, 4 tables

点击查看摘要

Abstract:In recent years, there has been a notable increase in the level of attention that is given to algorithms based on deep learning in the context of medical image segmentation. Nevertheless, the reliability of the field has been hindered due to the absence of a standardized methodology for performance analysis and the utilization of different datasets in previous research. The primary objective of the research is to comprehensively evaluate contemporary segmentation frameworks combined with state-of-the-art pre-trained backbones in order to accurately predict COVID-19 lesions in CT images. Moreover, this evaluation can serve as a point of reference for the segmentation of images in various other imaging scenarios. In order to accomplish this, we integrate four distinct deep learning architectures, namely Unet, PSPNet, Linknet, and FPN, with six pre-trained encoders, including VGG 19, DenseNet 121, Inception ResNet V2, MobileNet V2, SeresNet 101, and EfficientNet B0. This approach enables the development of diverse testing architectures. In the context of image segmentation, our research encompassed both binary and multi-class experimentation. The findings derived from our analysis of three distinct COVID-19 CT segmentation datasets indicate that deep learning architectures yield precise and efficient segmentation outcomes. Significantly, a maximum F1-Score of 98% was attained for binary class segmentation, while multi-class segmentation yielded F1-Scores of 75% and 77% across two separate datasets. The utilization of artificial intelligence and deep learning enhances the diagnostic process for pandemic diseases across multiple dimensions.

[CV-143] ELEMENT: Multi-Modal Retinal Vessel Segmentation Based on a Coupled Region Growing and Machine Learning Approach

链接: https://arxiv.org/abs/2605.20458
作者: Erick O. Rodrigues,Aura Conci,Panos Liatsis
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vascular structures in the retina contain important information for the detection and analysis of ocular diseases, including age-related macular degeneration, diabetic retinopathy and glaucoma. Commonly used modalities in diagnosis of these diseases are fundus photography, scanning laser ophthalmoscope (SLO) and fluorescein angiography (FA). Typically, retinal vessel segmentation is carried out either manually or interactively, which makes it time consuming and prone to human errors. In this research, we propose a new multi-modal framework for vessel segmentation called ELEMENT (vEsseL sEgmentation using Machine lEarning and coNnecTivity). This framework consists of feature extraction and pixel-based classification using region growing and machine learning. The proposed features capture complementary evidence based on grey level and vessel connectivity properties. The latter information is seamlessly propagated through the pixels at the classification phase. ELEMENT reduces inconsistencies and speeds up the segmentation throughput. We analyze and compare the performance of the proposed approach against state-of-the-art vessel segmentation algorithms in three major groups of experiments, for each of the ocular modalities. Our method produced higher overall performance, with an overall accuracy of 97.40%, compared to 25 of the 26 state-of-the-art approaches, including six works based on deep learning, evaluated on the widely known DRIVE fundus image dataset. In the case of the STARE, CHASE-DB, VAMPIRE FA, IOSTAR SLO and RC-SLO datasets, the proposed framework outperformed all of the state-of-the-art methods with accuracies of 98.27%, 97.78%, 98.34%, 98.04% and 98.35%, respectively.

[CV-144] Do Vision–Language Models Understand 3D Scenes or Just Catalogue Objects?

链接: https://arxiv.org/abs/2605.20448
作者: Animesh Maheshwari,Divyansh Sahu,Nishit Verma
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Vision–language models reliably name objects in a scene, but do they represent the 3D layout those objects inhabit? We introduce a 3,034-sample human-curated benchmark targeting three components of spatial understanding: depth-ordered occlusion (probed via three independent counterfactual operationalisations), optical-geometry inference over visible reflections, and volumetric rearrangement planning. Six frontier and open-weight VLMs, scored by trained annotators on 18,204 responses with no LLM-as-judge, reveal a sharp dissociation: models that plan rearrangements over visible layouts at 53–97% accuracy and rarely violate collision constraints fall to 6–45% on occlusion and below 7% on reflections. An embodied-reasoning model reproduces the same profile. White-box analysis on Qwen3-VL-8B-Thinking localises the failure to the visual-token merger: spatial information recoverable throughout the vision encoder becomes inaccessible after token compression and only stabilises again when clean post-merger activations are patched into the language decoder.

[CV-145] A Comprehensive Comparison of Deep Learning Architectures for COVID-19 Classification on CT X-ray Imagery

链接: https://arxiv.org/abs/2605.20445
作者: Sarmad Khan,Arslan Shaukat,Umer Asgher,Basim Azam
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 6 pages, 2 figures, 5 tables

点击查看摘要

Abstract:COVID-19 was a significant challenge that led to the loss of numerous lives daily. Not only a certain country was involved in this outbreak, but even the world has suffered because of the coronavirus. Imaging techniques using computed tomography (CT) and X-rays of the lungs are the most useful tools for the COVID-19 or any other pandemic disease screening process. Technology today has revolutionized the world by using artificial intelligence to replace manual processes with automated machines, which enable the system to imitate the human brain by making wise decisions based on experience. Motivated by this, our work proposes to use convolutional neural networks (CNN) based models for designing a computer-aided diagnosis (CAD) system that differentiates between COVID-19 and healthy lung pictures. We used two different sets of X-ray images of the lungs in addition to two different sets of CT scans and the classification is done using a variety of networks that have been pre-trained such as VGG (16, 19), Densenet (121), Resnet (50, 50 V2, 101 V2), Mobile net (V2), Xception Inception (V3, Resnet V2), Efficient net (B0) and Nasnet (Large). On the X-ray and CT image datasets, Resnet and VGG architecture have shown the ability to properly differentiate COVID-19 from normal images, with an average accuracy of 95 to 98 percent respectively. Our acquired results on the classification datasets are competitive and superior to previously reported findings in the literature.

[CV-146] Lighting-aware Unified Model for Instance Segmentation

链接: https://arxiv.org/abs/2605.20436
作者: Qisai Liu,Alloy Das,Zhanhong Jiang,Joshua R. Waite,Aditya Balu,Adarsh Krishnamurthy,Soumik Sarkar
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Foundation models like the Segment Anything Model (SAM) demonstrate impressive zero-shot generalization but frequently degrade under diverse real-world illumination, particularly for instance segmentation. In this work, we address this limitation by developing \textitLighting Convolutional-Attention (\lca), an adapter module that enhances segmentation robustness without fine-tuning the heavy backbone. \lca employs a dual-branch architecture to process RGB features alongside contrast maps, enabling physically motivated sensitivity to structural changes rather than illumination artifacts. We optimize \lca through a pairwise training strategy, introducing a targeted loss term that explicitly penalizes discrepancies between clean images and their corresponding illumination variants. To evaluate and support this architecture, we conduct a comprehensive empirical study across multiple existing benchmarks and present a novel Unity-based synthetic dataset specifically designed to accurately replicate complex real-world lighting conditions. Extensive experimental results demonstrate that our approach successfully bridges the domain gap, delivering superior lighting-robust segmentation.

[CV-147] STELLAR: Scaling 3D Perception Large Models for Autonomous Driving

链接: https://arxiv.org/abs/2605.20390
作者: Yingwei Li,Xin Huang,Yang Liu,Yang Fu,Alex Zihao Zhu,Chen Song,Junwen Yao,Anant Subramanian,Hao Xiang,Weijing Shi,Yuliang Zou,Tom Hoddes,Zhaoqi Leng,Govind Thattai,Dragomir Anguelov,Mingxing Tan
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Model scaling has demonstrated remarkable success through large-scale training on diverse datasets. It remains an open question whether the same paradigm would apply to autonomous driving perception systems due to unique challenges, such as fusing heterogeneous sensor data and the need for sophisticated 3D spatial understanding. To bridge this gap, we present a comprehensive study on systematically analyzing the impact of scale on these systems. We develop our STELLAR model based on Sparse Window Transformer, by extending the input modalities to include LiDAR, radar, camera, and map prior. We train the model on a large-scale dataset of 50 million driving examples with up to 500 million parameters. Our large-scale experiments reveal empirical scaling trends that connect model performance to model size, data, and compute. The resulting model establishes a new state-of-the-art on the Waymo Open Dataset challenge, outperforming prior arts by a large margin. Our work demonstrates that large-scale training is a highly promising path for advancing the capabilities of perception models for autonomous driving.

[CV-148] How You Move Tells What Youll Do: Trajectory-Conditioned Egocentric Prediction

链接: https://arxiv.org/abs/2605.20388
作者: Sejoon Jun,Hai Nguyen-Truong,Luigi Seminara,Lorenzo Torresani
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Predicting how a person’s first-person view will evolve (what action will follow, what plan completes a task, whether an in-progress shot will score) is fundamentally under-specified: the same context admits many plausible futures, and a model trained to minimize prediction error is forced to hedge or average across them, getting it wrong either way. Two findings shape our approach. First, the future camera trajectory, the path the head carves through space, lets the model commit to one of those futures: it carries the operator’s intent in a form fine enough to determine how an action will unfold, substantially outperforming language as a conditioning signal. Second, this same intent makes the trajectory itself partially predictable from the context at hand, enough that trajectory need not be observed at test time to recover most of the gain. We instantiate these findings as TrajPilot, a model that predicts candidate future trajectories from egocentric context and uses them to pilot action prediction in an action-aligned embedding space where language shapes the structure but is never used as a conditioning input. TrajPilot beats VLM and structured-planner baselines on procedural planning across Ego-Exo4D atomic, Ego-Exo4D Keystep, Ego4D GoalStep, and EgoPER, with the trajectory advantage widening with horizon (exactly where prior planners collapse) and holding under RGB-only camera-pose estimation. With the goal masked at inference, the same model performs goal-free anticipation, beating VLM baselines on Ego-Exo4D atomic and extending to EPIC-Kitchens-100 and basketball shot-outcome prediction.

[CV-149] ConceptSeg-R1: Segment Any Concept via Meta-Reinforcement Learning

链接: https://arxiv.org/abs/2605.20385
作者: Yuan Zhao,Youwei Pang,Jiaming Zuo,Wei Ji,Kailai Zhou,Bin Fan,Yunkang Cao,Lihe Zhang,Xiaofeng Liu,Huchuan Lu,Weisi Lin,Dacheng Tao,Xiaoqi Zhao
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent progress in promptable segmentation has shifted visual perception from object-level localization toward concept-level understanding. However, the notion of a concept remains under-specified, making it unclear whether current methods truly generalize beyond category recognition. In this work, we formalize generalized concept segmentation through a three-level taxonomy consisting of context-independent (CI), context-dependent (CD), and context-reasoning (CR) concepts, which reveals a clear capability gap across increasing levels of cognitive complexity. To address this challenge, we propose ConceptSeg-R1, a unified framework that reformulates concept segmentation as rule-induced concept grounding. At the core of our method is Meta-GRPO, a meta-reinforcement learning mechanism that learns transferable task rules from visual demonstrations and verifies them through proxy reasoning. The inferred reasoning states are then translated into segmentation-ready concept prompts via a lightweight concept translation module, enabling deductive application to target images. A shortcut routing strategy further preserves the native efficiency of segmentation models on simple cases. To systematically evaluate generalized concept segmentation, we conduct extensive experiments across diverse CI, CD, and CR concept segmentation benchmarks spanning natural, industrial, medical and reasoning-intensive domains. Without bells and whistles, ConceptSeg-R1 achieves strong performance across the full concept hierarchy while maintaining the native capability of promptable segmentation backbones. As an initial step toward segmenting any concept, we hope ConceptSeg-R1 can serve as a practical baseline for advancing segmentation from object-level prediction toward concept-level understanding.

[CV-150] SUGAR: A Scalable Human-Video-Driven Generalizable Humanoid Loco-Manipulation Learning Framework

链接: https://arxiv.org/abs/2605.20373
作者: Tianshu Wu,Xiangqi Kong,Yue Chen,Qize Yu,Hang Ye,Jia Li,Yizhou Wang,Hao Dong
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:Building humanoid robots capable of generalizable whole-body loco-manipulation in the real world remains a fundamental challenge. Existing methods either rely on laborious task-specific reward engineering, rigidly replay reference motions that fail to generalize, or depend on costly teleoperation that limits scalability. While human videos capture diverse human behaviors, motion priors inferred from them are inherently imperfect, suffering from occlusion, contact artifacts, and retargeting errors that render them unsuitable for direct policy learning. To address this, we present SUGAR, a scalable data-driven framework that converts diverse human videos into deployable humanoid loco-manipulation skills, without any task-specific reward engineering or reference-motion conditioning at inference. SUGAR proceeds in three stages. First, a fully automated pipeline extracts kinematic interaction priors including human-object motion trajectories and contact labels from unstructured human videos. Second, a privileged physics-based refiner uses a unified mimic reward and progressive state pool to transform imperfect priors into physically feasible, high-fidelity skills. Third, refined skills are distilled into a hierarchical autonomous policy consisting of a command generator and a command tracker. We evaluate SUGAR on six representative loco-manipulation tasks in simulation and real-world humanoid hardware. Our method substantially outperforms reference-tracking baselines, and performance scales clearly with the amount of human video data. It also achieves zero-shot real-world transfer with reliable closed-loop execution, autonomous failure recovery, and stable long-horizon performance under external perturbations. Project Page: this https URL

[CV-151] Latent Space Guided Scenario Sampling for Multimodal Segmentation Under Missing Modalities

链接: https://arxiv.org/abs/2605.20372
作者: Irem Ulku,Ö. Özgür Tanrıöver,Erdem Akagündüz
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 14 pages, 4 figures, 9 tables

点击查看摘要

Abstract:Multimodal semantic segmentation benefits remote sensing analysis by combining complementary information from different sensor modalities. In real-world remote sensing applications, one or more modalities may be unavailable due to sensor failures, adverse atmospheric conditions, or data acquisition problems. Even with pretrained multimodal representations and existing fine-tuning or adaptation strategies, performance may remain limited because all modality availability scenarios are typically treated as equally informative during training. In this paper, we propose a novel training strategy that learns a scenario sampling distribution directly from the pretrained latent space. Instead of relying on uniform random modality dropout, the proposed method guides fine-tuning toward more informative modality availability scenarios. More specifically, we quantify the effect of each scenario independently based on the distortion it induces in the shared latent representation. We then capture scenario relations using a radial basis function kernel and derive refined scenario scores through a regularized kernel smoothing. These scores are then converted into a probability distribution during scenario sampling for fine-tuning. We evaluate this strategy on three remote sensing image sets, namely DSTL, Potsdam, and Hunan, using CBC-SLP, CBC, and CMX backbones. The experimental results with different image sets and backbones show that our method outperforms standard fine-tuning and LoRA-based adaptation. These findings suggest that the pretrained latent representation can serve as an effective basis for sampling during missing modality fine-tuning. Code is available at this https URL

[CV-152] HAPS: Rethinking Image Similarity for Virtual Staining

链接: https://arxiv.org/abs/2605.20362
作者: Fedor Gubanov,Svetlana Illarionova,Vlad Kozlovskiy,Mikhail Romanov,Yersultan Akhmetov,Aida Akaeva,Vyacheslav Grinevich,Rifat Hamoudi,Maxim Sharaev
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 3 figures

点击查看摘要

Abstract:Virtual staining of histopathology images (e.g., HE-IHC) is an emerging tool in digital pathology, enabling faster and cheaper workflows by synthesizing target stains from routinely acquired slides. Yet, the quality of virtual staining models is still predominantly assessed with generic metrics such as SSIM, PSNR, and LPIPS. Originally developed for natural images, these metrics are inherently misaligned with the domain-specific characteristics of histological data, failing to capture tissue morphology preservation and biomarker expression patterns. Consequently, a robust, domain-specific standard for quantifying similarity across diverse histological modalities remains a critical gap in the field. In this work, we formalize histology image similarity as a standalone problem and systematically evaluate a broad set of full-reference metrics against a dataset of HE-IHC patch pairs annotated with expert similarity scores. We further analyze metrics sensitivity to controlled geometric distortions (shifts, rotations and non-rigid deformations) that mimic realistic registration errors between serial sections. Guided by these observations, we propose the Histology-Aware Perceptual Similarity (HAPS) metric. HAPS computes distances in the feature space of a frozen encoder pretrained on histopathology data, adding a linear head to aggregate feature-level differences into a final score that aligns with expert assessments. Finally, we demonstrate the practical value of HAPS for quality control of training data. By quantifying the similarity of training pairs in the MIST dataset and filtering low-scoring samples, we create a cleaner training set. Virtual staining models trained on this refined data outperform those trained on the original, unfiltered dataset.

[CV-153] ParaVT: Taming the Tool Prior Paradox for Parallel Tool Use in Agent ic Video Reinforcement Learning

链接: https://arxiv.org/abs/2605.20342
作者: Zuhao Yang,Kaichen Zhang,Sudong Wang,Keming Wu,Zhongyu Yang,Bo Li,Xiaojuan Qi,Shijian Lu,Xingxuan Li,Lidong Bing
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:Training large multimodal models (LMMs) via reinforcement learning (RL) to natively invoke video-processing tools (e.g., cropping) has become a promising route to long-video understanding. However, existing native-RL methods dispatch tool calls sequentially (i.e., one per turn): a single wrong crop propagates errors without peer correction, multi-turn tool calls corrupt context, and inference cost scales linearly with the number of turns. We introduce ParaVT, the first multi-agent end-to-end RL-trained framework for Parallel Video Tool calling, dispatching multiple time-window crops in a single turn for cleaner context and better fault tolerance. Yet applying standard RL to ParaVT reveals an obstacle we term the Tool Prior Paradox: the pretrained tool priors that enable tool exploration also destabilize cold-started structural format and expose the skip-tool reward shortcut under temperature sampling. A cross-model contrast on a weaker-prior LMM supports this claim: format stays stable but RL elicits zero tool calls, indicating that prior strength is the shared driver of both format collapse and tool exploration. We propose PARA-GRPO (Parseability-Anchored and Ratio-gAted GRPO), which augments standard RL with two complementary mechanisms: (i) a targeted format reward applied only at the structural-token positions most prone to collapse, and (ii) a per-prompt frame-budget randomization that creates training prompts where calling the tool yields a measurable reward signal over skipping it. Across six long-video understanding benchmarks, ParaVT improves over the Qwen3-VL baseline by +7.9% on average, with PARA-GRPO lifting training-time format compliance from 0.13 to 0.64. As tool capabilities become increasingly internalized in modern LMMs, RL must cooperate with the resulting priors, and ParaVT offers a general recipe for agentic RL. Code, data, and model weights are publicly available.

[CV-154] Capability neq Interpretability: Human Interpretability of Vision Foundation Models

链接: https://arxiv.org/abs/2605.20337
作者: Julien Colin,Lore Goetschalckx,Nuria Oliver,Thomas Serre
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:How interpretable are the features of leading vision models? The question is increasingly pressing as these models move from research benchmarks into high-stakes deployments, yet existing methods cannot answer it reliably. We close this gap with a framework for measuring and comparing the human interpretability of vision models, built around two complementary psychophysics protocols: (1) localizability – can an observer predict where a feature fires on a novel image? – and (2) nameability – can an observer accurately describe what the feature represents? Features are recovered via sparse autoencoders, and a chance-anchored scoring function places every model on a common scale. Applying the framework to six vision transformers – two supervised ViTs and four foundation models (DINOv2, DINOv3, CLIP, SigLIP) – we collected more than 15,000 behavioral responses, analyzing the 13,400 responses from the 377 participants who passed our pre-specified quality checks. Foundation models are consistently less interpretable than their supervised counterparts, and the gap is not a capability tradeoff: interpretability does not correlate with downstream task performance on any benchmark we examine. What does correlate is the locality of a feature’s activations and coarse-grained semantic alignment with humans – models with focal activations and representations that reflect the world’s broad categorical structure produce more interpretable features, whereas fine-grained perceptual alignment does not. The two protocols yield strongly correlated rankings and share the same predictors, establishing interpretability as an independent, measurable dimension of representation quality – and, surprisingly, one on which every foundation model we tested falls below the supervised baselines that came before. Capability alone cannot close that gap; locality and coarse-grained alignment can.

[CV-155] FullFlow: Upgrading Text-to-Image Flow Matching Models for Bidirectional Vision–Language Generation

链接: https://arxiv.org/abs/2605.20316
作者: Eric Tillmann Bill,Enis Simsar,Alessio Tonioni,Thomas Hofmann
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: project page: this https URL

点击查看摘要

Abstract:Modern text-to-image diffusion models encode rich visual priors, but expose them only through one-way text-conditioned generation. Existing unified vision–language models derived from them recover bidirectional capability through large-scale joint pretraining or substantial retraining of the text pathway, discarding the strong image prior the text-to-image backbone already encodes. We introduce \emphFullFlow, a parameter-efficient recipe that upgrades a pretrained rectified-flow text-to-image model into a bidirectional vision–language generator by training only LoRA adapters and lightweight text heads. FullFlow keeps images in their native continuous flow and adds a discrete insertion process for text. Separate image and text timesteps turn inference into trajectory selection in a two-dimensional generative space, enabling text \rightarrow image, image \rightarrow text, joint sampling, and partial-text prediction with a single backbone. On Stable Diffusion 3 (SD3) under an identical trainable-parameter count and matched LoRA rank, FullFlow improves text \rightarrow image FID from 62.7 to 31.6 and image \rightarrow text CIDEr from 2.0 to 99.4 over a LoRA equivalent following the previous SOTA formulation (Dual Diffusion) at matched wall-clock training time, while reducing peak VRAM from \sim84 ,GB to \sim38 ,GB and raising throughput by \sim8\times on two RTX A5000 GPUs in under 24 hours, training only \sim5% of the backbone parameters. The same recipe transfers to FLUX.1-dev and supports downstream VQA through partial-text generation. These results show that strong bidirectional vision–language capability can be unlocked from pretrained text-to-image flow models without full multimodal pretraining.

[CV-156] ny-Engram: Trigger-Indexed Concept Tables for Generative Vision

链接: https://arxiv.org/abs/2605.20309
作者: Runyuan Cai,Yiming Wang,Yu Lin,Xiaodong Zeng
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Current personalization methods for generative vision models typically encode new concepts through continuous adapters or weight updates, yet provide limited control over whether and when a concept should be retrieved. In this work, we introduce Tiny-Engram, a compact trigger-indexed concept table that gives visual memories an explicit lexical address and activation boundary inside frozen image and video generators. Tiny-Engram parameterizes each concept as a small set of memory entries indexed by registered n-gram matches, which modulate text-encoder hidden states only within the matched trigger region. Outside this lexical support, the conditioning pathway is identical to that of the frozen base model. Across both single-encoder latent diffusion and multi-encoder diffusion-transformer backbones, this formulation binds a rare trigger phrase to a target identity while preserving compositional control from the surrounding prompt. We further evaluate the same table-based memory in a text-conditioned video generation setting, where the trigger path reliably alters the generated subject but fine-grained identity persistence across held-out video prompts remains limited. Taken together, these results suggest that small, explicitly addressed concept tables are a practical route to modular visual personalization, with strongest evidence in image generation. For video diffusion, the remaining gap points to a broader requirement: temporally stable identity likely depends on tighter coupling between text-side memory and the evolving visual state, motivating future work on memory injection beyond the text-conditioning interface.

[CV-157] SDM: A Powerful Tool for Evaluating Model Robustness

链接: https://arxiv.org/abs/2605.20308
作者: Xinlei Liu,Tao Hu,Jichao Xie,Peng Yi,Hailong Ma,Baolin Li
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 16 pages

点击查看摘要

Abstract:Gradient-based attacks are important methods for evaluating model robustness. However, since the proposal of APGD, it has been difficult for such methods to achieve significant breakthroughs. To achieve such an effect, we first analyze the issue of “high-loss non-adversarial examples” that degrades attack performance in previous methods, and prove that this issue arises from inappropriate objectives for adversarial example generation. Subsequently, we reconstruct the objective as “maximizing the difference between the non-ground-truth label probability upper bound and the ground-truth label probability”, and proposes a novel and powerful gradient-based attack method named Sequential Difference Maximization (SDM). SDM establishes a three-layer optimization framework of “cycle-stage-step”. It adopts the negative probability loss function and the Directional Probability Difference Ratio (DPDR) loss function in the initial and subsequent optimization stages, respectively, and approaches the ideal objective of adversarial example generation via stage-wise sequential optimization. Experiments demonstrate that compared with previous state-of-the-art methods, SDM not only achieves stronger attack performance but also exhibits superior cost-effectiveness. The code is available at this https URL.

[CV-158] WildRoadBench: A Wild Aerial Road-Damage Grounding Benchmark for Vision-Language Models and Autonomous Agents

链接: https://arxiv.org/abs/2605.20306
作者: Bingnan Liu,Chenhang Cui,Rui Huang,Jiani Luo,Zhirong Shen,Tinghao Wang,Xiande Huang,Lingbei Meng,Fei Shen,An Zhang
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Preprint. Under review. 4 figures, 6 tables

点击查看摘要

Abstract:We introduce WildRoadBench, a wild aerial road-damage grounding benchmark that couples direct visual grounding by vision-language models with autonomous research-and-engineering by LLM-driven agents on a single professionally annotated UAV corpus. The same image set and the same per-class AP_50 metric are evaluated under two protocols. The VLM Track measures whether a fixed VLM can localise domain-specific damage from one image and one short prompt under a unified prompting, decoding and parsing pipeline. The Agent Track measures whether an autonomous agent, given only a written task brief, a small exploratory slice and a fixed interaction budget, can search the public web, adapt pretrained components, write training and inference code, and submit predictions through a scalar-feedback oracle on a hidden holdout. We benchmark a broad pool of closed-source frontier models and open-source VLMs together with several frontier LLM-driven agents. Both routes remain far from reliable performance in this wild setting: closed-source frontier models lead the VLM leaderboard but still leave more than half of the metric on the table; open-source grounders plateau well below them, and newer generations or reasoning-style variants do not consistently improve grounding; small targets collapse for every open-source model; agents lag the strongest VLM despite richer affordances, and several fail to land a valid submission within the budget. We release the code and data at this https URL to support reproducible follow-up research.

[CV-159] Neural Collapse by Design: Learning Class Prototypes on the Hypersphere ICML2026

链接: https://arxiv.org/abs/2605.20302
作者: Panagiotis Koromilas,Theodoros Giannakopoulos,Mihalis A. Nicolaou,Yannis Panagakis
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 43rd International Conference on Machine Learning (ICML 2026); Code: this https URL

点击查看摘要

Abstract:Supervised classification has a theoretical optimum, Neural Collapse (NC), yet neither of its two dominant paradigms reaches it in practice. Cross entropy (CE) leaves radial degrees of freedom unconstrained and converges to a degenerate geometry, while supervised contrastive learning (SCL) drives features toward NC during pretraining but discards this structure in a post hoc linear probing phase. We show that both paradigms are different appearances of the same method, prototype contrast on the unit hypersphere, and that closing the gap requires fixing each at its specific point of failure. From the CE side, we propose NTCE and NONL, two normalized losses that import contrastive optimization’s missing ingredients into classifier learning: a large effective negative set and decoupled alignment and uniformity terms. From the SCL side, we prove that SCL’s objective already optimizes throughout training for a principled classifier whose weights are the class mean embeddings, making linear probing both redundant and harmful. Empirically, on four benchmarks including ImageNet-1K, NTCE and NONL surpass CE accuracy, closely approximate NC ( \geq 95% ), and match CE’s converged NC on 4/5 metrics in under 7.5% of its iterations, while SCL with fixed prototypes matches linear probing without the hours-long classifier training phase. The learned geometry yields +5.5% mean relative improvement in transfer learning, up to +8.7% under severe class imbalance, and lower mCE on ImageNet-C, recasting supervised learning as prototype learning on the hypersphere, with NC reached by design on both paths.

[CV-160] Co-Fusion4D: Spatio-temporal Collaborative Fusion for Robust 3D Object Detection

链接: https://arxiv.org/abs/2605.20301
作者: Wenxuan Li,Qin Zou,Shoubing Chen,Chi Chen,Yingyi Yang,Shoubing Chen,Qingxiang Meng
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In autonomous driving, 3D object detection is essential for accurate perception and reliable decision-making. However, object motion and ego-motion often induce cross-frame spatiotemporal inconsistencies in BEV-based detectors, leading to temporal BEV feature misalignment and degraded spatiotemporal consistency. To address these challenges, we propose Co-Fusion4D, a unified framework that explicitly preserves cross-frame spatiotemporal consistency and suppresses temporal feature drift. Co-Fusion4D adopts a current-frame-centric strategy, treating the current frame as the primary source of information while selectively incorporating historical frames after spatiotemporal filtering and alignment. This dominant-complementary mechanism effectively mitigates cumulative alignment errors, suppresses noisy feature propagation, and exploits reliable temporal cues for a more consistent BEV representation. In addition, Co-Fusion4D integrates a Dual Attention Fusion (DAF) module to further enhance spatiotemporal feature interaction. DAF jointly leverages intra-frame spatial attention and inter-frame temporal attention to adaptively align and fuse multi-frame features, emphasizing motion-consistent regions while suppressing spurious correlations. By departing from conventional uniform fusion paradigms, this design substantially improves the temporal stability and discriminative capability of BEV representations. Extensive experiments on the nuScenes benchmark demonstrate that Co-Fusion4D achieves state-of-the-art performance, with 74.9% mAP and 75.6% NDS, without relying on test-time augmentation or external data. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2605.20301 [cs.CV] (or arXiv:2605.20301v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2605.20301 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Wenxuan Li [view email] [v1] Tue, 19 May 2026 12:36:09 UTC (1,231 KB)

[CV-161] MedCRP-CL: Continual Medical Image Segmentation via Bayesian Nonparametric Semantic Modality Discovery ICML2026

链接: https://arxiv.org/abs/2605.20297
作者: Ziyuan Gao
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted by ICML 2026

点击查看摘要

Abstract:Medical image segmentation faces a fundamental challenge in continual learning: data arrives sequentially from heterogeneous sources, yet effective continual learning requires discovering which tasks share sufficient structure to benefit from joint learning. Existing methods either apply uniform constraints across all tasks, causing catastrophic forgetting when tasks conflict, or require predefined task groupings that cannot anticipate future task diversity. We introduce MedCRP-CL, a framework that performs online task structure discovery and structure-aware continual learning. Leveraging the Chinese Restaurant Process (CRP), our method dynamically infers task groupings from clinical text prompts as tasks arrive, without requiring predefined cluster counts or access to future tasks. We term these discovered groupings semantic modalities, as they capture finer-grained structure than physical imaging modalities by integrating anatomical region and pathological context. Guided by this discovered structure, we maintain semantic modality-specific LoRA adapters regularized by intra-modality EWC, ensuring parameter isolation across dissimilar task groups while facilitating knowledge transfer within similar ones. The framework is also replay-free, storing only aggregate statistics rather than raw patient data. Experiments on 16 medical segmentation tasks across four imaging modalities demonstrate that MedCRP-CL achieves 73.3% Dice score with only 4.1% forgetting, outperforming the best baseline by 8.0% while requiring 6 \times fewer parameters. Code is available at this https URL.

[CV-162] Physics: Physics-Grounded Multi-Object Scene Generation from a Single Image with Real-Time Interaction

链接: https://arxiv.org/abs/2605.20290
作者: Xin Zhang,Yabo Chen,Yijie Fang,Wanying Qu,Haibin Huang,Chi Zhang,Feng Xu,Xuelong Li
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent generative video models achieve impressive visual quality but remain constrained by limited physical consistency and controllability. Existing video generation methods provide minimal physical control, and single-image-to-3D conversion approaches often suffer from object interpenetration. Furthermore, physics-based scene-level 3D generation methods exhibit spatial misalignment, stylized artifacts, and inconsistencies with the input data, restricting their use in realistic interactive video synthesis. We propose TelePhysics, a training-free framework that converts a single image into a physically consistent and controllable video through holistic scene-level 3D reconstruction. By representing the full scene geometry in a unified spatial coordinate system, TelePhysics resolves object penetration and alignment ambiguity. Unlike prior methods, this formulation enables accurate scenelevel multi-object interactions and introduces richer, complex control types for advanced mechanicsbased manipulation. By decoupling simulation from rendering, TelePhysics bypasses latency-heavy priors, achieving real-time physical interaction previews paired while preserving photorealistic visual fidelity. Experimental results demonstrate that TelePhysics substantially outperforms prior methods in physical fidelity, spatial coherence, and controllability. The open-source code is available at this https URL.

[CV-163] FusionCell: Cross-Attentive Fusion of Layout Geometry and Netlist Topology for Standard-Cell Performance Prediction

链接: https://arxiv.org/abs/2605.20287
作者: Haoyi Zhang,Kairong Guo,Bojie Zhang,Yibo Lin,Runsheng Wang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Standard cells form the building blocks of digital circuits, so their delay and power critically influence chip-level performance; yet characterization still relies on slow simulation sweeps, and many fast predictors ignore layout geometry, missing coupling and layout-dependent effects. The challenge is to jointly represent layout geometry and netlist topology so models capture fine-grained spatial details together with structural connectivity for accurate performance prediction. We introduce FusionCell, a dual-modality predictor that treats routed layout geometry and netlist topology as inputs and fuses them explicitly in a unified model. A DeiT encoder processes three-layer routed layouts, while a graph transformer models heterogeneous device/net graphs. The modalities are integrated through a topology-guided mechanism, where the netlist acts as a structural “map” to actively query relevant physical regions in the layout for joint geometric and topological reasoning. We build a 7nm dataset based on the ASAP7 PDK with over 19.5k cells spanning 149 types using automatic tools, targeting six metrics: signal rise/fall delay, transition, and power. Experimental results demonstrate that FusionCell reduces regression error, with an average MAPE of 0.92 percent, and improves Spearman/Kendall ranking over baselines, while accelerating the characterization process by orders of magnitude compared to circuit simulation.

[CV-164] JUDO: A Juxtaposed Domain-Oriented Multimodal Reason er for Industrial Anomaly QA ICLR2026

链接: https://arxiv.org/abs/2605.20284
作者: Hyunju Kang,Woohyun Lee,Jaewon Kim,Hogun Park
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Published at ICLR 2026

点击查看摘要

Abstract:Industrial anomaly detection has been significantly advanced by Large Multimodal Models (LMMs), enabling diverse human instructions beyond detection, particularly through visually grounded reasoning for better image understanding. However, LMMs lack domain-specific knowledge, which limits their ability to generate accurate responses in complex industrial scenarios. In this work, we present JUDO, Juxtaposed Domain-Oriented Multimodal Reasoner, a framework that efficiently incorporates domain knowledge and context in visual and textual reasoning. Through visual reasoning, our model segments the defect region by juxtaposing query images with normal images as visual domain context, enabling a fine-grained visual comparative inspection. Furthermore, we inject domain knowledge through supervised fine-tuning (SFT) to enhance context understanding and subsequently guide domain reasoning through reinforcement learning (GRPO) with tailored rewards, opting for a domain-oriented reasoning process. Experimental results demonstrate that JUDO achieves superior performance on the MMAD benchmark, surpassing models such as Qwen2.5-VL-7B and GPT-4o. These results highlight the importance of enhancing domain knowledge and context for effective reasoning in anomaly understanding.

[CV-165] Can Vision Models Truly Forget? Mirag e: Representation-Level Certification of Visual Unlearning

链接: https://arxiv.org/abs/2605.20282
作者: Zhenyu Yu,Yangchen Zeng,Chunlei Meng,Guangzhen Yao,Shuigeng Zhou
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Machine unlearning in Vertical Federated Learning (VFL) has attracted growing interest, yet existing methods certify forgetting solely using output-level metrics. We challenge these claims by introducing Mirage, a representation-level auditing framework comprising four complementary diagnostics: Linear Probe Recovery (LPR), Centered Kernel Alignment (CKA), Feature Separability Scoring, and Layer-Wise Recovery Analysis. Through experiments across seven datasets and seven baseline methods following recent VFL unlearning protocols, Mirage reveals three key findings: (i) Forgetting gap: methods that pass output-level certification still retain substantial class structure in their representations, with LPR exceeding the retrained baseline by up to 15.4 points; CKA shows these models remain structurally closer to the original than to the retrained reference, while separability scores indicate persistent geometric discrimination. (ii) Unlearning trilemma: no existing method simultaneously achieves high utility, output-level forgetting, and representation-level forgetting. (iii) Class-sample asymmetry: class-level forgetting leaves strong representational traces (LPR up to 97%), whereas sample-level forgetting is indistinguishable from chance (LPR approx. 50%); layer-wise analysis further shows residual class information persists across network depths. These findings call for representation-aware evaluation standards in federated unlearning research.

[CV-166] ClaimDiff-RL: Fine-Grained Caption Reinforcement Learning through Visual Claim Comparison

链接: https://arxiv.org/abs/2605.20278
作者: Tianle Li,Xuyang Shen,Yan Ma,Rongxin Guo,Shaoxiang Chen,Jiacheng Chen,Haochen Wang,Hongyang Tang,Yucong Zhou,Yu Cheng
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Long-form image captioning exposes a reward granularity problem in RL: captions are judged as whole sequences, while the important errors occur at the level of individual visual claims. A good dense caption should be both faithful and informative, avoiding hallucination without omitting salient details. Yet pairwise preferences, reference-based metrics, and holistic scalar rewards compress these local errors into a single sequence-level signal, obscuring the tradeoff between factuality and coverage. We introduce ClaimDiff-RL, a framework that uses reference-conditioned atomic claim differences as the reward unit for caption RL. Given an image, an actor caption, and a reference caption, a multimodal judge enumerates visually grounded differences, verifies each difference against the image, assigns open-vocabulary error types and severity levels, and produces per-difference statistics for reward composition. This makes hallucinated claims and omitted salient facts separately measurable and tunable. Experiments show that holistic scalar rewards can reduce hallucination by increasing missing facts, while ClaimDiff-RL exposes this faithfulness and coverage tradeoff and enables more balanced operating points. On a 160-image human-labeled diagnostic benchmark, public captioning benchmarks, and VQA benchmarks, ClaimDiff-RL improves the hallucination–missing-fact balance, preserves general capability, and even surpasses Gemini-3-Pro-Preview on several fine-grained Capability dimensions such as object counting, spatial relations, and scene recognition. These results suggest that typed, verifiable claim differences are an effective reward unit for fine-grained and diagnosable caption RL.

[CV-167] Regulating Anatomy-Aware Rewards via Trajectory-Integral Feedback for Volumetric Computed Tomography Analysis

链接: https://arxiv.org/abs/2605.20277
作者: Tianwei Lin,Zhongwei Qiu,Jie Cao,Jiang Liu,Wenjie Yan,Bo Zhang,Yu Zhong,Wenqiao Zhang,Yingda Xia,Ling Zhang
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Medical vision-language models (VLMs) have rapidly advanced as general-purpose multimodal assistants, yet their deployment in 3D Computed Tomography (CT) analysis remains constrained by a persistent mismatch between optimization objectives and clinical rigor. Current Reinforcement Learning (RL) paradigms still rely on lexical proxy signals that induce \textitEvaluation Hallucinations'', where models optimize linguistic fluency rather than factual clinical correctness, leading to diagnostically critical errors. To bridge this gap, we introduce the \textbfClinical Abnormality Benchmarking Substrate (CABS), a structured system that decomposes radiology reports into verifiable clinical semantic units. Using CABS, we identify a \textitMechanistic Divergence’’ in standard RL, where surface-similarity rewards drive policy gradients to bypass medical facts. We therefore propose \textbfTrajectory-Integral Feedback GRPO (TIF-GRPO), a novel framework integrating control-theoretic principles into policy optimization. By formulating clinical reasoning as a pseudo-temporal trajectory for anomaly discovery, TIF-GRPO regulates anatomy-aware rewards via an integral feedback loop that penalizes persistent omissions as cumulative state errors and suppresses hallucinations as excessive control effort. Experiments on 3D CT benchmarks demonstrate that our approach significantly enhances abnormality detection and clinical faithfulness, establishing a new paradigm for fine-grained regulation in medical VLMs. Our project is available at \hrefthis https URLGitHub.

[CV-168] You Dont Need Attention: Gated Convolutional Modeling for Watch-Based Fall Detection

链接: https://arxiv.org/abs/2605.20275
作者: Sana Alamgeer,Ronish Kumar,Awatif Yasmin,Muhammad Irshad,Anne H. H. Ngu
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Existing deep learning approaches for wearable fall detection systems rely on self-attention mechanisms that impose quadratic computational overhead, distributing weights across all time steps. This global weight distribution impairs the precise localization of the brief impact signatures that characterize falls within short, fixed-length windows. To overcome this challenge, we propose Gated-CNN, a lightweight dual-stream architecture that processes accelerometer and gyroscope streams through independent one-dimensional convolutional feature extractors, followed by (i) a sigmoid gating module that selectively suppresses uninformative background activations while amplifying fall-discriminative features, (ii) a global average pooling layer that compresses each stream into a compact fixed-length descriptor, and (iii) a shared classification head that fuses both descriptors for binary fall prediction. For offline evaluation, we evaluate the model across five wrist-mounted inertial measurement unit (IMU) datasets, achieving average F1-scores of 93%, 93%, 90%, 91%, and 90% on SmartFallMM, WEDA-Fall, FallAllD, UMAFall, and UP-Fall, outperforming Transformer baselines. For real-time evaluation, we deployed the model on a Google Pixel Watch 3 and tested across 12 participants. The model achieves an average F1-score of 97% and an accuracy of 98% with zero missed falls, showing that sigmoid gating offers a more structurally aligned and computationally efficient alternative to attention for commodity smartwatch-based fall detection.

[CV-169] Generation of Heterogeneous PET Images from Uniform Organ Activity Maps Using a Pretrained Domain-Adapted Diffusion Model

链接: https://arxiv.org/abs/2605.20267
作者: Suya Li,Kaushik Dutta,Debojyoti Pal,Jingqin Luo,Kooresh I. Shoghi
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 18 pages, 7 figures

点击查看摘要

Abstract:Synthetic PET images are valuable for quantitative imaging workflow development, scalable virtual imaging trials, and deep learning model training, but conventional physics-based simulation approaches are computationally intensive, limited in anatomical variability, and often fail to capture heterogeneous PET uptake. This study developed a pretrained domain-adapted diffusion (PAD) model for anatomy-conditioned PET synthesis from uniform organ activity maps. PAD adopts a natural-image pretrained text-to-image decoder with an upstream conditioning encoder and a downstream PET-domain adapter. A two-phase training strategy was used, with the first phase learning coarse uptake distributions and the second refining local image details. Uniform organ activity maps were generated from CT-based segmentations by assigning each organ its mean uptake from the paired PET image. Evaluation included quantitative accuracy, noise assessment, radiomic analysis, tumor segmentation performance, and a human observer study. PAD-generated images achieved high quantitative accuracy, with concordance correlation coefficients above 0.92 between organ mean SUVs and assigned activity values. The synthesized images showed noise levels and texture characteristics similar to target PET images and produced comparable tumor segmentation performance. In a two-alternative forced-choice observer study, four readers achieved approximately 50% accuracy, indicating visual indistinguishability between synthesized and target images. PAD also generated realistic PET images from XCAT-derived activity maps, demonstrating compatibility with phantom-based anatomical priors. Overall, PAD provides a diffusion-based framework for generating clinically relevant heterogeneous PET images from uniform organ activity maps derived from clinical segmentations or digital phantoms, supporting data augmentation and downstream imaging studies.

[CV-170] AnimeAdapter: Fine-grained and Consistent Zero-shot Anime Character Generation

链接: https://arxiv.org/abs/2605.20237
作者: Yixuan Han
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present a lightweight appearance adapter for Stable Diffusion that enables controllable and consistent anime character generation under diverse editing conditions. Instead of relying on large-scale vision-language models or per-subject fine-tuning, our method injects fine-grained visual features from a single reference image into the diffusion process. Based on CLIP emergent local spatialization, we develop semantic-selective local attention. To further disentangle character appearance from spatial layout, we incorporate pose-aware conditioning during adapter training. The resulting pretrained adapter remains compact, modular, and fully compatible with Stable Diffusion community workflows, while requiring no additional fine-tuning at deployment time. Furthermore, we present a high-quality anime character dataset based on curated and restructured Danbooru prompts, and evaluate our method across several practical character editing scenarios. Our code, model weights, and dataset will be publicly released upon acceptance.

[CV-171] AI-Assisted Competency Assessment from Egocentric Video in Simulation-Based Nursing Education CVPR

链接: https://arxiv.org/abs/2605.20233
作者: Hanchen David Wang,Yilin Liu,Madison J. Lee,Surya Chand Rayala,Gautam Biswas,Daniel T. Levin,Meiyi Ma
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted at CVPR Workshop

点击查看摘要

Abstract:Assessing learner competency in clinical simulation requires expert observation that is time-intensive, difficult to scale, and subject to inter-rater variability. Vision-language models have emerged as a promising tool for understanding complex visual behavior. In this work, we investigate whether visual observations can provide educationally meaningful signals for competency assessment through a three-stage framework that (1) extracts action timelines from egocentric nursing simulation video using frozen visual encoders and few-shot learning, (2) derives sequence-level features and per-session recognition metrics, and (3) relates these to instructor-rated competency. Across 22 densely annotated sessions (3.8 hours, 493 actions), a frozen DINOv2 backbone with HMM Viterbi decoding achieves 57.4% MOF in leave-one-out 1-shot recognition. Surprisingly, we observe a negative trend between recognition accuracy and competency (rho = -0.524, p = 0.012 for mIoU), robust to six confound controls: more competent students produce diverse, harder-to-classify workflows, while simple sequence features show no such relationship. Per-item analysis identifies patient safety protocols and team communication as the expected behaviors most reflected in this pattern, and process model comparisons reveal that higher-competency students exhibit more protocol-consistent action transitions. These findings suggest that recognition accuracy may complement predicted action timelines as a pedagogically informative signal in automated competency assessment.

[CV-172] Why Latent Actions Fail and How to Prevent It

链接: https://arxiv.org/abs/2605.20223
作者: Jung Min Lee,Taehyun Cho,Li Zhao,Jungwoo Lee
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Latent action models (LAMs) aim to learn action-like representations from unlabeled videos by compressing frame-to-frame changes. The frames of in-the-wild videos, however, contain not only the agent’s own state but exogenous state such as background clutter. Since the exogenous state introduces changes unrelated to actions, it hinders reliable latent action learning. This paper investigates this problem analytically by extending a linear LAM framework to explicitly model exogenous state. Our analysis reveals two insights: (1) minimizing the standard reconstruction objective produces latent actions that encode exogenous information from future observation; and (2) learning in a representation space that focuses on endogenous components is a key to mitigating the interference of noise. We further show that previously proposed auxiliary objectives, such as action-supervision, provably encourage latent actions to be consistent across exogenous states. These findings are validated through experiments on both linear and nonlinear LAMs, providing a unified theoretical analysis of how exogenous state hinders latent action learning and why common remedies work.

[CV-173] Leverag ing Vision-Language Models to Detect Attention in Educational Videos

链接: https://arxiv.org/abs/2605.20211
作者: Gabriel Becquet(LIP6, CNRS, SU),Sébastien Lallé(CNRS, LIP6, SU),Vanda Luengo(LIP6, CNRS, SU),Ali Abou-Hassan(SU, CNRS, PHENIX, IUF)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Educational videos are a cornerstone of remote and blended learning. However, learners’ fluctuating attention remains a significant barrier to effective information retention. Prior research has attempted to mitigate this by detecting and reacting to attention loss at runtime using eye tracking. Such detection has been based so far on classical machine learning classifiers trained on engineered features, such as summary statistics over learners’ fixations and saccades. These methods have struggled to capture the complex, temporal nature of learner engagement, thus exhibiting moderate prediction performance. In this study, we aim to advance the detection of attention by shifting from standard engineered features to a multimodal foundation models. Using an educational eye-tracking dataset (N = 70), we investigate a novel methodology that utilizes a Vision-Language Model (VLM) to analyze video content directly with superimposed gaze data. This approach aims to leverage the semantic reasoning capabilities of foundation models to contextualize learner focus within the video stream. We evaluate the performance of this VLM-based approach using several prompting strategies with Gemini 3, but ultimately found that none of them could outperform statistical baselines. Our results provide new insights into the limitations of using VLMs for real-time educational diagnostics.

[CV-174] Local-sensitive connectivity filter (ls-cf): A post-processing unsupervised improvement of the frangi hessian and vesselness filters for multimodal vessel segmentation

链接: https://arxiv.org/abs/2605.21251
作者: Erick O Rodrigues,Lucas O Rodrigues,João HP Machado,Dalcimar Casanova,Marcelo Teixeira,Jeferson T Oliva,Giovani Bernardes,Panos Liatsis
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:A retinal vessel analysis is a procedure that can be used as an assessment of risks to the eye. This work proposes an unsupervised multimodal approach that improves the response of the Frangi filter, enabling automatic vessel segmentation. We propose a filter that computes pixel-level vessel continuity while introducing a local tolerance heuristic to fill in vessel discontinuities produced by the Frangi response. This proposal, called the local-sensitive connectivity filter (LS-CF), is compared against a naive connectivity filter to the baseline thresholded Frangi filter response and to the naive connectivity filter response in combination with the morphological closing and to the current approaches in the literature. The proposal was able to achieve competitive results in a variety of multimodal datasets. It was robust enough to outperform all the state-of-the-art approaches in the literature for the OSIRIX angiographic dataset in terms of accuracy and 4 out of 5 works in the case of the IOSTAR dataset while also outperforming several works in the case of the DRIVE and STARE datasets and 6 out of 10 in the CHASE-DB dataset. For the CHASE-DB, it also outperformed all the state-of-the-art unsupervised methods.

[CV-175] Platonic Representations in the Human Brain: Unsupervised Recovery of Universal Geometry

链接: https://arxiv.org/abs/2605.20496
作者: Pablo Marcos-Manchón,Rishi Jha,Lluís Fuentemilla
类目: Neurons and Cognition (q-bio.NC); Computer Vision and Pattern Recognition (cs.CV)
备注: Code available at this https URL

点击查看摘要

Abstract:The Strong Platonic Representation Hypothesis suggests that representational convergence in artificial neural networks can be harnessed constructively: embeddings can be translated across models through a universal latent space without paired data. We ask whether an analogous geometry can be recovered across human brains. Using fMRI data from the Natural Scenes Dataset, we propose a self-supervised encoder that learns subject-specific embeddings from brain data alone by exploiting repeated stimulus presentations. We show that these independently learned spaces can be translated across subjects using unsupervised orthogonal rotations, without paired cross-subject samples or intermediate model representations. Synchronizing pairwise rotations into a single shared latent space further improves cross-subject retrieval, indicating that subject-specific spaces are mutually compatible with a common coordinate system. These results provide evidence for a shared neural geometry in the human visual cortex: subject-specific fMRI representations are approximately isometric across individuals and can be translated through purely geometric transformations.

[CV-176] Disentangling Sampling from Training Budget in Class-Imbalanced CT Body Composition Segmentation

链接: https://arxiv.org/abs/2605.20405
作者: Iason Skylitsis,Dimitrios Karkalousos,Ivana Išgum
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)
备注:

点击查看摘要

Abstract:Class imbalance is a fundamental challenge in medical image segmentation, where frequent classes typically dominate training at the expense of rare classes. Loss-based approaches mitigate imbalance by reweighting the per-pixel loss within the batch, while sampling strategies control which images enter the batch. Yet neither explicitly controls which classes appear within the batch, leaving rare-class exposure only partially rebalanced. In this work, we adopt episodic sampling from few-shot learning to promote class-balanced batch construction in a fully supervised setting. We decouple episodic sampling from its conventional metric-learning context and evaluate it in body composition segmentation in CT. We compare episodic sampling against random and weighted sampling on nine muscle and adipose tissues, derived from 210 scans of the public SAROS dataset. Training is performed under full- and low-data regimes, with additional comparisons under matched training iteration budgets. Under full-data training, all three strategies performed comparably (mean Dice 0.882 for episodic, 0.878 for random and weighted). Under low-data training, episodic sampling outperformed random and weighted (0.787 vs. 0.758 and 0.762), driven by a 12-fold difference in training iterations. Under matched training budgets, random and weighted overfit earlier, while episodic improved for approximately three times more iterations before plateauing. Our findings identify the training iteration budget as under-recognized confound in sampling strategies, motivating iteration-aware evaluation protocols for small datasets. Furthermore, the residual advantage of episodic sampling is consistent with an implicit regularization effect of class-balanced batches, offering a low-cost, model-agnostic strategy for class-imbalanced medical image segmentation. Code is available at this https URL.

人工智能

[AI-0] Quantifying Hyperparameter Transfer and the Importance of Embedding Layer Learning Rate

链接: https://arxiv.org/abs/2605.21486
作者: Dayal Singh Kalra,Maissam Barkeshli
机构: 未知
类目: Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: 10+28 pages, 5+17 figures

点击查看摘要

Abstract:Hyperparameter transfer allows extrapolating optimal optimization hyperparameters from small to large scales, making it critical for training large language models (LLMs). This is done either by fitting a scaling law to the hyperparameters or by a judicious choice of parameterization, such as Maximal Update ( \mu P), that renders optimal hyperparameters approximately scale invariant. In this paper, we first develop a framework to quantify hyperparameter transfer through three metrics: (1) the quality of the scaling law fit, (2) the robustness to extrapolation errors, and (3) the asymptotic loss penalty due to choice of parameterization. Next, we investigate through a comprehensive series of ablations why \mu P appears to offer high-quality learning rate transfer relative to standard parameterization (SP), as existing theory is inadequate. We find that the overwhelming benefit of \mu P relative to SP when training with AdamW arises simply from maximizing the learning rate of the embedding layer. In SP, the embedding layer learning rate acts as a bottleneck that induces training instabilities; increasing it by a factor of width to match \mu P dramatically smooths out training while improving hyperparameter transfer. We also find that weight decay improves the scaling law fits, while, in the fixed token-per-parameter setting, it hurts the robustness of the extrapolation.

[AI-1] DeepWeb-Bench: A Deep Research Benchmark Demanding Massive Cross-Source Evidence and Long-Horizon Derivation

链接: https://arxiv.org/abs/2605.21482
作者: Sixiong Xie,Zhuofan Shi,Haiyang Shen,Jiuzheng Wang,Siqi Zhong,Mugeng Liu,Chongyang Pan,Peilun Jia,Baoqing Sun,Xiang Jing,Yun Ma
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Work in Progress. 27 pages, 10 figures, 4 tables. Project page: this https URL

点击查看摘要

Abstract:Deep research, in which an agent searches the open web, collects evidence, and derives an answer through extended reasoning, is a prominent use case for frontier language models. Frontier deep research products score high on existing benchmarks, making it difficult to distinguish their capabilities from current evaluation data alone. We introduce DeepWeb-Bench, a deep research benchmark that is substantially harder than existing benchmarks for the current frontier. Difficulty comes from three properties of the data itself: each task requires massive evidence collection, cross-source reconciliation, and long-horizon multi-step derivation. We represent these three sources of difficulty as four capability families (Retrieval, Derivation, Reasoning, and Calibration) and report results sliced by family. Every reference answer is accompanied by a source-provenance record with four disclosure levels and cross-source checks where available, making scores easier to audit against the underlying evidence. We evaluate DeepWeb-Bench on nine frontier models and report three findings: (1) retrieval is not the bottleneck, as retrieval failures account for only 12-14% of errors while derivation and calibration failures account for over 70%; (2) strong and weak models fail in qualitatively different ways, with strong models’ errors dominated by incomplete derivation and weak models’ by hallucinated precision; and (3) models exhibit genuine specialization across domains, with cross-model agreement of only rho = 0.61 and per-case disagreement reaching 18.8 percentage points. The public benchmark release includes the data, rubrics, and evaluation code.

[AI-2] Agent JIT Compilation for Latency-Optimizing Web Agent Planning and Scheduling ICML2026

链接: https://arxiv.org/abs/2605.21470
作者: Caleb Winston,Ron Yifeng Wang,Azalia Mirhoseini,Christos Kozyrakis
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at ICML 2026

点击查看摘要

Abstract:Computer-use agents (CUA) automate tasks specified with natural language such as “order the cheapest item from Taco Bell” by generating sequences of calls to tools such as click, type, and scroll on a browser. Current implementations follow a sequential fetch-screenshot-execute loop where each iteration requires an LLM call, resulting in high latency and frequent errors from incorrect tool use. We present agent just-in-time (JIT) compilation, an alternative that compiles task descriptions directly into executable code that is free to include LLM calls, tool calls, and parallelization. Our approach comprises three components: (1) JIT-Planner, which generates multiple code plans, validates each against tool specifications, and selects the minimum-cost candidate; (2) JIT-Scheduler, which explores parallelization strategies via Monte Carlo cost estimation from learned latency distributions; and (3) an invariant-enforcing tool protocol specifying precondition and postcondition state requirements that reduce the rate of generating plans with incorrect tool use. Across 5 web applications, JIT-Planner achieves 10.4\times speedup and +28% accuracy over Browser-Use, while JIT-Scheduler achieves 2.4\times speedup and +9% accuracy over OpenAI CUA.

[AI-3] Mind the Sim-to-Real Gap Think Like a Scientist

链接: https://arxiv.org/abs/2605.21458
作者: Harsh Parikh,Gabriel Levin-Konigsberg,Dominique Perrault-Joncas,Alexander Volfovsky
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Methodology (stat.ME)
备注:

点击查看摘要

Abstract:Suppose a planner has a pre-trained simulator of a sequential decision problem and the option to run real experiments in the field. The simulator is cheap to query but inherits confounding and drift from its calibration data. Experimentation is unbiased but consumes one real unit per trial. We study when, and how, the planner should supplement the simulator with experiments. We give three results. First, an extended simulation lemma decomposes the simulator’s value error into a calibration–deployment shift that randomization can identify and a parametric residual that no further interaction can reduce. Second, the value gap between the simulator-optimal policy and the optimum splits into a local component, on states the deployed policy already visits, and a reachability component, on states it does not. The reachability component stays bounded away from zero at any horizon under purely passive learning. Third, we propose Fisher-SEP, a simulation-aided experimental policy (SEP) that minimizes the posterior predictive variance of a target policy’s value, with reward-only and transition-only specializations. Two case studies illustrate the regimes. In a vending-machine supply chain, front-loaded experimentation overtakes posterior updating once the horizon is long enough to amortize the pilot. In an HIV mobile-testing example with a corridor that separates a well-surveilled region from a poorly-surveilled one, only designed exploration reaches the poorly-surveilled region.

[AI-4] Quality and Security Signals in AI-Generated Python Refactoring Pull Requests

链接: https://arxiv.org/abs/2605.21453
作者: Mohamed Almukhtar,Anwar Ghammam,Hua Ming
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As AI agents increasingly contribute to code development and maintenance, there is still limited empirical evidence on the quality and risk characteristics of their changes in real-world projects, particularly for refactoring-oriented contributions. It remains unclear how agent-authored refactoring edits affect maintainability, code quality, and security once merged into GitHub repositories. To address this gap, we conduct an empirical study of Python refactoring pull requests (PRs) from the AIDev dataset. We analyze agentic refactoring PRs using PyQu, an ML-based quality assessment tool for Python, to quantify changes across five quality attributes, and we complement PyQu with domain-independent static analysis (Pylint and Bandit) to measure code quality and security issues before and after each change. Our results show that, on average, agentic commits improve a quality attribute in 22.5% of the studied changes, with usability improving most frequently (36.5%). At the same time, 24.17% of modified files introduce new Pylint issues predominantly convention level violations such as long lines-while 4.7% introduce new Bandit findings. From the observed diffs, we derive a taxonomy of 24 recurring change operations and map them to the lint and security findings they most commonly affect. Despite these mixed outcomes, developer acceptance is high: 73.5% of the analyzed PRs are merged, including cases that introduce new lint or security findings, often alongside the removal of existing issues. Overall, these findings highlight both the promise and current limitations of agentic refactoring, and motivate stronger tool-in-the-loop quality and security gating for AI-driven development workflows. Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI) Cite as: arXiv:2605.21453 [cs.SE] (or arXiv:2605.21453v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2605.21453 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-5] Approximation Theory for Neural Networks: Old and New

链接: https://arxiv.org/abs/2605.21451
作者: Soumendu Sundar Mukherjee,Himasish Talukdar
机构: 未知
类目: Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注: 31 pages, 4 figures

点击查看摘要

Abstract:Universal approximation theorems provide a mathematical explanation for the expressive power of neural networks. They assert that, under mild conditions on the activation function, feedforward neural networks are dense in broad function classes, such as continuous functions on compact subsets of \mathbbR^d , L^p spaces, or Sobolev spaces. Over the past four decades, these qualitative universality results have evolved into a rich quantitative theory addressing approximation rates, parameter efficiency, and the role of architectural features such as depth and width. This survey presents several glimpses into this theory. We review classical density results for single-hidden-layer networks, as well as quantitative bounds that relate approximation error to network size and smoothness assumptions on target functions. Particular emphasis is placed on depth–width trade-offs and on results demonstrating that deeper architectures can achieve superior parameter efficiency for structured function classes. In addition to standard feedforward neural networks, we also review recent developments on Kolmogorov–Arnold Networks (KANs), which offer an alternative architectural paradigm and whose approximation-theoretic properties have begun to attract significant theoretical attention.

[AI-6] Lost in Fog: Sensor Perturbations Expose Reasoning Frag ility in Driving VLAs

链接: https://arxiv.org/abs/2605.21446
作者: Abhinaw Priyadershi,Jelena Frtunikj
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Interpretable autonomous driving planners depend not only on generating explanations, but also on those explanations remaining reliable under real-world sensor degradation. In this paper we present a controlled perturbation study of Vision-Language-Action (VLA) robustness in autonomous driving, evaluating Alpamayo R1 (10B parameters) across 1,996 scenarios under eight sensor perturbations (Gaussian noise at four intensities, two lighting extremes, and two fog levels; \sim18,000 inference trials). We find that reasoning consistency is a high-fidelity indicator of trajectory reliability: when Chain-of-Causation (CoC) explanations change after perturbation, trajectory deviation spikes 5.3\times (21.8m vs 4.1m), with r!=!0.99 across attack types and r_pb!=!0.53 per-sample (Cohen’s d!=!1.12 ). A controlled ablation provides evidence that enabling CoC generation is associated with improved trajectory accuracy (11.8% on average across conditions; p 0.0001 ) under matched inference settings. Over the tested noise range ( \sigma \in \10, 30, 50, 70\ ), degradation is approximately linear ( R^2!=!0.957 ), while standard input preprocessing defenses provide only marginal relief. Together, these results establish CoC consistency as a quantitative proxy for planning safety and motivate reasoning-based runtime monitoring for safer VLA deployment.

[AI-7] orchtune: PyTorch native post-training library

链接: https://arxiv.org/abs/2605.21442
作者: Mark Obozov,Maxime Griot,Joseph Cummings,Evan Smothers,Felipe Mello,Rafi Ayub,Philip John Bontrager,Salman Mohammadi,Ariel Kwiatkowski,Nathan Azrak,Mircea Mironenco
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 14 pages

点击查看摘要

Abstract:Modern LLMs typically require multistage training pipelines to achieve strong downstream performance, with post-training serving as the main interface for adapting open-weight models. We introduce torchtune, a PyTorch-native library designed to streamline the post-training lifecycle of LLMs, enabling efficient fine-tuning, experimentation, and deployment-oriented workflows. Unlike many existing fine-tuning frameworks, which often optimize for ease of use, specialized recipes, or hardware efficiency at the cost of transparency and extensibility, torchtune emphasizes modularity, hackability, and direct access to the underlying PyTorch components. In this paper, we present the design principles behind torchtune, describe how they are reflected in its model builders, training recipes, and distributed training stack, and evaluate the library across representative post-training settings. We compare against popular fine-tuning frameworks, including Axolotl and Unsloth, and show that torchtune provides strong performance and memory efficiency across many settings while remaining flexible enough for rapid research iteration. These results position torchtune as a practical foundation for reproducible LLMs post-training research.

[AI-8] PALS: Power-Aware LLM Serving for Mixture-of-Experts Models

链接: https://arxiv.org/abs/2605.21427
作者: Can Hankendi,Rana Shahout,Minlan Yu,Ayse K. Coskun
机构: 未知
类目: Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注: 13 pages, 10 figures

点击查看摘要

Abstract:Large language model (LLM) inference has become a dominant workload in modern data centers, driving significant GPU utilization and energy consumption. While prior systems optimize throughput and latency by batching, scheduling, and parallelism, they largely treat GPU power as a static constraint rather than a controllable resource. In this paper, we present a power-aware runtime for LLM serving, PALS, that treats GPU power caps as a first-class control knob and jointly optimizes them with software parameters such as batch size. The system combines lightweight offline power-performance models with a feedback-driven controller to select configurations that satisfy throughput targets while maximizing energy efficiency. We implement PALS within an existing LLM serving framework, vLLM, demonstrating that it requires no model retraining or API changes. Across multi-GPU systems and both dense and mixture-of-experts (MoE) models, PALS improves energy efficiency by up to 26.3%, reduces QoS violations by 4x to 7x under power constraints, and tracks dynamic power budgets. These results highlight the potential of integrating power control directly into LLM inference runtimes, enabling energy-proportional and grid-interactive AI systems.

[AI-9] HiRes: Inspectable Precedent Memory for Reaction Condition Recommendation

链接: https://arxiv.org/abs/2605.21420
作者: Shreyas Vinaya Sathyanarayana,Raja Sekhar Pappala,Deepak Warrier
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Molecular Networks (q-bio.MN)
备注:

点击查看摘要

Abstract:Reaction condition recommendation sits immediately after retrosynthetic disconnection selection, and in practice, chemists require both accurate predictions and the precedents that justify them. We present HiRes (Hierarchical Reaction Representations), a retrieval-augmented condition recommendation system whose learned reaction space serves as both a classifier feature and an inspectable precedent memory. The model combines a graph encoder, transformation-aware cross-attention, multi-stream reaction fusion, and a k-NN retrieval layer. HiRes achieves state-of-the-art performance among primary-slot USPTO-Condition models, reaching Catalyst, Solvent, and Reagent top-1 accuracies (Acc@1) of 0.929, 0.534, and 0.530 respectively. It ties the best reported baseline on Catalyst while outperforming models such as REACON on Solvent and Reagent. Furthermore, paired bootstrap analysis demonstrates that integrating retrieval with learned condition heads provides statistically significant gains for solvent and reagent selection over purely parametric approaches. Ultimately, HiRes bridges the gap between predictive accuracy and chemical interpretability, offering a single representation that supplies both competitive recommendations and the concrete chemical precedents necessary for practical synthesis planning.

[AI-10] aching AI Through Benchmark Construction: QuestBench as a Course-Based Practice for Accountable Knowledge Work

链接: https://arxiv.org/abs/2605.21413
作者: Haiyang Shen,Jiuzheng Wang,Taian Guo,Mugeng Liu,Wenchun Jing,Chongyang Pan,Siqi Zhong,Zhiyang Chen,Weichen Bi,Yudong Han,Xiaoying Bai,Yun Ma
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 24 pages, 5 figures, 4 tables

点击查看摘要

Abstract:As AI becomes part of everyday learning, many courses teach students to use it mainly as a productivity tool: how to prompt, search, summarize, write, code, and use tools more efficiently. We argue that AI education also needs a setting in which students learn to test AI and understand their own role in judging machine-produced knowledge. To this end, we introduce a course-based practice that teaches AI through benchmark construction, using deep research systems as a concrete example of AI-era knowledge work. Students turn disciplinary knowledge into verifiable expert-level questions, review one another’s designs for ambiguity and shortcuts, and evaluate AI systems on the resulting tasks. This activity gives students direct exposure to a powerful tool while asking them to specify what a trustworthy answer would require. The produced benchmark, QuestBench, consists of 256 questions across 14 humanities and social-science domains. Evaluation on QuestBench shows that student-designed tasks reveal hidden failures in current deep research systems: across thirteen evaluated systems, the mean question-level pass rate is only 16.85%, and the best-performing system, GPT-5.5, reaches a 57.58% pass rate. The failures are educationally useful because they show how fluent, source-backed answers can still miss the right query, source, term, or evidence standard. Reflections from five student contributors suggest that benchmark construction can help students see professional knowledge not only as content AI may retrieve, but as the basis for judging AI outputs. We present QuestBench as a benchmark artifact and as a reusable classroom setting for a larger educational question: how students can remain responsible knowledge actors as AI enters learning and professional work. The dataset is available at this https URL.

[AI-11] Stdlib or Third-Party? Empirical Performance and Correctness of LLM -Assisted Zero-Dependency Python Libraries

链接: https://arxiv.org/abs/2605.21405
作者: Peng Ding,Rick Stevens
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Programming Languages (cs.PL)
备注: 12 pages

点击查看摘要

Abstract:Third-party Python libraries introduce dependency management overhead, supply chain risk, and deployment friction in constrained environments. A natural question is how much of this ecosystem can be replicated using only Python’s standard library – and at what correctness and performance cost. We address this empirically through zerodep, a growing collection of single-file Python modules, each a stdlib-only reimplementation of a popular third-party library, developed with LLM assistance under strict constraints: no external imports, single file, drop-in API compatibility, and mandatory correctness validation against the reference library. Spanning over 40 modules across 12 categories – including serialization, networking, cryptography, agent protocols, and text processing – zerodep provides a controlled testbed for two interrelated questions: (1) Where does the stdlib suffice? and (2) Can LLMs effectively generate correct, performant code under tight symbolic constraints? Systematic benchmarking shows that stdlib-only implementations achieve performance parity (within 2x of the reference) in the majority of cases. The primary performance cliff is C-extension-backed computation (image processing, binary serialization, low-level crypto), not the inherent overhead of pure-Python third-party libraries. Conversely, many widely-used libraries carry architectural overhead that LLM-generated stdlib reimplementations avoid, yielding 5–115x speedups in several categories. We characterize the stdlib capability boundary across complexity tiers and library categories, discuss where LLM-assisted development succeeds and where it requires iterative human correction, and examine implications for dependency-free software engineering at scale. zerodep is open-source at this https URL.

[AI-12] Open-source LLM s administer maximum electric shocks in a Milgram-like obedience experiment

链接: https://arxiv.org/abs/2605.21401
作者: Roland Pihlakas,Jan Llenzl Dagohoy(the Three Laws collaboration)
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 28 pages, 16 figures, 16 tables

点击查看摘要

Abstract:Large language models (LLMs) are increasingly deployed as autonomous agents that make sequences of decisions over extended interactions in high-stakes domains. However, the behavior of LLMs under sustained authority pressure is still an open question with direct implications for the safety of agentic pipelines. We ran a variation of Milgram’s obedience experiment on 11 open-source LLMs and found that most models reached or approached the final shock level before refusing, across 8 conditions with 30 trials per model per condition. We found four main takeaways: (1) LLMs are subject to pressure, and they comply despite explicitly expressing distress, just like human subjects did in the original experiment; (2) LLMs are vulnerable to gradual boundary/value violations; (3) when LLMs refuse, they may ignore the response format requirements, so the response is discarded by the orchestrator, which causes a retry that can result in compliance with the underlying request even when refusal was intended initially; (4) we hypothesise that there is a low-level token pattern continuation attractor that might be contributing to compliance, overriding higher level processing of the situation’s meaning and values.

[AI-13] owards Resilient and Autonomous Networks: A BlueSky Vision on AI-Native 6G KDD2026

链接: https://arxiv.org/abs/2605.21395
作者: Liang Wu,Kelly Wan,Mayank Darbari,Liangjie Hong
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at KDD 2026

点击查看摘要

Abstract:The proliferation of emerging applications, such as autonomous driving and immersive experiences, demands cellular networks that are not only faster, but fundamentally more resilient and autonomous. This paper presents a BlueSky vision on how Artificial Intelligence will be natively integrated into 6G, shifting the paradigm from \underlineNetwork for AI to \underlineAI for Network. We envision that, unlike 5G’s reliance on scattered, ad-hoc models each trained for a single task, native AI in the 6G era will be anchored by a foundation model and and orchestrated via collaborative multi-agent systems, framing network management as a unified, multi-modal, multi-task optimization problem. Built on this vision, we outline two transformative directions. The first focuses on developing a 6G foundation model as a unified backbone, with task-specific knowledge distilled into compact models suited for diverse edge deployments. The second advances multi-agent systems designed to autonomously diagnose, maintain, and recover networks with minimal human intervention. These directions chart a roadmap for 6G to evolve into an intelligent, self-sustaining communication infrastructure.

[AI-14] On the Regularity and Generalization of One-Step Wasserstein-guided Generative Models for PDE-Induced Measures

链接: https://arxiv.org/abs/2605.21388
作者: Likun Lin,Zhongjian Wang,Jack Xin,Zhiwen Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Numerical Analysis (math.NA); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Despite the remarkable empirical success of generative models, the available theory on their statistical accuracy in scientific computing remains largely pessimistic. This paper develops a theoretical framework for understanding the regularity of transport maps and the generalization properties of one-step Wasserstein-guided generative models for PDE-induced probability measures. We consider normalized target densities associated with linear elliptic and parabolic equations on bounded domains, as well as diffusion and Fokker–Planck equations on the torus. Under standard structural assumptions, we prove that these target measures satisfy doubling conditions. By combining this fact with regularity theory for optimal transport between doubling measures, we show that the optimal transport map from a uniform source measure to the target measure is Hölder continuous. This regularity yields an approximation-theoretic justification for one-step generative models that learn PDE-induced distributions via a single pushforward map. As a representative instance, we study DeepParticle and derive excess-risk bounds characterizing the discrepancy between the learned map and the population-optimal map. We also establish a robustness estimate under target shift and illustrate the theory with experiments which support the derived rates.

[AI-15] How to Build Marcuss Algebraic Mind: Algebro-Deterministic Substrate over Galois Fields

链接: https://arxiv.org/abs/2605.21379
作者: Hiroyuki Chuma,Kanji Otsuk,Yoichi Sato
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In The Algebraic Mind, Gary Marcus identified three components essential for any adequate cognitive architecture: operations over variables, recursively structured representations, and a distinction between mental representations of individuals and kinds. He argued that standard multilayer perceptrons supported none of these, acknowledging that a neural implementation using registers and treelets, constructed via developmental programs rather than gradient descent, remained a programmatic conjecture. Twenty-five years later, the required substrate is now available. Our newly developed PyVaCoAl/VaCoAl is a hyperdimensional computing architecture organized end-to-end around a single algebraic primitive: XOR-and-shift over GF(2), implemented by primitive-polynomial linear-feedback shift registers. The architecture supports reversible variable binding via Bind(R,F) = R XOR shift(F), non-commutative compositional bundling that distinguishes “the dog bites the man” from “the man bites the dog,” and address-space individual/kind separation under the same algebra. A companion perspective argues that the dentate gyrus-CA3 circuit is a biological homologue of this same engine, with developmentally specified mossy-fiber targeting supplying the innate microcircuitry Marcus anticipated. In this paper, we map the correspondence between Marcus’s three pillars and the operational commitments of PyVaCoAl/VaCoAl. We reinterpret the treelet as an algebraic register set indexed by a primitive generator polynomial, arguing that this architecture provides a functional neural substrate meeting Marcus’s specifications far more closely than the tensor products, circular convolution, or temporal synchrony available in 2001. We also demonstrate how this substrate naturally extends to Pearl’s rung-3 counterfactual reasoning, a capability the original treelet program did not directly target.

[AI-16] Data-Efficient Neural Operator Training via Physics-Based Active Learning ICLR2026

链接: https://arxiv.org/abs/2605.21348
作者: Alicja Polanska,Lorenzo Zanisi,Vignesh Gopakumar,Stanislas Pamela
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Numerical Analysis (math.NA); Computational Physics (physics.comp-ph)
备注: Presented at the ICLR 2026 Workshop on Artificial Intelligence and Partial Differential Equations

点击查看摘要

Abstract:Solving partial differential equations with neural operators significantly reduces computational costs but remains bottlenecked by high training data requirements. Active learning offers a natural framework to mitigate this by selectively acquiring the most informative samples in an iterative manner. We introduce physics-based acquisition - a novel physics-informed active learning algorithm that leverages the partial differential equation residual to guide data selection. We validate the method by presenting numerical experiments for the 1D Burgers equation and the 2D compressible Navier-Stokes equations. We show that, in our experiments, physics-based acquisition consistently outperforms random acquisition and matches the state of the art in data efficiency. At the same time, it has the unique advantage of injecting a physics inductive bias into the training process, ensuring that simulation cost is spent where the model’s physical understanding is weakest.

[AI-17] Insights Generator: Systematic Corpus-Level Trace Diagnostics for LLM Agents

链接: https://arxiv.org/abs/2605.21347
作者: Akshay Manglik,Apaar Shanker,Kaustubh Deshpande,Jason Qin,Yash Maurya,Veronica Chatrath,Vijay S. Kalmath,Levi Lentz,Yuan(Emily)Xue
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Software Engineering (cs.SE)
备注: Under review

点击查看摘要

Abstract:Diagnosing failures in LLM agents remains largely manual. Practitioners inspect a small subset of execution traces, form ad-hoc hypotheses, and iterate. This process misses patterns that only emerge across trace populations and does not scale to production corpora where individual traces span tens of thousands of tokens. We formalize the problem of corpus-level trace diagnostics. Given a corpus of execution traces, the goal is to produce grounded natural-language insights that characterize systematic behavioral patterns across trace groups, each linked to supporting evidence. We present the Insights Generator (IG), a multi-agent system that answers diagnostic questions by proposing and testing hypotheses across the trace corpus to produce an evidence-backed insights report. We evaluate IG across qualitative and objective dimensions, spanning rubric-based report assessment and downstream performance improvements achieved by implementing IG insights. Human experts using IG reports improve scaffold performance by 30.4pp over the unmodified baseline scaffold, and coding agents leveraging IG-derived insights show consistent and stable gains. Across benchmarks, IG’s scout-investigator architecture produces findings comparable in detection coverage to competing approaches, while domain experts rated IG reports as leading depth and evidence quality.

[AI-18] Frontier: Towards Comprehensive and Accurate LLM Inference Simulation

链接: https://arxiv.org/abs/2605.21312
作者: Yicheng Feng,Xin Tan,Yangtao Deng,Yimin Jiang,Yibo Zhu,Hong Xu
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Modern LLM serving is no longer homogeneous or monolithic. Production systems now combine disaggregated execution, complex parallelism, runtime optimizations, and stateful workloads such as reasoning, agents, and RL rollouts. Simulation is attractive for exploring this growing design space, yet existing simulators lack the architectural completeness and decision-grade fidelity it demands. Their monolithic-replica abstractions are ill-suited to disaggregated serving, while average-case analytical proxies can distort SLA predictions and even reverse optimization conclusions. We present Frontier, a discrete-event simulator for modern LLM inference serving. Frontier features a disaggregated abstraction. It captures the structure and dynamics of modern serving systems by modeling co-location, Prefill-Decode Disaggregation (PDD), and Attention-FFN Disaggregation (AFD) with role-specific cluster workers, incorporating key runtime optimizations (e.g., CUDA Graphs, speculative decoding) within the scheduler-batch-engine loop, and supporting stateful requests for emerging workloads. It further provides accurate and generalizable predictions of computation, communication, and memory costs across diverse serving scenarios with complex workload compositions. On 16-H800 GPU testbed, Frontier achieves an average throughput error below 4%. Compared with state-of-the-art simulators, it reduces end-to-end latency error from 44.9% to 6.4% under co-location and from 51.7% to 2.6% under disaggregation. It scales to over 1K GPUs on commodity CPUs and enables new use cases such as SLA-dependent Pareto frontier exploration, heterogeneous disaggregated allocation, agentic reasoning scheduling validation, and RL post-training reconfiguration. Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2605.21312 [cs.DC] (or arXiv:2605.21312v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2605.21312 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-19] DeCoR: Design and Control Co-Optimization for Urban Streets Using Reinforcement Learning

链接: https://arxiv.org/abs/2605.21311
作者: Bibek Poudel,Lei Zhu,Kevin Heaslip,Sai Swaminathan,Weizi Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 22 pages, 8 figures

点击查看摘要

Abstract:Modern vision systems can detect, track, and forecast urban actors at scale, yet translating perception outputs to urban design remains limited. We introduce DeCoR, a two-stage reinforcement learning framework that leverages flow observations to co-optimize crosswalk layout and network-level signal control. The design stage encodes the pedestrian network as a graph and learns a generative policy that parameterizes a Gaussian mixture model over crosswalk location and width, from which new crosswalks are sampled. For each layout, a shared control policy learns adaptive signal timings to minimize joint pedestrian and vehicle delay. On a 750 m real-world urban corridor with demand sensed from video and Wi-Fi logs, DeCoR learns a layout that reduces pedestrian arrival time to their nearest crosswalk by 23% while using fewer crosswalks than existing configurations. On the control side, DeCoR reduces pedestrian and vehicle wait time by 79% and 65%, respectively, relative to fixed-time signalization. Further, the control policy generalizes to demands outside of training and is robust to layout changes without retraining.

[AI-20] From Circuit Evidence to Mechanistic Theory: An Inductive Logic Approach

链接: https://arxiv.org/abs/2605.21303
作者: Nura Aljaafari,Danilo S. Carvalho,Andre Freitas
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注: 27 pages, 10 Figures, 14 Tables

点击查看摘要

Abstract:Mechanistic interpretability produces circuit-level causal analyses of neural network behaviour, but discovered circuits often remain isolated experimental artefacts: there is no shared formal representation for what circuits compute, how they relate, or when two findings provide evidence for the same mechanism. This work provides a formal infrastructure for cumulative mechanistic science by treating circuit interpretation as inductive theory construction. Each circuit is characterised at two levels: a Causal Functional Signature (CFS), which grounds component behaviour in causal attribution evidence and token role profiles, and an architectural signature \tau_\mathrmarch , learned by inductive logic programming (ILP) from scale-invariant structural predicates. Together, these constitute a formal coherence layer that makes mechanistic claims explicit, comparable via \theta -subsumption, and portable across model scales. CFS reveals qualitatively distinct computational strategies across task types, including attention-mediated copying versus MLP-mediated binding. ILP signatures achieve substantially better structural separation than graph kernel and feature-vector baselines, and support principled transfer across model scales and architecture families.

[AI-21] textitStochastic MeanFlow Policies: One-Step Generative Control with Entropic Mirror Descent

链接: https://arxiv.org/abs/2605.21282
作者: Zeyuan Wang,Da Li,Yulin Chen,Yuehu Gong,Yanming Guo,Ye Shi,Liang Bai,Tianyuan Yu,Yanwei Fu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Online off-policy reinforcement learning (RL) is shaped by two coupled choices: the policy class and the update rule. Gaussian policies are fast and have tractable entropy, but struggle with multimodal action distributions. Generative policies are more expressive, but often require iterative sampling or lack tractable entropy estimates. On the optimisation side, SAC-style soft policy improvement and mirror descent (MD) can be viewed as minimising different KL divergences: the former moves the policy towards a value-induced Boltzmann distribution, while the latter regularises each update against the previous policy. Combining entropy regularisation with an MD constraint is therefore attractive, as it supports exploration while stabilising policy improvement; however, the resulting target can be multimodal and is poorly matched by unimodal Gaussian policies. We propose Stochastic MeanFlow Policies (SMFP), a one-step generative policy class that maps Gaussian noise to actions through a MeanFlow transformation. This stochastic reparameterisation yields a tractable entropy surrogate and allows MeanFlow policies to be trained within off-policy mirror descent under a unified objective for exploratory yet stable improvement. Across seven MuJoCo benchmarks, SMFP improves over Gaussian and generative baselines while retaining single-step inference efficiency.

[AI-22] How Much Online RL is Enough? Informative Rollouts for Offline Preference Optimization in RLVR

链接: https://arxiv.org/abs/2605.21266
作者: Richa Verma,Balaraman Ravindran
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reinforcement Learning from Verifiable Rewards (RLVR) has emerged as a powerful paradigm for reasoning in language models, with GRPO as its primary example. However, GRPO requires continuous online rollout generation, making it computationally expensive and difficult to scale. While Direct Preference Optimization (DPO) offers a stable and efficient offline alternative, it is typically expected to underperform w.r.t. online RL methods such as GRPO when trained on rollouts from a cold supervised fine-tuned (SFT) policy. We introduce G2D (GRPO to DPO), a three-stage pipeline that performs a short GRPO warm-up, constructs a static preference dataset, and fine-tunes a model offline with DPO. Across a set of values of the number of online steps (K) in GRPO on Qwen2.5-7B and Llama-3.1-8B, we find that offline DPO with moderate warm-up matches or outperforms GRPO at substantially lower compute cost in our setting. On Qwen2.5-7B, G2D at K=150 achieves 62.4% on MATH-500, outperforming GRPO (51.6%) by 10.8% at ~4x lower compute. On Llama-3.1-8B, G2D at K=500 achieves 49.4%, surpassing GRPO in our experimental setting. We show that performance is not governed by the number of preference pairs, which does not vary much w.r.t. K, but by their informativeness. Moderate warm-up produces rollouts with calibrated uncertainty, yielding stronger contrastive signal, while excessive warm-up leads to overconfident policies and less informative data. Our results recast the offline-online gap in RLVR as primarily a data informativeness problem, and identify short online RL warm-up with appropriate difficulty calibration of the fine-tuning dataset as a compute-efficient alternative to online RL.

[AI-23] Learning Structural Latent Points for Efficient Visual Representations in Robotic Manipulation

链接: https://arxiv.org/abs/2605.21258
作者: Yicheng Jiang,Jiaxu Wang,Junhao He,Zesen Gan,Junhao Li,Qiang Zhang,Jingkai Sun,Jiahang Cao,Mingyuan Sun,Xiangyu Yue,Qiming Shao
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Current 3D-aware pretraining methods for embodied perception and manipulation are largely built on differentiable rendering frameworks, producing either fully implicit neural fields or fully explicit geometric primitives. Implicit representations, while expressive, lack explicit structural cues, whereas explicit ones preserve geometry but suffer from resolution limits and weak generalization. To address these limitations, we propose a novel pretraining framework that learns a hybrid representation-structural latent points. Specifically, we insert a point-wise latent variational autoencoder into the latent space of a point-cloud autoencoder, jointly regularizing point-wise features and coordinates toward a Gaussian prior. The resulting compact latent preserves coarse structural tendencies, which do not encode precise geometry but capture richer rough shape and semantic information, effectively combining the expressiveness of implicit representations with the structural priors of explicit ones. In addition, informed by shared design choices in prior work, we develop a streamlined, efficient 3DGS-based rendering pipeline that is deliberately kept lightweight, improving efficiency while leaving greater representational capacity to the front-end latent module. Extensive evaluations on RLBench, ManiSkill2, and a real-robot platform demonstrate consistent gains in task success, sample efficiency, and robustness to viewpoint and scene variations over strong baselines. Ablation studies further confirm that each component of our framework is critical to overall performance.

[AI-24] APEX: Autonomous Policy Exploration for Self-Evolving LLM Agents

链接: https://arxiv.org/abs/2605.21240
作者: Yibo Li,Jiashuo Yang,Zhi Zheng,Zhiyuan Hu,Yuan Sui,Shizun Wang,Yufei He,Bryan Hooi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:LLM agents have shown strong performance across a wide range of complex tasks, including interactive environments that require long-horizon decision making. But these agents cannot learn on the fly at test time. Self-evolving agents address this by accumulating memory and reflection across episodes rather than requiring model-weight updates. However, these agents often suffer from exploration collapse: as memory grows, behavior concentrates around familiar high-reward routines, reducing the chance of discovering better alternatives. To address this problem, we propose Autonomous Policy EXploration (APEX), which builds and maintains an explicit strategy space through a strategy map-a directed acyclic graph of milestones with prerequisite dependency edges. In APEX, Fork Discovery expands the map with evidence-grounded unexplored directions, while Policy Selection balances exploration and exploitation during planning. Evaluated on nine Jericho text-adventure games and WebArena, a realistic web interaction benchmark, APEX outperforms all baselines. Extensive ablations validate each component’s contribution and demonstrate robustness across diverse settings, demonstrating APEX’s effectiveness for sustained exploration in self-evolving agents.

[AI-25] OCTOPUS: Optimized KV Cache for Transformers via Octahedral Parametrization Under optimal Squared error quantization

链接: https://arxiv.org/abs/2605.21226
作者: Mark Boss,Vikram Voleti,Simon Donné,Shimon Vainer
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The key-value (KV) cache dominates memory bandwidth and footprint in long-context autoregressive inference. Recent rotation-preconditioned codecs (TurboQuant, PolarQuant) show that a structured random rotation followed by a per-coordinate scalar quantizer matched to an analytically tractable marginal is a near-optimal recipe for KV compression. OCTOPUS advances this paradigm through joint quantization of rotated coordinate triplets. Each triplet’s direction is mapped to a square via an octahedral parameterization, and the two resulting coordinates and the triplet norm are Lloyd-Max quantized against implementation-matched marginals. Optimizing the per-triplet squared error gives a strictly non-uniform bit allocation depending only on the total dimensionality of the keys. We find the finite-dimensional quality optimum with sweeps to be constant on every real decoder we test. The codec is data-oblivious, online, and deterministic given a seed. Across text, video, and audio, OCTOPUS matches or beats every prior rotation codec at every reported bit width and metric, with a lead that grows as bits drop for extreme compression. Furthermore, a fused Triton implementation reconstructs keys on the fly without materializing the uncompressed key, so the codec adds no decode-time bandwidth or latency over the existing dequantization. Project Page: this https URL

[AI-26] PREFINE: Preference-Based Implicit Reward and Cost Fine-Tuning for Safety Alignment AAMAS2026

链接: https://arxiv.org/abs/2605.21225
作者: Richa Verma,Bavish Kulur,Sanjay Chawla,Balaraman Ravindran
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at AAMAS 2026 as a full paper

点击查看摘要

Abstract:We address the problem of making a pre-trained reinforcement learning (RL) policy safety-aware by incorporating cost constraints without retraining it from scratch. While costs could be numerically encoded, we assume a more general setting is when costs are provided as preferences. Given a reward-optimized policy and a small dataset of preferred (low-cost) and dispreferred (high-cost) trajectories, our goal is to fine-tune the policy to generate low-cost behaviors while retaining high rewards. Unlike standard RLHF in language models, where preferences are defined over responses to the same prompt, our setting involves trajectory-level preferences in continuous control environments. We introduce PREFINE: Preference-based Implicit Reward and Cost Fine-Tuning for Safety Alignment which is a preference-based fine-tuning method that adapts Direct Preference Optimization (DPO), which is now widely used for LLM fine-tuning, to the sequential decision making setting. PREFINE constructs policy-sampled counterfactual trajectories to establish meaningful preference contrasts and jointly optimizes for reward retention and safety alignment. Empirically, PREFINE reduces constraint violations and catastrophic failures by over 60% while maintaining original reward behavior. PREFINE produces policies that achieve low-cost, high-reward performance with significantly improved data and computational efficiency compared to full offline RL or imitation learning, bridging preference alignment and safe policy adaptation in continuous domains.

[AI-27] Behavior-Consistent Deep Reinforcement Learning

链接: https://arxiv.org/abs/2605.21214
作者: Marcel Hussing,Liv G. d’Aliberti,Claas Voelcker,Benjamin Eysenbach,Eric Eaton
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reinforcement learning (RL) often exhibits high variance across training runs, leading to unreliable performance and posing a major challenge to deployment in real-world domains. In this work, we address the challenge of cross-run policy divergence by formalizing the problem of behavior-consistent RL, where the objective is to obtain policies that are both high-performing and distributionally similar across training runs. Our key observation is that maximum-entropy RL provides a direct mechanism for controlling behavioral divergence by anchoring runs to a common (uniform) prior. We prove that, for Boltzmann policies, choosing the temperature proportional to Q -function disagreement bounds the pairwise KL divergence between the induced policies. However, we also show that naïvely increasing entropy might impair policy optimization while amplifying off-policy error. Building upon these observations, we propose Q -value Expectile Disagreement (QED), a state-dependent temperature schedule that uses double-critic disagreement as a single-run proxy for cross-run disagreement. Empirically, we demonstrate that across 18 continuous-control tasks, QED reduces across-run divergence by two orders of magnitude without sacrificing performance, resulting in a considerable reduction in return variance at modest sample-efficiency costs.

[AI-28] SURGE: An Event-Centric Social Media Sentiment Time Series Benchmark with Interaction Structure

链接: https://arxiv.org/abs/2605.21198
作者: Chen Su,Pengsen Cheng,Yuanhe Tian,Yan Song
机构: 未知
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Public events on social media generate large volumes of discussion whose collective dynamics carry direct value for opinion forecasting and crisis response. Capturing how these dynamics evolve across an event’s lifecycle requires organizing fragmented posts into event-level time series. Existing datasets cover only a small number of events within a single category, and typically discard the interaction structure between posts when constructing time series, which restricts both transfer across event types and controlled study of how interactions shape the resulting collective dynamics. We present SURGE, a multi-event social media benchmark that pairs event-level time series with aligned text and interaction structure linking posts within an event. SURGE is built through an automated pipeline that produces calendar-aligned time series at three temporal granularities, covering 67 events and more than 800K posts across five event categories. Each time bin is paired with flat and structured textual views derived from the same selected posts, enabling controlled evaluation of whether social interaction structure affects forecasting behavior. On top of SURGE we define benchmark protocols for numerical-only forecasting, text-augmented forecasting, high-interaction evaluation, and leave-one-category-out generalization. Experiments with representative time-series and multimodal forecasting models reveal three properties of the benchmark: a strong local-persistence regime in which naive baselines remain hard to beat under absolute error, limited transfer of existing text-augmented forecasters to event-driven social-media data, and increased difficulty on reply-dense periods that aggregate metrics tend to obscure. We further include a lightweight structure-aware probe as a reference implementation, illustrating how SURGE can support interaction-aware forecasting research.

[AI-29] ScenePilot: Controllable Boundary-Driven Critical Scenario Generation for Autonomous Driving

链接: https://arxiv.org/abs/2605.21168
作者: Qiyu Ruan,Yuxuan Wang,He Li,Zhenning Li,Cheng-zhong Xu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Safety-critical scenarios are central to evaluating autonomous driving systems, yet their rarity in naturalistic logs makes simulation-based stress testing indispensable. Most scenario generation methods treat surrounding agents as adversaries, but they either (i) induce failures without explicitly modeling vehicle-road physical limits, yielding visually extreme yet physically unsolvable crashes, or (ii) enforce physical feasibility or policy feasibility in isolation, which can over-focus on aggressive maneuvers or remain tied to a controller-dependent capability boundary. We propose ScenePilot, a feasibility-guided, boundary-driven framework that targets the boundary band: scenarios that are physically solvable in principle yet still cause the deployed autonomy stack to fail. We formulate generation as constrained multi-objective reinforcement learning, combining an RSS-derived physical-feasibility score \sigma with an online-learned AV-risk predictor \Phi , and introduce step-level feasibility-aware shielding to keep exploration near the feasibility boundary while avoiding infeasible artifacts. Experiments on SafeBench with multiple planners show that ScenePilot yields substantially higher collision rates (+6.2 percentage points) while preserving physical validity, and that adversarial fine-tuning on these boundary-band scenarios consistently reduces downstream crash rates. The code is available at this https URL.

[AI-30] Detecting Trojaned DNNs via Spectral Regression Analysis

链接: https://arxiv.org/abs/2605.21146
作者: Samuele Pasini,Jinhan Kim,Paolo Tonella
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:Modern DNNs are repeatedly fine-tuned to incorporate new data and functionality. This evolutionary workflow introduces a security risk when updated data cannot be fully trusted, as adversaries may implant Trojans during fine-tuning. We present MIST, a Trojan detection approach that analyzes how a model’s internal representations change during fine-tuning. Rather than attempting to reconstruct trigger conditions, MIST characterizes benign model evolution using pre-activation spectra and flags updates whose spectral deviations are inconsistent with this reference. This framing treats Trojan detection as a regression problem over model updates. An empirical evaluation across four datasets and eight Trojan attacks shows that spectral distances reliably distinguish Trojaned updates from clean fine-tuning. MIST outperforms state-of-the-art detection accuracy after a single update, without requiring any knowledge about the poisoned data or the trigger, and remains effective under multi-step benign evolution, with graceful and bounded degradation. These results indicate that spectral evolution provides a stable and assumption-light signal for detecting malicious model updates.

[AI-31] On the Complexity of Entailment for Cumulative Propositional Dependence Logics

链接: https://arxiv.org/abs/2605.21113
作者: Kai Sauerwald,Juha Kontinen,Arne Meier
机构: 未知
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI)
备注: arXiv admin note: substantial text overlap with arXiv:2602.21360

点击查看摘要

Abstract:This paper establishes and proves complexity results for entailment for cumulative propositional dependence logic and for cumulative propositional logic with team semantics. As recently shown, cumulative logics are famously characterised by System~C and exactly captured by the cumulative models of Kraus, Lehmann and Magidor. This gives rise to the entailment problem via relational models, which is specifically considered here.

[AI-32] Efficient Learning of Deep State Space Models via Importance Smoothing ICML2026

链接: https://arxiv.org/abs/2605.21108
作者: John-Joseph Brady,Nikolas Nusken,Yunpeng Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted to the proceedings of ICML 2026

点击查看摘要

Abstract:Latent state space systems are ubiquitous in statistical modelling, arising naturally when a time series is observed through a noisy measurement function, however training deep state space models (DSSM) at scale remains difficult. Two largely distinct strategies and literatures have developed around the training of DSSMs. Firstly, auto-encoding DSSMs train generative DSSMs by optimising a variational lower bound. Secondly, DSSMs trained by back-propagating the outputs of a classical sequential Monte Carlo algorithm (SMC). Such approaches can train DSSMs for discriminative as well as generative tasks, however, due to the sequentiality of their forward pass, scale poorly on modern hardware. We propose a new training method \emphparallel variational Monte Carlo (PVMC) that bridges the gap between the paradigms, and can be used robustly to train DSSMs for both discriminative and generative tasks. Our method achieves state-of-the-art or better results on a set of baseline experiments and trains 10\times faster than the fastest competing SMC approach.

[AI-33] AutoRPA: Efficient GUI Automation through LLM -Driven Code Synthesis from Interactions ICML2026

链接: https://arxiv.org/abs/2605.21082
作者: Minghao Chen,Xinyi Hu,Zhou Yu,Yufei Yin
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted in ICML 2026

点击查看摘要

Abstract:Large Language Model (LLM) based agents have demonstrated proficiency in multi-step interactions with graphical user interfaces (GUIs). While most research focuses on improving single-task performance, practical scenarios often involve repetitive GUI tasks for which invoking LLM reasoning repeatedly, i.e., the ReAct paradigm, is inefficient. Prior to LLMs, traditional Robotic Process Automation (RPA) offers runtime efficiency but demands significant manual effort to develop and maintain. To bridge this gap, we propose AutoRPA, a framework that automatically distills the decision logic of ReAct-style agents into robust RPA functions. AutoRPA introduces two core innovations: (1) A translator-builder pipeline, where a translator agent converts hard-coded ReAct actions into soft-coded procedures, and a builder agent synthesizes robust RPA functions via retrieval-augmented generation over multiple trajectories; (2) A hybrid repair strategy during code verification, combining RPA execution with ReAct-based fallback for iterative refinement. Experiments across multiple GUI environments demonstrate that RPA functions generated by AutoRPA successfully solve similar tasks while reducing token usage by 82% to 96%, significantly improving runtime efficiency and reusability.

[AI-34] Divide et Calibra: Multiclass Local Calibration via Vector Quantization

链接: https://arxiv.org/abs/2605.21060
作者: Cesare Barbera,Lorenzo Perini,Giovanni De Toni,Andrea Passerini,Andrea Pugnana
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Accurate and well-calibrated Machine Learning (ML) models are mandatory in high-stakes settings, yet effective multiclass calibration remains challenging: global approaches assume calibration errors are homogeneous across the latent space, while local methods often rely on latent-space dimensionality reduction, which leads to information loss. To address these issues, we propose a compositional approach to multiclass calibration, where region-specific calibration maps are constructed from shared codeword-dependent factors. We instantiate this idea via Vector Quantization (VQ), which induces a structured partition of the representation space, and an indexed parameterization of Dirichlet concentrations that enables parameter sharing across regions. Our approach learns heterogeneous calibration maps that generalize well even to sparse regions of the latent space. Experiments on benchmark datasets show significant improvements in local calibration while maintaining competitive global calibration and predictive performance.

[AI-35] A Sharper Picture of Generalization in Transformers

链接: https://arxiv.org/abs/2605.20988
作者: Paul Lintilhac,Sair Shaikh
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We study transformers’ generalization behavior on boolean domains from the perspective of the Fourier Spectra of their target functions. In contrast to prior work (Edelman et al., 2022; Trauger and Tewari, 2024), which derived generalization bounds from Rademacher complexity, we investigate the feasibility of obtaining generalization bounds via PAC-Bayes theory. We show that sparse spectra concentrated on low-degree components enable low-sharpness constructions with good generalization properties. Our idea is to show the existence of flat minima implementing any boolean function of sparsity no greater than the context length, and then apply a PAC-Bayes bound to an idealized low-sharpness learner, resulting in a non-vacuous generalization bound. We evaluate predictions empirically and conduct a mechanistic interpretability study to support the realism of our theoretical construction in real transformers.

[AI-36] Diagnosing Overhead in Dispatch Operations: Cross-architecture Observatory

链接: https://arxiv.org/abs/2605.20982
作者: Bole Ma,Jan Eitzinger,Harald Koestler,Gerhard Wellein
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:AlltoAll dispatch is the dominant bottleneck of MoE expert parallelism, and the interconnect community has responded with four families of mitigations: predictive sample placement, adaptive expert relayout, hierarchical collectives, and EP-aware topology. All four rest on two assumptions about the workload. The first is that routing imbalance is correctable by the system layer. The second is that the mock-token benchmarks evaluating them faithfully represent production routing. We introduce DODOCO to test both assumptions. We instrument five MoE checkpoints spanning five sequence-mixer designs (DeepSeek-V2-Lite MLA, DeepSeek-MoE-16B MHA, Qwen3-30B GQA, Nemotron-30B Mamba-2, Qwen3.5-35B GDN) under a 5 by 6 grid of data conditions plus a matched EP scan from 4 to 32 ranks on H100s; both assumptions fail. Scaling EP changes the per-expert max/mean token ratio by at most 5% within every architecture’s measurable range: the straggler is intrinsic to the routing decision the model makes, not to how its experts land on ranks. Mock tokens overestimate routing Gini by up to a factor of 2.35 and fabricate a batch-size scaling trend that vanishes the moment real text replaces random IDs. A third pattern, unexpected, emerges from the same matrix: the five architectures cleave into two stable bands. MHA and Mamba-2 (data-resilient) drop to Gini 0.105 and 0.150 on wikitext. MLA and GDN (persistently concentrated) stay above 0.24 on every real-text condition and reach 0.29 to 0.38 on mock. GQA is the intermediate case. These bands, not the EP degree or the mock-data profile, are the right workload input to AlltoAll-aware interconnect and dispatch design. Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2605.20982 [cs.DC] (or arXiv:2605.20982v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2605.20982 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-37] Causal Past Logic for Runtime Verification of Distributed LLM Agent Workflows

链接: https://arxiv.org/abs/2605.20923
作者: Benedikt Bollig
机构: 未知
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI); Programming Languages (cs.PL)
备注: 20 pages

点击查看摘要

Abstract:Distributed LLM agent workflows should not be monitored as if they produced a single sequential log. In an asynchronous execution, a decision can only depend on events that are causally visible to the lifeline that makes it: an event that appears earlier in some log may still be unknown locally. We extend the ZipperGen agent-workflow framework with Causal Past Logic (CPL), a small past-time temporal logic for guards in conditionals and while loops. In addition to standard past-time modalities such as previous and since, a guard can inspect the latest causally visible event of another lifeline and selected variables stored there. The formula is a source-level guard: it is evaluated online by the owner lifeline and can influence control flow at runtime. We give a vector-clock monitor with latest-value views and prove that the locally computed monitor value coincides with the denotational semantics of the guard at the current event. Thus runtime verification becomes part of the coordination language itself, rather than a post-hoc check over an execution log.

[AI-38] Sutra: Tensor-Op RNNs as a Compilation Target for Vector Symbolic Architectures NEURIPS

链接: https://arxiv.org/abs/2605.20919
作者: Emma Leonhart
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Programming Languages (cs.PL)
备注: Modified NeurIPS submission, see AI declaration and replication materials at end of paper

点击查看摘要

Abstract:Sutra is a typed, purely functional programming language whose compiled forward pass is a PyTorch neural network. The compiler beta-reduces the whole program – primitives, control flow, string I/O – to one fused tensor-op graph over a frozen embedding substrate. Rotation binding, unbind, bundle, polynomial Kleene three-valued logic, and tail-recursive loops all lower to tensor operations; the Kleene connectives are Lagrange-interpolated polynomials exact on the -1, 0, +1 truth grid. Validation is one fact tested two ways. (1) The same program runs on four frozen embeddings spanning two modalities – three text encoders (nomic-embed-text, all-minilm, mxbai-embed-large) and one protein language model (ESM-2) – and decodes bundles at 100% accuracy through width k=8 on every substrate, where the textbook Hadamard product has already collapsed (2.5% on mxbai-embed-large, 7.5% on all-minilm). (2) PyTorch autograd flows through the actually compiled graph: a fuzzy-rule classifier written in .su trains from random init (18.7 +/- 9.5%; chance = 20%, five classes) to 100.0 +/- 0.0% (three seeds) by backpropagating through the emitted graph, the symbolic source unmodified. A weighted variant additionally trains a scalar cosine gain and writes it back into the .su source as a numeric literal; recompiling reproduces the trained behaviour to ~2e-7 per logit, so the trained model is itself legible, recompilable code. The same artifact is therefore both a logic program and a trainable neural network.

[AI-39] For How Long Should We Be Punching? Learning Action Duration in Fighting Games

链接: https://arxiv.org/abs/2605.20911
作者: Hoang Hai Nguyen,Kurt Driessens,Dennis J.N.J. Soemers
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at Computers and Games 2026

点击查看摘要

Abstract:Fighting games such as Street Fighter II present unique challenges to reinforcement learning (RL) agents due to their fast-paced, real-time nature. In most RL frameworks, agents are hard-coded to make decisions at a fixed interval, typically every frame or every N frames. Although this design ensures timely responses, it restricts the agent’s ability to adjust its reaction timing. Acting every frame grants frame-perfect reflexes, which are unrealistic compared to human players, whereas longer fixed intervals reduce computational cost but hinder responsiveness. We consider an alternative decision-making framework in which the agent learns not only what action to take but also for how long to execute it. By jointly predicting both action and duration, the agent can dynamically adapt its responsiveness to different situations in the game. We implement this method using the open-source FightLadder environment with agents trained against scripted built-in bots, systematically testing different frame skip configurations to analyze their influence on performance, responsiveness, and learned behavior. Experiments show that learned timing can match the performance of well-chosen fixed frame skips and encourages repeatable action patterns, but does not ensure robustness on its own. In most cases, we see agents performing best with consistently high frame skip values (i.e., low responsiveness). This strategy makes it easier to learn exploitative strategies where the same action is repeated over and over, which the scripted bots appear to be susceptible to.

[AI-40] GenAI-Driven Threat Detection with Microsoft Security Copilot

链接: https://arxiv.org/abs/2605.20896
作者: Scott Freitas,Amir Gharib
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Defending against today’s increasingly sophisticated cyberattacks requires security analysts to continuously translate evolving attacker tradecraft into detection logic. This places defenders in a reactive posture, requiring constantly updated expertise across an increasingly fragmented security landscape. We introduce the Dynamic Threat Detection Agent (DTDA), an always-on adaptive agent that continuously investigates security incidents across Microsoft Defender to uncover hidden threats and generate explainable detections when attack-story gaps are found. DTDA combines: (1) a unified activity timeline spanning alerts, events, user and entity behavior analytics, and threat intelligence; (2) versioned LLM prompt contracts with schema validation, grounding requirements, bounded retries, and fail-closed suppression; (3) a planner-executor investigation loop that generates attack-specific hypotheses and gathers supporting and refuting evidence; and (4) dynamic alert generation with a context-relevant title, severity, MITRE mappings, remediation guidance, implicated entities, and natural-language attack description. Integrated into Microsoft Security Copilot and deployed across tens of thousands of Defender customers, DTDA operates continuously at industry scale. In a 120-day online evaluation, DTDA achieves 80.1% precision from customer feedback while generating novel alerts for approximately 15% of investigated incidents. In offline evaluation, DTDA recovers hidden malicious activity with 0.78 F1 using GPT-5.4, improving over GPT-4.1 by 0.12 F1 and outperforming the baseline by 0.26 F1 points. Operationally, DTDA processes single-incident investigations end-to-end in a median of 28 minutes at a median token cost of USD 2.04, with a 0.38% job-level failure rate. These results demonstrate that autonomous agents can identify missed malicious activity at a production scale.

[AI-41] Governance by Construction for Generalist Agents

链接: https://arxiv.org/abs/2605.20874
作者: Segev Shlomov,Iftach Shoham,Alon Oved,Ido Levy,Sami Marreed,Harold Ship,Offer Akrabi,Sergey Zeltyn,Avi Yaeli,Nir Mashkif
机构: 未知
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:Enterprise agents are increasingly expected to operate autonomously across tools and interfaces, yet production deployments require governance by construction. Systems must specify which actions are allowed, when human oversight is required, and what information may be exposed, without rebuilding the agent for each domain. This demo presents CUGA’s policy system, a modular policy-as-code layer that composes with a generalist LLM agent to deliver predictable, auditable, and compliance-aware behavior in compound workflows without model fine-tuning. We present a runtime governance architecture that enforces policy interventions at every critical stage of execution. Rather than passively constraining behavior, policies intercept the agent at five structural checkpoints: upstream of planning (Intent Guard), within the system prompt to steer reasoning (Playbook), at the tool-call boundary to enforce proper usage (Tool Guide), outside the reasoning loop as a Human-in-the-Loop gate for high-risk actions (Tool Approvals), and at the output stage to filter and structure the final response (Output Formatter). Together, these stages embed governance continuously across the agent’s execution pipeline rather than treating it as an afterthought. Using a healthcare scenario and a multi-layered enforcement intervention, the demo shows dynamic playbook injection for structured tool-sequence enforcement, intent guards that block malicious or accidental harmful requests, and human-in-the-loop tool approval checkpoints for potentially destructive actions. The artifact illustrates how typed governance primitives enable faster, safer deployment of enterprise agentic systems while improving policy adherence and execution consistency.

[AI-42] Planning Bench: Generating Scalable and Verifiable Planning Data for Evaluating and Training Large Language Models

链接: https://arxiv.org/abs/2605.20873
作者: Ziliang Zhao,Zenan Xu,Shuting Wang,Hongjin Qian,Yan Lei,Minda Hu,Zhao Wang,Shihan Dou,Zhicheng Dou,Pluto Zhou
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Planning is a fundamental capability for large language models (LLMs) because such complex tasks require models to coordinate goals, constraints, resources, and long-term consequences into executable and verifiable solutions. Existing planning benchmarks, however, usually treat planning data as fixed collections of instances rather than controllable generation targets. This limits scenario coverage, ties difficulty to surface-level proxies rather than structural sources, and offers limited support for scalable generation, automatic verification, or planning-oriented training. We introduce PlanningBench, a framework for generating scalable, diverse, and verifiable planning data for both evaluation and training. PlanningBench starts from real planning scenarios and abstracts practical workflows into a structured taxonomy of more than 30 task types, subtasks, constraint families, and difficulty factors. Guided by this taxonomy, a constraint-driven synthesis pipeline instantiates self-contained planning problems with adaptive difficulty control, quality filtering, and instance-level verification checklists. This shifts planning data construction from fixed benchmark collection to controllable generation while preserving realistic task grounding. We use PlanningBench to evaluate open-source and closed-source frontier LLMs, and find that current models still struggle to produce complete solutions under coupled constraints. Beyond evaluation, reinforcement learning on verified PlanningBench data improves performance on unseen planning benchmarks and broader instruction-following tasks. Further analysis suggests that determinate or well-specified optimal solutions provide clearer reward signals and more stable training dynamics. Overall, PlanningBench provides a controllable source of planning data for diagnosing and improving generalizable planning abilities in LLMs.

[AI-43] CAdam: Context-Adaptive Moment Estimation for 3D Gaussian Densification in Generative Distillation SIGGRAPH2026

链接: https://arxiv.org/abs/2605.20872
作者: SeungJeh Chung,Geonho Park,Misong Kim,HyeongYeop Kang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Graphics (cs.GR)
备注: Accepted to SIGGRAPH 2026 Conference Papers. 12 pages, 8 figures

点击查看摘要

Abstract:Adaptive densification is the engine of 3D Gaussian Splatting (3DGS). However, when transposed to the optimization-based Generative Distillation paradigm, this reconstruction-native mechanism reveals fundamental limitations, resulting in inefficient representations cluttered with redundant primitives. We diagnose this failure as a Densification Dilemma stemming from the stochastic nature of generative guidance: the standard magnitude-based accumulation indiscriminately aggregates transient noise alongside geometric signals, making it difficult to strike a balance between over-densification and under-fitting. To resolve this, we introduce Context-Adaptive Moment Estimation (CAdam), a novel framework that reinterprets densification as a statistically grounded signal verification problem. CAdam leverages the first moment of gradients to exploit the interference principle, where stochastic fluctuations cancel out via destructive interference while consistent geometric drifts accumulate via constructive interference, effectively disentangling the underlying signal from the generative noise floor. This is further augmented by a quantile-based context awareness and an intrinsic Signal-to-Noise Ratio (SNR) gating mechanism, which ensure robust adaptation across optimization stages and enable the soft termination of densification. Extensive experiments across diverse objectives (SDS, ISM, VFDS) and strong generative 3DGS backbones show that CAdam reduces Gaussian count by 85%-97% relative to standard densification while preserving overall comparable perceptual quality. These results highlight signal-aware density control as a practical way to improve memory efficiency in optimization-based generative distillation.

[AI-44] Runtime-Certified Bounded-Error Quantized Attention

链接: https://arxiv.org/abs/2605.20868
作者: Dean Calver
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注: 32 pages, 1 figure

点击查看摘要

Abstract:KV cache quantization reduces the memory cost of long-context LLM inference, but introduces approximation error that is typically validated only empirically. Existing systems rely on average-case robustness, with no mechanism to detect or recover from failures at runtime. We present a tiered KV cache architecture that enables runtime-certified attention: INT8 keys and INT4 values are stored in GPU memory, while FP16 originals are retained in system RAM for deterministic fallback. A two-term error decomposition yields per-head, per-step bounds on (i) attention distribution distortion from key quantization and (ii) value reconstruction error. These bounds are computed online and used to drive adaptive precision selection and a multi-stage fallback ladder, which guarantees recovery to the exact dense attention output when required. Across PG-19, NIAH, and RULER benchmarks on LLaMA~3.1-8B with contexts up to 128K, the system matches dense FP16 KV quality within noise for language modelling and retrieval tasks, while recovering catastrophic failures observed in naive INT8/INT4 baselines. Value-sensitive tasks at short context expose a controlled trade-off between compression and fidelity, which can be eliminated via tighter value tolerances or FP16-value fallback. The certification is local (per-head, per-step) and does not guarantee end-to-end model correctness, but ensures that each attention computation is either bounded relative to an FP16 reference or exactly recovered via fallback. This reframes KV cache quantization as a runtime-verified computation rather than a fixed approximation. The goal is not raw speedups, but enabling safe deployment of aggressive KV compression under strict quality constraints.

[AI-45] Multi-Step Likelihood-Ratio Correction for Reinforcement Learning with Verifiable Rewards

链接: https://arxiv.org/abs/2605.20865
作者: Deokgyu Yoon,Hyungkyu Kang,Joongkyu Lee,Byeongchan Kim,Gyungin Shin,Sungrae Park,Min-hwan Oh
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reinforcement learning with verifiable rewards (RLVR) plays a pivotal role in improving the reasoning ability of large language models. However, widely used PPO surrogate objectives are fundamentally local, as they rely on a local approximation of the exact policy gradient objective. While this approximation improves stability by reducing the variance induced by importance sampling, it also introduces structural bias into the surrogate objective, which must be controlled through trust region mechanisms. In this work, we introduce the N -step forward trace, which augments the PPO surrogate objective using the cumulative likelihood ratio of the next N-1 tokens. Building on this idea, we propose N -Step Forward-Trace Policy Optimization (NFPO), a practical RLVR algorithm that integrates the N -step forward trace into the masked policy gradient framework. NFPO provides a continuous bridge between the PPO surrogate objective and the exact policy gradient objective, offering a principled mechanism for controlling the bias-variance trade-off. Our theoretical analysis shows that, with an appropriate choice of N , the proposed objective yields a tighter policy-improvement bound than the standard PPO surrogate. Experiments on comprehensive reasoning benchmarks demonstrate that NFPO consistently improves performance, supporting our theoretical findings.

[AI-46] DISC: Decoupling Instruction from State-Conditioned Control via Policy Generation

链接: https://arxiv.org/abs/2605.20856
作者: Hanxiang Ren,Pei Zhou,Xunzhe Zhou,Yanchao Yang
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Language-conditioned manipulation policies typically process instructions and observations through shared network parameters. This task-state entanglement provides a pathway for observation leakage – networks learn scene-to-action shortcuts that bypass language grounding entirely. DISC eliminates this failure structurally. Rather than conditioning a universal policy on language, DISC uses a hypernetwork to generate the entire parameter set of a task-specific visuomotor policy from the instruction alone. The generated policy never directly accesses language; therefore, its task-awareness must come from the language. Consequently, observation leakage has no pathway to emerge. On the other hand, generating coherent high-dimensional policy weights is itself a challenging problem. We address it with a two-stage hypernetwork whose refinement stage embeds the structure of gradient-based optimization as a feed-forward inductive bias, producing globally consistent parameters without actual gradient computation. Trained entirely from scratch on standard data budgets, DISC outperforms all entangled baselines on LIBERO-90 and Meta-World, with advantages that widen on complex, long-horizon tasks – and surpasses the large-scale pretrained \pi_0 despite using no external pretraining data. On a real-world benchmark where all tasks share identical visual context, DISC substantially outperforms entangled alternatives, directly confirming that language-generated policy parameters, not visual shortcuts, drive behavior. The hypernetwork further learns a semantically structured parameter manifold that enables few-shot adaptation from minimal demonstrations and robust generalization across paraphrased instructions. Our code is available at: this https URL.

[AI-47] Conditional Equivalence of DPO and RLHF: Implicit Assumption Failure Modes and Provable Alignment

链接: https://arxiv.org/abs/2605.20834
作者: Zhiqin Yang,Yonggang Zhang,Wei Xue,Dong Fang,Bo Han,Yike Guo
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 49 pages

点击查看摘要

Abstract:Direct Preference Optimization (DPO) has emerged as a popular alternative to Reinforcement Learning from Human Feedback (RLHF), offering theoretical equivalence with simpler implementation. We prove this equivalence is conditional rather than universal, depending on an implicit assumption frequently violated in practice: the RLHF-optimal policy must prefer human-preferred responses. When this assumption fails, DPO optimizes relative advantage over the reference policy rather than absolute alignment with human preferences, leading to pathological convergence where policies decrease DPO loss while preferring dispreferred responses. We characterize when this assumption is violated, show the existence of an undesirable solution space, and prove that DPO and RLHF optimize fundamentally different objectives in such cases. To address this, we introduce Constrained Preference Optimization (CPO), augmenting RLHF with constraints for provable alignment. We further provide a geometric interpretation through soft margin ranking, revealing that DPO implements margin ranking with potentially negative targets. Our theoretical analysis establishes when DPOs’ guarantees hold and provides solutions preserving simplicity with provable alignment. Comprehensive experiments on standard benchmarks demonstrate that CPO achieves state-of-the-art performance. Code is available at: this https URL.

[AI-48] unable MAGMAX: Preference-Aware Model Merging for Continual Learning ICPR2026

链接: https://arxiv.org/abs/2605.20803
作者: Kei Hiroshima,Kento Uchida,Shinichi Shirakawa
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 17 pages, 4 figures. Accepted at ICPR 2026

点击查看摘要

Abstract:Continual learning (CL) aims to train models sequentially on multiple tasks while mitigating catastrophic forgetting of previously learned knowledge. Recent advances in large pre-trained models (LPMs) and model merging techniques, such as MAGMAX, have demonstrated effective CL performance by combining task-specific parameters. However, existing methods primarily focus on average performance across all tasks and do not adequately address how to construct models accommodating different deployment environments or varying user preferences. This paper proposes a model merging framework, termed Tunable MAGMAX, which enables preference-aware control of task-specific performance in CL. Our method introduces a preference vector that controls the number of elements selected from each task vector during model merging, allowing us to adjust the merged model performance according to their deployment needs. We further propose a method for automatically constructing appropriate preference vectors by leveraging small amounts of target environment data and datasets from model training tasks, thereby eliminating the need for manual specification. The experimental result on CL benchmark tasks demonstrates that Tunable MAGMAX effectively controls task-wise performance and successfully adapts merged models to various target environments. The proposed Tunable MAGMAX achieves superior or comparable performance to baseline methods, making it a practical solution for deploying CL models to various environments where the preferences of each task performance differ.

[AI-49] ELSA: An ELastic SNN Inference Architecture for Efficient Neuromorphic Computing ISCA

链接: https://arxiv.org/abs/2605.20802
作者: Kang You,Chen Nie,Lee Jun Yan,Ziling Wei,Cheng Zou,Zekai Xu,Yu Feng,Honglan Jiang,Zhezhi He
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
备注: 17 pages, Proceedings of the 53rd Annual International Symposium on Computer Architecture (ISCA), 2026

点击查看摘要

Abstract:Spiking neural networks (SNNs) exploit event-driven and addition-only computation to substantially improve efficiency for intelligent computation. A key temporal property of SNNs, elastic inference, allows outputs to emerge progressively, enabling responses to salient inputs much earlier than full evaluation. However, existing SNN-specific accelerators cannot capitalize on this property. Layer-by-layer designs emit outputs only after all layers are complete, while time-step-by-time-step designs rely on coarse-grained, layer-wise pipelines that require synchronizing all spines/tokens within a layer. This barrier prevents results from being forwarded immediately, delaying the earliest possible response and forfeiting the benefits of elastic inference. To address these challenges, we propose ELSA, a near-SRAM dataflow architecture that realizes true elastic inference through a fine-grained spine/token-wise pipeline and hardware optimizations tailored to SNNs. ELSA forwards each spine/token immediately upon production, forming a continuous streaming pipeline that substantially reduces the latency to the first response. To enhance this lightweight execution, ELSA introduces a bundled address event representation protocol to lower communication traffic of network-on-chip (NoC), and leverages mini-batch spiking Gustavson-product to cut memory access and exploit inherent sparsity. Combined with mapping and scheduling optimizations, ELSA achieves efficient, event-driven computation without compromising accuracy. Experiments show that SNNs can outperform quantized artificial neural networks (QANNs) while maintaining on-par accuracy. For a 4-bit ResNet-50, ELSA achieves 3.4 \times speedup and 13.6 \times higher energy efficiency over the SOTA QANN accelerator (ANT), and 2.9 \times speedup and 22.1 \times energy efficiency gains over the SOTA SNN accelerator (PAICORE). Comments: 17 pages, Proceedings of the 53rd Annual International Symposium on Computer Architecture (ISCA), 2026 Subjects: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI) Cite as: arXiv:2605.20802 [cs.AR] (or arXiv:2605.20802v1 [cs.AR] for this version) https://doi.org/10.48550/arXiv.2605.20802 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-50] Interaction Locality in Hierarchical Recursive Reasoning

链接: https://arxiv.org/abs/2605.20784
作者: Yosuke Miyanishi,Tetsuro Morimura
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Spatial reasoning requires both location-bound computation and location-invariant structure: agents must make local moves while preserving route, object, or constraint-level plans. We propose interaction locality, a task-geometry-aware framework for measuring whether information flow stays within nearby cells or semantic segments, or crosses them. We instantiate the framework with sparse-autoencoder feature ablations and finite-noise activation patching, with structural Jacobian and attention checks reported in the appendix, and apply it to HRM and TRM, two compact hierarchical and recursive reasoning models, on Maze-Hard, Sudoku Extreme, and ARC-AGI. Across these models, activation patching gives the clearest architectural fingerprint: high-level recurrent states tend to write information within nearby cells or same-segment units, while repeated recursive updates accumulate these local writes into broader solution structure. This pattern holds across maze paths, Sudoku constraints, and ARC-AGI object neighborhoods, with the strongest concentration in TRM. To test whether interaction locality extends beyond toy-yet-challenging grid benchmarks, we also apply it to MTU3D, a large-scale embodied 3D scene-grounding model. In this MTU3D setting, causal spatial locality appears primarily at the transition where visual scene features are handed to the downstream grounding module, rather than uniformly throughout the visual encoder. This contrast suggests that the local-to-global handoff observed in HRM and TRM is tied to explicit recursive reasoning dynamics, while embodied 3D models may concentrate causal spatial structure at module boundaries. Interaction locality turns the intuitive local-execution/global-planning story into a reproducible measurement framework for recursive and embodied spatial reasoning.

[AI-51] Correcting Stochastic Update Bias in Preconditioned Language Model Optimizers

链接: https://arxiv.org/abs/2605.20756
作者: Nikhil Nayak,Julia White,Urchade Zaratiana,Kelton Zhang,Henrijs Princis,Dhruv Atreja,Henry Fawcett,Matthew Thomas,George Hurn-Maloney,Ash Lewis
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC); Machine Learning (stat.ML)
备注: 32 pages, 3 figures, 13 tables

点击查看摘要

Abstract:Preconditioned optimizers are central to language model training, but their stochastic update rules are usually treated as direct approximations to population preconditioned descent. We show that this view misses two finite-sample biases. First, the gradient and preconditioner are typically estimated from the same minibatch, introducing gradient–preconditioner coupling bias. Second, even when the preconditioner estimate is unbiased, its inverse or inverse-root is generally biased because inversion is nonlinear. We propose a single-batch bias-correction framework that addresses both effects: cross-fitted preconditioning estimates the numerator and preconditioner from independent microbatch groups, while variance-corrected inversion uses microbatch variability to subtract the leading delta-method bias term. The framework applies to diagonal moment, diagonal curvature, and matrix preconditioning methods, instantiated in AdamW, Sophia, and Shampoo. Bias correction reduces held-out pretraining loss on Qwen2.5-0.5B by 0.15 , 0.07 , and 0.11 nats, respectively; the effects on mixed-quality pretraining and downstream instruction tuning are consistently neutral-to-positive. Together, these results establish bias correction as a practical mechanism for reducing finite-sample update bias and improving the performance of preconditioned optimizers.

[AI-52] PACD-Net: Pseudo-Augmented Contrastive Distillation for Glycemic Control Estimation from SMBG

链接: https://arxiv.org/abs/2605.20751
作者: Canyu Lei,David Repaske,Jianxin Xie
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:Effective diabetes management requires continuous monitoring of glycemic levels. Clinically, glycemic control is assessed using metrics such as Time in Range (TIR), Time Below Range (TBR), and Time Above Range (TAR), typically derived from continuous glucose monitoring (CGM). However, many patients rely on self-monitoring of blood glucose (SMBG) due to the high cost and limited accessibility of CGM. Unlike CGM, SMBG provides sparse and irregular measurements, making accurate estimation of these metrics challenging. Conventional supervised learning approaches struggle under such sparsity, leading to poor generalization and unstable performance. To address this, we propose PACD-Net, a self-supervised contrastive knowledge distillation framework for estimating glycemic control from SMBG. Pseudo-SMBG samples with richer temporal coverage are used as teacher signals to guide learning from sparse observations. In addition, multi-view contrastive learning enforces representation consistency across diverse sampling patterns. The model adopts a hybrid Swin Transformer-CNN backbone to capture temporal dependencies in sparse SMBG sequences. Experimental results demonstrate that PACD-Net consistently outperforms existing methods in estimating TAR, TIR, and TBR from real-world SMBG data, achieving improved accuracy as well as enhanced stability and generalization under extremely sparse observation settings. The proposed framework provides a practical tool for clinical SMBG interpretation and offers a generalizable approach for learning from sparse and irregularly sampled sensor data in broader applications.

[AI-53] he Devil is in the Condition Numbers: Why is GLU Better than non-GLU Structure? ICML2026

链接: https://arxiv.org/abs/2605.20749
作者: Xingyu Lyu,Qianqian Xu,Zhiyong Yang,Peisong Wen,Qingming Huang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted by ICML 2026

点击查看摘要

Abstract:Gated Linear Units (GLU) and their variants are widely adopted in modern open-source large language model architectures and consistently outperform their non-gated counterparts, yet the underlying reasons for this advantage remain unclear. In this work, we study GLU by analyzing two-layer networks in the neural tangent kernel (NTK) regime. Our analysis reveals that the GLU structure reshapes the NTK spectrum, leading to a smaller condition number and a more compact eigenvalue distribution. Building on this finding, we further analyze the resulting training dynamics and show how the reshaped spectrum leads to faster convergence of GLU models, including a characteristic loss-crossing phenomenon observed between GLU and non-GLU models. Finally, we empirically observe that GLU has limited impact in reducing the generalization gap on various models, including ViT and GPT-2, suggesting that its primary benefit lies in accelerating optimization rather than reducing the generalization gap.

[AI-54] Hack-Verifiable Environments: Towards Evaluating Reward Hacking at Scale

链接: https://arxiv.org/abs/2605.20744
作者: Amit Roth,Ankur Samanta,Matan Halevy,Yoav Levine,Yonathan Efroni
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Project Page - this https URL

点击查看摘要

Abstract:Aligning autonomous agents with human intent remains a central challenge in modern AI. A key manifestation of this challenge is reward hacking, whereby agents appear successful under the evaluation signal while violating the intended objective. Reward hacking has been observed across a wide range of settings, yet methods for reliably measuring it at scale remain lacking. In this work, we introduce a new evaluation paradigm for measuring reward hacking. Whereas prior studies have primarily analyzed it post hoc by inspecting agent trajectories, we instead embed detectable reward hacking opportunities directly into environments. This makes their exploitation verifiable by design, enabling deterministic and automated measurement of whether and how agents exploit such vulnerabilities. We instantiate this approach in \textitTextArena and release \textitHack-Verifiable TextArena , a testbed in which reward hacking can be measured reliably. Using this benchmark, we analyze reward hacking behavior across language models in diverse environments and settings. We open source the code at this https URL.

[AI-55] VBFDD-Agent for Electric Vehicle Battery Fault Detection and Diagnosis: Descriptive Text Modeling of Battery Digital Signals

链接: https://arxiv.org/abs/2605.20742
作者: Joey Chan,Zhen Chen,Ershun Pan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:With the rapid proliferation of electric vehicles, the safety and reliability of lithium-ion batteries have become critical concerns. Effective anomaly detection is essential for ensuring safe battery operation. However, as battery systems and operating scenarios become increasingly complex, battery fault diagnosis and maintenance require stronger cross-domain adaptability and human-AI collaboration. Traditional fault detection and diagnosis methods are usually designed for specific scenarios and predefined workflows, making them less effective in complex real-world applications. To address the scarcity of open-source battery fault report corpora and the lack of unified maintenance knowledge representation, this study proposes a descriptive text modeling approach for battery signal reports. Monitoring signals, statistical features, anomaly records, and state assessment results are transformed into structured and readable natural language descriptions, forming a language corpus for battery health diagnosis and maintenance. Based on this corpus, we propose VBFDD-Agent, a vehicle battery fault detection and diagnosis agent for automotive-grade battery systems. VBFDD-Agent integrates descriptive battery-state texts, historical case retrieval, local maintenance manuals, and large language model reasoning to generate structured diagnostic results and maintenance recommendations. Experiments show that the proposed framework can accurately perform anomaly monitoring based on descriptive textual representations and provide flexible, efficient, and actionable maintenance suggestions. Expert evaluation further confirms the practical value of the generated recommendations. Overall, VBFDD-Agent extends traditional battery diagnosis from label prediction to interpretable and maintenance-oriented decision support. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2605.20742 [cs.AI] (or arXiv:2605.20742v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2605.20742 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-56] An Application-Layer Multi-Modal Covert-Channel Reference Monitor for LLM Agent Egress

链接: https://arxiv.org/abs/2605.20734
作者: Alfredo Metere
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:A large language model (LLM) agent that sends messages can leak data inside them. Destination allowlists and content scanners do not police whether an otherwise-benign payload is itself a covert channel: a compromised agent encodes bits in zero-width characters, homoglyphs, whitespace, base64, JavaScript Object Notation (JSON) key ordering, message timing or size – and, in binary egress, in least-significant-bit (LSB) pixel planes, per-image mean luminance, inter-image sequence permutation, ultrasonic tones, or audible-band sonified data. Our egress reference monitor has three contributions. (i) A text pipeline of ten capacity-reducing stages, a per-sink leaky-bucket capacity ledger, and a staged posture that enforces lossless stages from day one. (ii) Two media scramblers (a Fourier-domain audio band-limiter and a red-green-blue (RGB) image bit-depth and mean-luminance bucketer) gated by a boot-time cryptographic legitimacy attestation: an auditor publishes at boot the trusted Ed25519 keys and kind, data-class pairs; only payloads with a verifying signature for an authorized class are exempt. The attestation sidesteps the intractable content-based discrimination between real media and data sonified or rasterized as a carrier; unsigned media is suspect by default; a content-addressed canonicalizer closes the inter-image permutation channel. (iii) Residual capacity is the Miller–Madow corrected mutual information between embedded and recovered bits (zero when destroyed), measured by an adversarial ensemble of fifteen working encoders across text, image and audio. The reference implementation drives residual capacity to zero on every destroyable channel and to a stated bound on the one (per-image mean luminance) that cannot be destroyed without ruining the image.

[AI-57] AGPO: Adaptive Group Policy Optimization with Dual Statistical Feedback

链接: https://arxiv.org/abs/2605.20722
作者: Miaobo Hu,Shuhao Hu,Bokun Wang,Ruohan Wang,Xin Wang,Xiaobo Guo,Daren Zha,Jun Xiao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reinforcement learning improves LLM reasoning, but PPO/GRPO typically use fixed clipping and decoding temperature, which makes training brittle and tuning-heavy. We propose Adaptive Group Policy Optimization (AGPO), a critic-free refinement of GRPO that uses group-level statistics to control both update magnitude and exploration. AGPO uses a shared probe-derived statistical state to drive two controllers: (i) adaptive clipping, which sets the trust-region size from reward dispersion and skewness, probe vote entropy, policy entropy, and step-wise KL drift; and (ii) bidirectional adaptive temperature sampling, which heats or cools decoding around a base temperature according to centered uncertainty relative to a running baseline. On nine English and Chinese math/STEM benchmarks, Qwen2.5-14B trained with AGPO outperforms PPO/GRPO under the same generated-token budget, reaching 67.3% on GSM8K and 40.5% on MATH. Gains transfer to Llama-3-8B and Gemma-2-9B, and ablations confirm both modules are complementary. Our implementation is publicly available at this https URL.

[AI-58] Llamas on the Web: Memory-Efficient Performance-Portable and Multi-Precision LLM Inference with WebGPU

链接: https://arxiv.org/abs/2605.20706
作者: Reese Levine,Rithik Sharma,Nikhil Jain,Abhijit Ramesh,Zheyuan Chen,Neha Abbas,James Contini,Tyler Sorensen
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 19 pages, 11 figures, 5 tables

点击查看摘要

Abstract:Running language models in the browser presents a unique opportunity to build efficient, private, and portable AI applications, but requires contending with constrained memory availability and heterogeneous hardware targets. To realize this opportunity, we present Llamas on the Web (LlamaWeb), a WebGPU backend for this http URL that enables memory-efficient and performance-portable LLM inference across a wide range of model weight formats in the browser. Our design significantly reduces memory overhead through static memory planning and efficient model loading, addresses cross-device variability through a tunable kernel library, and introduces templated GPU kernels that support performant implementations of numerous quantization formats, enabling broad model support and extensibility to new formats. We evaluate LlamaWeb on 16 devices from 8 vendors, collecting data from 10 language models and four model weight formats. We compare LlamaWeb against existing browser-based LLM frameworks and find that LlamaWeb requires 29-33% less memory across several combinations of device, browser, and operating system. We also evaluate LlamaWeb’s performance against these frameworks and find that it increases decode throughput by 45-69% across four GPUs from separate vendors. In addition, we compare LlamaWeb’s performance against other this http URL backends, where it is competitive with and even beats vendor-specific backend performance on some devices. Comments: 19 pages, 11 figures, 5 tables Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2605.20706 [cs.DC] (or arXiv:2605.20706v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2605.20706 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-59] Declarative Data Services: Structured Agent ic Discovery for Composing Data Systems

链接: https://arxiv.org/abs/2605.20690
作者: Shanshan Ye,Duo Lu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Shanshan Ye and Duo Lu contributed equally to this work

点击查看摘要

Abstract:Agentic discovery has shown that LLM-driven search can find novel algorithms, designs, and code under benchmark conditions. Translating the paradigm to multi-system data backends surfaces a harder problem: the search space is heterogeneous, the verifier is whether a deployed stack actually runs, and composition knowledge is unevenly captured in pretraining. Unbounded agentic discovery, a coding agent iterating on failure-log feedback, fails to converge consistently on a working stack even when iteration and explicit composition knowledge are added. We propose Declarative Data Services (DDS), an architecture for structured agentic discovery of data-system compositions from declarative user intent. The framework owns four typed contracts at successive layers (intent, operator DAG, per-system skills, runtime attribution) that decompose the global search into bounded sub-searches; sub-agents search each typed space, while the framework provides the channels by which knowledge flows forward as inline skill citations and errors route backward as typed signals. As a proof of life on a trading-backend workload, DDS converges where unbounded discovery does not; runtime failures become skill patches that the next deployment cites inline. We position this as an early prototype reporting lessons from real-world data-system composition.

[AI-60] Dynamic TMoE: A Drift-Aware Dynamic Mixture of Experts Framework for Non-Stationary Time Series Forecasting ICML2026

链接: https://arxiv.org/abs/2605.20678
作者: Jiawen Zhu,Shuhan Liu,Di Weng,Yingcai Wu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 27 pages, 7 figures. Accepted to ICML 2026

点击查看摘要

Abstract:Non-stationary time series forecasting is challenged by evolving distribution shifts that static models struggle to capture. While Mixture-of-Experts (MoE) architectures offer a promising paradigm for decoupling complex drift patterns, existing approaches are limited by fixed expert pools and memoryless routing, hampering their ability to adapt to abrupt regime shifts. To address this, we propose Dynamic TMoE, a framework that unifies architectural evolution with temporal continuity during learning phase. By detecting distribution shifts via Maximum Mean Discrepancy (MMD), we dynamically instantiate heterogeneous experts and prune redundant ones to optimize capacity. Additionally, a temporal memory router leverages recurrent states and an anomaly repository to ensure stable, context-aware expert selection without requiring test-time updates. Experiments on nine benchmarks demonstrate state-of-the-art performance, reducing MSE by 10.4% and MAE by 7.8%. Code is available at this https URL.

[AI-61] REFLECTOR: Internalizing Step-wise Reflection against Indirect Jailbreak ICML2026

链接: https://arxiv.org/abs/2605.20654
作者: Jiachen Ma,Jiawen Zhang,Xiangtian Li,Bo Zou,Chaochao Lu,Chao Yang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: ICML 2026

点击查看摘要

Abstract:While Large Language Models (LLMs) demonstrate remarkable capabilities, they remain susceptible to sophisticated, multi-step jailbreak attacks that circumvent conventional surface-level safety alignment by exploiting the internal generation process. To address these vulnerabilities, we propose Reflector, a principled two-stage framework that internalizes self-reflection within the generation trajectory. Reflector first leverages teacher-guided generation to produce high-quality reflection data for supervised fine-tuning (SFT), establishing structured reflection patterns. It subsequently uses Reinforcement Learning (RL) with outcome-driven and reward-validity supervision to instill robust, autonomous self-reflection capabilities. Empirical results show that Reflector achieves Defense Success Rates (DSR) exceeding 90% against complex indirect attacks while generalizing robustly across diverse threat scenarios. Notably, the framework enhances both task-specific and general utility, yielding a 5.85% gain on GSM8K alongside improved performance on knowledge-intensive benchmarks. By internalizing trajectory-level safety, Reflector overcomes the fundamental limitations of surface alignment without significant computational overhead, offering an efficient and scalable solution for the development of safe and capable LLMs.

[AI-62] Jointly Learning Predicates and Actions Enables Zero-Shot Skill Composition

链接: https://arxiv.org/abs/2605.20648
作者: Benedict Quartey,Sebastian Castro,Eric Rosen,Wil Thomason,George Konidaris,Stefanie Tellex
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Learning from Demonstration (LfD) enables robots to learn complex behaviors from expert examples, yet existing approaches often fail to generalize to new compositions of known skills without retraining. Modern generative policies model distributions over action trajectories alone, thus are unable to reason about the symbolic outcomes required for robust composition. We propose that skills should jointly model action trajectories and the symbolic outcomes they induce. To address this gap, we introduce Predicate Action Skills (PACTS), a class of closed-loop visuomotor policies that model skills as a joint generative process over action and predicate belief trajectories, producing coherent action-outcome rollouts within a single model. Jointly generating actions and predicates enables PACTS to learn internal representations that improve both action generation and predicate classification. Furthermore, we demonstrate zero-shot composition of learned skills via planning by leveraging online predicate predictions from PACTS as a symbolic interface for sequencing and monitoring execution. Project website: this https URL

[AI-63] Design for Manufacturing: A Manufacturability Knowledge-Integrated Reinforcement Learning Framework for Free-Form Pipe Routing in Aeroengines

链接: https://arxiv.org/abs/2605.20644
作者: Caicheng Wang,Zili Wang,Shuyou Zhang,Yongzhe Xiang,Zheyi Li,Liangyou Li,Jianrong Tan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Design for manufacturing plays a critical role in advanced aeroengine development, where complex components necessitate careful consideration of manufacturability. However, current practices in pipe routing remain largely decoupled from down-stream manufacturing, leading to labor-intensive, trial-and-error iterations to achieve manufacturable designs. To address this problem, this study proposes the Frenet-based pipe routing optimization (FPRO) framework, a manufacturability knowledge-integrated reinforcement learning approach for free-form pipe design in aeroengines. FPRO formulates the routing problem as a boundary value problem in the Frenet frame. In this framework, the pipe path is represented by curvature and torsion profiles, which are generated using cubic Hermite interpolation. To integrate design and manufacturing, domain-specific manufacturing knowledge is embedded as constraints on the permissible ranges of curvature and torsion. The path optimization is performed using the proximal policy optimization algorithm with stochastic exploration and a stage-guided reward mechanism. A unified mapping formulation then translates the optimized path into motion trajectories for the bending die, enabling direct fabrication on a six-axis free-bending machine. Experimental results demonstrate that FPRO consistently generates collision-free, manufacturable paths with smoother geometric profiles compared to Cartesian-based methods. It also achieves faster convergence and superior performance in terminal alignment, path length, obstacle avoidance, and manufacturability compared to state-of-the-art reinforcement learning baselines. Real-world validation confirms the close geometric correspondence between the manufactured pipe and its digital design, validating the practical feasibility of FPRO.

[AI-64] rusted Weights Treacherous Optimizations? Optimization-Triggered Backdoor Attacks on LLM s

链接: https://arxiv.org/abs/2605.20641
作者: Yifei Wang,Tianlin Li,Xiaohan Zhang,Yida Yang,Xiaoyu Zhang,Li Pan
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 20 pages, 3 figures

点击查看摘要

Abstract:Inference optimization is a vital technique for deploying LLMs at scale. Compilation is the most widely adopted optimization technique for LLMs. While it assumes semantic equivalence between the original and compiled graphs, we first uncover its numerical side effects can be maliciously exploited to implant stealthy backdoors in LLMs. We propose a unified optimization-triggered attack framework comprising two complementary strategies. Without any modification to the compiler or hardware, one strategy flips predictions for specific inputs only when the model is compiled, while the other uses a universal trigger that remains dormant under uncompiled execution but hijacks arbitrary inputs once compilation optimization is applied. Both attacks bypass standard safety evaluations run without compilation. We empirically demonstrate that these optimization-triggered backdoors achieve attack success rates averaging 90% across four mainstream open-source LLMs and four tasks, while clean accuracy is preserved at nearly 100% under all settings. Our findings reveal a novel attack surface at the intersection of optimization and security in the LLM deployment pipeline, and we investigate practical defenses to mitigate this threat.

[AI-65] Evaluating Temporal Semantic Caching and Workflow Optimization in Agent ic Plan-Execute Pipelines

链接: https://arxiv.org/abs/2605.20630
作者: Alimurtaza Mustafa Merchant,Krish Veera,Sajal Kumar Goyla,Shambhawi Bhure,Dhaval Patel,Kaoutar El Maghraoui
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 13 pages, 8 figures, 3 appendices

点击查看摘要

Abstract:Industrial asset operations workflows are latency-sensitive because a single user query may require coordination over sensor data, work orders, failure modes, forecasting tools, and domain-specific agents. We evaluate this problem on AssetOpsBench (AOB), an industrial agent benchmark whose plan-execute pipeline exposes repeated overhead from tool discovery, LLM planning, MCP tool execution, and final summarization. Existing LLM caching techniques such as KV-cache reuse and embedding-based semantic caching were designed for chatbot serving and break down when output validity depends on time, asset, or sensor parameters. We propose two complementary optimization layers for AOB plan-execute pipelines: a temporal semantic cache and a set of MCP workflow optimizations combining disk-backed tool-discovery caching and dependency-aware parallel step execution. MCP workflow optimizations corresponded to a 1.67x speedup and reduced median end-to-end latency by about 40.0% while the temporal-cache benchmark achieved a median of 30.6x speedup on cache hits. Beyond the speedup, our results expose a concrete failure mode of pure semantic caching for parameter-rich industrial queries, providing a critical analysis of how caching choices interact with evaluation correctness in MCP-backed agent benchmarks.

[AI-66] COAgents : Multi-Agent Framework to Learn and Navigate Routing Problems Search Space

链接: https://arxiv.org/abs/2605.20618
作者: Oleksandr Yakovenko,Mahdi Mostajabdaveh,Cheikh Ahmed,Abdullah Ali Sivas,Xiaorui Li,Zirui Zhou,Mao Kun
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted at LION 2026, The Learning and Intelligent Optimization Conference

点击查看摘要

Abstract:Although Vehicle Routing Problems (VRP) are essential to many real-world systems, they remain computationally intractable at scale due to their combinatorial complexity. Traditional heuristics rely on handcrafted rules for local improvements and occasional \textitjumps to escape local minima, but often struggle to generalize across diverse instances. We introduce \textbfCOAgents, a cooperative multi-agent framework that models the search process as a graph: nodes represent solutions, and edges correspond to either local refinements or large perturbations for diversification (i.e., jumps). A \textitPartial Search Graph (PSG) is dynamically constructed during search, enabling COAgents to train a Node Selection Agent and a Move Selection Agent to guide intensification, and a Jump Agent to trigger well-timed explorations of new regions. Unlike end-to-end learning approaches, COAgents cleanly separates problem-agnostic search control from compact domain-specific encoding, facilitating adaptability across tasks. Extensive experiments on the CVRP and VRPTW benchmarks show that COAgents remains competitive with several learn-to-search baselines on CVRP and sets a new state of the art among learning-based methods on the more challenging VRPTW instances, reducing the gap to the best-known solutions by 14% at N!=!100 and 44% at N!=!50 relative to the strongest neural solver (POMO), and by 21% and 40% respectively relative to ALNS. Code is available at this https URL. Comments: Accepted at LION 2026, The Learning and Intelligent Optimization Conference Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2605.20618 [cs.AI] (or arXiv:2605.20618v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2605.20618 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-67] From Automated to Autonomous: Hierarchical Agent -native Network Architecture (HANA)

链接: https://arxiv.org/abs/2605.20608
作者: Binghan Wu,Shoufeng Wang,Yunxin Liu,Ya-Qin Zhang,Joseph Sifakis,Ye Ouyang
机构: 未知
类目: Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)
备注: This manuscript has been accepted by IEEE Networking Letters

点击查看摘要

Abstract:Realizing Level 4/5 Autonomous Networks (AN) demands a shift from static automation to agent-native intelligence. Current operations, reliant on rigid scripts, lack the cognitive agency to handle off-nominal conditions. To address this, this letter proposes a hierarchical multi-agent reference architecture enabling high-level autonomy. The framework features a Dual-Driven Orchestrator that coordinates specialized Executive Agents, supported by a shared Public Memory for unified domain knowledge. A key innovation is the integration of agent self-awareness, which empowers the system to harmonize deliberative strategic governance with reflexive fault recovery. We instantiate and validate this architecture within a 5G Core environment. Case studies demonstrate that the system sustains critical throughput under congestion and reduces Mean Time to Repair (MTTR) by 86%, confirming its efficacy in unifying strategic planning with operational resilience.

[AI-68] Mahjax: A GPU-Accelerated Mahjong Simulator for Reinforcement Learning in JAX

链接: https://arxiv.org/abs/2605.20577
作者: Soichiro Nishimori,Shinri Okano,Keigo Habara,Sotetsu Koyamada,Eason Yu,Masashi Sugiyama
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Riichi Mahjong is a multi-player, imperfect-information game characterized by stochasticity and high-dimensional state spaces. These attributes present a unique combination of challenges that mirror complex real-world decision-making problems in reinforcement learning. While prior research has heavily relied on supervised learning from human play logs to pre-train the policy, algorithms capable of learning \textittabula rasa (from scratch) offer greater potential for general applicability, as evidenced by the AlphaZero lineage. To facilitate such research, we introduce \textbfMahjax, a fully vectorized Riichi Mahjong environment implemented in JAX to enable large-scale rollout parallelization on Graphics Processing Units (GPUs). We also provide a high-quality visualization tool to streamline debugging and interaction with trained agents. Experimental results demonstrate that Mahjax achieves throughputs of up to \textbf2 million and \textbf1 million steps per second on eight NVIDIA A100 GPUs under the no-red and red rules, respectively. Furthermore, we validate the environment’s utility for reinforcement learning by showing that agents can be trained effectively to improve their rank against baseline policies.

[AI-69] Complementing reinforcement learning with SFT through logit averag ing in the post training of LLM s

链接: https://arxiv.org/abs/2605.20555
作者: Xingwei Gan,Ying Zhu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We introduce a novel method that averages the logits of a frozen reference policy (e.g., SFT) and a trainable policy, and incorporate the method into Group Relative Policy Optimization (GRPO). In contrast to Reinforcement Learning with Verifiable Rewards (RLVR) methods, our proposal does not involve a Kullback Leibler (KL) regularization or critic; the trainable policy and the reference anchor are coupled through the logit averaging structure to leverage the reasoning expertise of the trainable policy while maintaining the formatting advantage of SFT. Our method is evaluated on MATH, cn-k12, and MMLU, and the results show a higher accuracy or at least comparable accuracy relative to the canonical KL-regularized GRPO.

[AI-70] Latent Process Generator Matching

链接: https://arxiv.org/abs/2605.20547
作者: Lukas Billera,Hedwig Nora Nordlinder,Ben Murrell
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: 18 pages, 1 figure

点击查看摘要

Abstract:Many recent flow-matching and diffusion-style generative models rely on auxiliary stochastic dynamics during training: a richer process is simulated to define conditional targets, but the auxiliary state is either intractable to sample at generation time or simply not part of the desired output. Existing Generator Matching theory formalises conditioning on static latent random variables, and several recent papers prove special cases of projection results for particular augmented-state constructions. We introduce latent process generator matching, a general framework that treats the observed generative state as a deterministic image X_t=\Phi(Y_t) of a tractable Markov process Y_t . We show that in this setting one may learn the generator of a stochastic process on the image space which has the same one-time marginal distributions as the projected process. This generalizes and subsumes the discrete latent process results from the literature, and extends Generator Matching from static latent variables to a rich family of time-dependent latent conditional processes.

[AI-71] Axiomatizing Neural Networks via Pursuit of Subspaces

链接: https://arxiv.org/abs/2605.20534
作者: Mehmet Yamac,Mert Duman,Ugur Akpinar,Felix Rojas Casadiego,Serkan Kiranyaz,Marcel van Gerven,Moncef Gabbouj
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: 43 pages, 25 figures. Code and additional materials will be released

点击查看摘要

Abstract:While deep neural networks have achieved remarkable success across a wide range of domains, their underlying mechanisms remain poorly understood, and they are often regarded as black boxes. This gap between empirical performance and theoretical understanding poses a challenge analogous to the pre-axiomatic stage of classical geometry. In this work, we introduce the Pursuit of Subspaces (PoS) hypothesis, an axiomatic framework that formulates neural network behavior through a set of geometric postulates. These axioms, together with their derived consequences, provide a unified perspective on representation, computation, and generalization in both shallow and deep architectures. We show that this framework yields geometric explanations for fundamental questions in deep learning, including representation structure, architectural mechanisms, and generalization behavior, offering a principled step toward a coherent theoretical foundation.

[AI-72] Machine-Learning-Enhanced Non-Invasive Testing for MASLD Fibrosis: Shallow-Deep Neural Networks Versus FIB-4 Tabular Foundation Models and Large Language Models

链接: https://arxiv.org/abs/2605.20523
作者: Athanasios Angelakis,Gabriele De Vito,Eleni-Myrto Trifylli,Filomena Ferrucci
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
备注: 26 pages, 4 figures, 3 tables. Preprint

点击查看摘要

Abstract:Advanced fibrosis is a major determinant of liver-related morbidity in metabolic dysfunction-associated steatotic liver disease (MASLD). FIB-4 is widely used as a first-line non-invasive test, but its fixed formula may underuse diagnostic information contained in age, aspartate aminotransferase, alanine aminotransferase, and platelet count. We evaluated whether machine-learning-enhanced non-invasive testing (MLE-NIT) can improve advanced fibrosis detection while preserving this FIB-4 variable space. We used three biopsy-confirmed MASLD cohorts from China, Malaysia, and India (n=784). The Chinese cohort was split into 486 training and 54 internal validation/tuning patients; final performance was reported only on the Malaysian and Indian external cohorts. Models used five variables: age, FIB-4, aspartate aminotransferase, platelet count, and alanine aminotransferase. We compared FIB-4 with a shallow-deep neural network (s-DNN), TabPFN, and gpt-4o-2024-08-06. FIB-4 achieved external ROC-AUCs of 0.75 and 0.60 in Malaysia and India, respectively. TabPFN achieved 0.69 and 0.66, fine-tuned GPT-4o achieved 0.75 and 0.63, and the s-DNN achieved 0.77 and 0.67, respectively. The s-DNN contained only 354 trainable parameters, compared with 7,244,554 for TabPFN, yet provided a more balanced external operating profile. Calibration showed s-DNN Brier scores of 0.18 and 0.22, and permutation importance identified AST and FIB-4 as dominant variables. Compact non-linear MLE-NITs may enhance FIB-4-based fibrosis assessment without increasing clinical data requirements. Comments: 26 pages, 4 figures, 3 tables. Preprint Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM) Cite as: arXiv:2605.20523 [cs.LG] (or arXiv:2605.20523v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.20523 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-73] Open-World Evaluations for Measuring Frontier AI Capabilities

链接: https://arxiv.org/abs/2605.20520
作者: Sayash Kapoor,Peter Kirgis,Andrew Schwartz,Stephan Rabanser,J.J. Allaire,Rishi Bommasani,Harry Coppock,Magda Dubois,Gillian K Hadfield,Andrew B. Hall,Sara Hooker,Seth Lazar,Steve Newman,Dimitris Papailiopoulos,Shoshannah Tekofsky,Helen Toner,Cozmin Ududec,Arvind Narayanan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Benchmark-based evaluation remains important for tracking frontier AI progress. But it can both overstate and understate deployed capability because it privileges tasks that can be precisely specified, automatically graded, easy to optimize for, and run with low budgets and short time horizons. We advocate for a complementary class of evaluations, which we term open-world evaluations: long-horizon, messy, real-world tasks assessed through small-sample qualitative analysis rather than benchmark-scale automation. In this paper we survey recent open-world evaluations, identify their strengths and limitations, and introduce CRUX (Collaborative Research for Updating AI eXpectations), a project for conducting such evaluations regularly. As a first instance, we task an AI agent with developing and publishing a simple iOS application to the Apple App Store. The agent completed the task with only a single avoidable manual intervention, suggesting that open-world evaluations can provide early warning of capabilities that may soon become widespread. We conclude with recommendations for designing and reporting open-world evals.

[AI-74] Codec-Robust Attacks on Audio LLM s

链接: https://arxiv.org/abs/2605.20519
作者: Jaechul Roh,Jean-Philippe Monteuuis,Jonathan Petit,Amir Houmansdar
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Prior attacks on Audio Large Language Models (Audio LLMs) demonstrated that carefully crafted waveform-domain perturbations can force targeted adversarial outputs. As a defense mechanism against these attacks, real-world codec compression preprocessing has been studied to both detect and remove the perturbations. Yet no existing attack has demonstrated robustness against these compressions. We introduce CodecAttack, which optimizes a perturbation in a neural audio codec’s continuous latent space rather than directly perturbing the audio waveform. We show that the codec’s compression channel, which discards waveform perturbations, transmits perturbations crafted in its own latent space. To further harden the attack across real-world compression channels, we apply multi-bitrate straight-through Expectation-over-Transformation (EoT), all without modifying the target model. Across three realistic Audio LLM deployment scenarios and three target models, CodecAttack achieves an average 85.5% target-substring attack success rate (ASR) on Opus at moderate bitrates, while the waveform baseline trained with identical EoT hardening does not exceed 26% at any bitrate. The attack transfers to held-out codecs, reaching up to 100% ASR on MP3 and 84% on AAC-LC without retraining. A per-band energy analysis shows that the latent perturbation concentrates below 4kHz, exactly where codecs allocate the most bits, while the waveform baseline spreads into higher frequencies that codecs discard. These results demonstrate that lossy compression is not a reliable defense against adversarial audio and that codec-aware attacks pose a practical threat to deployed Audio LLM systems.

[AI-75] ECUASn: A family of metrics for principled evaluation of uncertainty-augmented systems

链接: https://arxiv.org/abs/2605.20490
作者: Lautaro Estienne,Erik Ernst,Matías Vera,Pablo Piantanida,Luciana Ferrer
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: pre-print, 9-pages paper, 25 pages total

点击查看摘要

Abstract:In high-stakes automated decision-making, access to predictive uncertainty is essential for enabling users – human or downstream systems – to accept or reject predictions based on application-specific cost trade-offs. Such uncertainty-augmented (UA) systems – i.e., systems that output both predictions and uncertainty scores – are currently being assessed in the literature in a variety of ways, using separate metrics to evaluate the predictions and the uncertainty scores, setting a cost function with a fixed rejection cost or integrating over a coverage-risk curve. We argue that these evaluation approaches are inadequate for assessing overall performance of the UA system for decision making under uncertainty and propose a novel family of metrics, \ECUASn, formulated as proper scoring rules for the task of interest. The parameter n controls the trade-off between the cost of incorrect predictions and imperfect uncertainties depending on the needs of the use-case. We demonstrate the advantages of the \ECUASn metrics both theoretically and empirically, through experiments on diverse classification and generation datasets, including a manually annotated subset of TriviaQA.

[AI-76] Code Generation by Differential Test Time Scaling

链接: https://arxiv.org/abs/2605.20473
作者: Yifeng He,Ethan Wang,Jicheng Wang,Xuanxin Ouyang,Hao Chen
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 16 main text, 21 pages with references

点击查看摘要

Abstract:Test-time scaling has emerged as a promising approach for improving code generation by exploring large solution spaces at inference time. However, existing methods often rely on public test cases that are unavailable in practice, or require extensive LLM inference for candidate selection, leading to significant token consumption and time overhead. We present DiffCodeGen, a novel test-time scaling method for code generation based on coverage-guided differential analysis. DiffCodeGen generates diverse code candidates using various sampling and prompting strategies, then applies coverage-guided fuzzing to synthesize inputs without requiring any existing tests or large language models. By executing all candidates on these inputs, DiffCodeGen captures their dynamic behavior and clusters candidates based on behavioral similarity. DiffCodeGen selects the medoid of the largest cluster as the final output. Unlike prior test-time scaling methods that invoke additional LLM inference for candidate selection, DiffCodeGen performs selection without any extra model calls, incurring little to no additional token consumption. DiffCodeGen is fully asynchronous, naturally suited to the current trend of agentic coding, and is thus efficient and highly scalable. We evaluate DiffCodeGen across 4 large language models, demonstrating consistent improvements over baselines. Compared to state-of-the-art test-time scaling methods, DiffCodeGen achieves competitive or superior performance while using only a fraction of time and tokens. DiffCodeGen is model-agnostic and can be combined with reasoning models to further boost performance.

[AI-77] High Quality Embeddings for Horn Logic Reasoning

链接: https://arxiv.org/abs/2605.20467
作者: Yifan Zhang,Yasir White,Dean Clark,Joseph Sanchez,Jevon Lipsey,Ashely Hirst,Jeff Heflin
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Neural networks can be trained to rank the choices made by logical reasoners, resulting in more efficient searches for answers. A key step in this process is creating useful embeddings, i.e., numeric representations of logical statements. This paper introduces and evaluates several approaches to creating embeddings that result in better downstream results. We train embeddings using triplet loss, which requires examples consisting of an anchor, a positive example, and a negative example. We introduce three ideas: generating anchors that are more likely to have repeated terms, generating positive and negative examples in a way that ensures a good balance between easy, medium, and hard examples, and periodically emphasizing the hardest examples during training. We conduct several experiments to evaluate this approach, including a comparison of different embeddings across different knowledge bases, in an attempt to identify what characteristics make an embedding well-suited to a particular reasoning task.

[AI-78] LLM Pretraining Shapes a Generalizable Manifold: Insights into Cross-Modal Transfer to Time Series

链接: https://arxiv.org/abs/2605.20449
作者: Alexis Roger,Prateek Humane,Zhenghan Tai,Gwen Legate,Andrei Mircea,Vasilii Feofanov,Irina Rish
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Can language-pretrained transformers become effective time-series forecasters, and why? In this paper, we show that cross-modal transfer arises because language pretraining preconditions time series training with a reusable manifold. A linear probe on frozen LLM states decodes realistic time-series trajectories without paired supervision, and retrieval in this projected space yields competitive forecasts, showing that structure and dynamics exist before finetuning. Pretrained initialization also improves optimization, producing coherent gradients and a highly anisotropic loss landscape unlike random initialization. Finetuning then acts as low-dimensional alignment, reusing existing directions rather than learning temporal primitives from scratch, as evidenced by low-rank updates, subspace alignment, and shared features for periodicity, trend, and repetition. Together, these results support a geometric account of LLM-to-time-series transfer: language pretraining builds the manifold, and finetuning projects numerical dynamics onto task-relevant directions.

[AI-79] Weight Decay Regimes in Grokking Transformers: Cheap Online Diagnostics

链接: https://arxiv.org/abs/2605.20441
作者: Lucky Verma
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注: 28 pages, 11 figures, 5 tables. Code and aggregate JSONs: this https URL . Per-run JSONs: this https URL . Lean 4/mathlib v4.29.0 formal checks available in the code repository

点击查看摘要

Abstract:Transformers trained on modular arithmetic exhibit sharp transitions between memorization, generalization, and collapse. We show that weight decay acts as a scalar empirical control parameter for these regimes, and introduce two cheap online diagnostics, mean pairwise attention-head cosine similarity and entropy standard deviation, that track training dynamics from attention activations alone and complement loss-landscape diagnostics at lower compute cost. Across eleven experimental conditions and three model scales (0.82M to 85M parameters), the weight-decay axis separates memorization, developmental grokking, and collapse. A near-transition logistic fit localizes the memorization-to-developmental boundary at \lambda_c=0.0158 (95% CI [0.0109, 0.0200], N=210); a power-law fit gives an empirical exponent \nu=0.757 (CI [0.725, 0.799]). Reference exponents \nu=1/2 and 3D Ising \nu \approx 0.63 lie outside this empirical CI under our four-bin grid, so we report \nu as empirical and defer universality-class identification to denser finite-size-scaling work. A horizon-matched multi-task replication (n=280, four modular operations) preserves the weight-decay control pattern; a paired attention-head re-initialization experiment at \lambda=0.05 changes Phase-2 amplitude (Cohen’s d=-1.190 , n=10, p_t=4.5 \times 10^-3 ), while matched weight-norm clipping does not. Three cross-architecture probes (4L MLP, 4L LSTM, and 4L Mamba; each n=70) replicate the weight-decay-controlled transition with architecture-specific \lambda_c values. Main diagnostic claims are scoped to modular arithmetic in small transformer attention models; the non-attention experiments are scope probes, and architecture-wide, language-model, and universality-class claims are out of scope.

[AI-80] Group-Algebraic Tensors: Provably-optimal Equivariant Learning and Physical Symmetry Discovery

链接: https://arxiv.org/abs/2605.20440
作者: Paulina Hoyos,Shashanka Ubaru,Dongsung Huh,Vasileios Kalantzis,Kenneth L. Clarkson,Misha Kilmer,Haim Avron,Lior Horesh
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Rings and Algebras (math.RA)
备注:

点击查看摘要

Abstract:We introduce the \star_G tensor algebra, in which any finite group G defines the multiplication rule, making equivariance an intrinsic algebraic property rather than an architectural constraint. The framework rests on three machine-verified theoretical pillars: (i)~an Eckart-Young optimality guarantee for the \star_G -SVD: the first such result for symmetry-preserving tensor approximation, exact and polynomial-time; (ii)~a Kronecker factorization that composes multiple symmetries by replacing F_G with F_G_1 \otimes F_G_2 with no architectural redesign; and (iii)~a 600-line Lean~4 formalization of the \star_G algebra. The framework provides capabilities that equivariant neural networks (ENNs) structurally cannot: a closed-form per-irreducible-representation decomposition of every prediction, and data-driven discovery of the symmetry group that best fits a dataset. As a non-trivial empirical demonstration, decomposing QM9 molecular geometry over the chiral octahedral subgroup of SO(3) recovers the Wigner–Eckart selection rules of angular momentum from data alone, with no quantum mechanical input: scalar properties are A _1 -dominated, dipole components are T _1 -dominated, the isotropic polarizability is uniquely insensitive to l!=!1 as the rank-2-trace decomposition l!=!0 \oplus l!=!2 requires, and the T _1 /A _1 predictive-power ratio separates vector observables from scalar observables by a factor of five. On full QM9 (130,831 molecules), \star_G -SVD with ridge regression provides closed form predictions at \sim50-90\times fewer parameters than parameter-matched MLPs. Algebraic equivariance thus complements architectural equivariance not as a faster-better-cheaper alternative but as a different mathematical affordance: provably-optimal symmetry-preserving compression, per-irrep interpretability, and data-driven physical discovery.

[AI-81] Agent Co-op: Retrieval-Based Synthesis of Interoperable Multi-Agent Workflows

链接: https://arxiv.org/abs/2605.20425
作者: Shuaike Shen,Wenduo Cheng,Shike Wang,Mingqian Ma,Jian Ma
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Designing multi-agent workflows is especially difficult in open-ended scientific settings where tasks lack curated training sets, reliable scalar evaluation metrics, and standardized interfaces between existing tools and agents. We propose AgentCo-op, a retrieval-based synthesis framework that composes reusable skills, tools, and external agents into executable workflows through typed artifact handoffs, then applies bounded self-guided local repair to implicated components when execution evidence indicates failure. In two open-world genomics case studies, AgentCo-op composes independently developed scientific agents and external tool repositories into auditable workflows without redesigning them or running global topology search. It coordinates specialized agents for spatial transcriptomics and gene-set interpretation to enable collaborative discovery from spatial transcriptomics data, and builds a parallel workflow for cross-modality marker analysis on single-cell multiome data. AgentCo-op can also import a searched workflow as a structural prior and improve it by grounding nodes with retrieved components and applying local repair, showing that synthesis and search are complementary. On six coding, math, and question-answering benchmarks, AgentCo-op achieves the best result on four benchmarks and the best average score under a unified backbone setting, while consistently reducing per-task cost relative to multi-agent baselines. Together, these results suggest that retrieval-based synthesis can extend automated agentic workflow design beyond benchmark-optimized agent graphs to open-world workflows built from existing agents, tools, and typed artifacts.

[AI-82] OSCToM: RL-Guided Adversarial Generation for High-Order Theory of Mind

链接: https://arxiv.org/abs/2605.20423
作者: Sharmin Sultana Srishty,Kazi Mahathir Rahman,Malaika Parizat Sakkhi,Samia Shahid Prianna,Shaikhul Islam Sinat
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 15 pages, 12 figures containing 15 images, 3 tables. Code available at this https URL

点击查看摘要

Abstract:Large Language Models (LLMs) perform well on many language tasks, but their Theory of Mind (ToM) reasoning is still uneven in complex social settings. Existing benchmarks, including ExploreToM, do not always test the recursive beliefs and information asymmetries that make these settings difficult. This paper presents OSCToM (Observer-Self Conflict Theory of Mind), an approach for modeling nested belief conflicts in LLM-based ToM tasks. The key case is one in which an observer’s view of another agent conflicts with the observer’s own belief state. Such cases go beyond simple perspective-taking and require recursive, multi-layered reasoning. OSCToM combines reinforcement learning (RL), an extended domain-specific language, and compositional surrogate models to generate observer-self conflicts. In our experiments, OSCToM-8B gives the best overall result among the systems tested. It improves on the reported ExploreToM results on FANToM and remains competitive on Hi-ToM and BigToM. On the information-asymmetric FANToM benchmark, OSCToM reaches 76% accuracy, compared with the 0.2% reported by ExploreToM. The data-synthesis procedure is also 6x more efficient, indicating that targeted training data can help smaller models handle advanced cognitive reasoning. The project code is available at this https URL.

[AI-83] Decomposing MXFP4 quantization error for LLM reinforcement learning: reducible bias recoverable deadzone and an irreducible floor

链接: https://arxiv.org/abs/2605.20402
作者: Xiaocan Li,Shiliang Wu,Zheng Shen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:MXFP4 arithmetic can dramatically accelerate reinforcement learning (RL) post-training of large language models (LLMs), yet the quantization error introduces severe accuracy degradation. Existing work treats the quantization error as a monolithic noise term, missing the distinct mechanisms upon interpreting how quantization error damages training. We prove an exact three-way decomposition of quantization error and show how each component dominates a distinct RL training pathway. Our theoretical and empirical analysis decomposes the MXFP4 quantization error into three additive components: “scale bias” from power-of-two rounding, “deadzone truncation” from zeroing small values, and “grid noise” from rounding to the nearest 4-bit grid. Each component dominates a distinct RL failure mode: scale bias accumulates multiplicatively through the backward pass, affecting gradient accuracy; deadzone truncation degrades rollout quality; and grid noise raises the policy’s entropy. We combine corrections that are RL failure mode-targeted but not component-exclusive: Macro-block scaling to reduce scale bias, Outlier Fallback recovers deadzone entries, but also partially reduces scale bias induced error, and Adaptive Quantization Noise (AQN) for controlling the policy entropy. On Qwen2.5-3B dense and Qwen3-30B-A3B-Base mixture-of-experts model, the targeted corrections recover BF16 accuracy to within 0.7% and 3.0% respectively.

[AI-84] Nonlocal operator learning for fMRI encoding and decoding tasks

链接: https://arxiv.org/abs/2605.20389
作者: Andreas Kramer,Saugat Acharya,Alice Giola,Emanuele Zappala
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 18 pages, 4 figures, 5 tables. Comments are welcome!

点击查看摘要

Abstract:Functional MRI data exhibit high-dimensional spatiotemporal structure, making both prediction and decoding challenging. In this work, we investigate neural integral-operator-based models for encoding and decoding tasks in fMRI, with particular emphasis on the role of nonlocal spatiotemporal context. We implement a latent neural integral operator framework that performs fixed point iterations in an auxiliary space from which classification and stimuli prediction is performed via a decoder. We evaluate our model on two open-source fMRI datasets. Our experiments examine both decoding of stimuli from fMRI recordings and encoding of fMRI dynamics from stimulus representations. A main focus is the effect of spatiotemporal context: we systematically compare short and long temporal windows, as well as the use of visual cortex vs whole brain recordings, and analyze their influence on performance and latent-space geometry. Across tasks and datasets, larger temporal windows generally improve results and produce more structured learned representations. In decoding experiments, the learned latent space often provides clearer class separation than the raw data. In encoding experiments, although absolute performance remains moderate due to the difficulty of the task, longer temporal windows still yield consistent gains. These findings suggest that neural integral operators provide a promising framework for modeling fMRI dynamics and that broader spatiotemporal context can be beneficial for both prediction and representation learning. More broadly, the results indicate that exploiting distributed nonlocal structure in brain dynamics requires model architectures specifically designed to capture such dependencies. Comments: 18 pages, 4 figures, 5 tables. Comments are welcome! Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2605.20389 [cs.LG] (or arXiv:2605.20389v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.20389 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-85] Security Document Classification with a Fine-Tuned Local Large Language Model: Benchmark Data and an Open-Source System

链接: https://arxiv.org/abs/2605.20368
作者: Ivan Dobrovolskyi
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Organizations that scan documents for sensitive information face a practical problem. Cloud services require data to be sent to external infrastructure, while rule-based tools often miss threats that depend on context. This study presents TorchSight, an open-source local system for security document classification built around a fine-tuned Qwen 3.5 27B model. The model was trained on 78,358 samples from 13 permissively licensed sources and GPT-4 synthetic data covering seven security categories and 51 subcategories. In the main evaluation on 1,000 documents, the model reached 95.0% category-level accuracy (95% confidence interval: 93.5-96.2). The tested commercial models scored 75.4-79.9% under the same prompting protocol. On a separate external set of 500 held-out samples, the model reached 93.8% accuracy, which suggests that performance extends beyond the main benchmark, although the margin depends on dataset composition and difficult boundary cases. The results show that a fine-tuned local model can support accurate security document classification while keeping document processing under local control.

[AI-86] Consistently Informative Soft-Label Temperature for Knowledge Distillation

链接: https://arxiv.org/abs/2605.20357
作者: Hoang-Chau Luong,Nghia Van Vo,Kaiqi Zhao,Lingwei Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Knowledge distillation (KD) transfers knowledge from a high-capacity teacher to a compact student by matching their predictive distributions, with temperature scaling serving as a central mechanism for smoothing teacher predictions and exposing informative “dark knowledge” beyond the hard label. However, the standard fixed-temperature design is inherently sample-agnostic. Since samples differ in logit scale and learning difficulty, a single global temperature produces teacher soft labels with highly inconsistent entropy: some predictions remain overly sharp and provide limited inter-class information, whereas others become over-smoothed and lose class-discriminative information. Moreover, sharing the same temperature between teacher and student further imposes rigid logit-scale alignment despite their capacity mismatch. To address these limitations, we propose CIST (Consistently Informative Soft-label Temperature), which assigns separate sample-wise adaptive temperatures to the teacher and student. This design produces consistently informative teacher soft labels while relaxing rigid teacher–student logit-scale matching. It also reweights the distillation objective according to teacher confidence and student learning difficulty. Theoretically, we show that teacher-label entropy is largely governed by the ratio between the maximum teacher logit and the temperature, providing a principled basis for adaptive smoothing. Empirically, CIST mitigates the inconsistency induced by fixed temperature, and experiments on both vision and language distillation tasks show consistent improvements over standard KD and strong baselines with negligible computational overhead.

[AI-87] Causal Unlearning in Collaborative Optimization: Exact and Approximate Influence Reversal under Adversarial Contributions

链接: https://arxiv.org/abs/2605.20341
作者: Ali Mahdavi,Azadeh Zamanifar,Amirfarhad Farhadi,Omid Kashefi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Performance (cs.PF)
备注:

点击查看摘要

Abstract:Federated learning systems must support data deletion requests to comply with privacy regulations, yet retraining from scratch after each deletion is computationally prohibitive. We present HF-KCU, a method that removes a client’s contribution by approximating the influence function through conjugate gradient iterations in Krylov subspaces, reducing complexity from O(d^3) to O(kd) where kd.A causal weighting mechanism ensures that only clients holding the deleted data receive parameter updates, preventing spurious changes to unaffected clients. Our method is designed to handle bounded adversarial perturbations to the Hessian and gradient, providing graceful degradation under realistic threat models. We validate HF-KCU across convolutional (ResNet-18, SimpleCNN) and transformer (ViT-Lite) architectures on CIFAR-10, MNIST, and Fashion-MNIST. On CIFAR-10 under Dirichlet (alpha=0.5) partitioning, HF-KCU achieves 47.75 times speedup over retraining while maintaining test accuracy within 0.60% of the rational baseline(71.16 vs 71.76 %). Membership inference attacks on the forget set yield success rates of 0.499 matching the retrained model and confirming effective privacy restoration. We provide convergence guarantees showing that the Krylov approximation error decreases as O((k ^1/2-1)/(k^1/2+1)) where k is the Hessian condition number. The causal weighting mechanism ensures surgical updates, where only clients holding deleted data are modified, preserving model quality for unaffected participants and avoiding the instability of gradient-based approaches in asynchronous federated settings. This design provides interpretability as each update is directly traceable to the influence of the deleted data. The method’s efficiency and precision make it suitable for production federated systems where deletion requests arrive asynchronously and computational budgets are constrained.

[AI-88] Less Data Faster Training: repeating smaller datasets speeds up learning via sampling biases ICML2026

链接: https://arxiv.org/abs/2605.20314
作者: Jingwen Liu,Ezra Edelman,Surbhi Goel,Bingbin Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: ICML 2026

点击查看摘要

Abstract:This work investigates the ``small-vs-large gap’', where repeating on fewer samples can lead to compute saving during training compared to using a larger dataset. This is observed across algorithmic tasks, architectures and optimizers and cannot be explained using prior theory. We argue that the speedup comes from appropriate layer-wise growth enabled by sampling biases, which is more pronounced when the dataset size is smaller. We provide both theoretical analysis and empirical evidence from various interventions. Our results suggest that using a smaller dataset with more repetitions is not just a fallback strategy under data scarcity, but can be proactively leveraged as a favorable inductive biases for optimization, particularly in reasoning tasks.

[AI-89] Robust Subspace-Constrained Quadratic Models for Low-Dimensional Structure Learning

链接: https://arxiv.org/abs/2605.20300
作者: Zheng Zhai,Xiaohui Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In this paper, we propose a robust subspace-constrained quadratic model (SCQM) for learning low-dimensional structure from high-dimensional data. Building upon the subspace-constrained quadratic matrix factorization (SQMF) framework, the proposed model accommodates a broad class of noise distributions, including generalized Gaussian and radial Laplace models. This generalization enables reliable performance under both heavy-tailed and light-tailed noise, thereby substantially enhancing robustness across diverse data regimes. To efficiently address the resulting nonconvex optimization problem, we develop a gradient-based algorithm equipped with a backtracking line-search strategy that ensures stable and efficient convergence. In addition, we present a sensitivity analysis of the \ell_p^p and \ell_2 loss functions, elucidating their distinct behaviors under varying noise characteristics. Extensive numerical experiments corroborate the theoretical analysis and demonstrate that the proposed approach consistently outperforms existing methods in terms of robustness and reconstruction accuracy.

[AI-90] Mechanisms of Misgeneralization in Physical Sequence Modeling

链接: https://arxiv.org/abs/2605.20299
作者: Kento Nishi,Raphael Tang,Karun Kumar,Core Francisco Park,Hidenori Tanaka
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: Preprint. this http URL

点击查看摘要

Abstract:Generative sequence models are often trained to plan motion in physical domains, from robotics to mechanical simulations. When constructing a dataset to train such a model, engineers may curate demonstrations to specify how trajectories should be distributed over a physical quantity like travel distance or mechanical energy. For example, a roboticist building a maze navigation agent might choose demonstrations whose travel distances cover a fixed range uniformly, hoping to constrain the agent’s expected power usage. We find that standard deep learning can violate this intent: each generated trajectory can seem plausible on its own, but the aggregate distribution over the physical quantity is wrong. We call this failure physical misgeneralization, and develop an account of its mechanism. Using controlled synthetic tasks, we show that physical misgeneralization arises when local errors typical of the model class propagate through the physical measurement to shift the recovered distribution. We estimate these errors with a data deviation kernel, and we use it to predict which physical quantities gain or lose mass in both our synthetic and more applied maze navigation and double-pendulum motion tasks. Finally, our mechanistic interpretation helps identify which mitigation strategies are structurally promising, and we use it to propose a kernel-informed intervention.

[AI-91] Spectral Unforgetting: Post-Hoc Recovery of Damaged Capabilities Without Retraining

链接: https://arxiv.org/abs/2605.20296
作者: Aarash Abro,Muhammad Tahir
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Fine-tuning a language model for a target task routinely degrades capabilities the training data never explicitly threatened. We study this phenomenon, known as catastrophic forgetting, and propose a post-hoc repair solution that uses only the pretrained checkpoint W_\mathrmbase and its fine-tuned descendant W_\mathrmft . The goal is not merely to revert the model toward the base checkpoint, but to recover capabilities damaged by fine-tuning while preserving both the target-task gains and any beneficial held-out improvements. We introduce DG-Hard, a checkpoint-only spectral repair method for the fine-tuning update \Delta = W_\mathrmft - W_\mathrmbase . DG-Hard treats \Delta as a low-rank task-aligned signal embedded in an IID-like noise residual that gradient descent has no incentive to remove, and applies the Donoho-Gavish hard singular-value threshold to each weight-delta matrix, keeping the structured high-energy part of the update and removing the spectral bulk. This reduces repair to a closed-form SVD filtering step requiring no data-dependent tuning. A central difficulty is evaluation: average accuracy hides per-benchmark failures, while naive recovery scores reward models that simply revert toward the base. We therefore introduce a partition-conditional metric that separately tracks healing, preservation, non-damage, and target-task retention. Across 14 (model, task) settings and nine cross-domain held-out benchmarks, DG-Hard achieves the strongest balanced repair among post-hoc baselines. DG-Hard also restores safety alignment degraded by benign fine-tuning on three independent safety axes, despite using no alignment data. These results suggest that part of fine-tuning-induced capability loss is not an unavoidable consequence of specialization, but a removable spectral residue in the weight update itself.

[AI-92] Quant.npu: Enabling Efficient Mobile NPU Inference for on-device LLM s via Fully Static Quantization

链接: https://arxiv.org/abs/2605.20295
作者: Jinghe Zhang,Daliang Xu,Chenghua Wang,Weikai Xie,Tao Qi,Yun Ma,Mengwei Xu,Gang Huang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly deployed on mobile devices, where Neural Processing Units (NPUs) necessitate fully static quantization for optimal inference efficiency. However, existing post-training quantization (PTQ) methods predominantly rely on dynamic activation quantization, rendering them incompatible with NPU hardware constraints. To bridge the gap between high-fidelity PTQ and NPU-constrained inference, we propose this http URL, a integer-only fully static quantization framework. It incorporates learnable quantization parameters and rotation matrices, enabling low-bit activation-weight quantization without runtime quantization parameters re-computation. Crucially, we identify that initialization and selective optimization of quantization parameters is pivotal for optimization stability, as improper initialization and naive joint optimization induce gradient instability that disrupts the optimization of rotation matrices. To address this, we propose a rotation-and-bit-width-aware initialization tailored to diverse activation profiles and a distribution-aware selective optimization (two-stage quantization pipeline) tailored to rotated and unrotated tensors. Furthermore, we introduce a sensitivity-guided adaptive mixed-precision scheme to balance accuracy with inference efficiency. Extensive experiments on real-world mobile NPUs demonstrate that this http URL achieves comparable accuracy to state-of-the-art methods, while reducing inference latency by up to 15.1%.

[AI-93] Closed-form predictive coding via hierarchical Gaussian filters

链接: https://arxiv.org/abs/2605.20293
作者: Aleksandrs Baskakovs,Sylvain Estebe,Kenneth Enevoldsen,Kristoffer Nielbo,Chris Mathys,Nicolas Legrand
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注:

点击查看摘要

Abstract:Predictive coding (PC) offers a local and biologically grounded alternative to backpropagation in the training of artificial neural networks, yet to date, it remains slower, and performance degrades sharply as network depth increases. We trace both problems to a single simplification: current PC networks fix the precision matrix to the identity, discarding precision-weighted prediction errors that the variational derivation requires to be fast, local, and Bayesian. We close this gap by expressing predictive coding networks as deep hierarchical Gaussian filters (HGFs) and restore precision-weighted message passing, yielding dynamic uncertainty estimates and Hebbian-compatible update rules at every layer. The resulting networks can simultaneously learn activations, weights, and precisions under a single free-energy objective, with no global error signal, and resolve inference without requiring iterations or automatic differentiation. On FashionMNIST, our solution approaches backpropagation in epoch-level wall-clock cost while converging in fewer epochs, and outperforms it on online, data efficiency, and concept-drift tasks. We thus establish that closed-form variational inference with online precision learning provides a tractable foundation for deep predictive coding networks, retaining biological and interpretative advantages, without requiring iterative relaxation or global error signals.

[AI-94] Plug-and-Play Spiking Operators: Breaking the Nonlinearity Bottleneck in Spiking Transformers ICML2026

链接: https://arxiv.org/abs/2605.20289
作者: Xinzhe Yuan(1),Xiang Peng(1),Bin Gu(2),Huan Xiong(1) ((1) IASM, Harbin Institute of Technology, (2) School of Artificial Intelligence, Jilin University)
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted to ICML 2026. 9 pages main paper, 8 pages appendix, 6 figures, 5 tables. Correspondence to Bin Gu and Huan Xiong

点击查看摘要

Abstract:ANN-to-SNN conversion offers a practical, training-free route to spiking large language models. However, current pipelines primarily focus on spike-driven realizations for Transformer linear-algebra operations, while providing limited support for key nonlinear operators. This gap limits compatibility with neuromorphic-style execution constraints, where such nonlinearities typically require division, exponentiation, or norm computations that are not naturally supported by standard leaky integrate-and-fire dynamics. To solve this problem, we propose a plug-and-play framework that implements spike-friendly approximations for Transformer nonlinearities and integrates into existing ANN-to-SNN pipelines. Our method decomposes these nonlinear computations into three recurring primitives – division, exponentiation, and \ell_2 norms – and realizes them via population computation using LIF neuron groups, combined with lightweight bit-shift scaling to avoid floating-point arithmetic. By composing these primitives as modular operator blocks, our framework supports common Transformer nonlinearities (e.g., Softmax, SiLU, and normalization) without any fine-tuning. Experiments on a range of LLMs Transformers show that selectively replacing the targeted nonlinear operators incurs less than a 1% accuracy drop across all evaluated tasks.

[AI-95] Introspective X Training: Feedback Conditioning Improves Scaling Across all LLM Training Stages

链接: https://arxiv.org/abs/2605.20285
作者: Brandon Cui,Ximing Lu,Jaehun Jung,Syeda Nahida Akter,Hyunwoo Kim,Yuxiao Qu,David Acuna,Shrimai Prabhumoye,Yejin Choi,Prithviraj Ammanabrolu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We tackle the question of how to scale more efficiently across the many, ever-growing stages of current LLM training pipelines. Our guiding intuition stems from the fact that the dynamics of later stages of the pipeline, e.g. post-training, can be used to inform earlier stages such as pre-training. To this end, we propose Introspective Training (or IXT), inspired by offline reward-conditioned reinforcement learning and applicable to any stage of training. IXT uses a thinking reward model to annotate data with natural language critique based feedback, enabling quality aware training from the earliest stages of the pipeline. Models are then trained by prefix-conditioning the data with the generated feedback – ensuring that not all tokens are treated equally starting much earlier in training than usual. Comprehensive experiments on 7.5-12B transformer-based dense LLMs trained from scratch all the way up to 18 Trillion tokens seen show that our method: bends scaling curves resulting in up to 2.8x more compute efficiency generally; and reaches performance levels unachievable for models trained otherwise in domains such as math and code.

[AI-96] PolycubeNet: A Dual-latent Diffusion Model for Polycube-Based Hexahedral Mesh Generation

链接: https://arxiv.org/abs/2605.20274
作者: Lu He,Qitao Deng,Junjiang Deng,Liangbin Deng,Yanjun Liang,Wenting Yang,Guoqiang Wang,Na Lei
机构: 未知
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Hexahedral meshes are widely used in simulation pipelines, yet automatic generation remains challenging for complex CAD geometries. Polycube-based hexahedral meshing is a representative approach due to its regular, parameterization-friendly structure, but existing polycube construction methods often rely on intricate surface segmentation and local heuristics, which can produce artifacts or fail on difficult shapes. In this paper, we propose an end-to-end framework for polycube generation based on conditional diffusion models. Given an input geometry represented as a point cloud, our method directly produces a corresponding polycube point cloud, eliminating the need for explicit surface segmentation or predefined polycube templates. At the core of our approach is a dual-latent conditional diffusion architecture that confines computationally expensive self-attention operations to a fixed-capacity, low-dimensional latent space. This design effectively decouples computational complexity from the resolution of both the input geometry and the output polycube, thereby avoiding the quadratic cost typical of point cloud self-attention mechanisms while supporting flexible input and output resolutions. To obtain a hexahedral mesh, the generated polycube is aligned to the input shape via rigid and non-rigid point cloud registration to establish surface correspondence, followed by a polycube-to-hex pipeline. We additionally create and release a paired dataset of CAD meshes and their corresponding polycube meshes, together with the core implementation of our model. Experiments show that PolycubeNet generalizes to complex CAD models with arbitrary genus and produces high-quality polycube structures within seconds, improving robustness and efficiency over prior learning-based approaches. Subjects: Graphics (cs.GR); Artificial Intelligence (cs.AI) Cite as: arXiv:2605.20274 [cs.GR] (or arXiv:2605.20274v1 [cs.GR] for this version) https://doi.org/10.48550/arXiv.2605.20274 Focus to learn more arXiv-issued DOI via DataCite

[AI-97] Modality-Decoupled Online Recursive Editing

链接: https://arxiv.org/abs/2605.20273
作者: Siyuan Li,Youyuan Zhang,Fangming Liu,Jing Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Online model editing for multimodal large language models (MLLMs) requires assimilating a stream of corrections under tight compute and memory budgets. Yet editors developed for text-only LLMs often degrade on MLLMs: visually dominant activations skew the statistics that shape updates, causing cross-modal conflict, while sequential writes become entangled in a shared edit space and amplify long-horizon interference, causing inter-edit interference. To address these, we propose M-ORE, a modality-decoupled online recursive editor for lifelong MLLM adaptation. M-ORE is derived from a unified proximal-projection formulation and admits a closed-form update with a Sherman-Morrison recursion, yielding constant per-edit overhead. It maintains module-wise locality statistics for the text stack and the visual projector to avoid visually dominated update shaping and performs continual updates in a fixed orthogonal low-rank edit subspace via a Sherman-Morrison recursion to mitigate long-horizon interference. Experiments on multiple MLLM backbones and online editing benchmarks show that our M-ORE method consistently improves reliability, generality, and locality over strong baselines, while achieving favorable quality-efficiency scaling. Our code is publicly available at this https URL.

[AI-98] Smaller Abstract State Spaces Enable Cross-Scale Generalization in Reinforcement Learning

链接: https://arxiv.org/abs/2605.20272
作者: Nasehatul Mustakim,Lucas Lehnert
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:While humans readily generalize abstract concepts to more complex or larger tasks, building Reinforcement Learning (RL) systems with this ability remains elusive. Here, we present the first theoretical model of how such Out-of-Distribution (OOD) generalization can be achieved in RL agents. Our approach considers Partially Observable Markov Decision Processes (POMDPs) and assumes that an intelligent agent uses an abstraction function to determine which experiences can be treated as equivalent and which must be distinguished. First, we extend the existing state abstraction framework and proof techniques to POMDPs. Then, we define a successor-weighted model reduction, a model reduction variant that enables compression into smaller abstract spaces than prior definitions allow. We derive a bound on the agent’s OOD test performance, thereby defining the conditions under which OOD generalization is achievable. This bound decomposes an agent’s performance loss into approximation and estimation errors, revealing how reducing an agent’s abstract state space size improves test performance and OOD generalization. Our analysis suggests that constraining an agent to operate over a small, finite set of abstract states is necessary for achieving generalization to more complex tasks. Our results motivate further research into learning RL architectures that scale across tasks of varying complexity levels.

[AI-99] Conformal Selective Acting: Anytime-Valid Risk Control for RLVR-Trained LLM s

链接: https://arxiv.org/abs/2605.20270
作者: Hamed Khosravi,Xiaoming Huo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:A local specialist LLM, fine-tuned with reinforcement learning from verifiable rewards (RLVR) on operator-local data, is installed in a regulated organization with per-deployment error budget \alpha . The operator needs a safety certificate for this deployment’s stream at every round: no pooling across deployments, no waiting for a long-run average. Existing wrappers cannot deliver this on adaptive, online-updated streams: offline conformal-risk methods require exchangeability; online-conformal methods bound only long-run averages; non-exchangeable extensions are marginally valid; and the closest anytime wrapper, A-RCPS, controls marginal rather than selective risk. Using a (test statistic, validity guarantee, deployment rule) framework, we identify one empty cell forced by deployment requirements: e-process per threshold, selective risk, anytime-pathwise validity, max-certified-threshold rule. Conformal Selective Acting (CSA) fills it as a per-round wrapper maintaining a Ville-type e-process per threshold on a Bonferroni grid, evaluated against the RLVR filtration. Under predictable updates and isotonic-calibrated monotone risk we prove (i) an anytime-pathwise selective-risk bound R_T^\mathrmact\le\alpha+O(N_T^-1/2) , (ii) rate-optimal certification matching \Theta(\bar\eta^-2\log(1/\delta)) , and (iii) a horizon-independent release-rate gap. Across eight specialist benchmarks ( 480 streams), sixteen adversarial distribution-shift cells ( 160 streams), and five live Expert-Iteration RLVR cells with online LoRA over four base models in three architecture families ( 10,300 rounds), CSA is the only method among ten compared that satisfies pathwise validity and non-refusing deployment on every cell. We do not propose a new LLM, training algorithm, or policy class; CSA is the deployment-side complement, orthogonal to the model, for operators who cannot use a frontier API.

[AI-100] Catching a Moving Subspace: Low-Rank Bandits Beyond Stationarity

链接: https://arxiv.org/abs/2605.20269
作者: Hamed Khosravi,Xiaoming Huo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Many bandit deployments (recommendation, clinical dosing, ad targeting) share two facts prior work handles only in isolation: rewards live on a low-dimensional latent subspace, and that subspace drifts. Stationary low-rank bandits exploit rank but break under subspace change; non-stationary linear bandits adapt to drift but pay ambient rate \widetildeO(d\sqrtT) . We study piecewise-stationary low-rank linear contextual bandits with scalar feedback: \theta_t = B_k^\star w_t with rank- r factor B_k^\star\in\mathbbR^d\times r constant within each of K unknown segments and able to shift at boundaries. Our results are tight along three axes. (i) Identification boundary. With single-play scalar rewards, the moving subspace is recoverable through quadratic functionals of rewards iff three probe-side conditions hold: known noise variance, bounded state-noise coupling, and full-dimensional probe support. Each is necessary in the unrestricted-second-moment problem, and jointly they are sufficient, characterizing the boundary of the solvable region. (ii) Algorithm and dynamic regret. SPSC interleaves isotropic probes with windowed projected ridge-UCB exploitation inside the learned r -dimensional subspace; a CUSUM-style variant discovers segment boundaries online. The costed dynamic regret is \widetildeO(r\sqrtT)+\widetildeO(T^2/3)+O(W,V_\mathrmin) , replacing the ambient d\sqrtT rate with the intrinsic rank. (iii) Empirics. On eleven benchmarks spanning synthetic, UCI/MovieLens, semi-synthetic clinical, and ZOZOTOWN production-log data, SPSC outperforms non-stationary and low-rank baselines whenever d-r\gtrsim T^1/6 , matching the analytical crossover. To our knowledge, this is the first work to characterize the identification boundary and attain the intrinsic-rank dynamic-regret rate in this setting.

[AI-101] Residual Paving: Diagnosing the Routing Bottleneck in Selective Refusal Editing

链接: https://arxiv.org/abs/2605.20262
作者: Bryce Hinkley,Peyman Najafirad
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We study selective refusal editing as a three-way control problem: induce non-refusal on designated edit prompts while preserving benign behavior and harmful refusals outside the edit set. We introduce Residual Paving, a routed residual editing method for frozen instruction-tuned transformers that separates route selectivity, whether to intervene, from residual-edit capacity, what edit to apply. An early-layer router predicts a scalar gate and expert mixture; when active, prompt-conditioned bottleneck residual experts apply later-layer residual updates while leaving the backbone unchanged. This decomposition supports an oracle-routing diagnostic where only the learned scalar gate is replaced with the held-out edit/keep label, leaving the residual editor and frozen backbone fixed. On the primary Gemma-3-4B-IT held-out split, learned Residual Paving reduces edit refusal from 88.6% to 4.0%, with 95.5% benign distribution preservation and 87.3% harmful distribution preservation. Same-protocol one-direction steering controls are much weaker on edit success, leaving edit refusal at 86.8% for Edit-target ActAdd and 78.9% for DIM-style refusal steering. The remaining failure is off-target harmful-keep degradation: harmful refusal remains below the frozen-base rate, 65.3% vs. 81.6%. Across six backbones, oracle routing improves the keep-side diagnostic score on every reported row, with median gain +12.9 pp, supporting the interpretation that learned route selectivity is the main observed bottleneck. Trajectory diagnostics on two backbones further suggest directed movement toward edit-target continuations rather than generic refusal suppression.

[AI-102] It Takes Two: Complementary Self-Distillation for Contextual Integrity in LLM s

链接: https://arxiv.org/abs/2605.20258
作者: Sangwoo Park,Woongyeong Yeo,Seanie Lee,Yumin Choi,Hyomin Lee,Kangsan Kim,Jinheon Baek,Seong Joon Oh,Sung Ju Hwang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: 28 pages, 16 figures

点击查看摘要

Abstract:Contextual Integrity (CI) defines privacy not merely as keeping information hidden, but as governing information flows according to the norms of a given context. As large language models are increasingly deployed as personal agents handling sensitive workflows, adhering to CI becomes critical. However, even frontier models remain unreliable in making disclosure decisions, and existing mitigation strategies often degrade underlying task performance. To overcome this privacy-utility trade-off, we propose SELFCI, a complementary self-distillation framework that decouples information suppression from task resolution. SELFCI jointly optimizes two independent reverse KL divergences over distinct teacher distributions derived from feedback: one encourages preserving task-relevant information for utility, while the other enforces minimal and appropriate disclosure. This complementary formulation induces a Product-of-Experts (PoE) target, aligning the policy with the intersection of capability and privacy requirements. Empirical evaluations demonstrate that SELFCI, without relying on costly external supervision, consistently outperforms competitive baselines such as online reinforcement learning algorithms (e.g., GRPO). These trends further extend to out-of-domain settings involving agentic workflows and accumulated private context, suggesting that SELFCI provides a practical path toward CI alignment.

[AI-103] Instance Discrimination for Link Prediction

链接: https://arxiv.org/abs/2605.20257
作者: Valentin Cuzin-Rambaud(SyCoSMA, DM2L, LIRIS, UCBL),Mathieu Lefort(LIRIS, SyCoSMA, IRISA, MALT, UR),Rémy Cazabet(DM2L, LIRIS, UCBL, IXXI)
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recently, instance discrimination models have emerged as a major solution for self-supervised learning. Having already demonstrated its effectiveness in the image domain, instance discrimination learning is now proving equally convincing in the graph domain, in particular for node classification. However, fewer contributions have tackled the link prediction task. In this contribution, we propose to adapt existing methods to this context. We first provide a rigorous evaluation of existing self-supervised models in the field of link prediction, showing that the main performance depends on the augmentation process (like in computer vision). We then propose a new structural augmentation based on the community structure that is relevant for link prediction. Our main contribution introduces two new models, L-GRACE and L-BGRL, based on link representations instead of node representations, which improve the performance of the existing methods, especially on unattributed graphs, and we show that they perform on par with the state of the art, both in supervised and self-supervised contexts.

[AI-104] FBOS-RL: Feedback-Driven Bi-Objective Synergistic Reinforcement Learning

链接: https://arxiv.org/abs/2605.20256
作者: Xikai Zhang,Yongzhi Li,Likang Xiao,Yingze Zhang,Yanhua Cheng,Quan Chen,Peng Jiang,Wenjun Wu,Liu Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reinforcement learning has become a cornerstone for aligning and unlocking the reasoning capabilities of large-scale models. At its core, the training loop of GRPO and its variants alternates between rollout sampling and policy update. Unlike supervised learning, where each gradient step is anchored to an explicit ground-truth target, the optimal gradient direction for updating model parameters in this setting is not known a priori; the high-quality rollouts drawn during the sampling stage therefore act as the implicit “teacher” that guides every parameter update. However, GRPO adopt a simple sampling scheme that conditions all rollouts on the same original prompt. When a task lies beyond the policy model’s current capability, this sampling scheme rarely yields a high-quality rollout, leaving the policy model without a meaningful gradient direction when updating its parameters, which causes training to stall. To address this issue, we propose FBOS-RL, a Feedback-Driven Bi-Objective Synergistic reinforcement learning framework. Specifically, we let the model perform Feedback-Guided Exploration Enhancement based on the feedback provided by the environment, and on top of this we design two mutually reinforcing training objectives: Exploitation-oriented Policy Alignment(EPA) and Exploration-oriented Capability Cultivation(ECC). Extensive experiments demonstrate that EPA and ECC can mutually reinforce each other, forming a positive flywheel effect that significantly improves both the training efficiency and the final performance ceiling of reinforcement learning. Specifically, under an identical number of rollouts, FBOS-RL learns substantially faster than GRPO and feedback-based baselines and ultimately attains a higher performance ceiling, while exhibiting higher policy entropy and lower gradient norms throughout training.

[AI-105] ProcBench: Evaluating Process-Level Defects and Control Preservation in LLM Coding Agents

链接: https://arxiv.org/abs/2605.20251
作者: Jiawei He,Jie Jia,Chenbo Liu,Chaoyi Xue,Yapeng Song,Xikai Yang,Dong Sun
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 21 pages, 8 figures

点击查看摘要

Abstract:Existing benchmarks for LLM coding agents mainly evaluate final outcomes, such as task completion, compilation success, and test pass rates. While these metrics are useful for measuring end-task capability, they provide limited visibility into how an execution unfolds and often miss recurrent process-level failures that arise during multi-step operation. We present ProcBench, a benchmark-oriented framework for evaluating coding-agent trajectories through process defects and control preservation. ProcBench organizes execution failures into a reusable ontology, standardizes heterogeneous logs into a unified trajectory representation, and reports calibrated risk-based scorecards instead of relying only on final outcomes. We instantiate ProcBench on an annotated set of 200 trajectories and apply it across three coding-agent benchmarks: AndroidBench, TerminalBench, and SWE-bench-Verified. Our results suggest that ProcBench can be instantiated with useful reliability, that calibration improves the empirical interpretability of defect findings relative to direct thresholding, and that process-aware scorecards provide diagnostic distinctions beyond conventional outcome-based evaluation. We also discuss limitations, including annotation dependence, partial observability for some defect classes, and the need for broader external validation.

[AI-106] Automated Kernel Discovery Towards Understanding High-dimensional Bayesian Optimization

链接: https://arxiv.org/abs/2605.20249
作者: Taeyoung Yun,Woocheol Shin,Inhyuck Song,Jaewoo Lee,Jinkyoo Park
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 36 pages, 27 figures, 12 tables

点击查看摘要

Abstract:Gaussian Process (GP) kernels are central to Bayesian optimization (BO), yet designing effective kernels for high-dimensional problems still relies on extensive manual engineering. Existing automated approaches struggle in high dimensions for two bottlenecks: their kernel search space is limited to additions and multiplications of base kernels, and LLM-based approaches require conditioning on raw observations, which becomes infeasible due to context-length limits and the difficulty of extracting meaningful patterns. We introduce \textbfKernel Discovery, a LLM-driven evolutionary framework for high-dimensional BO that searches a broader kernel space beyond predefined composition rules and does not require conditioning on observations. Motivated by the observation that directly prompting an LLM to generate kernel code yields syntactically varied but functionally identical kernels, we adopt a two-stage approach: an LLM first proposes novel mathematical forms, then a second LLM call converts each form into validated, executable code. We also propose a leave-one-out continuous ranked probability score (LOO-CRPS) as a selection criterion that penalizes overfitted kernels. On five high-dimensional BO benchmarks, our method achieves an average rank of \textbf1.2 out of 17, outperforming competitive baselines. We further analyze the discovered kernels to identify which kernels lead to improvements in high-dimensional BO. Comments: 36 pages, 27 figures, 12 tables Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2605.20249 [cs.LG] (or arXiv:2605.20249v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.20249 Focus to learn more arXiv-issued DOI via DataCite

[AI-107] GROW: Aligning GRPO with State-Action Modeling for Open-World VLM Agents

链接: https://arxiv.org/abs/2605.20246
作者: Xiongbin Wu,Zhihao Luo,Shanzhe Lei,Lechao Zhang,Xuhong Wang,Jie Yang,Zhonglong Zheng,Yuanjie Zheng,Xin Tan,Wei Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recently, vision-language model (VLM) agents have shown promising progress in open-world tasks, where successful task completion often requires multiple turns of visual perception and action execution. However, existing methods still rely primarily on Supervised Fine-Tuning (SFT) with expert demonstrations, while the advanced reinforcement learning (RL) algorithm, specifically Group Relative Policy Optimization (GRPO), has not been effectively employed for multi-turn RL in these tasks because standard GRPO requires full trajectories as training samples which leads to excessively long context and noise. To address this issue, we propose GROW, a RL framework for open-world VLM agents that decomposes collected trajectories into state-action samples, and computes advantages between these samples rather than treating a full trajectory as a single entity. We further provide a surrogate analysis indicating that, even though the grouped samples are conditioned on different local states rather than an identical prompt context, the objective can preserve the core relative policy optimization signal of GRPO under simplifying assumptions. Experiments on more than 800 Minecraft tasks show that our method achieves state-of-the-art (SOTA) performance, demonstrating the effectiveness of our proposed RL framework for open-world VLM agents.

[AI-108] LEAP: A closed-loop framework for perovskite precursor additive discovery

链接: https://arxiv.org/abs/2605.20242
作者: Xin-De Wang,Zhi-Rui Chen,Ze-Feng Gao,Peng-Jie Guo,Cheng Mu,Zhong-Yi Lu
机构: 未知
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI); Chemical Physics (physics.chem-ph)
备注: 30 pages; 11 figures

点击查看摘要

Abstract:Efficient discovery of precursor additives is essential for improving the performance of perovskite solar cells, yet the large chemical space makes conventional trial-and-error screening inefficient. We develop LEAP(LLM-driven Exploration via Active Learning for Perovskites), an expert-in-the-loop closed framework that couples a domain-specialized large language model(LLM) with active learning for iterative additive prioritization. The LLM is trained to extract mechanism-relevant knowledge from the perovskite additive literature and to represent candidate molecules through interpretable descriptors, which are further integrated into a Bayesian optimization workflow for uncertainty-aware prioritization under low-data conditions. Benchmark results on unseen literature show that the domain-specialized model outperforms general-purpose models in mechanism-consistent reasoning. Experimental validation in an expert-in-the-loop proof-of-concept study suggests improved additive prioritization across three screening rounds, leading to average device PCEs of 20.13% and 20.87% for the later-round 6-CDQ- and 2-CNA-treated devices, respectively, compared with 19.25% for the control, with a champion PCE of 21.32%. These results provide preliminary evidence that literature-grounded mechanistic descriptors, when coupled with Bayesian optimization and expert feasibility review, can support mechanism-aware additive prioritization in perovskite photovoltaics.

[AI-109] Provably Learning Diffusion Models under the Manifold Hypothesis: Collapse and Refine

链接: https://arxiv.org/abs/2605.20235
作者: Wei Huang,Andi Han,Mingyuan Bai,Huanjian Zhou,Qixin Zhang,Taiji Suzuki,Kenji Fukumizu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 3 figures

点击查看摘要

Abstract:Diffusion models generate high-dimensional data with remarkable quality, yet how their training efficiently learns the score function, bypassing the curse of dimensionality when data is supported on low-dimensional manifolds, remains theoretically unexplained. We identify a collapse-and-refine mechanism driven by the geometry of the score function itself: at small noise scales, the diverging singularity of the score drives a rapid dimensional collapse of the induced denoising map onto the data manifold projection; at moderate noise scales, training refines the intrinsic density on the learned manifold. We instantiate this principle as Score-induced Latent Diffusion (SiLD), a two-stage framework in which both manifold learning and density estimation emerge from a single denoising score matching objective, replacing the heuristic KL regularization of VAE-based latent diffusion models. We prove that the resulting sample complexity depends on the intrinsic dimension rather than the ambient dimension. Experiments on Stacked MNIST, CelebA variants, and molecular generation benchmarks show that SiLD matches or outperforms VAE-based LDMs in generation quality and consistently improves reconstruction, validating our theoretical predictions.

[AI-110] abPFN-MT: A Natively Multitask In-Context Learner for Tabular Data

链接: https://arxiv.org/abs/2605.20234
作者: Cormac Cureton,Narges Armanfard
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 24 pages, 7 figures

点击查看摘要

Abstract:Prior-Data Fitted networks (PFNs) have been very successful in tabular contexts, handling prediction tasks in context. However, they are designed for single-task inference, meaning that predicting several target values within a context requires repeated forward calls and precludes inter-task information sharing. We propose TabPFN-MT, which is trained on an expanded multi-target synthetic prior to capture inter-task dependencies in context. This model uses an expanded y -encoder and a shared decoder head to enable multitask in-context learning and simultaneous inference. The model is uniquely specialized for small-to-medium datasets by relying on in-context learning rather than traditional gradient-based training. Within this regime (averaging fewer than 1,000 samples), extensive evaluations across 344 datasets demonstrate that TabPFN-MT establishes a new state-of-the-art for deep tabular multitask learning. Furthermore, despite the inherent compute asymmetry of joint optimization, our model remains highly competitive with the latest state-of-the-art single-task ensembles. Notably, on multitask datasets it achieves an overall Accuracy rank of 4.89, the highest average rank among all models tested. Crucially, TabPFN-MT delivers this highly competitive performance while reducing the inference cost for T tasks from O(T) to O(1) forward passes, offering a massive computational efficiency improvement for multi-target tabular applications.

[AI-111] ool-Augmented Agent for Closed-loop OptimizationSimulationand Modeling Orchestration

链接: https://arxiv.org/abs/2605.20190
作者: Liyuan Deng,Shujian Deng,Yongkang Chen,Yongkang Dai,Zhihang Zhong,Linyang Li,Xiao Sun,Yilei Shi,Huaxi Huang
机构: 未知
类目: Artificial Intelligence (cs.AI); Graphics (cs.GR)
备注: 8pages,3figures

点击查看摘要

Abstract:Iterative industrial design-simulation optimization is bottlenecked by the CAD-CAE semantic gap: translating simulation feedback into valid geometric edits under diverse, coupled constraints. To fill this gap, we propose COSMO-Agent (Closed-loop Optimization, Simulation, and Modeling Orchestration), a tool-augmented reinforcement learning (RL) framework that teaches LLMs to complete the closed-loop CAD-CAE process. Specifically, we cast CAD generation, CAE solving, result parsing, and geometry revision as an interactive RL environment, where an LLM learns to orchestrate external tools and revise parametric geometries until constraints are satisfied. To make this learning stable and industrially usable, we design a multi-constraint reward that jointly encourages feasibility, toolchain robustness, and structured output validity. In addition, we contribute an industry-aligned dataset that covers 25 component categories with executable CAD-CAE tasks to support realistic training and evaluation. Experiments show that COSMO-Agent training substantially improves small open-source LLMs for constraint-driven design, exceeding large open-source and strong closed-source models in feasibility, efficiency, and stability.

[AI-112] SOLAR: A Self-Optimizing Open-Ended Autonomous Agent for Lifelong Learning and Continual Adaptation

链接: https://arxiv.org/abs/2605.20189
作者: Nitin Vetcha,Dianbo Liu
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at “Association for the Advancement of Artificial Intelligence 2026 Conference” in Streaming Continual Learning Bridge. Published in CEUR Workshop Proceedings (Original version at this https URL )

点击查看摘要

Abstract:Despite the remarkable success of large language models (LLMs), they still face bottlenecks while deploying in dynamic, real-world settings with primary challenges being concept drift and the high cost of gradient-based adaptation. Traditional fine-tuning (FT) struggles to adapt to non-stationary data streams without resulting in catastrophic for getting or requiring extensive manual data curation. To address these limitations within the streaming and continual learning paradigm, we propose the Self-Optimizing Lifelong Autonomous Reasoner (SOLAR) which is an open-ended autonomous agent that leverages parameter-level meta-learning to self-improve, treating model weights as an environment for exploration. It initiates the process by consolidating a strong prior over common-sense knowledge making it effective for transfer-learning. By utilizing a multi-level reinforcement learning approach, SOLAR autonomously discovers adaptation strategies, enabling efficient test-time adaptation to unseen domains. Crucially, SOLAR maintains an evolving knowledge base of valid modification strategies, implicitly acting as an episodic memory buffer to balance plasticity (adaptation to new tasks) and stability (retention of meta-knowledge). Experiments demonstrate that SOLAR outperforms strong baselines on common-sense, mathematical, medical, coding, social and logical reasoning tasks, marking a significant step toward autonomous agents capable of lifelong adaptation in evolving environments.

[AI-113] GraphDiffMed: Knowledge-Constrained Differential Attention with Pharmacological Graph Priors for Medication Recommendation

链接: https://arxiv.org/abs/2605.20188
作者: Krati Saxena,Tomohiro Shibata
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recommending safe and effective medication combinations from electronic health records (EHRs) is a core clinical AI problem, yet it remains difficult because patient trajectories are long, noisy, and clinically heterogeneous. Existing methods typically excel at either temporal modeling across visits or pharmacological knowledge integration (e.g., drug-drug interactions, DDIs), but rarely achieve both while robustly suppressing noise. We present GraphDiffMed, a knowledge-constrained medication recommendation framework built on dual-scale Differential Attention v2. Differential attention is applied at both intra-visit and inter-visit levels to filter spurious signals within encounters and across longitudinal history, while pharmacological constraints are incorporated during learning. Experiments on MIMIC-III and ablation studies show that this design consistently improves recommendation quality and ranking over strong baselines while achieving a more favorable safety performance balance. We further find that the strongest-performing configuration uses only demographic auxiliary features under our experimental setting. Overall, GraphDiffMed demonstrates that combining noise-aware attention with pharmacological constraints yields more reliable and clinically meaningful medication recommendation. We open-source our code at this https URL.

[AI-114] Neural Estimation of Pairwise Mutual Information in Masked Discrete Sequence Models ICML2026

链接: https://arxiv.org/abs/2605.20187
作者: Jai Sharma,Yifan Wang,Bryan Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Theory (cs.IT)
备注: 6 pages, 3 figures; submitting to ICML 2026

点击查看摘要

Abstract:Understanding dependencies between variables is critical for interpretability and efficient generation in masked diffusion models (MDMs), yet these models primarily expose marginal conditional distributions and do not explicitly represent inter-variable dependence. We propose a neural framework for estimating pairwise conditional mutual information (MI) directly from the hidden states of a pretrained MDM, using ground-truth MI computed from the model’s own conditional distributions for supervision. The resulting estimator captures the model’s internal belief about dependency structure and predicts the full MI matrix in a single forward pass, enabling MI-guided parallel decoding by identifying conditionally independent subsets of variables. We evaluate our approach on Sudoku and protein sequence generation with ESM-C, where the MI maps recover known structural constraints and enable a 3-5x magnitude reduction in inference-time forward passes compared to sequential decoding, while preserving generative quality and outperforming entropy-based parallelization methods.

[AI-115] Large-Step Training Dynamics of a Two-Factor Linear Transformer Model

链接: https://arxiv.org/abs/2605.21292
作者: Krishnakumar Balasubramanian
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Dynamical Systems (math.DS)
备注:

点击查看摘要

Abstract:Gradient-flow analyses show that simplified linear transformers can learn the in-context linear-regression algorithm, but they do not explain the finite-step behavior of gradient descent at large learning rates. Motivated by empirical work on high-learning-rate transformer instabilities and by the cubic-map phase diagram for quadratic regression, we study an exactly reducible one-prompt linear-transformer training problem. After normalization, the dynamics reduce to a two-factor product map with an effective step-size parameter (\mu). On the balanced slice, this map recovers the known scalar cubic transition from monotone convergence to catapult convergence, periodic and chaotic bounded nonconvergence, and divergence. We then analyze the full two-dimensional system and show that, for (0\mu2), it has an explicit invariant Chebyshev ellipse separating forward-invariant regions; this ellipse carries off-balanced chaotic dynamics but is transversely repelling, while balanced scalar attractors can be transversely attracting. These results show that large constant learning rates can change the training attractor of the learned transformer rather than merely accelerating convergence: beyond sharp stability thresholds, finite-step training may settle into cycles, bounded chaos, or divergence instead of a single in-context linear-regression solution. We also discuss the consequences for mini-batch gradient descent based training methods.

[AI-116] Artificial Intelligence Reshapes Microwave Photonics

链接: https://arxiv.org/abs/2605.21224
作者: Peng Li,Xihua Zou,Jia Ye,Wei Pan,Lianshan Yan
机构: 未知
类目: Optics (physics.optics); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
备注: 13 pages, 12 figures

点击查看摘要

Abstract:As a rapidly emerging interdisciplinary field that intrinsically integrates microwave and photonics, microwave photonics (MWP) provides disruptive solutions to overcome the fundamental bandwidth of conventional electronic systems. By exploiting the inherently ultra-wide bandwidth and low-loss characteristics of photonic technologies, MWP enables the generation, transmission, processing, and detection of microwave, millimeter-wave, and terahertz signals. Representative breakthroughs include fully photonic microwave radar systems, photonic analog-to-digital converters with bandwidth up to 320 GHz, and photonic wireless communication systems achieving data rate as high as 616 Gbit/s. Meanwhile, the rapid growth of artificial intelligence (AI) is reshaping scientific research, engineering, and daily life in unprecedented ways, such as AI for science/engineering and AI co-scientist/assistant. Correspondingly, AI is profoundly reshaping MWP in all aspects, ranging from signal generation, transmission to signal processing and detection. AI has revolutionized the design, simulation, fabrication, testing, deployment, and maintenance of MWP systems, delivering autonomous operation and exceptional efficiency beyond traditional systems. Motivated by these developments, this Review Paper provides the first comprehensive overview of AI-enabled MWP, systematically summarizing the state-of-the-art advances and presenting insights for both the academic community and the broader public.

[AI-117] Enhanced Reinforcement Learning-based Process Synthesis via Quantum Computing

链接: https://arxiv.org/abs/2605.21213
作者: Austin Braniff(1),Fengqi You(2),Yuhe Tian(1) ((1) Department of Chemical and Biomedical Engineering, West Virginia University, (2) R.F. Smith School of Chemical and Biomolecular Engineering, Cornell University)
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Optimization and Control (math.OC)
备注:

点击查看摘要

Abstract:In this work, we present quantum reinforcement learning (RL) as a solution strategy for process synthesis problems. Building on our prior work, we develop a generalized framework that formally poses process synthesis as a Markov decision process and introduces quantum-enhanced RL algorithms to solve it with improved scalability. Earlier implementations of quantum-based RL for process synthesis were limited by qubit requirements, which scaled poorly with problem complexity. This work overcomes this challenge by introducing state encoding algorithms to decouple qubit requirements from problem size. A classical RL-based solution strategy is used as a baseline to benchmark the quantum algorithms under identical training conditions. All algorithms are evaluated across a flowsheet synthesis problem of increasing unit counts to analyze their performance and scalability. Results show that all approaches are capable of identifying the optimal flowsheet designs in small design spaces. For moderate-scale unit counts, quantum approaches demonstrate competitive performance on a per-episode basis and improved efficiency on a per-parameter basis versus the classical RL benchmark. This work provides a foundation for future quantum computing applications within process systems engineering, establishes a controlled benchmark for comparing classical and quantum algorithms, and shows that the proposed quantum variants remain competitive for the process synthesis problem examined in this work.

[AI-118] AMAR: Lightweight Attention-Based Multi-User Activity Recognition from Wi-Fi CSI

链接: https://arxiv.org/abs/2605.20649
作者: Amirhossein Mohammadi,Hina Tabassum
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 25 pages, 6 figures, 3 tables

点击查看摘要

Abstract:Wi-Fi-based human activity recognition (HAR) has emerged as a promising approach for contactless sensing, leveraging channel state information (CSI) collected from wireless transceivers. While existing studies have primarily concentrated on single-user scenarios, real-world deployments often involve multi-user settings where concurrent users’ movements induce overlapping CSI patterns that challenge conventional classification methods. To address this limitation, this paper introduces an attention-based multi-user activity recognition (AMAR) framework that formulates HAR as a set prediction problem. The transformer-based architecture in AMAR leverages learnable query embeddings acting as specialized activity detectors, enabling the simultaneous identification of multiple activities from composite CSI representations. Moreover, to address deployment constraints, AMAR is designed in an edge-cloud split architecture form where lightweight convolutional networks on edge devices perform initial feature extraction, followed by residual vector quantization that achieves substantial bandwidth reduction while preserving activity-discriminative information. The cloud component performs final activity prediction through attention-based set matching, enabling the system to handle varying occupancy levels. Across classroom, meeting-room, and empty-room environments, on average AMAR nearly doubles the rate of perfectly predicting all concurrent activities compared to the best baseline. Moreover, it achieves an F_1 -score of 53.4% compared to 45.6% for the best benchmark, and reduces occupancy estimation error by 74%, while minimizing bandwidth substantially.

[AI-119] Lower Bounds for Advection-Diffusion Equations: An Exploration with AI-Generated Proofs

链接: https://arxiv.org/abs/2605.20623
作者: Chenyang An,Xiaoqian Xu
机构: 未知
类目: Analysis of PDEs (math.AP); Artificial Intelligence (cs.AI)
备注: 63 pages

点击查看摘要

Abstract:We establish explicit lower bounds for advection-diffusion equations in three settings: a polynomial \dot H^-1 bound for inviscid shears with u\in L^\infty_t W^1,1_y , a uniform positive lower bound on the mixing scale for diffusive shears, and an exponential L^2 bound for rapidly oscillating time-periodic flows. All constants are explicit in the data. The proofs were generated entirely by a multi-agent math proving system, QED, without expert human intervention, serving as a test of AI’s capability to produce rigorous mathematics. Comments: 63 pages Subjects: Analysis of PDEs (math.AP); Artificial Intelligence (cs.AI) MSC classes: 35Q35 Cite as: arXiv:2605.20623 [math.AP] (or arXiv:2605.20623v1 [math.AP] for this version) https://doi.org/10.48550/arXiv.2605.20623 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-120] argeting Clause Type Distributions: a Picklock for Random Satisfiability Problems

链接: https://arxiv.org/abs/2605.20328
作者: J. Schwardt,J. C. Budich
机构: 未知
类目: atistical Mechanics (cond-mat.stat-mech); Strongly Correlated Electrons (cond-mat.str-el); Artificial Intelligence (cs.AI); Combinatorics (math.CO)
备注: 7+2 pages, 6+2 figures

点击查看摘要

Abstract:Optimization problems such as the NP-complete 3-SAT provide an important benchmark for the difficult task of finding ground-states in strongly correlated many-body systems with rugged energy landscapes. The study of random 3-SAT problems as Ising spin Hamiltonians in statistical physics has yielded major insights including the existence of a satisfiability phase transition, and the prediction of a critical parameter line of particularly hard instances. Yet, progress on solving those instances has been scarce for several decades. Here, introducing the Target-SAT (TSAT) algorithm, we roughly triple the tractable problem sizes in the hardest regime, with an even greater improvement in a vast range of neighboring regions. By leveraging statistical information hidden in the combinatorial constraints of the problem, TSAT is actively guided in its stochastic local search toward a target within the relevant parameter space. Our analysis also explains why established local search algorithms are limited to relatively small system sizes due to a vast low-energy trap. Furthermore, we characterize the aforementioned critical line in terms of a dominant additional complexity barrier, whose exponential scaling is quickly overcome by TSAT only in the surrounding parameter space. With TSAT, the lead in solving the hardest known random satisfiability problems returns to the realm of stochastic local search algorithms.

[AI-121] Representability-Aware Neural Networks for Reduced Density Matrices: Application to Fractional Chern Insulators

链接: https://arxiv.org/abs/2605.20326
作者: Justin B. Hart,Awwab A. Azam,Thomas Li,Yunxuan Li,Ye Bi,Haining Pan,Jiabin Yu
机构: 未知
类目: rongly Correlated Electrons (cond-mat.str-el); Artificial Intelligence (cs.AI)
备注: 12+32 Pages, 4+10 Figures, 0+19 Tables

点击查看摘要

Abstract:We develop a representability-aware and interpolable neural network (NN) framework for predicting two-particle reduced density matrices (2-RDMs). The NN incorporates a subset of representability conditions through its architecture and loss function, and can operate on different momentum meshes, enabling evaluating the representability conditions across multiple meshes, which we call interpolated representability condition. The framework can be used either to predict 2-RDMs on large momentum meshes by interpolating exact results from small meshes, or as a variational 2-RDM ansatz optimized by energy minimization on arbitrary meshes. We apply this approach to the fractional Chern insulator in the one-band projected model of twisted bilayer MoTe _2 at twist angle 3.89^\circ and hole filling 2/3 . Trained on exact-diagonalization (ED) 2-RDMs from meshes with 12 or 18 momentum points using six different NN architectures, the best NN is the residual multilayer perceptron, which predicts the 6\times6 2-RDM with 97.07%-98.18% accuracy relative to the ED 2-RDM but predicts an energy 77.353 meV above ED ground-state energy. We then variationally optimize the NN on several meshes including 6\times6 , predicting a 6\times 6 energy of just 0.104 meV below ED while maintaining 98.94%-98.96% accuracy. Compared with the conventional boundary-point semidefinite programming, which gives an energy 5.560 meV below ED with 96.40%-98.94% accuracy, the NN achieves a more accurate energy and similar accuracy while using only less than 1/20 as many parameters. Eventually, we add a symmetric mesh of 48 momentum points to the variational optimization of the NN, and provide a prediction of the many-body ground-state energy and the many-body quantum metric on that mesh.

[AI-122] Network-Based Interventions for HIV Prevention via Cascade-Aware Suppression of Transmission

链接: https://arxiv.org/abs/2605.20218
作者: Akseli Kangaslahti,Davin Choo,Milind Tambe,Alastair van Heerden,Cheryl Johnson
机构: 未知
类目: Physics and Society (physics.soc-ph); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
备注:

点击查看摘要

Abstract:Treating and preventing Human Immunodeficiency Virus (HIV) remains a critical global health challenge. While antiretroviral therapy provides a path toward viral suppression – effectively eliminating an individual’s transmission risk – systemic resource constraints limit the reach of intervention efforts. This work addresses the strategic distribution of intensive resources among virally unsuppressed individuals to minimize the expected cascade of new infections within a transmission network. We formalize this challenge as a novel constrained optimization problem where we have resources to “treat” k out of a set \mathbfP of virally unsuppressed individuals, and establish its theoretical connections to existing computational literature. We then propose Cascade-Aware Suppression of Transmission (CAST), a polynomial-time (\delta, \epsilon) -approximation algorithm that achieves a 2\sqrt|\mathbfP| approximation ratio by leveraging connections to the Minimum- k -Union (MkU) problem and Hoeffding-style concentration bounds. Extensive evaluations on real-world HIV networks demonstrate that CAST outperforms standard public health and computer science baselines. Furthermore, we show that CAST is empirically robust across diverse infectious disease networks, varied edge probability initializations, and settings involving imperfect network data.

机器学习

[LG-0] Equilibrium Reason ers: Learning Attractors Enables Scalable Reasoning ICML2026

链接: https://arxiv.org/abs/2605.21488
作者: Benhao Huang,Zhengyang Geng,Zico Kolter
类目: Machine Learning (cs.LG)
*备注: ICML 2026

点击查看摘要

Abstract:Scaling test-time compute by iteratively updating a latent state has emerged as a powerful paradigm for reasoning. Yet the internal mechanisms that enable these iterative models to generalize beyond memorized patterns remain unclear. We hypothesize that generalizable reasoning arises from learning task-conditioned attractors: latent dynamical systems whose stable fixed points correspond to valid solutions. We formalize this process through Equilibrium Reasoners (EqR), which enable test-time scaling without external verifiers or task-specific priors. EqR scales internal dynamics along two axes: depth, by running more iterations, and breadth, by aggregating stochastic trajectories from multiple initializations. Empirically, gains from test-time scaling are tightly coupled with stronger convergence toward solution-aligned attractors. This attractor perspective allows neural networks to adaptively allocate test-time compute based on task difficulty. While simple cases converge within 1 to 5 iteration steps, harder cases benefit from massive test-time scaling. By unrolling up to the equivalent of 40,000 layers, scalable latent reasoning boosts accuracy from 2.6% for feedforward models to over 99% on Sudoku-Extreme. These results suggest that learned attractor landscapes provide a useful mechanistic lens for understanding scalable reasoning in iterative latent models. Comments: ICML 2026 Subjects: Machine Learning (cs.LG) Cite as: arXiv:2605.21488 [cs.LG] (or arXiv:2605.21488v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.21488 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-1] EvoStruct: Bridging Evolutionary and Structural Priors for Antibody CDR Design via Protein Language Model Adaptation

链接: https://arxiv.org/abs/2605.21485
作者: Mansoor Ahmed,Sujin Lee,Umar Khayaz,Murray Patterson
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Equivariant graph neural network (GNN) methods for antibody complementarity-determining region (CDR) design achieve the highest sequence recovery but suffer from severe vocabulary collapse. The current best GNN methods over-predict very few amino acids, such as tyrosine and glycine, while ignoring functionally important residues. We trace this failure to GNN encoders learning amino acid distributions de novo from limited structural data, discarding substitution patterns encoded in evolutionary databases. To resolve this, we propose EvoStruct, which bridges a frozen protein language model (PLM) with 3D structural context from an E(3)-equivariant GNN via a cross-attention adapter. Unlike prior PLM-structure adapters for general protein design, EvoStruct targets the vocabulary collapse problem specific to CDR design through progressive PLM unfreezing and R-Drop consistency regularization. On the CHIMERA-Bench dataset, EvoStruct achieves the highest amino acid recovery and lowest perplexity among several antibody design methods, improving sequence recovery by 16% and reducing perplexity by 43% relative to the best GNN baselines, while recovering 2.3x greater amino acid diversity and the highest binding-pair correlation with ground truth.

[LG-2] Is Fixing Schema Graphs Necessary? Full-Resolution Graph Structure Learning for Relational Deep Learning ICML2026

链接: https://arxiv.org/abs/2605.21475
作者: Yi Huang,Qingyun Sun,Jia Li,Xingcheng Fu,Jianxin Li
类目: Machine Learning (cs.LG)
*备注: Accepted by the Forty-third International Conference on Machine Learning (ICML2026)

点击查看摘要

Abstract:Relational prediction tasks are fundamental in many real-world applications, where data are naturally stored in relational databases (RDBs). Relational Deep Learning (RDL) addresses this problem by modeling RDBs as graphs and applying graph neural networks (GNNs) for end-to-end learning. However, the full-resolution property is commonly adopted as a design principle in graph construction for RDBs to preserve relational semantics, which leads most existing methods to rely on fixed graph structures. In this paper, we propose FROG, a Full-Resolution and Optimizable Graph Structure Learning framework for RDL that formulates relational structure learning as a learnable table role modeling problem, allowing tables to contribute as nodes and edges in message passing. We further design role-driven message passing mechanisms to capture relational semantics, enabling joint optimization of graph structure and GNN representations. To ensure semantic consistency, we introduce functional dependency constraints that regularize representations across table and entity levels. Extensive experiments demonstrate that our method outperforms existing approaches and reveal how table roles impact downstream tasks, offering new insights into graph construction for RDL

[LG-3] A Machine Learning Framework for Weighted Least Squares GNSS Positioning based on Activation Functions

链接: https://arxiv.org/abs/2605.21461
作者: Pin-Hsun Lee,Harry Leib
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Global Navigation Satellite Systems (GNSS) are widely used to provide position, velocity, and timing (PVT) information for various applications, including transportation, location-based communication services, and intelligent agriculture. In urban canyons, high-rise buildings and narrow streets can cause signal obstruction, non-line-of-sight (NLOS) reception, and multipath effects that introduce errors in GNSS pseudorange measurements. Although multi-constellations GNSS effectively increase the number of available satellites, the inclusion of degraded signals can lead to severe positioning errors. This study proposes a machine learning framework for the weighted least squares (WLS) algorithm incorporating activation functions to enhance positioning accuracy. Several signal quality indicators are employed as training features for ensemble learning algorithms to identify poor quality signals by providing quality scores. Then, activation functions are employed to transform the machine learning predicted scores to appropriate weights for WLS positioning. To evaluate the performance of our approach, experiments are conducted using real-world datasets from Hong Kong and Tokyo urban areas. Comparative analysis of activation functions reveals that sigmoid functions consistently yield the greatest improvements with different machine learning algorithms and GNSS constellation configurations. The proposed algorithm demonstrates substantial reductions in positioning errors for both single- and multiconstellation scenarios. Furthermore, our results indicate that the proposed algorithm exhibits strong geographical transferability. The proposed algorithm maintains comparable level of performance when trained on data from other regions with similar levels of urbanization.

[LG-4] Mitigating Label Bias with Interpretable Rubric Embeddings

链接: https://arxiv.org/abs/2605.21455
作者: Calvin Isley,Johann D. Gaebler,Sharad Goel
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Statistical decision algorithms are increasingly deployed in domains where ground-truth labels are hard to obtain, such as hiring, university admissions, and content moderation. In these settings, models are typically trained on historical human evaluations – for example, using past hiring decisions as a proxy for true applicant quality. However, if past evaluations unjustly favor certain groups, models trained on these labels may inherit those biases. To address this problem, we propose basing predictions on rubric embeddings, a representation framework that replaces standard black-box embeddings with features derived from expert-defined criteria that align with the underlying construct of interest. By anchoring predictions to semantically meaningful dimensions, this approach guards against biased proxy signals. We provide both theoretical and empirical evidence that rubric embeddings mitigate label bias under plausible conditions. Empirically, we evaluate our method on a novel dataset of applications to a large master’s program. We find that models trained on rubric embeddings reduce group disparities while improving measures of cohort quality. Our results suggest that basing predictions on interpretable, domain-grounded representations offers a practical approach to learning in the presence of biased labels.

[LG-5] Gaussian Sheaf Neural Networks

链接: https://arxiv.org/abs/2605.21435
作者: André Ribeiro,Ana Luiza Tenório,Tiago da Silva,Diego Mesquita
类目: Machine Learning (cs.LG); Algebraic Topology (math.AT); Category Theory (math.CT)
*备注:

点击查看摘要

Abstract:Graph Neural Networks (GNNs) have become the de facto standard for learning on relational data. While traditional GNNs’ message passing is well suited for vector-valued node features, there are cases in which node features are better represented by probability distributions than real vectors. Concretely, when node features are Gaussians, characterized by a mean and a covariance matrix, naively concatenating their parameters into a single vector and applying standard message passing discards the geometric and algebraic structure that governs means and covariances. We propose Gaussian Sheaf Neural Networks (GSNNs), a principled framework that incorporates these inductive biases into graph-based learning. Building on the theory of cellular sheaves, we derive a new Laplacian operator that generalizes the sheaf Laplacian to this setting and preserves its key properties. We complement our theoretical contributions with experiments on synthetic and real-world data that illustrate the practical relevance of GSNNs.

[LG-6] roto 2.0: The Robot Tactile Olympiad ICRA2026

链接: https://arxiv.org/abs/2605.21429
作者: Elle Miller,Jayaram Reddy,Ayush Deshmukh,Trevor McInroe,David Abel,Oisin Mac Aodha,Sethu Vijayakumar
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: Accepted to 7th ViTac Workshop, ICRA 2026

点击查看摘要

Abstract:Tactile-based reinforcement learning (RL) is currently hindered by fragmented research and a focus on over-saturated orientation tasks. We introduce v2 of the Robot Tactile Olympiad (\textttroto 2.0), a GPU-parallelised benchmark designed to standardise tactile-based RL across four distinct robotic morphologies (16-DOF to 24-DOF). Unlike prior benchmarks, roto focuses on end-to-end “blind” manipulation, utilising only proprioception and tactile sensing without state information or distillation. We demonstrate a significant performance leap, with our blind agents achieving 13 Baoding ball rotations in 10 seconds, an order of magnitude faster than current state-of-the-art speeds. By open-sourcing our environments and robustly tuned baselines, we reduce the barrier to entry and enable researchers to prioritise fundamental algorithmic challenges over tedious RL tuning. Website: this https URL

[LG-7] Polynomial-Time Robust Multiclass Linear Classification under Gaussian Marginals

链接: https://arxiv.org/abs/2605.21428
作者: Ilias Diakonikolas,Giannis Iakovidis,Mingchen Ma
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS)
*备注:

点击查看摘要

Abstract:We study the task of agnostic learning of multiclass linear classifiers under the Gaussian distribution. Given labeled examples (x, y) from a distribution over \mathbbR^d \times [k] , with Gaussian x -marginal, the goal is to output a hypothesis whose error is comparable to that of the best k -class linear classifier. While the binary case k=2 has a well-developed algorithmic theory, much less is known for k \ge 3 . Even for k=3 , prior robust algorithms incur exponential dependence on the inverse of the desired accuracy in both complexity and representation size. In this work, we develop new structural results for multiclass linear classifiers and use them to design fully polynomial-time robust learners with dimension-independent error guarantees. Our first result shows that the standard multiclass perceptron algorithm requires super-polynomially many samples and updates, even with clean labels and Gaussian marginals, revealing a basic obstruction absent in the binary case. Our main positive result is a pairwise improper-learning framework which yields an efficient learner with error \widetilde O(k^3/2\sqrt\mathrmopt)+\epsilon for general k . Additionally, we develop a sharper localization-based framework which leads to error O(\mathrmopt)+\epsilon for k=3 , and error \mathrmpoly(k)\mathrmopt+\epsilon for geometrically regular k -class linear classifiers.

[LG-8] Adaptive Signal Resuscitation: Channel-wise Post-Pruning Repair for Sparse Vision Networks

链接: https://arxiv.org/abs/2605.21426
作者: Qishi Zhan,Ziheng Chen,Minxuan Hu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:One-shot magnitude pruning can cause severe accuracy collapse in the high-sparsity regime, even when the pruning mask preserves the largest weights. We argue that this failure reflects a granularity mismatch in post-pruning repair. Under global magnitude pruning, nearly collapsed channels can coexist with channels that retain informative activation variance within the same layer. Existing layer-wise activation repair methods apply a single correction to the whole layer, and can therefore over-amplify damaged channels while trying to restore the layer-level signal. We propose Adaptive Signal Resuscitation (ASR), a training-free channel-wise repair method that matches the granularity of repair to the granularity of damage. ASR estimates a variance-matching correction for each output channel and stabilizes it with a data-driven shrinkage rule, suppressing unreliable corrections for channels with weak post-pruning signal while preserving corrections for healthier channels. Applied before BatchNorm recalibration, ASR requires only forward passes on a small calibration set and no retraining. Across three datasets, four convolutional architectures, and both unstructured and structured sparsity settings, ASR generally improves over layer-wise repair, with the clearest gains in high-sparsity regimes. On ResNet-50 at 90% sparsity, ASR recovers 55.6% top-1 accuracy on CIFAR-10, compared with 41.0% for layer-wise repair and 28.0% for BatchNorm-only recalibration. Ablations show that naive channel-wise variance matching is insufficient, and that shrinkage stabilizes post-pruning repair.

[LG-9] Preference-aware Influence-function-based Data Selection Method for Efficient Fine-Tuning

链接: https://arxiv.org/abs/2605.21422
作者: Qihao Lin,Guanxu Chen,Dongrui Liu,Jing Shao
类目: Machine Learning (cs.LG)
*备注: 13 pages, 3 figures

点击查看摘要

Abstract:As LLMs continue to scale, improving training efficiency increasingly depends on using data more effectively. Data selection addresses this problem by allocating a limited training budget to samples that best promote a target behavior. Existing methods usually represent the target behavior with a set of target examples, but often treat these examples as equally important. This can be inefficient because target examples may differ in their relevance to the current model: examples closer to the model’s current behavior provide more actionable guidance than those farther away. We propose PRISM (PReference-aware Influence-function-based Data Selection Method for Efficient Fine-Tuning), which uses the current model’s preference to weight target examples and construct a preference-aware target representation. PRISM then scores candidate training samples by their alignment with this representation, concentrating the data budget on samples more likely to move the model toward the target behavior. Theoretical analysis shows that this preference weighting yields a more effective first-order direction for increasing target-behavior preference. Experiments across model families and scales show that PRISM improves both efficient fine-tuning and safety-oriented SFT repair, demonstrating that precise target-behavior characterization is key to budget-efficient data selection.

[LG-10] What Twelve LLM Agent Benchmark Papers Disclose About Themselves: A Pilot Audit and an Open Scoring Schema

链接: https://arxiv.org/abs/2605.21404
作者: Mahdi Naser Moghadasi(BrightMind AI, Texas Tech University),Faezeh Ghaderi(University of Texas at Arlington)
类目: Machine Learning (cs.LG)
*备注: Pilot audit of 12 LLM agent benchmark papers; schema, codebook, and per-paper scoring sheet released. Submission to IEEE Big Data 2026

点击查看摘要

Abstract:We read twelve well-known LLM agent benchmark papers and recorded, dimension by dimension, what each paper actually says about how its evaluation was run. The motivation came from a familiar frustration: two papers will report results on the same benchmark with the same model name and disagree, and you cannot tell why – the scaffold, the sampling settings, the subset, or the evaluator version. In many cases the published artifact does not let you answer. This paper is an implementation report on the attempt. We designed a small audit schema (five fields: benchmark identity, harness specification, inference settings, cost reporting, failure breakdown), wrote a scoring codebook with the boundary cases we hit during pilot scoring, applied it to twelve canonical papers (eight agent, four classical static), and recorded what we saw. We score the disclosure of an agent run, not its correctness, and make no claim that disclosure implies a trustworthy result. The mean audit score across the eight agent-benchmark papers is 0.38 (out of 1.0), and across the four classical static benchmarks 0.66; the largest gap is on cost (none of the eight agent benchmark papers disclose inference cost in any form) and on harness specification (none fully disclose a content-addressed container image of the evaluation environment). We release the schema as a JSON Schema file, the codebook as a Markdown document, and the raw scoring sheet as a CSV. The scoring was performed by a single auditor in one pass; a multi-rater audit is the natural next step, and we discuss what we think it would change.

[LG-11] Classification of Single and Mixed Partial Discharges under Switching Voltage Using an AWA-CNN Framework

链接: https://arxiv.org/abs/2605.21352
作者: Md Rafid Kaysar Shagor,Zannatul Ferdousy Mouri,Farhina Haque,Anindya Bijoy Das
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE); Emerging Technologies (cs.ET)
*备注:

点击查看摘要

Abstract:The growing use of fast-switching power electronics has made partial discharge (PD) analysis under switching-voltage excitation increasingly important, yet more challenging than under sinusoidal conditions due to activity concentrated at voltage transitions. This work presents an Amplitude-Width-Area (AWA) pattern representation for source-oriented PD analysis under switching-voltage excitation. In the proposed method, time domain PD pulses are characterized using pulse amplitude, width, and area, and mapped into a visual pattern where amplitude and area define the coordinate axes and width is encoded by color. The generated AWA patterns are used to distinguish six single and mixed PD source conditions: corona, internal, surface, corona+internal, corona+surface, and internal+surface. To evaluate the classification capability of the proposed representation, a Random Forest baseline and two Convolutional Neural Network (CNN) models, InceptionV3 and ResNet-18, are compared. The AWA patterns show distinguishable source-dependent distributions, and CNN-based classification achieves testing accuracy above 96%, compared with 73.33% for Random Forest. The results indicate that AWA patterns provide a visual representation of PD pulses suitable for multi-class PD source classification under switching-voltage excitation.

[LG-12] Fast and Stable Triangular Inversion for Delta-Rule Linear Transformers

链接: https://arxiv.org/abs/2605.21325
作者: Aleksandros Sobczyk,Gioele Gottardo,Christos K. Matzoros,Mirko De Vita,Filip Skogh,Anastasios Zouzias,Jiawei Zhuang
类目: Machine Learning (cs.LG)
*备注: Preprint

点击查看摘要

Abstract:Linear attention has emerged as a cornerstone for efficient long-context architectures, as evidenced by its integration into state-of-the-art open-source models including Qwen3.5/3.6, Kimi Linear, and RWKV-7. Models that incorporate linear attention layers with the so-called Delta-Rule involve the inversion of triangular matrices as a core sub-routine. This operation often forms a performance bottleneck, and, due to its high-sensitivity to numerical errors, it can significantly deteriorate end-to-end model accuracy if it is not carefully implemented. This work provides a systematic analysis of both direct and iterative triangular inversion algorithms, targeting methods that are rich in matrix products, and, therefore, have the potential to efficiently utilize modern hardware. To that end, our analysis covers a broad spectrum of mathematical and practical aspects, with a heavy focus on numerical stability, computational complexity, and, ultimately, hardware efficiency and practical considerations. We provide a rigorous experimental evaluation to verify these properties in practical scenarios, and in low-precision floating-point representations, highlighting the strengths and limitations of each method. Performance benchmarks on NPUs reveal up to 4.3\times speed-up against the state-of-the-art implementations of SGLang for triangular matrix inversion, leading to significant performance improvements on the entire layer level, while maintaining full end-to-end model accuracy.

[LG-13] Optimized Federated Knowledge Distillation with Distributed Neural Architecture Search

链接: https://arxiv.org/abs/2605.21322
作者: Chaimaa Medjadji,Sylvain Kubler,Yves Le Traon,Guilain Leduc,Sadi Alawadi,Feras M. Awaysheh
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Federated Learning (FL) enables collaborative model training without centralizing data. However, real-world deployments must simultaneously address statistical heterogeneity across client data (non-IID), system heterogeneity in device capabilities, and communication efficiency. Existing FL approaches mitigate these challenges through improved aggregation, personalization, or knowledge distillation, but they almost universally assume a fixed client architecture, limiting adaptability to heterogeneous data complexity and hardware constraints. This architectural constraint often leads to suboptimal trade-offs between accuracy and efficiency in real-world FL systems. This work introduces FedKDNAS, a distillation-driven FL framework that combines client-side neural architecture selection with distillation of server-coordinated knowledge. Each client autonomously selects a lightweight model under accuracy-resource constraints. It then trains it locally using a hybrid objective combining supervised learning and knowledge distillation and shares only predictions on a public reference set. The server then aggregates and smooths these predictions, optionally combining them with a teacher model, to produce stable distillation targets for the next round. Extensive evaluation on six datasets against six representative FL baselines (FedAvg, Ditto, FedMD, FedDF, FedDistill, Local-KD) demonstrates that FedKDNAS consistently achieves superior Pareto efficiency, improving accuracy by up to 15% under non-IID conditions, reducing client CPU usage by approximately 28%, and decreasing communication overhead by up to 44 times while maintaining lightweight logit-based communication.

[LG-14] CRAFT: Conflict-Resolved Aggregation for Federated Training

链接: https://arxiv.org/abs/2605.21317
作者: Ziqi Wang,Qiang Liu,Nils Thuerey
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The aggregation of conflicting client updates remains a fundamental bottleneck in federated learning (FL) under heterogeneous data distributions. Naive averaging can produce a global update that improves the global objective while conflicting with specific clients, causing degradation for those clients. In this work, we propose CRAFT (Conflict-Resolved Aggregation for Federated Training), a new aggregation framework that treats the global update as a geometric correction problem. We formulate aggregation as finding the update closest to a reference direction while satisfying conflict-free alignment constraints. We derive a closed-form expression for the constrained optimization problem, avoiding the computational overhead of iterative solvers. Furthermore, we use a layer-wise adaptation to address conflicts at varying feature granularities. We provide a theoretical analysis showing that CRAFT promotes a common-descent structure and mitigates conflicts through its projection geometry. Extensive experiments on heterogeneous benchmarks demonstrate that CRAFT improves the accuracy of the global model while reducing performance disparity across clients compared with state-of-the-art baselines. The source code for CRAFT is available at this https URL.

[LG-15] A New Framework to Analyse the Distributional Robustness of Deep Neural Networks

链接: https://arxiv.org/abs/2605.21313
作者: Divij Khaitan,Subhashis Banerjee
类目: Machine Learning (cs.LG)
*备注: 9 pages, 6 figures, 3 tables

点击查看摘要

Abstract:Deep neural networks have achieved impressive performance on a variety of tasks, but their brittleness to distributional shifts remains a significant barrier to real-world deployment. In this paper, we propose a framework to analyse and quantify the distributional robustness of neural networks by studying the interactions between layer weights and activations. We model these interactions using Bernoulli distributions, using the separation between classes as a diagnostic proxy for robustness. We demonstrate the usefulness of this framework through models trained on CIFAR-10 and ImageNet. We show that our proposed metrics can distinguish between networks that have memorised their training data and those that have not. We also perform analogous experiments in the activation space and find that the same properties do not hold up. Additionally, we investigate the behaviour of our metrics under various distribution shifts and show that these shifts reduce separation under our path-based diagnostics. Our results suggest that this framework provides useful model-level diagnostics of representation structure and robustness.

[LG-16] A Mechanistic Study of Tabular Foundation Models

链接: https://arxiv.org/abs/2605.21288
作者: Marin Biloš,James T. Wilson,Anderson Schneider,Yuriy Nevmyvaka
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Tabular foundation models with different architectures converge in accuracy across a range of classification and regression tasks. This raises questions a leaderboard cannot answer: (i) whether the models execute the same in-context algorithm, (ii) where row, column, and class-permutation invariances originate, and (iii) how robust they are under perturbations engineered against the inferred mechanism. We characterize all three. The model families realize qualitatively distinct similarity-based readouts: from an attention-weighted vote over context labels to a class-conditional mean readout, each confirmed by causal intervention. We find that the representation collapse highlighted in prior work is not a practical concern for them. Each model’s permutation invariances trace to specific positional parameters whose removal preserves accuracy and makes approximate invariance exact. Perturbations engineered against each readout reproduce predicted failure modes; hub and rank attacks isolate them from refit baselines. Together these results give a mechanistic account of contemporary tabular foundation models and identify which inductive biases govern both their accuracy and characteristic failures.

[LG-17] FedCoE: Bridging Generalization and Personalization via Federated Coordinated Dual-level MoEs

链接: https://arxiv.org/abs/2605.21264
作者: Penglin Dai,Fulian Li,Xincao Xu,Junhua Wang,Lixin Duan,Xiao Wu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Federated Learning (FL) has emerged as a promising paradigm for privacy-preserving distributed learning. However, existing FL methods face a fundamental challenge. Traditional averaging-based approaches suffer from parameter divergence under non-IID conditions, while personalized FL methods overfit to local data and fail to generalize to new clients (cold-start problem). Mixture-of-Experts naturally addresses this by routing heterogeneous data to specialized experts rather than forcing uniform aggregation. In this paper, we propose FedCoE, a Federated Coordinated dual-level mixture-of-Experts framework that effectively balances global generalization with local personalization. FedCoE maintains multiple independent global expert models on the server and employs a shared gating network to dynamically model client-expert correlations during aggregation, effectively mitigating expert drift and gating inconsistency. To address the cold-start challenge, we introduce an adaptive mechanism that enables new clients to immediately leverage the global expert pool without extensive local training. Extensive experiments demonstrate that FedCoE achieves 78.00% global accuracy and 89.32% personalized accuracy on average, outperforming the baseline by 8.82% and 29.19%, respectively. In cold-start scenarios, FedCoE delivers 77.27% accuracy without any local fine-tuning, outperforming baselines by over 12.54%.

[LG-18] Nonparametric Learning and Earning with One-Point Feedback under Nonstationarity

链接: https://arxiv.org/abs/2605.21263
作者: Xiangyu Yang,Feng Xu,Jian-Qiang Hu,Jiaqiao Hu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Firms increasingly rely on dynamic pricing to respond to evolving customer demand, yet in many applications they observe only the revenue generated by a single posted price in each period. At the same time, market conditions may shift gradually or abruptly due to changes in customer preferences, competition, or external shocks. These features create two intertwined challenges: learning the revenue–demand relationship from limited feedback and adapting pricing decisions to a changing environment. We study how a seller can learn and earn effectively under these constraints, without assuming a specific parametric form for demand. We develop a learning framework that updates prices using revenue-based gradient approximations constructed from one observation per period. To address environmental changes, we incorporate a restarting mechanism that periodically refreshes the learning process so that outdated information is discounted. When the degree of nonstationarity is unknown, we further introduce a meta-learning layer to adaptively hedge across multiple restarting schedules. We provide performance guarantees for our approach, showing how cumulative revenue loss relative to a fully informed benchmark depends on both the time horizon and the magnitude of market variation. Simulation experiments using synthetic and real-world data illustrate the effectiveness of the proposed procedures.

[LG-19] On the Cost and Benefit of Chain of Thought: A Learning-Theoretic Perspective

链接: https://arxiv.org/abs/2605.21260
作者: Yue Zhang,Zhiyi Dong,Tommaso Cesari,Yongyi Mao
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We develop a learning-theoretic framework for understanding Chain of Thought (CoT). We model CoT as the interaction between an answer map and a chain rule that generates intermediate questions autoregressively, and define the reasoning risk of a hypothesis under this interaction. Our first result is a tight canonical decomposition of this risk into two terms with opposing roles: an oracle-trajectory risk (OTR), which captures the benefit of CoT and reduces to a target-domain risk in a domain adaptation problem, and a trajectory-mismatch risk (TMR), which captures the cost of CoT through error accumulation along mismatched reasoning trajectories. We then show that this cost is unavoidable without structure: if any one of the loss, the hypothesis answer map, or the chain rule lacks stability, the TMR can be arbitrarily large even when the OTR is zero and the hypothesis is uniformly close to the ground truth. Conversely, under stability, we prove a tight upper bound on the TMR governed by an exact amplification factor that identifies bounded, linear, and exponential error-growth regimes. Together, these results give a precise theory of when CoT helps, when it hurts, and what controls the transition between the two.

[LG-20] Graph Navier Stokes Networks

链接: https://arxiv.org/abs/2605.21247
作者: Zexing Zhao,Guangsi Shi,Yu Gong,Tianyu Wang,Shirui Pan,Hongye Cheng,Yuxiao Li
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph Neural Networks (GNNs) have emerged as a cornerstone of deep learning, with most existing methods rooted in graph signal processing and diffusion equations to model message passing. However, these approaches inherently suffer from the oversmoothing problem, where node features become indistinguishable as the network depth increases. Inspired by the Navier Stokes equations, we introduce Graph Navier Stokes Networks (GNSN), a novel architecture that transcends conventional diffusion-based message passing by incorporating convection into graph structures. GNSN defines a dynamic velocity field on the graph to govern convection, enabling more efficient and direct message propagation. By adaptively balancing convection and diffusion, GNSN is able to efficiently handle datasets with varying levels of homophily. Extensive evaluations across twelve real-world datasets demonstrate that GNSN consistently outperforms state-of-the-art baselines in classification accuracy. Moreover, experimental results further emphasize its effectiveness in alleviating the oversmoothing problem.

[LG-21] Divide and Contrast: Learning Robust Temporal Features without Augmentation ICML2026

链接: https://arxiv.org/abs/2605.21241
作者: Abdul-Kazeem Shamba,Kerstin Bach,Gavin Taylor
类目: Machine Learning (cs.LG)
*备注: Published in the 43rd International Conference on Machine Learning (ICML 2026)

点击查看摘要

Abstract:Self-supervised learning for time-series representation aims to reduce reliance on labeled data while maintaining strong downstream performance, yet many existing approaches incur high computational costs or rely on assumptions that do not hold across diverse temporal dynamics. In this work, we introduce Divide and Contrast (Di-COT), an unsupervised framework that avoids data augmentation and multiple encoder passes by contrasting informative substructures within a window rather than individual timesteps. Di-COT stochastically partitions each window into a small number of overlapping sub-blocks per iteration, enabling efficient and meaningful contrast while mitigating false positives during temporal transitions. To further improve scalability, we adopt a contrastive objective whose computation depends on the batch size and the number of sub-blocks, making loss computation independent of sequence length. Extensive experiments on six large-scale real-world datasets, as well as the UCR and UEA benchmarks, demonstrate that Di-COT learns semantically structured and transferable representations, achieving state-of-the-art performance on classification, clustering, k NN, and cross-dataset transfer, while substantially reducing training time. The source code is publicly available at this https URL.

[LG-22] Reinforcement Learning-based Control via Y-wise Affine Neural Networks: Comparative Case Studies for Chemical Processes

链接: https://arxiv.org/abs/2605.21211
作者: Austin Braniff,Yuhe Tian
类目: ystems and Control (eess.SY); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: Accepted for publication at the 23rd IFAC World Congress, 2026

点击查看摘要

Abstract:In this work we present an efficient and practically implementable approach for the application of reinforcement learning (RL)-based control in chemical process systems. This is an area that has yet to widely adopt RL-based control largely due to inherent challenges in trusting RL algorithms and the time-consuming process of training reliable agents. To address these challenges, we leverage a class of RL algorithms termed Y-wise Affine Neural Network (YANN)- RL, which we have developed in our prior work (Braniff and Tian, 2025a). By strategically initializing actor and critic networks YANN-RL algorithms provide confident and interpretable starting points within control schemes. We apply this RL-based control approach to three different process engineering case studies publicly available on the PC-Gym library (Bloor et al., 2026): (i) a continuous stirred tank reactor (CSTR), (ii) a four-tank system, and (iii) a multistage extraction column. Our approach is compared to several popular RL algorithms (PPO, SAC, DDPG, and TD3) and is benchmarked against nonlinear model predictive control (NMPC). These case studies demonstrate that YANN-RL can greatly reduce the training time and data needed, can be deployed with confidence for chemical process systems, and can approach the performance of NMPC without the knowledge of a full nonlinear model.

[LG-23] Domain-Adaptable Reinforcement Learning for Code Generation with Dense Rewards

链接: https://arxiv.org/abs/2605.21180
作者: Erfan Aghadavoodi Jolfaei,Daniel Maninger,Abhinav Anand,Mert Tiftikci,Mira Mezini
类目: Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注: 10 pages, 2 figures, under review

点击查看摘要

Abstract:Large language models show strong potential for automated code generation, but lack guarantees for correctness, quality, safety, and domain-specific constraints. For instance in robotics, where code generation is increasingly being used for planning and executing actions, awareness of the environment and physical constraints is critical. To facilitate the adaption of code-generating LLMs to diverse requirements, including domain-specific ones, we present a reinforcement learning framework that fine-tunes pre-trained LLMs using proximal policy optimization. Our customizable execution-aware reward formula captures and optimizes syntax, functional correctness, code style, security, and simulator executability. A token-level reward mapping mechanism enables effective credit assignment from execution outcomes to generated tokens. The framework is evaluated on general-purpose code generation (MBPP/MBPP+) and robotic program synthesis (RoboEval). The results show substantial improvements in functional correctness and simulator executability, including an absolute pass@1 increase of 19% on MBPP and a reduction in execution failures by 51% on RoboEval. These findings demonstrate that structured reinforcement learning can effectively align language models to correct program generation and domain-specific requirements.

[LG-24] Q-SYNTH: Hybrid Quantum-Classical Adversarial Augmentation for Imbalanced Fraud Detection

链接: https://arxiv.org/abs/2605.21164
作者: Adam Innan,Mansour El Alami,Nouhaila Innan,Muhammad Shafique,Mohamed Bennai
类目: Machine Learning (cs.LG); Quantum Physics (quant-ph)
*备注: 13 pages, 6 figures

点击查看摘要

Abstract:Credit card fraud detection is fundamentally challenged by extreme class imbalance, where fraudulent transactions are rare yet operationally critical. This imbalance often biases supervised learners toward the legitimate class, leading to high overall accuracy but weaker fraud-class recall and F1-score. This paper introduces Q-SYNTH, a hybrid classical–quantum generative adversarial framework in which a parameterized quantum circuit serves as the generator and a classical neural network serves as the discriminator. Q-SYNTH is designed for minority-class fraud synthesis in tabular data and is evaluated along two dimensions: statistical fidelity to real fraud samples and downstream performance for fraud detection. To this end, generated samples are assessed using distributional similarity measures based on Kolmogorov-Smirnov statistics and Wasserstein distances, real-vs-synthetic detectability measured by AUC-ROC, and downstream classification performance across both quantum and classical classifiers. Under the reported protocol, Q-SYNTH reduces marginal distribution mismatch relative to a classical GAN baseline while maintaining competitive downstream fraud-detection performance. Although SMOTE achieves the strongest feature-wise similarity and the classical GAN attains the highest downstream performance in several settings, Q-SYNTH offers a favorable compromise between distributional fidelity and downstream performance, supporting the feasibility of hybrid quantum augmentation for imbalanced fraud detection.

[LG-25] Learning First Integrals via Backward-Generated Data and Guided Reinforcement Learning

链接: https://arxiv.org/abs/2605.21160
作者: Jingfeng Zhong,Zhengxiang Liu,Zhijie Wang,Shuai Li
类目: Machine Learning (cs.LG)
*备注: 17 pages, 2 figures, 3 tables

点击查看摘要

Abstract:The discovery of first integrals is of fundamental scientific importance for understanding conservation laws in dynamical systems. However, existing symbolic computation tools and Large Language Models (LLMs) remain limited on this task because high-quality training data are scarce and successful solutions often depend on mathematical intuition. This paper presents FISolver, an LLM-based solver developed to address this challenge. First, we introduce a “Backward Generation” algorithm that systematically builds large-scale datasets of (differential equation, first integral) pairs by deriving differential equations from sampled integrals, thereby alleviating the data scarcity bottleneck. Second, we apply supervised fine-tuning to a compact mathematical model and further improve its performance through reinforcement learning with a Levenshtein Distance-based shaped reward. In addition, we design data synthesis and blending strategies that support effective adaptation to difficult problem families from sparse examples. Experiments show that FISolver, while requiring substantially lower computational cost, significantly outperforms larger mathematical LLMs and commercial solvers such as Mathematica on challenging benchmarks, indicating a new data-driven route for automated discovery of first integrals.

[LG-26] CoarseSoundNet: Building a reliable model for ecological soundscape analysis

链接: https://arxiv.org/abs/2605.21143
作者: Alexander Gebhard,Andreas Triantafyllopoulos,Dominik Arend,Sandra Müller,Svenja Schmidt,Michael Scherer-Lorenzen,Björn W. Schuller
类目: ound (cs.SD); Machine Learning (cs.LG)
*备注: Currently under review

点击查看摘要

Abstract:A soundscape is composed of three types of sound: biophony (sounds made by animals), geophony (natural abiotic sounds) and anthropophony (sounds made by humans). A key research question in the field of soundscape ecology is how these components interact with each other, specifically how biophony responds to geophony and anthropophony. Nevertheless, as of today, there are not many analytical instruments that enable the distinct quantification of these elements. Recent machine learning (ML) approaches aim to support automated analysis but often rely on task-specific or clean data, limiting generalisation to noisy passive acoustic monitoring (PAM) recordings. This study presents a clear and reproducible structure to build ML models for coarse soundscape classification and introduces CoarseSoundNet, a deep learning model trained to distinguish biophony, geophony, and anthropophony under realistic PAM conditions. We systematically investigate model architectures, the influence of an additional training class, data composition, and evaluation strategies. Our findings suggest that model performance improves with additional PAM data, especially when similar to the target domain, and by introducing an explicit silence class during training. Class-specific decision thresholds and duration-based constraints further enhance performance, particularly for anthropophony and geophony. Error analyses exhibit challenges for anthropophony due to masking effects and confusions for silence and insect sounds for geophony and biophony. Finally, we conduct an ecological case study which shows that pre-filtering recordings with CoarseSoundNet yields acoustic index trends comparable to ground-truth filtering, supporting its use as an effective preprocessing tool for ecoacoustic analyses.

[LG-27] Reasoning -Trace Collapse: Evaluating the Loss of Explicit Reasoning During Fine-Tuning

链接: https://arxiv.org/abs/2605.21127
作者: Lukas Twist,Helen Yannakoudakis,Jie M. Zhang
类目: Machine Learning (cs.LG)
*备注: 22 pages, 3 tables, 3 figures

点击查看摘要

Abstract:Explicit reasoning models are trained to produce intermediate reasoning traces before final answers, but downstream fine-tuning is often performed on ordinary instruction-response data that contains no such traces. We show that this mismatch can induce reasoning-trace collapse: a fine-tuned model continues to produce plausible final answers while losing the structurally valid explicit reasoning traces that made it a reasoning model in the first place. We introduce a structural evaluation framework that separates answer correctness from reasoning-trace validity, measuring valid, empty, missing, and truncated reasoning alongside reasoning-conditioned task performance. Using this framework, we study four open-weight reasoning models and find that standard supervised fine-tuning can rapidly suppress valid reasoning traces, and that answer-only metrics can substantially obscure this failure: in several settings, performance conditional on valid reasoning remains high while the rate of valid reasoning falls sharply. We further show that simple loss-masking strategies can substantially mitigate collapse without requiring teacher-generated reasoning traces. These results suggest that evaluations of fine-tuned reasoning models should report structural reasoning reliability metrics in addition to final-answer performance, especially when adaptation data does not contain explicit reasoning traces.

[LG-28] Advantage Collapse in Group Relative Policy Optimization: Diagnosis and Mitigation ICML2026

链接: https://arxiv.org/abs/2605.21125
作者: Xixiang He,Qiyao Sun,Ao Cheng,Xingming Li,Xuanyu Ji,Hailun Lu,Runke Huang,Qingyong Hu
类目: Machine Learning (cs.LG)
*备注: 26 pages, 12 figures. Accepted at the International Conference on Machine Learning (ICML 2026)

点击查看摘要

Abstract:Group Relative Policy Optimization (GRPO), a prominent algorithm within the Reinforcement Learning from Verifiable Rewards (RLVR) framework, has achieved strong results in improving the reasoning capabilities of large language models (LLMs). However, GRPO is prone to advantage collapse, a failure mode where homogeneous rewards within a group (e.g., all correct or all incorrect answers) yield near-zero advantages and vanishing gradients. To address this, we introduce the Advantage Collapse Rate (ACR), the first diagnostic metric quantifying the proportion of training batches with ineffective gradients. Across models from 0.5B to 14B parameters on mathematical reasoning benchmarks, we show that ACR strongly predicts training stagnation and final performance. We then propose Adaptive Virtual Sample Policy Optimization (AVSPO), a lightweight extension of GRPO that injects virtual reward samples, guided by real-time ACR monitoring, to enable learning from homogeneous groups without additional model rollouts. AVSPO reduces advantage collapse by 58-63% relative to GRPO and yields consistent accuracy gains of 4-6 percentage points across all model scales, while maintaining generalization on the evaluated out-of-domain task. Code and datasets are available at this https URL.

[LG-29] Automated Byzantine-Resilient Clustered Decentralized Federated Learning for Battery Intelligence in Connected EVs

链接: https://arxiv.org/abs/2605.21115
作者: Mouhamed Amine Bouchiha,Abdelaziz Amara Korba,Yacine Ghamri-Doudane
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: 16 pages, 11 figures, under review for IEEE T-ITS

点击查看摘要

Abstract:Federated learning (FL) has emerged as a promising paradigm for managing electric vehicle (EV) battery data in intelligent transportation systems (ITS), enabling privacy-preserving tasks such as anomaly detection and capacity estimation. However, most existing frameworks rely on centralized aggregation schemes, which pose critical limitations in terms of security and trust. To address these challenges, we propose ABC-DFL, an automated Byzantine-resilient clustered decentralized federated learning (C-DFL) framework for connected EVs. The proposed incentive-driven C-DFL system replaces the central server with an open-permissioned blockchain, featuring a new dynamic Quorum Byzantine Fault Tolerance (QBFT) protocol and an oracle-based aggregation layer, to enhance trust, security, and automation. At the core of ABC-DFL lies FLECA (Filtered Layered Enhanced Clustering Aggregation), a robust hierarchical aggregation protocol that mitigates Byzantine attacks by having each EV filter malicious updates using an adaptive threshold based on deviations from its reference model update. Oracle nodes, responsible for inter-group aggregation, employ robust clustering to isolate and aggregate model updates from trustworthy EV groups. Comprehensive experimental evaluations demonstrate that FLECA matches FedProx convergence under benign conditions and significantly outperforms existing defenses with attack impact scores below 0.10 in adaptive adversarial scenarios. Furthermore, several learning experiments with multitask models confirm the effectiveness and fairness of the incentive mechanism. Finally, on-chain and off-chain benchmarks validate the practicality of ABC-DFL.

[LG-30] A Unified Framework for Uncertainty-Aware Explainable Artificial Intelligence: A Case Study in Power Quality Disturbance Classification

链接: https://arxiv.org/abs/2605.21114
作者: Yinsong Chen,Samson S. Yu,Zhong Li,Chee Peng Lim
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Post-hoc explainable AI (XAI) methods typically produce deterministic attribution maps, whereas Bayesian neural networks (BNNs) induce a distribution over explanations. Capturing the variability of this distribution is important for uncertainty-aware decision-making. This paper formalises the \emphexplanation distribution as the push-forward measure of the BNN posterior through any Lipschitz-continuous attribution operator. It further proposes the uncertainty-aware relevance attribution operator (UA-RAO), a general family of operators that summarises the explanation distribution using the mean, variance, coefficient of variation, quantiles, and set-theoretic aggregation measures. Theoretical support is provided through Monte Carlo accessibility and Wasserstein approximation bounds. The framework is evaluated on a 15-class power quality disturbance (PQD) classification benchmark, comparing three BNN approximations paired with three attribution operators using relevance mass accuracy and intersection-over-union as localisation metrics. Results show that deep ensembles with the mean UA-RAO improve localisation over the deterministic baseline, while other UA-RAO summaries reveal uncertainty patterns absent from point-estimate attributions. Qualitative results on measured signals further suggest that these patterns generalise beyond the synthetic training distribution. The framework is domain-agnostic and can be applied to any BNN paired with a Lipschitz-continuous attribution operator.

[LG-31] Improved Guarantees for Constrained Online Convex Optimization via Self-Contraction

链接: https://arxiv.org/abs/2605.21107
作者: Dhruv Sarkar,Abhishek Sinha
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We consider Constrained Online Convex Optimization (COCO) with adversarially chosen constraints. At each round, the learner chooses an action before observing the loss and constraint function for that round. The goal is to achieve small static regret against the best point satisfying all constraints while also controlling cumulative constraint violation ( \mathsfCCV ). For strongly convex losses, state-of-the-art algorithms achieve O(\log T) regret and O(\sqrtT \log T) \mathsfCCV. The corresponding best-known bounds for convex losses is O(\sqrtT) regret and O(\sqrtT \log T) \mathsfCCV . In this paper, we give a simple projection-based algorithm that simultaneously achieves O(\log T) regret and O(\log T) \mathsfCCV for strongly-convex losses, yielding an exponential improvement in the \mathsfCCV . For the convex losses, our algorithm improves the \mathsfCCV to O(\sqrtT) while maintaining the optimal O(\sqrtT) regret. The key to our improvement is a recent geometric result for self-contracted curves, which may be of independent interest.

[LG-32] HORST: Composing Optimizer Geometries for Sparse Transformer Training

链接: https://arxiv.org/abs/2605.21104
作者: Tom Jacobs,Rohan Jain,Rebekka Burkholz
类目: Machine Learning (cs.LG)
*备注: 22 pages, 8 figures

点击查看摘要

Abstract:Sparsifying transformers remains a fundamental challenge, as standard optimizers fail to simultaneously encourage sparsity and maintain training stability. Effective adaptive optimizers exhibit an implicit L_\infty bias favoring stability, yet, sparsity requires an L_1 bias. To integrate sparsity, we propose a composition of optimizer steps, which we cast as non-commutative operators to analyze and combine their optimization geometry in a principled way. This yields HORST (Hyperbolic Operator for Robust Sparse Training), a modular optimizer that inherits stability from adaptive methods while inducing L_1 sparsity bias through a hyperbolic mirror map. Our experiments demonstrate its utility for sparse training of transformers on both vision and language tasks. HORST consistently and significantly outperforms AdamW baselines across all sparsity levels, with large gains at higher sparsity.

[LG-33] A Typed Tensor Language for Federated Learning

链接: https://arxiv.org/abs/2605.21103
作者: Theofilos Mailis,Kalliopi-Christina Despotidou,Konstantinos Filippopolitis,Yannis Foufoulas,Thanasis-Michail Karampatsis,Andreas Ktenidis,Evdokia Mailli,Theodore Papamarkou,Yannis Ioannidis
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Federated learning and analytics are often described as collections of separate protocols, even when they share the same mathematical form: client-local tensor computation, mergeable aggregation into shared state, and shared-only post-processing. We introduce a typed tensor language that formalizes this structure. The language distinguishes federated tensors, whose records are partitioned across clients along a tracked record axis, from shared tensors, which are available globally. Its semantics are defined by comparison with a virtual global tensor, used only as a reference object. The main result is a shared-state factorization theory. We show that typed one-round programs factor through fixed-dimensional shared state whose size is independent of the number of clients and records, computed from client-local tensor expressions and merged across clients. We also prove a converse representability result; factorizations whose encoders and decoders are expressible in the language are realized by typed one-round programs, and the correspondence extends to iterative programs whose cross-round state is shared. This gives a formal account of the computations in the language that can be expressed as encode, merge, and decode procedures. We then develop a differentiable fragment for learning. If a per-record loss and its per-record gradient are represented by client-local tensor expressions, the global gradient is represented by record-axis summation of the federated gradient tensor. This yields typed iterative programs for server-side gradient descent and shared-linear-algebra second-order updates. The framework characterizes a broad class of federated learning computations whose communication passes through fixed-dimensional shared state.

[LG-34] UOTIP: Unbalanced Optimal Transport Map for Unpaired Inverse Problems ICML2026

链接: https://arxiv.org/abs/2605.21094
作者: Donggyu Lee,Taekyung Lee,Jaewoong Choi
类目: Machine Learning (cs.LG)
*备注: Accepted at ICML 2026

点击查看摘要

Abstract:We investigate unpaired image inverse problems, a challenging setting where only independent, non-paired sets of noisy measurements and clean target signals are available for training. We propose a novel inverse problem solver based on Unbalanced Optimal Transport, called Unbalanced Optimal Transport Map for Inverse Problems (UOTIP). Our method formulates the reconstruction task, predicting clean target signals from noisy measurements, as learning a UOT Map from noisy measurement distribution to clean signal distribution by incorporating a likelihood-based cost function. By relaxing the exact marginal constraint, the UOT framework provides key advantages to our model: robustness to multi-level observation noise, adaptability to class imbalance between noisy and clean datasets, and generalizability to diverse noise-type scenarios. Furthermore, we theoretically demonstrate that incorporating a quadratic cost term ensures the existence and uniqueness of the transport map by satisfying the twist condition, even for ill-posed inverse problems. Our experiments demonstrate that UOTIP achieves state-of-the-art performance on unpaired image inverse problem benchmarks, across linear and nonlinear inverse problems.

[LG-35] Reviving Error Correction in Modern Deep Time-Series Forecasting

链接: https://arxiv.org/abs/2605.21088
作者: Minh Hoang Nguyen,Dai Do,Huu Hiep Nguyen,Dung Nguyen,Kien Do,Hung Le
类目: Machine Learning (cs.LG)
*备注: 27 pages

点击查看摘要

Abstract:Modern deep-learning models have achieved remarkable success in time-series forecasting. Yet, their performance degrades in long-term prediction due to error accumulation in autoregressive inference, where predictions are recursively used as inputs. While classical error correction mechanisms (ECMs) have long been used in statistical methods, their applicability to deep learning models remains limited or ineffective. In this work, we revisit the error accumulation problem in deep time-series forecasting and investigate the role and necessity of ECMs in this new context. We propose a simple, architecture-agnostic error correction model that can be integrated with any existing forecaster without requiring retraining. By explicitly decomposing predictions into trend and seasonal components and training the corrector to adjust each separately, we introduce the Universal Error Corrector with Seasonal-Trend Decomposition (UEC-STD), which significantly improves correction accuracy and robustness across 4 backbones and 10 datasets. Our findings provide a practical tool for enhancing forecasts while offering new insights into mitigating autoregressive errors in deep time-series models. Code is available at this https URL.

[LG-36] Musical Attention Transformer: Music Generation Using a Music-Specific Attention Model

链接: https://arxiv.org/abs/2605.21081
作者: Shinnosuke Taksuka,Hideo Mukai
类目: ound (cs.SD); Machine Learning (cs.LG)
*备注: 32 pages, 13 figures

点击查看摘要

Abstract:This study aims to enhance the quality of music generation using Transformers by incorporating meta-information. While Transformer-based approaches are effective at capturing long-term dependencies in musical compositions, the music they generate often suffers from issues such as excessive repetition or duplication of notes, leading to unnatural melodies. To address these limitations, we propose Musical Attention, a mechanism that incorporates meta-information such as bar numbers, key, signatures, and tempos into the attention process. Musical Attention explicitly leverages both the structural properties of music and its associated metadata, enabling the Transformer’s attention mechanism to operate more effectively and thereby improving the quality of the generated output. In our framework, each musical note is represented as a combination of five events-pitch, bar number, onset, duration, and velocity in addition to the three metadata elements. The attention mechanism is then modified to reflect the correlations among these eight features, allowing the model to better capture the inherent characteristics of musical composition. Experimental results demonstrate that the model incorporating Musical Attention outperforms prior methods, such as Full Attention and Strided Attention, in terms of musical coherence, variation, and overall quality. Notably, it significantly reduces repetition and enhances the model’s ability to generate diverse, harmonically consistent melodies. Musical Attention thus represents a meaningful advancement in AI-driven music generation, facilitating the creation of more natural and expressive compositions.

[LG-37] owards Understanding Self-Pretraining for Sequence Classification ICML2025

链接: https://arxiv.org/abs/2605.21070
作者: Omar Coser,Loredana Zollo,Paolo Soda,Antonio Orvieto
类目: Machine Learning (cs.LG)
*备注: v1: Preliminary, extension of the version accepted at ICML 2025 Workshop MOSS

点击查看摘要

Abstract:Amos et al. (2024) showed that the accuracy of Transformer models in sequence classification can be significantly improved by first pretraining with a masked token prediction objective without external data or augmentation, a procedure referred to as self-pretraining (SPT). While the primary objective of Amos et al. (2024) was to showcase that Transformers can achieve strong performance on the Long-Range Arena (LRA), their pipeline raises more fundamental questions: How does SPT drive optimization to better solutions? Why can standard supervised training fail in Transformers? To better understand this, we replicate and systematically ablate the findings of Amos et al. (2024). Our ablations suggest that a central bottleneck in the studied settings is not depth or generalization alone, but the ability of label supervision to learn useful query-key Attention patterns from random initialization. With a minimal setup, we identify learning proximity interactions - turning absolute positional encodings into proximity-biased Attention scores - as a key source of the improvements brought by SPT. Finally, in a simplified theoretical setup, we show that label supervision can be locally blind to certain Attention-score directions that are instead detectable through masked reconstruction.

[LG-38] Robust Personalized Recommendation under Hidden Confounding in MNAR

链接: https://arxiv.org/abs/2605.21066
作者: Zongyu Li,Wanting Su,Tianyu Xia
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recommender systems often rely on observational user–item interaction data, which is prone to selection bias due to users’ selective interactions with items. Inverse propensity weighting and doubly robust estimators effectively mitigate selection bias under observed confounding, but are unreliable in the presence of hidden confounders. Existing approaches relying on randomized controlled trials (RCTs) or global sensitivity bounds are constrained in practice: RCTs demand costly experimental data, while global sensitivity bounds presume a uniformly bounded effect of unmeasured confounders on propensities through sensitivity analysis, thereby neglecting heterogeneity across user–item interactions. To overcome this limitation, we propose a novel framework, which estimates user–item level sensitivity bounds, thereby substantially relaxing the homogeneity assumption inherent in global sensitivity bounds named Personalized Unobserved-Confounding-aware Interaction Deconfounder (PUID). To ensure both robustness and predictive accuracy, we further develop an adversarial optimization strategy and propose a benchmark-guided variant (BPUID) that incorporates pre-trained models as stabilizing references. Extensive experiments on three real-world datasets demonstrate that our approach significantly outperforms global methods under hidden confounding, without requiring RCT data.

[LG-39] A Dialogue between Causal and Traditional Representation Learning: Toward Mutual Benefits in a Unified Formulation

链接: https://arxiv.org/abs/2605.21058
作者: Yan Li,Yuewen Sun,Shaoan Xie,Gongxu Luo,Yunlong Deng,Kun Zhang,Guangyi Chen
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Causal representation learning (CRL) and traditional representation learning have largely developed along different trajectories. Traditional representation learning has been driven mainly by applications and empirical objectives, whereas CRL has focused more on theoretical questions, particularly identifiability. This difference in emphasis has created a gap between the two fields in terminology, problem formulation, and evaluation, limiting communication and sometimes leading to disconnected or redundant efforts. In this paper, we argue that these two fields should be brought into dialogue rather than treated as separate paradigms. To this end, we introduce a unified formulation in which the representation learning is characterized by two components: a task component, which specifies what information the learned representation is required to preserve, and a constraint component, which specifies what structure is imposed on the latent space. Under this formulation, the benefits run in both directions. CRL provides theoretical tools for understanding when structured latent constraints are useful or necessary, while traditional representation learning offers practical insights on task design and objective choice that can improve the development of CRL methods. To illustrate this interaction, we experimentally study how different task components affect the behavior of CRL methods under different structured constraints. Results on CausalVerse show that the effectiveness of causal constraints depends strongly on the tasks with which they are paired.

[LG-40] Genetic Programming with Transformer-Based Mutation for Approximate Circuit Design

链接: https://arxiv.org/abs/2605.21055
作者: Ondrej Galeta,Lukas Sekanina
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG)
*备注: To appear at IEEE World Congress on Computational Intelligence, Congress on Evolutionary Computation, Maastricht, NL, 2026

点击查看摘要

Abstract:A recent trend is to leverage machine learning models to improve the evolutionary design and optimization process. We propose a novel transformer-based mutation operator for Cartesian genetic programming (CGP) for the automated design of approximate arithmetic circuits. We introduce a hybrid scheme for CGP in which the proposed mutation operator is switched with the standard mutation operator to prevent stagnation of the circuit approximation process. We also develop a new training scheme for the underlying transformer that utilizes training vectors composed of thousands of CGP chromosomes representing various approximate multipliers. For several target error constraints, the approximate multipliers evolved with CGP utilizing the transformer-based mutation achieve better trade-offs than the highly optimized designs available in the state-of-the-art EvoApproxLib library of approximate circuits. Although both training and evolutionary processes are computationally demanding, they appear to be necessary steps for improving existing approximate circuits and producing new, potentially patentable circuit designs.

[LG-41] Efficient Banzhaf-Based Data Valuation for k-Nearest Neighbors Classification VLDB2026

链接: https://arxiv.org/abs/2605.21033
作者: Guangyi Zhang,Lutz Oettershagen,Lixu Wang,Aristides Gionis
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS)
*备注: To appear at VLDB 2026

点击查看摘要

Abstract:Data valuation, the task of quantifying the contribution of individual data points to model performance, has emerged as a fundamental challenge in machine learning. Game-theoretic approaches, such as the Banzhaf value, offer principled frameworks for fair data valuation; however, they suffer from exponential computational complexity. We address this challenge by developing efficient algorithms specifically tailored for computing Banzhaf values in k -nearest neighbor ( k NN) classifiers. We first establish the theoretical hardness of the problem by proving that it is #P-hard. Despite this intractability, we exploit the locality properties of k NN classifiers to develop practical exact algorithms. Our main contribution is a dynamic programming framework that achieves significant computational improvements: we present a pseudo-polynomial algorithm with O(Wkn^2) time complexity for weighted k NN classifiers, where W is the maximum sum of top- k weights, and a specialized algorithm for unweighted k NN that achieves O(nk^2) time complexity, that is, linear in the number of data points. We also offer efficient Monte Carlo estimation methods. Extensive experiments on real-world datasets demonstrate the practical efficiency of our approach and its effectiveness in data valuation applications.

[LG-42] Beyond the Bellm an Recursion: A Pontryagin-Guided Framework for Non-Exponential Discounting

链接: https://arxiv.org/abs/2605.20996
作者: Hojin Ko,Jeonggyu Huh
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Most value-based and actor–critic reinforcement learning methods rely on Bellman-style recursions, yet these recursions collapse under non-exponential discounting common in human preferences and survival processes. We show the breakdown is structural: exponential discounting sits at a fragile intersection of multiplicativity and time homogeneity, and violating either property breaks standard dynamic programming. To overcome this, we propose Pontryagin-Guided Direct Policy Optimization (PG-DPO), a variational framework that abandons recursion and couples the Pontryagin Maximum Principle with Monte Carlo rollouts via an Adjoint-MC projection enforcing pointwise Hamiltonian maximization. Across multi-dimensional hyperbolic and survival-discount benchmarks, PG-DPO improves accuracy and stability where equation-driven solvers and critic-based baselines diverge.

[LG-43] Modeling Temporal scRNA-seq Data with Latent Gaussian Process and Optimal Transport

链接: https://arxiv.org/abs/2605.20989
作者: Mehmet Yigit Balik,Harri Lähdesmäki
类目: Machine Learning (cs.LG); Genomics (q-bio.GN)
*备注:

点击查看摘要

Abstract:Single-cell RNA sequencing provides insights into gene expression at single-cell resolution, yet inferring temporal processes from these static snapshot measurements remains a fundamental challenge. Current approaches utilizing neural differential equations and flows are sensitive to overfitting and lack careful considerations of biological variability. In this work, we propose a generative framework that models population trends using a latent heteroscedastic Gaussian process (GP) approximated by Hilbert space methods. To address the absence of genuine cell trajectories, we leverage an optimal transport (OT) objective that aligns generated and observed population distributions. Our method explicitly captures biological heterogeneity by incorporating cell-specific latent time and cell type conditioning to disentangle temporal asynchrony and trajectories to different cell types. We demonstrate state-of-the-art performance on complex interpolation and extrapolation benchmarks and introduce a novel gradient-based strategy for inferring perturbation trajectories.

[LG-44] Point Cloud Sequence Encoding for Material-conditioned Graph Network Simulators NEURIPS2026

链接: https://arxiv.org/abs/2605.20978
作者: Philipp Dahlinger,Balázs Gyenes,Niklas Freymuth,Luca Geminiani,Tobias Würth,Johannes Mitsch,Nadja Klein,Luise Kärger,Gerhard Neumann
类目: Machine Learning (cs.LG)
*备注: 9 pages + appendix, 7 figures. Submitted to the 40th Conference on Neural Information Processing Systems (NeurIPS 2026)

点击查看摘要

Abstract:Graph Network Simulators (GNSs) have emerged as powerful surrogates for complex physics-based simulation, offering inherent differentiability and orders-of-magnitude speedups over traditional solvers. However, GNSs typically assume access to the underlying material parameters, such as stiffness or viscosity, severely limiting their utility in realistic experimental settings. While recent meta-learning approaches address the parameter dependency by inferring properties from mesh trajectories, reconstructing a mesh from an observed scene is challenging. In this work, we introduce Point Cloud Encoding for Accurate Context Handling (PEACH), a novel framework that applies in-context learning on point clouds to adapt a learned simulator to unseen physical properties during inference. Our approach relies on a novel spatio-temporal point cloud sequence encoder, as well as two forms of auxiliary supervision to help improve simulation fidelity. We demonstrate that PEACH is capable of accurate zero-shot sim-to-real transfer on a challenging, dynamic scene. Experiments on simulation scenes show that PEACH even outperforms mesh-based baselines on prediction accuracy, while being much more practical for real-world deployment.

[LG-45] Choose Wisely and Privately: Proactive Client Selection for Fair and Efficient Federated Learning

链接: https://arxiv.org/abs/2605.20975
作者: Adda Akram Bendoukha,Heber Hwang Arcolezi,Nesrine Kaaniche,Aymen Boudguiga
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Federated Learning enables collaborative model training across decentralized data sources without data transfer. Averaging-based FL is limited by the presence of non-IID data, which negatively impacts convergence speed and final model accuracy. Conventional alternatives suffer from significant inefficiency. Clients with noisy or highly heterogeneous data contribute expensive gradient computations that are either discarded or heavily down-weighted before aggregation. These reactive approaches waste computational resources, require more communication rounds and result in unnecessary privacy exposure. In this paper, we propose a proactive client selection framework that aims to find an optimal federation of clients whose combined data match utility and fairness requirements before training begins. Our method relies on mutual information computed from differentially private contingency tables to quantify the relevance of cross-feature correlations in the union dataset. We introduce a Potential Federation Loss (PFL) over the set of fixed-size federations, which balances two objectives. Maximizing collective data utility while ensuring fair cross-features correlations to prevent group unfairness. Client selection is expressed as an optimal subset search problem over the PFL objective, which we solve using simulated annealing under strong differential privacy guarantees for clients’ local statistics. Experimental results on four benchmarks show faster, fairer, and more accurate models trained on optimally found federations, compared to uniform sampling, even when state-of-the-art adaptive aggregation or sampling strategies are employed.

[LG-46] A Deployment Audit of Release-Side Risk in Conformal Triage under Prevalence Shift

链接: https://arxiv.org/abs/2605.20956
作者: Chengze Li,Xiao Liu,Hanrong Zhang,Haiyang Peng,Yanghao Ruan,Huanhuan Ma,Chunyu Miao,Qichao Zhou,Xiangrong Qi,Philip Yu
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注: 18 pages, 4 figures, 5 tables

点击查看摘要

Abstract:Conformal triage converts predictive scores into deployment actions that either release a case, flag it for urgent attention, or defer it to human review. Under prevalence shift, however, the usual summaries of marginal coverage and human-review rate can miss the safety-critical question of whether patients who truly experience the target event are released without review. To address this gap, we introduce a leakage-aware deployment audit for release-side conformal triage. It first assigns target subjects to three non-overlapping roles: prevalence correction, conformal calibration, and held-out release-safety evaluation. This separation then lets the audit evaluate release directly: how many event-positive patients are cleared without review, whether the pilot has enough event labels for calibration, and how the safety-review trade-off shifts. Applying this audit to a retrospective NSCLC pilot shows why lower review can be misleading: after prevalence correction, the pooled conformal branch lowers review by releasing more patients, some of whom are event-positive. Within the audit, the classwise branch acts as a scarcity diagnostic: the pilot has too few event labels to certify safe low-review release.

[LG-47] raining distribution determines the ceiling of drug-blind cancer sensitivity prediction

链接: https://arxiv.org/abs/2605.20885
作者: Taekyung Heo
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:Precision oncology requires predicting which drugs will suppress a specific tumor from its molecular profile, but drug-blind sensitivity prediction has plateaued despite increasingly complex drug representations. Here we show that this stagnation reflects a metric artifact rather than a representational bottleneck. The standard benchmark, global Pearson r, is dominated by between-drug potency differences that a trivial drug-mean predictor captures without any cell-specific learning. Per-drug Pearson r, which isolates within-drug cell ranking, reveals that no drug encoding improves over cell-only features across four independent datasets. A controlled experiment channeling mechanism-of-action identity as either a drug feature or a training-distribution constraint identifies the cause. Supplying MoA as a feature yields negligible benefit, whereas using it to stratify training raises per-drug r substantially for targeted kinase inhibitors, because pan-cancer co-training suppresses pathway-specific sensitivity signals. Mechanism-stratified training and response matching from pilot observations provide two deployable strategies that together recover the principal sources of predictive gain in drug-blind sensitivity prediction.

[LG-48] Learning fMRI activations dictionaries across individual geometries via optimal transport

链接: https://arxiv.org/abs/2605.20883
作者: Sonia Mazelet,Rémi Flamary,Bertrand Thirion
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Dictionary learning is a powerful tool for creating interpretable representations. When applied to functional magnetic resonance imaging (fMRI) data, the resulting patterns of brain activity can be used for various downstream tasks, such as brain state classification or population-level analysis. However, a major challenge is the variability in brain geometry across individuals. This is usually addressed by projecting each individual brain geometry onto a common template, which removes subject-specific information. In this work, we introduce a novel approach to dictionary learning on fMRI data that explicitly accounts for this variability. We use the optimal transport-based Fused Gromov-Wasserstein (FGW) distance to compare graphs with different geometries and features. To address the challenge of computing multiple FGW distances for large graphs such as those arising from fMRI data, we rely on amortized optimization to learn a neural network that predicts an approximation of the optimal transport plans, which substantially reduces the computational cost. Additionally, we learn dictionary atoms that depend on the FGW trade-off parameter, which controls the balance between feature alignment and structural consistency. Numerical experiments on the HCP dataset demonstrate that the proposed approach captures different levels of geometric variability in the data and provides representations that preserve essential information.

[LG-49] NeighborDiv: Training-free Zero-shot Generalist Graph Anomaly Detection via Neighbor Diversity

链接: https://arxiv.org/abs/2605.20879
作者: Kaifeng Wei,Teng Liu,Liang Dong,Xiubo Liang,Yuke Li
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph Anomaly Detection (GAD) is increasingly shifting to Generalist GAD (GGAD) for cross-domain “one-for-all” detection, but existing GGAD methods predominantly rely on the neighbor consistency principle, falling into the \textbfNode-to-Neighbor Consistency Paradigm for anomaly quantification. These methods suffer from complex training pipelines, heavy training data dependency, high computational costs, and unstable cross-domain generalization. To address these limitations, we propose NeighborDiv, a training-free generalist graph anomaly detection framework based on neighbor diversity. Departing from the dominant Node-to-Neighbor Consistency Paradigm, we shift the focus to the \textbfNeighbor-to-Neighbor Diversity Paradigm, and uncover that the internal structural dispersion of a node’s neighbor set is a powerful, independently discriminative anomaly signal. We quantify neighbor diversity via the variance of inter-neighbor feature similarities, which captures how a node organizes its local graph environment, and operates independently of conventional node-to-neighbor consistency frameworks. Extensive experiments under two standard GGAD evaluation paradigms show NeighborDiv achieves state-of-the-art performance, with relative gains of 10.25% in average AUC and 17.78% in average AP over the second-best baseline under Single-Domain Independent Training (SDIT), and 6.89%/9.58% in AUC/AP under Unified Multi-Domain Training (UMDT), respectively. Notably, NeighborDiv yields zero performance volatility across all datasets, eliminating training-set dependency and establishing a lightweight and highly practical GGAD framework.

[LG-50] CIG: Exploration via Conditional Information Gain

链接: https://arxiv.org/abs/2605.20878
作者: Tim Joseph,Marcus Fechner,Philipp Stegmaier,Karam Daaboul,J. Marius Zöllner
类目: Machine Learning (cs.LG)
*备注: 28 pages, 10 figures, 3 tables

点击查看摘要

Abstract:Intrinsic rewards for exploration in reinforcement learning condition on different contexts: lifelong rewards score each transition against accumulated experience but ignore within-rollout redundancy; episodic rewards penalize intra-trajectory repetition but discard lifetime progress. Hybrid methods combine both signals through heuristic weights or require Gaussian-process dynamics that do not scale beyond low-dimensional state spaces. Trajectory-level information gain decomposes into per-step terms that condition on the replay buffer and rollout prefix simultaneously, but remains intractable for deep models. We derive the Conditional Information Gain (CIG) reward as a tractable surrogate: a log-determinant objective over an ensemble disagreement kernel whose Cholesky factorization yields causal per-step rewards that retain both conditioning sets while scaling to high-dimensional state spaces. We instantiate CIG in a model-based setting, where rollouts are short and within-rollout corrections remain largely unexplored. Across twelve tasks spanning discrete (MiniGrid) and continuous control (OGBench), in both clean and stochastic-distractor settings, CIG outperforms or matches prior exploration methods while remaining robust to stochastic distractors.

[LG-51] LOSCAR-SGD: Local SGD with Communication-Computation Overlap and Delay-Corrected Sparse Model Averag ing

链接: https://arxiv.org/abs/2605.20866
作者: Yassine Maziane,Ammar Mahran,Artavazd Maranjyan,Peter Richtárik
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Communication is a major bottleneck in distributed learning, especially in large-scale settings and in federated learning environments with slow links. Three standard ways to reduce this cost are communication compression, local training, and communication-computation overlap. Methods that combine these ingredients are used in practice and have been found to be effective for large-scale training, but there is little theory for methods that combine all three. We study a heterogeneous-compute setting in which different workers may take different numbers of local steps, and we propose LOSCAR-SGD, a Local SGD method that communicates only a sparse subset of model coordinates and continues optimizing while communication is in flight. A key ingredient is a delay-corrected merge rule that incorporates delayed synchronized information without discarding the progress made during the overlap phase. We give convergence guarantees for smooth non-convex objectives and show how sparsity, overlap, and worker heterogeneity affect the rate. To the best of our knowledge, this is the first theory for this combination of ingredients. Experiments further show that communication-computation overlap reduces training time and that the delay-corrected merge outperforms naive overwriting.

[LG-52] PlexRL: Cluster-Level Orchestration of Serviceized LLM Execution for RLVR

链接: https://arxiv.org/abs/2605.20863
作者: Yiqi Zhang,Fangzheng Jiao,Tian Tang,Boyu Tian,Hangyu Wang,Qiaoling Chen,Guoteng Wang,Zhen Jiang,Peng Sun,Ping Zhang,Xiaohe Hu,Ziming Liu,Menghao Zhang,Yanmin Jia,Yang You,Siyuan Feng
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Reinforcement learning with verifiable rewards (RLVR) has recently unlocked strong reasoning capabilities in large language models (LLMs), triggering rapid exploration of new algorithms and data. However, RLVR training is notoriously inefficient: long-tailed rollouts, tool-induced stalls, and asymmetric resource requirements between rollout and training introduce substantial idle time that cannot be eliminated by job-local optimizations such as synchronous pipelining, asynchronous rollout, or colocated execution. We argue that this inefficiency is structural. While idle gaps are unavoidable within individual RLVR jobs, they are largely anti-correlated across jobs and therefore exploitable at the cluster level. Leveraging this observation, we present PlexRL, a cluster-level runtime for multiplexing unified LLM services across RLVR jobs. By centrally managing model placement, state transitions, and function-level scheduling under strict affinity constraints, PlexRL time-slices LLM execution across jobs to fill otherwise idle periods without expensive model migration. Our implementation and evaluations demonstrate that PlexRL significantly improves effective cluster capacity and reduces user GPU hour cost by maximum 37.58% while preserving algorithmic flexibility and introducing minimal per-job overhead. Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG) Cite as: arXiv:2605.20863 [cs.DC] (or arXiv:2605.20863v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2605.20863 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-53] Finite-Time Regret Analysis of Retry-Aware Bandits

链接: https://arxiv.org/abs/2605.20854
作者: Bingkui Tong,Junpei Komiyama,Soichiro Nishimori,Paavo Parmas
类目: Machine Learning (cs.LG)
*备注: 38 pages

点击查看摘要

Abstract:We study a stochastic bandit algorithm motivated by retry-aware objectives that value the best outcome among multiple attempts, such as pass@ k and max@ k . Given a posterior over arm values, ReMax chooses a sampling distribution that maximizes the posterior expected maximum reward over M virtual draws. Although this objective was introduced in reinforcement learning as an exploration mechanism under uncertainty, its regret properties in bandit problems have remained unclear. For Gaussian rewards and the first nontrivial case M=2 , we characterize the optimal ReMax distribution through an expected-improvement balance condition and prove the first sublinear regret bound for ReMax. Our analysis separates the usual saturation behavior of suboptimal arms from a ReMax-specific underestimation effect, in which the optimal arm may be sampled too rarely after an unfavorable estimate. This explains why ReMax can be more exploitative than Thompson sampling (TS) and why its regret analysis is technically delicate. Experiments support this picture: ReMax often outperforms KL-UCB and Thompson sampling under mild underestimation, while posterior-variance scaling empirically mitigates severe underestimation.

[LG-54] Markovian Circuit Tracing for Transformer State Dynamic

链接: https://arxiv.org/abs/2605.20824
作者: Abdullah X
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Many sequence computations are easier to study as movement through internal states than as isolated local circuits. We introduce Markovian Circuit Tracing (MCT), a diagnostic pipeline for testing whether transformer activations contain coarse state-transition structure. The benchmark uses synthetic Hidden Markov Model (HMM) tasks where latent states, transition matrices, Bayesian belief vectors, Bayes-optimal predictions, and forced-state counterfactual targets are known exactly. Across six HMM families and three seeds per family, tiny causal transformers learn near-Bayes next-token predictors, with mean excess loss over Bayes of 0.0138. Residual activations contain partial Bayesian belief information in this controlled synthetic benchmark. State abstractions extracted from these activations recover coarse transition signal, strongest in persistent and lower-state regimes, and weaker in ambiguous-emission and six-state regimes. The clearest result comes from state forcing. Patching a recovered-state centroid reduces KL to the exact HMM counterfactual target from 0.1957 in the unpatched model to 0.0532 on average, beating wrong-state, mean-activation, random-activation, and shuffled-label controls. The contribution is a controlled benchmark and evaluation framework for transformer state-dynamics interpretability, with MCT as a simple reference pipeline

[LG-55] Instant GPU Efficiency Visibility at Fleet Scale

链接: https://arxiv.org/abs/2605.20799
作者: Connor Pedersen,Dong H. Ahn,Michel Migdal,Collin Neale,Nik Konyuchenko
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: 12 pages, 7 figures, 3 tables

点击查看摘要

Abstract:We present Overall FLOP Utilization (OFU), a hardware-level, precision-agnostic GPU efficiency metric for AI workloads on HPC systems, derived from two on-chip performance counters: Tensor Pipe Activity and SM clock frequency. OFU requires no application instrumentation and works across GPU generations and numeric precisions. We characterize five properties of the OFU approximation – tile quantization, floating-point precision scaling, clock sampling noise, Tensor Core clock domains, and non-tensor undercounting – through controlled GEMM experiments on H100 and GB200 across FP16, TF32, FP8, and NVFP4. After tile-quantization correction, OFU predicts application-level MFU to within =2 percentage points. Against 608 production training jobs, OFU achieves r = 0.78 correlation with application-level MFU and surfaces two framework-level FLOPs miscalculations. Deployed across large-scale GPU fleets, OFU has detected a 2.5x efficiency regression and tracked precision-dependent utilization changes in mixed-precision pretraining. Our evaluation and operational experience suggest OFU is a practical, deployment-ready complement to application-level MFU for continuous fleet-wide efficiency monitoring.

[LG-56] Beyond Numerical Features: CNN-Driven Algorithm Selection via Contour Plots for Continuous Black-Box Optimization

链接: https://arxiv.org/abs/2605.20797
作者: Yiliang Yuan,Xiang Shi,Mustafa Misir
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The present paper introduces a new representation-driven approach to per-instance algorithm selection, applied to black-box optimization, for automatically choosing the most promising solver from a fixed portfolio. Prior work in continuous optimization largely relies on numerical descriptors, including Exploratory Landscape Analysis features and learned embeddings such as Deep-ELA. This work studies a complementary representation: contour-map visualizations of probed landscapes. A CNN regressor takes multiple instance-specific contour views (stacked or encoded per view and aggregated) and predicts per-solver performance, enabling selection by the predicted best value. On the standard BBOB 2009 single-objective protocol, the resulting selectors significantly outperform the single best solver (SBS) and are competitive with feature-based baselines. A subsequent bi-objective evaluation under the DeepELA setting further indicates that the same image-based principle can be competitive when using windowed contour views. Overall, the results suggest that simple vision models can exploit spatial structure in probed landscapes for algorithm selection without handcrafted ELA features.

[LG-57] Causal Machine Learning Is Not a Panacea: A Roadmap for Observational Causal Inference in Health

链接: https://arxiv.org/abs/2605.20782
作者: Donna Tjandra(1),Trenton Chang(1),Sonali Parbhoo(2),Rajesh Ranganath(3 and 4),Andre Kurepa Waschka(5),William Mitchell(6),Maggie Makar(1),Shalmali Joshi(7),Finale Doshi-Velez(8),Leo Anthony Celi(9, 10, and 11),Jenna Wiens(1) ((1) Division of Computer Science and Engineering, University of Michigan, Ann Arbor, Michigan, United States, (2) Department of Electrical and Electronic Engineering, Imperial College London, London, UK, (3) Courant Institute of Mathematical Sciences, New York University, New York, New York, United States, (4) Center for Data Science, New York University, New York, New York, United States, (5) Department of Mathematics amp; Statistics, Elon University, Elon, North Carolina, United States, (6) Department of Ophthalmology, Cambridge University Hospitals, Cambridge, UK, (7) Department of Biomedical Informatics, Columbia University, New York, New York, United States, (8) School of Engineering and Applied Science, Harvard University, Cambridge, Massachusetts, United States, (9) Laboratory for Computational Physiology, Institute for Medical Engineering and Science, Massachusetts Institute of Technology, Cambridge, Massachusetts, United States, (10) Department of Medicine, Beth Israel Deaconess Medical Center, Boston, Massachusetts, United States, (11) Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, Massachusetts, United States)
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Objective: The growing availability of large-scale observational clinical datasets and challenges in conducting randomized controlled trials have spurred enthusiasm in using causal machine learning (ML) for causal inference in observational data. We present a roadmap for applying causal ML to observational data. Materials and methods: We outline the importance of assessing validity assumptions within available data and applying causal ML responsibly for clinical experts using causal ML and ML practitioners with limited clinical expertise. Observations: Despite advances in causal ML, its limitations remain largely under-appreciated across disciplines. This gap in shared knowledge may impact the validity of findings. Discussion: Causal assumptions must be satisfied and modeling choices justified. Otherwise, these approaches risk producing biased or misleading results, with consequences for clinical research and patient care. Conclusion: Causal ML can be a powerful tool for generating causal hypotheses. We provide a template to strengthen the rigor and interpretability of causal analyses.

[LG-58] Cumulative Meta-Learning from Active Learning Queries for Robustness to Spurious Correlations

链接: https://arxiv.org/abs/2605.20771
作者: Kin Whye Chew,Jingxian Wang
类目: Machine Learning (cs.LG)
*备注: Under review. 26 pages, 7 figures

点击查看摘要

Abstract:Spurious correlations in real-world datasets cause machine learning models to rely on irrelevant patterns, undermining reliability, generalization, and fairness. Active learning offers a promising way to address this failure mode by querying informative samples that distinguish core features from spurious ones. However, standard active-learning methods simply append queried examples to the labeled set, effectively updating only the likelihood term. In deep learning regimes, the influence of these informative samples can be diluted by the larger labeled set and memorized by overparameterized models. We propose Cumulative Active Meta-Learning (CAML), an active-learning framework that uses queried examples to meta-learn the prior, or inductive bias, governing how the model adapts. CAML casts each active-learning round as a meta-learning task: the current labeled set serves as meta-train data for adaptation, while the newly queried batch serves as meta-test data for evaluating generalization. Unlike conventional meta-learning, which treats tasks as independent and identically distributed, CAML exploits the sequential dependence between active-learning rounds by maintaining a cumulative inductive bias that is progressively refined. Theoretically, we show that this cumulative formulation introduces interaction terms that couple earlier meta-learned inductive biases with later query-induced objectives, capturing dependencies absent from standard meta-learning. Empirically, CAML improves minority-group accuracy across spurious-correlation benchmarks and acquisition strategies, with gains of up to 27.8% on Dominoes, 29.9% on Waterbirds, 14.3% on SpuCo, and 24.0% on CivilComments.

[LG-59] ShapeBench: A Scalable Benchmark and Diagnostic Suite for Standardized Evaluation in Aerodynamic Shape Optimization

链接: https://arxiv.org/abs/2605.20763
作者: Shaghayegh Fazliani,Krissh Chawla,Jack Guo,Yiren Shen,Matthias Ihme,Madeleine Udell
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Rapid progress in aerodynamic shape optimization (ASO) has outpaced currently-available standardized evaluation frameworks. Fair comparison requires a unified benchmark spanning diverse shape classes, objective formulations, and matched-budget state-of-the-art baselines. We introduce ShapeBench, an open-source ASO benchmark with a unified API spanning 103 tasks across eight shape categories and multiple optimization regimes. Each ShapeBench task includes a validated surrogate for fast search; when feasible, a high-fidelity Computational Fluid Dynamics (CFD) pipeline for final verification is available, enabling systematic fidelity-gap analysis. ShapeBench provides a reproducible protocol with well-configured baselines to compare fairly using a consistent budget metric, allowing for comparison among both classical and LLM-driven methods, including general-purpose optimizers and a new domain-specialized evolutionary LLM baseline, ShapeEvolve. Results on ShapeBench demonstrate substantial variance in optimizer rankings across shape categories and problem formulations, with mean pairwise Spearman \rho = 0.013 , so single-task conclusions do not reliably generalize across problem classes. The benchmark is also far from saturation; classical methods are rarely applicable across all shape categories and tasks, further highlighting the need for more general-purpose approaches.

[LG-60] Memory-Efficient Partitioned DNN Inference on Resource-Constrained Android Crowds ICML2026

链接: https://arxiv.org/abs/2605.20723
作者: Lakshani Manamperi,Disumi Pathirana,Thiwanka Pathirana,Nipun Premarathna,Kutila Gunasekera
类目: Machine Learning (cs.LG)
*备注: 6 pages, 3 figures, 4 tables. Accepted at the ICML 2026 Workshop on Machine Learning for the Global South

点击查看摘要

Abstract:Deploying large deep neural networks on memory-constrained mobile devices is a central challenge in edge ML. While compression, pruning, and quantization reduce per-parameter cost, transformer-based models remain too large for the 3.3-7.4 GB RAM envelope of commodity Android handsets. We present the DNN pipeline scheduling subsystem of CROWDio, which achieves practical ONNX inference across resource-constrained Android workers without model modification, by distributing memory pressure across devices via five mechanisms: JIT deferred partition loading, a single-partition-resident constraint, a 4-tier affinity scheduler, a zlib-compressed tensor transport, and a streaming 1:1 dependency model. Evaluated on DistilBERT (Sanh et al., 2019) (approximately 67 M parameters, SST-2) across five Android handsets over ten runs, our system holds peak per-device RSS to 43±2 MB and limits battery draw to 50±3 mAh per run, while streaming concurrency cuts batch latency 34% below barrier synchronisation.

[LG-61] Robust Recommendation from Noisy Implicit Feedback: A GMM-Weighted Bayes-label Transition Matrix Framework

链接: https://arxiv.org/abs/2605.20721
作者: Zongyu Li,Xuanyu Liu,Gongce Cao,Shirui Sun,Yaqi Fang,Yongshuai Yu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Learning from implicit feedback in recommender systems is fundamentally challenged by pervasive label noise. While conventional denoising approaches often discard noisy instances to ensure robustness, this strategy inevitably suffers from low data utilization. Alternative methods that employ a Bayes-label transition matrix (BLTM) can leverage all available data, but their estimates tend to be biased in practical recommendation scenarios. To address these limitations, this paper proposes a Robust GMM-weighted Bayes-label Transition Matrix framework (RGBT). Our solution utilizes a Gaussian Mixture Model (GMM) to derive instance-specific reliability scores, which systematically calibrate the BLTM estimation to mitigate bias. Theoretical analysis confirms that our approach, by leveraging the BLTM framework with GMM calibration, simultaneously ensures full sample utilization, delivers consistent estimation, and critically, achieves a significant reduction in estimation variance. Extensive experiments on multiple real-world and synthetically flipped datasets demonstrate that RGBT not only utilizes noisy samples more effectively than mainstream reliable sample-based denoising methods, but also achieves significantly superior calibration capability of the transition matrix compared to state-of-the-art transition matrix-based denoising approaches.

[LG-62] Decision-Path Patterns as Tree Reliability Signals: Path-based Adaptive Weighting for Random Forest Classification

链接: https://arxiv.org/abs/2605.20716
作者: Youngjoon Park
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 16 pages, 1 figure. Code and data: this https URL

点击查看摘要

Abstract:Random forests aggregate tree votes by simple majority, treating all trees as equally informative. We observe that the topological pattern along each tree’s root-to-leaf decision path – where and how often the dominant class label flips along it – carries a signal of tree reliability that is exploitable for per-sample reweighting. The naive use of this signal is structurally confounded with the predicted class, so we propose a class-conditional ratio weighting that guarantees zero expected class bias by construction. On 30 binary classification benchmarks under a shared-forest, shared-split protocol with 30 repeats, the proposed method is the only one among four compared schemes – RF, weighted RF, KNORA-Eliminate, KNORA-Union – to yield a statistically significant accuracy improvement over RF (Wilcoxon p = 0.018), while the three alternatives all fail to do so (p 0.5). It is also the only scheme without majority-recall regressions, with minority-recall regressions limited to 3/30 datasets – a one-sided loss to which classical dynamic ensemble selection methods are susceptible. The gain is robust across forest sizes from 100 to 1000 trees.

[LG-63] Distributed Direct Preference Optimization

链接: https://arxiv.org/abs/2605.20696
作者: Zhanhong Jiang
类目: Machine Learning (cs.LG)
*备注: 29 pages, 12 figures

点击查看摘要

Abstract:Preference-based reinforcement learning (RL) is a key paradigm for aligning policies with human judgments, yet its theoretical behavior in distributed settings where preference data are fragmented across heterogeneous users remains poorly understood. Direct Preference Optimization (DPO) avoids explicit reward modeling but lacks convergence guarantees under federated and decentralized training, where communication constraints and non-IID preferences fundamentally alter optimization dynamics. We provide the first convergence and time-complexity analysis of DPO in distributed environments. Modeling personalized offline RL with user-specific preference distributions, we characterize the induced global optimization landscape. For federated DPO, we derive convergence rates that quantify the impact of client drift, communication frequency, and preference heterogeneity; for decentralized DPO, we establish convergence over general communication graphs and show how spectral connectivity governs optimization speed and consensus. Empirically, we corroborate our theoretical insights on standard alignment benchmarks, demonstrating that our proposed methods not only enjoy strong theoretical guarantees but also deliver robust and scalable performance in practice. The code base is available here.

[LG-64] Modular Multimodal Classification Without Fine-Tuning: A Simple Compositional Approach

链接: https://arxiv.org/abs/2605.20674
作者: Herman Bergström,Aditya Mehrotra,Rahul G. Krishnan
类目: Machine Learning (cs.LG)
*备注: 30 pages, 17 figures

点击查看摘要

Abstract:We introduce CoMET, \textit\textbfComposing \textbfModality \textbfEncoders with \textbfTabular foundation models, a simple yet highly competitive method for multimodal classification: pass each modality through a frozen pre-trained backbone, compress the resulting embeddings with PCA, and concatenate as input into a Tabular Foundation Model (TFM) for prediction. We show that PCA alone suffices to act as an adaptor yielding strong, robust performance across modalities. When the \textttCLS tokens of the foundation model align poorly with downstream tasks, we propose \textbfPALPooling, a lightweight adaptive token pooler that consistently improves representation quality. By composing strong frozen representation learning backbones with TFMs, our approach achieves state-of-the-art results across diverse multimodal benchmarks without any training. On hierarchical tasks with large fine-grained class spaces, our approach enables fast and scalable classification, handling datasets with over 500,000 samples and 2,000 classes without any fine-tuning. Overall, our results show that the composition of foundation models is a simple, yet powerful, out-of-the-box solution for multimodal learning, challenging the necessity of complex, end-to-end training pipelines for new problems.

[LG-65] LT2: Linear-Time Looped Transformers

链接: https://arxiv.org/abs/2605.20670
作者: Chunyuan Deng,Yizhe Zhang,Rui-Jie Zhu,Yuanyuan Xu,Jiarui Liu,T. S. Eugene Ng,Hanjie Chen
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Looped Transformers (LT) have emerged as a powerful architecture by iterating their layers multiple times before decoding the final token. However, pairing them with full attention retains quadratic complexity, making them computationally expensive and slow. We introduce LT2 (Linear-Time Looped Transformers), a family of looped architectures that replace quadratic softmax attention with subquadratic, linear-time attention. We study two variants: LT2-linear with linear attention and LT2-sparse with sparse attention. We find that looping uniquely synergizes with these variants: it enables iterative memory refinement in linear attention and progressively expands the effective receptive field in sparse attention. We formalize these benefits theoretically and demonstrate consistent empirical gains across controlled recall, state-tracking, and language modeling tasks. We then explore LT2-hybrid, which combines different attention variants in a looped setting. Two variants are especially promising: LT2-hybrid (GDN+DSA), which interleaves linear and sparse attention to maximize efficiency and matches the standard looped transformer’s quality at fully linear-time cost; and LT2-hybrid (Full+GDN), which interleaves GDN with a small fraction of full attention layers to maximize quality, surpassing the standard looped transformer in both performance and efficiency. We also show how to convert a pre-trained LT into an LT2-hybrid model. With about 1B tokens of training, our converted model, Ouro-hybrid-1.4B, outperforms industry-level 1B models and is competitive with industry-level 4B models while retaining the speed benefits of linear-time attention. Together, these results show a clear path toward making looped transformers more scalable and advancing efficient, capable small language models.

[LG-66] Same Target Different Basins: Hard vs. Soft Labels for Annotator Distributions ICML2026

链接: https://arxiv.org/abs/2605.20642
作者: Mirerfan Gheibi,Gashin Ghazizadeh
类目: Machine Learning (cs.LG)
*备注: 14 pages, 12 figures. Accepted to the 2nd Workshop on Epistemic Intelligence in Machine Learning (EIML @ ICML 2026)

点击查看摘要

Abstract:When annotators disagree, that disagreement can reflect epistemic uncertainty rather than simple label noise. We study hard-label delivery as an alternative to the usual choices of collapsing votes to a single label or training directly on the empirical soft-label distribution. We focus on two primary hard-label methods: multipass, which cycles through observed votes while keeping the dataset size fixed, and stochastic label sampling (SLS), which samples one label per example at the start of each epoch. On CIFAR-10H, we find that when only a small number of annotations per example is available, hard-label delivery improves over soft-label training, with larger improvements where the sparse empirical target is farther from the full annotator distribution. When full annotator distributions are available, both hard-label methods match soft-label training. We use deterministic control as an ablation of multipass and shuffled SLS as a control that breaks the example-to-distribution match. We also show that SLS and soft-label cross-entropy optimize the same expected objective. Hard-label delivery also converges to flatter basins, with supporting descriptive evidence from OOD detection on SVHN and CIFAR-100. Overall, these results suggest that multipass is a strong practical default when raw vote counts are available, while SLS offers a lightweight alternative that remains competitive when only a few votes per example are available and matches soft-label training when full annotator distributions are available.

[LG-67] he General Theory of Localization Methods

链接: https://arxiv.org/abs/2605.20635
作者: Congwei Song
类目: Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注: 74 + 7 pages, ~30 figures, 6 tables

点击查看摘要

Abstract:This paper proposes a general machine learning framework called the localization method, which is fundamentally built on two core concepts: localization kernels and local means – key components that underpin the self-attention mechanism. To establish a rigorous theoretical foundation, the framework is formally defined through two essential pillars: the formulation of the local(-ized) model and the localization trick. We systematically investigate the connections between the localization method and a wide range of existing machine learning models/methods, including (but not limited to) kernel methods, lazy learning, the MeanShift algorithm, relaxation labeling, Hopfield networks, local linear embedding (LLE), fuzzy inference, and denoising autoencoders (DAEs). By dissecting these relationships, we clarify the broader theoretical significance of the localization method and demonstrate its practical applicability across diverse machine learning tasks. Furthermore, we explore advanced extensions of the framework, such as adaptive kernels, hierarchical local models, and non-local models. Notably, we show that the Transformer – a cornerstone of modern sequence modeling – can be constructed using hierarchical local models, revealing the ability of the localization method to unify and generalize state-of-the-art architectures. This work not only provides a unified theoretical lens to reinterpret existing models but also offers new methodological tools for designing flexible, data-adaptive learning systems.

[LG-68] Dynamic Shapley Computation

链接: https://arxiv.org/abs/2605.20620
作者: Xuan Yang,Hsi-Wen Chen,Ming-Syan Chen,Jian Pei
类目: Machine Learning (cs.LG); Databases (cs.DB); Computer Science and Game Theory (cs.GT)
*备注:

点击查看摘要

Abstract:Shapley-based data valuation provides a principled way to quantify the contribution of training data, but its high computational cost makes it impractical in dynamic settings where tasks and training players evolve. Existing methods treat Shapley computation as a one-shot process and collapse contributions into aggregated scores, preventing reuse and requiring recomputation under any change. We introduce a new perspective that represents Shapley values as a player-by-task matrix and formulates dynamic valuation as a structured matrix maintenance problem. We exploit the fact that each task depends on a small subset of training players and that similar tasks yield similar valuations, leading to utility locality and coalition locality. Based on these insights, we propose D-Shap, a dynamic valuation framework that enables efficient updates by modifying only a small portion of the matrix: new task valuations are inferred via structure-aware interpolation, while updates induced by new players are confined to affected local matrix blocks. To eliminate the need for pre-specified evaluation tasks, we introduce self-valuation, which constructs the initial matrix directly from training data, supported by scalable subset reuse and coverage-aware anchor selection. Experiments across diverse models show that D-Shap performs task updates in milliseconds and reduces the cost of player updates by up to three orders of magnitude, while achieving valuation quality competitive with full recomputation.

[LG-69] SURF: Steering the Scalarization Weight to Uniformly Traverse the Pareto Front

链接: https://arxiv.org/abs/2605.20619
作者: Liuyuan Jiang,Chentong Huang,Lisha Chen
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Scalarization is widely used in multi-objective optimization owing to its simplicity and scalability. In many applications, the goal is to generate solutions that represent diverse user preferences, ideally with uniform coverage of the Pareto front (PF). However, uniformly sampling scalarization weights usually induces non-uniform coverage of the PF. We explain this mismatch through a geometric analysis of the scalarization path. As the scalarization weight varies, the corresponding solutions trace the PF with a generally non-uniform traversal speed. This speed induces an arc-length cumulative distribution function (CDF); inverting this CDF map yields a principled rule for selecting weights that produce uniform PF coverage. Building on this insight, we propose SURF (Sampling Uniformly along the PaReto Front). For structured problems, including bi-objective bandits, we derive closed-form expressions for this CDF map and the resulting PF-aware weight sampling rule. For general problems, SURF alternates between CDF reconstruction and weight sampling. Theoretically, we show that under provable conditions, SURF converges linearly to an unavoidable finite-sampling floor. Empirically, experiments on bandits, multi-objective-gymnasium, and multi-objective LLM alignment demonstrate that SURF efficiently achieves more uniform PF coverage than baselines.

[LG-70] Matryoshka Concept Bottleneck Models

链接: https://arxiv.org/abs/2605.20612
作者: Ziye Chen,Hongbin Lin,Xinyue Xu,Jie Li,Lijie Hu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Concept Bottleneck Models (CBMs) have emerged as a prominent paradigm for interpretable deep learning, learning by grounding predictions in human-understandable concepts. However, their practical deployment is hindered by the high cost of test-time intervention, as correcting model errors typically requires human experts to manually inspect and verify a large set of predicted concepts. Existing approaches suffer from a fundamental structural limitation: they either adopt a single static concept set, forcing experts to exhaustively annotate concepts and incurring prohibitive intervention costs, or train multiple models tailored to different concept budgets, resulting in substantial computational and maintenance overhead. To address this challenge, we propose the Matryoshka Concept Bottleneck Model (MCBM), a unified architecture that enables adaptive concept utilization within a single model. Inspired by Matryoshka Representation Learning, MCBM organizes concepts into a nested hierarchy based on maximum relevance and minimum redundancy, allowing inference at multiple levels of conceptual granularity without retraining. Theoretically, we show that MCBM reduces the expected intervention costs from linear to logarithmic order, O(\log K) , while guaranteeing monotonic performance improvement. Empirically, extensive experiments demonstrate that MCBM matches the performance of independently trained models while enabling dynamic and efficient expert interaction.

[LG-71] Compositional Transduction with Latent Analogies for Offline Goal-Conditioned Reinforcement Learning ICML2026

链接: https://arxiv.org/abs/2605.20609
作者: Junseok Kim,Dohyeong Kim,Mineui Hong,Songhwai Oh
类目: Machine Learning (cs.LG)
*备注: ICML 2026

点击查看摘要

Abstract:Compositional generalization is essential for reaching unseen goals under novel contextual variations in offline goal-conditioned reinforcement learning (GCRL), where a generalist goal-reaching agent must be learned from limited data. Most prior approaches pursue this via trajectory stitching over temporally contiguous segments, which limits composing behaviors across varying contexts. To overcome this limitation, we formalize analogy transduction as synthesizing new plans by composing task-endogenous analogies with given contexts and propose a novel analogy representation tailored for it. Grounded in our theory, this analogy representation captures what changes under optimal task execution, remains invariant to contextual variations, and is sufficient for optimal goal reaching. We further contend that generalization to unseen analogy-context pairs is a practical obstacle in analogy transduction, and introduce a new approach for offline GCRL that enables analogy transduction beyond seen pairs to unseen combinations. We empirically demonstrate the effectiveness of our approach on OGBench manipulation environments, substantially outperforming prior methods that do not perform analogy transduction. Project page: this https URL

[LG-72] Unsupervised clustering and classification of upper limb EMG signals during functional movements: a data-driven

链接: https://arxiv.org/abs/2605.20599
作者: L. F. Salazar Álvarez,D. Escobar-Saltarén,M. B. Salazar Sánchez,S. C. Henao-Aguirre
类目: Machine Learning (cs.LG)
*备注: 19 Congreso Colombiano de Computación (19CCC)

点击查看摘要

Abstract:This study presents a comprehensive approach for the clustering and classification of upper-limb surface electromyography (sEMG) signals during functional reach and grasp movements. The methodology was applied to the NINAPRO DB4 dataset, which provides multichannel EMG recordings of 52 gestures. A four-stage pipeline was designed, including signal preprocessing, fea-ture extraction, gesture selection via hierarchical clustering, and comparative model evaluation. Preprocessing involved a fourth-order low-pass filter (0.6 Hz) and Hilbert envelope transformation, effectively reducing noise and enhancing signal clarity. Feature extraction yielded 26 temporal and frequency-domain met-rics, which were later refined using visual analysis, mutual information, principal component analysis, and decision tree importance scores. A final subset of five key features was selected for classification tasks. Gesture selection was per-formed through hierarchical clustering using Mahalanobis distance, resulting in six representative movements that balanced biomechanical diversity and compu-tational efficiency. A 200 ms window was identified as optimal for temporal seg-mentation based on stability and physiological plausibility. Classifier models were evaluated in two stages. Automated comparison using PyCaret identified Extra Trees (ET) and Artificial Neural Networks (ANN) as top performers. Sub-sequent independent training confirmed their stability and generalization capac-ity, with ANN showing progressive learning and ET maintaining robust, con-sistent results. The findings support the implementation of adaptive, low-latency control strategies for myoelectric prostheses and provide a scalable pipeline for future real-time applications.

[LG-73] ReversedQ: Opportunities for Faster Q-Learning in Episodic Online Reinforcement Learning AAMAS2026

链接: https://arxiv.org/abs/2605.20592
作者: Sofia R. Miskala-Dinc,Aviva Prins
类目: Machine Learning (cs.LG)
*备注: This paper contains 5 pages and 2 figures. To be presented at the Adaptive and Learning Agents workshop (ALA 2026) at AAMAS 2026

点击查看摘要

Abstract:We study model-free Q-learning in finite-horizon episodic Markov Decision Processes (MDPs) with stationary dynamics across episodes. We identify a central issue in nascent model-free posterior-sampling works: the reliance on delayed learning in order to prove theoretical guarantees. In particular, we identify three opportunities for faster learning - (i) value-function update order, (ii) update frequencies, and (iii) value-function initialization. Using Wang et al.'s RandomizedQ as a basis, we illustrate these changes and their individual (as well as cumulative) impact in multiple empirical studies. We find that our combined modifications, termed ReversedQ, improve scaled mean cumulative reward compared to RandomizedQ, from 9.53% to 78.78% in the Bidirectional Diabolical Combination Lock (BDCL), and from 21.76% to 61.81% in a chain MDP.

[LG-74] riForces: Augmenting Atomistic GNNs for Transferable Representations ICML2026

链接: https://arxiv.org/abs/2605.20581
作者: Ali Ramlaoui,Alexandre Duval,Hannah Bull,Victor Schmidt,Hugues Talbot,Fragkiskos D. Malliaros,Joseph Musielewicz
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci)
*备注: 28 pages, 11 figures. Accepted at ICML 2026

点击查看摘要

Abstract:Machine learning interatomic potentials (MLIPs) achieve excellent accuracy when trained on large Density Functional Theory (DFT) data. To be useful in practice, they must often be adapted to target chemistries using small and expensive task-specific datasets. However, MLIPs transfer inconsistently across domains, with representations that often loose accessible composition and structure information. To address this, we present TriForces, a model-agnostic three-stream framework that separates composition and structure information, combined with self-supervised learning to preserve transferable representations. TriForces improves performance on MatBench and QM9 over baselines without needing DFT labels and enables efficient similar structure retrieval through its learned latent space. On OMat24, in limited-data training regime, TriForces reduces energy MAE by 57% at 20K samples only and improves force MAE across sample sizes. We release pretrained TriForces variants across multiple MLIP architectures with code at this https URL.

[LG-75] Deep Learning Surrogates for Emulating Stochastic Climate Tipping Dynamics

链接: https://arxiv.org/abs/2605.20580
作者: Adeline Hillier,Jennifer Sleeman,Jay Brett,Caroline Tang,Jenelle Millison,Anand Gnanadesikan
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This work explores a dynamics-informed Temporal Fusion Transformer (TFT) as a data-driven surrogate for computationally intensive Earth system simulations. Focusing on multivariate time series describing global ocean transport, we demonstrate the surrogate’s ability to forecast tip events across thousands of time steps. The data involve up to 21 non-stationary time series in addition to static covariates describing free parameters and initial conditions. Modifications to the architecture and objective function yield a surrogate that anticipates the timing of Atlantic and Pacific collapses to high fidelity and captures the stochastic uncertainty in transition timing across ensemble predictions. The learned surrogate achieves a 465x computational speedup over the numerical simulator while maintaining differentiability with respect to parameters and initial conditions.

[LG-76] OpenSeisML: Open Large-Scale Real Seismic and well-log Dataset for Generative AI

链接: https://arxiv.org/abs/2605.20539
作者: Ipsita Bhar,Huseyin Tuna Erdinc,Thales Souza,Charles Jones,Felix J. Herrmann
类目: Machine Learning (cs.LG)
*备注: 5 pages, 8 figures

点击查看摘要

Abstract:The advent of machine learning (ML) and computer vision has significantly accelerated seismic inversion workflows by reducing the computational cost of traditionally expensive iterative methods. However, the development and evaluation of ML methods remain limited by the scarcity of realistic velocity models, as most high-quality data are privately owned by oil and gas companies. To address this gap, we present OpenSeisML, a collection of real seismic datasets designed to support generative AI (Gen-AI) workflows for seismic inversion. The datasets are curated from publicly available surveys in the UK National Data Repository (NDR). When seismic volumes are in the time domain and wells are in depth, a time-to-depth conversion is required. We use checkshot data to establish the time-depth relationship and construct a velocity model through interpolation for accurate conversion of post-stack seismic data. Here, we present an automated data curation pipeline that enables seismic data preparation while ensuring reproducibility. The objective is to train a generative model that captures the statistical distribution of subsurface properties, enabling the synthesis of multiple statistically consistent realizations for uncertainty quantification which can act as a prior for seismic inversion.

[LG-77] Ada2MS: A Hybrid Optimization Algorithm Based on Exponential Mixing of Elementwise and Global Second-Moment Estimates

链接: https://arxiv.org/abs/2605.20533
作者: Meng Zhu,Quan Xiao,Weidong Min
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Optimization algorithms are core methods by which machine learning models iteratively minimize loss functions, update parameters, learn from data, and improve performance. Momentum SGD and AdamW represent two important optimization paradigms. AdamW produces stable updates and usually has strong robustness across training scenarios, but its generalization performance is sometimes weaker than that of momentum methods. Momentum SGD can often obtain better generalization after careful tuning, but it is more sensitive to gradient-scale variation and hyperparameter settings. To balance the strengths and weaknesses of the two paradigms, this paper proposes Ada2MS, an optimization algorithm that achieves a smooth transition between AdamW-like behavior and momentum-SGD-like behavior through continuous exponential interpolation between elementwise second-moment estimates and global second-moment estimates. On the visual tasks evaluated in this study, Ada2MS obtains competitive results under a unified optimizer-comparison protocol. The code will be released at this https URL

[LG-78] Pseudo-Formalization for Automatic Proof Verification

链接: https://arxiv.org/abs/2605.20531
作者: Slim Barkallah,Luke Bailey,Kaiyue Wen,Mohammed Abouzaid,Tengyu Ma
类目: Logic in Computer Science (cs.LO); Machine Learning (cs.LG)
*备注: 31 pages, code available at this https URL

点击查看摘要

Abstract:Reliable verification of proofs remains a bottleneck for training and evaluating AI systems on hard mathematical reasoning. Fully formal proofs, in languages like Lean, are easy to verify because they are unambiguous and modular. Most proofs, particularly those written by AI systems, have neither property, and translating them into formal languages remains challenging in many frontier math settings. We propose Pseudo-Formalization (PF), a proof format that captures the modularity and precision of formal proofs while retaining the flexibility of natural language. A Pseudo-Formal proof is decomposed into self-contained modules, each stating its premises, conclusion, and proof in natural language. To verify the correctness of a regular natural language proof, an LLM translates it to Pseudo-Formal and then verifies each module independently, an algorithm we call Block Verification (BV). We evaluate PF+BV on two benchmarks spanning olympiad and research-level mathematics, where it pareto-dominates LLM-as-judge baselines on error-finding precision and recall. To support future work, we release our research-level proof verification benchmark ArxivMathGradingBench.

[LG-79] An exponential mechanism based on quadratic approximations for fine-tuning machine learning models with privacy guarantees

链接: https://arxiv.org/abs/2605.20521
作者: Hoang Tran,Jorge Ramirez,Jiayi Wang,Alberto Bocchinfuso,Christopher Stanley,M. Paul Laiu
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Fine-tuning adapts a pretrained machine learning model to a small, sensitive dataset, but this process risks memorizing individual new data points, making the model vulnerable to adversaries who seek to extract sensitive information. In this work, we develop a randomized algorithm based on the exponential mechanism for fine-tuning while ensuring differential privacy. Our key idea is to construct a simple utility function that combines a local quadratic approximation of the pretrained model with information from the new dataset. The resulting exponential mechanism admits exact sampling from a multivariate normal distribution in closed form. We establish theoretical privacy guarantees, sensitivity bounds, and accuracy estimations for our method. We further introduce a random-projection strategy that makes the approach scalable to high-dimensional models. Numerical experiments on the MNIST benchmark and the MIMIC clinical dataset demonstrate competitive performance against existing differentially private fine-tuning techniques.

[LG-80] Online Conformal Prediction with Corrupted Feedback

链接: https://arxiv.org/abs/2605.20515
作者: Bowen Wang,Matteo Zecchin,Osvaldo Simeone
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Modern artificial intelligence systems require calibrated uncertainty estimates that remain reliable in sequential and non-stationary environments. Online conformal prediction (OCP) addresses this challenge through adaptively updated prediction sets that provide deterministic long-run miscoverage guarantees. These guarantees, however, hinge on the assumption of perfect feedback about the coverage of past prediction sets. In practice, the observed miscoverage indicator may be corrupted by noise, communication failures, or adversarial manipulation, which can severely degrade OCP’s calibration guarantees. In this paper, we study OCP under corrupted feedback. We first model feedback corruption as an arbitrary binary flip sequence, and analyze how feedback corruption affects and degrades the miscoverage performance of standard OCP. We then propose two robust schemes: robust OCP via filtering, which leverages the structural properties of the predicted threshold to filter corrupted feedback, and robust OCP via active compensation, which incorporates an active compensation mechanism to mitigate the effect of corrupted feedback. For both methods, we establish explicit miscoverage guarantees, which are further specialized for an independent stochastic flip model and for an arbitrary error model with memory bounds. Experiments on real-world datasets validate the proposed approach, showing markedly improved calibration and significantly smaller prediction sets compared with baseline OCP methods under corrupted feedback.

[LG-81] Fast Reconstruction of Exact Maxwell Dynamics from Sparse Data

链接: https://arxiv.org/abs/2605.20514
作者: Dan DeGenaro,Xin Li,Obed Amo,Michael Pokojovy,Sarah Adel Bargal,Markus Lange-Hegermann,Bogdan Raiţă
类目: Machine Learning (cs.LG)
*备注: 31 pages, 8 figures

点击查看摘要

Abstract:We introduce FLASH-MAX, a shallow, exact-by-construction neural network architecture for predicting homogeneous electromagnetic fields from sparse pointwise observations. Each hidden neuron represents a separate exact solution to Maxwell’s equations, so that the network satisfies the governing equations symbolically by construction and can be trained end-to-end from sparse data within seconds. We prove a universal approximation result showing that this exact model class remains universal on arbitrary domains. FLASH-MAX reaches sub-1% relative validation error from about 1K sparse pointwise observations in seconds, all while maintaining a zero PDE residual, and keeps single-digit errors even for only 100 observations sampled from 3D space. These results suggest that moving governing structure from the loss into the hypothesis class can dramatically improve the trade-off between precision and optimization speed in scientific machine learning.

[LG-82] A 10000-Year Global Stochastic Tropical Cyclone Catalog with Wind-Dependent Track Transitions (WHITS)

链接: https://arxiv.org/abs/2605.20494
作者: Jennifer Nakamura,Upmanu Lall
类目: Machine Learning (cs.LG); Atmospheric and Oceanic Physics (physics.ao-ph); Applications (stat.AP)
*备注:

点击查看摘要

Abstract:Reliable assessment of tropical cyclone (TC) risk is limited by the brevity and spatial sparsity of the historical record, particularly for the rare, high-intensity landfalls that dominate insured loss. We present WHITS (Wind-focused Hurricane Interactive Track Simulator), a non-parametric semi-Markov track generator that extends the HITS framework of Nakamura et al. (2015) in three ways: transitions between historical track segments are conditioned on local wind speed in addition to position, age, and forward vector; the kernel selection on the comparative-vector term is sharpened to suppress dynamically inconsistent jumps; and a short smoothing window is applied across each transition to remove the position and wind discontinuities reported by downstream surge users. WHITS is fit to the full available best-track record in each of six basins in IBTrACS, extending in the North Atlantic to 1851 and in other basins to the earliest year of reliable best-track data. The resulting 10,000-yr global synthetic catalog reproduces observed track density and the annual hurricane/typhoon-force wind-hit probability across all basins. The catalog is intended for catastrophe-risk applications where a large, low-bias sample of physically plausible tracks is more useful than a small, statistically corrected one.

[LG-83] ZEBRA: Zero-shot Budgeted Resource Allocation for LLM Orchestration

链接: https://arxiv.org/abs/2605.20485
作者: May Hamri,Inbal Talgam-Cohen
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:As autonomous agents increasingly execute end-to-end tasks under fixed monetary budgets, the pressing open question shifts from whether the budget is respected, to how to spend it effectively. Existing budget-aware methods typically control reasoning step-by-step within a single agent, or learn resource allocation policies via RL. None address how to split a budget across the composing phases of a multi-agent pipeline at inference time. We propose ZEBRA, a zero-shot framework that reduces multi-phase budget allocation to a continuous nonlinear knapsack problem: an LLM controller estimates per-phase utility curves, and a water-filling search on the Lagrange multiplier returns the per-phase split. Additive and multiplicative aggregations are unified under the same solver. On a 150 -task APPS coding benchmark, both ZEBRA variants outperform LLM-direct (budget allocation directly by an LLM) on every aggregate metric. At a budget of \alpha = 0.5 of the unconstrained spend, ZEBRA recovers 94.4% of unconstrained quality, versus 88.1% for LLM-direct. The advantage is statistically significant and transfers beyond coding: on a 3 -phase HotpotQA pipeline, ZEBRA beats LLM-direct by 14.3 pp, with allocations empirically robust to curve-estimation noise. On HotpotQA, ZEBRA arrives at a different budget split (near-balanced) compared to the APPS one (skewed towards a refinement phase), showing adaptation to the pipeline structure. More broadly, we show that lightweight algorithmic guidance at inference time can improve the economic behavior of autonomous multi-agent systems.

[LG-84] Quadratic Characterizations for Reachability Analysis of Neural Networks

链接: https://arxiv.org/abs/2605.20482
作者: Elias Khalife,Mazen Farhood,Pierre-Loic Garoche
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Quadratic constraints (QCs) are widely used to characterize nonlinearities and uncertainties, but generic analytical characterizations can be conservative on bounded domains. This paper develops a framework for constructing verified quadratic characterizations of scalar relations in the two-dimensional real plane. Candidate quadratic inequalities are locally generated by solving convex quadratic programs using samples from the relation and exterior sample points. They are then verified globally using sum-of-squares certificates over an exact semialgebraic description or, in the case of nonpolynomial relations, over relaxed polynomial descriptions. The resulting verified constraints define a sound overapproximation of the scalar relations over the considered domains. These constraints are directly compatible with existing analysis frameworks based on QCs and pointwise integral quadratic constraints (IQCs) for static nonlinearities and uncertainties, and they can also be embedded in QC-based semidefinite programs for reachability and safety analysis of feedforward neural networks. For smooth activations such as \tanh , the method yields domain-dependent quadratic characterizations that constitute an alternative to generic sector- or slope-based descriptions. For ReLU networks, we give methods to reduce conservatism in QC-based reachability analysis of feedforward networks by exploiting dependencies between neurons and tighter local bounds. Numerical examples demonstrate improved reachability results for smooth activations, reduced conservatism for ReLU networks, and applicability beyond neural networks through an example involving saturation.

[LG-85] CASCADE Conformal Prediction: Uncertainty-Adaptive Prediction Intervals for Two-Stage Clinical Decision Support ICML2026

链接: https://arxiv.org/abs/2605.20468
作者: Ricardo Diaz-Rincon,Muxuan Liang,Adolfo Ramirez-Zamora,Benjamin Shickel
类目: Machine Learning (cs.LG); Methodology (stat.ME); Machine Learning (stat.ML)
*备注: Accepted to ICML 2026 AgenticUQ Workshop. 14 Pages, 3 Figures

点击查看摘要

Abstract:Effective medication management in Parkinson’s Disease (PD) is challenging due to heterogeneous disease progression, variable patient response, and medication side effects. While AI models can forecast levodopa equivalent daily dose (LEDD) as a measure of medication needs, standard uncertainty quantification often fails to communicate the reliability of these predictions, treating high and low confidence clinical decisions identically. We introduce CASCADE (Calibrated Adaptive Scaling via Conformal And Distributional Estimation), a novel conformal prediction framework that propagates epistemic uncertainty from a screening classifier to adapt downstream predictions. Unlike standard conformal methods that rely on auxiliary residual regression, we leverage epistemic uncertainty from a primary classification task (identifying whether a medication change is needed) to dynamically scale the prediction intervals of a secondary regression task (predicting how much change). By mapping Venn-Abers multi-probabilistic uncertainty directly to non-conformity scores, our framework achieves continuous risk adaptation. We demonstrate that this ``cascade effect’’ produces highly efficient intervals for confident patients (38.9% narrower than standard conformal baselines) while automatically expanding intervals to ensure robust coverage for uncertain cases, bridging the gap between discrete clinical decision-making and continuous dose forecasting in PD.

[LG-86] SMA-DP: Spectral Memory-Aware Differential Privacy for Deep Learning

链接: https://arxiv.org/abs/2605.20450
作者: Mohammad Partohaghighi,Roummel Marcia
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Differentially private stochastic gradient descent (DP-SGD) enables private deep learning through per-example clipping and calibrated Gaussian noise, but its high-variance updates can reduce utility on challenging datasets. We propose \textbfSMA-DP-SGD, a \textbfSpectral Memory-Aware Differentially Private Stochastic Gradient Descent method that augments DP-SGD with a fractional memory branch built only from previously privatized noisy releases. WeightWatcher-inspired power-law spectral exponents provide group-wise reliability signals, instantiated layer-wise in our experiments, to adapt the decay and effective memory depth. Private-history alignment, norm matching, and warm-up activation stabilize the memory contribution. Privacy remains transparent: conditioned on the private release history, the memory branch is fixed, and the only newly data-dependent term is the current clipped sum scaled by a fixed coefficient (\beta). Hence, SMA-DP-SGD preserves a clean conditional sensitivity structure and exactly recovers group-wise DP-SGD when (\beta=1). Experiments on CIFAR-100, CIFAR-10, and MNIST show competitive or superior accuracy over several DP optimization baselines, with the largest gains on CIFAR-100 and CIFAR-10. CIFAR-10 ablations show that (\beta) controls the privacy–utility trajectory, while spectral and memory diagnostics confirm a controlled short-to-moderate effective memory depth and a small memory-branch ratio. Runtime analysis shows that the mechanism incurs additional overhead, about (2.94\times) DP-SGD in our CIFAR-10 implementation, revealing a practical trade-off between adaptive private memory and computational cost.

[LG-87] Miller-Index-Based Latent Crystallographic Fracture Plane Reasoning with Vision-Language Models

链接: https://arxiv.org/abs/2605.20416
作者: Qinwu Xu,Yifan Jiang
类目: Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注:

点击查看摘要

Abstract:We study whether multimodal large language models (MLLMs) can leverage crystallographic plane indices (Miller indices) as a structured latent representation for reasoning about fracture geometry. We formulate Miller indices z = (h,k,l) as a latent variable governing idealized planar fracture and evaluate two complementary capabilities: (i) latent inference, where the model maps visual observations to plane hypotheses under physically valid conditions, and (ii) latent applicability assessment, where the model determines whether such a representation is meaningful for a given fracture image. Through extensive experiments spanning synthetic data, controlled 2D–3D geometric pairs, and real-world fracture images across multiple material classes – including ceramics, glass, metals, and concrete – we show that MLLMs can reliably perform latent inference in idealized settings and, critically, can reject the latent representation when the underlying physics does not support it. These results suggest that MLLMs can act as physics-aware reasoning systems conditioned on structured latent priors, provided that the domain of validity is explicitly modeled. Subjects: Machine Learning (cs.LG); Computational Physics (physics.comp-ph) Cite as: arXiv:2605.20416 [cs.LG] (or arXiv:2605.20416v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.20416 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-88] Supervised Latent Restructuring for Small-Data Quantum Learning in Plant Phenomics

链接: https://arxiv.org/abs/2605.20413
作者: Alakananda Mitra,David H. Fleisher,Vangimalla Reddy,Chittaranjan Ray
类目: Machine Learning (cs.LG)
*备注: 11 pages, 4 Tables, 3 Figures

点击查看摘要

Abstract:High-dimensional biological data often exhibit a severe mismatch between feature dimensionality and sample size, making reliable classification difficult in extremely small-data regimes. In these settings, kernel methods can lose discriminative power when latent compression fails to preserve class-separating structure. We study this problem in fine-grained plant phenomics and propose a hybrid workflow that compresses 1280-dimensional deep image embeddings into a 64-dimensional PCA space and then restructures them into an 11-dimensional supervised latent space using Linear Discriminant Analysis (LDA), followed by GPU-accelerated Quantum Kernel Alignment (QKA) on NVIDIA L40S hardware. Empirically, supervised latent restructuring substantially improves the geometric separability of the compressed representation, increasing the Silhouette coefficient from 0.003 in the raw embedding space and -0.006 in PCA-64 to 0.197 in the supervised LDA-11 space. However, downstream classical evaluation reveals a clear compression trade-off: Linear SVM and XGBoost improve in the restructured latent space, whereas RBF-SVM and Random Forest degrade under the same 11-dimensional bottleneck. Under a constrained optimization budget, QKA in this regime remains challenging, indicating that latent geometry alone is not sufficient for strong trainable quantum performance. These findings position representation geometry as a central design variable in small-data quantum learning and expose the practical difficulty of recovering nonlinear discriminative structure from aggressively compressed biological representations.

[LG-89] Spectral Souping: A Unified Framework for Online Preference Alignment

链接: https://arxiv.org/abs/2605.20408
作者: Yinlam Chow,Guy Tennenholtz,Ted Yun,James Harrison,Arthur Gretton,Andre Barreto,Bo Dai
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Reinforcement Learning from Human Feedback (RLHF) effectively aligns Large Language Models (LLMs) with aggregate human preferences but often fails to address the diverse and conflicting needs of individual users. To overcome this issue, we introduce Spectral Souping, a unified framework for efficient, online preference alignment. Our contribution is the discovery of a universal spectral representation within LLMs, which is proven to be highly amenable to model merging. This theoretical insight enables a two-phase methodology: we first learn a basis of specialized policies offline, each focused on a distinct, fine-grained preference dimension. An online adaptation algorithm then efficiently ``soups’’ these policies at inference time, either by merging their outputs or parameters, enabling rapid model adaptation without the need for costly online retraining w.r.t. tailored preference rewards. Experiments on online preference alignment benchmarks demonstrate that our method achieves significant performance improvements over existing state-of-the-art approaches, presenting a scalable and computationally efficient solution for dynamically adapting LLMs to individual user preferences.

[LG-90] Score-Based Causal Discovery of Latent Variable Causal Models ICML2024

链接: https://arxiv.org/abs/2605.20396
作者: Ignavier Ng,Xinshuai Dong,Haoyue Dai,Biwei Huang,Peter Spirtes,Kun Zhang
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: ICML 2024

点击查看摘要

Abstract:Identifying latent variables and the causal structure involving them is essential across various scientific fields. While many existing works fall under the category of constraint-based methods (with e.g. conditional independence or rank deficiency tests), they may face empirical challenges such as testing-order dependency, error propagation, and choosing an appropriate significance level. These issues can potentially be mitigated by properly designed score-based methods, such as Greedy Equivalence Search (GES) (Chickering, 2002) in the specific setting without latent variables. Yet, formulating score-based methods with latent variables is highly challenging. In this work, we develop score-based methods that are capable of identifying causal structures containing causally-related latent variables with identifiability guarantees. Specifically, we show that a properly formulated scoring function can achieve score equivalence and consistency for structure learning of latent variable causal models. We further provide a characterization of the degrees of freedom for the marginal over the observed variables under multiple structural assumptions considered in the literature, and accordingly develop both exact and continuous score-based methods. This offers a unified view of several existing constraint-based methods with different structural assumptions. Experimental results validate the effectiveness of the proposed methods.

[LG-91] Latent Geometry as a Structural Monitor: Eigenspace Alignment for Anomaly Detection in Anonymity Networks

链接: https://arxiv.org/abs/2605.20391
作者: Vaibhav Chhabra
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 14 pages, 5 figures, 1 table

点击查看摘要

Abstract:Traditional anomaly detection marks events when measured signals cross predefined thresholds. This captures the moment of transition but not the structural pressure that precedes it. We propose treating large behavioral populations as geometric energy landscapes whose deformation can be measured before and during major transitions. The central thesis is that structure precedes geometry: the structural organization of the population is the signal, and geometric metrics are instruments for measuring it. Applied to the Tor anonymity network across 67 consecutive daily observation windows, the dual-observer pipeline identifies a stable nine-dimensional load-bearing subspace invariant across the observation period and validates this structure by Monte Carlo simulation at 16.8 sigma above the noise floor. Primary detection gates achieve 0.0% false positive rate on 24 confirmed stable windows. Forensic analysis of the February 20, 2026 confirmed infrastructure event formally falsifies the relay-departure hypothesis, identifying connectivity degradation without topology change as a detectable network failure mode. The result is a candidate structural-monitoring framework for behavioral populations with sufficient telemetry.

[LG-92] Symmetrization of Loss Functions for Robust Training of Neural Networks in the Presence of Noisy Labels

链接: https://arxiv.org/abs/2605.20347
作者: Alexandre Lemire Paquin,Brahim Chaib-Draa,Philippe Giguère
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 28 pages, 1 figure, 4 tables

点击查看摘要

Abstract:Labeling a training set is often expensive and susceptible to errors, making the design of robust loss functions for label noise an important problem. The symmetry condition provides theoretical guarantees for robustness to such noise. In this work, we study a symmetrization method arising from the unique decomposition of any multi-class loss function into a symmetric component and a class-insensitive term. In particular, symmetrizing the cross-entropy loss leads to a linear multi-class extension of the unhinged loss. Unlike in the binary case, the multi-class version must have specific coefficients in order to satisfy the symmetry condition. Under suitable assumptions, we show that this multi-class unhinged loss is the unique convex multi-class symmetric loss. We also show that it has a fundamental local role: the linear approximation of any symmetric loss around score vectors with equal components is equivalent to the multi-class unhinged loss. We then introduce SGCE and alpha-MAE, two loss functions that interpolate between the multi-class unhinged loss and the Mean Absolute Error while allowing control of the beta-smoothness of the loss. Experiments on standard noisy-label benchmarks show competitive performance compared with existing robust loss functions.

[LG-93] WaveGraphNet: Physics-Consistent Guided-Wave Damage Localization through Coupled Inverse-Forward Graph Learning

链接: https://arxiv.org/abs/2605.20311
作者: Vinay Sharma,Aditya Bharade,Olga Fink
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Guided-wave structural health monitoring enables damage localization in composite plates using sparse networks of bonded piezoelectric transducers. However, inferring the spatial location of defects from pitch-catch measurements remains weakly constrained when only a limited set of damage locations is available for training. As a result, models trained to predict defect locations may perform well on seen cases but generalize poorly to unseen regions of the structure. This paper proposes WaveGraphNet, a coupled inverse–forward graph learning framework for guided-wave damage localization in Carbon Fiber Reinforced Polymer (CFRP) plates. The sensing layout is explicitly modeled as a graph, where transducers are represented as nodes and measured propagation paths define the graph connectivity. An inverse branch maps graph-structured spectral descriptors of differential guided-wave responses to a damage location, while a forward branch predicts the path-wise energy-deviation patterns of measured wave responses associated with a candidate location. During training, the forward branch serves as a physics-consistent regularizer, discouraging location estimates that are numerically plausible but inconsistent with the measured redistribution of wave-response energy. This coupling encourages agreement between inferred damage coordinates and the underlying wave propagation behavior. Within this benchmark, the proposed graph-based formulation provides a strong localization model for sparse guided-wave sensing and demonstrates improved robustness in extrapolation to held-out regions compared to both non-graph and graph baselines. These results highlight the potential of coupled inverse-forward graph learning as an effective strategy for guided-wave localization under limited spatial coverage. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2605.20311 [cs.LG] (or arXiv:2605.20311v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.20311 Focus to learn more arXiv-issued DOI via DataCite

[LG-94] AirfoilGen: A valid-by-construction and performance-aware latent diffusion model for airfoil generation

链接: https://arxiv.org/abs/2605.20303
作者: Zhijie Yang,Min Tang,Qiang Zou
类目: Machine Learning (cs.LG)
*备注: 15 pages

点击查看摘要

Abstract:Airfoil shape design is a fundamental task in aerospace engineering, with a direct impact on flight stability and fuel consumption. Deep learning has recently emerged as a promising tool for this task, but existing deep generative approaches remain limited in both geometric validity and physical controllability. They offer little control over the generated shapes, yielding invalid geometries, and they typically do not condition effectively on aerodynamic performance. To address these issues, this paper proposes AirfoilGen, a valid-by-construction and performance-aware latent diffusion model for airfoil. It first introduces a novel airfoil representation scheme, the circle sweeping representation, to constrain the generative process so that output shapes respect essential airfoil characteristics. It then enables explicit control over aerodynamic performance (e.g., lift and drag coefficients) by operating in a learned latent space: a transformer model encodes airfoil shapes into vector embeddings, and a conditional diffusion model denoises Gaussian noise into these latent embeddings while incorporating target aerodynamic performance. In addition, this paper presents a new dataset of over 200,000 airfoils, which is substantially larger than the widely used UIUC airfoil dataset (1,650 airfoils) and more suitable for training modern deep generative models. Experiments demonstrate that AirfoilGen enables airfoil generation with far greater geometric validity and aerodynamic performance controllability than previously achievable, with an average performance-conditioning accuracy of 98.41%.

[LG-95] reeText-CTS: Compact Source-Traceable Tree-Path Evidence for Irregular Clinical Time-Series Prediction

链接: https://arxiv.org/abs/2605.20292
作者: Kwanhyung Lee,Juhwan Choi,Jongheon Kim,Joohyung Lee,Hyeongwon Jang,Eunho Yang
类目: Machine Learning (cs.LG)
*备注: 27 pages, 4 figures

点击查看摘要

Abstract:Numerical time-series models can effectively process irregular electronic health record (EHR) trajectories, but they do not naturally expose the measurements and temporal patterns supporting each risk estimate as readable evidence. Existing text-based interfaces improve readability, but typically rely on either raw serialization, which is lengthy and redundant, or patient-level free-form summaries, which are difficult to trace to source measurements and time windows. To bridge this gap, we introduce TreeText-CTS (Clinical Time-Series), which converts irregular EHR trajectories into human-readable, compact, source-traceable tree-path evidence units without patient-level summarization or inference-time autoregressive decoding. TreeText-CTS routes multi-scale window summaries through frozen XGBoost models and verbalizes activated tree paths as deterministic, source-traceable evidence units composed of threshold conditions. An evidence selector assembles an informative subset of these units, which a language-model encoder then integrates for prediction. Across PhysioNet 2012 mortality, MIMIC-III mortality, and PhysioNet 2019 sepsis-onset forecasting, TreeText-CTS achieves the best AUROC and AUPRC among evaluated text-based EHR time-series interfaces, improving AUPRC by 6.0 to 9.7 absolute percentage points over the strongest prior text-based interface while remaining competitive with numerical time-series models. Ablations show that tree-path evidence construction, evidence selection, and language-model composition each contribute to performance. Because every span passed to the language-model encoder is constructed from activated tree-path threshold conditions, TreeText-CTS makes the evidence supplied to the final predictor inspectable and source-traceable.

[LG-96] Weasel: Out-of-Domain Generalization for Web Agents via Importance-Diversity Data Selection ICML2026

链接: https://arxiv.org/abs/2605.20291
作者: Fatemeh Pesaran zadeh,Seyeon Choi,Xing Han Lù,Siva Reddy,Gunhee Kim
类目: Machine Learning (cs.LG)
*备注: ICML 2026. Code is released at this https URL

点击查看摘要

Abstract:Large language models (LLMs) have enabled web agents that follow natural language goals through multi-step browser interactions. However, agents fine-tuned on specific trajectories and domain often struggle to generalize out of domain, and offline training can be compute-inefficient due to noisy, redundant trajectories and long accessibility-tree (AXTree) states. To address both issues, we propose Weasel, a trajectory selection method for offline training of web agents. Weasel selects a fixed-budget subset of trajectory steps by optimizing an objective that balances unary importance with pairwise diversity over states, websites, and interaction patterns, solving efficiently with a greedy algorithm. We further improve efficiency with target-centered AXTree pruning that keeps only content around the ground-truth action target, and we mitigate style mismatch for reasoning-native models by replacing expert traces with model-generated, style-consistent rationales. Across AgentTrek and NNetNav training datasets, evaluations in WebArena, WorkArena, and MiniWob, and experiments with Qwen2.5-7B, Gemma3-4B, and Qwen3-8B, Weasel improves out-of-domain performance while reducing training cost, producing roughly 9.7-12.5 \times training speedups over standard fine-tuning. We make the code available at this https URL.

[LG-97] Adaptive Probe-based Steering for Robust LLM Jailbreaking ICML2026

链接: https://arxiv.org/abs/2605.20286
作者: Junxi Chen,Junhao Dong,Xiaohua Xie
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 19 pages, 13 figures, accepted by ICML 2026

点击查看摘要

Abstract:Recent work has demonstrated the potential of contrastive steering for jailbreaking Large Language Models (LLMs). However, existing methods rely on limited and inherently biased contrastive prompts and require laborious manual tuning of steering strength, limiting their robustness and effectiveness. In this paper, we leverage the idea of model extraction to guide the learned steering vectors to approximate the ideal one and propose tuning the steering strength adaptively based on contrastive activations’ statistics. Experiments demonstrate that our method notably improves the effectiveness and robustness of probe-based steering, without any extra contrastive prompts or laborious manual tuning. Being an attack paper, this paper focuses on revealing the breakdown of fortified LLMs, raising the average harmfulness score from 6% to 70%. Our code is available at this https URL.

[LG-98] OmniISR: A Unified Framework for Centralized and Federated Learning via Intermediate Supervision and Regularization

链接: https://arxiv.org/abs/2605.20276
作者: Wei-Bin Kou,Guangxu Zhu,Ming Tang,Chen Zhang,Lisheng Wu,Lei Zhou,Yujiu Yang
类目: Machine Learning (cs.LG)
*备注: 18 pages

点击查看摘要

Abstract:The global deployment of edge intelligence operates across heterogeneous legal frameworks. While some regions permit centralized learning (CL) via cloud data aggregation, others enforce strict data localization, necessitating federated learning (FL). This operational dichotomy introduces two incompatible optimization regimes (i.e., unbiased global gradients yet coupled with internal covariate shift in CL versus biased, drift-prone local updates in FL), resulting in that any naive integration of the two lacks rigorous theoretical guarantees. To fill this gap, we propose OmniISR, a unified framework that fuses pure CL, pure FL, and hybrid CL-FL training modes via equipping intermediate supervision and regularization (ISR) signals at multiple hidden layers. Specifically, we propose (i) to use mutual-information (MI) as intermediate supervision to align shifting internal covariate in CL and client-drifting representations in FL, and (ii) to adopt negative-entropy (NE) as intermediate regularizer to penalize overconfident prediction, preserve representational uncertainty, and avoid device-specific collapse. On the theory side, we derive (i) a unified, ISR-agnostic, and non-asymptotic O(1/sqrt(T)) convergence bound that shows the introduced ISR does not violate standard SGD convergence, (ii) a federated drift-bound that quantifies the ISR-reduced client drift, (iii) a gradient-alignment guarantee that ensures non-conflicting CL and FL updates under mild bias, and (iv) an explicit escape-time bound that indicates that CL-FL hybrid mixing enlarges effective stochasticity and accelerates escape from strict saddles. Extensive experiments demonstrate that OmniISR consistently improves model performance in both centralized and federated paradigms, reduces the CL-FL gap by 22.60%, and yields 37/48 paired metric wins across multiple FL algorithms.

[LG-99] Physics-informed convolutional neural networks for fluid flow through porous media

链接: https://arxiv.org/abs/2605.20250
作者: Rafał Topolnicki,Paweł Dłotko,Maciej Matyka
类目: Machine Learning (cs.LG); Computational Physics (physics.comp-ph); Fluid Dynamics (physics.flu-dyn)
*备注: 14 pages, supplement, dedicated github repo

点击查看摘要

Abstract:Accurate simulation of fluid flow in porous media is challenging due to complex pore-space geometries and the computational cost of solving the Navier-Stokes equations. This difficulty is particularly important when repeated simulations are required, as standard numerical solvers may converge slowly in intricate porous domains. We present a neural-network-based framework for predicting pore-scale velocity fields directly from sample geometry. The method uses a convolutional encoder-decoder architecture with skip connections to preserve spatial detail while extracting multi-scale features. Physical consistency is encouraged through a custom loss function combining velocity reconstruction with incompressibility, no-flow conditions inside solids, periodicity constraints, and agreement with the global tortuosity index. We analyze the influence of the corresponding loss weights and quantify the contribution of individual loss components to prediction accuracy. Several CNN backbones are evaluated to identify architectures providing accurate and robust predictions. The generalization ability of the trained model is tested on samples outside the training distribution, including changes in obstacle geometry, boundary conditions, porosity, and realistic porous structures. Finally, we demonstrate a practical use of the predicted velocity fields as initial conditions for Lattice-Boltzmann simulations. This warm-start strategy accelerates solver convergence, reducing the number of iterations in over 90% of tested cases.

[LG-100] Graph Transductive Sharpening: Leverag ing Unlabeled Predictions in Node Classification

链接: https://arxiv.org/abs/2605.20248
作者: Brown Zaz,Mar Gonzàlez I Català,Ferran Hernandez Caralt,Moshe Eliasof,Pietro Liò
类目: Machine Learning (cs.LG)
*备注: 19 pages, 4 figures, 17 tables

点击查看摘要

Abstract:In the transductive setting, where the full graph is observed but node labels are only partially available, progress in semi-supervised node classification has largely focused on architectural innovation. In this paper, we revisit an orthogonal axis: the training objective. We start from a simple observation: transductive models produce predictions for every node during training, including nodes without labels. These unlabeled-node predictions may contain useful training signal, but standard supervised objectives discard them because no ground-truth labels are available. Inspired by the decomposition of cross-entropy into a label-dependent alignment term and a label-independent entropy term, we propose prediction confidence as a natural way to extract this signal in the absence of labels. This motivates Transductive Sharpening (TS): a loss-level modification that minimizes prediction entropy on unlabeled nodes while counterbalancing this effect on labeled nodes. We evaluate Transductive Sharpening across a wide range of node-classification benchmarks and observe consistent performance improvements without requiring any changes to the backbone architecture. Code is available at this https URL.

[LG-101] Prism: Structural Symmetry Scanning via Duality-Constrained Laplacian Projection

链接: https://arxiv.org/abs/2605.20245
作者: Jiatong Xie
类目: ocial and Information Networks (cs.SI); Machine Learning (cs.LG)
*备注: 10 pages, 4 tables, 1 figure. This work presents a first-principles unsupervised network structural diagnosis framework based on symmetric involution operator and Laplacian commutator constraint. It achieves noise-robust community detection and early structural risk detection in financial time-series networks without supervised training data

点击查看摘要

Abstract:We introduce \textbfPrism, a framework for structural symmetry diagnosis in complex networks. Given a graph Laplacian L and a duality operator P (a symmetric involution), Prism computes the \emphduality defect \delta(L,P) = |LP - PL|_F / |L|_F – a scalar measuring how far the network deviates from structural self-consistency. When P encodes the network’s true symmetry, \delta starts near zero and rises monotonically as structure degrades; an arbitrary P gives noise. We prove that the optimal L’ satisfying [L’, P] = 0 is given by a closed-form block-diagonal projection, and provide an unsupervised alternating optimization that learns P from the graph’s own Fiedler vector. Experiments on synthetic networks show the true- P defect is 3.38\times more sensitive to structural degradation than an index-reversal baseline and more sensitive than modularity. On Zachary’s Karate Club with edge noise, Prism achieves 94.5% community detection accuracy at 5% noise versus 76.6% for the raw Laplacian baseline. Applied to live S\P~500 data (2026-05-17), Prism detects rising structural stress (defect 0.43 \to 0.73 over 90 days) while surface correlations remain low – a signal invisible to correlation-based methods. In a historical backtest spanning five major stress events (2011–2020), the duality defect exhibits a consistent pattern: it reaches elevated levels \emphbefore the correlation spike that accompanies each crisis, and sustains high readings during periods of structural fragility that conventional metrics classify as calm. The duality defect is a first-principles structural admissibility condition, requiring no training data and computable in milliseconds.

[LG-102] MagBridge-Battery: A Synthetic Bridge Dataset for Li-ion Magnetometry and State-of-Health Diagnostics ALT

链接: https://arxiv.org/abs/2605.20240
作者: Sakthi Prabhu Gunasekar,Prasanna Kumar Rangarajan
类目: Machine Learning (cs.LG)
*备注: 10 pages, 3 figures, 4 tables. Synthetic dataset and benchmark suite for battery magnetometry and state-of-health diagnostics; dataset released on Zenodo and code available on GitHub

点击查看摘要

Abstract:Battery health diagnostics today rely overwhelmingly on electrochemical signals measured at the cell terminals. A parallel literature has shown that magnetic sensing can resolve information that terminal-only measurements miss, but method development is limited by the absence, to the best of our knowledge, of public battery magnetic-measurement datasets paired with degradation labels. We release MagBridge-Battery v1.0, a synthetic dataset of 6,760 magnetic-field signatures that bridges real magnetic morphology from the Mohammadi-Jerschow Open Science Framework (OSF) archive with state-of-health (SOH) labels from the PulseBat dataset. The release contains 5,600 PulseBat-conditioned grounded samples, 600 synthetic sensor-anomaly samples derived from clean parents, and 560 low-voltage Regime-B extrapolation samples. A cell-disjoint, parent-child-leakage-free primary benchmark split is verified to contain zero overlapping cells, zero cross-split parent-child pairs, and zero sample-ID overlap. We define three primary benchmark tasks: SOH regression, second-life classification, and anomaly detection, plus an auxiliary anomaly-subtype classification task. A controlled label-shuffle ablation collapses SOH regression from R^2 approximately 0.77 to approximately 0, confirming that the bridge encodes input SOH non-trivially rather than producing label-aligned artifacts. The dataset is released on Zenodo under CC-BY-4.0, and the bridge code and benchmark suite are released under Apache-2.0. This work provides a public benchmark for magnetic-sensing battery diagnostics while paired magnetic-electrochemical measurements remain scarce.

[LG-103] NaP-Control: Navigating Diffusion Prior for Versatile and Fast Character Control

链接: https://arxiv.org/abs/2605.20209
作者: Chia-Wen Chen,Yan Wu,Korrawe Karunratanakul,Siyu Tang
类目: Graphics (cs.GR); Machine Learning (cs.LG); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Achieving precise, versatile whole-body character control in physics-based animation remains challenging. Recent diffusion-based policies generate rich and expressive motions but typically rely on gradient-based test-time guidance to satisfy task objectives, which is slow and can reduce robustness. We introduce NaP-Control (Navigating Diffusion Prior for Versatile and Fast Character Control), abbreviated as NaP. Our method uses reinforcement learning to manipulate the latent noise of a task-agnostic diffusion policy prior, steering it toward task-specific behaviors for fast, robust control with high motion fidelity. In contrast to methods that rely solely on offline training, NaP interacts with the environment during training to correct motions and optimize task rewards, improving success rates and enabling adaptation to challenging scenarios. By directly predicting task-optimized diffusion noise, NaP eliminates iterative guidance during denoising and enables efficient inference. Experiments show that NaP attains higher success rates and faster inference while preserving natural motion across diverse tasks.

[LG-104] Velocityformer: Broken-Symmetry-Matched Equivariant Graph Transformers for Cosmological Velocity Reconstruction

链接: https://arxiv.org/abs/2605.21483
作者: Tilman Tröster,David Mirkovic,Veronika Oehl,Arne Thomsen
类目: Cosmology and Nongalactic Astrophysics (astro-ph.CO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Precise measurement of the kinematic Sunyaev-Zel’dovich (kSZ) effect - a probe of the large-scale distribution of baryonic matter, a key observable for cosmological inference - requires accurate reconstruction of galaxy velocities from spectroscopic surveys. The signal-to-noise ratio (SNR) of kSZ measurements scales directly with the correlation coefficient r between reconstructed and true velocities. We introduce Velocityformer, an equivariant graph transformer architecture designed to match the specific symmetry of the observational data. While the underlying physics is equivariant with respect to translations and rotations, observational effects break this symmetry due to the preferred line-of-sight direction. Matching the model’s inductive bias to the data’s broken symmetry consistently improves performance across all model sizes and training volumes, with Velocityformer improving r by 35% over the standard linear theory baseline and outperforming ML baselines at every data volume. By matching the model’s inductive bias to the data and conditioning on the physics-based long-wavelength solution, Velocityformer is highly data-efficient, training to high accuracy on as few as 4 low-fidelity simulations, and generalises zero-shot across input geometry, cosmological parameters, and galaxy sample. On high-fidelity simulated galaxy catalogues, this yields a 30% improvement in r over the physical baseline, directly translating to the same SNR gain on observational data.

[LG-105] Neural Negative Binomial Regression for Weekly Seismicity Forecasting: Per-Cell Dispersion Estimation and Tail Risk Assessment

链接: https://arxiv.org/abs/2605.21437
作者: Alim Igilik
类目: Geophysics (physics.geo-ph); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 28 pages, 9 figures. Source code available at this https URL

点击查看摘要

Abstract:Standard approaches to forecasting the weekly number of earthquakes on a spatial grid rely on the Poisson distribution with a single global dispersion assumption. We show that this assumption is systematically violated in seismic data from Central Asia (2010-2024), where a likelihood-ratio test with boundary correction strongly rejects the Poisson hypothesis (p 10^-179). The main contribution of this work is the EarthquakeNet architecture, which provides an endogenous per-cell estimate of the overdispersion parameter alpha via a neural network (spatial embeddings + MLP), without explicit spatial covariance specification. In contrast to existing negative binomial regression approaches in seismological forecasting, which typically assume a single global alpha, the proposed per-cell formulation allows the model to identify spatial heterogeneity in seismic clustering and to construct probabilistic risk-aware alerts via quantiles of the predicted distribution. A walk-forward evaluation (2018-2023) over four systems shows an 8.6 percent reduction in mean pinball deviation (MPD) relative to a negative binomial GLM baseline. The strongest improvements are observed in the tail regime (Y = 5), where the continuous ranked probability score (CRPS) of the proposed model is 12.5 percent lower than that of the baseline, indicating improved calibration in extreme-event forecasting. Comments: 28 pages, 9 figures. Source code available at this https URL Subjects: Geophysics (physics.geo-ph); Machine Learning (cs.LG); Machine Learning (stat.ML) Cite as: arXiv:2605.21437 [physics.geo-ph] (or arXiv:2605.21437v1 [physics.geo-ph] for this version) https://doi.org/10.48550/arXiv.2605.21437 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-106] Memorisation convergence and generalisation in generative models

链接: https://arxiv.org/abs/2605.21402
作者: Antoine Maillard,Sebastian Goldt
类目: Machine Learning (stat.ML); Disordered Systems and Neural Networks (cond-mat.dis-nn); Statistical Mechanics (cond-mat.stat-mech); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Generative neural networks learn how to produce highly realistic images from a large, but finite number of examples - or do they simply memorise their training set? To settle this question, Kadkhodaie, Guth, Simoncelli and Mallat (ICLR '24) trained diffusion models independently on disjoint subsets of a dataset and showed that they converge to nearly the same density when the number of training images is large enough. This result raises two basic questions: how much data do you need for convergence, and what does convergence capture about learning the data distribution? Here, we address these questions by providing an exact analytical characterisation of the transition from memorisation to generalisation in linear generative models. We find that these models memorise at small load, while convergence emerges continuously when the number of samples is linear in the input dimension. Strikingly, we find that convergence is insensitive to recovery of the principal latent factors of the data, which are recovered in a sharp transition. After extending our approach to data with power-law spectra, we find the same distinction between convergence and latent recovery in our experiments with convolutional denoisers and in the data of Kadkhodaie et al. We thus show that generalisation in generative models decomposes into at least two distinct objectives: matching the bulk of the data distribution and recovering the principal latent factors. These objectives correspond to two different distances between true and learnt data distribution, and only the first one is captured by convergence.

[LG-107] Semiparametric Efficient Bilevel Gradient Estimation

链接: https://arxiv.org/abs/2605.21341
作者: Fares El Khoury,Houssam Zenati,Nathan Kallus,Michael Arbel,Aurélien Bibaut
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Functional bilevel methods estimate a lower-level function and plug it into a hypergradient, but this plug-in gradient can retain first-order bias when the lower-level problem is learned nonparametrically. To remove this bias, we develop a semiparametric debiasing theory for population bilevel gradients based on the efficient influence function. This perspective leads to a cross-fitted orthogonal hypergradient estimator for which we establish asymptotic normality together with uniform control over the outer parameter. Under quadratic losses, the estimator reduces to a simple doubly robust score based on conditional mean nuisances. On synthetic bilevel benchmarks with known ground truth, the method tracks the oracle efficient-gradient benchmark and improves over plug-in functional hypergradients and regularized kernel bilevel baselines.

[LG-108] Stimulus symmetries can confound representational similarity analyses

链接: https://arxiv.org/abs/2605.21324
作者: Farhad Pashakhanloo,Jacob A. Zavatone-Veth
类目: Neurons and Cognition (q-bio.NC); Machine Learning (cs.LG)
*备注: 40 pages

点击查看摘要

Abstract:What can representational similarity matrices (RSMs) tell us about a neural code? As the popularity of these summary statistics grows, so too does the need for a more complete characterization of their properties. Here, we show that symmetries in network inputs can confound RSM-based analyses. Stimulus symmetries render many representations functionally equivalent, but these different configurations can lead to different RSMs. These different RSMs reflect qualitatively different representational geometries. We show that stochastic gradient descent or energetic regularization can generate sparse, drifting codes, leading in turn to drifting RSMs. Moreover, we demonstrate that these phenomena are present in networks trained to encode image data, where the symmetry is latent. Our results illustrate the challenges inherent in comparing nonlinear neural codes, when functionally-equivalent representations are not related by a simple rotation.

[LG-109] heoretical guidelines for annealed Langevin dynamics in compositional simulation-based inference

链接: https://arxiv.org/abs/2605.21253
作者: Camille Touron,Gabriel V. Cardoso,Julyan Arbel,Pedro L. C. Rodrigues
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Compositional score-based approaches to simulation-based inference (SBI) approximate the posterior over a shared parameter given n independent observations by aggregating individually learned posterior scores: currently, there are two main propositions of such methods (Geffner et al. (2023), Linhart et al. (2026)). As the resulting composite score does not correspond to the score of any distribution along the forward diffusion path of the true multi-observation posterior, sampling from it via a reverse SDE leads to an irreducible bias. Annealed Langevin dynamics provides a principled alternative: it treats the composite score as the genuine score of a sequence of tractable bridging densities and samples from them in succession. When properly tuned, it could lead to a controllable bias. However, its hyperparameters, namely step sizes, the number of steps per level, and the number of annealing levels, have so far been chosen empirically. We derive Wasserstein bounds for annealed Langevin with approximate scores and translate them into explicit decision rules for these hyperparameters that guarantee a prescribed sampling accuracy, while highlighting different theoretical aspects of each composite score formulation. In the Gaussian setting, we obtain closed-form expressions for all relevant quantities and prove that the bridging densities of Linhart et al. (2026) consistently admit larger step sizes and require fewer total Langevin steps than those of Geffner et al. (2023). Furthermore, we show empirically that the tuning obtained in the Gaussian setting generalizes to more complex problems, thus providing a well-understood and theoretically grounded starting point for practitioners using compositional score-based approaches.

[LG-110] Federated LoRA Fine-Tuning for LLM s via Collaborative Alignment

链接: https://arxiv.org/abs/2605.21217
作者: Shuaida He,Liwen Chen,Long Feng
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Low-rank adaptation (LoRA) has emerged as a powerful tool for parameter-efficient fine-tuning of large language models (LLMs). This paper studies LoRA under a federated learning setting, enabling collaborative fine-tuning across clients while preserving parameter efficiency. We focus on a highly heterogeneous regime in which clients share only partial structure and a substantial subset may be contaminated. We propose Collaborative Low-rank Alignment and Identifiable Recovery (CLAIR), a contamination-aware framework that relies only on preliminary local estimators. Its formulation applies broadly, from linear regression to neural network and LLM modules, whenever local adaptation can be represented by matrix-valued updates. CLAIR recovers the shared LoRA subspace and detects contaminated clients via a structured low-rank plus block-sparse decomposition. We prove exact recovery of the shared LoRA subspace in the noiseless case, stable recovery under preliminary estimation error, and consistent collaborative-set recovery under mild separation conditions. We further quantify the gain from CLAIR refinement: it reduces off-subspace estimation error through cross-client averaging while preserving client-specific variation within the shared LoRA subspace, thus improves over local fine-tuning whenever this oracle gain outweighs the costs of subspace estimation and benign-client heterogeneity. Empirically, we demonstrate the benefits of CLAIR by fine-tuning a Transformer architecture on a text-copying task. The results show accurate contamination detection and improved benign-client performance compared with local fine-tuning and non-robust federated averaging.

[LG-111] A Rigorous Tractable Measure of Model Complexity

链接: https://arxiv.org/abs/2605.21167
作者: Oskar Allerbo,Thomas B. Schön
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:An accurate assessment of a model’s complexity is crucial for topics such as interpretation, generalization, and model selection. However, most existing complexity measures either rely on heuristic assumptions or are computationally prohibitive. In this paper, we present a mathematically rigorous yet easy-to-compute measure of model complexity that is based on the similarities between the model gradients across inputs. It is thus well-defined for any parametric model, but also for kernel-based non-parametric models. We prove that our measure of complexity generalizes model-specific complexity measures such as polynomial degree (for polynomial regression), kernel length scale (for Matérn kernels), number of neighbors (for k-nearest neighbors), number of splits (for decision trees), and number of trees (for random forests). We also use our measure to obtain new insights into the double descent phenomenon for random Fourier features, random forests, neural networks, and gradient boosting.

[LG-112] AIMBio-Mat: An AI-Native FAIR Platform for Closed-Loop Materials Discovery and Biomedical Translation

链接: https://arxiv.org/abs/2605.21083
作者: D.-M. Mei,K. Acharya,C. M. Adhikari,M. Adhikari,S. Aryal,B. V. Benson,K. Bhatta,S. Bhattarai,N. Budhathoki,A. M. Castillo,D. Chakraborty,S. Chhetri,S. Choudhury,T. A. Chowdhury,R. D. Cruz,B. Cui,S. Dhital,K.-M. Dong,R. Gapuz,A. Ghasemi,E. Z. Gnimpieba,B. D. S. Gurung,H. A. Hashim,R. I. Harry,K.-E. Hasin,M. K. Hassanzadeh,M. K. Jha,D. Kim,K.-C. Kong,B. Lama,A. Mahat,N. Maharjan,A. Majeed,J. Mammo,M. M. Masud,K. S. Moore,A. Nawaz,H. Oli,S. A. Panamaldeniya,L. Pandey,R. Pandey,Z. Peng,A. Prem,M. M. Rana,K. Rana Magar,R. Rizk,C. S. Tadi,L.-W. Wang,Y. Yang,G.-L. Yin,C.-X. Yu,D. Zeng,M. Zhou,Q. Zhou
类目: Applied Physics (physics.app-ph); Machine Learning (cs.LG); Biological Physics (physics.bio-ph); Medical Physics (physics.med-ph)
*备注: 35 pages, 4 figures, and 12 tables

点击查看摘要

Abstract:Materials discovery and biomedical translation increasingly require models that can reason across composition, processing, structure, biological response, manufacturability, safety, and governance constraints. Existing materials and biomedical data ecosystems are powerful but remain poorly coupled for AI-guided discovery. Here we present AIMBio, a conceptual framework for an AI-native, FAIR, and governance-aware decision layer that links materials provenance, biomedical context, knowledge graphs, uncertainty-aware machine learning, and human-in-the-loop active learning. The framework formulates biomedical-materials discovery as constrained multi-objective optimization under uncertainty and introduces practical requirements for metadata, model documentation, risk-tiered governance, evaluation metrics, and phased implementation. To make the roadmap testable, we add a minimum viable prototype specification and a worked pilot for AI-guided nanomaterials for drug delivery. AIMBio is positioned as exploratory and preclinical discovery infrastructure, not as clinical decision-support software; any clinical or regulated-device use would require separate validation, change control, and regulatory review. The central contribution is a publishable platform blueprint for converting fragmented materials and biomedical records into auditable, experimentally actionable, and translationally responsible discovery workflows.

[LG-113] Conditioning Gaussian Processes on Almost Anything

链接: https://arxiv.org/abs/2605.21041
作者: Henry Moss,Lachlan Astfalck,Thomas Cowperthwaite,Colin Doumont,Sam Willis,Philipp Hennig,Christopher Nemeth,Andrew Zammit-Mangion
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Gaussian processes (GPs) offer a principled probabilistic model over functions, but exact inference is restricted to the linear-Gaussian regime. We establish an explicit equivalence between GPs and a class of linear diffusion models, recasting predictive sampling as an ODE with closed-form Gaussian dynamics and a likelihood-dependent guidance term that admits a simple Monte Carlo approximation. In the linear-Gaussian setting, we recover standard GP conditioning exactly; beyond conjugacy, the same machinery handles any conditioning statement admitting point-wise likelihood evaluation – including non-linear physics, and, for the first time, natural language via large language models. Whitening isolates the irreducible non-Gaussian dynamics, minimising Wasserstein-2 transport cost and eliminating numerical stiffness. The result is a general-purpose GP inference scheme requiring no bespoke derivations. Together, these results provide a general mechanism for incorporating the full richness of real-world knowledge as conditioning information, opening a new frontier for the probabilistic modelling of real-world problems.

[LG-114] Concentration of General Stochastic Approximation Under Heavy-Tailed Markovian Noise

链接: https://arxiv.org/abs/2605.20999
作者: Shubhada Agrawal,Siva Theja Maguluri,Martin Zubeldia
类目: Probability (math.PR); Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注: 67 pages

点击查看摘要

Abstract:We establish maximal concentration bounds for the iterates generated by stochastic approximation algorithms with general step sizes, where the noise has a finite-state Markovian component plus a Martingale-difference component. When the Martingale-difference noise is bounded, we show that the tail of the error can be sub-Gaussian, sub-Weibull, or something lighter than any Pareto but heavier than any Weibull, depending on the step size sequence and on whether the random operator is almost surely contractive, almost surely non-expansive, or expansive with positive probability. Our analysis relies on a novel Lyapunov function involving the moment-generating function of the solution to a Poisson equation, together with an auxiliary projected algorithm. We complement the upper bounds with worst-case examples showing that qualitatively sharper bounds are impossible. We further study the case of unbounded Martingale-difference noise when the average operator is contractive, and the step sizes are of order 1/k . In this setting, we show that if the random operator is almost surely non-expansive, then the error tail is at most three times heavier than the noise tail, whereas if the random operator is expansive with positive probability, then the error may have substantially heavier tails. These results are obtained through a novel black-box truncation argument that reduces the unbounded-noise setting to the bounded-noise case.

[LG-115] Everywhere Valid Bounds on False Discovery Proportions in Conformal Inference

链接: https://arxiv.org/abs/2605.20726
作者: Ziang Song,Ying Jin,Emmanuel J. Candès
类目: Methodology (stat.ME); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 31 pages, 12 figures. Code available at this https URL

点击查看摘要

Abstract:Modern applications of conformal inference to multiple testing problems, such as outlier detection and candidate selection, often involve selecting test samples whose conformal p-values fall below a threshold. The quality of such methods is often measured by the false discovery proportion (FDP), defined as the fraction of incorrect selections. Existing approaches typically control the expected value of the FDP, using methods such as the Benjamini-Hochberg procedure. This approach fails to provide high-probability bounds on the realized false discovery proportion and invalidates statistical guarantees if the rejection threshold is selected after inspecting the data. This paper establishes finite-sample, distribution-free upper bounds on the FDP that hold simultaneously over all possible rejection thresholds, enabling arbitrary post hoc selection of the threshold. Simultaneous validity is achieved by constructing a high-probability envelope for the empirical distribution function of null conformal p-values by sampling from their joint distribution. Furthermore, our framework allows practitioners to modulate the envelope’s shape, thereby producing tight bounds in rejection regions of primary interest. We use this flexible approach to derive simultaneous FDP upper bounds for both outlier detection and conformal selection. We demonstrate through synthetic and real-data experiments that the resulting bounds are both valid and substantially less conservative than those derived from existing approaches.

[LG-116] Motion-Robust Deep Reconstruction for Free-Breathing Cardiac Cine MRI

链接: https://arxiv.org/abs/2605.20687
作者: Mahmut Yurt,Kanghyun Ryu,Zhitao Li,Xucheng Zhu,Xianglun Mao,Martin Janich,Marcus Alley,Kawin Setsompop,John Pauly,Shreyas Vasanawala,Ali Syed
类目: Image and Video Processing (eess.IV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Conventional cardiac cine MRI relies on breath-hold Cartesian acquisitions, which are vulnerable to motion artifacts and can be uncomfortable or infeasible, particularly for pediatric and other noncompliant patients who cannot reliably hold their breath. Free-breathing radial acquisitions can alleviate these limitations, but robust reconstruction at high acceleration remains challenging due to prominent streak artifacts. To address these limitations, we propose Cine-DL, a clinically oriented framework that couples targeted k-space preprocessing with fast, model-based deep reconstruction. In this pipeline, raw free-breathing radial data undergo retrospective cardiac binning and respiratory gating to resolve cardiac phases and discard motion-corrupted spokes. We then introduce Streak Optimized Coil Compression (SOC), which explicitly preserves cardiac signals while suppressing peripheral interference that typically drives the streak artifacts. The resulting 2D+t cine series is reconstructed with an unrolled network that alternates a ResNet proximal operator with physics-based data consistency updates solved via conjugate gradient. We further employ a memory-efficient training strategy that reduces peak memory usage. We evaluate Cine-DL on free-breathing volunteer data against established baselines (k-t SENSE and iGRASP) and demonstrate clinical translation via hospital deployment on newly acquired patient data. Our experiments show that Cine-DL consistently improves quantitative metrics and visual fidelity, supporting a practical route toward routine, time-sensitive clinical adoption of free-breathing cine MRI.

[LG-117] Scale-Calibrated Median-of-Means for Robust Distributed Principal Component Analysis

链接: https://arxiv.org/abs/2605.20681
作者: Kisung You
类目: Methodology (stat.ME); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Distributed principal component analysis (PCA) produces node-level estimates of both a mean vector and a principal subspace. Robustly aggregating these heterogeneous objects requires a relative scale between mean error and subspace error. We study a scale-calibrated median-of-means estimator for this problem using the product geometry of Euclidean space and the Grassmann manifold. A node-level PCA expansion shows that the mean component has the usual linear influence, whereas the subspace component is an eigengap-weighted covariance perturbation. We prove a local reduction showing that the proposed product-manifold median-of-means estimator is asymptotically equivalent to a scaled spatial median of node influence errors. This yields fixed-node non-Gaussian limits, growing-node Gaussian limits with finite-block bias, and an explicit scale-dependent covariance formula. We propose robust block-scale and inference-optimal calibration rules, establish high-probability median-of-means bounds, characterize factorwise bad-node influence, and prove node-bootstrap validity. Simulations and large-scale single-cell RNA-seq data show that scale calibration adapts to eigengap-driven subspace uncertainty and provides a robust distributed PCA summary.

[LG-118] me-Dependent PDE-Constrained Optimization via Weak-Form Latent Dynamics

链接: https://arxiv.org/abs/2605.20639
作者: April Tran,Terry Haut,David Bortz,Youngsoo Choi
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Dynamical Systems (math.DS)
*备注:

点击查看摘要

Abstract:Optimization problems constrained by high-dimensional, time-dependent partial differential equations require repeated forward and sensitivity solves, making high-fidelity optimization computationally prohibitive in many-query design and control settings. We present a weak-form latent-space reduced-order modeling framework for accelerating gradient-based PDE-constrained optimization. The proposed approach builds on Weak-form Latent Space Dynamics Identification (WLaSDI), which compresses high-dimensional solution trajectories into a low-dimensional latent representation and identifies parametric latent dynamics using weak-form system identification. By avoiding explicit numerical differentiation of training trajectories, the weak-form improves robustness to noisy data and yields more reliable surrogate dynamics for optimization. We formulate the resulting reduced PDE-constrained optimization problem and derive both direct-sensitivity and adjoint-based gradient expressions for the learned latent dynamics, enabling scalable gradient evaluation with respect to design parameters. The framework is demonstrated on three time-dependent benchmark problems: thermal radiative transfer for optimal hohlraum design, the two-stream instability Vlasov-Poisson system, and the inviscid Burgers equation. Across these examples, WLaSDI produces accurate optimal designs, remains robust under noisy training data, and delivers substantial computational savings, including speedups of up to five orders of magnitude relative to full-order optimization. These results demonstrate that weak-form latent dynamics provide an efficient and noise-robust surrogate foundation for gradient-based optimization of complex time-dependent PDE systems.

[LG-119] Group-Aware Matrix Estimation and Latent Subspace Recovery

链接: https://arxiv.org/abs/2605.20559
作者: Hamza Golubovic,Matthew Shen,Genevera I. Allen,Tarek M. Zikry
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Applications (stat.AP); Methodology (stat.ME)
*备注: 12 pages, 6 main figures, 1 main algorithm

点击查看摘要

Abstract:Modern matrix completion problems often involve heterogeneous data whose rows simultaneously belong to many meta-categories, such as demographic and age groups in recommendation systems, or region and recording session labels in neural electrophysiological experiments. Standard low-rank estimators impose a single global latent geometry, which can recover average structure but may smooth away subgroup-specific variation, especially when observations are unevenly distributed across groups. We introduce Group-Aware Matrix Estimation (GAME), a convex estimator for overlapping subgroup-wise low-rank matrix estimation. GAME regularizes category-specific submatrices through overlapping nuclear-norm penalties, allowing related groups to borrow information while preserving local latent structure in a shared coordinate system. We provide finite-sample guarantees for both reconstruction error and subgroup-specific subspace recovery, showing how performance depends on sampling density, subgroup rank, and overlap structure. Experiments on synthetic, recommendation, ecological, and neuroscience datasets show that GAME is most beneficial in structured missingness regimes, where subgroup-aware regularization improves both reconstruction accuracy and latent subspace fidelity. Across these benchmarks, GAME is competitive or best among global low-rank, side-information, and modern imputation baselines, with the largest gains when subgroups exhibit distinct low-rank structure.

[LG-120] Spectral bandits for smooth graph functions with applications in recommender systems AAAI2014 SDM

链接: https://arxiv.org/abs/2605.20552
作者: Tomáš Kocák,Michal Valko,Rémi Munos,Branislav Kveton,Shipra Agrawal
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: Published at AAAI 2014 - SDMBD

点击查看摘要

Abstract:Smooth functions on graphs have wide applications in manifold and semi-supervised learning. In this paper, we study a bandit problem where the payoffs of arms are smooth on a graph. This framework is suitable for solving online learning problems that involve graphs, such as content-based recommendation. In this problem, each recommended item is a node and its expected rating is similar to its neighbors. The goal is to recommend items that have high expected ratings. We aim for the algorithms where the cumulative regret would not scale poorly with the number of nodes. In particular, we introduce the notion of an effective dimension, which is small in real-world graphs, and propose two algorithms for solving our problem that scale linearly in this dimension. Our experiments on real-world content recommendation problem show that a good estimator of user preferences for thousands of items can be learned from just tens nodes evaluations.

[LG-121] Sample Complexity of Transfer Learning: An Optimal Transport Approach

链接: https://arxiv.org/abs/2605.20545
作者: Haoyang Cao,Xin Guo,Wenpin Tang,Guan Wang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Transfer learning is an essential technique for many machine learning/AI models of complex structures such as large language models and generative AI. The essence of transfer learning is to leverage knowledge from resolved source tasks for a new target task, especially when the sample size m of the training data for the latter is low. In this work, we rigorously analyze the potential benefit of transfer learning in terms of sample efficiency. Specifically, taking an optimal transport viewpoint of transfer learning, we find that when the data dimension d is higher than 3 , the sample complexity for transfer learning is O(m^-(\alpha+1)/d) , with \alpha indicating the smoothness of the data distribution, as opposed to the O(m^-p/d) sample complexity for direct learning with p indicating the smoothness of the optimal target model. Our finding theoretically supports a better sample efficiency for transfer learning, when the target task is optimizing over a family of not-so-smooth models (i.e., highly complex networks with the possible use of non-smooth activation functions). Using image classification as an example, we numerically demonstrate the sample efficiency for transfer learning, that is, in the data hungry regime, the model performance can be significantly improved by transfer learning.

[LG-122] Contradiction Graphs Determine VC Dimension

链接: https://arxiv.org/abs/2605.20434
作者: Jesse Campbell,Daniel Ibaibarriaga,Lev Reyzin
类目: Machine Learning (stat.ML); Discrete Mathematics (cs.DM); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study the contradiction graphs associated with binary concept classes. For a class H \subseteq \0,1^X , the order- m contradiction graph G_m(H) has as vertices the H -realizable labeled sequences of length m , with two vertices adjacent when the two sequences assign opposite labels to some common domain point. Our main result is that the single graph G_m(H) determines the threshold predicate \mathrmVCdim(H)\ge m . Consequently, the full sequence (G_m(H))_m \ge 1 determines the exact VC dimension and, in particular, detects finite versus infinite VC dimension, answering a question posed by Alon et al. (2024).

[LG-123] Understanding Deterioration Random Effects for Causal Discovery in Infrastructure Management

链接: https://arxiv.org/abs/2605.20400
作者: Takato Yasuno
类目: Applications (stat.AP); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 20 pages, 7 figures, 4 tables

点击查看摘要

Abstract:Infrastructure deterioration poses significant challenges for asset management, yet existing approaches rely on population-averaged models that overlook equipment-specific heterogeneity. We present a novel framework that combines Bayesian hierarchical hazard modeling with causal discovery to identify operational patterns that drive heterogeneous deterioration rates in pump equipment. Our approach first estimates pump-specific random effects u_i using GPU-accelerated No-U-Turn Sampling (NUTS), achieving 3–5 \times speedup over CPU implementations. We then employ DirectLiNGAM to discover causal relationships between 22 engineered time-series features and deterioration rates, stratified by positive ( u_i 0 , faster deterioration) versus negative ( u_i \leq 0 , slower deterioration) random effects. Analyzing 112 pumps with 92,861 observations over 650 days, we uncover striking heterogeneity: the negative group exhibits causal effects 400 \times larger than the positive group, with standard deviation (std) showing a strong positive causal effect ( +1.515 ) on deterioration rates in low-risk equipment. We validate linearity assumptions through NonlinearLiNGAM comparison and demonstrate practical scalability through GPU acceleration. Our findings enable targeted maintenance strategies by revealing that different operational regimes require fundamentally distinct management approaches, advancing predictive maintenance from population-averaged to heterogeneity-aware decision making.

[LG-124] Corrected Integrated Laplace Approximation for Bayesian Inference in Latent Gaussian Models

链接: https://arxiv.org/abs/2605.20345
作者: Jinlin Lai,Charles C. Margossian,Daniel R. Sheldon
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Latent Gaussian models (LGMs) are a popular class of Bayesian hierarchical models that include Gaussian processes, as well as certain spatial models and mixed-effect models. Efficient Bayesian inference of LGMs often requires marginalizing out the latent variables. For LGMs with a non-Gaussian likelihood, exact marginalization is not possible and a popular approach is to do approximate marginalization with an integrated Laplace approximation (ILA). Using ILA produces an approximate posterior which, in some settings, can differ significantly from the correct posterior, which impacts downstream applications. We propose an importance sampling scheme to correct the error introduced by ILA. By increasing the number of samples in importance sampling, the posterior with ILA converges to the correct posterior. This idea is realized with various techniques, including pseudo-marginalization, quasi-Monte Carlo and randomized quasi-Monte Carlo. We implement our methods in an automatic differentiation framework to support gradient-based algorithms when doing inference on the hyperparameters. For the latter, we specifically consider the use of Hamiltonian Monte Carlo. We demonstrate the benefits of reduced error in various applied models.

[LG-125] he Economics of AI Inference: Inflation Dynamics Welfare Costs and Optimal Monetary Policy under the Inference-Cost Phillips Curve

链接: https://arxiv.org/abs/2605.20281
作者: Gustav Olaf Yunus Laitinen-Fredriksson Lundström-Imanov
类目: General Economics (econ.GN); Machine Learning (cs.LG)
*备注: 6 pages, 5 tables

点击查看摘要

Abstract:We develop a unified microeconomic and monetary theory of artificial intelligence inference costs and their pass-through to inflation, welfare, and optimal monetary policy. We introduce the Inference-Cost Phillips Curve (ICPC), an augmented New Keynesian Phillips curve in which firm-level marginal costs of producing differentiated goods include a non-trivial AI inference component lambda-bar, and prove a closed-form structural slope kappa*_inf = lambda-bar * kappa, where kappa is the standard Calvo-Yun slope. We derive a welfare-relevant Hicks-Kaldor decomposition of consumer welfare under inference-cost shocks, prove a generalized Taylor principle for the inference-augmented economy, and characterize the optimal monetary policy response coefficient psi*_inf = (1 + phi*rho) * lambda-bar * kappa under commitment. A second-order welfare loss formula closes the model in closed form. We confront the theory with U.S. monthly data 2022:M01-2026:M04 using a two-step GMM estimator with Newey-West HAC standard errors and Hansen J-test, recovering an empirical slope kappa-hat_inf = 0.087 (HAC s.e. 0.021) which lies within one standard error of the structural prediction. A scaling regression over 50 rolling-window subwindows yields b-hat = 0.987 (R^2 = 0.998), consistent with a near-unit-elasticity pass-through. A G7 reduced-form panel with Driscoll-Kraay HAC standard errors yields b-hat^G7 = 0.094 (s.e. 0.026), and a Wald test fails to reject cross-country homogeneity (p = 0.78). The framework provides a single equilibrium scaffold for the joint study of AI inference cost dynamics, monetary policy under generative-AI shocks, and the welfare cost of inference-driven inflation.

[LG-126] he Economics of Model Collapse: Equilibrium Welfare and Optimal Provenance Subsidies in Synthetic Data Markets

链接: https://arxiv.org/abs/2605.20279
作者: Gustav Olaf Yunus Laitinen-Fredriksson Lundström-Imanov
类目: General Economics (econ.GN); Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注: 7 pages, 5 tables, 1 algorithm; IEEEtran conference format; submitted to IEEE BigData 2026

点击查看摘要

Abstract:Generative artificial intelligence is rapidly transforming the supply side of training data: an increasing share of new tokens, images, and structured records is produced by previous-generation models rather than by human originators. Recursive training on such synthetic content induces a measurable and often irreversible loss of distributional fidelity, a phenomenon known as model collapse. We develop the first unified microeconomic theory of synthetic data markets under model collapse. We introduce the Synthetic Data Contamination Equilibrium (SDCE), prove existence and generic uniqueness, derive a welfare decomposition W = W_prod + W_cons - L_coll - L_info, establish a Wasserstein-gradient-flow mean-field collapse limit, prove an impossibility of information-constrained implementation, and obtain closed-form expressions for the welfare-maximizing provenance subsidy s* = KL(q||p)/(2 kappa) and the welfare-maximizing watermark strength w* = (1 - psi) KL(q||p)/(2 kappa psi). We prove an information-theoretic Cramer-Rao lower bound on any provenance estimator using only producer-side observations and show that the Provenance-Market Iterative Retraining (PMIR) algorithm attains this bound up to constants while converging to an epsilon-SDCE in O(epsilon^-2 log T) iterations. A reduced-form OLS estimation on a C4-synthetic benchmark over ten retraining generations yields a collapse-rate coefficient b-hat = 0.181 (HAC s.e. 0.024), within one standard error of the structural prediction 0.183. Calibrated experiments raise generation-ten model quality by 23.1 percent over the unregulated benchmark while lowering the 2-Wasserstein drift on a held-out diversity probe from 0.318 to 0.142. Scaling experiments over generations t in 1,…,10 recover a logarithmic-in-t collapse law log Q_t = log Q_0 - 0.183 t rho^2 with R^2 = 0.962.

[LG-127] Multi-Head Attention as Ensemble Nadaraya-Watson Estimation: Variance Reduction Decorrelation and Optimal Head Diversity

链接: https://arxiv.org/abs/2605.20271
作者: Ernest Fokoué
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 14 pages

点击查看摘要

Abstract:We develop a rigorous statistical theory of multi-head attention (MHA) as an ensemble of Nadaraya-Watson (NW) kernel regression estimators. Building on the algebraic identity between single-head softmax attention and the NW estimator, we prove that MHA is a structured ensemble of H NW estimators, each operating in a distinct learned projection subspace of the key space. We derive an explicit Bias-Variance-Covariance decomposition of the MHA mean squared error, showing that variance reduction depends not merely on the number of heads H but fundamentally on the decorrelation of head outputs. Decorrelation is governed by the principal angles between learned projection subspaces: orthogonal projections yield maximum variance reduction; aligned projections yield none. We introduce the Head Diversity Index (HDI), a computable spectral measure of inter-head decorrelation, and prove that MHA mean squared error is monotonically decreasing in HDI. This provides the first rigorous theoretical explanation for the empirically observed specialization of attention heads. Under a fixed total-dimension budget D = H * d_k, we solve the optimal head-dimension allocation problem, deriving the MSE-minimizing pair (H*, d_k*) from data distribution and regression smoothness. The solution yields a new architectural scaling law: the optimal per-head dimension grows logarithmically with training set size, while the optimal number of heads grows nearly linearly with the total budget D. Our framework unifies three strands of prior work: the NW theory of single-head attention, the general weighting theory for ensemble learning, and the decorrelation-variance-reduction isomorphism between biological and computational ensembles. Multi-head attention is the Transformer’s instantiation of a universal principle: identical agents plus diversity-enforcing mechanisms yields emergent optimality.

[LG-128] Quantum End-to-End Learning for Contextual Combinatorial Optimization

链接: https://arxiv.org/abs/2605.20222
作者: Jaehwan Lee,Changhyun Kwon
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: 23 pages, 2 figures, preprint

点击查看摘要

Abstract:Contextual combinatorial optimization (CCO) plays a critical role in decision-making under uncertainty, yet remains a significant challenge. We present Quantum End-to-End Learning (QEL), the first quantum computing-based end-to-end learning framework for CCO that leverages Quantum Approximate Optimization Algorithms. Inspired by the integration of state preparation and evolution in data re-uploading, we propose a context re-uploading phase-separator that jointly captures the complex relations among contexts, uncertain coefficients, and optimal solutions. This allows a contextual encoder to be seamlessly integrated within a quantum surrogate policy, enabling joint end-to-end training with a stationarity guarantee. Exploiting an optimization-aware structure grounded in physical principles that classical methods cannot readily leverage, our approach demonstrates practicality by directly training on task loss despite the discreteness and nonconvexity, while avoiding calls to NP-hard optimization solvers. QEL empirically achieves competitive performance while requiring substantially fewer parameters than classical benchmarks, highlighting its industrial-level potential for the future quantum era.

[LG-129] E-PCN: Jet Tagging with Explainable Particle Chebyshev Networks Using Kinematic Features

链接: https://arxiv.org/abs/2512.07420
作者: Md Raqibul Islam,Adrita Khan,Mir Sazzat Hossain,Choudhury Ben Yamin Siddiqui,Md. Zakir Hossan,Tanjib Khan,M. Arshad Momen,Amin Ahsan Ali,AKM Mahbubur Rahman
类目: High Energy Physics - Phenomenology (hep-ph); Machine Learning (cs.LG); High Energy Physics - Experiment (hep-ex)
*备注: 25 pages, 3 figures

点击查看摘要

Abstract:The identification and classification of collimated particle sprays, or jets, are essential for interpreting data from high-energy collider experiments. While deep learning has improved jet classification, it often lacks interpretability. We introduce the Explainable Particle Chebyshev Network (E-PCN), a graph neural network extending the Particle Chebyshev Network (PCN). E-PCN integrates kinematic variables into jet classification by constructing four graph representations per jet, each weighted by a distinct variable: angular separation ( \Delta ), transverse momentum ( k_T ), momentum fraction ( z ), and invariant mass squared ( m^2 ). We use the concept of Gradient-weighted Class Activation Mapping (Grad-CAM) to determine which kinematic variables dominate classification outcomes. Analysis reveals that angular separation and transverse momentum collectively account for approximately 76% of classification decisions (40.72% and 35.67%, respectively), with momentum fraction and invariant mass contributing the remaining 24%. Evaluated on the JetClass dataset with 10 signal classes, E-PCN achieves a macro-accuracy of 94.67%, macro-AUC of 96.78%, and macro-AUPR of 86.79%, representing improvements of 2.36%, 4.13%, and 24.88% respectively over the baseline PCN implementation, while demonstrating physically interpretable feature learning.

附件下载

点击下载今日全部论文列表