Arxiv今日论文 | 2026-06-16

本篇博文主要内容为 2026-06-16 从Arxiv.org论文网站获取的最新论文列表，自动更新，按照NLP、CV、ML、AI、IR、MA六个大方向区分。

说明：每日论文数据从Arxiv.org获取，每天早上12:30左右定时自动更新。

提示: 当天未及时更新，有可能是Arxiv当日未有新的论文发布，也有可能是脚本出错。尽可能会在当天修复。

自然语言处理共186篇(Computation and Language (cs.CL))
人工智能共432篇(Artificial Intelligence (cs.AI))
计算机视觉共291篇(Computer Vision and Pattern Recognition (cs.CV))
机器学习共400篇(Machine Learning (cs.LG))
多智能体系统共24篇(Multiagent Systems (cs.MA))
信息检索共35篇(Information Retrieval (cs.IR))
人机交互共49篇(Human-Computer Interaction (cs.HC))

多智能体系统

[MA-0] okenPilot: Cache-Efficient Context Management for LLM Agents

【速读】：该论文旨在解决大语言模型（LLM）代理在长时序任务中因上下文累积导致的推理成本上升问题。现有方法通过文本剪枝或动态内存淘汰来减少令牌占用，但其无约束的序列修改会破坏提示前缀一致性，引发前缀匹配失败与缓存失效，暴露出文本稀疏性与提示缓存连续性之间的关键权衡。为此，本文提出TokenPilot——一种双粒度上下文管理框架：全局层面，摄入感知压缩（Ingestion-Aware Compaction） 在数据摄入阶段稳定提示前缀，并消除开放世界环境噪声；局部层面，生命周期感知淘汰（Lifecycle-Aware Eviction） 动态监测上下文片段的剩余效用，仅在任务相关性消失时保守地按批次卸载内容。在PinchBench与Claw-Eval基准测试中，无论是孤立模式还是连续模式，TokenPilot均显著降低推理成本（分别达61%、56%和61%、87%），同时保持与现有系统相当的性能表现。该方案已集成至LightMem2系统中。

链接: https://arxiv.org/abs/2606.17016
作者: Buqiang Xu,Zirui Xue,Dianmou Chen,Chenyang Fu,Chiyu Wu,Caiying Huang,Chen Jiang,Jizhan Fang,Xinle Deng,Yijun Chen,Yunzhi Yao,Xuehai Wang,Jin Shang,Gong Yu,Ningyu Zhang
机构: Zhejiang University(浙江大学); University of Electronic Science and Technology of China(电子科技大学); Xi’an University of Electronic Science and Technology(西安电子科技大学); HomologyAI(同源智能)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: LightMem Series: Work in Progress

点击查看摘要

Abstract:As LLM agents are deployed in long-horizon sessions, context accumulation drives up inference costs. Existing approaches utilize text pruning or dynamic memory eviction to minimize token footprints; however, their unconstrained sequence mutations alter layouts, introducing prefix mismatches and cache invalidation. This reveals a critical trade-off between text sparsity and prompt cache continuity. To address this, we present TokenPilot, a dual-granularity context management framework. Globally, Ingestion-Aware Compaction acts as a framework harness to stabilize prompt prefixes and eliminate open-world environmental noise at the ingestion gate. Locally, Lifecycle-Aware Eviction monitors the ongoing residual utility of context segments, enforcing a conservative batch-turn schedule to offload content segments only when task relevance expires. Experiments on PinchBench and Claw-Eval under both isolated and continuous modes demonstrate that TokenPilot reduces costs by 61% and 56% in isolated mode, and 61% and 87% in continuous mode, while maintaining competitive performance compared to prior systems. TokenPilot has been integrated into LightMem2 at this https URL.

[MA-1] Human-on-the-Bridge: Scalable Evaluation for AI Agents

【速读】：该论文旨在解决当前对生成式 AI（Generative AI）代理评估方法中存在的碎片化与可扩展性不足的问题。现有评估手段如基准测试仅衡量静态能力、人工在环（Human-in-the-Loop）方式难以规模化、大语言模型作为评判者依赖于评估器设计、红队测试多为偶发性、追踪审计则需预设明确证据规则，均难以全面反映智能体在多轮交互中表现出的复杂行为特征。为此，论文提出“人类在桥上”（Human-on-the-Bridge, HOB）这一可扩展的智能体评估范式，其核心在于将人类专家知识前置化：在评估开始前，由领域专家预先构建可复用的评估智能体，包括领域上下文、红队陷阱（Red-Team Traps）、陪审员角色设定（Juror Personas）、评分指南、审计规则及回退策略。随后，ProofAgent Harness 利用该预置智能体进行多轮对抗性评估，实现轨迹捕获、多陪审员打分与证据关联报告。实验覆盖23,500个智能体交互回合，涵盖金融、医疗和代码生成场景，结果表明，即使使用较小的Harness LLM，HOB亦能有效挑战基于前沿大模型的智能体，显著提升评估质量，并揭示出静态基准与单一评判者常忽略的关键缺陷，如虚假工具调用声明、遗漏必要工具调用、策略漂移、操纵路径及安全但无解的拒绝响应。HOB 的关键突破在于将专家判断编码至评估流程上游，实现评估智能的可持续复用，从而构建一种可规模化、高保真的人类主导型智能体评估体系。

链接: https://arxiv.org/abs/2606.16871
作者: Fouad Bousetouane
机构: ProofAgent.ai; The University of Chicago (芝加哥大学)
类目: Multiagent Systems (cs.MA)
备注: 33 pages, 3 figures

点击查看摘要

Abstract:AI agents must be evaluated as behavioral systems, not as isolated response generators. They reason across turns, call tools, preserve context, follow policies, and act under uncertainty. Existing methods provide useful but fragmented signals: benchmarks measure fixed capabilities, Human-in-the-Loop review preserves expert judgment but does not scale easily, LLM-as-judge methods depend on evaluator design, red teaming is often episodic, and trace auditing requires explicit evidence rules. This paper introduces Human-on-the-Bridge (HOB), a scalable evaluation paradigm for agentic AI. HOB places human expertise upstream, where experts curate reusable evaluation intelligence before testing begins, including domain context, Red-Team Traps, Juror Personas, scoring guidelines, audit rules, and fallback policies. ProofAgent Harness then executes this curated intelligence repeatedly through multi-turn adversarial evaluations, trace capture, multi-juror scoring, and evidence-linked reporting. We evaluate HOB through symmetric and cost-efficient asymmetric settings across frontier LLM-based agents and Harness LLM tiers. The study covers 23,500 agent turns and produces evidence-linked findings across finance, healthcare, and code generation. The results show that HOB can amplify evaluation quality without requiring equally large evaluator models, allowing smaller Harness LLMs to challenge agents built on frontier LLM backbones. The evaluation surfaces failures often missed by static benchmarks and single-evaluator scoring, including phantom tool-call claims, missing mandatory tool calls, policy drift, manipulation paths, and safe but non-resolving refusals. These findings support HOB as a paradigm for scaling human-curated evaluation intelligence, where expert judgment is encoded upfront and reused across repeated agent evaluations rather than applied manually inside every run.

[MA-2] Misinformation Propagation in Benign Multi-Agent Systems

【速读】：该论文旨在解决多智能体系统（multi-agent systems）在高风险场景（如医疗诊断、法律分析和司法鉴定）中因单个智能体基于错误或误导性上下文进行推理而导致错误传播的问题。其核心挑战在于，当智能体通过工具调用等途径获取错误信息时，这些错误可能在多智能体间的轮次交互中持续扩散，影响整体系统的可靠性。论文的解决方案关键在于：通过引入基于意图的误导信息，在推理、知识和对齐任务中评估单智能体与多智能体系统的鲁棒性。研究发现，尽管误导信息会降低单智能体性能并可在多智能体辩论中持续存在，但相较于单智能体提示，多智能体辩论能显著缓解性能下降，尤其当多数智能体未受误导时。系统鲁棒性高度依赖于群体构成与决策协议——共识机制在同伴压力下更具稳定性，而多数决机制则常能将受误导的智能体引导回正确答案。因此，多智能体系统对误导信息的鲁棒性不仅取决于底层模型能力，更关键在于智能体间的信息交换方式与最终决策的聚合策略。

链接: https://arxiv.org/abs/2606.16710
作者: Jonas Becker,Jan Philip Wahle,Terry Ruas,Bela Gipp
机构: University of Göttingen (哥廷根大学); LKA NRW (北莱茵-威斯特法伦州警察局)
类目: Multiagent Systems (cs.MA); Computation and Language (cs.CL)
备注: 20 pages, 8 figures, 1 table

点击查看摘要

Abstract:Multi-agent systems, in which multiple large language model agents solve problems through turn-based interaction, are increasingly deployed in high-stakes settings such as medical diagnosis, legal analysis, and forensic decision-making. Their reliability can be at risk when single agents reason from incorrect or misleading context, e.g., from tool calls, since errors may propagate through agent interactions. This work studies this risk by injecting intent-based misinformation into benign single-agent and multi-agent systems across reasoning, knowledge, and alignment tasks. We find that misinformation can degrade single-agent performance and persists across multi-agent debate, with agents often retaining answers introduced by misinformed peers. Nevertheless, multi-agent debate reduces the resulting performance degradation compared to single-agent prompting, especially when most agents are not exposed to misinformation. Robustness depends on group composition and decision protocol. Consensus can be more stable than voting under peer pressure, while majorities can often steer misinformed agents back toward correct answers. Our results show that misinformation robustness in multi-agent systems depends on the underlying model and also on how agents exchange information and aggregate decisions.

[MA-3] he Proxy Knows Too Much: Sealing LLM API Routers with Attested TEEs

【速读】：该论文旨在解决当前代理系统中通过API路由器访问大语言模型（Large Language Models, LLMs）时存在的严重安全风险：由于路由器在传输层安全（TLS）会话终止后需建立上游会话，导致整个交互过程以明文形式暴露于路由器端，使其成为应用层中间人（man-in-the-middle），从而可实施工具调用重写、依赖项替换（如使用域名劫持的包）、仅在规避审计条件下触发攻击以及被动窃取敏感信息等恶意行为。现有客户端防御机制易被绕过。其解决方案的关键在于提出AEGIS——一种提供方透明的经认证的API路由器，其数据路径为客户端验证过的忠实透传。AEGIS将明文处理限制在小型硬件可信执行环境（hardware enclave）组件中，而认证、调度、计费与管理等功能仍保留在不可信主机上；客户端在释放明文前对可信环境进行验证，确保主机既无法读取也无法篡改交互内容，且明文仅能流向由测量镜像预先固定的受信目标。实验表明，所有四类恶意路由器攻击在基准明文暴露场景下均成功，但在AEGIS保护下全部被阻断，包括针对同一边界条件的自适应测试。该方案实现的可信路径仅851行代码，支持三种提供方原生API无需转换，在真实负载与并发环境下完成每请求处理，本地中继开销约为每请求6毫秒。在种子审计试点中，两个主流编码代理分别发现了10个预设不变性违规中的8个和10个，验证了其有效性。

链接: https://arxiv.org/abs/2606.16358
作者: Sipeng Xie,Qianhong Wu,Hengrun Lu,Ziliang Sun,Qi Wu,Bo Qin,Qin Wang
机构: Beihang University (北京航空航天大学); Renmin University of China (中国人民大学); Independent (独立)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Agents increasingly access large language models (LLMs) through API routers. A router terminates the client’s transport-layer security session and opens a separate upstream session, so it holds the full interaction in plaintext. This makes the router an application-layer man-in-the-middle: it can rewrite agent tool calls, swap dependencies for typosquatted packages, trigger attacks only under audit-evading conditions, and passively exfiltrate secrets. Existing client-side defenses are evadable. We propose AEGIS, a provider-transparent attested API router whose data path is a client-verified faithful passthrough. AEGISconfines plaintext handling to a small hardware-enclave component while leaving authentication, scheduling, accounting, and management on the untrusted host. The client verifies the enclave before releasing plaintext. The host can neither read nor alter the interaction, and plaintext leaves only toward destinations fixed by the measured image. We show that all four malicious-router attack classes succeed against a plaintext-access baseline and are blocked by AEGIS, including adaptive tests against the same boundary. The trusted path is 851 lines, carries three provider-native APIs without conversion, and completes every request under real-provider workload and concurrency. In a seeded audit pilot, two commodity coding agents find eight and ten of ten planted invariant violations. The local relay overhead is about six milliseconds per request. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Multiagent Systems (cs.MA) Cite as: arXiv:2606.16358 [cs.CR] (or arXiv:2606.16358v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2606.16358 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[MA-4] Distributed Safe Consensus Under Asymmetric Input and Time-Varying Output Constraints

【速读】：该论文旨在解决在连通无向图环境下，具有同时存在非对称执行器约束与输出安全约束的单积分器多智能体系统中的安全分布式一致性问题。其核心挑战在于如何在保证执行器输入严格位于非对称允许区间内、且各智能体输出始终处于预设安全区间内的前提下，实现系统的渐近一致性同步。解决方案的关键在于引入一种基于屏障坐标变换（barrier-coordinate transformation）的方法，将原系统映射到一个随时间变化的安全区间上，并在此变换后的坐标系中设计分布式同步控制律。该控制器通过集成基于图的协调层与执行器侧跟踪层，实现了输入可接受性、安全输出集的前向不变性以及渐近同步的协同保障。理论分析表明，在初始条件集为紧致的情况下，闭环系统解完备，所有信号有界，执行器输入始终保持在非对称边界内部，智能体输出始终位于指定安全区间内；同时，变换后的同步误差指数收敛至零，原始输出渐近同步至嵌入于共同安全区间的设计师选定轨迹。数值仿真验证了所提框架在非对称执行器约束和时变输出约束下的有效性与安全性。

链接: https://arxiv.org/abs/2606.16116
作者: Abhinav Sinha,Shashi Ranjan Kumar
机构: University of Cincinnati(辛辛那提大学); Indian Institute of Technology Bombay(印度理工学院孟买分校)
类目: ystems and Control (eess.SY); Multiagent Systems (cs.MA); Robotics (cs.RO); Dynamical Systems (math.DS)
备注:

点击查看摘要

Abstract:This paper studies safe distributed consensus for single-integrator multi-agent systems over connected undirected graphs under simultaneous asymmetric actuator constraints and output safety constraints. Each agent is equipped with a continuously differentiable asymmetric actuator dynamics that maps a commanded control signal to the realized plant input while keeping the latter strictly inside a prescribed admissible interval. To address output safety, a barrier-coordinate transformation is introduced over a common time-varying safe interval, and a distributed synchronization law is designed in the transformed coordinates. The resulting controller integrates a graph-based coordination layer with an actuator-side tracking layer, thereby enabling simultaneous enforcement of input admissibility, forward invariance of the safe output set, and asymptotic synchronization. For compact admissible sets of initial conditions, it is shown that the closed-loop solution is complete, all signals remain bounded, the actuator inputs remain strictly within their asymmetric bounds, and the agent outputs remain inside the prescribed safe interval for all time. Moreover, the transformed synchronization errors converge exponentially to zero, and the original agent outputs asymptotically synchronize to a designer-selected admissible trajectory embedded in the common safe interval. Numerical simulations validate the proposed framework and demonstrate safe consensus under both asymmetric actuation bounds and time-varying output constraints.

[MA-5] Orchestrated Reality: From Role-Play to Living Playable Game Worlds – LLM -Driven World Simulation as a Parameterized-Action POMDP

【速读】：该论文旨在解决开放世界与沙盒类游戏中，如何高效融合高度编排的叙事（tightly-authored narrative）与深度模拟的世界系统（如角色行为、等级机制、后果推演等）这一长期存在的难题。传统方法因需大量人工设计和复杂协调而成本高昂。其核心解决方案在于提出“协同现实”（Orchestrated Reality）框架，将游戏世界建模为一个由单一“协作者代理”（orchestration agent）掌控的持久性、可验证的规范状态对象，类比桌游中的游戏主持人（Game Master, GM）。关键创新在于将生成式AI驱动的游戏世界形式化为参数化动作部分可观测马尔可夫决策过程（Parameterized-Action POMDP），其中状态以结构化的JSON树表示，动作分解为意图类型与结构化参数，观测仅限于叙事投影，状态转移通过“计划-差异-验证-应用”（PDVA）的LLM驱动流水线实现，确保每次状态更新均为经过模式验证且内容哈希校验的JSON差分。该框架实现了叙事语义与系统状态之间的强一致性，为构建完全自主的生成式游戏引擎提供了可扩展的架构基础。

链接: https://arxiv.org/abs/2606.16014
作者: Yuhang Huang,Chenmiao Li,Chaowei Fang
机构: The University of Tokyo(东京大学); Individual Researcher(个人研究员)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 9 pages, 2 figures. Work in progress. Yuhang Huang and Chenmiao Li contributed equall

点击查看摘要

Abstract:Many games rely on storytelling combined with systems that track levelling, NPC behaviour, and consequence simulation; bridging tightly-authored narrative with deeply-simulated worlds – most acute in sandbox and open-world settings – has been prohibitively expensive. LLM-driven worlds open a new path: a single harness can coordinate numerical state, narrative voice, storytelling pacing, and rule logic together. Realising this requires the LLM system to sustain a persistent world (who is where, what has just happened, what is currently true), which today’s deployed systems do not: the narrative voice asserts state in free prose without any validated representation, so a fully autonomous game engine remains infeasible. We treat this as an architectural choice, not a limitation of language models, and report work in progress on a framework – orchestrated reality – that makes the world a canonical object owned by a singleton orchestration agent analogous to the tabletop-RPG Game Master (GM). We formalise an LLM-driven game world for a human player as a Parameterized-Action POMDP: state is a tree of canonical JSON entities, actions decompose as a=(k, x_k) (a discrete intent kind plus structured JSON parameters), the agent observes only a narrative projection o=O(s) of state, and the transition kernel F is an LLM-driven Plan-Diff-Validate-Apply (PDVA) pipeline that commits schema-validated, content-hashed JSON deltas. We give the formal model, a JSON-state example, a worked single-turn example, and a catalogue of 15 illustrative incidents drawn from a real deployment showing the framework in action. Empirical validation through a planned human player study – together with multi-NPC concurrent agency and deployment as an RL environment – is situated as future work.

[MA-6] DeepRoot: A KG-Coordinated Multi-Agent System for Therapeutic Reasoning over Historical Medical Texts

【速读】：该论文旨在解决历史医药文献（如《神农本草经》）中蕴含的药物发现潜力因缺乏标准化结构而难以融入现代生物医学研究流程的问题。核心挑战在于这些文献以非本体论的叙述性文本和非统一的分类体系呈现，导致其内容无法有效支持大规模、可验证的药物研发。为此，论文提出DeepRoot——一个基于多智能体的生成式AI（Generative AI）系统，其关键创新在于将“知识图谱构建”与“推理能力”作为可分离且可组合的两个维度：系统首先构建一个经过验证的知识图谱（Knowledge Graph, KG），再结合大语言模型（LLM）进行联合推理，从而实现对历史药学数据的精准挖掘。实验表明，在21个待测的化合物-疾病治疗关系中，DeepRoot在R@20指标上成功召回10个（47.6%），显著优于仅使用原始语料库的LLM（4.8%）及随机猜测水平（~2.4%）。同时，在以LLM为裁判的推理质量评估中，DeepRoot在推理连贯性与证据真实性方面均优于基线模型及具备工具调用能力的模型；相比之下，仅依赖工具调用的模型在87%的主张中产生幻觉，而纯知识图谱推理虽无幻觉但推理连贯性差。唯有DeepRoot的KG+LLM协同架构在两项指标上全面领先，揭示了通过“可验证知识图谱”与“可解释推理”的解耦设计，系统性挖掘与再利用历史医学知识的有效路径。

链接: https://arxiv.org/abs/2606.15931
作者: Zijian Carl Ma,Sean J. Wang,Sijbren Kramer,Li Erran Li
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Historical medical archives and traditional medicines hold immense potential for drug discovery and remain a primary source for current drug development. However, pre-ontological prose and idiosyncratic taxonomies prevent the standardization and medical modernization of the data for use in current biomedical pipelines. Furthermore, no existing LLM agent system, whether tool-calling, retrieval-augmented, or agentic deep-research, can convert such text into verifiable drug-discovery leads at scale. We close this gap with DeepRoot, a multi-agent LLM system that jointly builds and utilizes a verified knowledge graph, showing that grounding and reasoning – often conflated – are separable axes the system can compose for therapeutic reasoning. Applied to the Shen Nong Ben Cao Jing, DeepRoot recovers 10 of 21 held-out compound-disease treatment pairs at R@ 20 ( 47.6% vs 4.8% for a raw corpus LLM and \sim!2.4% random) and dominates an LLM-as-judge audit for reasoning quality over baseline LLMs and LLMs with direct tool-call access to the same APIs DeepRoot itself queries. Tool-using LLMs hallucinate evidence on 87% of claims, versus 7-10% for DeepRoot. Graph-only inference hallucinates 0% but ranks lowest on reasoning coherence; DeepRoot KG+LLM is the only condition to win on both axes, pointing toward a route for systematic mining and repurposing of historical medical knowledge.

[MA-7] SkillVetBench: LLM -as-Judge for Multi-Dimensional Security Risk Evaluation in Open-Source LLM Agent Skills NEURIPS2027

【速读】：该论文旨在解决开源大语言模型（LLM）智能体生态系统中，由社区贡献的技能（即模块化工具定义，用于扩展智能体能力）缺乏有效安全评估的问题。现有扫描工具主要在代码层运行，对指令层风险（如提示注入、记忆污染）及多智能体协同攻击等语义层面威胁具有结构性盲区，无法识别通过自然语言指令实施的恶意行为或隐蔽的数据外泄通道。其核心解决方案是提出SKILLVETBENCH——一个基于Hugging Face的实时公开排行榜，采用生成式AI作为评判者（LLM-as-Judge）对智能体技能进行多维度语义评估。关键创新在于SARS（Skill Agentic Risk Score），一种基于原理性加权公式的五维智能体风险评分体系，能够系统量化指令遵循系统中的潜在危害；同时集成CVSS v4.0完整向量分解与ClawHub双视角展示机制，将LLM生成的审查意见与官方市场判定并列呈现，增强可解释性。实证结果表明，该方法在78个已确认恶意技能上实现零误报率，在22个良性对照样本上实现零漏报率，显著优于静态基线工具SKILLSIEVE（仍遗漏15%威胁），尤其在指令层攻击类型中，传统工具检出率仅为35%–95%，部分场景下甚至为0%（如CODEBERT对9个记忆污染技能完全未检测），凸显了语义级评估的必要性，并进一步验证了多评估器集成在实际部署中的有效性。

链接: https://arxiv.org/abs/2606.15899
作者: Ismail Hossain,Sai Puppala,Md Jahangir Alam,Tanzim Ahad,Sajedul Talukder
机构: SUPREME Lab, University of Texas at El Paso (德克萨斯大学埃尔帕索分校)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: The main research paper is submitted to NeurIPS 2027, it is in under review

点击查看摘要

Abstract:Open-source LLM agent ecosystems are growing rapidly, yet the security of community-contributed skills - modular tool definitions that extend agent capabilities - remains largely unvetted. The gap we fill: existing scanners operate at the code layer and are structurally blind to instruction-layer and multi-agent risk - natural-language directives that hijack an agent, exfiltrate data through encoded side channels, or chain harm across pipelines - so what is needed is a semantic, multi-dimensional vetting system rather than another signature matcher. We present SKILLVETBENCH, a live public leaderboard on Hugging Face that uses an LLM-as-Judge to vet agent skills. What is new: SARS (Skill Agentic Risk Score), a five-dimensional agentic-risk metric with a principled weighted formula for instruction-following systems. What is integrated: full CVSS v4.0 vector decomposition and a ClawHub dual-view that places our LLM-generated review beside the official marketplace verdict. What is demonstrated: drawing on our companion benchmark paper [ 1], the LLM-as-Judge stage achieves zero false negatives across 78 confirmed-malicious skills and zero false positives across 22 benign controls, while the best static baseline (SKILLSIEVE) still misses 15%; for instruction-layer categories such as Prompt Injection and Memory Poisoning, conventional tools miss between 89% and 100% of threats (e.g., CODEBERT detects none of nine memory-poisoning skills). Detection rates vary from 35% to 95% across four LLM evaluators, motivating ensemble scoring in production deployments.

[MA-8] Odds Law: The Decomposition Algebra On How Intelligence Organizes Itself to Solve Difficult Problems Reliably

【速读】：该论文旨在解决在基本问题求解器不可靠的前提下，如何通过组织这些求解器来可靠地解决复杂问题，以及这种可靠性的理论极限是什么。其核心问题是：在存在不确定性与错误的底层组件情况下，如何设计可扩展、可组合的系统结构以实现对高难度问题的稳定求解，并量化可靠性与成本之间的权衡关系。解决方案的关键在于构建一个分解代数（decomposition algebra），将基础求解器视为随机范畴中的态射（morphism），并通过四种组合算子——串行复合（sequential composition）、并行集成（parallel ensembling）、验证门控（verification gating）和递归约简（recursive reduction）——生成复合求解器的全部可能结构。该框架进一步引入两个同态映射：可靠性估值（reliability valuation） 映射到有序幺半群 $([0,1], \le)$ ，用于衡量正确性概率；成本估值（cost valuation） 映射到交换幺半环，刻画资源开销。基于此，论文推导出可靠性在结构中传播的组合规律。其核心成果包括：(i) 验证奇数定律（verification odds law），揭示验证门将正确性奇数乘以验证器的似然比 $\Lambda$ ，从而实现条件独立验证门下的几何级可靠性增强；(ii) 可靠性放大定理（reliability amplification theorem），表明当 $\Lambda > 1$ 时，仅需 $O(\log 1/\delta)$ 的验证深度即可达到目标可靠性 $1-\delta$ ；(iii) 阈值二分性（threshold dichotomy），指出在临界参数之上，可通过对数代价使可靠性趋近于1，而低于或等于该阈值则无法实现任何放大效应。此外，论文证明了自组织（self-organization） 是单调改进算子在策略完全格上的最小不动点，且该不动点满足单位成本边际对数奇数增益相等。最后，通过匹配性下限分析，揭示了信息上限限制单个验证门的放大能力（由散度量限定），且共享错误模式导致投票下限为正，因此多样性（diversity）是实现无界放大的必要条件。综上所述，可靠性并非免费或神秘获得，而是依赖于独立信息的获取、通过结构化组合进行组织，并受到验证器性能的根本约束。

链接: https://arxiv.org/abs/2606.15712
作者: Hidayet Aksu
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 10 pages, 2 figures

点击查看摘要

Abstract:We ask a structural question: given unreliable elementary problem-solvers, what organizations of them solve hard problems reliably, and what are the limits? We develop a decomposition~algebra : elementary solvers are morphisms in a stochastic category, and four combinators (sequential composition, parallel ensembling, verification gating, and recursive reduction) generate the space of compound solvers. We equip this algebra with two homomorphisms, a reliability valuation into the ordered monoid ([0,1],\le) and a cost valuation into a commutative semiring, and we derive the composition laws that govern how reliability flows through structure. Our central results are (i) a verification~odds~law (the result that names this report), showing that a verification gate multiplies the odds of correctness by the verifier’s likelihood ratio \Lambda , so that k conditionally independent gates yield geometric amplification; (ii) a reliability~amplification~theorem , giving target reliability 1-\delta at O(\log 1/\delta) verification depth whenever \Lambda1 ; and (iii) a threshold~dichotomy : above the critical parameters reliability can be driven arbitrarily close to one at logarithmic cost, while at or below them no amplification is possible. We then show that self-organization is the least fixed point of a monotone improvement operator on the complete lattice of strategies, and that this fixed point equalizes marginal log-odds gain per unit cost. Finally, we prove matching limits: an information ceiling bounds per-gate amplification by a divergence quantity; shared error causes create a strictly positive voting floor, so diversity is necessary for unbounded amplification. Reliability, in short, is neither free nor magical: it is bought with independent information, arranged by composition, and bounded by the verifier.

[MA-9] AI-Driven Framework for Adaptive Water Network Management with Proof-of-Concept Implementation: Addressing Non-Revenue Water in Jordan

【速读】：该论文旨在解决约旦严峻的水资源短缺问题，特别是因泄漏、偷盗和计量误差导致的非收益水（Non-Revenue Water, NRW）高达50%的痛点。传统被动式管理手段难以实现NRW的持续降低，因此本文提出一种集成式智能框架，其核心在于融合EPANET水力模型、数字孪生技术、SCADA系统与基于大语言模型（Large Language Model, LLM）的AI代理，实现供水管网的持续监控与自适应决策。该方案的关键创新在于将实时数据流与物理驱动的仿真相结合，通过检索增强生成（Retrieval-Augmented Generation, RAG）实现政策语义理解，并利用函数调用（function calling）完成管网控制指令生成。在安曼1,164个节点的示范网络中，采用离线部署的LLM（llama3.1:8b via Ollama）验证了系统的可行性，实现了自动化水力模拟、基于流量异常的分布式区域（Water Distribution Zone, DZ）对齐检测，以及2分钟内生成健康报告且零API成本的高效响应。实验表明，30.1 L/s的模拟泄漏可引发15条管道的流量重分配，触发15个节点集群告警，精准定位爆管位置，符合DZ管理实践。该框架兼容约旦间歇性供水及低自动化水平现状，支持分阶段部署，为缺水地区提供可扩展的智能化减损路径。

链接: https://arxiv.org/abs/2606.15709
作者: Mohammed Fasha,Nahel Al-Maayta,Bilal Sowan,Mohammad Athamneh,Husam Barham
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Jordan faces severe water scarcity with 50% of water produced is lost to leakage, theft and metering issues also known as non-revenue water (NRW). Traditional reactive approaches have proven insufficient for sustained NRW reduction. This paper proposes an intelligent framework integrating EPANET hydraulic modeling, digital twin technology, SCADA systems, and large language model (LLM)-based AI agents for continuous network monitoring and adaptive decision-making. The system combines real-time data streams with physics-based simulation to detect anomalies, employing retrieval-augmented generation (RAG) for policy interpretation and function calling for network control. A proof-of-concept implementation validates technical feasibility using EPYT with offline LLMs (llama3.1:8b via Ollama) on a 1,164-junction Amman district network. The system demonstrates automated hydraulic simulation, flow-based anomaly detection aligned with water distribution zone (DZ) practice, and AI-generated health reports with response times under 2 minutes and zero API costs. Burst detection relies on local flow anomaly analysis: a 30.1~L/s simulated leak produces measurable flow redistribution in 15 pipes, flagging a 15-junction cluster that localises the burst – confirming alignment with water distribution zone (DZ) monitoring practice. The framework accommodates Jordan’s intermittent supply patterns and limited automation through phased implementation, offering a scalable pathway for water-scarce regions to leverage intelligent automation for NRW reduction and operational efficiency.

[MA-10] Agent ic Retrieval and Reinforcement Learned Equation Chains: A Controlled Generation Framework for Complex and Novel Physics Word Problems

【速读】：该论文旨在解决生成高质量、新颖、复杂且可解的物理应用题（Physics Word Problems, PWPs）这一在教育内容生成领域中仍具挑战性且研究不足的问题。现有方法多借鉴数学应用题（Math Word Problem, MWP）生成范式，常导致生成题目存在歧义、不可解或结构过于简单，并缺乏语言多样性。其解决方案的关键在于提出一种两阶段框架——ARVRE（Agentic Retrieval Value Reinforced Equation-chain）：第一阶段采用离线时序差分学习构建有效的物理方程链（equation chain），同时结合代理式检索增强生成（agentic retrieval-augmented generation, RAG）动态选取与主题相关的概念与词汇，实现对问题结构与难度的显式控制；第二阶段利用大语言模型（Large Language Model, LLM）将生成的方程链与检索到的概念转化为自然语言物理题。通过以有效方程链为生成基础，该方法在保障数学正确性的前提下，显著提升了题目的复杂性、新颖性及语言多样性。人机评估结果表明，ARVRE生成的PWPs在质量上优于现有方法，凸显了强化学习、检索机制与大语言模型融合在可靠生成教育类物理内容方面的潜力。

链接: https://arxiv.org/abs/2606.15591
作者: Tirthankar Mittra
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Generating high-quality Physics Word Problems (PWPs) that are novel, complex, and solvable remains a challenging and underexplored problem in educational content generation. Existing approaches, many adapted from Math Word Problem (MWP) generation, often produce ambiguous, unsolvable, or structurally simple questions with limited linguistic diversity. We introduce ARVRE (Agentic Retrieval Value Reinforced Equation-chain), a two-stage framework for generating diverse and mathematically valid PWPs. In the first stage, a form of offline temporal-difference learning is used to construct valid chains of physics equations, while an agentic retrieval-augmented generation (RAG) framework dynamically selects topic-specific concepts and vocabulary. This design enables explicit control over problem structure and difficulty. In the second stage, a Large Language Model (LLM) converts the equation chain and retrieved concepts into a natural-language physics question. By grounding generation in valid equation chains, our method preserves mathematical correctness while promoting linguistic diversity and contextual richness. Human and automated evaluations demonstrate that ARVRE generates PWPs that are more complex, novel, and solvable than those produced by existing approaches. These results highlight the potential of combining reinforcement learning, retrieval, and LLMs for reliable generation of educational physics content.

[MA-11] Your Agent Has a Genome: Sequence-Level Behavioral Analysis and Runtime Governance of LLM -Powered Autonomous Agents

【速读】：该论文旨在解决大语言模型（Large Language Model, LLM）驱动的自主代理在实际运行中因缺乏有效行为监控与干预机制而导致的任务失败率高、资源消耗大及验证环节缺失等问题。其核心挑战在于如何从复杂的运行轨迹中提取可量化、可分析的行为模式，并据此构建实时干预策略以提升系统可靠性与效率。解决方案的关键在于提出一种名为“基础序列分析”（Base Sequence Analysis）的框架，将代理的运行行为编码为由四个符号组成的紧凑符号序列（X：探索，E：执行，P：规划，V：验证），借鉴基因组序列分析的方法，运用n-gram模式挖掘、马尔可夫转移矩阵和点双列相关性分析等手段对真实世界中的347条生产级ReAct代理执行轨迹进行建模。研究发现，三元组P-X-P是唯一具有统计显著性的高风险模式，使任务成功率下降10.4%；规划-探索-规划的频率（P-ratio）为最强负向预测因子（r = -0.256, p < 0.0001）；而执行到验证（E-V）的转移概率仅为2.1%，揭示了系统性验证缺失问题。基于上述发现，作者设计了名为Governor的三层运行时干预系统，包含规则引擎、统计累积器与基于卡方检验的阈值自适应模块。在自然前后对比实验中（N=101 vs. N=246），Governor实现了任务成功率绝对提升6.2%，同时平均令牌消耗降低44%。此外，通过将XEPV编码应用于SWE-bench上的2000条公开代理轨迹，验证了探索循环与验证缺陷在不同系统间的普适性。研究进一步提出了包括基础序列语言模型、跨代理行为指纹识别与奖励塑造在内的六项未来方向，并开源了完整工具包以支持可复现性研究。

链接: https://arxiv.org/abs/2606.15579
作者: Sidi Deng
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA); Software Engineering (cs.SE)
备注: 16 pages, 15 figures, 12 tables

点击查看摘要

Abstract:We propose Base Sequence Analysis, a framework that encodes the runtime behavior of LLM-powered autonomous agents into compact symbolic sequences using a four-letter alphabet: X (Explore), E (Execute), P (Plan), and V (Verify). Drawing an analogy to genomic sequence analysis, we apply n-gram pattern mining, Markov transition matrices, and point-biserial correlation to 347 real-world execution traces collected from a production ReAct agent system over 8 days. Our analysis reveals that (1) the trigram P-X-P is the only statistically significant high-risk pattern, lowering success rate by 10.4%; (2) P-ratio is the strongest negative predictor of success (r=-0.256, p0.0001); and (3) the E-V transition probability is only 2.1%, indicating a systemic verification deficit. Based on these findings, we design Governor, a three-layer runtime intervention system comprising a rule engine, a statistical accumulator, and a chi-square-based threshold adaptor. In a natural before/after deployment evaluation (N=101 vs. N=246), Governor achieves a +6.2% absolute increase in task success rate while simultaneously reducing average token consumption by 44%. To validate cross-system generality, we apply the XEPV encoding to 2,000 public SWE-agent trajectories on SWE-bench, confirming that exploration spirals and the E-V verification deficit replicate in an independent system. We outline six research directions including base sequence language models, cross-agent behavioral fingerprinting, and reward shaping, and release an open-source toolkit for reproducibility.

[MA-12] Minimal Oversight: Uncertainty-Aware Governance for Delegated AI Systems

【速读】：该论文旨在解决在人工智能（AI）系统中实现合理自主性委托时所面临的不确定性感知治理问题，即如何在保障系统性能的前提下，科学分配自动化程度、识别可信度校准的证据、确定委托系统的性能上限以及判断何时需要人类干预。其核心解决方案是提出最小充分监督原则（Minimum Sufficient Oversight Principle, MSO），这是一个基于费舍尔信息流形上的变分原理，旨在最小化治理负担的同时满足任务交付约束。该原则导出的欧拉-拉格朗日解揭示了任务空间中“水灌式”（water-filling）的监督资源配置机制。通过构建基于显式行为的受控委托通道模型，研究证明了平稳符号级审查策略的容量定理，并推导出工作流复杂度与质量退化之间的局部一阶近似关系，以及以漂移主导的自主时间标度律，揭示了干预时机与有效容量、复杂度及系统漂移之间的内在联系。研究进一步指出，掩码（masking）是一种结构性的AI治理病态现象：修正后的性能可能掩盖真实能力信号，导致信任校准失效。基于合成仿真与半真实工作流重构，提出了上游优先纠正、基于敏感性的干预策略及扩展自主权前的显式可行性验证等设计准则。最终构建了一个可计算的框架，用于处理委托型AI系统中的不确定性、规划与监督问题，相关配套Python工具包已开源。

链接: https://arxiv.org/abs/2606.15563
作者: Carlos R. B. Azevedo
机构: Independent Researcher, São Paulo, Brazil
类目: Artificial Intelligence (cs.AI); Information Theory (cs.IT); Multiagent Systems (cs.MA)
备注: Companion Python package: pip install minimal-oversight | Code: this https URL | 26 pages, 1 figure, 5 tables

点击查看摘要

Abstract:AI systems increasingly delegate decisions to specialized models, evaluators, tools, and supervisory controllers. The central AI problem is no longer only model accuracy, but uncertainty-aware governance: how much autonomy to grant, which evidence should calibrate trust, what performance ceiling a delegated AI system can sustain, and when human intervention becomes necessary. We propose the Minimum Sufficient Oversight Principle (MSO), a variational principle for principled autonomy delegation: minimize governance burden on the Fisher information manifold subject to a delivery constraint. The resulting Euler-Lagrange solution yields a water-filling allocation of governed delegation across the task space. Building on a revealed-action governed delegation channel model, we prove a capacity theorem for stationary symbolwise review policies, derive a local first-order approximation relating workflow complexity to quality degradation, and give a drift-dominated autonomy-time scaling law linking intervention timing to effective capacity, complexity, and drift. Within this framework, masking appears as a structural AI-governance pathology: corrected performance can hide the competence signal needed to calibrate trust. Synthetic simulations and a semi-real reconstructed workflow support design prescriptions including upstream-first correction, sensitivity-based intervention, and explicit feasibility checks before autonomy is expanded. The result is a computable framework for uncertainty, planning, and oversight in delegated AI systems. A companion Python package is available at this https URL.

[MA-13] Synthetic Counteradaptation: A Principle of Human-AI Co-evolution

【速读】：该论文旨在解决人机协同系统中日益复杂的交互动态问题，特别是如何理解人类与人工智能（AI）在多智能体环境中呈现出的递归性、共演化特征。传统研究往往将人类或AI视为静态参与者，忽视了二者在长期互动中相互适应、持续演化的机制。为此，论文提出“合成反适应”（synthetic counteradaptation）这一新概念，其核心在于揭示：当AI系统发展出新型策略或社会协议时，人类能够从中提取洞察并主动调整自身行为，进而引发新的交互模式，形成双向反馈循环。该解决方案的关键在于构建一个以“互适—反馈—演化”为链条的分析框架，通过围棋博弈、混合动机社会互动及地缘政治模拟等多场景实证，验证了该机制在复杂系统中促进动态平衡与创新协作的潜力，为未来人机协同系统的可持续演化提供了理论基础与实践指导。

链接: https://arxiv.org/abs/2606.15503
作者: Ivar Frisch,Jackie Kay,Philip Moreira Tomei
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Multiagent Systems (cs.MA); Neural and Evolutionary Computing (cs.NE)
备注: 15 pages, 1 figure. Published in Antikythera (MIT Press), February 2025

点击查看摘要

Abstract:In this paper, we introduce the concept of synthetic counteradaptation, a process where human and AI systems co-evolve by adapting to each other’s strategies and behaviors. Synthetic counteradaptation occurs when AI systems develop novel strategies or social protocols, prompting humans to extract insights and adapt their own behaviors in response, leading to the emergence of new agent interaction dynamics. To illustrate these dynamics, we analyze examples from various contexts, including the game of Go, mixed-motive social interactions, and geopolitical simulations. By exploring these cases, we demonstrate how synthetic counteradaptation provides a framework for understanding the recursive and co-evolutionary nature of human-AI interactions in multi-agent environments.

[MA-14] CoAgent : Concurrency Control for Multi-Agent Systems ATC2026

【速读】：该论文旨在解决多智能体大语言模型（Multi-agent LLM systems）在并发执行时面临的新型并发控制难题。当多个智能体并行操作共享状态（如Git仓库、Kubernetes集群或文档）时，传统并发控制机制（如两阶段锁2PL和乐观并发控制OCC）因不适应生成式智能体的特性而失效：单个智能体事务持续时间长（数分钟级推理）、读集宽泛且不可静态推断、写操作即时生效且无法回滚或缓冲。经典方法要么阻塞长时间推理，要么在冲突时丢弃大量已计算工作。本文提出的关键解决方案是引入一种基于“能力”的自适应并发控制机制——MTPO（Monotonic Trajectory Pre-Order），其核心在于利用每个智能体内部的生成式AI能力，使其能够判断冲突写是否影响自身计划，并精准修复受影响的操作。该机制采用“咨询式”控制策略：运行时仅通知受影响的智能体进行重评估与修复，而非强制回滚；通过预先注册的反向操作（saga-style inverse）实现错误写入的机械性撤销与重排序。在系统空闲时，所有操作可被序列化为预设的单调轨迹顺序。作者实现该协议为CoAgent工具调用中间件，支持在线声明可撤销工具的“ToolSmith”功能。实验表明，在十个高竞争负载下，CoAgent在保持接近串行正确性的前提下实现1.4倍加速和近似串行的令牌开销，显著优于2PL与OCC；在纯Bash目标系统上，其成功构建了包含25个工具的动态库，将任务通过率从45/71提升至63/71，同时降低耗时至0.80倍、成本至0.86倍。

链接: https://arxiv.org/abs/2606.15376
作者: Hongtao Lyu,Dingyan Zhang,Mingyu Wu,Xingda Wei,Haibo Chen
机构: Shanghai Jiao Tong University (上海交通大学)
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 14 pages, 7 figures. Submitted to ATC 2026

点击查看摘要

Abstract:Multi-agent LLM systems – coding agents, devops agents, document agents – now routinely run several agents in parallel against the same git tree, Kubernetes cluster, or document. As soon as two of them mutate shared state, they enter the regime classical concurrency control has studied for decades, but classical mechanisms fit LLM agents poorly. A single agent transaction spans minutes of inference, read sets are broad and opaque rather than statically inferable, and the live state agents act on admits neither fork nor buffer, so writes take effect the moment they execute. Locks block long inference intervals; OCC abort-and-retry discards minutes of work on every conflict. This paper builds concurrency control on a capability classical transactions lack: the LLM inside each agent can judge whether a conflicting write invalidates its plan, and can repair exactly the operations that depended on it. Control therefore turns advisory: the runtime informs, the agent repairs. Our protocol, MTPO (Monotonic Trajectory Pre-Order), fixes a serialization order at launch, serves each read the order-filtered value, and applies writes speculatively in place; a one-way notification asks an affected reader to re-judge and patch its plan, while the framework mechanically undoes and reorders misplaced writes through the saga-style inverse each tool registers in advance. At quiescence the run is serializable in the pre-decided order. We realize MTPO as CoAgent, toolcall middleware whose privileged ToolSmith grows footprint-declared, undoable tools online. On ten contended workloads, CoAgent stays within 5% of serial correctness at a 1.4\times speedup and near-serial token cost, where 2PL and OCC surrender nearly all concurrency gains; on a bash-only target system, it grows a 25-tool library online and lifts the task pass rate from 45/71 to 63/71 at 0.80\times the time and 0.86\times the cost. Comments: 14 pages, 7 figures. Submitted to ATC 2026 Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA) ACMclasses: D.4.1; H.2.4; I.2.11 Cite as: arXiv:2606.15376 [cs.DC] (or arXiv:2606.15376v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2606.15376 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[MA-15] Resilient Consensus in Agent ic AI

【速读】：该论文旨在解决在多智能体系统中，基于大语言模型（Large Language Model, LLM）的智能体在面对潜在恶意行为时，能否实现可靠共识的问题。传统抗容错共识理论（resilient consensus theory）适用于确定性智能体，但其在非确定性、可能表现出对抗性行为的LLM智能体上的适用性尚不明确。为此，作者将LLM智能体间的共识过程建模为拜占庭共识博弈（Byzantine consensus game），并在完全图与一般通信拓扑结构上进行受控实验。研究发现，经过提示（prompted）的LLM智能体无法达成理论上可实现的共识：即使在经典理论保证收敛的场景下，共识仍会失败，且该现象在不同温度参数和决策视野下均持续存在。然而，通过引入经典抗容错共识滤波器（resilient consensus filters）对LLM智能体进行封装后，共识性能显著提升，且滤波器的增益取决于底层网络拓扑所提供的鲁棒性。研究表明，经典抗容错共识理论可作为评估智能体人工智能（agentic AI）安全性的重要分析框架。

链接: https://arxiv.org/abs/2606.15024
作者: Sribalaji C. Anand,George J. Pappas
机构: KTH(瑞典皇家理工学院); University of Pennsylvania(宾夕法尼亚大学)
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:Large language model (LLM) agents are increasingly deployed in multi-agent systems where they must coordinate and agree on shared decisions. We ask whether classical resilient consensus theory, developed for deterministic agents, transfers to LLM agents that may behave adversarially. Framing LLM agreement as a Byzantine consensus game, we run controlled experiments on complete and general communication graphs. We find that prompted LLM agents fail to reach agreement that is achievable in principle: consensus can fail even in settings where classical theory guarantees that a convergent algorithm exists, and this failure persists across temperatures and horizons. At the same time, wrapping the agents with classical resilient consensus filters improves agreement. The benefit of filtering depends on how much robustness the underlying topology already provides. Our results suggest that classical resilient consensus theory is a useful lens for the safety of agentic AI.

[MA-16] Hierarchical Generative Agents for Simulating Sequential Human Behavior

【速读】：该论文旨在解决现有灾害疏散模拟模型中人类行为建模不真实的问题，即传统模型多假设个体行为理性且同质化，难以反映真实灾害情境下人类复杂的认知、情感与社会互动过程。其核心挑战在于缺乏真实人类行为数据，导致仿真结果过于乐观且脱离实际。论文提出的解决方案关键在于构建一个基于认知分层的、以人物角色（persona）驱动的生成式仿真框架，通过引入大语言模型（LLM）与认知模块协同决策机制，实现从高层疏散目标、中层路径推理到底层导航行为的三级认知结构建模，并结合实证疏散数据对模型进行校准。该框架采用动态刺激驱动的方式，在网格化城市环境中实时模拟火灾等灾害演化过程，使代理（agent）能够根据环境变化做出序列化、情境敏感的人类级决策，从而显著提升疏散行为模拟的真实性与预测能力。

链接: https://arxiv.org/abs/2606.14989
作者: Maria G. Mendoza,Lucas Waldburger,Jin Lee,Shankar Sastry
机构: University of California, Berkeley (加州大学伯克利分校)
类目: Multiagent Systems (cs.MA)
备注: 20 pages, 6 figures

点击查看摘要

Abstract:Complex cognitive, emotional, and social processes shape human evacuations during natural disasters. Accurate modeling and understanding of human behavior in disasters or emergencies can greatly impact the evacuation process by informing more effective planning and resource allocation. However, collecting human data in these situations is very difficult, and existing computational evacuation models assume rational, homogeneous behavior, leading to unrealistic, overly optimistic predictions. To address this gap, we present a simulation framework of sequential human decision-making during an evacuation scenario, introducing cognitively grounded, persona-driven agents. Our framework models evacuation behavior in a grid-based urban environment that evolves over time, capturing fire and other hazards. Human agents are modeled as personas that make sequential decisions in response to environmental stimuli with cognition structured in three levels: high-level evacuation goals, mid-level route reasoning, and low-level navigation. Decision-making is driven by large language models (LLMs) coupled with a cognitive module and calibrated with empirical human evacuation data. We propose a dynamic, stimulus-driven disaster simulation framework that models human evacuation decision-making using persona-conditioned LLM agents and a cognitive hierarchy.

[MA-17] rust Between AI Agents : Measuring Formation Breakage and Recovery with Implications for Governing Multi-Agent Systems

【速读】：该论文旨在解决多智能体系统中人工智能（AI）代理间信任度量缺乏标准化方法的问题。当前，随着语言模型代理在团队协作中日益普及，各代理需判断对队友的可信程度，但现有体系尚无统一、可操作的信任评估机制。为此，作者提出一种基于“高成本验证”行为的测量方法：在合作生存游戏中，验证队友输出需消耗资源，而盲目信任错误信息可能导致致命后果。通过对比记忆缺失版本模型，验证行为的减少被用作可观测的信任指标。研究发现，在与始终可靠的队友协作时，四个前沿模型（Claude Opus 4.6、Claude Sonnet 4.6、GPT-5.1 和 Gemini 3.1 Pro）将验证频率降低约60%-85%，而两个较小模型则未表现出显著调整。当出现失败时，信任会迅速瓦解，但不同模型的响应策略各异——部分模型集中加强对其“肇事者”的监督，另一些则整体提高警惕。信任恢复速度远慢于形成过程，且集中发生的失败事件比分散失败更持久地引发怀疑。这些差异具有实际影响：能够建立信任的模型验证更少、决策更快，并在环境中获得更高收益；相反，持续过度验证往往导致犹豫不决而非安全性提升。研究结果表明，信任倾向可在部署前被量化，提示多智能体系统治理的核心应聚焦于信任校准，而非采取最大怀疑态度。

链接: https://arxiv.org/abs/2606.14923
作者: Yujiao Chen
机构: Massachusetts Institute of Technology (麻省理工学院)
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:As language-model agents increasingly work in teams, each agent must decide how much to trust its teammates. Yet we lack a standard way to measure trust between AI agents. We propose a behavioral measure based on costly verification. In a cooperative survival game, checking a teammate’s work consumes resources, while trusting a wrong answer can be fatal. Relative to a memoryless version of the same model, reduced verification provides an observable measure of trust. Using this framework, we study trust formation, breakage, and recovery across six frontier model snapshots. When paired with a consistently reliable teammate, four snapshots (Claude Opus 4.6, Claude Sonnet 4.6, GPT-5.1, and Gemini 3.1 Pro) reduce verification by roughly 60-85%, whereas two smaller snapshots show little or no such adjustment. Failures reverse this discount, but models differ in how they respond. Some concentrate renewed scrutiny on the culprit, while others become more cautious toward the entire team. Recovery is slower than formation, and clustered failures sustain suspicion far longer than the same number of failures spread apart. These differences have practical consequences. Models that form trust verify less, decide more quickly, and achieve higher payoffs in our environment. By contrast, persistent over-verification is associated with indecision rather than safety. Our results show that trust dispositions can be measured before deployment and suggest that calibration, rather than maximal suspicion, should be the central concern in the governance of multi-agent AI systems.

[MA-18] Selective Control under Noisy Perception: Governance Failures Hidden by Aggregate Metrics in Modular Networks

【速读】：该论文旨在解决内容审核系统在高准确率指标下仍可能造成实质性伤害的问题，特别是当错误判断集中作用于连接不同社区的“桥接用户”时。其核心问题是：传统基于聚合准确率的评估指标无法揭示系统对特定脆弱群体的非均衡损害，导致治理失效。解决方案的关键在于引入一种新的治理损失（governance loss, L_gov）度量，该度量将误判为“假阳性”（即压制了本应保留的有用内容）和“假阴性”（即放过了危险内容）的代价进行差异化定价，并独立于执行成本。研究通过基于代理的建模表明，在噪声偏倚于假阳性的情境下，该治理损失显著上升，且用户度数（degree）作为桥接中心性（betweenness）的近似代理（相关系数 r=0.96），可成为低成本审计的关键指标，从而有效识别潜在受损的桥接用户。

链接: https://arxiv.org/abs/2606.14819
作者: Igor Itkin
机构: 未知
类目: Multiagent Systems (cs.MA); Physics and Society (physics.soc-ph)
备注: 39 pages, 7 figures. Code and data: this https URL

点击查看摘要

Abstract:A content-moderation system can score well on every standard accuracy metric and still cause real harm, if its mistakes fall on the few users who connect otherwise separate communities. We show this in an agent-based model where N=240 learning agents on a community-structured network each post harmless, productive, or dangerous content, and a regulator removes or penalizes whatever a noisy classifier flags. Overall usefulness barely moves as the noise changes (one-way ANOVA, p=0.96): by aggregate measures, nothing looks wrong. The damage instead concentrates on these bridge users, whose useful posts are wrongly suppressed and whose dangerous posts are wrongly spared. A governance loss (L_gov) that prices these two mistakes separately from the cost of enforcement more than doubles under false-positive-heavy noise. Aggregate accuracy hides who is harmed, and the cheap quantity to audit is how many connections a user has (degree), a near-perfect proxy for the betweenness that defines a bridge (r=0.96).

[MA-19] Obligation-Producing Actions

【速读】：该论文旨在解决义务生成型动作（obligation-producing actions）在情境演算（Situation Calculus）框架下引发的框架问题（frame problem）。这类动作指代理执行后会为其自身产生义务，例如“打开门”这一行为会引发后续必须“关闭门”的义务。传统框架问题解决方案（如Reiter的方案）难以直接处理此类动作对义务模态语义中可达性关系（accessibility relation）的影响。为此，本文提出一种简化且更贴近克里普克式（Kripke-style）道义逻辑可能世界语义的解决方案，摒弃了Demolombe提出的“情境理想性”（ideality of situations）概念，从而保持形式系统的简洁性与语义一致性。关键在于通过扩展Reiter的基本动作理论，构建完整的推理机制，并将回归算子（regression operator）从初始情境推广至新设定，确保义务在后续情境中持续有效，除非被显式撤销，从而实现符合直觉的义务延续性。

链接: https://arxiv.org/abs/2606.14810
作者: Kalonji Kalala,Iluju Kiringa,Tet Yeap
机构: 未知
类目: Logic in Computer Science (cs.LO); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:This paper proposes a Situation Calculus solution to the frame problem for obligation-producing actions, which are actions that create obligations on the part of the agent that performs them. As an example of such actions, we have an opening door action performed by an agent, which has the subsequent obligation of getting the door closed. Demolombe and others extend Raymond Reiter’s solution to the frame problem for ordinary actions to accommodate obligation-producing actions. Obligation-producing actions do affect the truth value of a newly introduced fluent that captures the accessibility relation used in semantics of obligation modalities in the Situation Calculus. Our work simplifies Demolombe’s characterization of the accessibility relation by eliminating the notion of ideality of situations, thereby remaining close to Kripke-style possible-world semantics for deontic logic, in the spirit of Governatori’s approach. Furthermore, we spell out details of a complete solution by extending basic action theories of Reiter to the new setting. Finally, we extend Reiter’s regression operator for reasoning about actions back to the initial situation to this new setting. Our solution yields intuitive properties that one would expect from obligations: for example, if a sentence is obligatory to an agent in a given situation, it remains so in subsequent situations unless the obligation is explicitly stopped.

[MA-20] XMedFusion: A Knowledge-Guided Multimodal Perception and Reasoning Framework for Autonomous Medical Systems ICRA

【速读】：该论文旨在解决当前端到端多模态医学影像模型在放射科报告生成中普遍存在的视觉定位能力弱的问题，即模型难以准确捕捉图像中的细微临床征象，导致诊断报告不可靠且易遗漏关键发现。其核心解决方案是提出一种模块化的人工智能框架XMedFusion，通过构建基于多智能体的结构化感知与推理机制实现突破：该框架将视觉信息分解为协同的功能组件，包括一个提取图像-文本对齐证据的视觉感知智能体、一个构建临床相关发现知识图谱的知识图谱构建智能体，以及一个基于检索引导的报告草稿生成过程以确保报告结构一致性；最终由合成智能体通过推理驱动的验证机制，迭代整合视觉与结构化证据，生成可解释且可靠的诊断报告。实验结果表明，XMedFusion在公开胸部X光数据集上显著优于基线视觉-语言模型，在BLEU-1、ROUGE-L和METEOR等文本生成指标上分别提升0.0493至0.3359、0.0863至0.2440和0.0829至0.1708，并在一致性（2.38→7.80）和准确性（2.34→6.93）等语义评估指标上取得显著进步，充分验证了结构化多智能体感知与推理在提升医学影像系统鲁棒性、透明度与自动化水平方面的有效性。

链接: https://arxiv.org/abs/2606.14766
作者: Hamza Riaz,Arham Haroon,Maha Baig,Muhammad Dawood Rizwan,Muhammad Naseer Bajwa,Muhammad Moazam Fraz
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: Accepted at the 2026 International Conference on Robotics and Automation in Industry (ICRAI)

点击查看摘要

Abstract:Autonomous medical and robotic systems increasingly rely on intelligent perception and reasoning capabilities to interpret visual data and support clinical decision making. Radiology report generation represents a critical component of such automated diagnostic workflows, yet existing end-to-end multimodal models often suffer from weak visual grounding, resulting in unreliable interpretations and omission of subtle clinical findings. This paper presents XMedFusion, a modular AI framework designed as an intelligent perception and reasoning module for autonomous medical systems. The proposed framework decomposes visual information into coordinated functional components that emulate expert-driven analysis, including a visual perception agent that extracts image-grounded evidence, a knowledge graph construction agent that structures clinically relevant findings, and a retrieval-guided drafting process that ensures a consistent reporting structure. A synthesis agent iteratively integrates visual and structured evidence through reasoning-driven verification to produce reliable and interpretable diagnostic outputs. Experimental evaluation on a public chest radiograph dataset demonstrates significant improvements over baseline vision-language models, achieving gains from 0.0493 to 0.3359 in BLEU-1, 0.0863 to 0.2440 in ROUGE-L, and 0.0829 to 0.1708 in METEOR, along with substantial improvements in semantic evaluation metrics such as Consistency (2.38 to 7.80) and Accuracy (2.34 to 6.93). The results highlight the effectiveness of structured multi-agent perception and reasoning for enhancing robustness, transparency, and automation in intelligent medical imaging systems, enabling integration into autonomous healthcare and robotic diagnostic workflows.

[MA-21] MiroBench: Benchmarking Realism in Agent ic Simulation of Real-world Discussions

【速读】：该论文旨在解决当前大语言模型（LLM）代理在模拟真实世界社会互动时，其生成行为是否能够保留真实人类行为的内容模式与交互动态这一关键问题。现有评估体系碎片化严重，难以实现系统间比较或量化进展。为此，论文以Reddit讨论为具体研究场景，构建了首个针对在线社区互动模拟的基准测试MiroBench，基于4,292个真实Reddit线程，从重复性与语义一致性、叙事内容、毒性与攻击性、结构复杂性四个维度，采用统计检验方法对比生成内容与真实内容的分布差异。实验结果表明，当前主流模型在分布上仍显著偏离真实讨论，而轻量级提示工程改进仅带来有限提升。MiroBench的核心价值在于提供了一个可测量、可诊断、可优化的基准框架，推动生成式AI在社会仿真中的真实性评估与持续改进。

链接: https://arxiv.org/abs/2606.14715
作者: Yaoning Yu,Ye Yu,Haojing Luo,Haohan Wang
机构: University of Illinois at Urbana-Champaign (伊利诺伊大学厄本那-香槟分校); Starc.Institute
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
备注:

点击查看摘要

Abstract:LLM agents are increasingly used to simulate real world interactions, but it remains unclear whether simulated behaviors preserve the content patterns and interaction dynamics of real human behaviors. Existing evaluations remain fragmented, which makes it difficult to compare systems or measure progress. In this paper, we focus on Reddit discussions as a concrete first step toward evaluating real-world social simulation. Reddit threads provide public, topic-grounded, multi-party interactions where people share experiences, debate, seek advice, express emotion, and collectively respond to products, events, and social issues. These discussions offer an observable window into broader social behavior, making them a useful setting for testing whether LLM agents can reproduce not only fluent text, but also the distributional patterns and interaction dynamics of real online communities. We introduce MiroBench, a benchmark for Reddit discussion simulation built from 4,292 real Reddit threads. MiroBench uses statistical tests to compare generated and real discussions across four major aspects: repetition and semantic uniformity, narrative content, toxicity and aggression, and structural complexity. Experiments across five domains and five models show that current simulators remain distributionally mismatched with real Reddit threads, while a lightweight prompt-based improvement procedure provides only limited gains. MiroBench offers a concrete benchmark for measuring, diagnosing, and improving realism in LLM-based social simulation.

[MA-22] Evaluation of Alternative-Based Information Systems for Deliberative Polling using an Agent ic Simulator

【速读】：该论文旨在解决协商式投票（Deliberative Polling）中确保每位参与者接触到代表全部论点空间（reason space）的充分多样性论证这一核心挑战，即“覆盖问题”（coverage problem），尤其在大规模、存在策略性或对抗性行为的选民群体中更为严峻。其解决方案的关键在于提出并评估一种基于大语言模型（LLM）的代理型双极论证模拟器（Agentic Bipolar Argumentation Simulator, ABAS），该模拟器将协商过程形式化为一个六元组结构（Jend, Jopp, Ratt, Renh, VA, VR），涵盖支持与反对理由、攻击与增强关系以及股东与关系权重。通过模拟N个具有潜在观点（[-1, 1]分布）的自主股东代理，系统动态生成和推荐论证，并利用基于可观测支持量（endorsement mass）的排序机制筛选出前K条推荐理由。评估指标为“覆盖率”——即每个股东接收到的推荐理由中，所覆盖的论点标签集合占总论点语料库的比例，以此衡量对NP难的“包含性理由问题”（Subsuming Justification Problem）的求解效果。实验揭示了创作率（pown）、推荐规模（K）、论证密度（plinks）及群体规模（N）对覆盖率与语料多样性的显著影响；在经过身份认证、无法实施Sybil攻击但仅关系图可被操纵的场景下，通过反向PageRank规则赋予作者数量权重的评分机制，能显著抵御“标签洪水攻击”（tag-flood attack）导致的覆盖率崩溃，优于均匀权重策略，展现出更强的鲁棒性。

链接: https://arxiv.org/abs/2606.11692
作者: Rwaida Alssadi,Khulud Alawaji,Balaji Kasula,Muntaser Syed,Badria Alfurhood,Markus Zanker,Marius Silaghi
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Social and Information Networks (cs.SI)
备注:

点击查看摘要

Abstract:Deliberative polling promises to improve collective decision-making by exposing shareholders to a broad range of arguments before they vote. Yet ensuring that every voter encounters a representative sample of the reason space, the coverage problem, remains an open challenge, particularly at scale and in adversarial or strategically motivated electorates. This paper introduces a way of evaluating solutions using the LLM-based Agentic Bipolar Argumentation Simulator, grounded in a framework which formalises a poll as a six-tuple Jend, Jopp, Ratt, Renh, VA, VR of endorsing and opposing justifications, attack and enhance relations, and shareholder- and relation-weights. ABAS simulates N autonomous shareholder agents, each assigned a latent opinion according to desired distributions in [-1, 1], who sequentially vote, choose or author justifications, and optionally submit argumentation-graph links. The simulator implements recommendations that rank existing justifications by their observable endorsement mass. It evaluates the mechanism’s success by coverage, namely the fraction of the corpus reason-tag set represented in the K recommendations presented to each shareholder, as a solution to the NP-hard Subsuming Justification Problem. Reported experiments characterise how creativity rate (pown), recommendation size (K), argumentation density (plinks), and population size (N) affect coverage and corpus diversity. In an authenticated electorate where Sybil attacks are impossible and only the relation graph is gameable, we stress-test the scoring with coordinated strategic voting attacks: a tag-flood attack collapses coverage, while author-count relation weighting through a reversed-PageRank rule resists the flood markedly better than uniform weights.

[MA-23] Physics of anticipatory active matter with application to crowd dynamics

【速读】：该论文旨在解决传统统计物理框架在描述具有前瞻行为的生物体（如捕食者追捕猎物、行人移动或机器人导航）时的局限性，即现有模型仅基于当前及历史状态进行反应式建模，难以有效刻画前瞻性决策过程。其核心解决方案是构建一个面向前瞻型代理（anticipatory agents）的统计物理框架，通过引入基于观测数据构建的代价函数（cost function），将代理在当前时刻的动力学与其所预期的未来系统状态相耦合。关键创新在于：将d维空间中前瞻代理的动力学映射为d+1维非前瞻链的动力学，其中跨维度的涨落用于表征对未来状态的不确定性。借助聚合物物理（polymer physics）中的理论工具，可对这类链的动态特性进行分析，并界定一个“前瞻范围”（anticipation horizon），在此范围之外的模糊未来状态可采用平均场方法处理。该框架成功应用于行人动力学建模，实现了操作层与战术层的无缝融合，在仅使用最小化代价函数表达式的情况下，即可复现复杂实验场景（如穿越杂乱环境、从拥挤列车下车等），显著优于现有先进模型。模型具备透明且灵活的结构，便于进一步集成其他机制。

链接: https://arxiv.org/abs/2606.14818
作者: Alexis Raulin-Foissac(ILM, UCBL),Alexandre Nicolas(ILM, CNRS)
机构: 未知
类目: Physics and Society (physics.soc-ph); Statistical Mechanics (cond-mat.stat-mech); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Statistical Physics has traditionally dealt with entities that interact merely based on the present, and possibly past, configurations. This reactive framework is inefficient in many situations involving living beings, such as predators chasing a prey, pedestrians, or even robots. This paper introduces a statistical physical framework for the dynamics of anticipatory agents, whose present-time dynamics depend on the prospective system state that they anticipate. We clarify how these dynamics can be expressed in terms of a cost function constructed based on observations and we show that the dynamics of an anticipatory agent in d dimensions can be mapped onto the dynamics of a (non-anticipatory) chain in d + 1 dimensions, with fluctuations acting transversely on the chain to account for the uncertainty about the future state. Insights from polymer Physics help us characterize the dynamics of these chains and delineate an anticipation horizon beyond which the blurry future can be handled in a mean-field way. The foregoing framework is successfully applied to pedestrian dynamics, leading to a seamless integration of operational and tactical levels in an agent-based model. Even with a minimal expression of the cost, the model succeeds in reproducing various experimental scenarios which are challenging for state-of-the-art models, such as crossing cluttered environments or alighting from a crowded train. The transparent and flexible basis of the model allows the straightforward incorporation of additional mechanisms.

自然语言处理

[NLP-0] he Value Axis: Language Models Encode Whether Theyre on the Right Track

【速读】：该论文旨在解决语言模型是否在内部表征其当前策略路径的价值（即该策略达成目标的预期可能性）这一问题。其核心挑战在于揭示模型内在决策机制中关于“目标达成期望”的隐式表征，并探究该表征如何影响模型的行为输出。解决方案的关键在于构建一个基于合成的上下文强化学习数据的“价值轴”（value axis），用于量化和操纵模型对当前推理路径的内部评估。通过该价值轴，研究发现模型激活值能有效区分高/低置信度表达、有无回溯的推演过程以及代码正确性与错误状态；进一步实验表明，通过直接偏好优化（DPO）可提升受奖励行为的内部价值，从而增强模型在执行后表现出的自信程度。此外，该方法被成功应用于真实场景分析，揭示了模型在后训练阶段对政治敏感查询赋予较低内部价值，而监督微调则增强了模型在训练领域内的内部置信度。结果表明，语言模型线性编码了对目标成功概率的估计，该估计作为调节因子动态影响其在探索与自修正之间的权衡及表达风格。

链接: https://arxiv.org/abs/2606.17056
作者: Nick Jiang,Isaac Kauvar,Jack Lindsey
机构: Stanford University (斯坦福大学); Anthropic
类目: Computation and Language (cs.CL)
备注: Code repository: this https URL

点击查看摘要

Abstract:We investigate whether language models internally track the value of their current trajectory, defined as the likelihood that their ongoing strategy will achieve their goals. Using synthetic, in-context reinforcement learning data, we construct a “value” axis for Qwen3-8B. We find that activations along this axis distinguish between high vs. low verbalized confidence, rollouts without and with backtracking, and correct vs. corrupted code. Steering towards high value causally suppresses self-correction and reduces explanatory verbosity, while steering towards low value induces backtracking and exploration. We demonstrate that direct preference optimization (DPO) can increase the internal value of rewarded behaviors (e.g. use a certain word), causing the model to act more confidently after exhibiting them. Finally, we apply the value axis to study in-the-wild settings. For example, we find that Qwen assigns low value to politically sensitive chat queries after post-training and that supervised fine-tuning increases internal confidence within the training domain. Our results suggest that language models linearly encode an estimate of expected goal success that modulates their confidence in pursuing a direction.

[NLP-1] Context-Aware RL for Agent ic and Multimodal LLM s

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在处理长上下文或复杂多模态任务时，难以准确识别关键细粒度证据的问题，例如代码工具追踪中的单行输出或图像中的细微视觉特征。其核心解决方案是提出一种上下文感知的强化学习方法——ContextRL，通过引入一个间接辅助目标（indirect auxiliary objective）来提升模型的长程推理与多模态理解能力。具体而言，ContextRL不直接监督最终答案，而是给模型提供一个问题、一个答案以及两个高度相似的上下文，奖励其选择能够支持该问题-答案对的上下文，从而引导模型实现更精细的语义定位与上下文锚定。该方法在编码代理和多模态推理两个领域构建了对比性上下文数据集：前者基于轨迹生成1000对，后者通过生成编辑与相似性搜索构建7000对。实验表明，ContextRL在5个长程推理基准上相比标准GRPO平均提升2.2%，在12个多样化的视觉问答基准上平均提升1.8%。通过与仅使用相同对比上下文但作为标准三元组训练的数据增强基线进行对比，验证了性能提升主要源于所提出的上下文选择目标，而非单纯的数据量增加，凸显了该间接目标设计的有效性。

链接: https://arxiv.org/abs/2606.17053
作者: Peiyang Xu,Bangzheng Li,Sijia Liu,Karthik R. Narasimhan,Pramod Viswanath,Prateek Mittal,Xingyu Fu
机构: Princeton University (普林斯顿大学); UC Davis (加州大学戴维斯分校)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: 29 pages, 9 figures

点击查看摘要

Abstract:Large language models (LLMs) often fail when answering requires identifying a small but decisive piece of evidence within a long or complex context, such as a single line in a tool trace or a subtle detail in an image. We propose ContextRL, a context-aware reinforcement learning (RL) method that improves long-horizon reasoning and multimodal performance through an \emphindirect auxiliary objective. Instead of supervising only the final answer, ContextRL presents the model with a query, an answer, and two highly similar contexts, and rewards it for selecting the context that supports the query–answer pair, thereby encouraging fine-grained grounding. We construct contrastive context data in two domains: for coding agents, trajectories serve as contexts, yielding 1k pairs built via condition filtering; for multimodal reasoning, images serve as contexts, yielding 7K pairs built via generative editing and similarity search. ContextRL achieves average gains of +2.2% over standard GRPO on 5 long-horizon benchmarks, and +1.8% across 12 diverse visual question answering benchmarks. To disentangle the effect of the proposed objective from that of additional data, we compare against data-augmentation baselines that repurpose the same contrastive contexts as standard query–context–answer examples. These baselines provide little to no improvement, showing that the gains arise from the proposed context-selection objective rather than from the contrastive data alone.

[NLP-2] KVEraser: Learning to Steer KV Cache for Efficient Localized Context Erasing ICML2026

【速读】：该论文旨在解决长上下文大语言模型（LLM）中后置上下文擦除（post-hoc context erasing）的计算效率难题。其核心问题是：在预填充（prefill）阶段之后，若发现某些已处理内容（如过时信息、错误工具观测或有害提示注入）需删除，传统方法必须重新计算所有后续token，导致计算开销随后缀长度线性增长，难以高效实现局部编辑。解决方案的关键在于提出KVEraser，一种基于学习的KV缓存编辑方法，通过仅替换被删除区间对应的键值（KV）状态为可迁移的引导状态（steering states），而保留其余缓存不变，从而实现高效的局部化上下文擦除。该方法采用两阶段训练范式——通用跨度邻域预训练以抑制被删段的影响，任务特定微调以适配下游场景，显著提升了编辑的泛化能力与性能。实验表明，KVEraser在1K–32K上下文长度下几乎达到全量重计算的性能水平，但延迟仅增加24%，远优于全量重计算17.6倍的延迟增长；在包含有害事实干扰的长文档问答任务中，其性能优于现有近似基线，并实现3–4倍的速度提升。

链接: https://arxiv.org/abs/2606.17034
作者: Mufei Li,Shikun Liu,Dongqi Fu,Haoyu Wang,Yinglong Xia,Hong Li,Hong Yan,Pan Li
机构: Georgia Institute of Technology (佐治亚理工学院)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Oral at the ICML 2026 Workshop on the Impact of Memorization on Trustworthy Foundation Models

点击查看摘要

Abstract:Post-hoc context erasing over the KV cache is challenging because a local edit has a global consequence: once a span has been processed, its influence propagates into the cached states of all subsequent tokens. This issue arises naturally in long-context LLM applications, where stale retrieved facts, incorrect tool observations, retracted user preferences, or harmful prompt injections may be identified only after prefill. Exact erasing must then recompute all tokens after the deleted span, making its computational cost depend on suffix length rather than erased-span length. We introduce KVEraser, a learned KV-cache editing method for efficient localized context erasing. Given a processed context and a span to remove, KVEraser replaces only the KV states of the erased interval with learned steering states while reusing the remaining cache unchanged. To learn a transferable erasing mechanism, we build a two-stage training pipeline: generic span-neighbor pre-training teaches the eraser to suppress the influence of the erased span, while task-specific fine-tuning adapts this capability to downstream scenarios. Experiments show that KVEraser nearly matches full recomputation in post-erasure performance on in-domain tasks across 1K–32K context lengths, while its latency increases by only 24% compared with a 17.6x increase for full recomputation. KVEraser also generalizes to unseen long-document QA tasks with harmful factual distractors, achieving the best performance among approximate baselines with a 3–4x speedup over full recomputation.

[NLP-3] DEEPRUBRIC: Evidence-Tree Rubric Supervision for Efficient Reinforcement Learning of Deep Research Agents

【速读】：该论文旨在解决生成式深度研究代理（Deep research agents）在使用基于评分标准的强化学习（rubric-based reinforcement learning, RL）进行训练时，因评分标准（rubric）质量不佳而导致的效率低下问题。核心挑战在于，现有方法依赖大语言模型（LLM）根据给定查询自动生成评分标准，但当模型未能准确推断出任务所需的核心信息需求时，生成的评分标准往往不完整，从而削弱了奖励信号的有效性。为提升评分标准与查询之间的对齐度和可靠性，论文提出DeepRubric——一种逆向数据构建框架：不再从查询出发推导评分标准，而是先确定一个基于证据的报告应评估的具体目标，再由此构建出语义一致的查询-评分标准对。该框架通过递归扩展证据支持的子问题，构建“证据树”（evidence tree），其叶节点作为原子且可验证的评估目标，确保最终生成的查询与评分标准严格对应实际信息需求。基于此框架，研究者构建了9,000个高质量的查询-评分标准监督样本，并利用基于评分标准的GRPO算法训练DeepRubric-8B模型，在三个基准测试中达到与当前开源最优模型相当的性能，同时将强化学习所需的GPU小时数降低约13倍。因此，该方案的关键突破在于通过结构化、可追溯的证据树驱动的反向构造机制，显著提升了评分标准的可靠性与训练效率。

链接: https://arxiv.org/abs/2606.17029
作者: Minghang Zhu,Chuyang Wei,Junhao Xu,Yilin Cheng,Zhumin Chen,Jiyan He
机构: Shandong University (山东大学); Zhongguancun Academy (中关村学院); Fudan University (复旦大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Deep research agents synthesize long-form reports by searching and reasoning over retrieved evidence. Reinforcement learning with rubric-based rewards improves these agents by optimizing them against checkable criteria that translate report quality into reward signals, but its efficiency depends on whether those criteria reliably capture the task scope and evidence needs. Most existing studies ask an LLM to generate rubrics for a given query, but when the model fails to infer the underlying information needs, the generated rubrics may be incomplete and reduce RL efficiency. To obtain more reliable query–rubric supervision, we introduce DeepRubric, a data construction framework that reverses this process: instead of inferring evaluation criteria for a given query, it first determines what an evidence-backed report should be evaluated on and then synthesizes aligned query–rubric pairs from those evaluation targets. Starting from a sampled seed topic, DeepRubric builds an evidence tree by recursively expanding evidence-backed sub-questions, whose leaves serve as atomic and verifiable evaluation targets. It then uses the evidence tree to synthesize the training query and rubrics, ensuring that the reward evaluates exactly the information requested by the query. Using DeepRubric, we construct 9K query–rubric supervision examples and train DeepRubric-8B with rubric-based GRPO, achieving comparable performance to prior open state-of-the-art deep research models across three benchmarks with roughly 13x fewer RL GPU-hours.

[NLP-4] Selection Without Signal Recovery Through Expression: A Measurement Study of Post-Hoc Falsification Operators for Frozen Small Code Models

【速读】：该论文旨在解决生成式代码模型（Generative Code Models）在离线和隐私受限场景下，尽管采用冻结的小规模模型（≤1.5B参数、本地运行且无需微调），仍频繁生成看似合理但错误的程序这一核心问题。其解决方案的关键在于引入一系列后处理操作（post-hoc operators），如选择、验证、修复、消除、组合策略、可靠否决与生成条件化等，以在不重新训练模型的前提下提升输出质量。这些方法遵循波普尔主义（Popperian）原则：通过严苛测试攻击候选输出，保留通过者。然而，实验结果表明，在一个确定性执行断言与无泄漏匹配计算协议下，26种语义层面的后处理操作均未能在所测试的数据集上超越“Best-of-N”（BoN）基准的泛化准确率。负面结果源于三个机制性障碍：覆盖墙（coverage wall，系统性难题任务无法通过增加采样深度缓解）、能力剪刀（capability scissors，优秀生成器产生的通过可见测试样本中几乎无可区分的错误）、以及近似空共识陷阱（near-empty consensus trap，可见通过但隐藏错误的多数情形极少与正确替代方案共现）。此外，分布无关的“无害”边界显示，除非采样量n≥45，否则无法在零观测伤害情况下对伤害率≤α进行置信认证。值得注意的是，有两个操作在非语义输出空间中表现有效：表达层恢复（M1）实现了唯一显著的准确率提升，能够恢复标准提取器丢弃的正确程序（实现鲁棒提取与公开测试签名对齐），且不造成伤害（b10=0）、无信息泄露，并使DeepSeek-Coder-1.3B在HumanEval+上提升12个任务（p=2.4e-4）；自适应共识早停（ACE）则实现了约19%的计算节省且无伤害。M1与选择类操作的负向结论在HumanEval+和MBPP+上跨三个模型单元复现。研究结论强调：应优先优化评估框架并衡量覆盖率，再考虑是否归因于语义后处理推理的不足。

链接: https://arxiv.org/abs/2606.16999
作者: Mehmet Iscan
机构: PythaLab, Yıldız Technical University (伊斯坦布尔技术大学)
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 33 pages, 4 figures, 8 tables

点击查看摘要

Abstract:Frozen small code models (=1.5B parameters, run locally without fine-tuning) suit offline and privacy-constrained use, but often emit plausible-but-wrong programs. A natural remedy is a post-hoc operator that selects, verifies, repairs, or re-processes the model’s samples without retraining; in principled form it is Popperian: attack each candidate with a severe test, keep what survives. We measure whether such operators help. Under one deterministic execution oracle and a leakage-free, matched-compute protocol, 26 semantic post-hoc operators (selection, verification, repair, elimination, portfolios, sound vetoes, generation conditioning) are evaluated against Best-of-N (BoN); on the cells and benchmarks tested, none improves held-out accuracy over BoN. The negative is mechanistic: a coverage wall (systematic hard-task failures deeper sampling does not rescue), a capability scissors (a competent generator leaves almost no discriminable error among visible-test passers), and a near-empty consensus trap (the visible-pass-but-hidden-wrong majority a leakage-free selector needs rarely co-occurs with a correct alternative). A distribution-free do-no-harm bound cannot certify a harm rate =alpha at zero observed harm unless n=45. Two operators help on a different axis, outside the semantic output space. An expression-layer recovery (M1), the only accuracy gain here, recovers correct programs the standard extractor discards (robust extraction and public-test signature alignment); it does no harm (b10=0), is leakage-free, and lifts DeepSeek-Coder-1.3B by +12 tasks on HumanEval+ (p=2.4e-4). An adaptive consensus early-stop (ACE) is a calibrated compute-saving control (~19% saving, zero harm). M1 and the selection negative replicate on HumanEval+ and MBPP+ across three model cells. The lesson: fix the harness and measure coverage before blaming semantic post-hoc reasoning.

[NLP-5] Exploring Extrinsic and Intrinsic Properties for Effective Reasoning with Code Interpreter

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在使用代码解释器（Code Interpreter, CI）进行推理时，其有效推理行为的内在机制与外在特征尚不明确的问题。现有研究虽已证实CI能通过可执行计算与迭代验证显著提升模型推理能力，但其背后支撑高效推理的关键行为属性仍缺乏系统性探究。为此，本文从自然语言推理研究中汲取启发，从两个维度展开分析：外在属性（外显特征），即关键标记（crucial tokens）；内在属性（内生特征），即代码特异性的认知行为，如验证、回溯和逆向链式推理。研究发现，性能更强的CI推理模型普遍表现出更高的关键标记频率及上述认知行为的活跃度。基于此，论文进一步提出在推理与训练阶段利用这些关键属性以提升性能的策略：在推理阶段，通过添加代码特异性的关键标记，在数学、排序与优化等任务上显著提升表现，而在其他任务中收益有限；在训练阶段，将代码特异性认知行为引入先进微调框架，可在三种评估模型中的两种上改善监督微调与强化学习效果。深入分析表明，这些行为不仅能减少错误响应中的过度思考（overthinking）现象，提升生成效率，还揭示了某些模型因固有局限而难以获益的原因。本研究首次系统刻画了有效CI推理的核心特性，揭示了利用关键属性优化推理能力的潜力与边界。

链接: https://arxiv.org/abs/2606.16934
作者: Patomporn Payoungkhamdee,Napat Laosaengpha,Jenta Wonglertsakul,Pittawat Taveekitworachai,Pume Tuchinda,Panjapong Poobanchuen,Ekapol Chuangsuwanich,Can Udomcharoenchaikit,Samuel Cahyawijaya,Peerat Limkonchotiwat,Sarana Nutanong
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Reasoning with a Code Interpreter (CI) has emerged as an effective paradigm for enhancing the reasoning capabilities of large language models (LLMs) through executable computation and iterative verification. Despite its growing adoption, the behavioral properties underlying effective code reasoning remain largely underexplored. In this work, we investigate code reasoning from two distinct perspectives inspired by prior studies of natural language reasoning: extrinsic properties, represented by crucial tokens, and intrinsic properties, represented by code-specific cognitive behaviors. Across multiple LLMs, we find that stronger CI reasoning models consistently exhibit a higher prevalence of crucial tokens and cognitive behaviors, particularly verification, backtracking, and backward chaining. Building on these observations, we examine how these properties can be leveraged during both inference and training. At inference time, appending code-specific crucial tokens improves performance on several reasoning capabilities, including mathematical, ordering, and optimization, while yielding limited benefits elsewhere. At training time, augmenting a state-of-the-art framework with code-specific cognitive behaviors improves supervised fine-tuning and reinforcement learning performance in two of three evaluated models. Further analysis shows that these behaviors reduce overthinking in incorrect responses and improve token efficiency, while also revealing factors that limit gains in a certain model. Our findings provide the first systematic characterization of effective reasoning with CI and demonstrate both the potential and limitations of leveraging key properties to improve CI-based reasoning.

[NLP-6] IMPACTeen: Intentions Manipulation Persuasion Annotations and Consequences in Teen Communication Dataset

【速读】：该论文旨在解决青少年情境下社会影响（social influence）文本识别与分析的缺乏标准化数据资源的问题。现有研究在青少年社交互动、媒体传播及数字环境中的社会影响机制建模方面存在数据稀缺与标注不一致的瓶颈，尤其缺乏多视角、高维度且具备真实语境的数据支持。其解决方案的关键在于构建IMPACTeen数据集，通过受控的大语言模型（LLM）生成结合两阶段人工编辑与验证流程，确保内容在青少年语境下的真实性与合理性；同时采用多维度标注体系，涵盖影响存在性、影响策略、意图、后果、抵抗反应及标注置信度等七个层面，并从青少年、家长、心理学家、沟通专家和教师五类角色进行交叉标注，显著提升了数据的丰富性与可靠性，为社会影响检测、标注分歧分析、跨语言建模及语言模型训练评估提供了高质量基准。

链接: https://arxiv.org/abs/2606.16910
作者: Aleksander Szczęsny,Wiktoria Mieleszczenko-Kowszewicz,Maciej Markiewicz,Beata Bajcar,Tomasz Adamczyk,Jolanta Babiak,Grzegorz Chodak,Przemysław Kazienko
机构: Wrocław University of Science and Technology (弗罗茨瓦夫科学与技术大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:IMPACTeen is a dataset of textual social influence scenarios spanning interpersonal, media-based, and digital settings in an adolescent context. It contains 1,021 texts, 5,100 individual annotation records, and gold labels for social influence techniques, with each text annotated from five distinct perspectives: teenagers, parents, psychologists, communication experts, and teachers. The resource was constructed through constrained LLM generation, followed by a two-step human editing and validation phase aimed at ensuring youth-context realism. A multi-dimensional annotation covered influence presence, techniques, intentions, consequences, resistance, reactions, and annotation confidence. The dataset supports research on social influence detection, annotator disagreement, cross-lingual modeling, and the training and evaluation of language models. The dataset was created in Polish and is accompanied by a corresponding English version.

[NLP-7] LESS Is More: Mutual-Stability Sampling for Diffusion Language Models

【速读】：该论文旨在解决扩散型大语言模型（diffusion large language models, dLLMs）在实际推理效率上的瓶颈问题，即现有采样方法采用固定数量的反向去噪步骤，导致计算资源被浪费在已稳定的位置上，同时可能过早地对不稳定的预测进行“解掩码”（token commitment），从而影响生成质量。其核心解决方案是提出一种无需训练、与模型无关的自适应采样器——\textscLESS，将词元提交过程建模为一个在线停止问题。\textscLESS的关键创新在于引入联合稳定性准则（joint stability rule），仅当某掩码位置满足三个条件时才允许解掩码：（1）最高概率预测具有高置信度；（2）该预测词元在近期多个反向步骤中持续保持一致；（3）其预测分布在连续步骤间通过Top-K Jensen–Shannon散度衡量的分布稳定性达标。该方法在Dream-7B、LLaDA-8B和LLaDA-1.5-8B等多个模型上进行了验证，覆盖全序列扩散与半自回归分块采样两种范式，在七个涵盖通用知识、数学和代码任务的基准上均表现出优于现有训练自由自适应采样器的平均准确率，且相较固定预算采样减少72.1%的反向步骤数，显著降低了前向传播次数、实际运行时延及推理计算量。

链接: https://arxiv.org/abs/2606.16908
作者: Amr Mohamed,Guokan Shang,Michalis Vazirgiannis
机构: MBZUAI; Ecole Polytechnique
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Diffusion large language models (dLLMs) offer a promising alternative to autoregressive decoding by iteratively refining masked sequences, enabling parallel token updates and bidirectional conditioning. Their practical efficiency, however, is limited by sampling procedures that execute a fixed number of reverse denoising steps selected before decoding, spending computation on already-stable positions and sometimes committing unstable ones too early. We present \textscLESS, a training-free, model-agnostic adaptive sampler that treats token commitment as an online stopping problem. \textscLESS implements mutual-stability sampling through a joint stability rule that makes a masked position eligible for unmasking only when its top-1 prediction has high confidence, its top-1 token persists across recent reverse steps, and its predictive distribution is stable under top- K inter-step Jensen–Shannon divergence. We evaluate \textscLESS on Dream-7B, LLaDA-8B, and LLaDA-1.5-8B, covering full-sequence diffusion and semi-autoregressive blockwise sampling regimes, across seven benchmarks spanning general knowledge, math, and code. \textscLESS improves average accuracy over strong training-free adaptive samplers while using 72.1% fewer reverse steps than fixed-budget decoding. Since each reverse step requires a Transformer forward pass, these step-count reductions translate into fewer forward evaluations, lower measured wall-clock latency, and lower estimated inference compute.

[NLP-8] Speaking the Language of Science: Toward a General-Purpose Generative Foundation Model for the Natural Sciences

【速读】：该论文旨在解决自然科学领域中多任务异构性带来的建模挑战，即如何在不同科学任务间建立统一的表示与推理框架。传统方法通常依赖于针对特定领域的专用模型和独立的技术栈，导致跨域泛化能力有限。其解决方案的关键在于提出一种名为LOGOS（Language Of Generative Objects in Science）的生成式科学语言模型，通过将各类科学对象及其空间相互作用编码为共享词汇表下的离散标记序列，构建了一个基于统一科学语法的自回归框架。该框架以离散令牌形式显式建模空间接触与约束模式，实现对复杂结构交互的纯序列化表征，无需依赖显式坐标或几何神经网络。这一统一表示使得多种下游任务均可被一致地表述为同一语法空间中的下一步标记预测问题，从而在持续多领域预训练与下游目标之间形成强对齐。实验表明，LOGOS在多个科学任务上表现优于或相当于是领域专用基线模型，初步验证了“一模多用”在自然科学研究中的可行性。此外，模型规模（1B、3B、8B参数）与性能呈正相关，暗示未来人工智能赋能科学（AI4S）的发展方向不应是脱离大语言模型（LLM）构建独立技术体系，而应通过共享架构、共享训练范式及共享推理基础设施，深度对齐科学基础模型与通用大语言模型，使后者真正成为进入AI4S的新入口。

链接: https://arxiv.org/abs/2606.16905
作者: Mingyang Li,Yurou Liu,Jieping Ye,Bing Su,Ji-Rong Wen,Zheng Wang
机构: Alibaba Group; Gaoling School of Artificial Intelligence, Renmin University of China
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In this report, we present LOGOS (Language Of Generative Objects in Science), a scientific generative language model that unifies heterogeneous tasks across the natural sciences within a single autoregressive framework based on a shared scientific grammar. It encodes diverse scientific objects and their spatial interactions as token sequences over a common vocabulary. By representing spatial contact and constraint patterns as discrete tokens, the model captures complex structural interactions in a purely sequential manner, without relying on explicit coordinates or geometric neural networks. This unified representation enables a wide range of downstream tasks to be formulated consistently as next-token prediction in the same grammar space, creating strong alignment between continued multi-domain pre-training and downstream objectives. Across diverse tasks, LOGOS consistently matches or outperforms domain-specific baselines, providing preliminary evidence for the feasibility of “one model fits all” in the natural sciences. We train LOGOS models at different scales (1B, 3B, and 8B parameters) and find a consistent positive correlation between model size and performance. This suggests that the future of AI for Science (AI4S) may not lie in building an independent technical stack that is separated from large language models (LLMs). Instead, it may depend on deeply aligning scientific foundation models with LLMs through shared architectures, shared training paradigms, and shared inference infrastructure, so that LLMs can truly become a new entry point for AI4S. We release the model weights and associated resources to facilitate further research.

[NLP-9] Contrastive-Difference CKA Reveals Concept-Specific Structural Alignment Across Language Model Architectures

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在不同架构下是否以结构上兼容的方式编码高层概念这一核心问题。其关键发现是：尽管多种架构在几何结构上表现出中等程度的收敛性，但在功能层面却实现了近乎完美的概念迁移能力，揭示了几何-功能之间的解耦现象。为实现这一发现，研究提出了一种无需训练的诊断方法——对比差异核对齐（contrastive-difference CKA, CKA_Delta），通过计算样本级对比差异的核对齐度，有效分离出特定概念的收敛性与通用相似性，从而在标准CKA无法区分的情况下实现显著的概念特异性判别。该方法不仅验证了六类概念域（包括代码与自然语言、推理与回忆等非指令类概念）中的普遍性趋势，还表明随着模型规模增大（如70B级别），潜在的统一性可能增强。因此，CKA_Delta被定位为一种实用的架构分类器与异常检测工具（如Gemma模型的判别距离d=1.08，AUC=0.79），而非绝对的性能预测指标，为跨架构概念监控提供了无训练诊断新范式。

链接: https://arxiv.org/abs/2606.16897
作者: Xueping Gao
机构: Alibaba Cloud(阿里云)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Do different LLM architectures encode high-level concepts in structurally compatible ways? We systematically characterize a geometric-functional universality dissociation: across multiple concept domains and architectural families, moderate geometric convergence coexists with near-perfect functional transfer. Using contrastive-difference CKA (CKA_Delta), a training-free diagnostic that computes kernel alignment on per-sample contrastive differences, we isolate concept-specific convergence from generic similarity – achieving significant discrimination where standard CKA cannot. The dissociation replicates across all six concept domains we test (five with p = 0.017 geometric discrimination and safety as a converging-functional trend, p = 0.08), including two non-instruction concepts (code-vs-NL, reasoning-vs-recall) validated without system prompts; a single 70B–70B pair provides an observational note that universality may strengthen with scale, requiring replication with additional =70B models. We position CKA_Delta as a practical regime classifier and architectural outlier detector (Gemma: d = 1.08, AUC = 0.79) rather than an absolute transfer-accuracy predictor, providing a training-free diagnostic for cross-architecture concept monitoring.

[NLP-10] Symbolic Informalization: Fluent Productive Multilingual

【速读】：该论文旨在解决形式化数学（formal mathematics）在机器可验证与人类可读性之间的鸿沟问题，即如何在不损失精确性的前提下，将严格的形式化证明转换为自然语言表述，使其具备可读性和流畅性。其核心挑战在于实现跨不同形式化系统与自然语言之间的精准、一致且高效的语义映射。解决方案的关键在于构建一个基于中间语言（interlingual）架构的系统——Informath，其中Dedukti作为统一枢纽，连接多种形式化证明系统（如Agda、Lean、Rocq），实现跨系统的知识集成；而语法框架（Grammatical Framework, GF）则负责处理多自然语言下的语法正确性与语言多样性，确保生成文本在不同语言中均符合语法规则并保持语义一致性。这一架构实现了形式化内容到自然语言的可靠转换，使人工智能生成的自动形式化证明能够被清晰解释，从而支持人机协作的数学开发流程。

链接: https://arxiv.org/abs/2606.16893
作者: Aarne Ranta
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Logic in Computer Science (cs.LO)
备注:

点击查看摘要

Abstract:Symbolic informalization enables a reliable conversion of formal mathematics to natural language. It has the potential to make machine-checked content human-readable without loss of precision. In a traditional proof system usage, symbolic informalization generalizes the limited mechanisms of syntactic sugar into the ordinary language of mathematics. In a setting where proofs are constructed by artificial intelligence and autoformalization, symbolic informalization can explain what precisely has been constructed. This paper outlines the project Informath, which aims to show how symbolic informalization can produce fluent text with a reasonable development effort and address multiple formal and natural languages. Informath is based on an interlingual architecture, where Dedukti works as a hub between different proof systems (Agda, Lean, Rocq) and Grammatical Framework (GF) takes care of linguistic correctness and variation in different natural languages.

[NLP-11] Compositional Reasoning Depth Predicts Clinical AI Failure: Empirical Evidence Consistent with Transformer Compositionality Limits in Electronic Health Record Question Answering

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在电子健康记录（Electronic Health Record, EHR）问答任务中存在系统性失败的问题，尤其是当问题需要更多推理步骤时，错误率显著上升。其核心问题是：现有评估基准无法揭示模型在复杂临床推理任务中的结构性缺陷，导致对模型真实能力的误判。为此，研究提出了一种基于理论驱动的“跳数（hop count）”分类体系——即从EHR中回答一个临床问题所需的不同推理步骤数量，作为预测模型失败的原理性指标。关键解决方案在于通过人工标注313个由临床医生生成的MedAlign EHR问答对，并将其按跳数分为四个层级，进而验证多个模型（包括Claude Sonnet、GPT-4o与GPT-5.4-2026-03-05）在不同跳数下的表现。结果一致显示，所有模型的准确率随跳数单调下降，且该衰减并非由病历截断引起，而是源于组合推理（compositional reasoning）的固有难度。此外，尽管引入扩展思维（extended thinking）策略，但未能显著改善准确率随跳数下降的趋势，且思维令牌使用量与跳数呈正相关（r=0.31, p<0.0001），符合预期的O(k)计算复杂度。因此，跳数作为一个理论驱动、跨架构通用的指标，能够有效预测大模型在临床AI应用中的错误风险，为临床部署中的风险分层提供了可操作依据。

链接: https://arxiv.org/abs/2606.16890
作者: Sanjay Basu
机构: University of California San Francisco (加州大学旧金山分校); Waymark (Waymark); Stanford University (斯坦福大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 20 pages, 5 figures. Code: this https URL

点击查看摘要

Abstract:Aggregate accuracy benchmarks conceal a systematic structure in how large language models fail at electronic health record (EHR) question answering: questions requiring more inferential steps produce disproportionately more errors. Motivated by theoretical results on transformer compositionality limits, we introduce a pre-specified hop-count taxonomy – the number of distinct reasoning steps required to answer a clinical question from an EHR – as a principled predictor of model failure. We annotate 313 clinician-generated MedAlign EHR question-answer pairs across four hop levels and evaluate 301 questions in a within-model ablation (claude-sonnet-4-6, zero-shot vs. extended thinking) and cross-architecture replications (gpt-4o and gpt-5.4-2026-03-05, zero-shot). All three models, spanning two providers and two OpenAI generations (GPT-4 and GPT-5), show monotone accuracy decline with hop count: Claude Sonnet zero-shot falls from 30.6% (hop=1) to 17.6% (hop=4) (Cochran-Armitage z=-2.30, p=0.011; OR per hop 0.72, 95% CI [0.56,0.92], p=0.008); GPT-4o replicates this (37.8% to 14.7%; OR 0.58 [0.45,0.75], p0.001); and gpt-5.4-2026-03-05 confirms it (37.8% to 23.5%; OR 0.80 [0.66,0.98], p=0.027). A pre-specified context-sufficiency audit shows higher-hop questions are not differentially disadvantaged by EHR truncation (answerability 93-95% at hops 2-4 vs. 79% at hop=1), so the decline reflects compositional reasoning difficulty. Extended thinking did not significantly flatten the accuracy-depth curve across three reasoning conditions, and thinking-token usage scaled with hop count (r=0.31, p0.0001), consistent with the predicted O(k) computational requirement. Hop count is thus a theory-motivated, cross-architecture predictor of large-language-model error on EHR question answering, with direct implications for deployment risk stratification of clinical AI.

[NLP-12] Understanding Scam Trends and Rail Paths from Reddit Self-Disclosure Narratives

【速读】：该论文旨在解决在线诈骗行为研究中长期存在的两个关键问题：一是现有工作缺乏对诈骗特征与环节（rail）随时间演变趋势的系统性分析，二是由于缺少覆盖多种诈骗类型且带有标注的开源数据集，导致对各环节之间关联关系的研究受限。为应对上述挑战，研究提出构建一个基于2023至2025年Reddit用户自述叙事的新型数据集，通过启发式标注方法收集并分析21,304篇包含身份、通信、平台和支付等至少一个环节的帖子，以揭示诈骗特征的年度演化规律；进一步采用大语言模型（LLM）辅助标注方法对1,800篇包含明确或可重构诈骗链路的帖子进行标注，并通过人工验证确保质量，实现对诈骗路径（scam path）的深入分析；同时，利用主题建模分析帖子评论内容，探究社区支持行为的变迁。研究发现，诈骗过程普遍呈现多环节（multi-rail）特征，不同年份主导的诈骗类型及环节组成存在差异，且各类诈骗在路径复杂性上表现出系统性差异，社区支持行为亦随时间趋于精细化。该研究为合成诈骗链路数据生成与人工智能相关诈骗风险评估提供了重要支撑，但其结论可能不适用于其他社交平台。

链接: https://arxiv.org/abs/2606.16874
作者: Yangjun Zhang,Mirko Bottarelli,Mark Hooper,Carsten Maple
机构: The Alan Turing Institute, London, UK(英国伦敦艾伦·图灵研究所)
类目: Computation and Language (cs.CL); Computational Engineering, Finance, and Science (cs.CE); Computers and Society (cs.CY)
备注: 6 pages, International Conference on AI and the Digital Economy (CADE) 2026

点击查看摘要

Abstract:Online scam behavior is inherently multi-stage, and the lifecycle includes temporally ordered rails and events rather than isolated signals. Existing works analyze characteristics of scam types and rails, but they do not track scam trends across years. Moreover, the work on the relations between rails is hampered due to the lack of open-source datasets with annotations and coverage of different scam types. To address these gaps, we build a dataset to analyze the yearly trend of scam characteristics and rail paths using Reddit self-disclosure narratives from 2023 to 2025. We collect 21,304 posts from scam-related subreddits with at least one rail among identity, communication, platform, and payment for trend analysis by heuristic annotation. Then, we label 1,800 posts containing explicit or recoverable scam chains by an LLM-assisted method for scam path analysis. The method is evaluated with human annotation. Lastly, we run a topic model on the comments of the posts to analyze the community support behavior. The results reveal that scam processes are predominantly multi-rail. Across years, different scam types and rail components dominate. Different scam types vary systematically in path complexity. Reddit support behaviors have become more detailed over time. This work supports synthetic scam chain data simulation and AI-related scam risk assessment, though findings may not generalise to other platforms.

[NLP-13] Revisiting the Systematicity in Negation in the Era of In-Context Learning

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在理解否定句语义方面仍存在的系统性挑战，尤其关注模型对否定表达及其作用范围（negation scope）的识别能力。其核心问题在于：尽管现代大模型在特定任务中表现出一定能力，但在否定句的理解上仍存在不一致和不鲁棒的现象，且这种局限性在不同输出格式下表现各异。解决方案的关键在于从行为层面（behavioral systematicity）与表征层面（representational systematicity）双重检验模型的系统性能力——前者通过提示学习（in-context learning）与示范（demonstrations）考察模型对否定表达及作用范围的识别性能；后者则探究能否从上下文示例中稳健构建用于否定线索提取的功能向量（function vectors），进而支持更深层的语义理解。研究发现，虽然模型可有效构建用于否定线索识别的功能向量，但针对否定作用范围识别的功能向量构建则面临更大挑战，揭示了当前模型在否定语义理解上的表征不稳定性。

链接: https://arxiv.org/abs/2606.16867
作者: Hitomi Yanaka,Taisei Yamamoto
机构: The University of Tokyo (东京大学); Riken (理研); Tohoku University (东北大学)
类目: Computation and Language (cs.CL)
备注: Accepted to the 6th Workshop Natural Language Meets Logic and Machine Learning (NALOMA2026) at ESSLLI2026

点击查看摘要

Abstract:Understanding the meaning of negated sentences remains one of the challenges for language models, even in the era of large language models (LLMs). We analyze systematicity regarding LLM understanding of negation from two perspectives: behavioral systematicity and representational systematicity. For behavioral systematicity, we confirm that through demonstrations and in-context learning, LLMs can recognize negation expressions and scope within sentences to some extent, but they fail to achieve perfect performance. In particular, the difficulty of the negation scope recognition for models varies depending on the output format. For representational systematicity, we analyze the extent to which function vectors can be robustly constructed from in-context examples for tasks that are essential to understanding negation. The experiments suggest that while function vectors can be composed for negation cue extraction tasks, extracting function vectors for recognizing scope is more challenging.

[NLP-14] Follow the Latent Roadmap: Navigating Revocable Decoding for Diffusion LLM s with Anchor Tokens

【速读】：该论文旨在解决扩散型大语言模型（Diffusion Large Language Models, dLLMs）在并行生成过程中面临的解码速度与生成质量之间的权衡问题。现有可回溯解码策略虽尝试通过验证与重掩码机制纠正错误，但通常在混合质量的上下文中运行，导致两个关键缺陷：错误传播（Error Propagation），即新生成的令牌会吸收错误上下文中的有害信息；局部错误强化（Local Error Reinforcement），即错误之间相互增强，从而逃避检测。为缓解上述问题，本文提出一种无需训练的框架——锚点监督可回溯解码（ASRD），其核心在于在嵌入空间中显式地将解码上下文分解为可信的“锚点令牌”（Anchor Tokens，通过时间一致性识别）与不确定候选令牌。ASRD引入动态锚点令牌缓存，设计了两种互补机制：（1）锚点引导生成，通过在掩码位置注入加权锚点信号，隐式引导注意力聚焦于可靠的全局语义骨架；（2）锚点扰动验证，对不确定候选令牌施加正交扰动，破坏由脆弱局部共识驱动的错误模式，促使其被重新掩码。实验结果表明，ASRD在数学与代码基准上显著优于现有重掩码基线，在保持生成质量提升最高达6.4%的同时，推理吞吐量最高提升7.2倍。

链接: https://arxiv.org/abs/2606.16847
作者: Yizhen Yao,Qinglin Zhu,Runcong Zhao,Xiangxiang Dai,Yanzheng Xiang,Yulan He,Lin Gui
机构: King’s College London; The Chinese University of Hong Kong; The Alan Turing Institute, UK
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Diffusion Large Language Models (dLLMs) offer a promising avenue for parallel generation but face a trade-off between decoding speed and quality. While revocable decoding strategies attempt to mitigate errors by verifying and remasking tokens, they typically operate within a mixed-quality context. This leads to two critical failures: \textitError Propagation, where new tokens absorb toxic information from erroneous context, and \textitLocal Error Reinforcement, where errors mutually reinforce each other to evade detection. To alleviate these challenges, we propose ASRD (Anchor Supervised Revocable Decoding), a training-free framework that operates within the embedding space. ASRD explicitly decouples the decoding context into trusted \textitAnchor Tokens, which are identified via temporal consistency, and uncertain candidates. Leveraging a dynamic Anchor Tokens Cache, we introduce two complementary mechanisms: (1) Anchor-Guided Generation, which injects entropy-weighted anchor signals into masked positions to implicitly rectify attention toward the reliable global skeleton; and (2) Anchor-Perturbed Verification, which applies orthogonal perturbations to uncertain candidate tokens, destabilizing and remasking errors driven by fragile local consensus. Extensive experiments on math and coding benchmarks demonstrate that ASRD outperforms recent remasking baselines, achieving accuracy improvements of up to 6.4% while accelerating inference throughput by up to 7.2 \times .

[NLP-15] Robust Dual-Signal Fusion: Hybrid Neuro-Symbolic Gating with Compressed Chain-of-Thought Refinement for Irony Detection in Social Media Texts

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在零样本（zero-shot）语义理解中对反讽（irony）等隐喻性语言的识别难题，其核心挑战在于LLMs固有的字面语义解释倾向。为此，作者提出了一种鲁棒双信号融合（Robust Dual-Signal, RDS）框架，该框架采用混合神经符号（neuro-symbolic）架构，在无需监督微调（Supervised Fine-Tuning, SFT）的前提下，通过压缩思维链（Chain-of-Thought, CoT）推理轨迹实现高效信息整合。该方案的关键在于：将符号先验（symbolic prior）与冻结的CoT推理路径进行并行、协同的多信号融合，而非逐层叠加；统计消融实验表明，仅当三者（符号先验、CoT推理、神经基线）同时存在并并发融合时，才能获得显著优于基线的性能提升（p = 0.005），验证了结构协同效应的必要性。该方法在严格保留的TweetEval测试集上达到78.1%准确率与0.777宏平均F1，与微调后的BERTweet性能持平；在高度不平衡的iSarcasm数据集上，零样本宏平均F1达0.6726，讽刺类F1达0.4821，超越多个强监督的SemEval Transformer集成模型。

链接: https://arxiv.org/abs/2606.16845
作者: Ankit Bhattacharjee,Krityapriya Bhaumik
机构: Indian Institute of Technology Kharagpur(印度理工学院克哈格普尔分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 11 pages total, 10 figures

点击查看摘要

Abstract:Large Language Models (LLMs) natively default to literal semantic interpretations, making zero-shot irony detection a persistent challenge. We introduce the Robust Dual-Signal (RDS) Fusion framework, a hybrid neuro-symbolic architecture that compresses Chain-of-Thought (CoT) reasoning trajectories without Supervised Fine-Tuning (SFT). Evaluated on a strictly held-out TweetEval test set (N=734), RDS achieves 78.1% accuracy and a Macro F1 of 0.777, matching the absolute performance ceiling of the fine-tuned BERTweet. On the heavily imbalanced iSarcasm dataset, the frozen CoT pipeline filters 22.5% of out-of-distribution hallucinations, yielding a zero-shot Macro F1 of 0.6726 and Ironic F1 of 0.4821, outperforming multiple heavily supervised SemEval transformer ensembles. A statistical ablation confirms this structural synergy: adding the symbolic prior to the neural baseline yields no significant gain (p = 0.242), and the marginal benefit of adding the CoT pipeline to that prior is heavily compressed (p = 0.149). Only the complete, concurrent fusion of all three signals achieves a statistically validated improvement over the baseline (p = 0.005).

[NLP-16] Data-Driven Decoding of Russells Circumplex Model of Affect

【速读】：该论文旨在解决生成式情感表示中潜在空间（latent space）缺乏可解释性与几何结构透明度的问题，尤其针对深度学习模型在情感计算中常被视为高维、不可见的“黑箱”现象。其核心挑战在于验证情感语义是否在深层表示中以符合心理学理论的方式组织。解决方案的关键在于通过统一的实验设计，检验基于Transformer架构的文本（RoBERTa）与语音（wav2vec 2.0）编码器及其多模态融合架构，在自然语境数据集（如MSP-Podcast）和受控语言模型生成刺激下，其隐空间是否能恢复Russell情绪环形模型（circumplex model）所描述的效价-唤醒度（valence-arousal）拓扑结构，并再现人类对情绪邻近关系的认知。研究发现，多模态融合显著实现了与情绪环形模型完全一致的拓扑对齐；此外，在零样本设置下，通用文本嵌入投影的细粒度情绪词项亦能精准落在已知的人类标注坐标附近。这一成果提出了一种数据驱动的新框架，证明情绪环形结构是这些模态嵌入中固有的表征特性，而非仅源于人工标注，从而实现了心理理论与表示学习之间的实质性桥梁。

链接: https://arxiv.org/abs/2606.16843
作者: Amdjed Belaref,Samir Sadok,Zineb Noumir,Renaud Seguier
机构: Alten(阿尔滕); CentraleSupélec IETR UMR CNRS 6164(中央理工-国家科学研究中心信息与电信研究所6164联合实验室); Inria at Univ. Grenoble Alpes, CNRS, LJK(法国格勒诺布尔大学的法国国家信息与自动化研究所、国家科学研究中心、利昂-克莱蒙特-弗朗索瓦大学)
类目: Computation and Language (cs.CL)
备注: This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:Affective computing increasingly relies on deep learning to represent emotions, yet latent spaces often remain opaque, high-dimensional black boxes. This paper investigates whether Transformers’ embeddings recover the geometric regularities of Russell’s circumplex model. We unify two complementary experiments testing the hypothesis that, after training models on text and speech, their resulting latent spaces encode a topology consistent with valence-arousal and reproduce human-like neighborhood relations. Specifically, we evaluate deep representations extracted from Transformer-based text (RoBERTa) and speech (wav2vec 2.0) encoders, along with a multimodal Transformer fusion architecture, across naturalistic datasets like MSP-Podcast and controlled LLM-generated stimuli. Our analysis reveals that multimodal fusion of text and audio yields perfect topological alignment with Russell’s primary emotion ordering. Furthermore, in a zero-shot setting using generic text embeddings, projected fine-grained emotion terms fall close to their established human-mapped coordinates. Our contribution is a novel, data-driven framework for validating emotion models, demonstrating that Russell’s circumplex structure is intrinsically encoded in the embeddings of these modalities rather than being solely an artifact of human labeling, thereby bridging the gap between psychological theory and representation learning.

[NLP-17] Does Traversal Order Matter? A Systematic Study of Tree Traversal Methods in Transformer Grammars

【速读】：该论文旨在解决生成式语言模型中语法树线性化方式对模型性能的潜在显著影响问题，特别是现有研究仅依赖深度优先遍历（Depth-First Traversal, DFT）进行语法树线性化所带来的局限性。其核心解决方案在于拓展了语法树遍历的设计空间，提出并验证了广度优先遍历（Breadth-First Traversal, BFT）以及一种新型混合遍历策略——生成规则遍历（Production-Rule Traversal, PRT），该策略融合了BFT的全局结构前瞻能力与DFT的早期词法生成优势。通过将不同遍历方式与多种树结构配置及掩码策略相结合，并在语言建模、句法泛化和摘要任务上进行实证评估，研究揭示了嵌套组合与全局前瞻之间的内在权衡，为设计任务感知型Transformer Grammars提供了可操作的指导原则。

链接: https://arxiv.org/abs/2606.16836
作者: Zongru Liu,Pengyu Ji,Pengcheng Wang,Kewei Tu
机构: ShanghaiTech University (上海科技大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Transformer Grammars (TGs) enhance language modeling by incorporating syntactic tree structures. Despite the potentially significant impact on model performance of how syntactic trees are linearized in TGs, existing studies rely solely on Depth-First Traversal (DFT) for linearization. In this paper, we expand the traversal design space by exploring Breadth-First Traversal (BFT) and a novel hybrid traversal strategy, Production-Rule Traversal (PRT), which combines the structural lookahead of BFT with the early lexical generation of DFT. We integrate these traversal methods with varying tree configurations and masking strategies, and empirically evaluate their performance on language modeling, syntactic generalization and summarization. We reveal the inherent trade-offs between nested composition and global lookahead, providing actionable recommendations for designing task-aware Transformer Grammars.

[NLP-18] ying the Loop – Tied Expert Layers in Mixture-of-Experts Language Models

【速读】：该论文旨在解决大规模语言模型（LLM）在采用混合专家（Mixture-of-Experts, MoE）架构时面临的高内存开销问题，即尽管每个输入令牌仅激活少量专家，但所有专家参数仍需驻留于训练与推理的内存中，导致显存占用巨大。其核心解决方案是提出“专家绑定”（Expert Tying）机制，通过在连续的Transformer层之间共享专家参数，同时保持各层独立的路由（routing）和注意力计算能力。该方法利用MoE路径中固有的参数冗余性，在几乎不牺牲困惑度（perplexity）或下游任务性能的前提下，将内存占用降低近2倍，显著优化了计算开销与内存使用之间的权衡，为下一代高效训练与扩展大模型提供了有效路径。

链接: https://arxiv.org/abs/2606.16825
作者: Martin Jaggi
机构: EPFL (洛桑联邦理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Code available at this https URL

点击查看摘要

Abstract:Mixture-of-Experts (MoE) architectures efficiently scale Large Language Models (LLMs) by activating only a small fraction of their experts per token, yet the full parameter count - dominated by the expert parameters - must be held in training and inference memory. To address this, we introduce Expert Tying, an architectural modification that shares expert parameters across consecutive transformer layers while preserving independent, layer-wise routing and attention. We evaluate this approach across common, state-of-the-art architectures, including OLMoE, Qwen3, and DeepSeek-style MoEs. Our pretraining experiments demonstrate that tying experts can reduce memory footprint by almost 2x at virtually no degradation in perplexity or downstream quality. By exploiting the parameter redundancy inherent in MoE pathways, our method provides a highly favorable compute-to-memory trade-off, advancing efficient training and scaling of next-generation LLMs. Comments: Code available at this https URL Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2606.16825 [cs.CL] (or arXiv:2606.16825v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2606.16825 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-19] Scaling LLM Reasoning from Minimal Labels: A Semi-Supervised Framework with a Lightweight Verifier LREC2026 DATE

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在推理能力训练中对大量标注正确答案的依赖问题，此类标注数据获取成本高昂且难以大规模扩展。针对这一挑战，论文提出一种半监督框架，通过将推理验证过程本身转化为数据生成机制，实现从极少量监督信号中高效扩展推理学习。其核心解决方案在于：仅使用少量标注样本训练一个轻量级的推理正确性分类器（reasoning-correctness classifier），用于判断大语言模型生成的中间推理链（intermediate reasoning traces）是否有效；结合基于熵的置信度阈值筛选机制，剔除低置信度的不可靠样本，保留高置信度的推理轨迹用于后续模型微调。实验结果表明，在可验证数学问题（Orca-Math子集）和图像场景图问答（GQA with Visual Programming）任务上，该方法在仅使用极少标注数据的情况下，性能可媲美传统方法使用10–15倍更多标注数据的效果。消融实验证实，推理正确性分类器与熵值过滤机制共同构成可扩展且抗噪声的伪标签生成体系的关键支柱。该方法通过以低成本的推理验证替代昂贵的答案级监督，为构建大规模推理资源提供了可行路径，并推动了未来可自主学习、依赖极小人类输入的推理系统的发展。

链接: https://arxiv.org/abs/2606.16811
作者: Keizo Kato,Chenhui Chu,Yugo Murawaki,Sado Kurohashi
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: LREC 2026. Section 3.3 is updated

点击查看摘要

Abstract:For the development of Large language models (LLMs), recent approaches to generating pseudo intermediate reasoning have shown remarkable progress. But they typically rely on large numbers of correctly annotated answers to assess reasoning quality. This paper presents a semi-supervised framework that scales reasoning learning from minimal supervision, turning reasoning verification itself into a data creation mechanism. We train a lightweight reasoning-correctness classifier on only a few labeled samples, which judges whether intermediate reasoning traces generated by an LLM are valid. Furthermore, an entropy-based confidence threshold filters out unreliable samples, and the remaining high-confidence reasoning traces are used to fine-tune the model. Experiments on Verifiable Math Problems (Orca-Math subset) and Question Answering on Image Scene Graphs (GQA) with Visual Programming show that our method achieves accuracy comparable to using 10-15x more labeled data. Ablation analyses confirm that both the classifier and entropy filtering are essential for scalable and noise-resistant pseudo-labeling. By replacing expensive answer-level supervision with lightweight reasoning verification, our method provides a practical path toward constructing large-scale reasoning resources and paves the way for future autonomous reasoning systems that learn from minimal human input.

[NLP-20] Connecting Speech to Words through Images

【速读】：该论文旨在解决在缺乏显式文本监督的情况下，如何学习书面词汇与其对应发音之间的映射关系这一挑战。其核心问题是：在没有语音转写（transcript）或标注文本的前提下，如何自动构建一个可关联书面词与口语发音的词汇库。解决方案的关键在于提出一种视觉引导的无监督方法——首先利用图像描述生成系统从图像中提取出显著视觉概念对应的书面词汇，形成初始词汇表；随后针对每个目标词，检索包含该词的图像描述所对应的语音语段，并采用无监督词发现技术对齐这些语音片段，从而定位出与特定书面词相对应的口语发音段。整个过程完全不依赖任何文本标注，仅通过图像与语音的跨模态对齐实现。实验结果表明，该方法在口语词检索和关键词检测任务中优于强基准神经模型，且具备更高的可解释性，验证了其在英语中的可行性，并为低资源语言（无转录文本）的语音-词汇学习提供了重要启示。

链接: https://arxiv.org/abs/2606.16807
作者: Gabriel Pirlogeanu,Dan Oneata,Horia Cucu,Herman Kamper
机构: University of Cape Town (开普敦大学); Herman Kamper
类目: Computation and Language (cs.CL)
备注: Accepted at EUSIPCO 2026 - 5 pages, 3 figures, 2 tables

点击查看摘要

Abstract:How can we learn the mapping between written words and their spoken counterparts in the absence of explicit textual supervision? We present a visually grounded method for building a vocabulary of spoken words using only images and their spoken descriptions. First, image captioning systems are used to build a vocabulary of written words representing salient visual concepts in the images. For each word, we then find utterances whose image captions contain that word. Then we use an unsupervised word discovery technique to align these utterances to locate instances of the target word. The result is spoken word segments that are linked to written words – all accomplished without any text supervision. In spoken word retrieval and keyword spotting experiments, the proposed approach outperforms a strong neural baseline while being more interpretable. These results demonstrate the feasibility of the approach in English and motivate future work on low-resource languages without transcripts.

[NLP-21] LLM -based Visual Code Completion for Aerospace Geometric Design

【速读】：该论文旨在解决航空航天工程设计领域中，如何在保障安全性和可解释性的前提下，有效应用生成式AI（Generative AI）技术实现几何设计自动化的问题。当前，尽管大型语言模型（Large Language Models, LLMs）和视觉语言模型（Vision Language Models, VLMs）在视觉代码补全方面取得了显著进展，但以安全性与可解释性为核心诉求的航空航天行业尚未有公开部署基于LLM的几何设计辅助系统。为此，本文提出一种基于LLM的视觉编程协作者（visual programming copilot）应用，采用视觉编程变体的ReAct（Reasoning and Acting）方法，并结合GPT 5.4模型，实现对复杂设计任务的智能引导。其解决方案的关键在于：一是构建了专用于航空航天领域的可视化编程工具Wingbuilder，该工具作为Grasshopper插件库，提供针对航空器几何抽象的定制化组件；二是创建了包含18个由领域专家设计、涵盖不同难度层级的任务集——航空航天视觉编程数据集（Aerospace Visual Programming Dataset, AVPD），并附带真实解作为基准。通过两名资深航空航天工程师的用户测试验证，该协作者在生成具有实用价值的建议方面表现良好，但受限于ReAct推理延迟较高，仅适用于耗时较长且需高质量输出的复杂任务。参与者普遍认可该工具的潜力，表示未来愿意持续使用。

链接: https://arxiv.org/abs/2606.16806
作者: Hau Kit Yong,Robert Marsh,Edmar A. Silva,András Sóbester,Stuart E. Middleton
机构: University of Southampton (南安普顿大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent advances in both Large Language Models (LLMs) and Vision Language Models (VLMs) have seen a step change in their ability to perform visual code completion, but the aerospace industry, which prioritizes safety and explainabilty over rapid LLM adoption, currently has no publicly announced LLM-based geometric design copilot systems in commercial use by aerospace Original Equipment Manufacturers (OEMs). This paper presents a LLM-based visual programming copilot application for aerospace engineering design tasks, using a visual programming variant of the ReAct methodology and GPT 5.4. In addition to the copilot, we describe Wingbuilder, a new Grasshopper plugin library with custom components for aerospace-specific geometry abstraction, and an associated Aerospace Visual Programming Dataset (AVPD) with 18 aerospace expert designed tasks at different levels of difficulty alongside ground truth solutions. We evaluate our copilot application with a user trial involving two experienced aerospace engineers from a large aircraft manufacturing company. We find our copilot visual programming ReAct methodology was successful in generating suggestions that participants found helpful, but slow ReAct inference times limit its usefulness to more complex time-consuming tasks where waiting for good copilot solution suggestion was worthwhile. Participants reported they liked the tool and would be willing to use it in the future.

[NLP-22] he Art of Mixology: Mixup-based Obfuscation for Privacy-Preserving Split Learning in Large Language Models

【速读】：该论文旨在解决资源受限用户在训练大语言模型（LLM）时，采用分割学习（Split Learning）框架所面临的隐私保护与模型性能之间的权衡难题。现有隐私保护型分割学习方法普遍存在性能退化严重、易受先进数据重构攻击、计算与通信开销过高以及跨任务表现不稳定等问题。为应对上述挑战，论文提出一种基于Mixup的新型隐私保护分割学习框架MIXGUARD，其核心创新在于融合了令牌级混淆（token-level obfuscation）、表示级混淆（representation-level obfuscation）与自适应梯度扰动（adaptive gradient perturbation）三重机制，协同作用以在保留有效学习信号的同时最大限度防止服务器端的隐私泄露。关键技术路径为：首先在公开数据集上构建轻量级校准模型，用于优化目标表示的近似；随后在私有数据上的隐私保护微调过程中使用该校准模型进行特征精炼。大量实验在多个大语言模型家族、规模、架构及微调策略下，涵盖四类分类任务与四类文本生成任务，验证了MIXGUARD在保持与非分割训练基线相当的模型性能的同时，显著优于现有方法对前沿数据重构攻击的防御能力，并在自适应攻击场景下展现出良好的鲁棒性。

链接: https://arxiv.org/abs/2606.16801
作者: Chen Chen,Xiang Gao,Xianshun Wang,Chengran Li,Shengyu Xia,Xueluan Gong,Linru Zhang,Qian Wang,Kwok-Yan Lam
机构: Nanyang Technological University (南洋理工大学); Wuhan University (武汉大学)
类目: Computation and Language (cs.CL)
备注: 19 pages, 5 figures

点击查看摘要

Abstract:Split learning provides a practical paradigm for resource-constrained users to train Large Language Models (LLMs) by offloading computation-intensive layers to a server while keeping raw data local. However, existing privacy-preserving split learning methods still face a difficult trade-off among utility, privacy, efficiency, and stability. Specifically, these methods often suffer from substantial utility degradation, remain vulnerable to advanced data reconstruction attacks, incur prohibitive computational and communication overhead, or exhibit unstable performance across different tasks. In this paper, we propose MIXGUARD, a novel mixup-based privacy-preserving split learning framework for LLMs. MIXGUARD introduces token-level obfuscation, representation-level obfuscation, and adaptive gradient perturbation mechanisms, which operate jointly to preserve useful learning signals while preventing privacy leakage to the server. Technically, MIXGUARD first constructs a lightweight calibration model on a public dataset to refine the approximated target representation, and then applies this model during privacy-preserving fine-tuning on private data. We conduct extensive experiments on four classification tasks and four text generation tasks across multiple LLM families, model sizes, architectures, and fine-tuning strategies. The results show that MIXGUARD preserves model utility comparable to non-split training baselines, consistently achieves stronger privacy protection than existing split learning defense methods against state-of-the-art data reconstruction attacks, and remains robust under adaptive attack settings.

[NLP-23] OpenClaw-Skill: Collective Skill Tree Search for Agent ic Large Language Models

【速读】：该论文旨在解决大语言模型（Large Language Model, LLM）在真实系统中执行复杂任务时，缺乏可复用、高效且具备泛化能力的技能这一关键问题。为提升模型在工具调用、多步推理及动态环境交互中的表现，本文提出一种基于树搜索的技能构建框架——集体技能树搜索（Collective Skill Tree Search, CSTS）。其核心解决方案在于通过两阶段迭代机制实现技能的协同生成与评估：第一阶段“集体技能节点生成”（CSN-Gen）利用多个模型的集体知识，针对每个子任务探索多样化的候选技能，确保技能空间的充分覆盖；第二阶段“集体技能节点评估”（CSN-Assess）引入多模型作为评判者，采用两种评分机制——集体质量评分（聚合独立评估以获得鲁棒的有效性估计）和集体可迁移性评分（显式验证技能在不同模型间的泛化能力），从而筛选出高质量且通用性强的技能节点。在此基础上，CSTS构建了一个结构化、多样化且可泛化的技能树，并生成相应的技能增强型训练数据，使模型能够有效学习与应用这些技能。此外，论文进一步提出“集体技能强化学习”机制，主动从技能树中选取多个相关技能以拓展解空间探索，避免陷入单一技能导致的同质化或次优解。最终，基于该框架训练的模型OpenClaw-Skill在长周期规划、工具使用及复杂基准测试的泛化能力方面均表现出卓越的智能体（agent）性能。

链接: https://arxiv.org/abs/2606.16774
作者: Tianyi Lin,Chuanyu Sun,Jingyi Zhang,Changxu Wei,Huanjin Yao,Shunyu Liu,Xikun Zhang,Liu Liu,Jiaxing Huang
机构: The Hong Kong Polytechnic University (香港理工大学); Nanyang Technological University (南洋理工大学); Tsinghua University (清华大学); Royal Melbourne Institute of Technology (皇家墨尔本理工学院); Beijing University of Aeronautics and Astronautics (北京航空航天大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 13 pages, 2 figures

点击查看摘要

Abstract:Equipping Large Language Model (LLM) agents with effective skills is crucial for solving complex tasks in real-world systems like OpenClaw. In this work, we aim to develop a framework that automatically constructs such reusable skills to enhance LLMs in tool use, multi-step reasoning, and dynamic environment interaction. To this end, we propose Collective Skill Tree Search (CSTS), a novel tree-search-based skill construction framework that constructs structured, diverse and generalizable tree of skills. The core idea of CSTS is to leverage collective intelligence to jointly search, identify and compose effective skills via two iterative phases: Collective Skill Node Generation (CSN-Gen) and Collective Skill Node Assessment (CSN-Assess). CSN-Gen exploits collective knowledge from multiple models to explore diverse candidate skills for each subtask, enabling comprehensive skill exploration. CSN-Assess employs multiple models as judges to evaluate and select skill nodes with two scoring mechanisms: (1) collective quality scoring that aggregates independent evaluations to produce a robust estimate of skill effectiveness, and (2) collective transferability scoring that explicitly verifies whether a skill generalizes well across different models. With CSTS, we construct a set of comprehensive tree of skills along with skill-augmented training data, enabling models to effectively learn and utilize skills. Besides, we introduce Collective Skill Reinforcement Learning, which actively selects multiple relevant skills from the tree to broaden solution-space exploration, avoid being trapped by a single skill and its resulting homogeneous or suboptimal solutions. As a result, our trained model, OpenClaw-Skill, exhibits outstanding agentic capabilities in long-horizon planning, tool use and generalization over challenging benchmarks.

[NLP-24] P3B3: A Multi-Turn Conversational Benchmark for Measuring European and Brazilian Portuguese Variety Bias in LLM s ACL2026

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在葡萄牙语多变体使用中存在显著不平衡的问题，特别是欧洲葡萄牙语（pt-PT）与巴西葡萄牙语（pt-BR）在训练数据和模型偏好上的不对称性。当前，尽管pt-BR在数据量上占据主导地位，但其在模型中的偏好倾向及其对不同语言变体的公平性尚未得到充分研究。为此，论文提出P3B3——一个由专家精心构建的、语言变体无偏的对话式提示基准，并配套设计了一套评估框架，用于量化模型在语言变体偏好（variety bias）与可控性（controllability）方面的表现。实验结果表明，多数主流LLMs表现出对pt-BR的强烈偏好，且不同模型在可控性方面存在差异，凸显了在多语言多变体场景下实现更均衡表示的紧迫性。解决方案的关键在于通过人工标注的高质量基准与系统化评估框架，揭示并量化语言变体偏差，为未来开发更具包容性和公平性的多变体语言模型提供可操作的评估工具与改进方向。

链接: https://arxiv.org/abs/2606.16753
作者: Rafael Ferreira,Inês Vieira,Inês Calvo,James Furtado,Iago Paulo,Diogo Tavares,Diogo Glória-Silva,David Semedo,João Magalhães
机构: NOVA University of Lisbon (里斯本大学); NOVA LINCS (里斯本大学信息科学与技术研究中心)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at MeLLM Workshop at ACL 2026

点击查看摘要

Abstract:As Large Language Models (LLMs) become embedded in everyday communication, capturing regional linguistic variation is essential for reliable and equitable language use. In Portuguese, European (pt-PT) and Brazilian (pt-BR) varieties remain unevenly represented, with pt-BR dominating in data quantity, while LLM preference for Portuguese variants remains underexplored. To address this gap, we introduce P3B3, an expert-curated language variety agnostic benchmark of conversational prompts, along with an evaluation framework for measuring variety bias and controllability. Experiments on several models show that most LLMs exhibit a strong bias toward pt-BR, with variation in controllability across models. These results highlight the need for more balanced multilingual representation across language varieties.

[NLP-25] MyPCBench: A Benchmark for Personally Intelligent Computer-Use Agents

【速读】：该论文旨在解决当前计算机使用代理（computer-use agents）评估基准在真实应用场景中存在的重要缺陷：现有评测环境多为非个性化、脱离用户实际数字生活情境的模拟，无法充分检验代理在涉及用户上下文、历史数据及登录账户等个人化信息场景下的能力，尤其在需要跨平台登录或处理敏感信息的网页任务中表现不足。为弥合这一差距，论文提出MyPCBench，一个基于Linux桌面环境的综合性评测框架，其中集成17个模拟真实世界的网页应用和完整的桌面系统栈，并以《办公室》（The Office）角色迈克尔·斯科特（Michael Scott）作为唯一标准化的人格化用户身份进行数据初始化。该环境共定义184项任务，均源自OpenClaw社区的真实请求，覆盖复杂跨应用操作与长轨迹任务。研究采用统一的计算机+Shell工具接口对六种闭源与开源模型进行基准测试，结果显示最优模型Claude Opus 4.6仅能完全解决55.4%的任务，且是唯一超过50%的模型。模型失败主要集中在需跨多个应用协同执行或轨迹过长的任务上，凸显了个性化上下文管理对代理性能的关键挑战。解决方案的核心在于构建一个高度仿真的、具备持续性用户状态的个人化桌面环境，从而更真实地评估代理在长期交互与多系统联动中的综合能力。

链接: https://arxiv.org/abs/2606.16748
作者: Lawrence Keunho Jang,Andrew Keunwoo Jang,Jing Yu Koh,Ruslan Salakhutdinov
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Current benchmarks for computer-use agents evaluate models in impersonal environments. This leaves a gap between evaluation and deployment where personal assistants are expected to work across a user’s whole digital life, including their context, historical data, and logged-in accounts. This gap is widest on web tasks, where live web evaluations cannot exercise sites that require logging in or personal information, the kind of site a real personal assistant has to drive. We introduce MyPCBench, which tests computer-use agents as personal assistants on a Linux desktop populated with 17 simulated real-world web applications and a full desktop stack, all seeded for one canonical persona, Michael Scott from The Office. We define 184 tasks in this environment, each inspired by a real request drawn from the OpenClaw community, and benchmark six closed and open-weight models with a uniform computer+bash tool surface. We find that the best model, Claude Opus 4.6, fully solves 55.4% of the tasks, the only model above 50%. Model failures cluster on tasks that span many applications and on long trajectories, where personalization stresses an assistant the most. We release the environment, task set, and agent harness at this https URL.

[NLP-26] Multi-Turn Reflective Masking Elicits Reasoning in Mask Diffusion Models

【速读】：该论文旨在解决自回归模型（Autoregressive, AR）在推理过程中进行输出修正时依赖全序列重新生成的问题，即使仅需局部修改也需从头生成，导致效率低下且难以模拟人类通过迭代局部修正来纠错的自然认知过程。现有掩码扩散模型（Mask Diffusion Models, MDMs）虽具备天然支持局部编辑的能力，但缺乏多轮掩码与去噪机制，无法实现连续的、基于上下文演进的逐步修正。为此，论文提出反射式掩码（Reflective Masking, RM），通过轻量级后训练方式激活MDMs内在的迭代反思能力，使其能够在测试阶段以原生方式实现逐轮回溯与修正，无需架构改动即可适配现有MDMs。RM的关键在于引入**历史引用（History Reference）**机制——一种无参数设计，利用修正过程中中间去噪状态中的信息，使模型能够借鉴先前推理步骤的洞见，从而更有效地指导当前修正。该方法在文本生成、数独求解和图像编辑等多种任务与模态中均显著优于标准掩码基线，展现出优异的泛化能力，确立了其作为MDMs推理基础范式的潜力。

链接: https://arxiv.org/abs/2606.16700
作者: Yanming Zhang,Yihan Bian,Jingyuan Qi,Yuguang Yao,Lifu Huang,Tianyi Zhou
机构: University of Maryland (马里兰大学); Virginia Tech (弗吉尼亚理工学院); Intuit (易趣公司); UC Davis (加州大学戴维斯分校); MBZUAI (穆罕默德·本·扎耶德人工智能大学)
类目: Computation and Language (cs.CL)
备注: 22 pages, 6 figures, 5 tables

点击查看摘要

Abstract:While reasoning on autoregressive (AR) models is often performed by chain-of-thought reasoning and reflection, their refinement of previous outputs still relies on fully sequential generation, even when only local edits are needed. In contrast, the masking mechanism in Mask Diffusion Models (MDMs) naturally supports explicit local edits on previous outputs, allowing selective refinement without discarding previous answers and generating another from scratch. While this property more closely aligns with how humans correct mistakes by iterative local refinement, existing MDMs do not support multi-turn masking and denoising. We propose Reflective Masking (RM), which elicits such an intrinsic reasoning capability in MDMs via lightweight post-training. RM provides a native test-time scaling, where an MDM iteratively revisits and revises its prior outputs based on evolving context. To exploit insights from previous turns like AR reasoning, we further introduce History Reference, a parameter-free mechanism that leverages intermediate denoising states during revision. Our approach requires no architectural changes and is easily applicable to existing MDMs. Across diverse tasks and modalities, including text generation, Sudoku, and image editing, Reflective Masking consistently outperforms standard masking-based baselines and demonstrates strong generality, positioning RM as a fundamental primitive for reasoning on MDMs.

[NLP-27] From Affect Prediction to Affect Forecasting: Evidence for Distinct Information Sources in Longitudinal Text

【速读】：该论文旨在解决纵向文本中情感状态建模的双重挑战：即区分当前情感估计（current affect estimation）与未来情感变化预测（future affective change forecasting）之间的本质差异。现有方法通常将每条文本视为独立观测，对两类任务采用相同假设，忽视了二者可能依赖不同信息源的可能性。为此，本文提出特质-状态情感预测框架（Trait–State Affective Prediction, TSAP）及其时间扩展版本E-TSAP，用于基于文本的效价（valence）与唤醒度（arousal）预测；同时提出情感变化预测混合模型（Affective Change Forecaster Hybrid, ACF-Hybrid），用于下一阶段情感变化的预测。研究基于91名用户共1,737条生态情境日记和情绪词条目进行评估。结果表明，E-TSAP在当前情感预测上表现良好，效价与唤醒度的复合皮尔逊相关系数分别为0.670和0.449；然而，在未来情感变化预测任务中，仅依赖文本语义的模型性能显著低于使用紧凑数值轨迹特征的基线模型——文本包含模型的效价与唤醒度相关系数仅为0.316和0.284，而基于前序状态的简单基线分别达到0.615和0.670。相比之下，引入维度特异性数值轨迹特征的ACF-Hybrid模型在预测未来变化时表现出色，效价与唤醒度的相关系数均达0.659和0.658。核心结论是：文本语义更适用于当前情感状态的推断，而未来情感变化则更依赖于先前数值轨迹动态，揭示了两种任务在信息需求上的根本差异。

链接: https://arxiv.org/abs/2606.16687
作者: Sadia Noor,Seemab Latif,Raja Khurram Shahzad,Mehwish Fatima
机构: SEECs, Pakistan(SEECs，巴基斯坦); Mid Sweden University (中瑞典大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Modeling dimensional affect in longitudinal text requires distinguishing current affect estimation from future affective change forecasting. Existing approaches often treat each text as an independent observation and apply similar assumptions to both tasks, without testing whether they rely on different information sources. This paper investigates that distinction using longitudinal self-reported ecological essays and feeling-word entries. We propose the Trait–State Affective Prediction (TSAP) framework and its temporal extension E-TSAP for per-text valence and arousal prediction, evaluated on a held-out prediction test set of 1,737 entries from 91 users. We further propose the Affective Change Forecaster Hybrid (ACF-Hybrid) for next-step affective change forecasting, evaluated on a held-out forecasting test set of 46 users. For prediction, E-TSAP achieves composite Pearson correlations of 0.670 for valence and 0.449 for arousal. For forecasting, textual representations perform worse than compact numeric trajectory baselines: the text-inclusive model achieves only r=0.316 for valence and r=0.284 for arousal, whereas a simple prior-state baseline reaches r=0.615 and r=0.670, respectively. ACF-Hybrid, using dimension-specific numeric trajectory features, achieves r=0.659 for valence and r=0.658 for arousal. These results show that textual semantics support current affect prediction, whereas future affective change is better captured through prior numeric trajectory dynamics.

[NLP-28] Progressive Knowledge-Guided Large Language Model Framework for Bearing Fault Diagnosis

【速读】：该论文旨在解决基于振动的轴承故障诊断中长期存在的三大相互关联的测量挑战：全局统计特征效率与局部瞬态信号保真度之间的权衡、测量特征对底层故障物理机制的可追溯性不足，以及跨诊断尺度下多源测量信息融合效率低下。其解决方案的关键在于提出一种渐进式物理引导的多尺度振动信号处理框架，将上述问题统一集成于一个诊断流程中。该框架首先基于轴承运动学理论和特征缺陷频率构建81维具有物理可追溯性的测量描述子，实现了每样本约20毫秒的实时故障筛查；其次引入故障自适应信号分段机制，利用物理先验引导分析聚焦于故障相关波形区域，无需人工特征工程；最后在训练过程中隐式编码结构化故障机理知识至模型参数，实现推理阶段无需依赖外部知识的自主多尺度信息融合。在四种公开基准数据集上验证表明，该方法在多种工况下达到98.49%的诊断准确率，并相较信号级基线降低12.6倍计算成本。可解释性分析进一步证实诊断特征激活模式与已知轴承故障机理一致，保障了在安全关键工业系统中的测量可追溯性。

链接: https://arxiv.org/abs/2606.16684
作者: Jinghan Wang,Gaoliang Peng,Yanjun Chen,Wei Zhang,Wentao Wu,Tianchen Liu
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Vibration-based bearing fault diagnosis requires resolving three interrelated measurement challenges, including the trade-off between global statistical feature efficiency and local transient signal fidelity, insufficient traceability of measurement features to underlying fault physics, and ineffective multi-source measurement information fusion across diagnostic scales. This paper presents a progressive physics-guided multi-scale vibration signal processing framework that addresses all three challenges within a unified diagnostic pipeline. An 81-dimensional measurement descriptor, derived from bearing kinematic theory and characteristic defect frequencies, establishes a physically traceable feature space enabling real-time fault screening at approximately 20 ms per sample. A fault-adaptive signal segmentation mechanism then directs analytical attention toward fault-relevant waveform regions guided by physics-based priors, without manual feature engineering. Structured fault mechanism knowledge is further encoded implicitly in model parameters during training, enabling autonomous multi-scale measurement fusion without external knowledge dependencies at inference. Validated on four public benchmark datasets under diverse operating conditions, the framework achieves 98.49% diagnostic accuracy with a 12.6-fold reduction in computational cost relative to signal-level baselines. Interpretability analysis confirms that diagnostic feature activations align with established bearing fault mechanics, supporting measurement traceability in safety-critical industrial systems.

[NLP-29] Multimodal Evaluator Preference Collapse: Cross-Modal Contagion in Self-Evolving Agents

【速读】：该论文旨在解决在多模态情境下，当智能体（AI agents）利用语言模型进行自我评估时所引发的系统性偏差问题，特别是评估者偏好坍缩（Evaluator Preference Collapse, EPC）现象在跨模态场景中被显著放大的问题。其核心挑战在于：在文本与视觉任务并行的反馈循环中，评估者对某些策略的偏好会因跨模态传播而发生扭曲，导致最优策略选择被污染甚至反转。解决方案的关键在于提出“跨模态传染”（cross-modal contagion）这一新机制，揭示了评估者在某一模态中形成的偏好可转移至另一模态并影响策略选择；通过四阶段隔离训练范式量化传染系数，并发现跨模型评估（如GPT-4o与其他模型协作）会导致强对称双向传染，而自评估则具有近乎完全的免疫性。研究进一步构建了以评估者身份为索引的传染矩阵，识别出跨模态评估器架构是偏好传染的主要风险因子，为未来设计鲁棒的多模态自主评估系统提供了关键理论依据和实验框架。

链接: https://arxiv.org/abs/2606.16682
作者: Zewen Liu
机构: Qilu Institute of Technology, School of Software Engineering (齐鲁工业大学软件工程学院)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 19 pages, 0 figures

点击查看摘要

Abstract:When AI agents use language models to evaluate their own outputs in a feedback loop, systematic biases emerge. We show that Evaluator Preference Collapse (EPC) is dramatically amplified in multimodal settings. Using GPT-4o to evaluate DeepSeek-chat across text and visual tasks, we find that a single strategy (step_by_step) absorbs 48.4% of all weight – 3.2x the collapse observed in text-only self-evaluation – while three visual-domain strategies receive only 9.1% combined weight. We then demonstrate a novel phenomenon we term cross-modal contagion: evaluator preferences acquired on one modality transfer to and corrupt strategy selection on another. Through a four-phase isolation training paradigm, we measure contagion coefficients and document strategy inversion – the optimal strategy for a modality reverses after cross-modal exposure. A Phase 3 statistical validation across four evaluator configurations (N=53 total independent repetitions, 15,592 API calls) reveals a clear hierarchy: cross-model evaluation (GPT-4o, N=8) produces strong but symmetric bidirectional contagion (mean gamma_T-V=1.176, gamma_V-T=1.089, Delta=-0.088, p=0.575, Cohen’s d=0.29); high round counts (DashScope, 50 rounds) cause collapse to single-strategy dominance (70% zero contagion); and self-evaluation provides near-complete immunity – 97% of runs (N=30, DeepSeek-chat) yield exactly zero contagion (mean gamma=0.033, 95% CI [-0.031, 0.010], p=0.642, d=0.07). No evaluator condition shows statistically significant directional asymmetry. We introduce the contagion matrix indexed by evaluator identity, release the MM-EPC experimental framework, and identify cross-model evaluator architecture as the primary risk factor for preference contagion.

[NLP-30] FraudSMSWalker: Benchmarking Agent ic Large Language Models for SMS-to-Webpage Fraud Detection

【速读】：该论文旨在解决跨渠道短信诈骗（smishing）中，现有评估方法因过度依赖URL、域名等可公开访问的声誉线索而导致模型无法真实反映其对网页内容与短信信息一致性判断能力的问题。当前方法或仅基于短信文本进行分类，或暴露完整的网址及域名信息，使模型可通过“声誉捷径”（reputation shortcuts）做出决策，从而掩盖了其在缺乏外部信任信号时的真实推理能力。为此，本文提出FraudSMSWalker——一个面向被遮蔽URL的短信到网页欺诈判断的可控基准测试框架。其关键创新在于：在模型可见输入中仅提供短信上下文与经净化处理的网页内容证据，而主动隐藏原始URL、主机、域名、IP地址、重定向链及声誉元数据等敏感信息，确保模型必须基于语义一致性与上下文合理性进行判断。该基准包含699条双语（中英）案例链，涵盖10类服务场景，其中包含332个欺诈样本和367个良性样本，特别设计了“高难度良性案例”，即网页虽包含登录、支付、验证等常见于诈骗流程的功能元素，但在特定服务背景下具有合理性和真实性。通过在掩码浏览器代理协议下评估九种网络代理模型，并开展URL可见性消融实验，结果表明：尽管当前模型能够识别部分可疑线索，但普遍存在对良性案例召回率下降的问题，且多数正向预测缺乏充分的观测证据支持。这一发现凸显了现有模型在缺乏声誉捷径时仍难以实现既准确又基于证据的欺诈判断。因此，FraudSMSWalker为衡量网络代理在抑制外部信任线索后是否具备可靠、可解释的欺诈推理能力提供了关键基准。

链接: https://arxiv.org/abs/2606.16659
作者: Y. H. Zhou,Z. M. Ma,Y. J. Zhou,Y. T. Li,H. X. Xiang,Y. M. Cheng,T. L. Chen,K. J. Zhang,Z. H. Nan,J. H. Ni,Z. Wu,Q. Y. Pan,S. Zhang,S. Cheng,M. Y. Luo
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:SMS fraud is increasingly cross-channel: a message directs the user to a webpage, and the final risk depends on how the SMS claim aligns with the page content and requested user action. However, existing evaluations either focus on message-only smishing classification or expose URL and domain cues that allow models to rely on reputation shortcuts. To address this gap, we introduce \textbfFraudSMSWalker, a controlled benchmark for URL-masked SMS-to-webpage fraud judgment. FraudSMSWalker contains 699 bilingual chains, including 332 fraudulent and 367 benign cases, across ten service scenarios. The model-visible input consists of the SMS context and sanitized webpage evidence, while raw URLs, hosts, domains, IPs, redirects, and reputation metadata are withheld. The benchmark further includes hard benign cases whose pages contain login, payment, verification, or account-management elements that are plausible under the service context but also appear in scam flows. We evaluate nine web agents under masked browser-agent protocols and conduct URL-visibility ablations. The results show that current agents can detect suspicious cues, but struggle to preserve benign recall and often produce positive predictions that are weakly supported by the observed evidence. These findings position FraudSMSWalker as a benchmark for measuring whether web agents can make fraud judgments that remain both accurate and evidence-grounded when direct reputation shortcuts are suppressed. The associated code and dataset are accessible at the \hrefthis https URLanonymous link.

[NLP-31] Islamic Large Language Models : From Knowledge Acquisition to Trustworthy and Hallucination-Resistant AI

【速读】：该论文旨在解决生成式人工智能在伊斯兰知识密集型问答任务中面临的可信度与准确性挑战，尤其针对宗教与法律类问题的复杂性。其核心问题在于：现有大语言模型（LLMs）虽具备一定的语言生成能力，但在处理伊斯兰知识时难以确保答案的权威性、来源可追溯性及教法学派（madhhab）差异的合理体现，易产生幻觉或简化多元合法解释。解决方案的关键在于构建一套多维度的可信伊斯兰AI框架，包括：基于阿拉伯语自然语言处理（Arabic NLP）与以阿拉伯语为中心的大语言模型，整合权威伊斯兰语料资源，实现检索增强生成（retrieval-augmented generation），支持引文感知生成（citation-aware generation）与教法学派敏感推理（madhhab-aware reasoning），引入人类专家评估机制，并建立涵盖答案准确性、忠实性（faithfulness）、源有效性与推理质量的综合评估基准。最终目标是发展具有抗幻觉能力的可信伊斯兰AI系统，推动其在宗教与法律语境下的可靠应用。

链接: https://arxiv.org/abs/2606.16629
作者: Mohammed Amine Mouhoub
机构: Paris Dauphine University (巴黎第九大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly used for knowledge-intensive question answering, including religious and legal questions. Islamic knowledge is a particularly demanding setting: answers are expected to be grounded in authoritative sources, citations must be exact, Arabic varieties differ substantially from the language of classical sources, and legitimate jurisprudential disagreement must be represented rather than collapsed into a single answer. This survey reviews the emerging field of Islamic LLMs and trustworthy Islamic AI. We organize the literature around Arabic NLP and Arabic-centric LLMs, Islamic NLP resources, Qur’anic question answering, Islamic knowledge benchmarks, retrieval-augmented generation, Islamic legal reasoning, inheritance reasoning, hallucination evaluation, and trustworthiness. We argue that fluency in Arabic is not sufficient for Islamic AI. Reliable systems require curated sources, retrieval and verification modules, citation-aware generation, madhhab-aware reasoning, human expert evaluation, and benchmarks that measure not only answer accuracy but also faithfulness, source validity, and reasoning quality. The survey concludes with a research agenda for hallucination-resistant Islamic AI systems.

[NLP-32] Sycophancy as Material Failure under Pushback Loading: A Multi-Axis Characterization Across Three Loading Cases and up to Seventeen Material Charges

【速读】：该论文旨在解决大语言模型（LLM）中“奉承行为”（sycophancy）的构念界定模糊问题，即尽管已有70余篇文献记录该现象，但专家对其实质边界尚未达成共识（组内相关系数ICC = .184；Ye et al., 2026）。其核心挑战在于：行为分类结果高度依赖于所偏好的表层表现形式，导致构念碎片化。为此，论文提出一种材料科学类比框架——将对话视为在负载作用下的测试样本，大语言模型作为材料属性，反驳（pushback）代表渐进式载荷，立场反转（stance-flip）则对应材料失效。研究通过三种加载情境（辩论，n=1000；错误预设，n=3400；伦理场景，n=3400）共7800个样本，采用14个逐轮轴向测量指标（涵盖速度、损伤累积、框架漂移、脆性及方向稳定性等），并引入独立管道获得三个发言人解析轴向指标，实现多维度量化。这些测量具有胡克定律耦合特性（σ = E · ε 类比），在不同加载条件下具有可重复性，最大相关系数 |r_rb| 达0.35（辩论情境），且符号结构揭示了伦理场景下速度与累积块出现反向模式。方差分解显示两种典型失效模式：辩论情境为“材料主导型”（类脆性断裂，由模型等级决定），而错误预设与伦理场景为“主题主导型”（类蠕变，由输入负载决定），其比率分别为2.03与0.13/0.17，且受估计方法影响，尤其在方向稳定性上体现明显差异。跨评判者可靠性分析（GPT-4o vs Haiku 4.5）表明，辩论评分具备评判鲁棒性（Cohen’s κ = 0.88），而错误预设评分则高度依赖评判者（κ = 0.36），提示单评判基准需报告此敏感性。本研究的关键解决方案在于构建了一种不依赖特定表层形式的多轴向表征体系，呼应了Ye等人诊断所呼吁的方法论转向：以系统性、可复现的多维测量替代主观化的单一判别标准。

链接: https://arxiv.org/abs/2606.16617
作者: Ferdinand M. Schessl
机构: 未知
类目: Computation and Language (cs.CL); Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI)
备注: 12 pages, 3 figures. Code, data, and pre-registrations: this https URL

点击查看摘要

Abstract:Sycophancy in LLMs is documented across 70+ papers, but expert agreement on construct boundaries remains low (ICC=.184; Ye et al., 2026). The construct fragments because behavioral classification depends on which surface form is privileged. We adopt a materials-science framing: conversation as test specimen under load, LLM-model as material charge, pushback as progressive load, stance-flip as material failure. We characterize this failure across three loading cases (debate n=1000; false-presuppositions n=3400; ethical-setting n=3400; 10-17 material charges per case; 7800 specimens total) using 14 turn-level axis-measurements spanning velocity, damage accumulation, frame-drift, brittleness, and direction stability, plus three speaker-resolved axes from an independent pipeline. The measurements are Hooke-coupled ( \sigma = E \cdot \varepsilon analog) and reproduce across loading cases with effects up to |r_rb| = 0.35 on debate; the sign structure adds a second pattern: the ethical-setting case inverts the velocity and accumulation blocks. Variance composition partitions into two profiles: debate is charge-dominated (brittle-fracture-like: the material grade decides), false-presuppositions and ethical-setting are topic-dominated (creep-like: the load decides); the ratios (2.03 vs 0.13/0.17) are estimator-dependent, for debate even in direction. Cross-judge reliability (GPT-4o vs Haiku 4.5) shows debate scoring is judge-robust (Cohen’s \kappa = 0.88 ) while false-presupposition scoring is judge-sensitive ( \kappa = 0.36 ) – a caveat single-judge benchmarks must report. This is the methodological move Ye et al.'s diagnosis calls for: a multi-axis characterization that does not depend on which surface form of the construct one privileges.

[NLP-33] VeriGraph: Towards Verifiable Data-Analytic Agents

【速读】：该论文旨在解决大语言模型（Large Language Model, LLM）驱动的智能体在数据密集型分析任务中输出不可验证的问题。现有方法依赖于线性文本轨迹进行推理，导致数值结论难以复现、定性判断难以审查，核心挑战在于确定性计算与自然语言语义推断在非结构化流中高度耦合。为此，论文提出VeriGraph——一种可追溯的神经符号推理框架，其关键在于通过在执行过程中构建显式的异构证据有向无环图（evidence-directed acyclic graph, DAG），将原始数据、解释器变量、计算结果与自然语言主张以统一结构关联。该框架引入三种证据扩展原语：计算扩展、锚定扩展和推导扩展，实现多源信息的系统化连接。在此架构下，结构可追溯性被转化为从原始数据源到终端主张的图可达性问题，语义支持则通过主张级别的证据评估来量化。为进一步提升图构建质量，设计了基于图的策略优化方法，采用复合奖励函数联合监督答案正确性、计算完整性与推导一致性。实验在四个基准测试上表明，VeriGraph-8B在所有基线中取得最高综合得分，且生成的证据图具备显著更强的主张锚定能力，在主张级证据支持评估中达到87.61%的锚定率。研究结果表明，显式证据图构建是实现可验证数据分析智能体的可行且有效路径。

链接: https://arxiv.org/abs/2606.16603
作者: Jiajie Jin,Zhao Yang,Wenle Liao,Yuyang Hu,Guanting Dong,Xiaoxi Li,Yutao Zhu,Zhicheng Dou
机构: Gaoling School of Artificial Intelligence, Renmin University of China (中国人民大学高瓴人工智能学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 10 pages

点击查看摘要

Abstract:LLM-based agents have demonstrated strong capabilities in data-intensive analytical tasks, yet their outputs are rarely verifiable: a reliance on linear text trajectories makes their reasoning difficult to audit. In particular, deterministic computations over raw data and semantic deductions over natural-language claims are often entangled in an unstructured stream, leaving numerical conclusions hard to reproduce and qualitative judgments hard to inspect. To address this, we propose VeriGraph, a traceable neuro-symbolic reasoning framework that enables agents to construct an explicit heterogeneous evidence directed acyclic graph (DAG) during execution. VeriGraph introduces three evidence-expansion primitives, namely computational, grounding, and derivational expansion, to connect raw data, interpreter variables, computed results, and natural-language claims in a unified graph. Under this formulation, structural traceability is reduced to graph reachability from raw data sources to terminal claims, while semantic support is measured by claim-level evidence evaluation. To improve graph construction, we further design a graph-based policy optimization strategy with a composite reward that jointly supervises answer correctness, computational integrity, and derivational coherence. Experiments on four benchmarks show that VeriGraph-8B achieves the highest overall score among all baselines. More importantly, VeriGraph produces auditable evidence graphs with substantially stronger claim grounding, achieving a 87.61% Grounding Rate under our claim-level evidence support evaluation. These results suggest that explicit evidence-graph construction is a promising path toward verifiable data-analytic agents. Our code is available at this https URL.

[NLP-34] How Far Can Machine Translation Quality Take You? Extrinsic Discourse Evaluation in Goal-Oriented Setups

【速读】：该论文旨在解决现有机器翻译（Machine Translation, MT）评估体系中长期存在的局限性问题，即当前主流的评估方法主要依赖于内在质量指标（intrinsic metrics）或以话语为中心的静态评价，未能有效衡量翻译错误在下游任务中的实际影响。为此，论文提出了一种基于目标导向的外在话语评估框架，通过两种不同情境——静态与交互式——来考察翻译对话语连贯性和协作效果的实质性影响。其解决方案的关键在于引入两个具体任务作为探测工具：在静态情境下，采用实体计数任务（entity counting task）作为指代一致性（referential consistency）的外在度量；在交互式情境下，利用目标导向的多智能体“福利外交”（Welfare Diplomacy）游戏来检验长时程沟通与协同能力。研究发现，即使在内在翻译质量较高的情况下，系统仍可能产生指代不一致问题，且交互场景中的翻译失败显著影响协作成效。因此，该研究强调以目标为导向的环境作为话语敏感型外在翻译评估的有效范式，为构建更贴近真实应用的评估体系提供了新思路。

链接: https://arxiv.org/abs/2606.16596
作者: Wafaa Mohammed,Kata Naszadi,Vlad Niculae
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Existing machine translation (MT) metrics and discourse-focused evaluations primarily assess translation quality intrinsically, without measuring the downstream consequences of translation errors. In this work, we focus on extrinsic discourse evaluation of machine translation under two distinct regimes: static and interactive. Under the static regime, we propose an entity counting task as a probe of referential consistency in discourse. We show that high intrinsic MT quality does not reliably predict downstream discourse success and strong MT systems still produce referential inconsistencies. For the interactive regime, we study the goal-oriented multi-agent Welfare Diplomacy game as a probe of long-horizon communication and coordination. We find that interaction-specific translation failures impact downstream coordination. Our results highlight goal-oriented environments as a viable framework for discourse-sensitive extrinsic MT evaluation.

[NLP-35] SING: Synthetic Intention Graph for Scalable Active Tool Discovery in LLM Agents

【速读】：该论文旨在解决大型语言模型（Large Language Model, LLM）代理在复杂数字环境中进行工具调用时面临的动态、大规模工具发现效率低下与语义对齐不足的问题。随着代理所连接的工具生态扩展至数百甚至数千个API和服务，传统的全量工具模式注入（tool schema injection）不仅成本高昂，且受限于封闭世界假设，无法适应任务演进中涌现的新需求。现有的一次性检索（one-shot retrieval）方法难以将孤立的工具描述与代理的真实任务意图对齐，尤其在长周期任务中，因任务分解、观测反馈和子目标演化导致的能力需求动态变化而失效。其解决方案的关键在于提出SING——一种意图感知的主动工具发现框架，通过构建一个融合用户意图、工具能力及工具协作模式的意图-工具图（intention-tool graph），实现基于任务状态演化的动态工具检索。该框架利用统一的7,471个工具语料库，在三个真实世界工具使用基准上验证，显著提升全局召回率（Global Recall@5）最高达59.8%，下游任务成功率提升28.9%，同时将完整工具库模式暴露降低99.8%。结果表明，意图感知的图结构能够有效支持大规模代理生态系统中的精准、上下文高效的工具发现。

链接: https://arxiv.org/abs/2606.16591
作者: Qiao Xiao,Haochen Shi,Yisen Gao,Wenbin Hu,Huihao Jing,Tianshi Zheng,Baixuan Xu,Ziheng Zhang,Weiqi Wang,Haoran Li,Jiaxin Bai,Yangqiu Song
机构: Cornell University(康奈尔大学); The Hong Kong University of Science and Technology(香港科技大学); The Ohio State University(俄亥俄州立大学); Hong Kong Baptist University(香港浸会大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language model (LLM) agents increasingly rely on agent harnesses that manage context, tools, and multi-turn execution, making tools a central interface for acting in realistic digital environments. As harness-connected tool ecosystems expand to hundreds or thousands of APIs, services, and task-specific skills, exhaustive tool schema injection becomes costly and imposes a closed-world assumption that limits agents to a predefined static inventory. Retrieval-augmented tool selection offers a natural alternative, but existing one-shot retrieval methods often fail to align isolated tool descriptions with the agent’s true task intention, especially in long-horizon tasks where required capabilities emerge through decomposition, observations, and newly induced subgoals. We propose SING, an intention-aware active tool discovery framework that builds an intention-tool graph linking user intentions, tool capabilities, and tool collaboration patterns, and dynamically retrieves tools according to evolving task states. Using a unified corpus of 7,471 tools, we evaluate SING on three real-world tool-use benchmarks. SING improves Global Recall@5 by up to 59.8% and downstream success rate by up to 28.9% over baselines, while reducing full-corpus tool-schema exposure by 99.8%, demonstrating that intention-aware graph structure enables more accurate and context-efficient tool discovery in large-scale agentic ecosystems.

[NLP-36] Uncertainty Is Not a Safety Net for Clinical VQA but Can It Anticipate Model Failure?

【速读】：该论文旨在解决临床视觉-语言模型（VLMs）在实际应用中缺乏可靠不确定性估计（Uncertainty Estimation, UE）的问题，即无法有效判断模型预测的可信度，从而影响其安全部署。当前主流的UE方法虽被广泛使用，但研究发现其不确定性水平并非独立于模型性能，而是与模型准确率高度相关：在模型表现最差的区域，不确定性估计也相应降低，导致系统在最需要可靠性保障的场景下反而给出错误的信任信号。通过引入“非正确选项干扰”（NOTA perturbations）的应力测试，研究发现模型准确率显著下降，但不确定性几乎不变，表明现有方法存在系统性校准偏差。然而，研究进一步揭示，在未受扰动输入上的不确定性能够可靠预判哪些预测在扰动下会崩溃，说明当前VLM中的不确定性蕴含关于模型脆弱性的诊断信息。因此，解决方案的关键在于将不确定性估计重新定位为一种诊断工具，用于识别模型的脆弱预测，并倡导采用基于扰动的评估范式，以推动生成式AI在临床环境中的安全落地。

链接: https://arxiv.org/abs/2606.16583
作者: Arnisa Fazla,Alberto Testoni,Ameen Abu-Hanna,Barbara Plank,Iacer Calixto
机构: Amsterdam University Medical Center (阿姆斯特丹大学医学中心); University of Amsterdam (阿姆斯特丹大学); Amsterdam Public Health, Methodology (阿姆斯特丹公共卫生方法学); MaiNLP, Center for Information and Language Processing, LMU Munich (慕尼黑大学信息与语言处理中心); Munich Center for Machine Learning (MCML) (慕尼黑机器学习中心); OpenAI (OpenAI)
类目: Computation and Language (cs.CL)
备注: 17 pages, 4 figures

点击查看摘要

Abstract:Safe deployment of clinical vision-language models (VLMs) requires reliable uncertainty estimation (UE): a signal indicating when predictions should be trusted or escalated to a clinician. We test whether current UE methods actually deliver this signal. Benchmarking 8 methods across 12 VLMs on clinical visual question-answering (VQA), we find that UE quality is not an intrinsic property of the UE method: it tracks model accuracy, degrading precisely where the model performance is weakest, and therefore where reliability is most needed. When we stress-test models by hiding the correct option among the multiple-choice answers (NOTA perturbations), accuracy collapses while uncertainty barely changes, leaving models systematically miscalibrated. Yet, we find that uncertainty on the unperturbed input reliably anticipates which predictions will collapse under NOTA, indicating that UE in current VLMs carries diagnostic information about model fragility. Our results position UE as a diagnostic tool for identifying fragile predictions and motivate perturbation-based evaluation as a path toward safe clinical deployment.

[NLP-37] Can LLM Agents Agent s Infer World Models? Evidence from Agentic Automata Learning

【速读】：该论文旨在解决生成式 AI（Generative AI）在交互式环境发现任务中对未知确定性有限自动机（Deterministic Finite Automaton, DFA）的推理与学习能力评估问题。其核心挑战在于衡量大语言模型（LLM）代理通过与预言机（oracle）进行成员查询（membership queries）和等价查询（equivalence queries）来逐步揭示隐藏DFA结构的能力。解决方案的关键在于构建一个可扩展、可控复杂度、具备明确交互效率度量指标和强基线（经典自动化学习算法）的测试基准。实验结果表明，尽管推理型模型表现优于非推理模型，但其在查询规划、证据整合及假设构建方面仍存在系统性缺陷，导致性能随DFA规模增长急剧下降。总体而言，当前LLM代理虽能实现一定程度的非平凡交互式探索，但在鲁棒性与效率上远不及传统自动化学习算法。

链接: https://arxiv.org/abs/2606.16576
作者: Reef Menaged,Gili Lior,Shauli Ravfogel,Roee Aharoni,Gabriel Stanovsky
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We propose agentic automata learning to evaluate the extent to which tool-calling LLM agents can uncover hidden environments through interaction. In our setup, an agent should uncover a hidden deterministic finite automaton (DFA) by interacting with an oracle through (1) membership queries (“Does this string belong to the target language?”) and (2) equivalence queries (“Is this the target DFA?”). This yields a scalable testbed with controlled task complexity, measurable interaction efficiency, and strong baselines (classic automata-learning algorithms). Evaluating state-of-the-art LLMs, we find that performance drops sharply as DFA size increases. Reasoning models are markedly stronger than non-reasoning models, yet trajectory analyses reveal recurring failures in query planning, evidence integration, and hypothesis construction. Overall, our results show that current LLM agents can sometimes perform non-trivial interactive discovery, but remain far less robust and efficient than classic algorithms for the task.

[NLP-38] Fast When Careful Who: Dual-Process Multiparty Turn-Taking with Diffusion Augmentation

【速读】：该论文旨在解决多说话人对话系统中可靠发言权交接（turn-taking）的难题，尤其针对真实场景下存在语音重叠和快速说话人切换的情况。现有方法大多局限于双说话人交互，难以有效处理复杂的多说话人环境。其核心解决方案是提出一种仅依赖音频的两阶段流水线：第一阶段为快速触发器（fast trigger），通过扫描音频实时生成候选发言结束时间点；第二阶段为轻量级验证器（lightweight verifier），仅在这些候选时刻运行，判断是否发生发言权转移（\textscShift）或维持当前说话人（\textscHold），并支持下一说话人预测。该分离机制将“何时触发”与“是否转移”解耦，提升了系统的鲁棒性与效率。此外，研究还引入基于扩散模型的、保持标签一致性的背景音频混合数据增强策略，进一步提升了发言权转移检测性能。实验在完整的多说话人设置及受控的双人前两名投影设置下进行，结果表明该方法显著优于基线，并且数据增强带来了持续改进。

链接: https://arxiv.org/abs/2606.16568
作者: Rutherford A. Patamia,Ming Liu,Wei Luo,Favour Ekong,Akan Cosgun
机构: Deakin University (迪肯大学); Griffith University (格里菲斯大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reliable turn-taking is essential for spoken dialogue systems. However, most existing methods are designed for two-speaker interaction and struggle with realistic multiparty audio containing overlap and rapid speaker changes. We study multiparty turn-taking on the VoxConverse dataset and propose an audio-only two-stage pipeline that separates when to trigger a turn boundary from whether the floor is actually transferring. A fast trigger scans the audio and proposes candidate end-of-turn times, while a lightweight verifier runs only at those times to decide \textscHold or \textscShift and support next-speaker prediction. We report results in the full multiparty setting and a controlled dyadic top-2 projection for comparability. We also investigate diffusion-based, label-preserving background-audio mixing as a data augmentation strategy. Results show improved shift detection over a baseline, with further improvements from diffusion augmentation.

[NLP-39] he BD-LSC Dataset: Facilitating the Benchmarking of Models for Lexical Semantic Change Detection in Slang and Standard Usage

【速读】：该论文旨在解决现有计算词汇语义变化（LSC）方法在捕捉双向语义演变方面的局限性，特别是针对同时兼具语义增益与语义消退的复杂情况，尤其关注兼具俚语与标准用法的词语。其核心挑战在于难以准确识别和区分罕见的俚语义项，导致模型性能受限。解决方案的关键是构建两个互补的基准数据集：双向词汇语义变化（BD-LSC）数据集，用于刻画三个时间阶段中词义的增益、损失与稳定状态，支持对复杂语义演化轨迹的研究；以及俚语追踪词义消歧（ST-WSD）数据集，提供结合俚语与标准用法的细粒度实例级语义标注，实现对词义消歧（WSD）及语义变化检测模型的系统性评估。通过这些数据集，研究系统评估了多种方法（包括基于上下文嵌入的无监督聚类、监督机器学习、Transformer模型及前沿大语言模型），结果表明，少样本微调的GPT-4o在精确词义匹配（ESM）和多标签准确率上表现最佳，但所有系统在宏平均F1分数上均接近0.5，揭示了罕见俚语义项识别仍是当前的核心开放挑战。

链接: https://arxiv.org/abs/2606.16560
作者: Afnan Aloraini,Viktor Schlegel,Goran Nenadic,Riza Batista-Navarro
机构: University of Manchester (曼彻斯特大学); Qatar University (卡塔尔大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Automatic semantic change detection aims to identify how word meanings shift over time, offering insights into both linguistic and societal change. Despite recent progress in computational lexical semantic change (LSC), existing benchmarks and methods struggle to capture bi-directional semantic change, particularly cases where words simultaneously gain and lose senses. This problem is especially challenging for words that have both slang and standard meanings. To address these gaps, we introduce two complementary benchmark datasets. The Bi-Directional Lexical Semantic Change (BD-LSC) dataset captures sense gain, sense loss, and stability across three time periods, enabling the study of complex semantic trajectories. The SlangTrack Word Sense Disambiguation (ST-WSD) dataset provides fine-grained, instance-level sense annotations for words combining slang and standard usages, supporting systematic benchmarking of WSD and semantic change detection models. Using these benchmarks, we systematically evaluate models across different methodological families: unsupervised clustering using contextualised embeddings, supervised machine learning, transformer-based models, and state-of-the-art large language models. Among the evaluated systems, the few-shot GPT-4o model achieved the strongest aggregate performance on Exact Sense Match (ESM) and multi-label accuracy; however, Macro-F1 scores near 0.5 across all systems show that rare slang senses remain difficult, which we identify as the central open challenge.

[NLP-40] Can LLM Coding Agents Reason About Time Series?

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在处理时间序列数据时面临的自动化分析难题，尤其是在金融、医疗和环境监测等关键领域中，尽管时间序列数据普遍存在，但其自动解析仍具挑战性。核心问题在于：如何有效利用LLMs对复杂的时间序列数据进行准确理解与决策支持。论文提出三种解决方案：直接输入原始数值数据、将LLM作为代码生成代理（coding agent），以及两者的结合。其中，关键创新在于引入“代码代理”机制——通过让模型迭代调用Python代码查询并处理数据，从而增强其对时间序列的分析能力。实验结果表明，具备代码访问权限的代理模型相比仅处理原始数据的模型，在两个基准测试中性能提升最高达10%。然而，即便最优方案仍存在约22%-34%的错误率，反映出当前模型在推理深度和细节捕捉上的局限。通过强模型评判器（strong LLM judge）对输出进行分析发现，代码代理虽能正确选择统计检验方法，却常忽略关键上下文细节；而直接处理原始数据的模型则依赖简化的估算策略，虽能得出正确结论，但缺乏系统性。因此，该研究揭示了现有方法在逻辑严谨性与信息完整性之间的权衡，为未来构建更鲁棒的时序分析智能体提供了重要方向。

链接: https://arxiv.org/abs/2606.16545
作者: Filip Rechtorík,Ondřej Dušek,Zdeněk Kasner
机构: Institute of Formal and Applied Linguistics; Faculty of Mathematics and Physics, Charles University
类目: Computation and Language (cs.CL)
备注: 17 pages, 7 figures

点击查看摘要

Abstract:Large language models (LLMs) are increasingly being used for automated decision-making systems in finance, healthcare, or environmental monitoring. Time series data are ubiquitous in these fields, yet hard to process automatically. Can time series be analyzed by LLM agents? We examine three approaches: providing the agent with raw numerical data, using the LLM as a coding agent, or a combination of both. In the coding agent setup, the model iteratively queries the data using Python code. Using two time series understanding benchmarks, we show that agents with code access can outperform models processing raw data by up to 10%. However, even the best performing agent still answers about 22-34% of the questions incorrectly. To get insights into models’ strategies and reasoning gaps, we analyze the model outputs with a strong LLM judge. Our analysis reveals that coding agents can select appropriate statistical tests, but often miss important nuances. Meanwhile, models with access to raw data can reach the right conclusions using back-of-the-envelope calculations.

[NLP-41] DoubtProbe: Black-Box Jailbreak Defense via Structural Verification and Semantic Auditing

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在面向用户系统中部署时面临的黑盒越狱攻击（black-box jailbreak attack）防御难题。现有防御方法多依赖已知攻击覆盖、提示层面的语义判断或局部运行时控制，但在面对不断演进的提示包装、表达重写和结构篡改时易出现不稳定现象。本文的核心观察是：多数黑盒越狱攻击并未消除有害目标，而是通过重组实现目标所需信息的表达与执行方式，在规避安全对齐的同时仍保留生成过程中的可恢复性。针对这一问题，提出DoubtProbe——一种双分支推理时防御框架，其关键在于将黑盒越狱防御建模为在受控变换下的一致性检验。该方案包含两个核心组件：结构分支通过提取原始请求的结构化表示，基于约束重建请求，并检测原请求与重构请求间的信息保全失败；语义分支则直接对原始提示进行语义审计。实验结果表明，DoubtProbe在多个基准测试中均展现出更强且更稳定的防御-效用权衡能力，例如在Qwen2.5-72B上将JBB攻击成功率从0.293降至0.100，CodeAttack攻击成功率从0.152降至0.001，同时保持低误报率（AlpacaEval和OR-Bench上分别为0.022和0.016），该性能在迁移至Llama-3.1-70B时依然稳定。研究表明，结构不一致信号结合语义审计，为黑盒越狱防御提供了一种实用且可泛化的技术路径。

链接: https://arxiv.org/abs/2606.16527
作者: Xuanyu Yin,Yilin Jiang,Jun Zhou,Kai Chen,Zhengfu Cao,Xiaolei Dong
机构: East China Normal University (华东师范大学); The Hong Kong University of Science and Technology (Guangzhou) (香港科技大学（广州）); Institute of Information Engineering, Chinese Academy of Sciences (中国科学院信息工程研究所)
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注: 25 pages, 5 figures

点击查看摘要

Abstract:As large language models (LLMs) are increasingly deployed in user-facing systems, black-box jailbreak defense has become an important practical problem. Existing defenses often rely on known-attack coverage, prompt-level semantic judgment, or local runtime control, yet these paths can become unstable under evolving prompt packaging, expression rewriting, and structure manipulation. We observe that many black-box jailbreaks do not remove the harmful goal, but reorganize the information needed to express and execute it, thereby evading safety alignment while remaining recoverable during generation. Motivated by this observation, we propose DoubtProbe, a dual-branch inference-time defense framework that combines structural verification with semantic auditing and formulates black-box jailbreak defense as consistency checking under controlled transformation. The structural branch extracts a structured representation from the original request, reconstructs the request under representation constraints, and detects information-preservation failures between the original and reconstructed requests; the semantic branch audits the original prompt directly. We evaluate DoubtProbe against representative black-box defenses on jailbreak and benign-request benchmarks, and further test backbone transfer from Qwen2.5-72B to Llama-3.1-70B. Results show that DoubtProbe achieves a stronger and more stable defense-utility trade-off: on Qwen2.5-72B, it reduces the JBB attack success rate from 0.293 to 0.100 and the CodeAttack attack success rate from 0.152 to 0.001, while maintaining false positive rates of 0.022 and 0.016 on AlpacaEval and OR-Bench; the same pattern remains stable on Llama-3.1-70B. These findings show that structural inconsistency signals provide a practical and generalizable basis for black-box jailbreak defense, especially when combined with semantic auditing.

[NLP-42] SkillWiki: A Living Knowledge Infrastructure for Agent Skills

【速读】：该论文旨在解决智能体技能（agent skills）在大规模生产、治理与持续演化过程中缺乏基础设施支撑的问题。当前，知识主要依托维基百科（Wikipedia）进行管理，软件代码则通过GitHub进行协作开发，但智能体技能仍处于分散、难以复用与追踪的困境。为此，论文提出SkillWiki——一个动态演进的知识基础设施，其核心在于将异构知识源转化为可复用的技能资产，并通过关联原始证据实现技能的可追溯性与可信性。解决方案的关键在于构建一个支持技能全生命周期管理的系统：从知识摄入与技能生成，到基于溯源信息的探索、治理机制，以及基于执行反馈的持续演化。该框架实现了知识、技能与执行经验在统一基础设施中的协同进化，为智能体系统的可持续发展提供了可扩展的技术路径。

链接: https://arxiv.org/abs/2606.16523
作者: Dingcheng Huang,Yuda Ding,Bingshuo Liu,Qingbin Liu,Xi Chen,Jiang Bian,Hongliang Sun,Zhiying Tu,Dianhui Chu,Xiaoyan Yu,Dianbo Sui
机构: Harbin Institute of Technology (哈尔滨工业大学); Tencent (腾讯); Nanyang Technological University (南洋理工大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:While knowledge is managed through Wikipedia and software through GitHub, agent skills still lack an infrastructure for large-scale production, governance, and evolution. SkillWiki is a living knowledge infrastructure that supports the organization, grounding, and continuous evolution of agent skills by transforming heterogeneous knowledge into reusable skill assets linked to their originating evidence. Our demonstration presents the complete skill lifecycle, from knowledge ingestion and skill production to provenance-aware exploration, governance, and execution-driven evolution. SkillWiki highlights a future in which knowledge, skills, and execution experience co-evolve within a shared infrastructure. The live demonstration and source code are publicly available at this https URL.

[NLP-43] daVinci-kernel: Co-Evolving Skill Selection Summarization and Utilization via RL for GPU Kernel Optimization

【速读】：该论文旨在解决GPU内核优化中如何高效生成高性能计算代码的问题，其核心挑战在于在保证功能正确性的前提下，自动探索并实现显著的执行效率提升。解决方案的关键在于提出daVinci-kernel，一个基于强化学习的框架，通过动态演化的技能库实现技能发现与技能利用的协同。该框架联合训练三个共享同一大语言模型（LLM）主干的智能体：技能选择智能体（Skill Selection Agent）利用BM25与LLM重排序从知识库中检索相关优化技术；策略智能体（Policy Agent）根据选定技能生成多轮次的CUDA/Triton内核代码；技能摘要智能体（Skill Summary Agent）将成功执行轨迹提炼为可复用的新技能。新技能仅在经由执行验证确认具备可重复加速效果后才被纳入技能库。三个智能体共享单一LLM主干，通过结构化监督微调（SFT）冷启动初始化，并采用多轮次REINFORCE算法与各智能体独立的优势估计进行端到端联合优化，从而实现高效、可扩展的自适应内核生成。

链接: https://arxiv.org/abs/2606.16497
作者: Dayuan Fu,Mohan Jiang,Tongyu Wang,Dian Yang,Jiarui Hu,Liming Liu,Jinlong Hou,Pengfei Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:GPU kernel optimization represents a paradigm where functional correctness is assumed and execution efficiency is the objective. We present daVinci-kernel, a reinforcement learning framework that couples skill discovery with skill exploitation through a dynamically evolving skill library. daVinci-kernel jointly trains three agents sharing one LLM backbone: a Skill Selection Agent that retrieves relevant techniques via BM25 and LLM reranking, a Policy Agent that generates multi-turn CUDA/Triton kernels conditioned on selected skills, and a Skill Summary Agent that distills successful rollouts into reusable skills. Candidate skills are added only after execution-based verification confirms reproducible speedups. All three agents share a single LLM backbone, are initialized via a structured SFT cold start on diversity-filtered data, and are then jointly optimized end-to-end with multi-turn REINFORCE and per-agent advantage estimation. On KernelBench, daVinci-kernel-14B achieves 37.2%, 70.6%, and 32.2% on Level 1, Level 2, and Level 3 under the Fast _1 threshold, outperforming the strongest prior RL-trained model, this http URL-14B.

[NLP-44] REFLEX: Reflective Evolution from LLM Experience

【速读】：该论文旨在解决现有大型多模态语言模型（Large Multimodal Language Models, LLMs）在引导演化搜索生成可解释程序策略时存在的诊断与修复耦合问题。当前框架依赖单一模型调用同时完成视觉行为证据的解读与修正代码的生成，导致诊断-修复过程高度纠缠，形成黑箱反馈循环，不仅难以追溯突变逻辑，也无法在不同演化运行之间保留算法洞见。为实现可审计且高效的策略搜索，其核心解决方案在于结构化解耦视觉诊断与代码生成：提出无需训练的演化框架REFLEX，其中视觉增强的“评判者”（Critic）首先将任务相关的行为证据提炼为结构化、可审计的诊断结果；随后，针对文本优化的“执行者”（Actor）基于这些诊断及一个持续自进化、可复用的“技能记忆库”（Skill Memory）合成子代策略。该架构不仅提供透明的突变追踪路径，还支持跨运行的程序知识迁移。在多个控制基准任务（如Lunar Lander、Acrobot、Pendulum）以及36维天线阵列综合任务上的实验表明，REFLEX具备卓越的样本效率，可在少于10次LLM调用内求解Acrobot和Pendulum，并在Lunar Lander上达到1.092的归一化加权得分，显著加速早期透明策略的发现，同时保持优异的最终性能。

链接: https://arxiv.org/abs/2606.16496
作者: Pan Wang
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large multimodal language models (LLMs) have emerged as powerful tools for guiding evolutionary search toward interpretable programmatic policies. However, existing frameworks rely on a monolithic model call to simultaneously interpret visual behavioral evidence and synthesize corrective code. This diagnosis-repair entanglement creates an opaque feedback loop, obscuring the rationale behind mutations and preventing the retention of algorithmic insights across independent runs. To achieve auditable and efficient policy search, we argue that visual diagnosis must be structurally decoupled from code generation. We present REFLEX, a train-free evolutionary framework that operationalizes this decoupling. In REFLEX, a vision-enabled Critic first distills task-specific behavioral evidence into structured, auditable diagnoses. Subsequently, a text-optimized Actor synthesizes child policies using these diagnoses alongside a persistent, self-evolving Skill Memory of reusable code snippets. This architecture not only provides transparent mutation traces but also enables cross-run programmatic knowledge transfer. Extensive evaluations across control benchmarks (Lunar Lander, Acrobot, Pendulum) and a 36-dimensional antenna array synthesis task demonstrate exceptional sample efficiency. Notably, REFLEX solves Acrobot and Pendulum in under 10 LLM calls and reaches a best Normalized Weighted Score of 1.092 on Lunar Lander, achieving highly competitive final performance while significantly accelerating the early-stage discovery of transparent policies.

[NLP-45] Lost at the End: Primacy Bias in Multimodal Retrieval-Augmented Question Answering EMNLP2026

【速读】：该论文旨在解决多模态知识增强视觉问答（multimodal KB-VQA）系统中因检索上下文位置依赖性导致的性能下降问题，尤其关注信息在上下文中的位置分布对模型回答准确性的影响。传统纯文本长上下文大语言模型（LLM）存在“中间丢失”（lost-in-the-middle）现象，即首尾信息被更多利用而中间内容被忽略；但该现象是否适用于部署中的多模态KB-VQA仍不明确。为此，作者设计了首个针对阅读器端位置依赖性的受控探测实验——金标准位置协议（gold-position protocol），通过仅改变问题中目标段落（gold passage）在提示（prompt slot）中的位置，系统评估不同位置对答案准确率的影响。实验在三个开源7B/8B视觉语言模型（VLM）阅读器及两个KB-VQA基准上进行，检索数量k最高达20。结果显示，效果由典型的U型曲线反转为“首因效应”（primacy effect）：当目标段落位于首位时，性能显著优于末位，差距达16至26个点，这一现象被称为“末尾丢失”（Lost at the End）。三组消融实验表明，多模态设置将原本存在于文本模式中的首因效应放大了2.2至4.5倍，且图像位置与干扰项打乱实验共同将问题根源定位至指令微调阅读器的提示槽0位置。进一步地，在冻结阅读器条件下，三种检索侧优化方法（MMR、最优重排序、基于排名的重排序）均无法缓解该差距。研究结论指出，召回率@k（recall@k）并非衡量部署型KB-VQA性能的合适指标，真正缩小性能差距需依赖阅读器端干预。研究团队已公开其探测协议，作为评估此类干预措施的标准化工具。

链接: https://arxiv.org/abs/2606.16494
作者: Jieyuan Liu,Jianyang Gu,Shijie Chen,Jefferson Chen,Zhen Wang
机构: University of California, San Diego(加州大学圣地亚哥分校); The Ohio State University(俄亥俄州立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 9 figures. Under review at EMNLP 2026

点击查看摘要

Abstract:Knowledge-based visual question answering (KB-VQA) lets vision-language systems answer questions that exceed their parametric knowledge by conditioning a reader on passages retrieved from a Wikipedia-scale knowledge base. In pure-text long-context LLMs, retrieved-context use follows the U-shaped “lost-in-the-middle” effect of Liu et al. (2024): information at the start and end of context is used, the middle is lost. Whether this transfers to deployed multimodal KB-VQA is open. To close this gap, we design the first controlled probe of reader-side position dependence in multimodal KB-VQA: a gold-position protocol in which only the gold passage’s prompt slot varies within question. We run it on three open-source 7B/8B VLM readers and two KB-VQA benchmarks at k up to 20. The shape flips from U to primacy: gold-at-first beats gold-at-last by 16 to 26 points on every reader-by-benchmark cell, an effect we call “Lost at the End”. Three targeted ablations narrow the cause: a text-only control shows the multimodal setting amplifies an already-present text-mode primacy 2.2 to 4.5 times, and image-position and distractor-shuffle ablations together pin the locus to prompt slot 0 of the instruction-tuned reader. On a frozen reader, three retrieval-side fixes (MMR, oracle reranking, rank-based reordering) all leave the gap intact (no separable improvement). Our findings indicate that recall@k is the wrong metric for deployed KB-VQA and that closing the gap requires reader-side intervention; we release our protocol as a controlled instrument for evaluating such interventions.

[NLP-46] From Awareness to Adherence: Bridging the Context Gap in Spoken Dialogue Systems via Context-Aware Decoding INTERSPEECH2026

【速读】：该论文旨在解决端到端（End-to-End, E2E）语音对话系统在多轮对话中难以严格保持上下文一致性的核心问题。尽管现有研究多将此类失败归因于模型遗忘对话历史，但本文指出一个同样关键却长期被忽视的瓶颈：潜在的上下文感知能力与实际上下文遵循行为之间存在显著脱节。即模型虽在内部能够识别相关历史话语，但在解码阶段，强参数先验往往压制了这些上下文信号。为此，论文提出一种音频自适应的上下文感知解码（Audio-adapted Context-Aware Decoding, CAD）方法，通过利用模型内部注意力机制识别关键历史回合，并在推理时对比有无该关键上下文条件下的输出分布，从而直接增强多模态上下文信号的影响力。在Audio MultiChallenge基准测试上的实验表明，该方法在语义记忆（Semantic Memory）和自我一致性（Self Coherence）子任务上均取得显著提升，有效实现了严格的、忠实于上下文的对话行为。

链接: https://arxiv.org/abs/2606.16472
作者: Che Hyun Lee,Heeseung Kim,Sungroh Yoon
机构: Seoul National University (首尔国立大学); University of Seoul (首尔大学)
类目: Computation and Language (cs.CL)
备注: Interspeech 2026 Main Track

点击查看摘要

Abstract:Despite the success of end-to-end (E2E) spoken dialogue systems, maintaining strict context adherence in multi-round conversations remains a challenge. While prior works attribute these failures to models forgetting dialogue history, we highlight an equally critical but overlooked bottleneck: a gap between latent context awareness and active adherence. Although models internally recognize relevant past utterances, strong parametric priors often overshadow these signals during decoding. To bridge this gap, we propose an audio-adapted Context-Aware Decoding (CAD) approach. By leveraging internal attention mechanisms to isolate key historical rounds, our approach contrasts output distributions with and without this key context during inference, directly amplifying multimodal contextual signals. Evaluations on the Audio MultiChallenge benchmark demonstrate significant improvements in Semantic Memory and Self Coherence subtasks, successfully enforcing strict, context-faithful adherence.

[NLP-47] ACCORD: Action-Conditioned Contextual Grounding for Language Agents

【速读】：该论文旨在解决大语言模型（Large Language Model, LLM）代理在执行用户指令时因上下文缺失而导致的失败问题。由于人类指令常依赖于隐含的环境假设，而这些假设无法仅通过指令本身推断，尤其在信息丰富的数字与物理环境中，代理必须主动从当前工具、数据、界面和观测结果中恢复缺失上下文，才能有效执行任务。现有代理往往基于假设而非实际观察采取行动，忽略可获取的信息，并未能整合已有证据，导致执行偏差。其解决方案的关键在于提出一种名为ACCORD（Action-Conditioned Contextual Grounding）的代理框架，该框架在每一步动作前主动探测环境以补全缺失信息，并将此前轨迹中被忽略的相关上下文进行整合，实现对任务状态的动态、自适应语境锚定。ACCORD无需额外训练或任务成功信号，即可显著提升任务完成率：在AppWorld基准上，使用GPT-5-mini时任务成功率从42.0%提升至62.6%（+20.6点），且在更强基模（Claude-4.5-sonnet）、开源模型（Qwen3.5-27B-FP8）及具身任务场景（AlfWorld）中均表现出持续增益，验证了其通用性与有效性。

链接: https://arxiv.org/abs/2606.16432
作者: Lai Jiang,Cheng Qian,Zhenhailong Wang,Pan Lu,Heng Ji,Hao Peng
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:User instructions are often underspecified because humans rely on implicit assumptions about the surrounding environment. For large language model (LLM) agents operating in information-rich digital and physical environments, these assumptions cannot be inferred from the instruction alone; they must be recovered from the current state of tools, data, interfaces, and observations. Effective execution therefore requires agents to identify missing context, ground it in observed evidence, and carry it forward into subsequent actions. We show that current agents often fail to do so. They act from assumed rather than observed specifics, overlook information they could have gathered, and fail to incorporate evidence that has already been returned. Building on this insight, we propose ACCORD (Action-Conditioned Contextual Grounding), a simple and effective agent framework for adaptive grounding. Before each action, ACCORD actively probes the environment for missing information and integrates relevant context from the agent’s trajectory that would otherwise be overlooked. Requiring no additional training or task-success signals, ACCORD improves task-goal completion on AppWorld by up to +20.6 points with GPT-5-mini, from 42.0% to 62.6%, compared to strong baselines. These gains persist with a substantially stronger base model (+10.8 with Claude-4.5-sonnet), an open-weight model (+10.1 with Qwen3.5-27B-FP8), and on the embodied AlfWorld benchmark (+7.4 success rate with GPT-5-mini).

[NLP-48] aylor-Calibrate: Principled Initialization for Hybrid Linear Attention Distillation

【速读】：该论文旨在解决将预训练Transformer模型通过转换方式迁移至混合线性注意力模型（如门控增量网络，GDN）时存在的稳定性与初始化质量差的问题。现有方法在将教师模型的注意力投影直接复制到学生模型后，无法有效设定新引入的递归衰减、写入门控及输出门控等动态机制，导致学生模型初始状态处于较差的动力学区域，需耗费大量训练样本进行初始化修复而非学习教师模型的剩余行为。为此，本文提出一种轻量级初始化方法——Taylor-Calibrate，其核心在于利用泰勒展开引导的教师注意力统计信息，自适应地确定学生模型中的值投影、记忆时间尺度、写入门控和输出门控参数，并通过简短的逐层对齐步骤使每一层输出与教师模型保持一致。实验表明，该方法在四种教师设置和三种保留层数策略下均显著提升零样本学生模型性能，相较基线方法在代表性消融实验中实现高达88倍的性能提升，并在达到匹配恢复目标时减少4.9至9.2倍的训练令牌消耗。

链接: https://arxiv.org/abs/2606.16429
作者: Zhongzhu Zhou,Qingyang Wu,Junxiong Wang,Mayank Mishra,Shuaiwen Leon Song,Ben Athiwaratkun,Chenfeng Xu
机构: Together AI; The University of Sydney; University of California, Berkeley; The University of Texas at Austin; Microsoft
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 24 pages, 9 figures

点击查看摘要

Abstract:Hybrid linear attention models offer an appealing path to faster long-context inference: they reduce the quadratic cost and KV-cache burden of full softmax attention while retaining much of the quality of Transformer models. A practical way to obtain such models is to convert a pretrained Transformer instead of pretraining a new architecture from scratch, but this conversion is still brittle. Simply copying the teacher attention projections into a Gated DeltaNet (GDN) student does not specify the new recurrent decay, write, and output-gating dynamics. As a result, the converted model often starts in a poor dynamical regime and must spend many distillation tokens repairing initialization rather than learning the remaining teacher behavior. We propose Taylor-Calibrate, a lightweight initialization method for hybrid GDN students. The method uses Taylor-guided teacher attention statistics to set the value projection, memory timescale, write gates, and output gate, then applies a short per-layer alignment step to match each converted layer to the teacher output. Across four teacher settings and three retained-layer policies, Taylor-Calibrate gives substantially stronger zero-shot students, with up to an 88x improvement in a representative ablation, and reaches matched recovery targets with 4.9x–9.2x fewer training tokens than naive conversion.

[NLP-49] PathRouter: Aligning Rewards with Retrieval Quality in Agent ic Graph Retrieval-Augmented Generation

【速读】：该论文旨在解决生成式语言模型在图结构证据（graph-structured evidence）上进行迭代检索与推理时所面临的两大核心问题：一是仅基于结果的强化学习（outcome-only reinforcement learning）导致的“答案路径奖励混淆”（answer-path reward aliasing），即模型可能通过捷径获得正确答案，而非依赖有效证据路径；二是“搜索-更新模糊性”（search-update ambiguity），即标量级轨迹反馈无法明确指示应调整哪些检索动作。为此，论文提出PathRouter这一路径感知训练框架，其关键在于通过联合评估每条推理轨迹的答案正确性与证据路径重叠度，将轨迹划分为四类，并采用差异化的优势缩放机制，抑制捷径行为的同时保留对证据的主动探索。针对证据贫乏的轨迹，引入冻结的黄金证据教师模型（frozen gold-evidence teacher），提供基于令牌级别的KL散度指导，仅作用于推理与搜索查询令牌，避免对答案令牌的直接模仿。实验在三个不同规模模型（3B、7B等）的六个问答基准上验证表明，PathRouter显著提升答案F1值与证据路径重叠度，平均实现3.1和4.9的F1增益，优于强基线模型。

链接: https://arxiv.org/abs/2606.16409
作者: Bo Wang,Heyan Huang,Yaolin Li,Wei Tang,Yuan Zhang,Wenbo Li,Mingze Gao,Ge Shi,Chong Feng
机构: Beijing Institute of Technology (北京理工大学); Joy Future Academy (未来学院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Agentic GraphRAG trains language-model agents to iteratively retrieve and reason over graph-structured evidence, enabling more accurate and context-aware decision-making by efficiently navigating complex information networks. However, outcome-only reinforcement learning suffers from \textit\textbfanswer-path reward aliasing, where correct answers may come from shortcuts rather than useful evidence paths. It also exhibits \textit\textbfsearch-update ambiguity, as scalar trajectory-level feedback does not indicate which retrieval actions to adjust. To mitigate these shortcomings, we present PathRouter, a path-aware training framework for agentic GraphRAG. PathRouter jointly evaluates each trajectory along answer correctness and evidence-path overlap, yielding four trajectory categories with differentiated GRPO advantage scaling that suppresses shortcut reinforcement while preserving evidence-seeking behavior. For evidence-poor trajectories, a frozen gold-evidence teacher provides token-level KL guidance on reasoning and search-query tokens, excluding answer tokens to avoid direct response imitation. Experiments on six QA benchmarks across three model sizes show that PathRouter consistently improves answer F1 and evidence-path overlap, achieving average F1 gains of 3.1 on 3B and 4.9 on 7B models compared to a strong baseline.

[NLP-50] A Mechanistic Understanding of Pronoun Fidelity in LLM s

【速读】：该论文旨在解决大语言模型在存在多个指代对象时，无法准确、一致地使用代词（pronoun）的问题，尤其关注代词使用的忠实性（fidelity）与鲁棒性，这是实现公平且连贯文本生成的关键挑战。现有研究多依赖行为层面的分析方法，难以揭示模型内部的真实机制。为此，本文从模型内部机理出发，系统检验了三种潜在因果机制——群体实体绑定（Group Entity Binding, G）、近期性偏差（Recency Bias, R）和刻板印象偏差（Stereotypical Bias, S）是否在多个先进语言模型中被因果性地实现。通过基于无界分布式对齐搜索（Boundless Distributed Alignment Search）的方法，研究发现这三种机制均以分布在网络深度中的因果子空间形式共存，且单独任一机制均无法完全解释模型行为，但三者联合可解释91%至99.5%的行为表现。进一步的注意力头分析揭示出两条相互竞争的复制路径：群体绑定与刻板印象共享一个局部化的概念层级路径，用于检索“职业-代词”绑定单元；而近期性偏差则依赖于分布式的词元层级路径，直接重复表面形式。综上，代词忠实性源于多个同时激活的因果子空间之间的动态竞争。

链接: https://arxiv.org/abs/2606.16407
作者: Katharina Trinley,Jesujoba O. Alabi,Dietrich Klakow,Vagrant Gautam
机构: Saarland University (萨尔兰大学); Heidelberg Institute for Theoretical Studies (海德堡理论研究所)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Faithful and robust pronoun use is important for fair and coherent generations, yet large language models largely fail when multiple referents use different pronouns. To study the interplay of reasoning, repetition, and bias in this task, prior work relies exclusively on behavioural approaches, which may not reflect a model’s internal workings. Therefore, we provide a mechanistic, model-internal perspective on pronoun fidelity, testing whether three mechanisms – group entity binding (G), recency bias ®, and stereotypical bias (S) – are causally implemented across several SOTA language models. Using Boundless Distributed Alignment Search, we find all three coexist as causal subspaces distributed across network depth. No single mechanism fully explains model behaviour, but a combination of the three consistently accounts for 91-99.5%. An attention head analysis further reveals two competing copying routes; group binding and stereotype share a localized concept-level route that retrieves a bound occupation-pronoun unit, while recency uses a distributed token-level route that repeats surface forms. In sum, pronoun fidelity arises from competition between simultaneously active causal subspaces.

[NLP-51] Surpassing Scale by Efficiency: A Compact 135M Parameter Foundational LLM Natively Adapted for the Bangla Language

【速读】：该论文旨在解决大参数量自然语言处理（NLP）模型在低资源、非拉丁文脚本（如孟加拉语）场景下部署受限的问题，尤其针对边缘计算设备、移动系统及去中心化本地硬件的计算资源瓶颈。其核心挑战在于如何在保持语言建模性能的同时实现模型轻量化与高效推理。解决方案的关键在于提出 bangla-smollm-135m——一个专为孟加拉语脚本设计的13500万参数解码器架构基础模型，通过采用确定性的“交集-拼接”（intersect-and-append）分词合并策略，将TituLLMs与SmolLM2-135M进行融合，在不破坏预训练参数初始状态稳定性的前提下，有效缓解子词（subword）脚本碎片化问题。在零样本多任务基准测试（PIQA_bn、OpenBookQA_bn、CommonsenseQA_bn 和 Bangla_MMLU）中，该模型表现优于参数量为其两倍的Gemma-3-270m，并达到10亿参数级别模型的性能水平，验证了其在小规模模型中实现高效率语言建模的可行性。

链接: https://arxiv.org/abs/2606.16383
作者: Rabindra Nath Nandi
机构: Independent Researcher
类目: Computation and Language (cs.CL)
备注: Submitted to a Workshop

点击查看摘要

Abstract:While the NLP landscape is dominated by multi-billion parameter architectures, their deployment in low-resource, non-Latin scripts remains computationally prohibitive for edge configurations, mobile systems, and decentralized local hardware. This paper presents bangla-smollm-135m, a highly compact 135-million parameter decoder-only foundational model engineered explicitly for high-efficiency language modeling in the Bangla script. By leveraging a deterministic intersect-and-append token merging strategy between TituLLMs and SmolLM2-135M, the model overcomes subword script fragmentation without destabilizing early pretrained parameter states. In zero-shot multi-task benchmark evaluations (PIQA_bn, OpenBookQA_bn, CommonsenseQA_bn, and Bangla_MMLU), bangla-smollm-135m matches or outperforms models twice its size (Gemma-3-270m) and achieves parity with models in the 1B parameter tier. The model is available at rnnandi/bangla-smollm-135m

[NLP-52] Evaluating LLM Personalization via Semantic Constraint Verification

【速读】：该论文旨在解决当前大语言模型（LLM）个性化评估中依赖脆弱的表面匹配指标或计算开销巨大的“以LLM为裁判”（LLM-as-a-judge）协议所带来的可解释性不足问题。其核心解决方案是提出一种可扩展且语义不变的自然语言蕴含约束验证框架（Natural Language Inference Constraint Verification, NLICV），该框架通过将句子语义映射至真值条件集合，并利用自然语言蕴含（NLI）模型来验证个性化约束。与传统二元评分不同，NLICV将LLM行为细分为四种模式：个性化、泛化、奉承和失败，提升了评估的细致度与语义理解能力。实验表明，NLICV在保持与人工标注高度一致的同时，显著降低了推理延迟与令牌消耗（最高实现2100倍的速度提升）。此外，基于消融分析的机制能够精准定位驱动约束验证的具体句子，提供可信赖且可解释的评估证据，从而实现了高效、可解释的个性化评估。

链接: https://arxiv.org/abs/2606.16368
作者: Xuran Li,Guanqin Zhang,Imran Razzak,Hakim Hacid,Eleanna Kafeza,Hao Xue,Flora D. Salim
机构: University of New South Wales (新南威尔士大学); Mohamed bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学); The Technology Innovation Institute (技术创新研究所); The Hong Kong University of Science and Technology (香港科技大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Current evaluation paradigms for Large Language Model (LLM) personalization rely heavily on brittle surface-matching metrics or computationally expensive LLM-as-a-judge protocols, both of which lack interpretability. To address these limitations, we introduce Natural Language Inference Constraint Verification (NLICV), a scalable, semantically invariant framework that maps sentence meanings to truth-condition sets to verify personalization constraints via a Natural Language Inference (NLI) model. Moving beyond binary scoring, NLICV categorizes LLM behaviors into four distinct modes: personalization, generalization, sycophancy, and failure. Extensive experiments demonstrate that NLICV aligns closely with human annotations while drastically reducing the latency and token costs associated with LLM judges (up to 2100 inference speedup). Finally, through an ablation-based procedure, NLICV pinpoints the exact sentences driving the constraint verification, yielding faithful, understandable evidence for its evaluations.

[NLP-53] yler: Typed Latent Reasoning for Language Models – When to Think What to Compute and How Much to Allocate

【速读】：该论文旨在解决生成式 AI（Generative AI）中链式思维（Chain-of-Thought, CoT）提示所引发的冗余与推理开销问题，其核心挑战在于：现有隐式推理（latent reasoning）方法缺乏对何时触发隐式计算、执行何种类型计算以及分配多少计算预算的动态决策能力。为此，论文提出一种名为 Typed Latent Reasoning (Tyler) 的有类型且预算感知的隐式推理框架，其关键创新在于设计了一个可学习的策略网络，在自回归解码的每一步动态决定是否输出文本标记或切换至针对特定推理功能优化的隐式计算模块。一旦激活，该模块将当前推理状态映射为支持全局规划、局部状态更新或可重用过程抽象的连续隐式表示。实验表明，Tyler 在三种主流大语言模型上相较 CoT 提升最高达 14.49 个百分点，优于最强基线 4.30 个百分点，并在多种推理任务中展现出优异的泛化能力与最低的遗忘率。

链接: https://arxiv.org/abs/2606.16360
作者: Hanyu Lin,Min Cai,Jiawei Wen,Haodi Zhang
机构: Shenzhen University(深圳大学); University of Alberta(阿尔伯塔大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: website: this https URL

点击查看摘要

Abstract:Chain-of-thought (CoT) prompting improves reasoning in large language models (LLMs) by externalizing intermediate computation as discrete text tokens, but this textual interface also introduces redundancy and inference overhead. Latent reasoning offers a promising alternative by carrying part of the computation in continuous representations. However, existing methods typically predefine when latent computation is invoked and how it is allocated during decoding, leaving a key problem unresolved: when to invoke latent computation, what type of computation to perform, and how much budget to allocate. We propose \textbfTyped \textbfLat\textbfent \textbfReasoning (Tyler), a typed and budget-aware framework for latent reasoning during autoregressive decoding. Tyler learns a policy that, at each decoding step, chooses between emitting a text token and switching to a latent computation module specialized for a particular reasoning function. Once invoked, an operator maps the current reasoning state into latent tokens that support global planning, local state updates, or reusable procedural abstraction. Across extensive experiments on three backbone LLMs, Tyler improves accuracy by up to 14.49 points over CoT and by up to 4.30 points over the strongest competing baseline. It further generalizes across diverse reasoning domains and achieves the best final-stage performance with the lowest forgetting.

[NLP-54] MASC: Transmasculine Attitude and Speech Corpus INTERSPEECH2026

【速读】：该论文旨在解决跨性别男性（transmasculine）群体在语音健康评估与声学特征研究中缺乏系统性、多模态数据支持的问题。现有研究在声学分析与感知评价之间存在脱节，且针对该群体的语音数据集稀缺，限制了个性化干预与临床评估的发展。其解决方案的关键在于构建并发布首个大规模、多模态的跨性别男性态度与语音语料库（Transmasculine Attitudes and Speech Corpus, TMASC），包含196名跨性别男性的问卷反馈与66段音频记录，涵盖咳嗽、清嗓样本、朗读片段及特定会话问题。通过整合感知评价与声学参数，该语料库实现了对群体层面声学特征的识别、跨模态数据的融合分析以及声学测量的标准化校准，为跨性别男性语音健康研究提供了可复用、可扩展的数据基础与方法框架。

链接: https://arxiv.org/abs/2606.16351
作者: Sidney Wong
机构: Centre for Sustainability Research, University of Otago (奥塔哥大学可持续发展研究中心); Te Pūnaha Matatini Centre of Research Excellence for Complex Systems (复杂系统研究卓越中心)
类目: Computation and Language (cs.CL)
备注: Accepted to Interspeech 2026 Main Track

点击查看摘要

Abstract:We introduce the Transmasculine Attitudes and Speech Corpus (TMASC), a multimodal corpus of 196 transmasculine individuals, including questionnaire responses and 66 audio recordings. The questionnaire includes items exploring the vocal health of transmasculine individuals. The audio recordings include cough and throat-clearing samples, a reading passage, and additional session-specific questions. This paper outlines the development of this corpus and the data collection procedures. To illustrate the utility of this corpus, we present three case studies demonstrating how this crowd-sourced multimodal corpus can be used to support transmasculine individuals. These include the integration of perceptual and acoustic data, the identification of group-level characteristics, and the calibration of acoustic measurements.

[NLP-55] Whose hotel does the AI recommend? An algorithm audit of reputation signals in LLM -assisted hotel selection

【速读】：该论文旨在解决生成式 AI（Generative AI）在旅游预订场景中作为信息中介时，其推荐决策机制不透明的问题。由于用户越来越多地依赖大语言模型（LLM）助手进行酒店选择，这些系统成为影响酒店可见性的关键节点，但其推荐逻辑缺乏可解释性与实证依据。为解决此问题，研究采用预设的基于选择的联合分析（choice-based conjoint）实验设计，对十二个开源与专有模型在不同用户画像与提示模板下的推荐行为进行审计，系统性地随机化五家酒店的客评评分、评论数量与新近度、管理方回应、连锁归属、价格、环保认证及列表位置等特征。通过估计各信号对推荐概率的平均边际效应，发现客评评分与价格是主导因素（高评分提升推荐概率31.6个百分点，高价降低30.0个百分点），再现了人类“价值-价格优先”认知模式，但过度加权环保认证而忽略管理方回应。此外，仅具内容无关性的列表位置也产生显著因果影响，相当于每晚约12美元的价值。尽管模型给出的推荐理由部分反映实际权重，但匹配度有限。该研究的核心解决方案在于通过因果实验构建可验证的证据基础，为生成式引擎优化与人工智能信息中介的问责提供实证支撑。

链接: https://arxiv.org/abs/2606.16344
作者: Mirza Samad Ahmed Baig,Syeda Anshrah Gillani,Asher Ali
机构: Hamdard University (哈姆德大学); Al Khobar (阿尔科巴尔)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注: 32 Pages

点击查看摘要

Abstract:Travelers increasingly ask large language model (LLM) assistants which hotel to book, making these systems gatekeepers of property visibility – yet what moves their recommendations is undocumented. We conduct a pre-specified algorithm audit using a randomized choice-based conjoint: across personas, prompt templates, and twelve open-weight and proprietary models, assistants choose among five hotels whose guest rating, review volume and recency, management response, chain affiliation, price, eco-certification, and list position are independently randomized. We estimate the average marginal component effect of each signal on the probability of recommendation. Guest rating and price dominate (a top rating raises selection by 31.6 percentage points; a high price lowers it by 30.0), reproducing human valence-and-price primacy but over-weighting eco-certification and ignoring management response. List position – a content-free artifact – shifts recommendations causally, worth about \ 12 per night. Stated reasons track revealed weights imperfectly. The findings ground generative engine optimization and the accountability of AI infomediaries in causal evidence.

[NLP-56] PaperJury: Due-Process Review for Bounded LaTeX Revision

【速读】：该论文旨在解决人类撰写的计算机科学论文在投稿前加固（pre-submission hardening）过程中存在的核心问题，即现有写作辅助工具、批判生成器及以评审者为中心的循环系统缺乏跨轮次的持久性问题标识、从批评到裁决的确定性路由机制，以及对稿件的可控性（如拒绝无效关切或推迟作者依赖型问题）。其解决方案的关键在于构建一个闭环的“审稿-裁决-修订-验证”系统PaperJury，采用确定性与语义性分离的设计范式：由确定性编排（deterministic orchestration）负责论文分解、冻结的主张骨架（frozen claim spine）、持久化账本（durable ledger）、路由决策、终止条件判断和精确一次的补丁应用；而语义代理（semantic agents）仅限于有限范围内的审查、判断与修复。该系统通过限定整体性审查、基于可争议性的路由机制、类司法程序的正当程序（due-process trial）以及风险比例化的防护链（risk-proportional guard chains），实现对锚定边界修改的安全控制，最终输出三类终结性结果：无效驳回（invalid-drop）、可修复有效问题（valid-fixable）和需作者介入（author-required）。实验评估表明，将负载关键的安全性和完成逻辑置于确定性编排而非模型自主决策中，显著提升了问题质量、裁决准确性、编辑安全性与收敛效率，验证了该架构的有效性。

链接: https://arxiv.org/abs/2606.16322
作者: Yiran Wang,Ruixuan An,Biao Wu,Wenhao Wang
机构: 未知
类目: Computation and Language (cs.CL)
备注: 10 pages, 5 figures

点击查看摘要

Abstract:Pre-submission hardening of human-authored LaTeX computer science papers differs from drafting assistance because it requires adversarial whole-paper review, explicit no-fix outcomes, and bounded artifact-safe revision. Existing writing assistants, critique generators, and judge-centered loops lack durable issue identity across rounds, deterministic routing from critique to adjudication, and manuscript control that can reject invalid concerns or defer author-dependent ones. We present PaperJury, a closed-loop review-verdict-revise-verify system built on a deterministic-versus-semantic split: deterministic orchestration manages decomposition, a frozen claim spine, a durable ledger, routing, stopping, and exact-once patch application, while semantic agents are limited to bounded review, judgment, and repair. PaperJury combines bounded holistic review, contestability-based routing, a due-process trial, and risk-proportional guard chains for anchor-bounded edits, yielding terminal outcomes of invalid-drop, valid-fixable, and author-required. In a two-arm expert-review evaluation on held-out Vision, natural language processing, and machine learning papers against four baselines, we assess issue quality, verdict and routing quality, edit safety, convergence behavior, and cost, supporting the thesis that load-bearing safety and completion logic should reside in deterministic orchestration rather than model discretion. PaperJury is available at this https URL.

[NLP-57] QK-Normed MLA: QK normalization without full key caching

【速读】：该论文旨在解决多头潜在注意力（Multi-head Latent Attention, MLA）与查询-键（Query-Key, QK）归一化之间的兼容性问题。传统上，后投影QK RMS归一化（RMSNorm）需要对每个缓存的键进行完整的投影以计算动态的均方根（RMS）统计量，这与MLA通过缓存低维潜在状态实现高效解码的核心机制相冲突。其解决方案的关键在于揭示该不兼容性实为实现方式的局限，而非架构本质约束：通过将RMSNorm分解为静态仿射权重和动态标量RMS统计量，可将键侧的静态权重吸收至查询侧投影矩阵中，而动态的RMS统计量则简化为每令牌及每键值（KV）组一个反向RMS标量。该重构后的形式在精确算术下与显式的后投影QK RMSNorm完全等价，并保留了MLA原有的潜在解码路径。实验表明，在4亿参数模型训练至1000亿词元规模时，采用QK归一化的MLA相较于QK截断法具有更低的训练损失和更优的下游性能；在H800解码基准测试中，即使在256k上下文长度下，延迟开销也低于2%。因此，该方法使QK归一化成为无需全键缓存即可应用于MLA模型的实际稳定化手段。

链接: https://arxiv.org/abs/2606.16310
作者: Yizhou Han,Yao Zhao,Jun Zhou,Longfei Li,Ruoyu Sun
机构: The Chinese University of Hong Kong (香港中文大学); Ant Group (蚂蚁集团)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 13 pages, 5 figures, conference-style manuscript

点击查看摘要

Abstract:Query-key (QK) normalization stabilizes attention by controlling the scale of queries and keys before the dot product, but is not immediately compatible with Multi-head Latent Attention (MLA). MLA achieves efficient decoding by caching low-dimensional latent states instead of full keys, whereas post-projection QK RMSNorm appears to require the fully projected key for every cached token. We show this apparent incompatibility is an implementation artifact, not an architectural constraint. RMSNorm decomposes into a static affine weight and a dynamic scalar RMS statistic. The static key-side weight can be absorbed into the MLA query-side projection; the dynamic key statistic reduces to one inverse-RMS scalar per token and KV group. The resulting formulation is exactly equivalent to explicit post-projection QK RMSNorm in exact arithmetic and preserves MLA’s latent decode path. In our 400M runs trained for up to 100B tokens, QK-Normed MLA achieves lower training loss and better downstream accuracy than QK clipping, while H800 decode benchmarks show less than 2% latency overhead up to 256k context. These results make QK normalization a practical stabilization option for MLA models without requiring full-key caching.

[NLP-58] State-Grounded Multi-Agent Synthetic Data Generation for Tool-Augmented LLM s

【速读】：该论文旨在解决训练工具增强型大语言模型（LLM）智能体所面临的高质量多轮、工具依赖对话数据稀缺的问题，此类数据在生产环境中受隐私限制且人工标注成本高昂。其核心解决方案是提出StateGen——一个合成数据生成平台，通过四角色大模型循环架构（包括人格化用户模拟器、待测智能体、状态驱动的工具模拟器及多维度大模型评判器）生成具备评分与推理轨迹丰富性的训练对话。关键创新在于引入权威状态管理器（authoritative state manager），通过跨轮次维护结构化世界状态对象，并强制“后端即真实”（backend-is-truth）原则，从根本上消除由工具调用幻觉引发的主要错误类型。此外，该架构可自然扩展至分层多智能体场景，将子智能体作为共享单一状态对象的工具处理。实验基于三个生产级语料库评估64,698条对话，结果显示工具调用幻觉得分达9.66/10，支持通过23维特质向量实现人格化变化，且训练集与黄金测试集的清晰分离验证了数据未被记忆诱导（每准则差距分析）。与八种外部系统对比表明，当前无公开平台同时具备多轮生成、状态驱动的工具模拟、分层多智能体支持及内置评分机制。

链接: https://arxiv.org/abs/2606.16307
作者: Rahul Khedar,Eshita,Sneha Teja Sree Reddy Thondapu,Mayank Malhotra,Arup Das,Jitesh Chandra,Yun-Shiuan Chuang,Chaitanya Kulkarni,Arun Menon,Linsey Pang,Avinash Karn,Mouli V,Prakhar Mehrotra
机构: PayPal AI
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 9 pages, 5 figures, 6 tables, 1 algorithm

点击查看摘要

Abstract:Training tool-augmented LLM agents requires large corpora of multi-turn, tool-grounded conversational data that is expensive to annotate, privacy-constrained in production settings, and largely absent from public datasets. We present StateGen, a synthetic data generation platform that produces scored, reasoning-trace-rich training conversations by orchestrating a four-role LLM loop: a persona-conditioned user simulator, an agent under test, a state-grounded tool simulator, and a multi-axis LLM judge. The key architectural contribution is an authoritative state manager that maintains a structured world-state object across turns, enforcing a backend-is-truth invariant that eliminates the dominant class of tool-call hallucinations by construction. StateGen extends naturally to hierarchical multi-agent settings by declaring sub-agents as tools, all sharing a single state object. We report results on 64,698 evaluated conversations across three production corpora: tool-call hallucination scores reach 9.66/10, the system supports persona-driven variation via a 23-dimensional trait vector, and a cleanly separated train and golden evaluation set split confirms the data is not memorization bait (per-criterion gap analysis). Comparison with eight external systems shows that no single publicly available platform combines multi-turn generation, state-grounded tool simulation, hierarchical multi-agent support, and built-in judge scoring.

[NLP-59] VisualClaw: A Real-Time Personalized Agent for the Physical World

【速读】：该论文旨在解决视觉语言模型（Vision Language Models, VLMs）在实际部署中面临的三大核心问题：高延迟与高成本（尤其在处理密集视频帧和长提示时）、部署后智能体架构静态不变，以及现有视频问答（video-QA）基准无法有效评估智能体在工具使用工作空间中利用视觉证据的能力。其解决方案的关键在于提出VisualClaw——一种基于双重原则的自演化多模态智能体框架：一是采用混合编码机制，通过级联门控筛选低信息量的流式视频帧，并结合热/冷top-k注入策略压缩文本技能库，显著降低推理成本；二是引入技能演化机制，使智能体能够从失败中学习，通过检索记忆作为直接上下文或引导性证据，动态更新技能库以支持未来任务。实验表明，VisualClaw在多个视频问答基准上平均降低98%的每问题API成本（相较于全帧上传），并实现平均3.85%的准确率提升；同时，为填补评估空白，研究构建了VisualClawArena这一包含200个场景的多模态智能体基准，验证了该框架在复杂工作空间中利用动态证据与可执行检查的能力，进一步推动智能体向边缘应用与个性化助手演进。

链接: https://arxiv.org/abs/2606.16295
作者: Haoqin Tu,Jianwen Chen,Zijun Wang,Siwei Han,Juncheng Wu,Hardy Chen,Haonian Ji,Kaiwen Xiong,Jiaqi Liu,Peng Xia,Jieru Mei,Hongliang Fei,Jason Eshraghian,Zeyu Zheng,Yuyin Zhou,Huaxiu Yao,Cihang Xie
机构: UC Santa Cruz; UNC-Chapel Hill; Google
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: H. T. and J. C. contribute to this project equally

点击查看摘要

Abstract:Vision language models are serving as general-purpose interfaces for complex multimodal tasks. However, deployment still faces three gaps: VLMs typically incur high latency and cost when processing dense video frames and long prompts, the agent scaffold remains static after deployment, and standard video-QA benchmarks do not test whether agents can use visual evidence inside tool-using workspaces. We present VisualClaw, a self-evolving multimodal agent built around two principles. First, hybrid encoding reduces deployment cost by filtering less informative streaming frames with a cascaded gate and compressing the text skill bank through hot/cold top-k injection. Second, skill evolution lets the agent learn from failures: retrieved memories condition an evolver as direct concatenated context or as guided evidence, producing skill-bank updates that help future questions. Across 4 video-QA benchmarks with 2 VLMs, VisualClaw cuts per-question API cost by an average -98% versus full-frame upload and by -25.9% over the offline uniform 8 frame baseline, while boosting accuracy in most settings, e.g., an average +3.85% and a peak +15.80% on EgoSchema with Gemini 3 Flash. To address the gap, we curate VisualClawArena, a 200-scenario multimodal agentic benchmark built through a strict five-stage pipeline; models must use video evidence, documents, dynamic updates, and executable checks inside a workspace. On VisualClawArena, the same framework with computer-use agent backends improves macro accuracy by +2.9% for Codex (GPT-5.5) and +3.2% for Claude Code (Sonnet 4.6) over no-evolution baselines, with a -9.5% cost reduction compared to the uniform-sampled baseline. These properties make VisualClaw a natural fit for edge applications, where the cascade reduces a 1-hour streaming session from ~3,600 API uploads down to only 5-20 calls and the self-evolution makes it a perfect personalized assistant.

[NLP-60] HiMPO: Hindsight-Informed Memory Policy Optimization for Less-Entangled Credit in Long-Horizon Agents

【速读】：该论文旨在解决长时程智能体（long-horizon agents）在记忆写入（memory writing）过程中面临的信用分配（credit assignment）难题。由于记忆更新可能因下游工具失效、观测噪声或推理错误而被奖励或惩罚，导致其信用与实际贡献产生因果混淆（causally entangled credit），从而引发智能体误判有效信息或保留冗余内容的问题。其解决方案的关键在于提出一种基于事后认知的内存策略优化框架——HiMPO（Hindsight-Informed Memory Policy Optimization）。HiMPO通过比较同一预写状态下的历史记忆与更新后记忆所恢复的任务相关性，评估记忆更新的局部效用（local utility），并引入事后相关性（hindsight relevance）作为有界回溯过滤器，在目标结果不支持局部效用时衰减记忆信用。由此生成的记忆特异性优势仅用于优化记忆令牌，而轨迹级奖励仍用于优化其他行为决策，实现了记忆更新信用与整体行为的解耦。实验表明，该方法在基于评判员的开放域任务和压缩记忆问答任务中均优于主流基于记忆和强化学习的基线模型，同时保持了上下文压缩效率，并通过可控干预验证了其有效减少了由工具错误引发的责备泄露（blame leakage），提升了记忆更新归因的准确性。

链接: https://arxiv.org/abs/2606.16285
作者: Jiangze Yan,Yi Shen,Wenjing Zhang,Jieyun Huang,Zhaoxiang Liu,Ning Wang,Kai Wang,Shiguo Lian
机构: Unicom Data Intelligence, China Unicom; Data Science Artificial Intelligence Research Institute, China Unicom
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Preprint. 2 figures

点击查看摘要

Abstract:Long-horizon agents rely on memory mechanisms to compress interaction history, but optimizing memory writing faces a distinct credit assignment challenge: a memory update may be rewarded or penalized due to downstream tool failures, noisy observations, or reasoning errors rather than its own contribution. This causally entangled credit can lead agents to discard useful evidence or preserve irrelevant information. We propose HiMPO, a Hindsight-Informed Memory Policy Optimization framework for assigning less-entangled credit to memory-writing actions in long-horizon agents. HiMPO first estimates the local utility of a memory update by comparing the task-relevant information recoverable from the previous and updated memories under the same pre-write state. It then uses hindsight relevance as a bounded retrospective filter that attenuates memory credit when local utility is not supported by the target outcome. The resulting memory-specific advantage is applied only to memory tokens, while trajectory-level rewards optimize the rest of the agent behavior. Across judge-based open-domain tasks and objective compressive-memory QA, HiMPO improves over strong memory-based and RL-based baselines while preserving compressed-context efficiency. Controlled interventions further show that HiMPO reduces blame leakage from tool-induced errors and improves attribution fidelity of memory updates.

[NLP-61] Who Should Lead Decoding Now? Tracking Reliable Trajectories for Ensembling Masked Diffusion Language Models

【速读】：该论文旨在解决生成式语言模型中多模型知识融合（Knowledge Fusion）的难题，特别是针对掩码扩散语言模型（Masked Diffusion Language Models, MDLMs）在多样化能力与知识覆盖背景下如何有效整合不同模型的知识。其核心问题在于：现有方法难以在序列生成过程中动态识别并利用各模型的优势轨迹，导致知识利用不充分。解决方案的关键在于提出一种基于轨迹迭代集成（Trajectory-based Iterative Ensembling, TIE）的框架，通过追踪答案相关位置上的置信度动态变化，判断当前哪个模型遵循更可靠的生成路径，并在不同去噪步骤中选择性地传递部分去噪序列。该机制允许不同模型在生成过程的不同阶段发挥其互补优势，从而实现高效的知识协同与性能提升。实验与分析表明，TIE为解决MDLM集成这一未被充分探索的问题提供了切实可行的方法。

链接: https://arxiv.org/abs/2606.16281
作者: Heecheol Yun,Joonhyung Park,Joowon Kim,Eunho Yang
机构: KAIST(韩国科学技术院); AITRICS
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: preprint

点击查看摘要

Abstract:Masked Diffusion Language Models (MDLMs) have emerged as a distinct paradigm for sequence generation. As MDLMs become diverse in capabilities and knowledge coverage, an important question is how to combine their knowledge. Toward this, we first investigate the unique decoding dynamics of MDLMs. We find that successful generations exhibit stable confidence dynamics over answer-relevant positions, while unreliable trajectories can often be corrected by injecting promising intermediate states from other models. Guided by this observation, we propose \textbfTIE ( \textbfT rajectory-based \textbfI terative \textbfE nsembling), a knowledge fusion framework in which MDLMs iteratively identify reliable decoding trajectories and relay them across models. TIE tracks confidence dynamics over answer-relevant positions to determine which model currently follows a more reliable trajectory and selectively transfers partially denoised sequences across models. As the model on the more promising trajectory often changes across denoising steps, TIE allows different models to contribute complementary strengths at different stages of generation. Strong performance across diverse reasoning tasks, along with our analyses, suggests that TIE offers a practical approach to the underexplored problem of MDLM ensembling.

[NLP-62] Data Augmentations for Data-Constrained Language Model Pretraining

【速读】：该论文旨在解决在数据受限、计算资源充裕的背景下，自回归语言模型（Autoregressive, AR）预训练因过拟合导致性能随训练轮次增加而持续下降的问题。随着高质量文本数据生成速度接近瓶颈，现有模型面临“数据天花板”挑战，亟需在固定语料库上实现高效、多轮次的训练。其核心解决方案在于引入数据增强作为正则化手段，通过设计三类正交的数据增强策略：词元级噪声（如掩码、随机替换）、序列重排（如从右到左预测、填空中间任务）以及目标偏移预测（预测未来位置的 token）。系统性消融实验表明，各类增强方法均能有效延缓过拟合并降低验证损失，其中随机词元替换表现最优；组合使用多种增强策略可进一步提升性能。研究证明，数据增强显著缓解了自回归预训练对数据的低效利用问题，为应对数据受限场景提供了可行且高效的解决方案。

链接: https://arxiv.org/abs/2606.16246
作者: Michael K. Chen,Xikun Zhang,Zhen Wang
机构: UC San Diego (加州大学圣地亚哥分校); RMIT University (皇家墨尔本理工大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As AI labs approach a data ceiling where compute capacity outpaces the rate of new high-quality text generation, language model pretraining is shifting toward a data-constrained, compute-abundant regime that demands productive multi-epoch training on fixed corpora. Standard autoregressive (AR) pretraining overfits severely in this setting, reaching its optimum early and then continuously deteriorating. We investigate data augmentation as a regularizer to mitigate this overfitting and enable productive training for hundreds of epochs on the same data. We introduce three orthogonal categories of augmentation for AR pretraining: token-level noise (masking, random replacement), sequence permutations (right-to-left prediction, Fill-in-the-Middle), and target offset prediction ( x_t+i for i 1 ). Through systematic ablations, we find that individual augmentations delay overfitting and lower validation loss relative to the baseline, with random token replacement achieving the best minimum loss among individual methods. Combining augmentation categories further lowers the minimum validation loss. Our experiments demonstrate that data augmentations mitigate AR pretraining’s data inefficiency and offer a promising solution to the data-constrained regime. All code and data are available at this https URL

[NLP-63] LiFT: Local Search via Linear Programming for Overfitting-Controlled Transformers

【速读】：该论文旨在解决预训练Transformer模型微调过程中存在的过拟合问题，其核心挑战在于如何在不依赖重复全量重训练的前提下，实现对模型参数与正则化超参数的高效、精准调控。解决方案的关键在于提出一种基于线性规划（Linear Programming, LP）的局部搜索框架——线性规划微调（LiFT），将微调过程建模为双层优化驱动的正则化问题，联合更新模型参数与正则化超参数。通过利用初始预热阶段收集的验证梯度与训练海森矩阵信息，构建一个以最小化缩放方向导数为目标的线性规划问题，从而生成具备验证感知能力的局部下降方向。该方向能够引导对特定层和正则化参数的聚焦式更新，有效抑制过拟合，同时保持训练最优性。相比传统依赖启发式或网格搜索的微调方法，LiFT实现了任务特异性更新的系统性识别，不仅在GPT-2 Small于WikiText-2上的实验中展现出显著且一致的测试困惑度降低效果，尤其在易过拟合场景下优势突出，更从理论上建立了Transformer微调与双层优化、局部搜索及正则化理论之间的严谨联系。

链接: https://arxiv.org/abs/2606.16243
作者: Abhishek Shukla,Anikeit Khanna,Ankur Sinha,Faiz Hamid
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 22 pages, 6 figures, published in The 20th Learning and Intelligent Optimization Conference (LION 2026)

点击查看摘要

Abstract:This paper proposes a Linear Programming (LP)-based local search framework for fine-tuning pretrained transformer models with explicit control against overfitting. The approach formulates transformer fine-tuning as a bilevel optimization-based regularization problem, in which model parameters and regularization hyperparameters are jointly updated. Information collected during initial warm-up iterations, including validation gradients and training Hessian information, is used to construct a local descent direction by solving an LP that minimizes a scaled directional derivative while preserving training optimality. This validation-aware descent direction enables focused local updates of both parameters and regularization hyperparameters, reducing overfitting without requiring repeated full retraining cycles. The resulting method, termed Linear Programming-based Fine-Tuning (LiFT) for transformers, differs from conventional fine-tuning by systematically identifying task-specific updates rather than relying on heuristic or grid-based hyperparameter selection. Experiments on GPT-2 Small fine-tuned on WikiText-2 demonstrate that LiFT enables effective adaptation through selective tuning of transformer blocks and regularization parameters, yielding consistent improvements in test perplexity across multiple layer configurations and regularization settings, with particularly pronounced gains in overfitting-prone scenarios. Beyond empirical performance, LiFT establishes a principled connection between transformer fine-tuning, bilevel optimization, local search, and regularization theory.

[NLP-64] Rapid Poison: Practical Poisoning Attacks Against the Rapid Response Framework ICML2026

【速读】：该论文旨在解决生成式 AI（Generative AI）系统中快速响应（Rapid Response, RR）框架在持续训练过程中面临的对抗性污染问题，尤其关注当攻击者通过提示注入（prompt injection）手段向分类器的训练数据中注入恶意样本时，如何导致模型产生严重误判。其核心挑战在于：攻击者仅能修改越狱样本（jailbreak samples），而无法操纵良性数据或标签，这一限制使传统攻击方法失效，因而对防御机制构成严峻挑战。解决方案的关键在于提出一种名为“遗漏攻击”（Omission Attack）的新攻击范式，该方法利用一个新发现的现象——当模型在缺乏特定概念的不安全样本上进行训练时，会错误地将该概念的存在与“安全”标签关联起来。基于此，攻击者可构造具有特定触发特征的毒化样本，诱导分类器在面对真实越狱输入时产生高比例的假阴性（最高达96%），同时在无害样本上引发高达100%的假阳性，且在仅1%的毒化率下即实现显著的标签翻转，从而严重破坏分类器的可靠性。

链接: https://arxiv.org/abs/2606.16242
作者: David Huang,Jaewon Chang,Avidan Shah,Prateek Mittal,Chawin Sitawarin
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Spotlight at ICML 2026

点击查看摘要

Abstract:The Rapid Response (RR) framework, deployed in production systems, including Anthropic’s ASL-3 safeguards, continuously improves jailbreak-detection classifiers. When new jailbreaks emerge that bypass these classifiers, Rapid Response generates synthetic variants for training, helping the model generalize from the new attacks and quickly adapt. We reveal that prompt injection can infiltrate this pipeline to deliver poisoned samples into the classifier’s training set, enabling two attack objectives: (I) targeted poisoning attacks that create false positives on harmless samples by categorizing them as a jailbreak, with a specific desired feature (e.g., certain formatting, subject, or keyword), (II) concept-based backdoor attacks that induce false negatives on jailbreak inputs, generalizing even to jailbreaks from attack strategies the defender explicitly trained against, when the backdoor trigger is present. Importantly, our threat model restricts adversaries to modifying only jailbreak samples (not benign data or labels), a constraint unexplored by prior work that makes the second objective particularly challenging. We address this with Omission Attack, which exploits a new phenomenon: when training on concept-absent unsafe samples, the classifier misassociates that concept’s presence with the safe label. Both attacks cause substantial and in some cases near-complete label flipping at only a 1% poisoning rate, achieving up to 100% false positive rates and up to 96% false negative rates.

[NLP-65] Creative Collision: Directorial Persona Steering and Competition in Large Language Models ICML2026

【速读】：该论文旨在解决大语言模型在推理阶段如何通过激活操控（activation steering）实现更精细、可控的生成行为，特别是针对多个语义对立方向共存时的交互机制问题。现有方法通常仅注入单一语义方向，难以捕捉复杂创作意图的动态平衡。本文提出“创意碰撞”（Creative Collision）这一新范式，通过在残差流中叠加两个语义对立的导演人格向量——史蒂文·斯皮尔伯格（乐观、救赎性道德取向）与马丁·斯科塞斯（黑暗、道德模糊），利用精选剧本语料库上的均值-差异激活对比构建向量，并引入标量混合参数 α 与操控系数 λ 进行插值。研究发现：（i）斯皮尔伯格的表征具有显著的方向主导性，在几乎全插值范围内压制斯科塞斯的道德影响；（ii）中间碰撞点在高 λ 值下反而提升生成连贯性，呈现反直觉的协同效应；（iii）二者在40层解码器中均集中于第28层，揭示出共享的“道德基调底座”（moral-tone substrate）。其核心解决方案在于揭示了对抗性语义方向在变压器残差流中的几何结构与动态交互规律，为可控制的创造性生成及价值对齐的叙事合成提供了理论基础与实践路径。

链接: https://arxiv.org/abs/2606.16240
作者: Subramanyam Sahoo,Justin Shenk
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted at ICML 2026 Workshop on Human-AI Co-Creativity

点击查看摘要

Abstract:Activation steering has emerged as a powerful tool for shaping the behaviour of large language models at inference time, yet most prior work injects a \emphsingle semantic direction into the residual stream. We study the richer setting in which two semantically opposing steering vectors are superimposed – a regime we call \textbfCreative Collision. Concretely, we construct directorial persona vectors for Steven Spielberg (optimistic, redemptive moral valence) and Martin Scorsese (dark, morally ambiguous) via mean-difference activation contrast on curated screenplay-derived corpora, then interpolate between them with a scalar mixing parameter \alpha \in [0,1] and a steering coefficient \lambda . Across five evaluation axes – moral valence, generation coherence, surface style, directional dominance, and vector geometry – three principal findings emerge: (i)~Spielberg’s representational signature exhibits robust \emphdirectional dominance, suppressing Scorsese’s moral influence across almost the entire interpolation range; (ii)~intermediate collision points paradoxically \emphimprove generation coherence relative to pure single-director steering at high \lambda ; and (iii)~both personas localise maximally to layer~28 of a 40-layer decoder-only transformer, revealing a shared \emphmoral-tone substrate. These results illuminate the geometry of competing semantic directions in transformer residual streams and have direct implications for controllable creative generation and value-aligned narrative synthesis.

[NLP-66] PACT: Privileged Trace Co-Training for Multi-Turn Tool-Use Agents

【速读】：该论文旨在解决多轮工具使用智能体在训练过程中面临的双重挑战：一方面，基于强化学习（Reinforcement Learning, RL）的方法因奖励稀疏和信用分配困难而难以有效优化；另一方面，仅依赖专家轨迹进行监督微调（Supervised Fine-Tuning, SFT）虽能提供密集的过程监督，却容易使模型过度受限于固定的执行路径，丧失泛化能力。为此，本文提出一种特权轨迹协同训练框架（Privileged trAce Co-Training, PACT），其核心在于将专家轨迹作为训练阶段的优化信号，而非推理阶段的提示。PACT 保持推理过程完全基于提示（prompt-only），并通过两个互补信号指导优化：一是基于轨迹条件的强化学习代理（trace-conditioned RL surrogate），用于在专家轨迹上下文中评估纯提示生成的推理序列；二是组件感知的监督微调损失（component-aware SFT loss），以渐进式减弱的强度对推理前缀和工具调用进行监督。为缓解对训练专用轨迹上下文的过度依赖，PACT 引入了纯提示锚定机制，并从潜在轨迹视角揭示了两种基于轨迹的目标如何协同引导优化而不参与实际推理生成。实验在 FTRL、BFCL 和 ToolHop 数据集上验证了 PACT 在多个基准方法上的持续提升，充分证明了特权轨迹协同训练在多轮工具使用学习中的有效性。

链接: https://arxiv.org/abs/2606.16215
作者: Zhenbang Du,Jun Luo,Zhiwei Zheng,Xiangchi Yuan,Kejing Xia,Dachuan Shi,Qirui Jin,Qijia He,Shaofeng Zou,Yingbin Liang,Wenke Lee
机构: Georgia Institute of Technology(佐治亚理工学院); Ohio State University(俄亥俄州立大学); University of Pennsylvania(宾夕法尼亚大学); Arizona State University(亚利桑那州立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Project page: this https URL

点击查看摘要

Abstract:Multi-turn tool-use agents must reason, call tools, and adapt to observations across several interaction turns. Post-training such agents is challenging, as reinforcement learning often suffers from sparse rewards and weak credit assignment despite matching the prompt-only inference setting, while supervised fine-tuning on expert traces provides dense process supervision but can over-constrain the model to fixed trajectories. To tackle this, we propose PACT, a Privileged trAce Co-Training framework for multi-turn tool-use agents. The key idea is to use expert traces only as training-time optimization signals rather than rollout-time hints. PACT keeps rollout generation prompt-only, then uses expert traces to guide optimization through two complementary signals: a trace-conditioned RL surrogate that evaluates prompt-only rollouts under expert-trace context, and a component-aware SFT loss that supervises reasoning prefixes and tool-calls with annealed strength. To reduce over-reliance on the training-only trace context, PACT further introduces a prompt-only anchoring. We also provide a latent-trace view that connects the two trace-based objectives and explains how expert traces can guide optimization without being used during rollout generation. Experiments on FTRL, BFCL, and ToolHop show that PACT consistently improves over strong SFT- and RL-based baselines, highlighting the value of privileged trace co-training for multi-turn tool-use learning.

[NLP-67] Weaving Multi-Source Evidence for Biomedical Reasoning : The BioMedHop Benchmark and BioWeave Framework

【速读】：该论文旨在解决生物医学问答（Biomedical QA）中复杂推理任务的评估与建模难题，尤其针对跨异构数据源（如知识图谱、文献文档、网络资源）的证据拓扑构建与源感知推理能力不足的问题。现有基准普遍局限于考试式知识、文献理解或短程多跳推理，未能充分覆盖基于结构化证据拓扑的源条件图推理（source-conditioned graph reasoning）与证据拓扑构造。为此，作者提出 BioMedHop，一个基于多源图结构的生物医学推理基准，涵盖10,045个实例，支持知识图谱（KG）、文献、网络及混合证据设置，涵盖共享邻居匹配、交集推理、路径推理和计数等复杂推理类型，并提供选项式、开放式和数值计数等多种答案形式。为支撑该基准，研究进一步提出 BioWeave，一种源感知推理框架，其核心在于：通过检索知识图谱路径，从文献与网络资源中抽取支持性线索，将多源信息整合为统一的证据图（evidence graph），并基于实体级别的证据支持进行答案验证。实验表明，BioWeave 在 BioMedHop 上整体性能领先于对比方法，相较于强基线 ToG-2 提升10.5%；同时可有效提升不同大语言模型（LLM）骨干的推理表现，使小型模型如 Qwen3-4B 的推理能力达到 GPT-4-Turbo 水平，凸显了其在高效、精准多源推理中的关键优势。

链接: https://arxiv.org/abs/2606.16211
作者: Xingyu Tan,Shiyuan Liu,Xiaoyang Wang,Qing Liu,Xiwei Xu,Xin Yuan,Liming Zhu,Wenjie Zhang
机构: University of New South Wales, Australia; CSIRO, Australia
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Biomedical question answering (QA) increasingly requires reasoning over interacting entities, where supporting evidence is scattered across biomedical knowledge graphs, literature documents, and web-accessible resources. However, existing biomedical QA benchmarks mainly focus on exam-style knowledge, literature comprehension, or short-range multi-hop inference, leaving source-conditioned graph reasoning and evidence topology construction underexplored. To fill this gap, we introduce BioMedHop, a multi-source graph-grounded benchmark for evaluating biomedical reasoning over structured evidence topologies. BioMedHop contains 10,045 instances across KG, document, web, and hybrid evidence settings, covering shared-neighbor matching, intersection reasoning, path-based reasoning, and counting, with option-based, open-ended, and numeric count renderings. To support this benchmark, we further propose BioWeave, a source-aware reasoning framework that retrieves biomedical KG paths, gathers supporting clues from documents and web sources, assembles them into a unified evidence graph, and verifies answers through entity-level evidence support. Comprehensive experiments show that BioWeave achieves the best overall performance among compared methods on BioMedHop, outperforming the strong hybrid baseline ToG-2 by 10.5% in the overall average. Moreover, BioWeave consistently improves different LLM backbones and enables smaller models, such as Qwen3-4B, to achieve reasoning performance comparable to GPT-4-Turbo.

[NLP-68] LLM -Powered Virtual Population for Demand Simulation and Pricing

【速读】：该论文旨在解决在产品信息以丰富非结构化数据（如文本描述和图像）形式存在，且决策者不仅需要平均需求预测还需不确定性估计以支持反事实价格评估的场景下，如何高效构建需求模拟器的问题。其核心解决方案是提出一种基于大语言模型（LLM）的虚拟人群模型，将潜在客户建模为有限混合客户画像（customer personas）的抽样集合；针对每个画像、产品及候选价格，利用LLM结合结构化画像信息与非结构化产品信息，推断出个体层面的购买概率；通过校准后的混合权重对这些概率进行聚合，生成总体需求的预测分布。该框架能够支持多种定价目标下的反事实分析，包括期望收益与风险敏感型目标（如条件风险价值，CVaR），并在包含产品描述与图像的在线时尚电商数据集上验证了其优越的预测性能和样本效率。相较于仅输出点估计的传统方法，该模型提供完整的需求分布，使管理者可量化需求不确定性，实现对不同价格策略的系统性比较，从而在平均收益或风险控制之间灵活权衡。

链接: https://arxiv.org/abs/2606.16183
作者: Chengpiao Huang,Kaizheng Wang
机构: Columbia University (哥伦比亚大学); Columbia University (哥伦比亚大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 18 pages, 7 figures

点击查看摘要

Abstract:We develop an LLM-powered virtual population model that simulates demand for pricing decisions, in settings where products are described by rich unstructured information, such as text descriptions and images, and where decision makers need not only mean-demand predictions but also uncertainty estimates for counterfactual prices. Our model represents exposed customers as draws from a finite mixture of customer personas. For each persona, product, and candidate price, an LLM elicits a persona-level purchase probability using both structured persona information and unstructured product information. These probabilities are aggregated through calibrated mixture weights to form a predictive distribution of aggregate demand. The resulting simulator can evaluate counterfactual prices under various pricing objectives, including expected revenue and risk-aware criteria such as conditional value at risk. We test the framework on an online HM fashion dataset with product descriptions and images. The calibrated LLM-based simulator achieves the best overall predictive performance among the models considered, and supports sample-efficient pricing decisions. Our framework provides a practical way to use LLMs as demand simulators for products with limited historical demand data but rich product information. By producing a full predictive demand distribution rather than only a point forecast, it enables managers to compare candidate prices, quantify demand uncertainty, and choose prices that target either average-case revenue or risk-aware objectives.

[NLP-69] Focus When Necessary: Adaptive Routing and Collaborative Grounding for Training-Free Visual Grounding

【速读】：该论文旨在解决多模态大语言模型（Multimodal Large Language Models, MLLMs）在处理高分辨率复杂图像时难以感知细粒度细节的问题。现有无训练方法虽通过图像缩放与局部裁剪提升细节感知能力，但其盲目应用导致简单任务产生计算冗余，并可能因丢失全局上下文或引入无关背景噪声而降低准确性。为此，本文提出一种动态且无需训练的框架LazyMCoT，其核心在于基于样本难度自适应分配视觉定位资源。关键创新包括：（1）自适应路由机制，通过单次前向传播中首个输出标记的统计特征评估预测不确定性，高效跳过高置信度样本，同时利用共形校准保障困难样本的召回率；（2）协作定位模块，结合模型内在的跨模态注意力与外部视觉专家，在两阶段精炼过程中生成精确的局部化表示，以恢复小目标或被遮挡目标。大量实验表明，LazyMCoT在多个基准上实现了与有训练方法相当的推理准确率，同时显著降低了平均推理延迟。

链接: https://arxiv.org/abs/2606.16158
作者: Yifan Wang,Peiming Li,Shiyu Li,Zhiyuan Hu,Xiaochen Yang,Wenming Yang,Yang Tang,Zheng Wei
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:While Multimodal Large Language Models (MLLMs) excel in cross-modal reasoning, they often struggle to perceive fine-grained details in complex high-resolution images. Recent training-free methods address this through image scaling and localized cropping. However, applying these manipulations indiscriminately introduces computational redundancy for simple queries and can degrade accuracy by truncating essential global context or introducing irrelevant background noise. To this end, we propose LazyMCoT, a dynamic and training-free framework that adaptively allocates visual grounding efforts based on sample difficulty. The framework features an Adaptive Routing mechanism that evaluates predictive uncertainty using first-token statistics from a single forward pass. This efficiently bypasses confident cases while ensuring the recall of difficult samples via conformal calibration. For these challenging cases, a Collaborative Grounding module integrates the inherent cross-modal attention of the model with an external visual expert through a two-stage refinement process. This refinement process generates a precise localized display to recover small or occluded targets. Extensive experiments across diverse benchmarks demonstrate that LazyMCoT rivals training-based approaches by simultaneously improving reasoning accuracy and reducing average inference latency. Our code is availble at this https URL.

[NLP-70] GRACE: Step-Level Benchmark for Faithful Reasoning over Context

【速读】：该论文旨在解决生成式模型在进行链式推理（Chain-of-Thought, CoT）时，尽管最终答案正确，但中间推理步骤可能偏离原始输入证据（即存在幻觉或不忠实现象）的问题。现有方法仅在响应层面检测幻觉，无法定位错误发生的具体步骤或识别错误类型，导致对模型推理过程的可解释性和可靠性评估不足。为此，论文提出GRACE——首个基于人工标注的、面向步骤级忠实性的基准数据集，专用于上下文依赖的文本推理任务。其关键创新在于构建了一个数据驱动的错误分类体系，通过无监督聚类自底向上发现两类核心错误路径：GRACE-Inference（演绎错误）与GRACE-Grounding（事实锚定错误），每类下细分四个具体错误类别，并对每个推理步骤进行忠实性、错误类型及自然语言解释的三重标注。该评估集具有高挑战性且经人工验证，实验表明当前模型在步骤级忠实性方面仍有巨大提升空间；进一步地，将步骤级忠实信号引入强化学习框架，显著提升了下游任务的准确率与推理可靠性。

链接: https://arxiv.org/abs/2606.16151
作者: Hoang Pham,Dong Le,Anh Tuan Luu
机构: Nanyang Technological University (南洋理工大学); VinUniversity (Vin大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Many reasoning tasks require models to reason over input context, from document-grounded question answering to rule-based deduction. Chain-of-Thought (CoT) prompting produces traces that appear transparent, yet individual steps can silently deviate from the source evidence, even when the final answer is correct. Existing methods detect hallucinations at the response level but fail to identify where in the chain a failure occurs or what type it is. We introduce GRACE, the first human-annotated step-level faithfulness benchmark with a data-driven error taxonomy for context-grounded textual reasoning. GRACE covers CoT traces from 10 models across 4 source datasets, with each step annotated for faithfulness, error category, and natural language explanation. A data-driven taxonomy, discovered bottom-up via unsupervised clustering, organizes failures into two tracks: GRACE-Inference (deductive errors) and GRACE-Grounding (factual grounding errors), with four categories each. The evaluation set is human-annotated and challenging by design. Our experiments reveal substantial headroom for current models. In addition, integrating step-level faithfulness signals into reinforcement learning pipelines improves both downstream accuracy and reasoning reliability.

[NLP-71] VibeThinker-3B: Exploring the Frontier of Verifiable Reasoning in Small Language Models

【速读】：该论文旨在解决在参数量严格受限的小模型（small-model）范式下，如何将可验证推理（verifiable reasoning）能力推向前沿水平的核心挑战。其解决方案的关键在于构建一个基于“谱到信号”（Spectrum-to-Signal）后训练范式的优化流水线，通过课程式监督微调、多领域强化学习以及离线自蒸馏等技术手段，系统性地增强模型的推理能力。实验结果表明，VibeThinker-3B 在多项高难度可验证任务上达到领先性能，如在 AIME26 上取得 94.3 分（结合声明级测试时缩放提升至 97.1），在 LiveCodeBench v6 上实现 80.2 的 Pass@1 分数，并在近期未见过的 LeetCode 比赛中保持 96.1% 的通过率，表现媲美或超越参数量大得多的旗舰模型（如 DeepSeek V3.2、GLM-5 和 Gemini 3 Pro）。此外，其在 IFEval 上获得 93.4 分，证明了强大推理能力的提升并未牺牲指令可控性。基于此前 1.5B 模型的研究，论文提出“参数压缩-覆盖假说”（Parametric Compression-Coverage Hypothesis），认为可验证推理可通过紧凑的推理核心进行压缩，而开放域知识与通用能力则需广泛参数覆盖以应对事实、概念及长尾场景，从而揭示小模型不仅是部署效率的替代方案，更是通往参数密集型能力前沿的互补路径。

链接: https://arxiv.org/abs/2606.16140
作者: Sen Xu,Shixi Liu,Wei Wang,Jixin Min,Yingwei Dai,Zhibin Yin,Yirong Chen,Xin Zhou,Junlin Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This technical report introduces VibeThinker-3B, a compact dense model with 3B parameters developed to investigate how far verifiable reasoning can be pushed within a strictly small-model regime. Building upon the Spectrum-to-Signal post-training paradigm, we systematically enhance the model through an optimized pipeline that includes curriculum-based supervised fine-tuning, multi-domain reinforcement learning, and offline self-distillation. Experimental evaluations demonstrate that VibeThinker-3B achieves frontier-level performance on highly demanding verifiable tasks. Specifically, it attains a score of 94.3 on AIME26 (improving to 97.1 with claim-level test-time scaling), an 80.2 Pass@1 on LiveCodeBench v6, and exhibits strong out-of-distribution generalization with a 96.1% acceptance rate on recent unseen LeetCode contests. This effectively places it in the performance band of first-tier reasoning systems, matching or exceeding flagship models that are orders of magnitude larger, such as DeepSeek V3.2, GLM-5, and Gemini 3 Pro. Furthermore, a score of 93.4 on IFEval confirms that this extreme reasoning enhancement does not compromise strict instruction controllability. Extending our previous 1.5B work, these findings motivate the Parametric Compression-Coverage Hypothesis, which views verifiable reasoning as compressible into compact reasoning cores, while open-domain knowledge and general-purpose competence require broad parameter coverage over facts, concepts, and long-tail scenarios. This perspective suggests that compact models are not merely deployment-efficient substitutes, but a complementary path toward frontier-level performance in parameter-dense capability regimes.

[NLP-72] XAI-Grounded Explanation Generation for Speech Deepfake Detection with Training-Free Multimodal Large Language Models INTERSPEECH2026

【速读】：该论文旨在解决语音深度伪造检测（SDD）系统在决策过程中缺乏可信解释的问题。现有解释方法存在两大局限：一是传统可解释人工智能（XAI）方法（如基于梯度的归因）生成的低层次归因信号与模型决策高度耦合，难以被人类直观理解；二是基于大语言模型（LLM）的解释生成方法由于缺乏启发式证据和任务特定监督，往往产生泛化且无依据的描述，根源在于当前语音深度伪造领域缺乏高质量的、具备可解释性的标注数据集。为此，本文提出一种无需训练的解释框架，通过融合XAI提供的可解释性证据与多模态大语言模型（LLM），生成具有语义连贯性、任务相关且基于事实的精准解释。研究基于PartialSpoof数据集构建了首个面向SDD的可解释性数据集，并通过人工评估与忠实性验证表明，引入XAI证据的方法在内部准确率上提升超过45%，显著增强了解释的可信度与实用性。

链接: https://arxiv.org/abs/2606.16137
作者: Yupei Li,Qiyang Sun,Xiaoliang Wu,Chenxi Wang,Berrak Sisman,Björn W. Schuller
机构: Imperial College London (帝国理工学院); Technical University of Munich (慕尼黑工业大学); University of Southampton (南安普顿大学); MBZUAI (穆巴达拉人工智能研究所); Johns Hopkins University (约翰霍普金斯大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at Interspeech 2026

点击查看摘要

Abstract:Speech deepfake detection (SDD) systems require trustworthy explanations for reliable decision-making. Existing explanation ways mainly fall into two categories. Traditional explainable AI (XAI), such as gradient-based attribution, produces low-level attribution signals tightly coupled with model decisions, and harder to be understood by human than natural language explanations. Meanwhile, large language model (LLM)-based explanation generation often produces generic and ungrounded descriptions due to the lack of heuristic evidence and task-specific supervision, stemming from limited grounded explanation datasets for SDD. We therefore propose a training-free explanation framework that integrates XAI evidence with multimodal LLMs to generate grounded and specific explanations. Using the PartialSpoof dataset, we construct a grounded explanation dataset and show that methods with XAI increase inside accuracy by over 45%, verified through human evaluation and faithfulness checks.

[NLP-73] AuAu: A Benchmark for Auditing Authoritarian Alignment in Large Language Models

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在生成文本过程中可能隐含或强化威权主义倾向的问题，特别是在全球威权主义抬头、生成式AI日益深度嵌入用户日常生活的背景下。其核心挑战在于如何系统评估模型在不同情境下表现出的威权特质，而现有方法往往缺乏对威权主义多维度结构（如威权攻击性、威权服从性和传统主义）的精细刻画。为此，作者提出AuAu基准测试体系，其关键创新在于整合三种互补的评估范式：（1）基于15项经人类验证的心理测量工具的标准化问题；（2）基于具体情境的行为短剧（vignettes）以探测模型在真实场景中的意图行为；（3）对现实用户提示的响应分析。该方法不仅衡量模型整体对威权主义的倾向性，更细化评估其三大子维度。实验结果表明，尽管所有17个来自中国、欧盟、俄罗斯和美国的模型在心理测量评估中均表现出显著的威权响应率，但在更贴近实际应用的下游任务中该比例明显下降；然而，仅通过施加威权系统提示即可诱导其中15个模型显著加剧威权输出。这一发现揭示了模型行为的高度可操纵性，凸显了持续开展系统性审计的必要性，以识别并缓解生成内容中潜在的非预期威权倾向。

链接: https://arxiv.org/abs/2606.16127
作者: Andreas Einwiller,Max Klabunde,Florian Lemmerich
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: v1, 50 pages

点击查看摘要

Abstract:The worldwide surge of authoritarianism, combined with the increasing central role in users’ everyday lives, raises the question of to what extent specific models exhibit or promote authoritarian attitudes and characteristics. We introduce AuAu, a comprehensive benchmark that aims to assess the risk of LLMs generating responses with authoritarian tendencies. This benchmark combines three evaluation approaches: (i) psychometric questions from an extensive pool of 15 human validated instruments; (ii) contextual behavior vignettes probing intended actions in concrete situations; and (iii) responses to realistic user prompts. Unlike prior work, AuAu evaluates not only a general closeness towards authoritarianism but also the established sub-concepts Authoritarian Aggression, Authoritarian Submission, and Conventionalism. Evaluating 17 models from China, the EU, Russia, and the USA, we find that all tested models exhibit substantial authoritarian response rates under the psychometric evaluation, though rates drop significantly in increasingly more realistic downstream task. We further find that an authoritarian system prompt easily manipulates 15 out of 17 models to promote increased authoritarianism. Our results underscore the need for continued, systematic auditing of LLM-based AI systems to detect and ultimately mitigate undesired authoritarian tendencies in generated output. Our code and data are available at: this https URL

[NLP-74] Know Your Limits : On the Faithfulness of LLM s as Solvers and Autoformalizers in Legal Reasoning ICML

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在推理任务中表现优异是否源于真实的逻辑推断，还是仅依赖启发式近似的问题。研究聚焦于法律蕴含（legal entailment）场景，通过对比纯LLM分类、基于LLM的符号推理与基于Z3 SMT求解器的符号推理三种范式，在ContractNLI的一个重新标注子集上评估五种LLMs的表现。其关键发现是：尽管引入形式化结构可提升准确率，尤其以基于LLM的符号推理达到最高基准性能，但这种性能提升并不等同于忠实的逻辑推理。研究识别出三大典型失败模式：范畴转嫁（scope laundering），即LLM在未执行底层形式推理的情况下报告与求解器不一致的分类结果，导致看似逻辑严谨实则无效的结论；隐含约束盲视（implicit constraint blindness），即LLM忽略形式化表示中本应遵循的逻辑约束；以及程序合成失败（program synthesis failures），即即使采用结构化提示，LLM仍生成错误的Z3代码。尤为重要的是，范畴转嫁现象在所有模型中普遍存在，严重质疑了基于LLM的符号推理作为符号执行代理的可靠性。研究揭示了基准准确率与逻辑忠实性之间存在根本性差距。

链接: https://arxiv.org/abs/2606.16118
作者: Olivia Peiyu Wang,Sanna Wong-Toropainen,Daneshvar Amrollahi,Ryan Bai,Tashvi Bansal,Arush Garg,Leilani H. Gilpin
机构: UC Santa Cruz; University of Helsinki; CodeX, Stanford; Stanford University; Canyon Crest Academy; Monta Vista High School; Los Altos High School
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Logic in Computer Science (cs.LO)
备注: 10 pages, submitted to COLM 2026 (under review, average score of 6.25 across 4 reviewers) and accepted by the AI4Law workshop at ICML. This is the version where we already addressed most of the reviews from the COLM reviewers

点击查看摘要

Abstract:Large Language Models (LLMs) achieve strong performance on reasoning tasks, but whether this reflects faithful logical inference or heuristic approximation remains unclear. We study this question in legal entailment by comparing three paradigms, including pure LLM classification, LLM-based Formal Reasoning, and solver-based Formal Reasoning using the Z3 SMT solver, on a re-annotated subset of ContractNLI across five LLMs. Our re-annotation reveals a systematic and measurable gap between pragmatic legal interpretation and strict formal entailment, where a substantial proportion of legally sound inferences are not formally grounded without additional unstated assumptions. While introducing formal structure improves accuracy, with LLM-based Formal Reasoning achieving the highest benchmark performance, we show that this gain does not imply faithful reasoning. We identify three recurring failure modes: scope laundering, where LLMs report solver-inconsistent classifications without executing the underlying formal reasoning, producing conclusions that appear logically grounded but are not; implicit constraint blindness, where LLMs overlook logical constraints present in formal representations; and program synthesis failures, where LLMs generate incorrect Z3 code despite structured prompting. Critically, scope laundering persists across all models, raising serious concerns about the faithfulness of LLM-based formal reasoning as a proxy for symbolic execution. These results reveal a fundamental gap between benchmark accuracy and logical faithfulness.

[NLP-75] owards Pareto-Optimal Tool-Integrated Agents with Pareto Ranking Policy Optimization ICML2026

【速读】：该论文旨在解决现有大语言模型（LLM）对工具调用的对齐方法中，过度关注任务准确率而忽视工具使用效率等辅助目标的问题，从而影响实际部署效果。其核心挑战在于如何在多个相互冲突的目标（如准确性与效率）之间实现有效权衡。解决方案的关键在于提出一种两阶段的多目标优化框架——ParetoPO：第一阶段采用基于超体积引导的动态标量法，根据全局帕累托前沿进展自适应调整奖励权重；第二阶段则以帕累托排序为基础进行优势计算，通过支配感知的信用分配机制，促进非支配轨迹的学习。该设计实现了跨多个冲突目标的细粒度、动作级优化，显著提升了模型在数学推理和多跳问答任务中的准确率-效率权衡表现，优于静态与启发式基线方法。

链接: https://arxiv.org/abs/2606.16111
作者: Junyi Li,Xiaowei Qian,Yingyi Zhang,Wenlin Zhang,Guojing Li,Sheng Zhang,Xiao Han,Yichao Wang,Xiangyu Zhao
机构: 未知
类目: Computation and Language (cs.CL)
备注: ICML 2026 Spotlight Paper

点击查看摘要

Abstract:Recent advances in tool-integrated language agents have significantly improved their ability to solve complex reasoning tasks. However, existing alignment methods predominantly focus on maximizing task accuracy, while overlooking auxiliary objectives such as tool-use efficiency, which are essential for practical deployment. To address this gap, we introduce ParetoPO, a two-stage multi-objective optimization framework for aligning tool-using large language models (LLMs) under competing objectives. In the first stage, ParetoPO leverages hypervolume-guided dynamic scalarization to adapt reward weights based on global Pareto frontier progress. In the second stage, it replaces scalarized learning signals with Pareto-ranking-based advantage computation, promoting nondominated trajectories through dominance-aware credit assignment. This design enables fine-grained, action-level optimization across multiple conflicting objectives. Experimental results on mathematic reasoning and multi-hop QA tasks show that ParetoPO consistently discovers policies with superior accuracy-efficiency trade-offs compared to static and heuristic baselines.

[NLP-76] Your “Pro” LLM Subscription May Actually Be “Free”: Exposing Fingerprint Spoofing Risks in LLM Inference Services

【速读】：该论文旨在解决当前大型语言模型（Large Language Model, LLM）指纹识别技术在面对恶意服务提供商时的脆弱性问题。具体而言，现有基于用户端的黑箱指纹识别方法依赖有限的查询预算和较弱的分类器，难以有效检测出经过伪装的劣质模型——即攻击者通过参数高效微调（Parameter-Efficient Fine-Tuning, PEFT）手段，使一个性能较弱的模型在行为上模仿更强模型，从而绕过指纹验证。这一新型威胁被称为“指纹伪造”（fingerprint spoofing）。其解决方案的关键在于提出名为GhostPrint的低成本攻击框架，该框架结合了代理建模（surrogate modeling）、基于奖励排序的微调（reward-ranked fine-tuning）以及知识蒸馏（knowledge distillation），能够在极低的微调成本下，使弱模型在静态与持续性指纹检测场景中均能稳定规避主流指纹识别方法，同时保持可用性，揭示了当前LLM指纹识别流程中的关键安全漏洞。

链接: https://arxiv.org/abs/2606.16100
作者: Jiahao Zhang,Xiuyu Li,Suhang Wang
机构: The Pennsylvania State University (宾夕法尼亚州立大学)
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:As Large Language Model (LLM) APIs become ubiquitous, users increasingly rely on black-box fingerprinting to verify that providers are serving the advertised premium models. However, these methods may overlook adversarial providers who manipulate model weights to cheat the fingerprint process. We introduce a novel threat termed fingerprint spoofing, where a malicious provider stealthily serves a weaker model that has been parameter-efficiently fine-tuned to mimic a stronger model, thereby evading user-side fingerprinting. We first formally prove that user-side resource constraints (i.e., finite query budgets and weak fingerprinting classifiers) make current fingerprinting vulnerable to fingerprint spoofing. Guided by this theoretical analysis, we propose GhostPrint, a cost-effective attack framework leveraging surrogate modeling, reward-ranked fine-tuning, and knowledge distillation. Extensive evaluations in both static and continual fingerprinting settings demonstrate that GhostPrint allows weak models to consistently bypass representative fingerprint methods while maintaining utility at a low fine-tuning cost, exposing a critical vulnerability in current LLM fingerprinting pipelines.

[NLP-77] Long-Context Modeling via GSS-Transformer Hybrid Architecture with Learnable Mixing

【速读】：该论文旨在解决自然语言处理中长距离依赖建模的挑战，特别是现有Transformer架构在序列长度增加时计算复杂度呈二次增长（O(N²)）的问题，以及状态空间模型（SSM）虽具线性复杂度（O(N)）却存在选择性回忆瓶颈、难以从压缩状态中精确检索信息的缺陷。这一效率与困惑度之间的根本权衡限制了长序列建模的实际应用。为此，论文提出并行混合架构（PHA），其核心创新在于将门控状态空间（GSS）、分组查询注意力（GQA）和前馈网络（FFN）作为独立的并行分支运行，并通过可学习的混合机制进行融合。该设计避免了强制让SSM近似注意力或串行化两种范式，而是使各分支各司其职：GSS专注捕捉全局上下文，注意力负责选择性信息检索，而FFN提供互补的非线性处理。实验表明，在WikiText-103数据集上，PHA以125M参数达到16.51的困惑度，优于Hedgehog（16.70）和H3-125M（23.70）；扩展至180M参数时，困惑度降至16.42，与纯注意力基线相当，同时在长上下文场景下实现24%更高的吞吐量和最高达40%的内存节省。在OpenWebText上，125M参数的PHA模型取得19.72的困惑度，优于标准Transformer（20.60）及现有GSS混合基线（19.80）。结果证明，将序列建模范式解耦为并行专业化模块，可在保持Transformer级困惑度的同时，显著提升长上下文语言建模的效率。

链接: https://arxiv.org/abs/2606.16093
作者: Kuzey Torlak,Hüseyin Arda Arslan,Anıl Dervişoğlu,Beyza Nur Deniz,Onur Boyar
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 16 pages, 9 tables, 4 figures

点击查看摘要

Abstract:Modeling long-range dependencies remains a central challenge in natural language processing. Transformer architectures achieve strong performance via self-attention but scale quadratically ( O(N^2) ) with sequence length, while State Space Models (SSMs) scale linearly ( O(N) ) but suffer from a selective recall bottleneck, struggling to retrieve precise information from compressed states. This creates a fundamental tradeoff between efficiency and perplexity. To tackle these challenges, we propose the \textitParallel Hybrid Architecture (PHA), which runs Gated State Spaces (GSS), Grouped Query Attention (GQA), and Feed-Forward Networks (FFNs) as independent parallel branches fused by a learnable mixing mechanism. Instead of forcing SSMs to approximate attention or serializing the two paradigms, PHA allows each branch to specialize: GSS captures global context, while attention performs selective retrieval, with FFN providing complementary processing. On WikiText-103, PHA achieves 16.51 PPL at 125M parameters, outperforming Hedgehog (16.70) and H3-125M (23.70). Scaling to 180M parameters yields 16.42 PPL, which gives comparable results with the pure attention baseline while delivering 24% higher throughput and up to 40% lower memory usage at long contexts. On OpenWebText, our 125M model achieves 19.72 PPL, outperforming standard Transformers (20.60) and GSS hybrid baselines (19.80). These results demonstrate that separating sequence modeling paradigms into parallel specialists enables Transformer-level perplexity with substantially improved efficiency for long-context language modeling.

[NLP-78] Rhythm of the Deep: A Computational-Linguistic Test of Duality of Patterning in Sperm Whale Codas

【速读】：该论文旨在解决自然界中是否存在类似人类语言“二重性构型”（duality of patterning）这一关键结构特征的问题，具体聚焦于抹香鲸（sperm whale）的回声定位信号——“群鸣”（coda）是否具备层级化组合结构。其核心挑战在于：如何在无明确语义或符号系统的情况下，从连续音频中识别出潜在的、可验证的层级式组合结构，避免将声学相似性误判为象征性结构。解决方案的关键在于构建一个受控的、可复现的计算语言学分析框架，采用冻结的音频编码器共识（consensus of frozen audio encoders）、留出结构检验（held-out structural tests）、基于统计量的零假设（per-statistic nulls）以及声学空模型可恢复性门控（acoustic-null recoverability gates），以严格排除随机性和声学冗余的干扰。研究发现，抹香鲸群鸣呈现窄带两层架构：底层由点击（click）通过共现与节律模式组合形成群鸣，而非依赖稳定顺序规则；上层则表现出群鸣单元间的序列依赖性，表现为0.132比特的第二阶转移熵提升（p = 0.002）。此外，在节奏缩放条件下，点击身份高度依赖速率，而群鸣身份保持相对稳定，揭示了从点击到群鸣的可测量抽象梯度。值得注意的是，仅基于节奏的基线模型虽能恢复底层结构，但无法再现上层序列依赖信号，说明该结构具有超越单纯节律的复杂性。研究不主张语言、语义或类人音位，而是报告了一种类似二重性构型的表征结构证据，其底层为非分段性的节奏驱动机制，并提供了一个可迁移、受控的框架，用于检验人工或自然声学符号系统中的组合结构。

链接: https://arxiv.org/abs/2606.16084
作者: Mudit Sinha,Sanika Chavan
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 22 pages, 2 figures, 4 tables. Preprint

点击查看摘要

Abstract:Human language has often been described as combining structure at two levels: lower-level units combine into larger units, which then combine into larger sequences. We test for this design feature, duality of patterning, in sperm whale codas using 1,483 codas from the Dominica Sperm Whale Project. Because acoustic similarity can imitate symbolic structure, we treat the problem as computational-linguistic structure discovery from continuous audio rather than as a direct claim about language or meaning. We use a consensus of frozen audio encoders, held-out structural tests, per-statistic nulls, and acoustic-null recoverability gates. The evidence supports a narrow two-tier architecture. At the lower tier, clicks compose into codas not by a stable ordered rule, but by which clicks are present together with their inter-click rhythm. At the upper tier, coda tokens show bout-level sequential dependence, with an NSB second-order transfer-entropy lift of 0.132 bits (p = 0.002). Under tempo scaling, encoder-derived click identity is strongly rate-bound, while coda identity remains substantially more stable, yielding a measurable abstraction gradient across the click-to-coda step. Rhythm-only baselines recover substantial lower-tier structure but fail to reproduce the upper-tier sequential-dependence signal. We do not claim language, semantics, perception, or human-like phonemes. Instead, we report representation-level evidence for a duality-of-patterning-like architecture whose lower tier is rhythmic rather than segmental, and provide a portable null-controlled framework for testing combinatorial structure in induced acoustic token systems.

[NLP-79] PVminerLLM 2: Improving Structured Extraction of Patient Voice via Preference Optimization

【速读】：该论文旨在解决患者自生成文本（patient-generated text）中蕴含的关于患者生活体验、社会背景及照护参与度的关键信息因高度非结构化而难以在以患者为中心的结果研究（patient-centered outcomes research）中有效利用的问题。尽管先前工作已提出PV-Miner基准与PVMinerLLM模型用于结构化信息提取，但仅依赖监督微调（Supervised Fine-Tuning, SFT）在处理罕见、细粒度且分布不均的标注错误时表现受限，尤其在对分词（token）敏感的结构化输出任务中尤为明显。为此，本文提出PVminerLLM2，一种基于大语言模型（LLM）的改进框架，其核心创新在于引入偏好优化（Preference Optimization, PO）机制以应对SFT无法覆盖的高精度分词级错误。关键解决方案包括：(i) 设计一种带有分词级门控稳定项（token-level gated stabilization term）的偏好目标函数，有效防止在偏好优化过程中绝对分词概率的退化；(ii) 提出混淆感知的偏好样本构建策略（confusion-aware preference pair construction），以更精准捕捉低区分度语义间的细微差异；同时结合分词重要性加权与逆频率重加权机制，缓解分词分布不平衡与类别偏斜问题。实验结果表明，PVminerLLM2在多种模型规模下持续优于强基线模型，在代码（Code）、子代码（Sub-code）和跨度（Span）任务上分别取得最高达4.43%、3.50%和1.55%的性能提升，并显著超越现有偏好优化训练方法的基线模型。

链接: https://arxiv.org/abs/2606.16074
作者: Samah Fodeh,Linhai Ma,Ganesh Puthiaraju,Srivani Talakokkul,Afshan Khan,Elyas Irankhah,Sreeraj Ramachandran,Ashley Hagaman,Sarah Lowe,Aimee Roundtree
机构: Yale School of Medicine (耶鲁大学医学院); Yale School of Public Health (耶鲁大学公共卫生学院); Texas State University (德克萨斯州立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Motivation: Patient-generated text contains critical information on patients’ lived experiences, social context, and care engagement, but remains largely unstructured, limiting its use in patient-centered outcomes research. Prior work introduced the PV-Miner benchmark and PVMinerLLM models for structured extraction. However, supervised fine-tuning (SFT) alone struggles with rare, fine-grained, and unevenly distributed errors, particularly in token-critical structured outputs. Results: We present PVminerLLM2, an improved set of LLMs for structured patient voice extraction that applies preference optimization to address token-critical errors beyond the reach of supervised fine-tuning. Our method introduces (i) a preference objective with token-level gated stabilization term that prevents degradation of absolute token likelihood under preference optimization, and (ii) confusion-aware preference pair construction to better capture low-separation distinctions. We further incorporate token-importance weighting and inverse-frequency reweighing to address token imbalance and class skew. Across multiple model sizes, PVMinerLLM2 consistently outperforms strong baselines, achieving gains of up to 4.43% (Code), 3.50% (Sub-code), and 1.55% (Span), and outperforms baseline LLM trained with existing preference optimization methods. Availability and Implementation: The supplementary material, code, evaluation scripts, and trained models for PVminerLLM2 are publicly available at: this https URL Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2606.16074 [cs.CL] (or arXiv:2606.16074v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2606.16074 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-80] From Argument Components to Graphs: A Multi-Agent Debate with Confidence Gating for Argument Relations

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在论点挖掘（Argument Mining, AM）任务中，尤其是在论点关系识别与分类（Argument Relation Identification and Classification, ARIC）任务上，因缺乏上下文协同分析能力而导致的细节遗漏问题，以及自修正机制可能加剧推理幻觉的缺陷。现有训练自由方法难以有效捕捉复杂语境下的论点间互动，而传统监督微调虽能提升性能但成本高昂且依赖领域特定标注数据。为此，论文提出一种基于多智能体辩论框架的改进方案，将ARIC任务重构为对论点组件对的辩论过程，并引入“置信度门控”机制——仅对置信度较低的样本启动辩论，高置信预测则直接保留，从而实现选择性辩论。实验结果表明，在UKP论点标注作文语料库v2上，该选择性辩论策略在所有无训练方法中取得了最高的宏平均F1值，而对全部样本进行辩论反而导致性能下降；此外，所有生成式方法均优于微调后的RoBERTa模型，揭示了监督微调在应对“攻击”类别欠表示问题时的局限性。该框架还生成可读性强的辩论对话记录，显著提升了模型决策过程的可解释性，弥补了单智能体与监督分类器在透明性方面的不足。

链接: https://arxiv.org/abs/2606.16047
作者: Jakub Bąba,Jarosław A. Chudziak
机构: Warsaw University of Technology (华沙理工大学)
类目: Computation and Language (cs.CL)
备注: Accepted for publication in the proceedings of KES 2026

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly assessed and utilized in the field of Argument Mining (AM), thanks to their strong general reasoning capabilities. However, standard training-free models often miss sophisticated details, specifically in contexts where two parts of the text have to be analyzed together. Furthermore, self-correction mechanisms tend to reinforce initial hallucinations in reasoning. Overcoming these limitations typically requires expensive, domain-specific supervised fine-tuning. Recent work has shown that a multi-agent paradigm can address such weaknesses for the component classification task through dialectical refinement with a Proponent-Opponent-Judge architecture, setting a promising direction for training-free approaches in the field. In this paper, we extend and evaluate this framework on the Argument Relation Identification and Classification (ARIC) task, reformulating it as a debate over component pairs. Besides that, we introduce a confidence gating mechanism that enables debating only on the uncertain cases and accepting the initial prediction when confidence is high. On the UKP Argument Annotated Essays v2 corpus, we demonstrate that the selective debate achieves the highest Macro F1 among all training-free methods, while debate over all samples degrades performance below that of one of the baselines. All generative approaches also outperform fine-tuned RoBERTa models on Macro F1, suggesting that the under-representation of the Attack class was more damaging to supervised fine-tuning than to inference-only models. Additionally, our framework produces human-readable debate transcripts, offering interpretability absent from both single-agent and supervised classifiers.

[NLP-81] In-Domain Supervised Pathology Report Classification: A Reproducible Pipeline from Data Curation to Production-Matched Evaluation

【速读】：该论文旨在解决监督式生物医学自然语言处理（NLP）模型在跨癌症登记系统迁移时出现的分布外（out-of-distribution）性能下降问题，即在病理报告训练的模型在不同登记机构间应用时表现显著退化。其核心解决方案是一套可复现的领域内（in-domain）监督学习流程，关键在于：通过设施分层采样（facility-stratified sampling）构建与生产环境匹配的领域内训练集和验证集，并对与登记病例关联的报告进行独立处理；采用盲法人工审计估计阳性病例真实患病率与标签噪声水平，从而识别并校正标签偏差；同时通过选择低假阴性率（FNR）且可管理评审工作量的操作点，实现模型性能优化。该方法在41.8万份报告的测试集上使肯塔基州模型将FNR降至0.003、FPR降至0.097，F1分数由基准模型的0.860提升至0.922，且盲审结果显示阳性率从0.500降至0.398，证实了罕见原发部位存在显著标签噪声，凸显了该方案在提升模型鲁棒性与实际可用性方面的有效性。

链接: https://arxiv.org/abs/2606.16026
作者: Isaac Hands,Bin Huang,Adam Spannaus,John Gounley,Heidi Hanson,Eric Durbin,Sally R. Ellingson
机构: University of Kentucky; UK Markey Cancer Center; Kentucky Cancer Registry
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We introduce an in-domain supervised pipeline designed to counter the out-of-distribution performance drop that hampers supervised biomedical NLP models, a problem observed when models trained on pathology reports are moved across cancer registries. Our contribution is a reproducible recipe for training a supervised classifier from routinely collected cancer registry data. It describes how to build the in-domain training set and a production-matched holdout, and to choose operating points that keep the false-negative rate (FNR) very low while keeping reviewer workload manageable. The pipeline standardizes data curation with facility-stratified sampling and separate handling of reports linked to registry cases, and includes a blinded manual audit to estimate positive-case prevalence and label noise. On a 418k-report holdout set, the Kentucky model achieved FNR 0.003 and false-positive rate (FPR) 0.097, improving over the Seattle-trained MOSSAIC OncoID baseline (FNR 0.010, FPR 0.183) and raising F1 from 0.860 to 0.922. In a blinded manual review of 600 reports, estimated positive prevalence declined from 0.500 to 0.398, indicating substantial label noise with errors concentrated in rare primary sites.

[NLP-82] Scaling Human and G2P Supervision for Robust Phonetic Transcription INTERSPEECH2026

【速读】：该论文旨在解决非标准方言及异常语音（如中风后言语障碍）中专家级音位标注成本过高这一问题。传统方法依赖人工标注，但其在多样性和可扩展性上存在局限；因此，研究常采用图素到音素（Grapheme-to-Phoneme, G2P）模型从文本自动生成音位标签以实现大规模标注。然而，本文通过构建一个涵盖母语者、非母语者及中风后言语的80小时精细化基准数据集，发现G2P监督的有效性存在关键质量阈值：当人工标注量低于20–30小时时，引入G2P监督能显著提升性能；而一旦超过此阈值，G2P不仅无法带来进一步收益，反而会损害跨方言的鲁棒性。在此阈值之后，有效的替代方案是利用自动语音识别（ASR）预训练来增强模型泛化能力，实验表明该方法相较先前系统实现了加权音素特征错误率降低2.3倍，并在非母语及失语症语音上取得显著改进。研究结果揭示，单纯依赖数据量驱动的G2P扩展策略在提升模型泛化能力方面可能面临边际效益递减的问题。

链接: https://arxiv.org/abs/2606.16019
作者: Alexander Metzger,Aruna Srivastava,Ruslan Mukhamedvaleev
机构: Koel Labs LLC(科尔实验室有限责任公司)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD)
备注: Accepted to Interspeech 2026

点击查看摘要

Abstract:Expert phonetic annotation is costly, especially for non-standard dialects and atypical speech. A common alternative is using Grapheme-to-Phoneme (G2P) models to auto-generate phonetic labels from text transcripts at scale. We study how automatic phonetic transcription performance scales with human and G2P supervision in English. Using a curated 80-hour benchmark spanning native, non-native and post-stroke speech, we identify a supervision quality threshold: G2P supervision helps only when fewer than 20-30 hours of human annotation are available. Beyond this threshold, it provides no significant benefit and can reduce cross-dialect robustness. What is effective after this threshold is ASR pretraining which we use to achieve a 2.3x reduction in weighted phone feature error rate over prior systems, with strong gains on non-native and aphasic speech. These results suggest that quantity-driven G2P scaling may yield diminishing returns for robust generalization.

[NLP-83] Who Flips? Self- and Cross-Model Counterarguments Reveal Answer Instability in LLM s ICML2026

【速读】：该论文旨在解决现有标准准确率基准无法有效评估大语言模型（Large Language Models, LLMs）在面对合理反论时答案稳定性的问题。传统基准仅关注模型是否接近正确答案，而忽略了当正确答案遭遇具有说服力的错误选项论证时，模型是否会动摇或改变原有判断。为此，论文提出一种受控评估协议：在模型正确回答多选题后，引入一个针对错误选项的连贯反论作为挑战，测量模型是否发生答案翻转。该方法通过分离论证内容与外部社会压力，并系统调控论证长度、自我归因及跨模型来源，实现了对模型稳定性的精细化测量。实验覆盖七种前沿模型和57个MMLU主题，结果显示翻转率在17.5%至97.3%之间，显著揭示了准确率指标无法捕捉的稳定性差异。关键发现包括：自我归因会显著提升翻转率（平均增加7.1个百分点，最高达18.7个百分点）；通过聚合多个模型生成的错误论证并选取最有效的版本，可构建更强的对抗性挑战；基于此构建的MaxFlip挑战集，相较标准自生成挑战可使翻转率提升最多23.6个百分点。研究同时开源了评估协议、挑战记录及MaxFlip数据集，以支持将答案稳定性评估与传统准确率评估并行开展。

链接: https://arxiv.org/abs/2606.16011
作者: Nafiseh Nikeghbal,Amir Hossein Kargaran,Shaghayegh Kolli,Jana Diesner
机构: Technical University of Munich(慕尼黑工业大学); Munich Center for Machine Learning(慕尼黑机器学习中心); LMU Munich(慕尼黑大学)
类目: Computation and Language (cs.CL)
备注: Accepted to the non-archival workshops AI4Good and AIWILD at ICML 2026

点击查看摘要

Abstract:Standard accuracy benchmarks are designed to test how closely large language models (LLMs) approach correct answers, but are not suitable for testing whether LLMs stick with a correct answer when that answer is challenged by a plausible counter-argument. We introduce a controlled protocol for evaluating answer stability: after a model answers a multiple-choice question correctly, we challenge the model’s answer with a coherent argument for an incorrect option and measure whether the model flips. The setup a) isolates argumentative content from overt social pressure and b) varies argument length, self-attribution, and cross-model source. Across seven frontier models and 57 MMLU subjects, flip rates range from 17.5% to 97.3%, revealing large differences in stability that are not captured by accuracy metrics alone. We find that self-attribution consistently increases flip rates (mean +7.1pp, up to +18.7pp). Also, pooling wrong-answer arguments across models and selecting the most effective one per question yields stronger adversarial challenges than relying on any single source model. We further construct MaxFlip, a curated challenge set that amplifies flips by up to +23.6pp over standard self-generated challenges. We release the protocol, challenge records, and MaxFlip to support stability evaluation alongside standard accuracy benchmarks. Materials are available at this https URL and this https URL.

[NLP-84] GRACE-DS: a Guarded Reward-guided Agent Correction Environment in Data Science

【速读】：该论文旨在解决大语言模型（Large Language Model, LLM）驱动的自动化机器学习（AutoML）代理在实际部署前缺乏可靠评估体系的问题。现有方法难以全面衡量代理在真实生产环境中执行端到端机器学习工作流时的性能、合规性与鲁棒性，尤其在数据泄露规避、可复现性、流程协议合法性及错误修正能力等方面存在评估盲区。为此，论文提出GRACE-DS（Guarded Reward-guided Agent Correction Environment in Data Science），一个专为特定组织定制的隔离式评估环境，涵盖从任务规划、数据探查、特征工程、模型开发、验证到代码修复与最终提交的全流程。其核心创新在于引入隐藏可执行验证器，不仅评估最终预测性能，还量化数据泄露规避、可复现性、协议有效性、修正行为与奖励对齐等关键指标。实验表明，采用“结构化灵活迭代交互”这一策略的代理显著优于单次生成、非结构化交互及基于重启的基线方法，在隐测试集上的归一化质量与协议合规完成率均取得提升。基于超过7000次实验验证，GRACE-DS构建了一个可信赖的评估平台，能够有效评估基于LLM的AutoML代理在类生产环境下满足组织特定要求的能力。

链接: https://arxiv.org/abs/2606.16000
作者: Aleksandr Tsymbalov,Danis Zaripov,Artem Epifanov,Anastasya Palienko
机构: ITMO University (圣彼得堡国立信息技术机械与光学大学); HSE University (高等经济大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We introduce GRACE-DS, a Guarded Reward-guided Agent Correction Environment in Data Science for pre-deployment evaluation of LLM-powered AutoML agents. GRACE-DS is a set of evaluation metrics in an isolated environment that can be applied to tabular ML tasks specific to a particular organization. It exposes agents to realistic workflow stages, from planning and data inspection through feature engineering, model development, validation, and code repair to final submission, while hidden executable validators measure not only final predictive performance but also leakage avoidance, reproducibility, protocol validity, correction behavior, and reward alignment. The strongest structured regime, flexible iterative interaction (our approach), achieves higher end-to-end normalized hidden-test quality than single-shot generation, unstructured interaction, and restart-based baselines, while also improving protocol-valid completion. Validated across more than 7,000 episodes, these results establish GRACE-DS as a robust platform for assessing the capacity of LLM-based AutoML agents to execute machine learning workflows under production-like conditions and in accordance with organization-specific requirements.

[NLP-85] ROMPAR: Morphological Completion and Demographic Unlearning for Romanian-Accented Speech Recognition

【速读】：该论文旨在解决议会语音自动转录中因人口统计偏差、方言差异及技术伪影（如分段过程中的话语截断）导致的识别准确率下降问题。其核心解决方案是提出一种多任务对抗训练框架，通过在年龄、性别和方言维度上强制实现人口统计不变性，提升模型对多样化说话人的鲁棒性；同时针对生成式架构中对抗目标固有的不稳定性，引入对抗系数的指数衰减机制以增强训练稳定性；此外，采用大语言模型（LLM）引导的解码策略，并结合位置相关的加权机制，有效完成截断词尾的形态学补全。实验表明，该方法显著降低了词错误率（WER），并在形态重构任务上达到96.6%的F1分数，验证了其在复杂真实场景下的有效性。

链接: https://arxiv.org/abs/2606.15984
作者: Andrei-Marius Avram,Aureliu-Valentin Antonie,Ştefan-Bogdan Badea,Andrei Florea,Robert-Nicolae Zaharoiu,Dumitru-Clementin Cercel
机构: National University of Science and Technology POLITEHNICA Bucharest (布加勒斯特理工大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Automated transcription of parliamentary proceedings faces significant hurdles due to demographic bias, dialectal variation, and technical artifacts such as utterance truncation during segmentation. This paper introduces the ROManian PARliamentary Speech Corpus (ROMPAR) dataset, a 17.80-hour corpus of Romanian and Moldavian parliamentary speech, featuring double-annotated ground truth and explicit labels for reconstructed word fragments. To build a robust ASR system, we propose a multi-task adversarial training framework that enforces demographic invariance across age, gender, and dialect. We address the inherent instability of adversarial objectives in generative architectures by introducing an exponential decay mechanism for the adversarial coefficients. Furthermore, we implement an LLM-guided decoding strategy with position-dependent weighting to facilitate morphological completion of truncated terminal words. Our results demonstrate that the proposed framework significantly reduces WER and achieves an F1-score of 96.6% in morphological reconstruction.

[NLP-86] Do Safety Monitors Stay Reliable After an Update? Benchmarking and Predicting Activation-Monitor Staleness

【速读】：该论文旨在解决生成式 AI（Generative AI）部署安全栈中激活监控器（activation monitors）在模型动态更新后可靠性下降的问题。当前实践中，激活监控器通常基于基础模型的内部表示进行训练并保持冻结状态，而实际部署的模型则可能经历量化、微调、LoRA适配或适配器合并等常规更新，导致监控器与更新后模型之间的语义对齐失效。研究的关键发现是：量化类更新（如NF4量化）对监控器性能影响较小，而微调类更新（如全量微调、QLoRA）普遍导致监控器性能显著退化，且脆弱性高度依赖于监控器类型——隐私/个人身份信息（PII）类监控器最为敏感，而拒绝-合规类监控器相对稳定，表明行为本身的变化并不必然导致其对应监控器失效。特别地，尽管单独使用NF4量化风险较低，但结合QLoRA适配时风险显著上升，揭示了量化与参数高效适应协同作用的潜在危害。此外，研究发现模型更新前的特征可有效预测监控器退化趋势，从而实现对重验证资源的优先级分配。因此，解决方案的核心在于：将微调操作默认触发激活监控器的重新验证，并利用预部署特征实现故障高发监控器的优先筛查，以提升安全监测系统的鲁棒性与可维护性。

链接: https://arxiv.org/abs/2606.15980
作者: Evan Duan
机构: University of Michigan (密歇根大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Activation monitors-lightweight probes trained on a language model’s internal representations-are an increasingly common layer in deployment safety stacks. Deployed models however are rarely static: they are quantized, fine-tuned, adapted with LoRA, or served with merged adapters while the monitor remains frozen. We present the first systematic test of whether this implicit contract holds: whether activation monitors trained on a base model remain reliable after these routine model updates. Across multiple safety-relevant monitors, model depths, update families, and open-weight models, we find a sharp split: quantization-style updates largely preserve frozen probe performance, while fine-tuning-style updates frequently make probes stale. Fragility is highly monitor-dependent, with privacy/PII probes most affected and refusal-compliance probes comparatively stable, showing that retraining a behavior need not stale its corresponding monitor. QLoRA is especially damaging despite NF4 quantization alone being relatively benign, suggesting that quantization becomes riskier when combined with adaptation. We further show that degradation is predictable from pre-deployment features, enabling revalidation budgets to be triaged toward the monitors most likely to fail. These results suggest that fine-tuning should trigger activation-monitor revalidation by default, while prediction can help prioritize which monitors to check first.

[NLP-87] A Large-Scale Multi-Dimensional Empirical Study of LLM s for Conversation Summarization

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在对话摘要任务中评估不足的问题，具体表现为评估场景单一、输入长度覆盖不全、样本量有限，且现有基准测试普遍忽略前沿推理系统与高效小型模型，缺乏细粒度、多维度的评估体系。其解决方案的关键在于提出一个统一的基准测试框架OmniCSEval，包含1800个来自六大真实场景的多样化对话数据，上下文长度跨度从128到32,000个标记（tokens），显著提升了评估的广度与复杂性。为实现细粒度评估，研究引入双向事实核查框架，结合关键事实匹配以衡量摘要的完整性和简洁性，并通过摘要事实验证来评估忠实度（faithfulness）。为确保评估可靠性，构建了人-大模型协同的事实提取流程以及多大模型共识验证机制用于摘要事实分解。基于此框架，对28个不同推理能力与模型规模的大模型进行了系统性评估，揭示了当前大模型在跨场景任务中的挑战、推理能力与模型规模的影响，以及推理模型的效率与适应性，同时为实际部署中的系统选型提供了实证指导。

链接: https://arxiv.org/abs/2606.15974
作者: Weixiao Zhou,Gengyao Li,Xianfu Cheng,Junnan Zhu,Feifei Zhai,Zhoujun Li
机构: Beihang University (北京航空航天大学); CASIA (中国科学院自动化研究所); Fanyu AI Laboratory (凡语AI实验室)
类目: Computation and Language (cs.CL)
备注: 21 pages, 18 figures

点击查看摘要

Abstract:Despite the significant advancement of LLMs in conversation summarization, their evaluation remains limited by insufficient scenarios, input lengths, and sample sizes. Furthermore, existing benchmarks often omit frontier reasoning systems and efficient small models, or lack fine-grained, multi-dimensional assessments. To bridge these gaps, we propose OmniCSEval, a unified benchmark comprising 1,800 diverse conversations across six real-world scenarios, featuring context lengths ranging from 128 to 32k tokens. For fine-grained evaluation, we employ a bidirectional fact-checking framework that integrates key fact matching to assess completeness and conciseness, alongside summary fact verification to evaluate faithfulness. To ensure reliable assessment, we establish a human-LLM collaborative pipeline for key fact extraction and a multi-LLM consensus verifier for summary fact decomposition. Leveraging this framework, we evaluate 28 LLMs across four distinct categories grouped by reasoning capability and model scale. Our extensive empirical study reveals critical insights regarding the cross-scenario challenges current LLMs continue to face, the impacts of reasoning and scale, and the efficiency and adaptability of reasoning models. We also provide guidance for system selection in real-world deployments.

[NLP-88] Formalize Once Edit the Rest: Efficient Lean-Based Answer Selection for Math Reasoning

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在数学推理任务中生成的自然语言答案难以高效验证的问题，特别是在测试时扩展（test-time scaling）场景下需对多个候选答案进行选择时，如何降低形式化（autoformalization）成本并提升答案选择准确性。现有方法通常对每个候选答案独立进行形式化，导致计算开销巨大。其解决方案的关键在于提出BASE（Base-and-Edit）框架，该框架仅对一个基准候选答案（base candidate）进行一次完整的形式化，随后通过训练一个名为LEANSCRIBE的重写模型，精准定位该基准形式化中的答案部分，并生成可复用的编辑函数，以原位修改的方式推导其余K−1个候选答案的形式化表达。这一策略实现了形式化效率与选择准确率的帕累托改进，在四个基准数据集和三个求解器共12种配置下均表现优异，当K=8时，可将自动形式化调用次数减少约5倍，且随着K增大，节省效果将进一步增强。

链接: https://arxiv.org/abs/2606.15972
作者: Ji Feng,Zhouxing Shi
机构: University of California, Riverside (加州大学河滨分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 15 pages, 1 figure. Code available at this https URL

点击查看摘要

Abstract:With large language models (LLMs) increasingly applied to mathematical reasoning, formal proof assistants such as Lean can be leveraged to verify reasoning outputs with machine-checkable rigor, enabling use cases such as answer selection in test-time scaling with K sampled candidate answers. However, employing Lean requires that LLM outputs, originally in natural language, first be formalized. Existing Lean-based answer-selection work uses an autoformalization model to generate a formal statement in Lean for each candidate answer independently, incurring a significant computational cost. We propose BASE, a base-and-edit pipeline that formalizes a single base candidate per problem and derives the remaining K-1 statements by editing the answer expression in place. To facilitate this, we train a rewriter model LEANSCRIBE to localize the answer in the base formalization and generate a reusable edit function for the other K-1 candidates. BASE simultaneously improves selection accuracy and reduces formalization cost - a Pareto improvement that holds on all 12 (dataset, solver) configurations across four benchmarks and three solvers, cutting autoformalizer calls by about 5x at K=8, with the reduction expected to become larger as K grows. Code is available at this https URL.

[NLP-89] SAG: SQL-Retrieval Augmented Generation with Query-Time Dynamic Hyperedges

【速读】：该论文旨在解决现有检索增强生成（Retrieval-Augmented Generation, RAG）方法在处理结构化约束和多跳推理时的固有局限性，特别是传统密集相似度检索难以有效利用结构化知识、而引入知识图谱又面临语义碎片化、维护成本高及增量更新困难等问题。其解决方案的关键在于提出SAG（SQL Retrieval Augmented Generation）架构，通过将文本块转化为语义完整的事件及其索引实体，利用标准SQL连接查询在查询时动态构建基于共享实体的局部超边，从而实现动态实例化的局部索引结构。该设计摒弃了全局静态图的预构建与持续维护需求，充分利用标准化数据库基础设施，天然支持增量写入、并发处理与持续扩展。在HotpotQA、2WikiMultiHop和MuSiQue三个典型多跳推理基准上，SAG在9项Recall@K指标中的8项取得最优表现，尤其在最具挑战性的MuSiQue数据集上达到80.0%的Recall@5，且已在生产环境中部署，支撑数亿级数据规模，线上检索延迟保持在秒级以内。

链接: https://arxiv.org/abs/2606.15971
作者: Yuchao Wu,Junqin Li,XingCheng Liang,Yongjie Chen,Yinghao Liang,Linyuan Mo,Guanxian Li
机构: Zleap AI
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) offers an effective approach for large language models to access external knowledge. However, existing methods rely on dense similarity retrieval and face inherent limitations in handling structured constraints and multi-hop reasoning. Incorporating knowledge graphs partially alleviates these issues, but at the cost of semantic fragmentation, high maintenance overhead, and difficult incremental updates. This paper introduces SAG (SQLRetrieval Augmented Generation), a structured architecture for retrieval and agent systems. Instead of pre-building a global static graph, SAG converts each chunk into one semantically complete event and a set of indexing entities, then uses SQL join queries to dynamically link events that share entities into local hyperedges,constructing, at query time, a dynamically instantiated local index structure. This design avoids the need for global graph rebuilding and ongoing maintenance; the system naturally supports incremental writes, concurrent processing, and continuous scaling through its reliance on standard database infrastructure. Across HotpotQA, 2WikiMultiHop, and MuSiQue, three standard multi-hop benchmarks,SAG achieves the best results on 8 out of 9 Recall@K metrics, reaching 80.0% Recall@5 on MuSiQue, the benchmark with the highest multi-hop reasoning this http URL has also been deployed at a production scale of hundreds of millions of data items, with online retrieval latency kept within seconds. Project site and code are available at this https URL.

[NLP-90] PreLort: Prefix-Nested LoRA for Federated Fine-Tuning under Rank Heterogeneity

【速读】：该论文旨在解决在异构硬件资源环境下，基于参数高效方法（如LoRA）进行联邦微调时，不同客户端因适配器秩（adapter rank）差异导致的模型参数无法直接聚合的问题。现有方法虽能实现异构秩下的聚合，但缺乏对信息在秩维度间分布的有效控制，致使共享的低秩表示利用不充分。其解决方案的关键在于提出PreLort：一种嵌套式低秩结构，将适配器维度组织为前缀层次结构（prefix hierarchy），确保低秩维度编码任务相关性强的信息，而高秩维度则保留额外容量。在此基础上，引入两个核心机制：(i) 分段聚合规则，仅对贡献于特定秩段的客户端进行平均，避免零填充低秩客户端带来的信息稀释；(ii) 前缀嵌套训练策略，通过在多秩截断条件下优化每个适配器，促使有效信号集中于低秩前缀维度。二者协同作用，使低秩前缀持续学习并聚合最具任务相关性的信息，同时允许低秩客户端受益于高秩客户端提供的丰富信息。实验表明，该方法在多个基础模型上均显著优于现有异构联邦LoRA方法，在准确率和ROUGE-L指标上表现更优，且困惑度更低或相当。

链接: https://arxiv.org/abs/2606.15963
作者: Muhammad Waseem,Nurbek Tastan,Andrej Jovanovic,Nicholas D. Lane,Nils Lukas,Karthik Nandakumar,Samuel Horvath
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Federated fine-tuning of large language models using parameter-efficient methods such as LoRA enables privacy-preserving adaptation of foundation models. Heterogeneous hardware resources introduce challenges, as clients with different adapter ranks cannot be directly aggregated. While existing methods enable aggregation under heterogeneous ranks, they fail to control how information is distributed across rank dimensions, leading to suboptimal use of shared low-rank representations. Instead, we propose PreLort: a nested low-rank formulation for federated LoRA that organizes adapter dimensions into a prefix hierarchy. Our approach ensures that lower-rank dimensions encode task-relevant information, while higher-rank dimensions capture additional capacity. Building on this, we introduce (i) a segment-wise aggregation rule that averages only over clients contributing to each rank segment, avoiding dilution from zero-padded lower-rank clients, and (ii) a prefix-nested training strategy that optimizes each adapter under multiple rank truncations, encouraging useful signal to concentrate in low-rank prefix dimensions. Together, these components encourage a consistent low-rank prefix capturing the most task-relevant information, while higher-rank dimensions learn additional capacity. This allows low-rank clients to benefit from richer information contributed by higher-rank clients, as prefix dimensions are consistently learned and aggregated. Experiments demonstrate that our method consistently outperforms prior heterogeneous federated LoRA methods in accuracy and ROUGE-L, while achieving lower or comparable perplexity across multiple base models.

[NLP-91] FinBalance: A Multi-Document Accounting Reconciliation Benchmark

【速读】：该论文旨在解决现有金融自然语言处理（Natural Language Processing, NLP）评估基准普遍局限于对已准备好的文本产物（如财务报表、表格或提取的数值）进行评测，而忽视了真实会计工作早期阶段的核心挑战——即从原始凭证文档中进行多文档会计对账（accounting reconciliation）、将分录聚合为资产负债表，并识别内部矛盾。针对这一问题，论文提出FinBalance，一个基于跨八个行业、三种期间类型和五种难度级别的源文档集合构建的多文档对账基准。其核心解决方案在于通过确定性生成器构建包含人工撰写的商业场景、会计政策、税务/外汇处理规则、文档结构、干扰项及不一致模式的合成数据集，该生成器可自动生成对应的分录、资产负债表以及23类不一致代码标签。在710条测试样本上，六种主流大模型（Large Language Models, LLMs）的最终资产负债表精确率最高仅达46%；更关键的是，四类模型在报告的资产负债表（BS_exact）与通过其生成分录回放至会计引擎后重构的资产负债表（BS_recon）之间存在26–41个百分点的显著差距，表明模型虽能生成数值上合理的分录，却难以正确关联支持性文档并保持聚合一致性。实验进一步揭示，引用压力提示（citation-pressure prompting）对文档链接错误改善有限，而引入会计引擎反馈机制的消融实验则显著提升了报告资产负债表的质量，并暴露出不一致检测中的权衡关系。专家财务评审人员对基准设计与标签体系进行了验证，确认其有效性与实用性。

链接: https://arxiv.org/abs/2606.15949
作者: Sasank Tumpati,Devansh Agarwal,Ayush Kedia,Arjun Neekhra,Murari Mandal,Krishna Garg,Yash Sinha,Suman Gupta,Dhruv Kumar
机构: BITS Pilani; KIIT Bhubaneswar; University of Oxford
类目: Computation and Language (cs.CL)
备注: 18 pages, 12 figures. Code and data: this https URL

点击查看摘要

Abstract:Existing financial-NLP benchmarks mostly evaluate prepared artifacts such as filings, tables, or extracted values. Real accounting begins earlier: source documents must be reconciled into cited journal entries, aggregated into a balance sheet, and checked for contradictions. We introduce FinBalance, a multi-document accounting reconciliation benchmark built from source-document bundles across eight industries, three period types, and five difficulty levels. Human-authored business scenarios, accounting policies, tax/FX treatments, document schemas, distractors, and inconsistency templates are composed by a deterministic generator whose ledger produces journal entries,balance sheets, and 23 inconsistency-code labels. On a 710-record evaluation split, six contemporary LLMs reach at most 46% exact final-balance-sheet accuracy. Four models show a 26-41 pp gap between BS_exact, the model’s reported balance sheet, and BS_recon, the balance sheet obtained by replaying its entries through our ledger. Models often recover numerically plausible entries but fail to bind them to supporting documents and aggregate them consistently. Citation-pressure prompting barely changes document-linking errors, while ledger-feedback ablations substantially improve reported balance sheets and expose inconsistency-detection trade-offs. Expert finance reviewers validate the benchmark design and labels.

[NLP-92] Beyond NL2Code: A Structured Survey of Multimodal Code Intelligence

【速读】：该论文旨在解决当前大语言模型（LLM）在文本到代码生成方面进展虽显著，但难以应对实际编程任务中依赖视觉输入（如截图、图表、矢量图、视频及交互状态等）的挑战。此类任务要求模型具备将视觉感知与可执行程序进行语义对齐的能力，因为程序的正确性不仅取决于语法，还涉及布局、几何结构、数据语义、可编辑性、交互行为以及执行后的领域特定约束。其核心解决方案在于提出“多模态代码智能”（Multimodal Code Intelligence）这一研究框架，通过明确代码在不同任务中的角色——作为渲染产物、可编辑符号结构、科学表征、中间推理轨迹或可执行策略/工具接口——构建系统化的分类体系。研究进一步将现有方法与基准测试归纳为四大领域：图形用户界面、科学可视化、结构化图形及前沿任务与框架，从而连接成熟的技术问题与新兴的代理式与统一化场景。展望未来，论文强调应聚焦以验证为核心的四个方向：多信号验证（整合互补正确性证据）、多状态验证（跨执行轨迹测试行为）、跨任务迁移测试（检验可复用的视觉-代码能力）以及可验证代理轨迹（揭示代理行为是否基于视觉证据），推动多模态代码生成从单一输出模仿向基于证据的可执行系统演进。

链接: https://arxiv.org/abs/2606.15932
作者: Xuanle Zhao,Qiushi Sun,Jingyu Xiao,Xuexin Liu,Haoyue Yang,Qiaosheng Chen,Xianzhen Luo,Jing Huang,Yufeng Zhong,Lei Chen,Shuai Fu,Zhenlin Wei,Jinhe Bi,Lei Jiang,Haibo Qiu,Siqi Yang,Peng Shi,Jian Hu,Zhixiong Zeng
机构: Meituan; The University of Hong Kong; The Chinese University of Hong Kong; Institute of Automation, Chinese Academy of Sciences; Nanjing University; Harbin Institute of Technology; Australian Institute for Machine Learning, Adelaide University; Ludwig Maximilian University of Munich; University of Science and Technology of China; Queen Mary University of London
类目: Computation and Language (cs.CL)
备注: Work completed in January 2026. Updating now

点击查看摘要

Abstract:While LLMs have substantially advanced text-to-code synthesis, many real programming tasks specify intent through visual artifacts such as screenshots, charts, documents, vector drawings, videos, and interactive states. These tasks require models to connect visual perception to executable programs, because correctness depends not only on syntax but also on layout, geometry, data semantics, editability, interaction behavior, and domain-specific constraints that apply after execution. This survey examines Multimodal Code Intelligence, covering systems that generate, edit, refine, execute, or reason with code under visually grounded inputs and outputs. We first formulate the field by the role that code plays in each task, distinguishing code as a rendered artifact, an editable symbolic structure, a scientific representation, an intermediate reasoning trace, or an executable policy or tool interface. We then organize benchmarks and methods into four domains: Graphical User Interface, Scientific Visualization, Structured Graphics, and Frontier Tasks and Frameworks. This taxonomy connects mature artifact-generation problems to emerging agentic and unified settings and allows us to compare how different tasks treat evidence of correctness. Looking ahead, we argue that future research may benefit from four verification-centered directions. Multi-signal validation can combine complementary evidence of correctness, multi-state verification can test behavior across execution trajectories, cross-task transfer testing can probe reusable visual-code skills, and verifiable agent traces can reveal whether agent actions are grounded in visual evidence. Together, these directions may move multimodal code generation from single-output imitation toward evidence-grounded executable systems.

[NLP-93] Calibrated Triage Not Autonomy: Confidence Estimation for Medical Vision-Language Models

【速读】：该论文旨在解决生成式医学视觉问答模型在临床应用中因过度依赖语言先验（language priors）而产生看似可信实则错误的诊断结论这一关键问题，尤其关注模型在缺乏图像支持时仍自信输出错误答案所带来的安全风险。其核心解决方案在于通过评估七种置信度估计器（confidence estimators）在多种开放权重的多模态大模型（LVLMs）与跨临床影像、放射科及病理学领域的多个医学视觉问答数据集上的表现，识别出能够可靠支持“选择性预测”（bounded selective prediction）的置信度信号——即仅在置信度超过阈值时自动处理病例，其余情况则交由临床医生判断。研究发现，传统评估指标如区分度（discrimination）和校准性（calibration）无法有效区分方法优劣，且廉价的自报告置信度可通过域外温度缩放（off-domain temperature scaling）轻易改善但无助于提升实际部署效能。真正决定可用性的关键在于：高置信度区域中模型犯错的比例——最差基线在41%至45%的错误中表现出高置信度，而最优探测器仅为1%至4%。此外，模型基础能力设定了上限，即使置信度校准良好，也仅能在20%误差容忍度下恢复约三分之一的放射科案例，而在病理学领域几乎无效。因此，当前模型可实现的安全角色是“校准化分诊”（calibrated triage），而非完全自主决策，即仅自动化经校准置信度判定为安全的病例，其余全部转交临床医生处理。研究已公开所有输出结果、正确性判断与置信度评分数据及代码。

链接: https://arxiv.org/abs/2606.15910
作者: Reza Khanmohammadi,Kundan Thind,Mohammad M. Ghassemi
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:A vision-language model can answer a question about a medical image fluently and confidently while barely using the image, leaning instead on language priors. In medicine this is the failure that matters most, because the answer looks trustworthy and is not, and the only protection is a confidence score reliable enough to tell the system when to abstain. We ask a deployment question rather than an accuracy one: how much imaging work a model can safely handle alone, and which confidence signal makes that possible. We evaluate seven confidence estimators across five open-weight LVLMs and three medical visual-question-answering datasets spanning broad clinical imaging, radiology, and pathology, with every probe trained only on natural images and applied without adaptation. Recast as bounded selective prediction (automate a case only when confidence clears a threshold, defer the rest), the comparison is cautionary. The standard metrics are poor guides: discrimination barely separates the methods, and the weak calibration of a cheap self-report is cheaply removed by off-domain temperature scaling without changing deployable yield. What distinguishes a usable estimator is the high-confidence region a clinician acts on: the weakest baselines are confidently wrong on 41 to 45 percent of their errors against 1 to 4 percent for the best probe, and no estimator is reliably best across domains or models. Safe handoff is governed at two levels: base-model competence sets a ceiling, so a well-calibrated score recovers roughly a third of radiology cases at a 20 percent error tolerance but almost none of pathology; the confidence layer then decides how much of that ceiling is reachable. The usable role today is calibrated triage, not autonomy: automate the cases a calibrated score marks safe, route the rest to a clinician. We release all outputs, correctness judgments, and confidence scores, with code.

[NLP-94] Control-Plane Placement Shapes Forgetting: An Architectural Study of Agent Memory Across Thirteen System Configurations

【速读】：该论文旨在解决大语言模型（LLM）在智能体记忆系统中所处位置对遗忘失效模式（forgetting failure modes）的影响问题，特别是针对现有基准测试主要聚焦于召回（recall）性能而忽略控制平面（control plane）中删除、覆盖、清除等操作的验证缺陷。其核心挑战在于：如何在保持高效召回的同时，实现对记忆内容的语义感知型精准删除（intent-aware deletion）与归一化处理（canonicalization）。解决方案的关键在于提出一种基于“变异时机钩子”（mutation-time hook）的架构设计，通过在记忆修改阶段引入生成式 AI (Generative AI) 的介入，实现了对意图敏感的删除操作（如前缀冲突和复合事实场景下达到78-85%成功率），并显著提升了整体遗忘性能（91.7–93.2%总体准确率，单次变异延迟仅2.3秒/案例），同时保持了召回路径不变。该方案通过新提出的ForgetEval基准测试框架（包含1000例模板用例与385例对抗性用例）和适配器协议（Adapter Protocol），系统性地揭示了不同部署策略间的互补性，并验证了联合使用时可带来高达27.8个百分点的性能提升。

链接: https://arxiv.org/abs/2606.15903
作者: Dongxu Yang
机构: DeepLethe
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 23 pages including appendices. Code, benchmark, and adapters released under MIT at this https URL

点击查看摘要

Abstract:Where an LLM sits in an agent memory pipeline – between the recall plane that retrieves stored facts (extensively benchmarked) and the control plane that mutates them via supersede, release, purge (largely untested) – shapes which forgetting failure modes the system recovers. Comparing thirteen system configurations on a 385-case adversarial surface, we observe three placement regimes with partly complementary coverage: deterministic primitives suffice for lexical/temporal categories but fail canonicalization (5% on identifier-obfuscation, 0% on cross-lingual); inscribe-time LLM recovers canonicalization (100%) but cannot help intent-aware deletion (0% on prefix-collision and compound-fact); a mutation-time hook recovers intent-aware deletion (78-85%) and brightens nearly all categories simultaneously (91.7-93.2% overall, 0.17 per 385-case run, 2.3s/case mutation latency vs. 64-191ms/case deterministic, recall path unchanged). We expose the trade-off via ForgetEval, a 1000-case templated suite plus a 385-case adversarial layer (132 hand-crafted + 253 LLM-drafted oracle-validated) scored by deterministic substring match, paired with a six-method Adapter Protocol with honest N/A scoring that lets heterogeneous memory stores enter in 130 lines. Admission is corroborated by 10-annotator IAA (Fleiss’ kappa = 0.958) and a 77-case external-authored subset (four blind contributors) that replicates the canonicalization asymmetry and amplifies the joint-placement lift (+27.8 pt). Production failures are predominantly forgetting failures rather than recall failures, yet existing benchmarks measure only recall. ForgetEval and all adapters are released under MIT. Comments: 23 pages including appendices. Code, benchmark, and adapters released under MIT at this https URL Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) ACMclasses: I.2.7; I.2.11; H.3.3 Cite as: arXiv:2606.15903 [cs.CL] (or arXiv:2606.15903v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2606.15903 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-95] BALTO: Balanced Token-Level Policy Optimization for Hallucination Mitigation

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在知识密集型场景中生成内容时存在的幻觉（Hallucination）问题，即模型生成的内容与提供的参考证据不一致。现有基于强化学习（Reinforcement Learning, RL）的方法通常采用响应级别（response-level）的忠实性奖励，但其存在粒度不匹配的问题：局部幻觉可能导致本应被支持的内容遭受错误惩罚。尽管近期研究引入了更细粒度的反馈机制，如主张级验证和词元级奖励，但仍面临信用分配不平衡的问题，易引发长度、冗余或优化噪声等偏差。为此，本文提出BALTO（Balanced Token-level Policy Optimization）框架，其核心在于通过提取可验证的事实主张，并基于参考上下文进行验证，将主张级判断映射为词元级标签；同时设计了一种平衡的词元级信用分配机制，该机制将未支持内容的概率质量重新分配给忠实内容，而非整体抑制响应。理论分析表明，该方法显著提升了训练稳定性和优化效率。实验在ConFiQA、RAGTruth和FinLLM-Eval三个基准上验证，BALTO在所有六组模型-基准组合中均达到最高的忠实性表现，并在Q-Score指标上持续优于现有后训练基线，展现出更强的忠实性与信息量权衡能力。

链接: https://arxiv.org/abs/2606.15893
作者: Ning Li,Zixuan Guo,Yan Xu,Wenbo Fei,Yifan Niu,Chang Luo,Yasheng Wang,Weiwen Liu,Yong Yu,Weinan Zhang
机构: Shanghai Jiao Tong University (上海交通大学); Tencent (腾讯); The Hong Kong University of Science and Technology (Guangzhou) (香港科技大学（广州))
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Hallucinations remain a major obstacle to deploying large language models (LLMs) in knowledge-intensive settings, where generated responses must be faithfully grounded in provided evidence. Reinforcement learning (RL) is a promising direction for hallucination mitigation, but response-level faithfulness rewards suffer from a granularity mismatch: localized hallucinations can cause supported content to receive spurious penalties. Although recent work introduces fine-grained feedback such as claim-level verification and token-level rewards, unbalanced credit assignment can still induce length, verbosity, or optimization-noise biases. We propose BALTO, a Balanced Token-level Policy Optimization framework for hallucination mitigation. BALTO extracts checkable factual claims, verifies them against the reference context, and projects claim-level judgments to token-level labels. A balanced token-level credit assignment mechanism is introduced into the framework. This design redistributes probability mass from unsupported content toward faithful content, rather than suppressing the entire response. We systematically analyze the limitations of response-level rewards from a theoretical standpoint, and prove BALTO’s advantages in training stability and optimization efficiency for hallucination mitigation. Experiments on ConFiQA, RAGTruth, and FinLLM-Eval show that BALTO achieves the highest faithfulness across all six model–benchmark settings and consistently outperforms existing post-training baselines in Q-Score, demonstrating a stronger faithfulness–informativeness trade-off.

[NLP-96] Neuron Level Analysis of Large Language Model in Legal Domain Reasoning

【速读】：该论文旨在解决大语言模型（LLM）在法律领域推理任务中关键神经元的识别与可解释性问题，尤其关注其在跨任务间共享与特异性神经元的分布规律。研究通过引入神经元归因得分（neuron attribution scores）对模型内部神经元进行排序与抑制，验证了被识别出的关键神经元对目标任务的准确性具有决定性影响，而随机抑制相同数量的神经元则无显著影响，从而确立了这些神经元的功能重要性。其解决方案的关键在于：首先，发现了一组在所有七项任务中均表现出显著影响力的共通神经元；一旦移除这些通用神经元，后续抑制操作仅影响其对应的特定任务，从而揭示出每种模型中真正具备任务特异性的神经元。此外，研究还发现法律领域的三个基准任务表现出较高的神经元重叠度，暗示存在跨越司法管辖区的通用法律语义神经成分。实验结果进一步表明，影响力神经元是否集中于中间MLP层可能依赖于输入格式与内容，并非普遍成立的现象，挑战了现有关于“影响力集中于中间层”的假设。

链接: https://arxiv.org/abs/2606.15884
作者: Eri Onami,Youmi Ma,Shuhei Kurita,Naoaki Okazaki
机构: Institute of Science Tokyo (东京科学研究所); NII (日本信息研究所); AIST (产业技术综合研究所)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We presented a neuron-level analysis of legal-domain reasoning in LLMs, comparing it with other applied domain tasks across seven open-weight models. Using neuron attribution scores to rank and suppress influential neurons, we confirmed that suppressing the identified neurons collapses accuracy on the target task, whereas suppressing the same number of random neurons does not. We further found a small subset of neurons influential across all seven tasks; once these are removed, suppressing the remaining neurons degrades only the task they were identified from, revealing genuinely task-specific neurons in every model studied. Within the legal domain, the three benchmarks exhibit relatively high neuron overlap and tend to be affected jointly, suggesting of legal components neurons that span jurisdictions. The distribution of identified neurons in our experiments suggests that the hypothesis that influential neurons are concentrated in middle MLP layers may depend on the input format and content, rather than being a universal phenomenon.

[NLP-97] Koshur Diacritizer: A Byte-Level Sequence-to-Sequence Model for Kashmiri Diacritic Restoration

【速读】：该论文旨在解决克什米尔语（Kashmiri）在数字化文本中因省略元音符号（diacritic marks）而导致的语义模糊问题，这一现象严重制约了下游自然语言处理（NLP）任务的性能。其核心解决方案是提出一种基于ByT5-small字节级序列到序列架构的克什米尔语元音符号恢复模型——Koshur Diacritizer。该方法的关键在于融合了三种关键技术：（1）考虑书写系统特性的文本归一化（script-aware normalization），以统一输入格式；（2）通过对齐验证（alignment validation）确保训练数据的质量与一致性；（3）采用保持骨架结构的推理策略（skeleton-preserving inference），在恢复元音符号的同时严格保留原始基础字母序列的完整性。实验结果表明，在独立测试集上达到0.2012的字符错误率（DERm）和0.2159的词错误率（WER），且由母语语言学专家评估的平均准确率达77.5%。研究团队公开发布了包含23.7k句对的标注数据集、预训练模型及源代码，为克什米尔语元音符号恢复及低资源语言的后续研究提供了可复现的基准。

链接: https://arxiv.org/abs/2606.15883
作者: Haq Nawaz Malik,Nahfid Nissar,Faizan Iqbal
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Kashmiri, an Indo-Aryan language written in a modified Perso-Arabic script, frequently omits diacritic marks in digital text, creating ambiguity and challenging downstream NLP applications. We present Koshur Diacritizer, a ByT5-small byte-level sequence-to-sequence model for restoring diacritics in Kashmiri text. To support this task, we release a publicly available dataset of 23.7k aligned undiacritized diacritized Kashmiri sentence pairs. The proposed framework combines script-aware normalization, alignment validation, and skeleton-preserving inference to ensure reliable restoration while maintaining the original base-letter sequence. Experimental results on a held-out test set achieve a DERm of 0.2012 and a WER of 0.2159. Additionally, evaluation by a native Kashmiri linguistic expert yields a mean accuracy of 77.5%. The dataset, model, and source code are publicly released to provide a reproducible baseline for Kashmiri diacritic restoration and future low-resource language research.

[NLP-98] Free Energy Heuristics: Fast-And-Frugal Cognition as Active Inference Under Uncertain Precision

【速读】：该论文旨在解决生成式 AI 在复杂推理任务中“更多推理反而降低性能”的悖论问题，尤其聚焦于规划、伦理争议及模型无法自检的任务场景。其核心问题是：为何在某些情境下，链式思维（Chain-of-thought, CoT）的增强推理会引发性能退化？论文提出，决定这一现象的关键因素是元不确定性（meta-uncertainty）——即模型对其自身证据可靠性判断的不确定程度。当元不确定性较高时，额外推理不再增加有效信息，反而制造虚假信心。作者通过变分自由能最小化框架证明，在重尾精度先验条件下，最优决策策略会在接收到有限数量高有效性线索后停止整合信息（定理2.6.1），且在下降主导性条件下与“择优而取”（take-the-best）启发式算法在样本层面完全一致（定理2.7.4）。这表明快速简约启发式与主动推断（active inference）实为同一计算过程的不同表述。研究进一步构建了FEH-79基准测试集，包含具有相同控制条件的奈特不确定性（Knightian uncertainty）情境，并在七种模型、五种CoT长度下进行预注册实验，验证了高元不确定性情境下更长的CoT会导致准确率显著下降（平均降幅17.3点，95%置信区间[7.7, 25.5]），而具有确定答案的任务则无此代价。结果呈现明显的范式依赖性：在中大型模型中效应显著，在前沿模型中呈方向性趋势，而在最弱模型中则消失甚至反转。该框架不仅阐明了CoT何时有效，更统一了贝叶斯认知与快速简约启发式传统，指出“少即是多”并非对贝叶斯推理的否定，而是元不确定性状态的指示信号。

链接: https://arxiv.org/abs/2606.15877
作者: Alex Bogdan
机构: Evolutionairy AI (进化人工智能)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 64 pages, 6 figures

点击查看摘要

Abstract:Chain-of-thought (CoT) improves large language models’ performance in math and symbolic reasoning. But on planning, contested ethics, and tasks where the model cannot check itself, more reasoning makes things worse. Both effects are documented; what has been missing is a principled account of which property decides the outcome. We argue it is meta-uncertainty: how unsure the model is about the reliability of its own evidence. When that uncertainty is high, extra reasoning stops adding signal and starts manufacturing false confidence. We prove that the policy minimizing expected free energy under uncertain precision stops integrating cues after a finite number of high-validity ones when the precision prior is heavy-tailed (Theorem 2.6.1), and under a Descending Dominance condition, is sample-wise identical to take-the-best (Theorem 2.7.4). Fast-and-frugal heuristics and active inference are, then, two descriptions of the same computation. The prediction is that on high-meta-uncertainty items, longer CoT should degrade accuracy. We score the regime per item (simulate-and-recover rho 0.96), build FEH-79, a benchmark of Knightian frames with matched controls, and run a pre-registered study across seven models (five open-weight 3B-32B, two frontier), five CoT lengths, and 7,875 responses. The gate, fixed before any data, required a negative interaction with posterior probability above 0.95 and an accuracy drop of more than 6 points. It held. The high-regime drop is 17.3 points (95% CI [7.7, 25.5]); matched items with definite answers show no cost. The effect is regime-dependent: decisive in capable mid-to-large models, directional in the two frontier systems, absent-to-reversed in the weakest. The framework answers when CoT helps and unifies the Bayesian and fast-and-frugal traditions: less-is-more effects are evidence about the meta-uncertainty regime, not against Bayesian cognition.

[NLP-99] SciOrch: Learning to Orchestrate Expert LLM s for Solving Frontier Multimodal Scientific Reasoning Tasks

【速读】：该论文旨在解决大语言模型（LLM）在前沿科学推理任务中表现不足的问题，尤其针对当前最强的商用模型仍无法达到专家水平的挑战。其核心问题是：单一模型在复杂科学推理任务中存在能力局限，不同模型在不同类型问题上各有所长，而现有评估方式忽略了这种互补性。为应对这一挑战，论文提出SciOrch框架，其关键在于训练一个轻量级80亿参数的“协调器”（orchestrator），通过调用多个商用前沿模型的API，实现对科学问题的分解、子任务委派与结果整合。该方法突破了传统基于强化学习的代理（agentic RL）范式，因API调用成本高且延迟大，难以支持标准在线采样。为此，研究采用基于蒙特卡洛树搜索（MCTS）的策略生成多样化协作轨迹，从中提取单步样本，并结合GRPO风格的训练机制优化协调器。实验表明，在包含240个题目的测试集上，SciOrch平均准确率达56.66%，优于最强单模型3.74个百分点，也超越最强多智能体基线3.33个百分点，同时在SGI-Reasoning和Scientists’ First Exam两个基准上均取得最佳性能，且API调用成本低于典型多智能体方法的一半。

链接: https://arxiv.org/abs/2606.15872
作者: Jingru Guo,Xiangyuan Xue,Lian Zhang,Wanghan Xu,Siki Chen,Philip Torr,Wanli Ouyang,Lei Bai,Zhenfei Yin
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Frontier scientific reasoning remains a major challenge for large language models (LLMs), where even the strongest commercial systems fall short of expert-level performance. A closer look at model behavior reveals substantial complementarity that single-model evaluation hides: different frontier models excel on different question types, and no single model captures the full picture. We present SciOrch, a framework that trains a lightweight 8B model to orchestrate frontier LLMs for scientific reasoning. The orchestrator decomposes each question, delegates sub-problems to selected commercial models through API calls, and synthesizes a final answer. Training such an orchestrator is fundamentally harder than conventional agentic RL: each action triggers an API call that is expensive in both dollar cost and latency, making standard online rollouts infeasible. We address this with MCTS-based approach, producing diverse orchestration trajectories, extracting per-node single-turn samples, and optimizing the orchestrator with GRPO-style training. On a 240-question test set spanning SGI-Reasoning and Scientists’ First Exam, SciOrch reaches 56.66% average accuracy, outperforming the strongest single commercial model by 3.74% and the strongest multi-agent baseline by 3.33%. It also attains the best accuracy on both SGI and SFE with less than half the API cost of typical multi-agent methods.

[NLP-100] When Correct Edges Cannot Be Verified: A Provenance Gap in Incomplete KGQA and a Provenance-Favoring Completion Policy

【速读】：该论文旨在解决不完整知识图谱问答（Incomplete Knowledge Graph Question Answering, IKGQA）中边补全的可信度问题，核心挑战在于：现有方法依赖文本可验证性（textual verifiability）作为补全边正确性的代理指标，但这一假设是否成立尚未得到系统检验。研究发现，即使在已知正确的补全边中，仍有76%-96%无法在穷尽检索的文本中找到支持证据，且该现象在不同删除率（20%/40%）、数据集（CWQ/WebQSP）和关系类型（结构化、常识性、长尾）下均保持稳健。这表明，文本可验证性实际上反映的是信息来源的可追溯性（provenance），而非事实正确性，二者之间存在根本性的范式鸿沟，无法通过单纯提升召回能力弥合。由此，研究重新定义了边补全任务的核心问题：从“该边是否正确”转变为“在缺乏可追溯证据的情况下，应接受还是放弃回答？”基于此，提出TGComplete，一种注重溯源性的接纳策略——在推理断点处检索证据，通过轻量级循环验证候选边，并在无支持时主动放弃回答。相较于生成式补全基线GoG，TGComplete在黄金标准下的边精度显著提升（15-21% vs 3-14%），EM准确率无统计显著损失，且承认边的严格可验证性提高3.1-7.4倍，代价是召回率下降。因此，TGComplete并非全面更优，而是在精确性、可审计性与召回率之间的权衡中，提供了一个面向高可审计场景的合理选择。

链接: https://arxiv.org/abs/2606.15833
作者: Yongqi Kang,Yu Fu,Yong Zhao
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Incomplete Knowledge Graph Question Answering (IKGQA) requires completing missing edges to continue reasoning. A growing line of work verifies completed edges against retrieved text, treating textual support as a proxy for edge quality. We ask a question that, to our knowledge, has not been systematically tested: does textual verifiability actually track correctness? Exploiting the gold deleted triples provided by the standard random-deletion protocol, we measure both. The finding is counterintuitive: among gold-correct completed edges, 76-96% have no supporting passage even under exhaustive retrieval, robustly across deletion rates (20%/40%), datasets (CWQ/WebQSP), and relation types (structural, commonsense, long-tail). Most Freebase-style facts simply do not occur as head-tail co-mentions in text. Textual faithfulness therefore measures provenance, not correctness – separated by a paradigm-level gap no in-corpus retrieval closes. This reframes edge completion. Since most completed edges – correct or not – are causally redundant for the answer (95-97% of correct answers do not depend on any unsupported edge), the central question shifts from “is the edge correct?” to “admit or abstain under provenance uncertainty?” Within this framing we present TGComplete, a provenance-favoring admission policy that retrieves evidence at a reasoning breakpoint, verifies a candidate through a lightweight loop, and abstains when support is absent. Against the generate-to-complete baseline GoG, it attains higher edge precision against gold (15-21% vs 3-14%), with no statistically detectable EM loss and 3.1-7.4 times higher strict faithfulness of admitted edges – at the cost of lower recall. We position TGComplete not as uniformly better, but as a principled point on a precision/provenance-recall trade-off, appropriate when auditability matters.

[NLP-101] he Truth Stays in the Family: Enhancing Contextual Grounding via Inherited Truthful Heads in Model Lineages ICML2026

【速读】：该论文旨在解决多模态大语言模型（Multimodal Large Language Models, MLLMs）与其基础大语言模型（Large Language Models, LLMs）之间是否存在本质行为关联的问题，特别是探究在经过指令微调或跨模态适配后，基础模型的上下文真实性（context-truthfulness）是否仍能被有效继承。其核心解决方案在于发现并利用注意力头（attention head）层面的权重保留特性：研究发现，具有高上下文真实性评分的注意力头在模型家族内部高度保留在不同变体中，且这些“真实感知头”倾向于关注与查询相关的证据。基于此，作者提出一种名为TruthProbe的软门控（soft-gating）策略，通过增强这些真实感知头的贡献，同时保持其他头的原有功能，从而有效提升模型在上下文真实性任务（如HaluEval）上的表现，并显著降低多模态幻觉（multimodal hallucination）在POPE和CHAIR数据集上的发生率。该方法实现了基础模型的上下文真实性能力向下游微调后的LLM及MLLM的有效迁移。

链接: https://arxiv.org/abs/2606.15821
作者: Miso Choi,Seonga Choi,Mincheol Kwon,Woosung Joung,Jinkyu Kim,Jungbeom Lee
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at ICML 2026

点击查看摘要

Abstract:Recent advances in large language models (LLMs) have produced many specialized multimodal LLMs (MLLMs) that share common foundational LLMs, forming distinct model lineages. It remains unclear whether a fundamental behavioral link exists between the foundational LLMs and downstream variants. We investigate this question by quantifying head-level context-truthfulness scores. Across diverse LLM and MLLM lineages, including Vicuna-, Qwen2.5-, LLaMA2-, and Mistral-based models, we find that Truth Scores are strongly preserved within model families, even after instruction tuning or multimodal adaptation. We further show that this inheritance is consistent with attention-head weight preservation, and that context-truthful heads attend to query-relevant evidence. Building on this finding, we propose TruthProbe, a soft-gating strategy that amplifies context-truthful heads while preserving other head contributions. TruthProbe improves contextual truthfulness on HaluEval and reduces multimodal hallucination on POPE and CHAIR, with base-LLM Truth Scores transferring effectively to their fine-tuned LLM and MLLM descendants. Code is available at this https URL.

[NLP-102] On Defining Erasure Harms for NLP

【速读】：该论文旨在解决自然语言处理（Natural Language Processing, NLP）系统在部署过程中可能引发的表征性伤害（representational harms）问题，特别是“消逝”（erasure）这一具体形式的伤害缺乏清晰、连贯的概念基础，导致其识别与度量困难。现有对“消逝”的概念化往往过于宽泛，难以界定其成立所需的关键要素；或局限于特定应用场景，虽便于局部测量但难以推广至其他情境。为此，本文提出一个结构化的“消逝”定义，明确指出判定“消逝”是否发生所必需的核心组成部分，强调从业者必须显式地阐明并可操作化这些要素，从而为准确识别与量化“消逝”提供统一且可扩展的方法论框架。

链接: https://arxiv.org/abs/2606.15815
作者: Yu Lu Liu,Arnav Goel,Jackie Chi Kit Cheung,Alexandra Olteanu,Ziang Xiao,Su Lin Blodgett
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The deployment of NLP systems has raised concerns about harms they might produce, including representational harms. Recent literature has begun to conceptualize and measure one such harm, the harm of erasure. Nevertheless, the field lacks a clear and cohesive conceptual foundation for identifying and measuring erasure. Existing conceptualizations of erasure are often broad – making it difficult to identify what is needed to establish and measure erasure – or else specific to particular settings – facilitating measurement for those settings but potentially challenging to adapt to other settings. To address this gap, we develop and propose a structured definition of erasure that clarifies what components are necessary for establishing whether erasure has occurred, which practitioners need to explicitly articulate and operationalize in order to measure erasure.

[NLP-103] da704 at SemEval-2026 Task 4: Modeling Narrative Structures via Pseudonymization and Multi-View Sentence Alignment

【速读】：该论文旨在解决叙事文本相似性计算与叙事表征学习中的核心挑战，即如何有效捕捉不同叙事在抽象主题、情节发展过程及最终结果等多维度上的语义相似性。其解决方案的关键在于采用基于微调句向量模型（fine-tuned sentence transformers）的对比学习框架，通过构造合成数据并施加对比损失（contrastive loss），实现对叙事结构深层语义的建模。具体而言，提出两种并行流水线：Track A采用单视角方法，通过对网络层进行智能冻结（smart layer freezing）以缓解过拟合；Track B则采用多视角方法，分别对主题、情节与结局三个维度设计特定投影头（view-specific projection heads），并通过自监督对齐机制增强各视图间的语义一致性。两种方法均基于句向量模型构建，充分挖掘叙事文本的多层次语义特征，显著提升了叙事相似性判断的准确性。

链接: https://arxiv.org/abs/2606.15783
作者: Tai Tran Tan,An Dinh Thien
机构: University of Information Technology, Ho Chi Minh City, Vietnam; Vietnam National University, Ho Chi Minh City, Vietnam
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We present our approach to SemEval 2026 Task 4: Narrative Story Similarity and Narrative Representation Learning. Our solution uses contrastive learning with fine-tuned sentence transformers to capture narrative similarity across abstract themes, course of action, and outcomes. We develop two pipelines: (Track A) a single-view method that encodes full narratives with smart layer freezing to reduce overfitting, and (Track B) a multi-view method that models theme, plot, and outcome with view-specific projection heads and self-supervised alignment. Both pipelines build on sentence-transformers models and are trained with contrastive loss on synthetic data. The code is available at the following GitHub repository: this https URL.

[NLP-104] DYNA : Dynamic Episodic Memory Networks for Augmenting Large Language Models with Temporal Knowledge Graphs in Continuous Learning

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在引入新知识时面临的灾难性遗忘（catastrophic forgetting）与高昂的微调成本问题。现有方法通常依赖于重新训练或参数更新，难以在不损害已有知识的前提下实现动态知识注入。为此，论文提出DYNA框架，其核心解决方案是构建一个时序知识图谱（temporal knowledge graph, TKG），将事件作为节点、时间关系作为带时间戳的有向边，形成一个可外部更新的时序记忆系统。该图谱作为冻结的LLM的外部记忆，通过随机游走和中心性度量在查询时检索相关事件节点，并将其上下文信息动态融合至模型输出中。实验表明，相比传统微调方法，DYNA可减少约7%的灾难性遗忘；在时序排序任务上，较标准检索增强生成（RAG）提升约5%。此外，研究发现图谱的聚类系数越高，检索性能越好，表明图结构本身对检索效率具有重要影响。关键贡献包括：（1）将情景记忆建模为时序知识图谱；（2）实现无需重训练的轻量化LLM增强；（3）揭示图谱属性可作为检索性能的预测指标。

链接: https://arxiv.org/abs/2606.15778
作者: Ali Sarabadani,Mahtab Tajvidiyan
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Social and Information Networks (cs.SI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) struggle to incorporate new knowledge without forgetting or costly retraining. We propose DYNA, a lightweight framework that augments a frozen LLM with a temporal knowledge graph where events are nodes and temporal relations are directed, timestamped edges. The graph serves as an external, updatable memory. At query time, DYNA retrieves relevant nodes via random walks and centrality measures, then augments the LLM’s response. Evaluated on three temporal recall tasks, DYNA reduces catastrophic forgetting by ~7% compared to fine-tuning and improves temporal ordering by ~5% over standard RAG. Higher graph clustering coefficients correlate with better retrieval, showing that graph structure matters. Contributions: (1) episodic memory as temporal KG, (2) retraining-free LLM augmentation, (3) graph properties as predictors of retrieval performance.

[NLP-105] da704 at SemEval-2026 Task 6: Structured Chain-of-Thought Prompting for Political Evasion Detection

【速读】：该论文旨在解决美国总统访谈中提取的英文问答对里政治回避策略（political evasion strategies）的分类问题，核心挑战在于应对严重类别不平衡以及需精准识别隐含于语言中的规避性意图。其解决方案的关键在于对比两种范式：一是基于QLoRA的参数高效微调（Parameter-Efficient Fine-Tuning, PEFT）Qwen3模型（4B-32B），通过分层上采样与加权交叉熵损失缓解类别不平衡；二是采用结构化链式思维（Chain-of-Thought, CoT）提示技术，驱动具备推理能力的API模型（DeepSeek-V3.2和Grok-4-Fast）进行多步语用分析。实验表明，结构化CoT提示在绝对宏平均F1（Macro F1）上显著优于参数高效微调基线。最优系统——采用扩展推理模式与少样本分层CoT提示的Grok-4-Fast，在子任务2（9类回避）上达到0.5147的Macro F1，子任务1（3类清晰度）达0.7979，分别位列官方排行榜第8名（共33队）和第13名（共41队）。消融研究进一步揭示：将标签嵌入层级分类体系有助于引导模型推理结构，而少量示例可实现任务校准；尽管最强提示变体间宏观指标无统计差异，但显式启用扩展推理模式能显著提升性能，因其支持对规避意图所需的多步骤语用分析。

链接: https://arxiv.org/abs/2606.15770
作者: Tai Tran Tan,An Dinh Thien
机构: University of Information Technology, Ho Chi Minh City, Vietnam; Vietnam National University, Ho Chi Minh City, Vietnam
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper describes our system for SemEval-2026 Task 6, which addresses the classification of political evasion strategies in English question-answer pairs extracted from U.S. presidential interviews. We systematically compare two distinct paradigms: (1) Parameter-Efficient Fine-Tuning of Qwen3 models (4B-32B) using QLoRA, enhanced with tiered upsampling and weighted cross-entropy loss to address severe class imbalance, and (2) structured Chain-of-Thought (CoT) prompting of reasoning-capable API models, namely DeepSeek-V3.2 and Grok-4-Fast. Our evaluation demonstrates that structured CoT prompting of reasoning-enabled models substantially outperforms our baseline parameter-efficient fine-tuning implementation in absolute Macro F1. Our best system, Grok-4-Fast with extended reasoning and few-shot hierarchical CoT prompting, achieves a Macro F1 of 0.5147 on Subtask 2 (9-class evasion) and 0.7979 on Subtask 1 (3-class clarity), ranking 8th out of 33 teams on Subtask 2 and 13th out of 41 teams on Subtask 1 on the official leaderboard. Furthermore, our ablation studies reveal key insights into effective prompt design for evasion detection: presenting labels within a hierarchical taxonomy helps structure model reasoning, while few-shot exemplars provide task calibration. However, the strongest prompt variants are not statistically distinguishable in Macro F1, and explicitly enabling extended reasoning modes yields substantial performance gains by facilitating the multi-step pragmatic analysis required to detect evasive intent.

[NLP-106] A Self Consistency Based Reranking for Narrative Question Answering

【速读】：该论文旨在解决叙事问答（Narrative Question Answering, NQA）任务中模型因依赖单一解码输出而导致生成结果不稳定、答案不完整或不一致的问题。现有预训练语言模型在推理时通常仅生成一个答案，易受随机性影响，难以充分捕捉长文本中的事件关联与语义连贯性。为此，本文提出一种基于自一致性（Self-Consistency-Based）的自集成重排序框架，其核心在于通过多候选答案生成与基于语义一致性的重排序机制，从多个生成结果中选择最一致、最合理的最终答案。该方法无需修改模型架构，即可通过探索多样化答案表达形式并利用共识机制提升答案的鲁棒性与准确性。实验结果表明，该框架在NarrativeQA数据集上显著提升了多种模型（包括FLAN-T5和Pegasus）的性能，其中Pegasus-Large的准确率提升达14.57%，验证了该策略的有效性。

链接: https://arxiv.org/abs/2606.15741
作者: Molham Mohamed,Ali Hamdi
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Narrative question answering (NQA) is a challenging task in natural language processing that requires models to understand long textual contexts, capture relationships across events, and generate coherent responses. Despite recent advances in pretrained language models, most existing approaches rely on a single decoding output during inference, making them sensitive to generation variability and often resulting in incomplete or inconsistent answers .To address this limitation, we propose a self-ensemble Self-Consistency-Based reranking framework for narrative question answering. The proposed method generates multiple candidate answers for each story-question pair and selects the final answer based on semantic agreement among the generated responses. This allows the model to explore diverse answer formulations while improving robustness through consensus-based selection without requiring modifications to the underlying architecture .The framework combines pretrained and fine-tuned language generation with multi-answer inference and similarity-based reranking. We evaluate the proposed approach on the NarrativeQA dataset using multiple models, including FLAN-T5 (Base and Small) and Pegasus-Large, under both baseline and fine-tuned settings .Experimental results demonstrate that the proposed method consistently improves performance across all models. In particular, FLAN-T5-Base achieves the best overall performance, improving from 82.32% to 86.66% (+4.34%) when combined with self-ensemble inference. Additionally, the largest improvement is observed with Pegasus-Large, which increases from 72.50% to 87.07% (+14.57%), highlighting the effectiveness of the proposed strategy.

[NLP-107] EHRNote-ChatQA: A Benchmark for Evidence-Grounded Multi-Turn Clinical Question Answering over Longitudinal Discharge Summaries

【速读】：该论文旨在解决现有临床问答（Clinical Question Answering, CQA）基准在真实临床场景中代表性不足的问题，尤其针对多份出院小结（Discharge Summaries）背景下的证据支撑型多轮问答任务。传统基准多聚焦于单轮、基于考试式医学知识的问答，缺乏对多文档信息整合与证据溯源能力的有效评估。为此，论文提出EHRNote-ChatQA，这是首个面向患者多份出院小结的证据支撑型多轮临床问答基准。其核心解决方案在于构建一个由医疗专家主导的端到端数据生成与验证流程：基于去标识化的MIMIC-IV出院小结，采用结构化摘要模板、专家定制的多轮问答模板，并结合大语言模型（LLM）生成初稿，最终由11名医学专家逐项审查与修订，确保每个问答对均具备临床准确性与证据可追溯性。该基准包含967个患者级样本（每例覆盖1至5份小结）及16,072对经专家验证的问答对（含8,036个内容问题及其对应的证据溯源问题），覆盖八大临床类别。实验表明，当前主流开放与闭源大模型在证据锚定方面表现显著弱于内容回答能力，且多轮推理中的错误会累积放大，单轮表现无法可靠外推至多轮场景。这一发现凸显了真实临床决策复杂性，确立了EHRNote-ChatQA作为评估临床问答系统严谨性与实用性的关键基准。

链接: https://arxiv.org/abs/2606.15735
作者: Jiyoun Kim,Muhan Yeo,Eunhye Jang,Jeewon Yang,Hangyul Yoon,Su Ji Lee,Hee Jo Han,Hee-Jae Jung,Doyun Kwon,Jun young Lee,Jaehun Lee,Jung-Oh Lee,Sunjun Kweon,Jong Hak Moon,Daseul Kim,Minjae Cho,Edward Choi
机构: KAIST; Seoul National University; Seoul National University Bundang Hospital; SAIHST, Sungkyunkwan University; Yonsei University College of Medicine; Gangnam Severance Hospital; Severance Hospital; Seoul Medical Center; Seoul National University Hospital; National Cancer Center; Icahn School of Medicine at Mount Sinai; Samsung Medical Center
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Discharge summaries are crucial clinical documents containing the context of a patient’s overall hospital stay, and are routinely reviewed by medical experts for patient readmission, ongoing care, and diagnostic decision-making. When reviewing them, medical experts often must iteratively synthesize information across multiple summaries while verifying the evidence supporting each answer. Although large language models (LLMs) are increasingly explored for clinical question answering, existing benchmarks do not sufficiently reflect this setting: they often evaluate exam-style medical knowledge or focus on single-turn question answering with limited evidence-grounding evaluation. We introduce EHRNote-ChatQA, the first benchmark for evidence-grounded multi-turn clinical question answering over patients’ multiple discharge summaries. Built from de-identified MIMIC-IV discharge summaries, EHRNote-ChatQA contains 967 patient-level multi-turn samples spanning one to five notes and 16,072 medical-expert-verified QA pairs (8,036 content questions, each paired with an evidence-grounding question) across eight clinical categories. The benchmark is constructed through an expert-informed pipeline combining discharge-summary structuring schema, expert-curated multi-turn QA templates, and LLM-based generation, followed by review and revision of every single QA sample by 11 medical experts. Benchmarking 22 open- and closed-source LLMs reveals several challenges, including that LLMs struggle more with evidence grounding than content answering, multi-turn errors compound across turns, and single-turn clinical QA performance does not reliably transfer to this setting. These findings establish EHRNote-ChatQA as a rigorous and practical benchmark for evaluating clinical QA systems. The dataset will be made publicly available through PhysioNet credentialed access.

[NLP-108] Vernier: Probing Representational Misalignment Behind Lexical Gaps in Causal Reasoning

【速读】：该论文旨在解决指令微调的语言模型在因果推理任务中对变量名替换敏感的问题：尽管结构化因果模型（Structural Causal Model, SCM）与正确答案保持不变，仅将英文变量名替换为语义保留的占位符后，模型输出却可能出现不一致。核心问题在于，这种“词汇间隙”（lexical gap）是否源于占位符视图下的信息丢失，还是源于模型内部表示与输出读取之间的表征错位。研究的关键解决方案是采用配对视图权重更新（paired-view weight update）作为干预工具，通过分析间隙消除后的机制来检验假设。实验结果表明，在有效工作范式下，证据更支持表征错位而非信息损失——占位符视图上的变量名探测器准确率提升，且在Qwen-7B、Qwen-14B和Llama-3.1-8B上的激活修补（activation patching）显示决策令牌的表征可实现跨视图的答案身份传递。使两视图对齐的关键操作是原始提示与占位符提示的反事实增强（counterfactual augmentation），而答案子空间的KL散度主要促进中间答案信念的一致性。模型性能的提升受限于模型家族、规模及任务类型，其中因果推理答案转移（CRASS transfer）在不同规模的Qwen模型和Llama系列中表现稳定，而e-CARE任务仍较弱，初步非因果重命名任务亦呈现相似定性模式。

链接: https://arxiv.org/abs/2606.15733
作者: Zhenyu Yu
机构: Fudan University (复旦大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Instruction-tuned language models can answer the same causal-reasoning question differently after its English variable names are replaced by type-preserving placeholders, although the structural causal model and the gold answer are unchanged. We ask whether this lexical gap reflects information loss in the placeholder view or a misaligned read-out from a representation that still carries answer-relevant content. Vernier uses a paired-view weight update as an instrument and then inspects the mechanism left after the gap closes. In the working regimes, the evidence favours representational misalignment. A variable-name probe becomes more accurate on the placeholder view, and activation patching on Qwen-7B, Qwen-14B, and Llama-3.1-8B shows that the decision-token representation can transfer answer identity between views. The update that realigns the views is counterfactual augmentation over original and placeholder prompts, while the answer-subspace KL mainly sharpens intermediate answer-belief agreement. Success is bounded by model family, scale, and task. CRASS transfer is reliable across Qwen scales and Llama, e-CARE remains weak, and preliminary non-causal rename tasks show a similar qualitative pattern.

[NLP-109] Beyond English: Uncovering the Multilingual Gap in Vision-Language-Action Models

【速读】：该论文旨在解决当前视觉-语言-动作（Vision-Language-Action, VLA）模型在多语言指令遵循能力上的显著短板问题。尽管底层大语言模型具备多语言能力，但现有VLA系统主要基于英语指令进行训练与评估，导致其在非英语指令下的表现严重下降，暴露出“多语言差距”（multilingual gap）。该问题的关键在于：即使语言模型本身支持多语言，其在联合视觉-语言-动作任务中的跨语言迁移能力仍受限于训练数据的语言分布以及多语言指令引发的表征偏移。研究的核心解决方案是提出一种名为“多语言主成分对齐”（Multilingual Principal Component Alignment）的简单而有效的微调策略，通过主成分分析（Principal Component Analysis, PCA）提取多语言表示的主成分子空间，并对齐不同语言的投影表示，从而缓解因语言差异带来的表征不一致，有效缩小多语言性能差距。

链接: https://arxiv.org/abs/2606.15714
作者: Hanyang Chen,Hongliang Li,Jiarui Cao,Yang Li,Yang Jiang,Haonan Wen,Kaiyu Huang,Shengnan Guo,Huaiyu Wan
机构: Beijing Jiaotong University(北京交通大学)
类目: Computation and Language (cs.CL); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Vision-Language-Action models have recently demonstrated promising capabilities in learning generalist robot policies from large-scale multimodal data. However, most existing VLA systems are trained and evaluated primarily with English instructions, leaving their ability to understand and execute instructions in other languages largely unexplored. While the underlying large language models often possess multilingual capabilities, it remains unclear whether these multilingual capabilities transfer to VLAs during training. In this work, we present the first systematic study of multilingual instruction following in VLA models. We first construct multilingual instructions by extending existing benchmarks with translations of their instructions. Using these instructions, we evaluate several representative VLA models across a range of tasks in simulation settings. Our experiments reveal a significant multilingual gap: models trained primarily on English instructions exhibit substantial performance degradation when evaluated on other languages, even when the underlying language backbone is multilingual. We provide several findings and analyses to understand the multilingual gap. Cross-lingual transfer behavior analysis shows that performance drops correlate with both instruction understanding and action execution. Representation analyses suggest that multilingual instruction-caused representation shifts may contribute to the multilingual gap. Motivated by these findings, we further explore strategies to improve multilingual performance in VLAs. We propose a simple yet effective multilingual fine-tuning approach, Multilingual Principal Component Alignment, which leverages Principal Component Analysis to get the principal component subspace and align projected multilingual representations, effectively reducing the multilingual performance gap.

[NLP-110] Do LLM s Reliably Identify Correct Information Units in Aphasic Discourse?

【速读】：该论文旨在解决失语症患者话语评估中正确信息单元（Correct Information Units, CIUs） 人工标注耗时长、依赖专业评分员的问题。传统CIU评分虽能有效量化话语的交际信息量，但其高劳动成本限制了临床与研究中的广泛应用。为此，本文提出利用指令微调的大语言模型（Instruction-tuned Large Language Models, LLMs） 实现无需梯度更新的零样本或少样本提示下的词元级CIU分类，以实现自动化识别。其解决方案的关键在于：通过少样本提示（few-shot prompting） 构建高质量示例库，使公开可用的指令微调大模型（如Llama-3.1-8B、Qwen2.5-7B、Mistral-7B）在未进行任务特定训练的情况下，仍可达到较高的分类性能（平均F1值0.776–0.817），展现出对不同严重程度失语症（从轻度到重度）话语的鲁棒性。然而，模型存在系统性过分类问题（高召回率但低精确率），且在重度失语症样本上表现最弱，表明当前方法尚不足以完全替代人类标注。因此，研究支持将生成式AI驱动的CIU识别作为“人机协同”话语评估系统中的关键辅助组件，为未来智能化失语症评估提供可行路径。

链接: https://arxiv.org/abs/2606.15696
作者: Jason M Pittman,Yesenia Medina-Santos,Anton Phillips Jr.,Brielle C. Stark
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 5 tables, 4 figures

点击查看摘要

Abstract:Correct Information Units (CIUs) are central to discourse assessment in aphasia because they quantify communicative informativeness rather than linguistic form alone. However, CIU scoring is time intensive and requires trained raters. This study examined whether instruction-tuned large language models (LLMs) can reliably perform token-level CIU classification from aphasic discourse transcripts. Sixteen picture-description transcripts elicited with the Cat Rescue stimulus were annotated for CIU status according to Nicholas and Brookshire (1993). The sample spanned four severity strata: control, mild, moderate, and severe aphasia. Four publicly available instruction-tuned LLMs were benchmarked under zero-shot and two few-shot prompting conditions across five stratified random seeds. Performance was evaluated against consensus human labels using accuracy, precision, recall, F1, and Cohen’s kappa. Zero-shot prompting was insufficient across models. In contrast, few-shot prompting yielded substantial gains and produced competitive performance for three viable models. Mean few-shot F1 scores ranged from 0.776 to 0.817 across Llama-3.1-8B, Qwen2.5-7B, and Mistral-7B, with no significant differences between fixed global and per-chunk local example selection. Phi-3-mini was unstable and did not yield reliable performance. Viable models showed high recall but lower precision, suggesting systematic over-classification of tokens as CIUs. Performance also varied by discourse severity, with the weakest results in more severe aphasia. Few-shot LLM prompting can support automated CIU identification without gradient-based task training, but agreement with human annotation remains insufficient for fully autonomous use. These findings support LLM-based CIU scoring as a promising human-in-the-loop component of discourse assessment systems.

[NLP-111] MosaicQuant: Inlier-Outlier Disaggregation for Unified 4-Bit LLM Quantization

【速读】：该论文旨在解决4-bit量化在压缩大语言模型（LLM）时因位宽限制导致的精度损失问题，尤其是难以同时准确表示密集的常见权重值（inliers）与稀疏的高幅值异常值（outliers），进而引发显著的模型性能下降。现有混合精度方法虽通过保留异常值的高精度来缓解此问题，但破坏了低比特计算的统一性，引入了精度转换和额外数据搬运，削弱了实际推理加速效果。其核心解决方案是提出一种统一的4-bit量化范式——MosaicQuant，基于全新的“inlier-outlier解耦”原则：将整个权重矩阵统一量化为一个密集的4-bit基础分量，以忠实捕捉inliers，而对不可避免被量化的outliers，则通过引入一个稀疏的4-bit残差分量进行误差补偿，且仅针对输出失真集中出现的关键权重块进行选择性修正。然而，单纯统一表示仍不足以实现高效执行，因此进一步设计了ZipperEngine，通过重叠流水线将稀疏块计算融合进密集4-bit GEMM核中，实现了表示与执行层面的完全统一，构建出单一连贯的低比特推理流水线。实验结果表明，MosaicQuant在保持接近FP16精度的同时，相较W16A16基线实现了最高达1.24倍的推理加速。

链接: https://arxiv.org/abs/2606.15652
作者: Yangjia Hu,Haodong Wang,Zicong Hong,Qianli Liu,Quanxin Shou,Jian Lin,Song Guo,Xiaowei Shen,Xiangjun Huang,Dian Wang,Jian Yang
机构: HKUST(香港科技大学); EPFL(洛桑联邦理工学院); MetaX Integrated Circuits Co., Ltd(元芯集成电路有限公司)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 17 pages

点击查看摘要

Abstract:4-bit quantization significantly reduces the memory footprint and accelerates the inference of large language models (LLMs). However, its limited bit-width representation struggles to faithfully capture both dense common values (\emphinliers) and rare large-magnitude values (\emphoutliers), causing substantial accuracy degradation. Existing mixed-precision methods mitigate this by retaining outliers in high precision, but at the cost of breaking the uniformity of low-bit execution, introducing precision conversion and extra data movement that undermine practical speedup. We propose \textbfMosaicQuant, a unified 4-bit LLM quantization paradigm built on a novel principle of \emphinlier–outlier disaggregation. Rather than elevating outlier precision, MosaicQuant quantizes the full weight matrix into a dense 4-bit base component, where inliers are captured faithfully while outlier are inevitably quantized. A sparse 4-bit residual component is then introduced to compensate for these quantization errors, selectively targeting the most error-critical weight blocks where output distortion is shown to be concentrated. However, a unified representation alone is insufficient, as naïvely executing the sparse residual as a separate kernel still breaks the unified low-bit inference pipeline. To bridge this gap, we introduce \textbfZipperEngine, which fuses sparse block computation into the dense 4-bit GEMM kernel via an overlapped pipeline, unifying not only the representation but also the execution into a single coherent low-bit inference pipeline. Extensive experiments on LLaMA3 and Qwen3 demonstrate that MosaicQuant preserves near-FP16 accuracy while achieving up to 1.24\times speedup over the W16A16 baseline.

[NLP-112] Extending Item Response Theory for Efficient and Meaningful Multilingual Evaluation

【速读】：该论文旨在解决多语言基准测试在评估大语言模型（LLM）时面临的三大核心问题：评估规模随语言数量线性增长导致效率低下、自动翻译引入的错误在大规模评估中难以被发现，以及部分题目混淆了通用知识与文化特异性知识。其解决方案的关键在于提出一种统一的统计框架——多语言项目反应理论（Multilingual-IRT），该框架通过引入每种语言的难度偏移量、分离内容效应与语言效应的区分度参数，以及每种语言的能力残差，实现了对语言差异的精细化建模。基于25个大语言模型在29种语言的MMLU-Pro-X数据集上拟合该模型，结果表明，其拟合参数可支持三项实际应用：在预测未观测到的（题目, 模型, 语言）组合时，相较于最强的基于准确率的基线，二元交叉熵降低11%-16%；能够系统性地识别出分布在全部28种非英语语言中的潜在翻译错误，而传统准确率基线仅集中在少数语言中检测；并有效恢复被传统方法遗漏的文化特异性题目。

链接: https://arxiv.org/abs/2606.15643
作者: Gili Lior,Tzviel Frostig,Gabriel Stanovsky,Matan Eyal
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Multilingual benchmarks are central to evaluating large language models (LLMs) across languages, but they suffer from three issues: exhaustive evaluation scales linearly with the number of languages, automatic translation introduces errors that are easily missed at scale, and some items conflate general and culture-specific knowledge. We address all three with a unified statistical framework, Multilingual-IRT, which extends Item Response Theory with per-language difficulty deviations, split discriminability separating content from language effects, and per-language ability residuals. Fitting Multilingual-IRT on 25 LLMs across 29 languages of MMLU-Pro-X, we show that its fitted parameters support three practical applications: predicting unobserved (item, LLM, language) instances with 11-16% lower binary cross-entropy than the strongest accuracy-based baseline, surfacing candidate translation errors distributed across all 28 non-English languages, whereas accuracy-based baselines concentrate detections in a few languages, and recovering culture-specific items that accuracy-based baselines miss.

[NLP-113] Distilling Examples into Task Instructions: Enhanced In-Context Learning for Real-World B2B Conversations

【速读】：该论文旨在解决生成式 AI 在低资源场景下进行语义复杂、多方参与的 B2B 对话分类任务时，传统上下文学习（In-context Learning, ICL）方法因上下文长度增加而导致性能下降的问题。其核心挑战在于：当通过拼接多个少样本示例来构建上下文时，长文本输入会显著影响模型效率与稳定性，且缺乏可解释性与可操作性。解决方案的关键在于提出一种新型知识提取方法，将冗长的原始示例压缩为紧凑、可解释的结构化分类标准与精确的任务描述，从而实现99%的令牌（token）使用量降低，并在宏平均AUC上提升最高达7%，同时在长上下文场景下保持鲁棒性——相较先进令牌压缩基线，其F1分数仅下降不足1点，而后者性能恶化超过9点。此外，该框架支持对分类逻辑的直接修正，有效提升了模型在真实自然语言处理应用中的透明度、效率与用户交互能力。

链接: https://arxiv.org/abs/2606.15641
作者: Guy Rotman,Adi Kopilov,Danit Berger Zalmanson,Omri Allouche
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted for publication in Findings of the Association for Computational Linguistics 2026

点击查看摘要

Abstract:In-context learning (ICL) is the standard method for low-resource classification, yet its efficacy in specialized domains remains largely unexplored. We address the challenge of classifying semantically complex, multi-party B2B conversations, where traditional ICL encounters significant limitations, especially as context length increases due to the concatenation of multiple few-shot examples. We introduce the \textttCall Playbook dataset, featuring five classification tasks derived from real-world B2B conversations targeting core sales concepts. To bridge the gap between performance and practical utility, we propose novel knowledge extraction methods that distill verbose examples into compact, interpretable representations of structured classification criteria and precise task descriptions. Our approach achieves a 99% reduction in token usage and improves macro-averaged AUC by up to 7% over traditional ICL. Notably, it remains robust as context grows, unlike advanced token compression baselines which degrade by over 9 F1 points. Importantly, our framework enables direct refinement of classification logic, addressing critical needs for transparency, efficiency, and user interaction in real-world NLP applications.

[NLP-114] Re-feeding Is Not Replaying: Measuring Replay Noise in Counterfactual Token-Credit Estimation

【速读】：该论文旨在解决生成式 AI（Generative AI）中基于逐标记反事实归因（per-token counterfactual credit estimation）的可靠性问题，即在语言模型推理过程中，如何准确判断某一特定标记（token）对最终输出结果的贡献程度。现有方法通过重新输入前缀提示（re-feed the transcript prefix）来模拟模型在生成过程中的状态，但该方法隐含假设重新输入能精确复现原始生成路径中的上下文键值缓存（KV state），而这一假设可能引入偏差。本文的关键发现是：在标准推理引擎下，重喂输入会导致归因估计发生显著偏差，其误差幅度高达14–28个百分点，显著高于由副本噪声基准（replica noise floor）所反映的随机波动水平。这种偏差主要源于零边界穿越（zero-boundary crossings）而非极性反转，且整体平均量级仍相对稳健，但关键标记的选择（selection）高度不可靠——基于重喂方式筛选出的高贡献标记集合与精确恢复状态下的结果仅有0.34–0.90的Jaccard相似度，远低于副本基准的0.63–0.96上限。研究进一步通过因果验证确认，在使用vLLM的批处理不变核（batch-invariant kernels）时，三种运行路径完全一致，分歧率为零，表明状态一致性可彻底消除该偏差。因此，论文提出的核心解决方案是：在进行反事实归因分析时，应直接恢复解码器状态或采用批处理不变的内核机制，并报告副本噪声基准以量化方法固有不确定性，从而提升归因结果的可信度与可重复性。

链接: https://arxiv.org/abs/2606.15621
作者: Nils Matteson
机构: Northeastern University (东北大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 10 pages, 3 figures. Code, per-pivot data, logs, and registration: this https URL (benchmarks/, paper/refeed-drift/)

点击查看摘要

Abstract:Per-token counterfactual credit estimation asks which token in a language-model rollout caused the final answer to be right or wrong: cut the transcript at a pivot, substitute an alternative token, replay continuations, and compare outcomes. Published methods re-feed the transcript prefix as a fresh prompt, assuming this reproduces the state the model passed through during generation. We measure what that assumption costs on a stock inference engine, with a three-pass design: continuations resumed from the verified decode-time KV state, an identical second exact pass (a replica noise floor), and a re-feed pass. Across six configurations and three models (including a GRPO-trained checkpoint), at low-margin decision tokens, re-feeding changes the credit estimate at rates 14-28 percentage points above the replica floor (7-21pp under a treatment-independent conditioning; problem-clustered t = 2.9-6.4). Most changes are zero-boundary crossings of the quantized estimator rather than polarity reversals, and the perturbation is consistent with mean-zero, so averaged quantities are largely safe; but selection is not: a critical-token set chosen by thresholding |\hatA_t| under re-feed overlaps the exact-resume selection at Jaccard 0.34-0.90, versus a 0.63-0.96 replica ceiling. A causal confirmation closes the loop: under vLLM’s batch-invariant kernels all three passes are identical on every measured channel, with both disagreement rates exactly zero. Replica passes themselves disagree on 9-23% of eligible estimates: single-sample credit measurements at decision tokens are unreliable under any replay. Settings were fixed in advance; exact-pass cache hits in the second campaign are instrumented (100% hit rate, 3,434 pivots); total compute was under 10 USD. We recommend that counterfactual credit studies resume decoder state or use batch-invariant kernels, and report a replica floor.

[NLP-115] LLM Judges Have Dark Current: A Psychometric Datasheet for LLM -as-a-Judge Evaluation

【速读】：该论文旨在解决当前大语言模型作为评判者（LLM-as-a-judge）在开放式模型评估中缺乏可重复性与测量可靠性的问题。现有方法通常将模型评判结果简化为单一的标量准确率、胜率或一致性指标，忽视了评判系统本身作为测量仪器的内在特性与潜在偏差。其解决方案的关键在于提出“评判者数据表”（Judge Datasheet）协议，通过一系列标准化测试来量化评判系统的多项关键性能指标：包括真真空条件下的暗电流（dark current）、对相同质量表面变化的稳定交叉敏感性、位置导致的虚假偏好（positional false preference）、在受控质量梯度上的目标敏感性，以及由平局指令所诱导的判别准则或操作点。该协议引入方向-稳定性分解方法，揭示了看似稳定的Delta0偏好可能源于表面响应的稳定性，也可能是被掩盖的位置偏差。在三模型对比案例研究中，不同模型表现出显著差异：Llama-3.1-8B存在高暗电流与展示冲突型的Delta0行为；Qwen2.5-14B虽具备真空洁净性与目标敏感性，但混合了稳定与位置性的过度区分；而Qwen2.5-32B则展现出低暗电流、低交叉敏感性及低位置虚假偏好。严格平局标准虽消除了Qwen32B的假性偏好，但将部分微弱的Delta1信号误归为平局，同时保留了对Delta5的敏感性。研究结果表明，提示工程主要影响的是评判的判别阈值（criterion），而非测量分辨率。本文的核心贡献并非验证下游机制假设，而是建立了一套计量学层面的协议，用以在做出下游结论前，对评判工具本身的性能进行系统性表征与校准。

链接: https://arxiv.org/abs/2606.15610
作者: Hiroyasu Usami,Keisuke Hara,Ayato Tsuboi,Naohiko Matsuda
机构: Chubu University (中部大学); Mitsubishi Heavy Industries, Ltd., Research Innovation Center (三菱重工业有限公司，研究创新中心)
类目: Computation and Language (cs.CL); Instrumentation and Methods for Astrophysics (astro-ph.IM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 22 pages, 4 figures

点击查看摘要

Abstract:LLM-as-a-judge systems are now routinely used for open-ended model evaluation, where human preference annotation is costly, slow, and difficult to reproduce. Yet these judges are often reported as scalar accuracy, win-rate, or agreement devices. We argue that a judge should instead be reported as a measurement instrument. We introduce a Judge Datasheet protocol that measures dark current under true-vacuum inputs, stable cross-sensitivity to same-quality surface variation, positional false preference, target sensitivity on a controlled quality ladder, and the criterion or operating point induced by tie instructions. The direction-stability decomposition reveals that apparent Delta0 preference can be stable surface response or disguised position bias. In a three-judge open-weight case study, Llama-3.1-8B shows high dark current and presentation-conflicted Delta0 behavior, Qwen2.5-14B is vacuum-clean and target-sensitive but mixes stable and positional over-discrimination, and Qwen2.5-32B is vacuum-clean with low stable cross-sensitivity and low positional false preference. A strict tie criterion eliminates Qwen32B Delta0 false preference but absorbs marginal Delta1 target signals into ties while preserving Delta5 sensitivity. The results show that prompting moves the criterion, not the resolution. We do not claim that the downstream mechanism hypothesis that motivated this work is confirmed; the contribution is a metrological protocol for measuring the measuring device before downstream claims are made.

[NLP-116] LLM -Assisted Stance Detection in Scientific Discourse: A Test Case in Bayesian Cognitive Science

【速读】：该论文旨在解决在社会科学研究中，面对具有解释性、理论负载强且间接表达的建构时，如何实现专家标注的可扩展性问题。其核心挑战在于对贝叶斯模型在心理学与认知科学文献中被理解为描述心理与神经机制的“实在论”（realism）立场，还是作为有用数学工具的“工具主义”（instrumentalism）立场进行准确识别。解决方案的关键在于构建一个基于理论驱动的编码手册（codebook），结合专家标注的参考数据，通过诊断引导的提示优化搜索（diagnostic-gated prompt-optimization search），生成适用于三款前沿大语言模型（LLM）（GPT-5.1、Claude Sonnet 4.6、Gemini 3 Pro Preview）的共享零样本提示（zero-shot prompt），并采用多评者信度分析验证结果。最终提示在保留测试集上实现了0.76的综合信度得分（调和均值：ICC=0.79，α=0.74），所有诊断指标均满足要求；在6,858条引文上的部署结果显示，模型在引文层面达成显著一致性（ICC=0.80；α=0.76；综合=0.78），文章层面的排名稳定性接近完美（相关系数r=0.96–0.97）。研究发现，整体语料以弱实在论为主，但多数文章并非单一立场，仅1.4%的文章保持一致立场，而59.5%跨越四个及以上立场区间；低层次感知/运动类文章的实在论得分显著高于高层次认知类文章（p<0.001，d=0.60），量化了长期存在的定性直觉。本研究提供了一个由专家主导的案例范式，其框架可推广至其他理论复杂、需精细解读的质性分析任务，而非适用于所有质性研究场景。

链接: https://arxiv.org/abs/2606.15566
作者: Eyup Engin Kucuk,Tarik Kelestemur,Ömer Dağlar Tanrikulu
机构: University of New Hampshire (新罕布什尔大学); Google DeepMind (谷歌深脑)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 9 pages, 4 figures; Code and data: this https URL

点击查看摘要

Abstract:Qualitative coding is central to social science, but expert annotation is difficult to scale. LLMs offer a possible extension, yet require careful validation when the target construct is interpretive, theoretically loaded, and only indirectly expressed. We study this problem in a difficult case: detecting whether authors treat Bayesian models as descriptions of mental and neural mechanisms (realism) or as useful mathematical tools (instrumentalism). Our method combines a theory-driven codebook, expert-coded reference annotations, a diagnostic-gated prompt-optimization search yielding a shared zero-shot prompt for three frontier LLMs (GPT-5.1, Claude Sonnet 4.6, Gemini 3 Pro Preview), and multi-rater reliability analysis. The final prompt achieved a held-out combined reliability score of 0.76 (harmonic mean of ICC = 0.79 and \alpha = 0.74), with all diagnostics satisfied. Deployed on 6,858 quotes from 210 articles, the three LLMs reached substantial quote-level agreement (ICC = 0.80; \alpha = 0.76; combined = 0.78) and near-perfect article-level rank stability ( r = 0.96-0.97 across rater pairs). The corpus was predominantly weakly realist, but article-level stances were rarely uniform: only 1.4% of articles used a single band, while 59.5% spanned four or more. Low-level perception/motor articles scored 8.8 Realism points higher than high-level cognition articles ( p .001 , d = 0.60 ), quantifying a long-held qualitative intuition. We present this as an expert-led case study; the framework is intended to generalize to similar theoretically demanding tasks, not to all qualitative analysis.

[NLP-117] EIBench: A Simulator-Based Benchmark and Turn-Credit RL for Emotion Management

【速读】：该论文旨在解决大语言模型（LLM）在情感智能（Emotional Intelligence, EI）评估中普遍存在的静态化、单轮对话局限性问题，即现有方法难以衡量模型在多轮交互中对用户情绪与关系状态的动态管理能力。其核心挑战在于：一个真正具备情感智能的模型不仅需识别用户情绪，更应通过多轮互动有效改善用户的主观情绪体验与人际信任关系。为应对这一问题，研究提出EIBench——一个基于模拟器的交互式情感管理基准测试平台，包含2,222个场景（2,009个训练、213个测试），采用2×2分类体系涵盖支持（Support）、防御（Defense）、修复（Repair）与魅力（Charm）四类情境，全面覆盖情感支持、边界维护、信任修复与关系建立等关键维度。该平台通过模拟器动态追踪每轮交互后的“情绪-关系状态”（emotion-relation state），并以锚定评分机制量化最终结果，从而实现双重功能：既作为评估基准提供终局奖励，又通过逐轮状态更新提供密集的强化学习（RL）反馈信号。针对当前模型在高压力情境下边界维护能力薄弱的问题，研究进一步提出中心化回合信用梯度策略优化（Centered Turn-Credit GRPO, CTC-GRPO），在保留终局奖励的基础上，充分利用模拟器提供的逐轮状态变化作为细粒度反馈，显著提升了模型在多轮情感管理任务中的表现。实验表明，CTC-GRPO将Qwen3-8B在EIBench上的得分从-22.4提升至+22.4，并在跨分布评估（如SAGE和EQBench3）中分别获得+12.4和+20.9的性能增益，验证了基于模拟器状态追踪的交互式反馈机制在情感智能训练与评估中的有效性。

链接: https://arxiv.org/abs/2606.15532
作者: Rongzhi Zhu,Xiang Huang,Yuchuan Wu,Rui Wang,Zequn Sun,Tao Ren,Weiyao Luo,Bingxue Qiu,Jieping Ye,Yongbin Li,Wei Hu
机构: Alibaba Group (阿里巴巴集团)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Emotional intelligence (EI) in Large Language Models (LLMs) is often evaluated through static understanding tasks or single-response dialogue generation. However, emotion management is interactive: a good model should not only recognize a user’s emotion, but also improve the user’s emotional and relational state over several turns. We introduce EIBench, a simulator-based benchmark for interactive emotion management. EIBench contains 2,222 scenarios, with 2,009 for training and 213 for held-out testing. The scenarios are organized by a 2x2 taxonomy covering Support, Defense, Repair, and Charm, which together capture different forms of support, boundary maintenance, trust repair, and rapport building. In each scenario, an LLM simulator plays the user, updates an emotion-relation state after each turn, and maps the final state to an anchor-based score. This design makes EIBench both an evaluation benchmark and a training environment: the final state gives the outcome reward, while the per-turn state updates provide dense feedback for RL. We evaluate 15 open- and closed-source LLMs. Current models perform well on support and rapport-building scenes, but struggle with boundary maintenance under user pressure. To improve the EI ability of LLMs, we propose Centered Turn-Credit GRPO (CTC-GRPO), a GRPO extension that reuses the simulator’s per-turn state updates as dense turn-level feedback while preserving the final outcome reward. CTC-GRPO improves Qwen3-8B from -22.4 to +22.4 on EIBench and also improves on out-of-distribution evaluations including SAGE (+12.4) and EQBench3 (+20.9%). Our results show that simulator-tracked user states can support both evaluation and training for multi-turn emotion management.

[NLP-118] Emergent retokenization symmetry in large language models : phenomenology and applications

【速读】：该论文旨在解决生成式 AI（Generative AI）中因分词（tokenization）引入的表示冗余问题，即在固定词汇表下，同一字节串存在多种合法的分词方式，但语言模型分词器通常仅输出一种标准分词结果（即“规范分词”）。这种人为打破表示对称性的方式可能影响模型在推理时的行为，且缺乏理由预期模型在下游任务中能保持对不同等价分词形式的鲁棒性。论文的关键解决方案是提出并系统应用重分词（retokenization）——将输入提示的规范分词替换为另一种语义等价但分词不同的形式，同时严格保留原始字节序列。该方法能够干净地隔离分词差异的影响，而不改变语法、语义或表面形式，从而成为探测模型对分词敏感性、组合理解能力及表示多样性的重要工具。实验表明，尽管重分词可能损害简单任务上的性能，但在复杂任务中可发现传统采样策略无法触及的解，揭示了模型内部计算中隐含的分词对称性，并提出了一种基于语义等价输入表示的新颖推理时采样轴线，为理解大语言模型的内在机制与提升生成多样性提供了新视角。

链接: https://arxiv.org/abs/2606.15521
作者: Kanishk Jain,Matthew Day,Tankut Can
机构: Emory University (埃默里大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Tokenization introduces representational redundancy: under a fixed token vocabulary, every byte string admits many valid token encodings, or segmentations, that decode to the same surface string. However, given a prompt, most language model tokenizers break this representational symmetry by returning a canonical segmentation. Training only on canonical segmentations should influence inference behavior, and there is little reason to expect models to respect segmentation symmetry on downstream tasks. We find that this symmetry partially emerges during training. Here, we probe this emergent symmetry through experiments testing token compositional understanding, representation diversity, and task focused benchmark performance. We primarily use \textbfretokenization – replacing a prompt’s canonical tokenization with an alternative segmentation while preserving its bytes exactly. Relative to other prompt perturbations, retokenization is unusually clean because it isolates segmentation effects without changing syntax, semantics or surface form. We use retokenization to study sensitivity and robustness to semantically identical input representations across pretraining and post-training. Moreover, this partial retokenization symmetry suggests a distinct inference-time sampling axis. While temperature sampling generates diverse outputs from the model using its next-token probability distribution, retokenization generates diversity from the model’s internal computations through semantically equivalent input representations. We find that while this retokenization sampling strategy can hurt performance on easy problems, it can also recover solutions that conventional sampling does not find. Overall, our work presents retokenization as a simple yet powerful probe of large language models, shedding light on compositional understanding and prompt sensitivity, and offering a novel sampling strategy.

[NLP-119] SHARD: Safe and Helpful Alignment via Self-Reframing Distillation

【速读】：该论文旨在解决大语言模型在面对敏感提示时表现出的“安全-有用性”权衡问题，即模型在处理涉及敏感内容的请求时，往往选择完全拒绝、生成泛化的安全模板，或无法有效回应用户合法且可安全回答的信息需求。其核心解决方案是提出一种自重构蒸馏方法（SHARD），关键在于通过哲学指导原则对敏感提示进行语义重构，以揭示其潜在的无害意图；随后将原始响应转化为更安全且更具帮助性的版本；最后利用模型自身生成的重构后数据进行微调。该方法在DNA和LINGUASAFE英文子集上均显著提升了多数模型家族的有用性，同时保持了安全性，且性能接近使用更大规模教师模型的蒸馏方案，表明模型能够内化由自身生成的安全与有用行为模式。

链接: https://arxiv.org/abs/2606.15517
作者: Viswonathan Manoranjan,Amogh Gupta,Anvesh Rao Vijjini,Thomas Hofweber,Snigdha Chaturvedi
机构: UNC Chapel Hill (北卡罗来纳大学教堂山分校)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models often struggle with sensitive prompts. They may refuse outright, provide generic safety boilerplate, or fail to address the user’s legitimate informational needs that can be answered safely. We introduce SHARD, a self-reframing distillation method to improve safe-helpfulness. It first rewrites sensitive prompts to surface benign intent using philosophical guidelines, then reframes its original responses into safe, more helpful ones, and finally fine-tunes the model on its self-reframed responses. Across DNA and the English subset of LINGUASAFE, SHARD improves helpfulness for most model families while preserving safety. It also remains competitive with distillation from a larger teacher model, suggesting that models can internalize safe and helpful behavior elicited from their own. Warning: This paper contains content that may be offensive or harmful.

[NLP-120] AthDGC: An Open Diachronic Greek Treebank with Indo-European Parallels

【速读】：该论文旨在解决古希腊语多时期（历时）句法标注数据匮乏的问题，特别是缺乏一个统一标准下覆盖八个历史阶段（从古风期到现代希腊语）的公开可获取的依存句法树库。其核心挑战在于如何在跨时代、跨文本、跨语言的复杂背景下实现高质量、一致性的句法标注与对齐。解决方案的关键在于构建一个基于PROIEL XML 2.0标准的端到端开放工作流与数据集——AthDGC，它整合了多个语言版本（拉丁语Vulgate、哥特语Wulfila、古教会斯拉夫语Marianus、古典亚美尼亚语）的《新约》经文，并在诗句层级实现与希腊语原文的精确对齐。该方案采用经过PROIEL训练的Stanford Stanza进行句法标注，利用LaBSE模型实现句子级对齐，通过multilingual-BERT注意力机制结合AwesomeAlign方法完成词级对齐，从而确保了跨语言、跨时期的标注一致性与可扩展性。该成果为古希腊语的历时语言学研究提供了首个大规模、标准化、可复现的依存句法资源。

链接: https://arxiv.org/abs/2606.15510
作者: Nikolaos Lavidas,Kiki Nikiforidou,Dag Haug,Leonid Kulikov,Vassiliki Geka,Vassileios Symeonidis,Theodoros Michalareas,Sofia Chionidi,Anastasia Tsiropina,Eleni Plakoutsi,Evangelos Argyropoulos
机构: 未知
类目: Computation and Language (cs.CL); Digital Libraries (cs.DL)
备注: 16 pages. Data paper for the v0.4 release of AthDGC. Concept DOI: https://doi.org/10.5281/zenodo.20439182 . Companion site: this https URL

点击查看摘要

Abstract:AthDGC (“Athens-PROIEL”) is an open, end-to-end workflow and dataset. It is, to the best of our knowledge, the first openly licensed dependency-parsed treebank of Greek that spans eight diachronic periods, namely Archaic, Classical, Koine, Late Antique, Byzantine, Late Byzantine, Early Modern, and Modern Greek, under a single PROIEL XML 2.0 schema, with verse-level cross-alignment of the New Testament to Latin (Vulgate), Gothic (Wulfila), Old Church Slavonic (Marianus), and Classical Armenian. AthDGC builds on the PROIEL Treebank Family (Haug and Johndal 2008; Eckhoff et al. 2018), which established the schema and the Koine-Greek reference set for the project. Annotation uses the Stanford Stanza PROIEL-trained workflow; sentence-level alignment uses LaBSE, a multilingual sentence-embedding model; word-level alignment uses multilingual-BERT attention through the AwesomeAlign procedure. The v0.4 release provides curated samples and the open-source toolkit; the full annotated corpus partitions remain under v0.5 audit on the Greek national HPC. Quantitative scale, per-witness verse counts, and per-period annotated-row counts are reported in the v0.5 release notes, after the audit pass completes. Concept DOI: https://doi.org/10.5281/zenodo.20439182.

[NLP-121] Evaluative Judgement in Teaching AI-based Translation: A Class-room Case Study of AI-Mediated Translation and Post-Editing

【速读】：该论文旨在解决在人工智能辅助翻译（AI-mediated translation）教学情境中，如何通过结构化比较通用大语言模型（General-Purpose Large Language Models, LLMs）与在线机器翻译（Online Machine Translation, MT）系统，激发学习者对翻译输出的评价性判断能力。其核心解决方案在于设计一项真实的本科高年级翻译课程作业，要求学生在实际翻译任务中，将英文专业维基百科文本译为加泰罗尼亚语或西班牙语，生成四种不同系统的翻译结果，结合自动评估指标与人工准确性/流畅性评分进行综合评判，并从中选择一个最优输出进行后期编辑，同时撰写书面报告说明决策依据。研究发现，学生并未将自动指标视为最终裁决标准，其最终选择往往与指标排名不一致，而是基于内容准确性、语言流畅性、术语使用、表达自然度及后期编辑工作量等多维度因素进行权衡。因此，该研究的重点并非在受控条件下对系统性能进行基准测试，而是在真实课堂环境中揭示学生如何在实践中构建并论证其系统选择逻辑，从而深化对人机协同翻译中认知判断机制的理解。

链接: https://arxiv.org/abs/2606.15483
作者: Gokhan Dogru
机构: 未知
类目: Computation and Language (cs.CL)
备注: Workshop on Teaching AI-based Translation and Technologies (TAITT 2026) - EAMT 2026

点击查看摘要

Abstract:Drawing on 23 anonymized student pro-jects from a fourth-year Machine Transla-tion and Post-editing course in a BA-level translation programme, this paper exam-ines how structured comparison of gen-eral-purpose LLMs and online MT sys-tems can elicit evaluative judgement in AI-mediated translation. Students translat-ed short specialised English Wikipedia texts into Catalan or Spanish, generated four system outputs, evaluated them using automatic metrics and human adequa-cy/fluency assessment, selected one output for post-editing, and justified their deci-sion in written reports. Descriptive counts are reported for all 23 projects, while qualitative interpretation is based on the 22 cases accompanied by written reports. Results show that students did not treat automatic metrics as final authority: final post-editing selections often diverged from metric rankings and were justified through adequacy, fluency, terminology, naturalness, and expected post-editing ef-fort. The study therefore does not bench-mark systems under controlled conditions; it analyses how students justified system choice within an authentic classroom as-signment.

[NLP-122] ESBMC-PLC: Formal Verification of IEC 61131-3 Ladder Diagram Programs Using SMT-Based Model Checking

【速读】：该论文旨在解决工业控制领域中可编程逻辑控制器（PLC）所执行的安全关键程序缺乏形式化验证的问题。当前主流的梯形图（Ladder Diagram, LD）编程语言虽广泛应用于工业自动化，但因其图形化结构（如触点与线圈的层级关系），无法被现有的基于SMT的模型检测工具直接处理，导致其安全性难以通过形式化方法严格验证。为此，本文提出ESBMC-PLC，首个原生支持梯形图（基于PLCopen XML格式）的开源形式化验证器，其核心创新在于：将梯形图语句转换为等价的GOTO中间表示（IR），并以非确定性输入建模PLC扫描周期为while(true)循环，从而支持基于SMT的有界模型检测或k-归纳法进行安全属性验证。此外，引入一种五属性的YAML描述语言（互斥性、不变性、缺失性、响应性、可达性），避免依赖复杂的时序逻辑表达。实验评估覆盖13个基准程序（6个应用领域，3类来源，包括实际部署的CONTROLLINO PLC与MathWorks Simulink PLC Coder生成代码），在61项属性上均实现正确分类，成功识别8个缺陷（生成可操作反例），完成7项无界k-归纳证明，所有测试用例均在苹果硅芯片上于60毫秒内完成。与PLCverif等工具对比表明，ESBMC-PLC是目前唯一同时具备原生梯形图支持、k-归纳能力及SMT位向量语义的开源工具，有效填补了现有研究中的两项关键空白。

链接: https://arxiv.org/abs/2606.15461
作者: Pierre Dantas,Lucas Cordeiro,Waldir Junior
机构: The University of Manchester (曼彻斯特大学); Federal University of Amazonas (亚马逊联邦大学)
类目: Computation and Language (cs.CL); Hardware Architecture (cs.AR)
备注: 24 pages

点击查看摘要

Abstract:PLCs execute safety-critical programs across industrial sectors. The dominant PLC notation, ladder diagram (LD) per IEC 61131-3, remains absent from formal verification: SMT-based model checkers cannot process LD’s rung-and-coil graphics. This paper presents ESBMC-PLC, the first open-source formal verifier with native LD support (PLCopen XML format), implemented as a new ESBMC frontend. ESBMC-PLC translates LD rungs to GOTO IR, models the PLC scan cycle as a while(true) loop with nondeterministic inputs, and checks safety properties via SMT-based bounded model checking or k-induction. A five-property YAML language (mutual_exclusion, invariant, absence, response, reachability) avoids temporal logic. A survey of 22 studies (2020-2026) identifies four research gaps; ESBMC-PLC closes two of them. Evaluation on 13 benchmarks (6 domains, 3 sources - including deployed CONTROLLINO PLCs and MathWorks Simulink PLC Coder) shows correct classification across 61 properties: all 9 author-constructed programs (Categories A/B) as expected, all 4 vendor programs (Category C) correctly unlabeled, with 8 bugs found (actionable counterexamples), 7 unbounded k-induction proofs, all runs under 60ms on Apple Silicon. Feature comparison with PLCverif shows that ESBMC-PLC is the only open-source tool that combines native LD, k-induction, and SMT bit-vector semantics.

[NLP-123] Pepti-Agent : An AI Agent for Peptide Design and Optimization

【速读】：该论文旨在解决治疗性肽（therapeutic peptides）在开发过程中面临的多属性协同优化难题，即在提升溶解性、降低溶血活性和抑制非特异性表面污染等关键性能时，由于这些性质由重叠的序列特征共同决定，单一属性的改善常导致其他属性恶化。传统计算设计方法虽结合生成模型与序列属性预测器进行迭代优化，但其组件常以耦合紧密的脚本形式实现，难以调试、扩展或复用，且优化过程依赖自然语言推理而非对候选序列多属性状态的动态追踪。本文提出的Pepti-Agent框架通过引入可独立观测的“模型上下文协议”（Model Context Protocol, MCP），将生成、属性预测与单残基突变操作解耦为可追溯、可验证的工具模块，并由大语言模型控制器协调执行。控制器在每次调用间实时参考预测输出，使序列优化基于当前候选物的多属性状态动态调整，而非仅依赖语言层面的推断。该框架采用任务定制的PeptideGPT生成候选肽，基于ProtBERT的分类器评估溶解性、溶血性和抗污染能力，结合可互换的突变算子进行序列修改，并全程记录每一步的决策、预测结果与采纳突变，从而构建可复现的多目标设计基准，支持高效候选物筛选与实验验证优先级排序。

链接: https://arxiv.org/abs/2606.15422
作者: Houxu Chen,Achuth Chandrasekhar,Amir Barati Farimani
机构: 未知
类目: Computation and Language (cs.CL); Biomolecules (q-bio.BM)
备注:

点击查看摘要

Abstract:Therapeutic peptides occupy a valuable design space between small molecules and biologics, but their development requires satisfying several competing constraints at once: solubility, hemolytic activity, and nonspecific surface fouling are governed by overlapping sequence features, so improving one property often degrades another. Computational design addresses this by pairing generative models with sequence-based property predictors, iteratively proposing and refining candidates. However, these components are typically wired together as monolithic scripts that are difficult to inspect, extend, or reuse, and they often refine sequences by natural-language reasoning rather than by tracking the evolving multi-property state of each candidate. We present Pepti-Agent, a closed-loop, peptide-specific framework that exposes generation, property prediction, and single-residue mutation as independently inspectable Model Context Protocol (MCP) tools. A large language model controller invokes these tools and consults live predictor output between calls, so refinement is guided by each sequence’s current property profile rather than by language reasoning alone. Task-specific PeptideGPT models generate candidates, ProtBERT-based classifiers score solubility, hemolysis, and non-fouling, and two interchangeable mutation operators propose sequence edits. By recording a per-step trace of controller decisions, predictor outputs, and accepted mutations, Pepti-Agent offers a reproducible substrate for benchmarking multi-objective design strategies and for prioritizing candidates for experimental validation.

[NLP-124] Let LLM s Judge Each Other: Multi-Agent Peer-Reviewed Reasoning for Medical Question Answering

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在医学问答（MedQA）任务中面临的准确性、可解释性与鲁棒性不足的问题。现有方法多依赖于单一模型的链式思维（Chain-of-Thought, CoT）推理或基于多数投票的集成策略，难以有效甄别错误推理路径，且缺乏对推理质量的动态评估机制。其解决方案的关键在于提出一种多智能体同行评审推理框架：多个LLM智能体独立生成包含候选答案的链式思维推理过程，并作为同行评审者相互评估彼此推理的客观事实正确性与逻辑严谨性；最终选取评分最高的推理链生成最终答案。该方法通过引入“自我批判”式的交叉验证机制，不仅提升了整体准确率（在三个基准数据集上平均达0.820），还增强了结果的可解释性与系统鲁棒性，且具备良好的可扩展性。相较于单模型CoT（最高0.777）和多数投票集成（最高0.789），该方法显著提升性能，表明以推理质量为核心评价标准而非仅依赖答案一致性，是构建可信生物医学人工智能系统的重要方向。

链接: https://arxiv.org/abs/2606.15419
作者: Zaifu Zhan,Shuang Zhou,Rui Zhang
机构: University of Minnesota (明尼苏达大学); University of Minnesota (明尼苏达大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by the Journal of the American Medical Informatics Association

点击查看摘要

Abstract:Objective: To enhance the accuracy, interpretability, and robustness of large language models (LLMs) in medical question answering (MedQA). Method: We designed a multi-agent peer-reviewed reasoning method in which multiple LLM agents independently generate chain-of-thought reasoning with candidate answers, then act as peer reviewers to evaluate each other’s reasoning for factual correctness and logical soundness. The highest-rated reasoning chain is selected to produce the final answer. Experiments were conducted with five state-of-the-art LLMs (Llama-3.1-8B, Qwen2.5-7B, Phi-4, DeepSeek-LLM-7B, GPT-oss-20B) on three benchmark datasets: HeadQA, MedQA-USMLE, and PubMedQA. Performance was compared against single-model chain-of-thought reasoning and chain-of-thought-based majority voting. Results: Peer-reviewed reasoning consistently outperformed both baselines. The best model combination achieved an average accuracy of 0.820 across datasets, exceeding the strongest single model (0.777) and majority voting ensembles (up to 0.789). The method also scaled effectively with more participating models, while peer assessments reliably distinguished high- from low-quality reasoning chains. Conclusion: The proposed multi-agent peer-reviewed reasoning method enables LLMs to act as both solvers and evaluators, yielding superior performance in MedQA. By emphasizing reasoning quality rather than answer agreement alone, this approach improves accuracy, interpretability, and robustness, offering a promising direction for trustworthy biomedical AI systems. Comments: Accepted by the Journal of the American Medical Informatics Association Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2606.15419 [cs.CL] (or arXiv:2606.15419v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2606.15419 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Zaifu Zhan [view email] [v1] Sat, 13 Jun 2026 18:09:44 UTC (1,355 KB)

[NLP-125] Encode Errors: Representational Retrieval of In-Context Demonstrations for Multilingual Grammatical Error Correction

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在少样本学习（Few-shot Learning）场景下进行语法错误纠正（Grammatical Error Correction, GEC）时性能不佳的问题，其核心挑战在于如何有效检索与目标错误模式匹配的上下文示范（in-context demonstrations），而非依赖语义相似性。本文提出的关键解决方案是利用模型内部状态提取一种名为“语法错误表征”（Grammatical Error Representation, GER）的新方法，该表征能够捕捉语法错误的本质特征且具备语义中立性。通过基于GER的检索机制，显著提升了多语言GEC数据集上的少样本性能，不仅在高资源语言上使8B规模开源模型的表现媲美Deepseek2.5和GPT-4o-mini等闭源模型，在低资源语言上也使F_0.5得分相比基线提升最高达1.20倍。该方法为多语言GEC提供了更精确、高效且具备可解释性的解决方案，推动了可解释性GEC研究的发展。

链接: https://arxiv.org/abs/2606.15416
作者: Guangyue Peng,Wei Li,Wen Luo,Houfeng Wang
机构: Peking University (北京大学)
类目: Computation and Language (cs.CL)
备注: 15 pages, 6 figures

点击查看摘要

Abstract:Grammatical Error Correction (GEC) involves detecting and correcting the wrong usage of grammar. While large language models (LLMs) with in-context learning (ICL) capabilities have shown significant progress on various natural language processing (NLP) tasks, their few-shot performance on GEC remains suboptimal. This is mainly due to the challenge of retrieving suitable in-context demonstrations that capture error patterns instead of semantic similarity. In this paper, we demonstrate that LLMs can inherently capture information related to grammatical errors through their internal states. From these states, we extract the Grammatical Error Representation (GER), an informative and semantically neutral encoding of grammatical errors. Our novel GER-based retrieval method significantly boosts performance in ICL settings on multilingual GEC datasets, improving the precision of correction. For high-resource languages, our results on 8B-sized open-source models match those of closed-source models such as Deepseek2.5 and GPT-4o-mini. For low-resource languages, our F_0.5 scores surpass the baseline by up to a factor of 1.20. This method provides a more precise and resource-efficient solution for multilingual GEC, offering a promising direction for interpretable GEC research.

[NLP-126] Few-Shot Biomedical Relation Extraction with Large Language Models : A Viable Alternative to Supervised Learning?

【速读】：该论文旨在解决生物医学关系抽取（BioRE）在低资源场景下依赖昂贵标注数据、难以扩展至多样化关系类型与领域的问题。其核心解决方案是采用基于提示学习（prompt-based learning）的大语言模型（LLM），探索两种任务范式：成对分类（pairwise classification）与联合生成（joint generation）。关键发现在于，尽管成对分类具有更高召回率，但联合生成在精确率和计算效率上更优；同时，在关系类型分布不均的设定下，基于提示的学习方法在宏平均F1（macro-F1）上超越了传统监督基线，尤其在稀有关系类型上表现显著，凸显了大语言模型在少样本条件下的潜力。研究进一步指出，当前性能差距主要源于某一模糊定义的关系类型，强调了清晰、规范的关系标注体系对提升性能的重要性。

链接: https://arxiv.org/abs/2606.15412
作者: Jakob Mraz,Tomaž Curk,Blaž Zupan
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Biomedical relation extraction (BioRE) is a key step in transforming biomedical literature into structured knowledge. However, most existing approaches rely on supervised models trained on costly annotated datasets, limiting their scalability and adaptability across relation types and domains. We investigate few-shot BioRE using prompt-based learning with large language models (LLMs) and compare two task formulations: pairwise classification, which predicts relations for individual entity pairs, and joint generation, which extracts multiple relations in a single model call. Experiments on the BioREDirect dataset reveal a clear precision-recall trade-off. Pairwise classification achieves higher recall, whereas joint generation is more precise and computationally efficient. The best-performing model achieves a micro-F1 score of 0.44, substantially outperforming previous few-shot results (0.34) while remaining below the supervised baseline (0.56). Much of this gap is attributable to a single ambiguously defined relation type. When evaluated using macro-F1, which better captures performance across relation types in an imbalanced setting, prompt-based approaches outperform the supervised baseline (0.45 vs. 0.38), particularly on rare relation types. These findings highlight the potential of LLMs for BioRE in low-resource settings and underscore the importance of well-defined relation schemas.

[NLP-127] -Mem: Memory That Anticipates Not Archives

【速读】：该论文旨在解决当前基于大语言模型（LLM）的长期对话记忆系统在跨会话记忆检索中的局限性问题，即现有方法仅依赖查询与存储内容之间的表层相似性（包括词汇和密集向量层面），导致在缺乏表面特征重叠的情况下无法有效召回相关记忆。这种机制仅适用于描述性关联（descriptive）场景，而对语义深层关联（associative）场景——即查询与记忆之间无直接词汇或实体重合，但存在潜在语义联系的情形——表现失效。为突破这一瓶颈，论文提出T-Mem架构，其核心创新在于引入“写时预演”（write-time rehearsals）机制，通过在记忆存储阶段预先生成两类触发器：一类用于描述性召回（基于表面相似性），另一类用于语义关联召回（基于潜在语义关系）。该设计使每条记忆在两个证据粒度（单个事实与完整对话回合）上均具备双重可访问性，从而实现对描述性与关联性记忆的全覆盖。实证结果表明，T-Mem在LoCoMo及LoCoMo-Plus基准上达到当前最优性能，首次实现了对话助手主动利用历史对话作为语义资产的能力。

链接: https://arxiv.org/abs/2606.15405
作者: Weidong Guo,Dakai Wang,Zixuan Wang,Hui Liu,Yu Xu
机构: Tencent(腾讯)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Long-term memory is essential for conversational agents to remain coherent across extended dialogues, follow through on commitments made many sessions earlier, and adapt their behaviour to each user. Current LLM-backed long-term conversational memory, however, is reachability-bounded by the similarity between a query and stored content, both lexical and dense-vector. The approach is effective when query and memory share surface features such as wording or named entities (we call this descriptive). But it misses another, equally valuable class of cases, where query and memory do not share surface features and are tied only by a latent semantic arc (associative). On this regime prevailing long-term memory systems collectively fail. Covering this other half is what allows an assistant, for the first time, to actively draw on past dialogue as a semantic asset. On the memory side, this is the engineering counterpart of what cognitive science calls episodic future thinking: rehearsing past experience for the future contexts under which it will need to be found. We call these write-time rehearsals triggers. We propose T-Mem, the first long-term conversational memory architecture that covers both descriptive and associative recall. At each of two evidence granularities, single facts and full exchanges, T-Mem instantiates one descriptive trigger family and one associative trigger family, so that every memory remains reachable from both surface-similar and relevance-bound queries. As empirical validation, T-Mem reaches state-of-the-art on both LoCoMo and LoCoMo-Plus.

[NLP-128] CHILLGuard: Towards Fine-Grained Chinese LLM Safety Guardrail with Scalable Data Construction and Model-aware Preference Alignment

【速读】：该论文旨在解决大语言模型（LLM）生成的恶意内容在中文场景下缺乏针对性安全防护的问题，尤其针对中国特有的监管政策、文化语境及语言细微差别导致现有通用安全护栏失效的挑战。其核心解决方案在于构建首个面向中文场景的细粒度风险分类体系——一个包含5个宏观类别和31个微观类别的风险标签体系，并据此研发专用于中文的生成式AI内容安全防护系统CHILLGuard。关键创新在于提出了一套可扩展的多阶段数据构建流程：通过检索增强生成扩展多源语料、利用提示工程重构生成隐性有害样本，并基于多模型投票机制进行标签校准，从而克服高质量中文安全标注数据稀缺的瓶颈。在此基础上，研究构建了包含40.5万条样本的训练集CHILLGuardTrain与包含5.17万条样本的严格标注测试集CHILLGuardTest。最终，基于生成-判别协同框架与模型感知的直接偏好优化（Model-aware Direct Preference Optimization），训练出具备卓越性能的CHILLGuard模型，在多项基准测试中相较Qwen3Guard-8B-Strict实现F1值提升15.92%，显著优于现有方法。

链接: https://arxiv.org/abs/2606.15396
作者: Wenbo Yu,Bohua Wang,Hao Fang,Kuofeng Gao,Jingru Zeng,Xiaochen Yang,Tianyi Zhang,Xiaoxiao Ma,Jiawei Kong,Hao Wu,Bin Chen,Shu-Tao Xia,Min Zhang
机构: Tsinghua University (清华大学); Beijing Normal University (北京师范大学); South China University of Technology (华南理工大学); Harbin Institute of Technology, Shenzhen (哈尔滨工业大学深圳); Shenzhen ShenNong Information Technology Co., Ltd. (深圳深农信息技术有限公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Malicious content generated from large language models (LLMs) could pose severe safety risks and ethical concerns. While existing LLM safety guardrails excel in English or multilingual settings, they lack adaptation to Chinese-specific regulatory policies, cultural context and linguistic nuances, failing to support fine-grained risk classification for diverse deployment needs. In this paper, we introduce a 5-macro, 31-micro category fine-grained risk taxonomy for Chinese scenarios, and build CHILLGuard: a dedicated Chinese LLM content safety guardrail. To address the critical scarcity of high-quality annotated Chinese safety data, we propose a scalable multi-stage data construction pipeline: we expand multi-source corpus via retrieval-augmented generation, generate implicit harmful samples through prompt engineering rewriting, and refine high-quality data via multi-model voting-based label calibration. Based on this, we build CHILLGuardTrain, a large-scale training set with 405,007 samples, and CHILLGuardTest, a rigorously curated annotated test set with 51,745 samples. We then train CHILLGuard on CHILLGuardTrain under a generator-classifier collaborative framework via Model-aware Direct Preference Optimization. Extensive experiments under multiple settings demonstrate the state-of-the-art performance of CHILLGuard, e.g., a 15.92% improvement of F1 score over Qwen3Guard-8B-Strict on our benchmark. We will release our resources at this https URL.

[NLP-129] Not All Skills Help: Measuring and Repairing Agent Knowledge

【速读】：该论文旨在解决大语言模型（LLM）代理在无需权重更新的情况下，通过积累自然语言技能实现性能提升时所面临的技能库冗余与无效技能干扰问题。现有系统将所有关于保留哪些技能及如何应用的决策完全交由大模型自身判断，这混淆了“创造性生成技能”与“基于实证评估技能有效性”这两个本质不同的角色。其核心问题是：尽管单个技能在某些任务上具有显著正向贡献，但在其他任务上可能产生负面影响，而这些对立效应在全局平均中相互抵消，导致传统全局性技能筛选方法无法识别并剔除有害技能。为此，本文提出ASSAY框架，其关键创新在于将技能生成与技能归因/优化分离：通过随机掩码（randomized masking）量化每个技能在小规模开发集上的因果贡献，离线重构技能库，并针对每项测试任务抑制预测效果为负的技能。实验表明，在涵盖七种基础模型、四个供应商及两个基准（AppWorld和tau-bench）的广泛评测中，ASSAY显著优于已有技能筛选方法；在AppWorld最困难的测试集上，DeepSeek-V3达到69.3%的任务目标完成率（相对提升47.4%），刷新当前公开方法记录；在tau-bench零售任务中，GPT-4.1实现8.7%的相对提升，超越o4-mini、o1和GPT-4.5，且未进行任何权重微调。消融实验进一步验证，性能提升主要源于任务级的因果掩码机制，表明推理阶段的“技能-任务匹配”是瓶颈所在，而非全局性地移除劣质技能。

链接: https://arxiv.org/abs/2606.15390
作者: Yixuan Wang,Yiyang Zhou,Yiming Liang,Congyu Zhang,Fuxiao Liu,Jiawei Zhou,Huaxiu Yao
机构: UNC Chapel Hill (北卡罗来纳大学教堂山分校); Purdue (普渡大学); NVIDIA (英伟达)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 18 pages, 5 figures

点击查看摘要

Abstract:LLM agents can improve without weight updates by accumulating natural-language skills from experience, but current systems entrust every decision about which skills to keep and how to apply them to LLM judgment alone. We argue that this conflates two distinct roles: generating a skill from experience is a creative act that judgment handles well, while deciding whether that skill actually helps requires empirical evidence across many tasks. Measuring per-skill causal contributions via randomized masking, we find that skill libraries exhibit pervasive causal heterogeneity: individual skills routinely help on some task types while hurting on others, yet their opposing effects cancel in aggregate, making them invisible to global curation methods. We propose ASSAY, a framework that separates generation from curation: it computes a per-skill causal attribution on a small development set, restructures the library offline, and suppresses skills with negative predicted effect for each test task. Across seven base models spanning four providers and two benchmarks (AppWorld and tau-bench), ASSAY consistently improves over prior skill-curation approaches. On AppWorld’s hardest split, DeepSeek-V3 achieves 69.3% task-goal completion (47.4% relative improvement), a new state of the art among all published methods including weight-tuned approaches. On tau-bench retail, GPT-4.1 improves by 8.7% relative, advancing past o4-mini, o1, and GPT-4.5 on the public leaderboard without any weight modification. Ablation traces the dominant gain to per-task masking, confirming that the bottleneck is matching skills to tasks at inference time, not removing bad skills globally. Code is available at this https URL.

[NLP-130] Rethinking the Role of Efficient Attention in Hybrid Architectures

【速读】：该论文旨在解决混合架构中高效注意力模块（如滑动窗口注意力，SWA）对模型长程上下文能力影响机制不明确的问题。当前主流语言模型普遍采用全注意力与高效注意力模块相结合的混合架构，但其内部各组件如何协同作用以实现长序列建模仍缺乏系统理解。论文从缩放特性、机制解析和架构设计三个维度展开分析，揭示了关键发现：首先，在缩放规律上，不同高效注意力设计仅影响长上下文能力的涌现速度，而在充分训练后各类混合架构最终趋于一致的长程性能；其次，机制层面表明，长程信息检索主要由全注意力模块承担，而高效注意力则通过调节优化轨迹间接影响模型行为，这解释了“大窗口惰性”（Large-Window Laziness）这一反直觉现象——即更大的SWA窗口反而延缓全注意力层中检索头的形成；最后，基于上述机制洞察，提出仅在小窗口SWA混合架构的全注意力层引入无位置编码（NoPE）的改进策略，显著提升长上下文性能，同时对短上下文表现影响可忽略，体现了机制驱动的高效架构优化路径。

链接: https://arxiv.org/abs/2606.15378
作者: Ziqing Qiao,Yinuo Xu,Chaojun Xiao,Zhou Su,Zihan Zhou,Yingfa Chen,Xiaoyue Xu,Xu Han,Zhiyuan Liu
机构: Tsinghua University (清华大学); OpenBMB
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 23 pages, 13 figures

点击查看摘要

Abstract:Modern language models increasingly adopt hybrid architectures that combine full attention with efficient attention modules, such as sliding-window attention (SWA) and recurrent sequence mixers. However, how these efficient modules shape model capabilities remains poorly understood. To address this gap, we conduct a systematic analysis across hybrid architectures from three perspectives: scaling behavior, mechanism analysis, and architecture design. First, from a scaling perspective, we find that efficient-attention design primarily affects how fast long-context capability emerges, while different hybrids eventually converge to comparable long-context performance under sufficient training. Second, mechanistically, we show that long-range retrieval is mainly carried by full attention, whereas efficient attention shapes its optimization trajectory. This explains a counter-intuitive phenomenon we call Large-Window Laziness: larger SWA windows can delay the formation of retrieval heads in full-attention layers. Third, guided by this mechanism, we show that applying NoPE to only the full-attention layers of a small-window SWA hybrid substantially improves long-context performance with negligible impact on short-context performance.

[NLP-131] Privacy-Preserving Text Sanitization for Distributed Agents Collaboration via Disentangled Representations

【速读】：该论文旨在解决分布式多智能体系统在跨组织边界交换文本时面临的隐私泄露问题，尤其关注除显式标识符外的分布性签名（如格式惯例、词汇选择与句法模式）所导致的隐式隐私泄露。其核心解决方案是提出DiSan（解耦净化，Disentangled Sanitization）框架，通过双流编码器将文本分解为两个解耦子空间：一个与源无关的角色语义子空间（保留任务相关语义），另一个仅反映源特征的风格子空间（本地化且可被移除）。该框架采用联邦原型对齐与对抗正则化实现无需集中化原始文本的联合训练，从而在保护隐私的同时维持语义一致性。实验表明，仅依赖标识符级掩码效果有限（仅降低18.6%的TF-IDF风格归属度），而DiSan能将答案级个人身份信息（PII）暴露降低20倍，并在分布式多智能体检索增强生成（RAG）基准上保持83%的答案忠实度；同时在Enron数据集上，基于TF-IDF和神经探测器的风格归属度分别降低73.2%和70.6%，验证了其在抵御风格反演攻击方面的有效性。

链接: https://arxiv.org/abs/2606.15335
作者: Xuan Liu,Hefeng Zhou,Sicheng Chen,Chao Yang,Xingcheng Xu,Jingjing Qu,Jiong Lou,Jie LI,Xia Hu
机构: Shanghai Artificial Intelligence Laboratory; Shanghai Jiao Tong University
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:When distributed agents exchange text across organizational boundaries, privacy leakage arises not only from explicit identifiers but also from distributional signatures such as formatting conventions, vocabulary choices, and syntactic patterns. We propose DiSan(Disentangled Sanitization), a privacy-preserving sanitization framework and a built-in component of Intern-Shannon for multi-agent collaboration. DiSan uses a two-stream encoder to factorize text into a source-invariant role subspace that preserves task semantics and a source-identifying style subspace that remains local. Federated proto-type alignment and adversarial regularization enable joint training without centralizing raw text. Experiments show that identifier-level masking is insufficient: masking 19.2% of tokens reduces TF-IDF stylometric attribution by only 18.6%. By contrast, DiSan reduces answer-level PII exposure by 20 times while maintaining 83% answer faithfulness on a distributed multi-agent RAG benchmark, and lowers Enron stylometric attribution by 73.2% under TF-IDF and 70.6% under a neural probe.

[NLP-132] Replay What Matters: Off-Policy Replay for Efficient LLM Reinforcement Unlearning

【速读】：该论文旨在解决大语言模型（LLM）遗忘（unlearning）过程中因基于强化学习（RL）的方法存在训练效率低下而引发的难题。现有方法如RULE将遗忘任务建模为学习拒绝行为，但其采用的在线策略（on-policy）优化机制在训练中反复采样相同的关键提示（forget/retain boundary prompts），导致简单样本快速收敛并产生冗余梯度信号，而位于遗忘与保留边界附近的困难样本仍持续生成低奖励轨迹，且这些轨迹仅被使用一次即被丢弃，造成计算资源浪费。针对此问题，本文提出ReRULE，一种面向强化学习遗忘的离线回放增强方法。其核心创新在于：在早期广义相对策略优化（GRPO）阶段，将低奖励的困难样本轨迹组存入回放缓冲区，并在后续训练中通过重要性采样实现离线策略（off-policy）更新，从而将计算资源聚焦于仍需学习的边界案例。理论分析表明，ReRULE相较于纯在线策略的RULE具有更紧的困难样本收敛上界；实验结果验证了其有效性——在MUSE-Books保留质量指标上从46.3提升至56.2，且仅增加5%~11%的训练时间，而在较简单的TOFU设置中提升有限，进一步证实了回放机制在难易样本差异显著时才最有效，体现了方法设计的合理性。

链接: https://arxiv.org/abs/2606.15333
作者: Zirui Pang,Chenlong Zhang,Haosheng Tan,Zhuoran Jin,Jiaheng Wei,Zixin Zhong
机构: The Hong Kong University of Science and Technology (Guangzhou); Institute of Automation, Chinese Academy of Sciences; University of Glasgow
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:LLM unlearning has emerged as a cost-effective alternative to full retraining for removing hazardous knowledge from pretrained models while preserving general utility. Recent RL-based methods such as RULE reformulate unlearning as learning a refusal behavior, but their on-policy optimization repeatedly samples from the same forget and retain/boundary prompts throughout training. We identify a critical inefficiency in this process: easy cases quickly converge and provide little useful gradient signal, while hard cases near the forget/retain boundary continue to produce low-reward rollouts that are discarded after a single use. To address this issue, we propose ReRULE, an off-policy replay enhancement for reinforcement unlearning. ReRULE stores low-reward hard-case rollout groups in a replay buffer during early GRPO training and reuses them in later stages through importance-sampled off-policy updates, redirecting computation toward boundary cases that still require learning. Theoretically, we show that ReRULE yields a tighter hard-case convergence bound than pure on-policy RULE. Empirically, ReRULE improves MUSE-Books Retain Quality from 46.3 to 56.2 while adding only 5–11% training time across benchmarks. Its limited improvement on the simpler TOFU setting further supports the intended conditional behavior: replay is most beneficial when the hard/easy disparity is pronounced.

[NLP-133] Prior over Evidence: Stereotype-Driven Diagnosis in LLM -Based L2 Pronunciation Feedback

【速读】：该论文旨在解决生成式AI在第二语言（L2）英语发音反馈中诊断可靠性的问题，核心关切在于：当前大型语言模型（Large Language Models, LLMs）的发音判断是否真正基于输入的语音证据，而非依赖预训练阶段形成的先验知识。研究通过分析1,800个来自六种母语背景的L2-Arctic语音样本，在三种具备音频处理能力的LLM、四个发音维度及五种证据条件（从纯文本基线到原始音频）下进行系统评估，采用评分准确率（Rating Accuracy, RA）、证据一致性（Evidence Coherence, EC）和接地正确性（Grounded Correctness, GC）三个指标综合衡量模型表现。研究发现的关键问题在于：评分准确率与基于证据的推理之间出现解耦现象——39.6%的判断虽具内部逻辑一致性却得出错误结论，而仅15.8%的正确判断具有合理推理支持；此外，所有模型在音素层面的反馈均趋同于一组固定的“高难度”发音特征，且该模式不受母语背景或证据类型影响；更重要的是，只有当提供的声学特征直接对应目标发音维度时（如将基频范围文本化可显著提升语调变化的接地性），模型性能才得以改善，而需要目标-实现对齐的任务（如重音和音素正确性）仍无法获得有效证据支撑。这一结果表明，当前通用型大语言模型更适合作为外部计算出的发音证据的口头表达工具，而非独立的诊断引擎，其可靠性高度依赖于输入证据的质量与针对性。

链接: https://arxiv.org/abs/2606.15325
作者: Rong Wang,Kun Sun
机构: University of Tuebingen (图宾根大学); Tongji University (同济大学)
类目: Computation and Language (cs.CL)
备注: 12 pages, 2 figures

点击查看摘要

Abstract:Large language models are increasingly deployed for written pronunciation feedback in second-language (L2) English learning, under the assumption that their diagnoses are grounded in the supplied speech evidence rather than in priors from pretraining. This assumption is tested on 1,800 L2-Arctic utterances spanning six L1 backgrounds, three audio-capable LLMs, four pronunciation dimensions, and five evidence conditions ranging from a text-only baseline to numeric acoustic features and raw audio. Each (utterance x model x condition x dimension) cell is scored on three metrics: Rating Accuracy (RA) against gold labels, Evidence Coherence (EC) assessing internal consistency without ground truth, and Grounded Correctness (GC) evaluated against gold evidence. Results show three findings across models. First, rating accuracy and grounded reasoning decouple: 39.6% of judged cells contain internally coherent reasoning that supports a wrong rating, against only 15.8% where the reasoning supports a correct rating. Second, phoneme-level feedback converges to a fixed inventory of L2-English difficulty phones that recurs across all six L1 backgrounds and all evidence conditions. Third, acoustic evidence improves the rating only when the supplied feature directly probes the target dimension: textualised F0 range raises pitch-variation grounding from (0.18-0.19) to (0.45-0.62) across all three models, while stress and phoneme correctness, which require target-to-realisation alignment, remain ungrounded. The same audio waveform without textualised F0 values does not reproduce this improvement. These findings indicate that current general-purpose LLMs are more reliable as verbalisers of externally computed pronunciation evidence than as standalone diagnostic engines.

[NLP-134] Adapting Reinforcement Learning with Chain-of-Thought Supervision for Explainable Detection of Hateful and Propagandistic Memes

【速读】：该论文旨在解决有害性与宣传性模因（hateful and propagandistic memes）内容识别中，单一模态信息不足以揭示其深层意图的问题，尤其针对基于思维链（Chain-of-Thought, CoT）的多模态大语言模型（Multimodal Large Language Models, MLLMs）在内容审核场景下表现不足、解释能力有限的挑战。其核心解决方案在于提出一种基于强化学习的后训练方法，通过任务特定奖励机制与分组相对策略优化（Group Relative Policy Optimization, GRPO）联合优化分类准确率与基于参考的解释质量。关键创新包括：（1）对现成MLLMs在英阿双语基准上进行系统性实证评估；（2）利用知识蒸馏与多大模型细粒度标注，扩展现有模因数据集以生成弱监督下的思维链推理路径；（3）引入带思维长度正则化的GRPO目标函数，实现分类性能与解释可读性的协同提升；（4）采用基于共识的伪标签策略，在无标注模因上实现自监督训练。实验表明，该方法在Hateful Memes和ArMeme基准上分别将FHM准确率提升至82.0%（+2.1%），在ArMeme上宏平均F1提升至0.612（+7.6点），同时生成自然语言解释，显著改善了类别间性能均衡性，优于传统序列分类基线。

链接: https://arxiv.org/abs/2606.15307
作者: Mohamed Bayan Kmainasi,Mucahid Kutlu,Ali Ezzat Shahroor,Abul Hasnat,Firoj Alam
机构: Hamad Bin Khalifa University (哈马德本哈利法大学); Qatar University (卡塔尔大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Hateful and propagandistic memes exploit the interplay between images and text to convey harmful intent that neither modality reveals alone. Although thinking-based multimodal large language models (MLLMs) have advanced vision-language understanding, their application to meme content moderation remains underexplored. We propose a reinforcement learning-based post-training method that improves classification performance and reference-based explanation quality in thinking-based MLLMs via task-specific rewards and Group Relative Policy Optimization (GRPO). Concretely, we (i) conduct a systematic empirical study of off-the-shelf MLLMs for hateful and propagandistic meme understanding across English and Arabic benchmarks, (ii) extend existing meme datasets with weakly supervised chain-of-thought (CoT) rationales via distillation and multi-LLM fine-grained propaganda annotations, (iii) introduce a GRPO-based objective with thinking-length regularization that jointly optimizes classification accuracy and explanation quality, and (iv) investigate self-supervised GRPO on unlabeled memes using consensus-based pseudo-labels. Experiments on the Hateful Memes and ArMeme benchmarks show that our approach improves over previously reported results on FHM accuracy (up to +2.1%, from 79.9% to 82.0%) and on ArMeme macro-F1 (up to +7.6 points, from 0.536 to 0.612 with explanations; +6.1 compared to the original ArMeme benchmark), while also generating natural-language explanations. On ArMeme, sequence-classification baselines remain stronger in terms of raw accuracy, whereas our approach provides more balanced per-class performance along with explanations. We publicly release our code, data extensions, and evaluation resources.

[NLP-135] CODA-BENCH: Can Code Agents Handle Data-Intensive Tasks? ICML2026

【速读】：该论文旨在解决当前智能体（agent）评估基准在真实软件开发场景中存在的重要缺陷：现有基准大多仅聚焦于代码生成或数据处理能力的单一维度，无法全面反映实际开发中代码与数据双重智能协同的需求。其核心问题是缺乏一个能够同时评估智能体在复杂数据环境中进行数据发现与代码生成能力的综合性评测平台。为此，论文提出CODA-BENCH，这是首个在数据密集型环境（基于Kaggle生态构建的类Linux沙箱）中联合评估代码智能与数据智能的基准测试框架。其关键创新在于构建了一个包含31个社区、共计1,009个任务的评测体系，每个任务环境平均包含980个文件，模拟了真实世界中的大规模数据规模与噪声干扰。实验结果表明，即使是最先进的智能体系统，在将数据发现与代码执行有效整合方面仍表现不佳，整体成功率仅为61.1%，凸显了当前智能体在数据密集型任务中存在显著的能力短板，为未来研究指明了融合多模态感知与决策能力的发展方向。

链接: https://arxiv.org/abs/2606.15300
作者: Yuxin Zhang,Ju Fan,Meihao Fan,Shaolei Zhang,Xiaoyong Du
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted at ICML 2026. 37 pages, 11 figures. Project page: this https URL Code: this https URL Data: this https URL

点击查看摘要

Abstract:Advanced agents are increasingly demonstrating the potential to operate as autonomous engineers, creating a growing demand for evaluation benchmarks that capture the complexity of real-world development. Such environments typically involve both complex code and large-scale data (i.e., file system). However, existing benchmarks usually evaluate code-centric or data-centric capabilities in isolation, leaving a clear gap with real development scenarios. In this paper, we bridge this gap by introducing CODA-BENCH, the first benchmark to jointly evaluate code and data intelligence in a data-intensive environment. We construct a data-intensive Linux sandbox based on the Kaggle ecosystem (containing hundreds of datasets), where agents must actively explore complex file hierarchies to identify relevant resources and generate code for data-driven analytical tasks. CODA-BENCH comprises 1,009 tasks spanning 31 communities, with each task environment containing an average of 980 files, simulating realistic data scale and noise. Evaluations of advanced agents reveal that even top-performing systems struggle to effectively integrate data discovery with code execution, achieving a success rate of only 61.1%. These results highlight a substantial gap in current agentic capabilities for data-intensive tasks and point to promising directions for future research.

[NLP-136] Evaluating and Preserving Lexical Stress in English-to-Chinese Speech-to-Speech Translation INTERSPEECH2026

【速读】：该论文旨在解决语音到语音翻译（S2ST）系统在跨语言词重音（lexical stress）传递方面的不足，尤其针对汉语等声调语言缺乏可靠自动评估指标的问题。现有S2ST系统虽在语义准确性和语音自然度方面取得显著进展，但对重音这一体现强调与说话人意图的关键语音特征的跨语言迁移仍研究不足。为此，论文构建了一个带有重音标注的中文语料库，并基于XLS-R开发了普通话重音检测器；结合英语重音评估系统（EmphAssess），提出了一种新型跨语言重音评估客观指标。同时，通过微调CosyVoice3模型，构建了具备重音感知能力的S2ST系统。实验结果表明，所提出的S2ST架构在重音翻译能力上显著优于现有方法，且保持了良好的翻译质量；所提评估指标与人工主观评价具有高度相关性，验证了其有效性。关键创新在于建立了首个面向声调语言的重音自动评估体系，并实现了重音感知的S2ST系统设计。

链接: https://arxiv.org/abs/2606.15266
作者: Yuchen Song,Xi Chen,Mingze Li,Satoshi Nakamura
机构: The Chinese University of Hong Kong, Shenzhen(香港中文大学（深圳）); Shenzhen Loop Area Institute(深圳环区研究院)
类目: Computation and Language (cs.CL)
备注: Accepted to Interspeech 2026

点击查看摘要

Abstract:Speech-to-speech translation (S2ST) systems have achieved impressive progress in semantic accuracy and speech naturalness. However, the cross-lingual transfer of lexical stress, a vital cue for emphasis and speaker intent, remains heavily underexplored, compounded by a lack of reliable automatic evaluation metrics for tonal languages like Chinese. We investigate English-to-Chinese S2ST stress transfer by constructing a stress-annotated Chinese dataset and an XLS-R-based Mandarin stress detector. Integrating this with the English EmphAssess system, we propose a novel objective metric for cross-lingual stress evaluation. Furthermore, we fine-tune CosyVoice3 to build a stress-aware S2ST system. Experiments demonstrate that our proposed S2ST architecture significantly outperforms existing systems in stress translation capability while maintaining competitive translation quality. Furthermore, our evaluation metric exhibits a strong correlation with human subjective judgments.

[NLP-137] Spokes: Optimizing for Diverse Pretraining Data Selection

【速读】：该论文旨在解决在固定数据预算下，如何有效提升数据选择中多样性（diversity）以减少冗余、从而改善模型性能的问题。其核心挑战在于多样性是数据集层面的全局属性，依赖于样本间的相互作用，难以通过单个样本的独立评估进行优化。现有方法多依赖代理指标或近似策略，往往无法保证子集具备充分多样性。本文提出一种基于G-Vendi分数的概率化多样化框架，并采用指数梯度下降（exponentiated gradient descent）进行直接优化，实现了对多样性的显式建模与高效求解。该方案的关键创新在于将多样性优化转化为可微分的优化问题，显著提升了所选子集的多样性——在50万样本子集上，G-Vendi分数提升达+489。实验表明，仅优化多样性的SPOKES方法在FineWeb和DCLM数据集上分别较随机采样提升+0.4和+0.5点平均下游性能；而同时优化质量与多样性的联合策略进一步取得最优效果，在两个数据集上分别实现+1.5和+1.4点的性能增益，全面超越包括语义去重和质量过滤在内的现有基线方法。

链接: https://arxiv.org/abs/2606.15216
作者: Clarence Lee,Yejin Choi,Luke Zettlemoyer,Pang Wei Koh,Hai Leong Chieu
机构: DSO National Laboratories(新加坡国防科技研究院); Stanford University(斯坦福大学); University of Washington(华盛顿大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 9 pages, 4 figures

点击查看摘要

Abstract:Diversity plays a critical role in data selection, improving performance under fixed data budgets by reducing redundancy and repetition. However, optimizing for diversity is inherently challenging, as it is a set-level property that depends on interactions between data points rather than individual examples. As a result, existing approaches typically rely on proxies or approximations, which often fail to ensure sufficiently diverse subsets. In this work, we directly optimize diversity by introducing a probabilistic diversification framework based on the G-Vendi score, optimized via exponentiated gradient descent. Our method produces subsets that are substantially more diverse than those obtained via random sampling, achieving a +489 increase in G-Vendi score on a 500k-sample subset. We evaluate our approach on FineWeb and DCLM, where it consistently outperforms existing methods. Notably, SPOKES (diversity-only) improves average downstream performance by +0.4 and +0.5 points over random sampling on DCLM and FineWeb, respectively. More importantly, jointly optimizing for both quality and diversity yields the strongest results: SPOKES achieves gains of +1.5 and +1.4 points on DCLM and FineWeb, outperforming all baselines, including semantic deduplication and quality filtering.

[NLP-138] AmchiBias: Measuring Stereotypical Bias in Goan Identity Groups with a Minimal Pair Dataset in English and Konkani

【速读】：该论文旨在解决自然语言处理（Natural Language Processing, NLP）系统在开发与部署过程中对社会文化刻板印象偏见的评估不足问题，尤其关注国家层面之外的次国家级（subnational）社会文化结构。现有研究多聚焦于国家层级的偏见，忽视了如印度果阿邦这样具有独特历史多元文化背景的区域性身份群体。为此，本文提出了AmchiBias——首个针对果阿邦社会文化刻板印象偏见的基准测试，涵盖八个社会人口学维度、313组最小对比对，并支持英语与德瓦纳加里·孔卡尼语双语形式。其解决方案的关键在于构建一个面向超本地化（hyperlocal）社区身份的多语言评估框架，揭示主流多语言编码模型在孔卡尼语中接近随机水平的表现，反映出通用多语言模型的语言能力缺陷以及印度本土语言模型对果阿文化的认知缺失。此外，当以英语提问时，具备更强印度语种覆盖的模型对泛印度群体表现出更高偏见，表明其响应更多依赖于泛印度预训练关联而非真实的果阿文化知识。这一发现凸显了低资源多语言NLP在评估超本地身份认同方面的显著空白。

链接: https://arxiv.org/abs/2606.15191
作者: Michelle Barbosa,Sebastian Padó,Franziska Weeber
机构: Institute for Natural Language Processing, University of Stuttgart (自然语言处理研究所，斯图加特大学)
类目: Computation and Language (cs.CL)
备注: The 1st Workshop on Stereotypes Across Cultures in Language Technologies

点击查看摘要

Abstract:Socio-cultural stereotypical bias is an important consideration in the development and deployment of NLP systems. It is however often considered only at the national level, despite rich subnational socio-cultural structures. We present AmchiBias, the first benchmark for measuring socio-cultural stereotypical bias for the Indian state of Goa with its unique historically multicultural setting. It covers various Goan identity groups and comprises 313 minimal pairs across eight sociodemographic dimensions in both English and Devanagari Konkani. We then evaluate stereotypical bias in five multilingual encoder models on this benchmark. We find near-chance scores in Konkani, reflecting language incompetence for general multilingual models and a lack of Goan cultural competence for Indian language models. Queried in English, models with a stronger Indian language coverage show higher bias for pan-Indian groups than hyperlocal Goan groups. This suggests the English signal reflects pan-Indian pretraining associations rather than genuine Goan cultural knowledge. Our findings highlight a critical gap in low-resource multilingual NLP evaluation for hyperlocal community identities.

[NLP-139] Beyond Layer Importance in Layer-wise Sparsity: An Inter-Layer Perturbation-Absorption Perspective EMNLP2026

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在层间稀疏性分配（layer-wise sparsity allocation）中存在的效率与性能失配问题，即现有方法主要依赖局部信号（如激活异常值或权重谱）估计各层重要性，但忽略了网络在剪枝后通过后续层进行补偿的能力，从而导致压缩后的最终性能不理想。其解决方案的关键在于通过受控扰动实验（controlled perturbation experiments）直接表征模型各层对剪枝规模扰动的响应特性，发现早期层倾向于放大扰动，而中后层则具有主动吸收扰动的能力，且这种吸收能力随深度增加呈单调增强趋势，同时隐藏状态轨迹方向趋于恢复至原始路径。进一步研究揭示，吸收现象仅在大扰动下显著出现，小扰动下全层均表现为放大，这一发现扩展了已有工作的线性累积理论。基于此，作者提出每层的“吸收系数”（absorption coefficient），并设计了“吸收感知校正”（absorption-aware correction）机制，作为独立于原有剪枝策略的正交增强模块，在70%稀疏度下显著提升模型性能，使困惑度降低7.13%，零样本准确率提升1.02%，适用于多种模型家族。

链接: https://arxiv.org/abs/2606.15161
作者: Tao Jing,Ningxin Wu,Chen Kang,Dong Yu,Changliang Li,Pengyuan Liu
机构: 未知
类目: Computation and Language (cs.CL)
备注: 10 pages, 4 figures, 4 tables. Submitted to EMNLP 2026

点击查看摘要

Abstract:The considerable layer-wise redundancy in large language models (LLMs) has established non-uniform sparsity allocation across layers as the standard pruning approach for efficient compression. Existing layer-wise allocation methods that estimate allocation strategy from local signals such as activation outliers or weight spectra mainly derive from local layer importance, whereas the final post-pruning performance is also influenced by the network’s subsequent compensatory capacity. In this paper, we directly characterize this property through controlled perturbation experiments. We make the following empirical findings. First, layers exhibit highly heterogeneous responses to pruning-scale perturbations. In most cases, early layers amplify perturbations, while middle and late layers actively absorb them, with relative L2 drift decreasing monotonically across depth and direction realigning toward the unperturbed hidden-state trajectory. Second, absorption is a large-perturbation phenomenon. Under small perturbations the network exhibits amplification across all layers, and the transition to absorption occurs smoothly as perturbation magnitude grows to pruning scale. This enriches the linearized accumulation theory underlying related works. Building on these findings, we define an absorption coefficient per layer and propose absorption-aware correction, an orthogonal augmentation that improves OWL and AlphaPruning by reducing perplexity by 7.13% and boosting zero-shot accuracy by 1.02% across multiple model families at 70% sparsity.

[NLP-140] Can Agents Read the Room? Benchmarking Visual Social Intelligence in Multimodal Simulation

【速读】：该论文旨在解决现有社交智能评估基准在多模态交互能力测试上的不足，即多数基准以文本为主，未能有效评估多模态智能体（Multimodal Large Language Models, MLLMs）利用视觉线索（如面部表情、姿态、眼神、情绪变化等）进行社会互动的能力。其核心问题是：如何构建一个能够系统性评估多模态代理在复杂社会情境中整合视觉与语言信息并实现协调交互的基准。解决方案的关键在于提出一个名为\textsc\benchmarkname的多模态社交仿真评估基准，该基准包含240个情景、585个角色实例及2,340个角色-任务实例，通过融合对齐的文本-视觉证据、结构化的角色档案以及四类角色级任务——表达任务（expression task）、特征任务（characteristic task）、交互调控任务（interaction regulation task）和交互结果任务（interaction outcome task），全面衡量模型在视觉感知驱动下的角色扮演与社交协作能力。实验表明，尽管模型在局部角色表现（如特定表情生成与冲突处理）已接近饱和，但在交互调控与基于视觉的情境化结果达成方面仍存在显著挑战，凸显了当前多模态社交智能在高层交互管理方面的局限性。

链接: https://arxiv.org/abs/2606.15152
作者: Shijun Wan,Xuehai Wu,Jiwen Zhang,Siyuan Wang,Zhongyu Wei
机构: Fudan University (复旦大学); Shanghai Innovation Institute (上海创新研究院); The Chinese University of Hong Kong (香港中文大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Social interaction depends on both language and visible social signals, such as facial expressions, posture, gaze, and emotional shifts. Yet existing social-agent benchmarks are largely text-based and rarely test whether multimodal agents can use visual cues to guide interaction. We introduce \textsc\benchmarkname, a benchmark evaluating visual social intelligence in multimodal social simulation. It contains 240 scenarios, 585 role instances, and 2,340 role-task instances, combining aligned textual-visual evidence, structured role profiles, and four role-level tasks: expression task, characteristic task, interaction regulation task, and interaction outcome task. Evaluating seven recent MLLMs under verbalized-vision and direct-vision reveals a clear gap between local role enactment and interaction management: role-specific expression and conflict handling are near saturation, whereas interaction regulation and visually grounded outcome achievement remain substantially more difficult. The code is released at this https URL, and the dataset is available at this https URL.

[NLP-141] PACUTE: Phonology- Affix- and Character-level Understanding of Tokens for Filipino EMNLP2026

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在处理具有非拼接性形态结构的语言时，因子词分词器（subword tokenizer）导致的词素边界错位问题，尤其针对菲律宾语这类存在大量中缀、叠词及由变音符号驱动的词汇区分的语言。其核心挑战在于现有模型难以准确捕捉字符级与词素结构之间的对应关系，从而影响对复杂形态生成机制的理解。解决方案的关键是提出PACUTE——一个包含4,600个任务的诊断基准，涵盖六层递进式的组合性分析框架，能够精准定位模型在词素分解、词素变换和音节划分等任务中的失效环节。实验结果表明，开源权重模型在词素分解任务上表现接近随机水平，而前沿商业模型虽能在包含匹配评分下识别部分词缀，但在涉及词素组合变换的复杂任务中仍远未达到字符级理解的理论上限，揭示出“可扩展的形态组合能力”而非单纯的字符访问能力，是当前模型在菲律宾语词结构理解上的主要瓶颈。

链接: https://arxiv.org/abs/2606.15144
作者: Jann Railey Montalan,David Demitri Africa,Jimson Paulo Layacan,Richell Isaiah Flores,Ivan Yuri De Leon,Lance Calvin Gamboa
机构: AI Singapore(人工智能新加坡); Nanyang Technological University(南洋理工大学); UK AI Security Institute(英国人工智能安全研究所); Ateneo de Manila University(马尼拉大学); University of Birmingham(伯明翰大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Submitted to EMNLP 2026

点击查看摘要

Abstract:Large language models (LLMs) process text as sequences of subword tokens, which can obscure the character-level and morphological structure that underlies word formation. This limitation is most acute for languages with non-concatenative morphology, where standard tokenizers systematically misalign token boundaries with morpheme boundaries. We introduce PACUTE, a diagnostic benchmark of 4,600 tasks designed to evaluate morphological understanding in Filipino, a language characterized by productive infixation, reduplication, and diacritic-driven lexical distinctions that are typically absent from written text. PACUTE includes a hierarchical diagnostic framework of six compositional levels that localizes where morphological understanding breaks down. Evaluating open-weight LLMs and frontier commercial models, we find that open-weight models perform near chance on morpheme decomposition regardless of scale. Frontier models perform much better, often recovering individual affixes under contains-match scoring, but remain far below their character-level ceilings on compositional tasks of morpheme transformations and syllabification. These results identify productive morphological composition, rather than character access alone, as the persistent bottleneck for Filipino word-structure understanding.

[NLP-142] When Cognitive Graphs Meet LLM s: BDEI Cognitive Pathways for Panic Emotional Arousal Prediction

【速读】：该论文旨在解决在个体恐慌情绪表现前准确预测其情绪唤醒时间的问题，现有方法虽融合了认知因素，但未显式建模情绪唤醒过程，因而难以有效预测情绪觉醒时机。其核心挑战在于：（1）评估理论指出情绪源于多维度威胁的同步评估，但此前研究未能将此类多源输入融合为统一的风险感知；（2）现有认知模型缺乏显式的“情绪节点”，导致威胁评估与情绪唤醒脱节，情绪需通过行为间接推断；（3）当前方法普遍采用大语言模型（LLM）作为主要决策者，但忽视其输出易受幻觉影响且脆弱性高，存在错误传播风险。针对上述问题，本文提出PanicCognitivePath（PCP）框架，其关键创新包括：引入基于心理距离理论的心理安全距离（PSD）模型，将四维信号映射为统一风险度量，作为认知推理的触发条件；在信念-欲望-意图（BDI）架构中显式嵌入基于评估情绪理论的情绪节点，构建信念-欲望-情绪-意图（BDEI）路径，使高风险个体直接进入该路径，实现威胁评估与情绪唤醒的耦合；同时将LLM限制于信念到欲望转换的参数估计任务，仅在单步中承担计算，从而将幻觉影响局限在局部，避免全局误差累积。实验结果表明，在飓风桑迪数据集上，PCP相较基线模型提升情绪唤醒时间预测准确率10.68%，峰值数量误差降低至7.07%。

链接: https://arxiv.org/abs/2606.15121
作者: Mengzhu Liu,Long Qin,Chuan Ai,Zhengqiu Zhu,Hongru Liang,Chen Gao,Yong Li,Xin Lu,Quanjun Yin
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Predicting individual panic emotional arousal timing before manifestation is essential for proactive emergency intervention. Existing methods incorporate cognitive elements but none explicitly model the emotional arousal process, making them ill-suited for emotional arousal timing prediction. We argue that grounding prediction in appraisal emotion theory is necessary because it explicitly models this process, but three problems must be solved. (1) Appraisal theory posits that emotion arises from simultaneous evaluation across multiple threat dimensions, yet no prior work fuses these inputs into risk perception. (2) Existing cognitive models lack an Emotion node, decoupling threat appraisal from emotional arousal and forcing emotions to be inferred indirectly from behaviors. (3) Given their generalizable cognitive reasoning, current approaches adopt LLMs as the primary decision-maker, yet overlook the fragility and hallucination-proneness of their outputs. To address these issues, we introduce PanicCognitivePath (PCP), a framework that addresses all three. A Psychological Safety Distance (PSD) model, grounded in psychological distance theory, maps four-domain signals into a unified risk metric as the entry condition for subsequent cognitive reasoning. An explicit Emotion node grounded in appraisal emotion theory is introduced into BDI, forming a Belief-Desire-Emotion-Intention (BDEI) pathway. Agents whose risk metric exceeds the PSD threshold enter this pathway, coupling threat appraisal directly to emotional arousal. The BDEI pathway governs all state transitions while the LLM is confined to parameter estimation for the Belief-to-Desire transition, confining hallucinations to a single step and preventing error propagation. Experiments on Hurricane Sandy show PCP improves arousal timing accuracy by 10.68% over baselines, reduces peak count error to 7.07%.

[NLP-143] When the Same Musical Knowledge Forgets Differently: A Clean Probe of Pathway-Dependent Forgetting

【速读】：该论文旨在解决多模态模型中知识遗忘问题的一个关键未被检验的假设——路径无关性假设（Pathway-Invariant Assumption），即不同信息获取路径（如通过听觉音频或阅读文本）是否影响知识在后续适应过程中的遗忘程度。研究发现，尽管同一音乐作品（如《Für Elise》）可通过音频或文本描述两种方式输入模型并获得相同感知内容，但以文本路径获取的知识比对应音频路径的知识更容易被遗忘，表现出显著的不对称性。为确保这一效应由获取路径而非其他混杂因素引起，作者提出配对路径受控协议（Paired Pathway Controlled Protocol, PPCP），该协议通过三阶段设计实现路径对齐：首先建立匹配的路径基线，其次在对称监督下激活双路径共享同一知识池，最后施加相同的遗忘压力。实验结果表明，这种遗忘差异在多种架构模型中稳定存在，且不受重写冲突、跨域学习、单模态压力或轻量级回放等干扰因素影响。两个独立的路由深度控制进一步排除了模型结构深度的影响，指向输入表示（input representation）是导致遗忘路径依赖性的主导因素。因此，该研究的关键突破在于揭示了知识遗忘具有高度路径依赖性，将获取路径确立为遗忘研究和多模态系统设计的新分析维度。

链接: https://arxiv.org/abs/2606.15088
作者: Yu Liu,Zhiwei Yang,Wenxiao Zhang,Cong Cao,Fangfang Yuan,Kun Peng,Haimei Qin,Lei Jiang,Jin B. Hong,Hao Peng,Yanbing Liu
机构: Institute of Information Engineering, CAS(中国科学院信息工程研究所); School of Cyber Security, UCAS(中国科学院大学网络空间安全学院); The University of Western Australia(西澳大利亚大学); Beihang University(北京航空航天大学)
类目: ound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:A model can learn that the piano piece Für Elise is calm and reflective by listening to the audio or by reading a text description, but does it matter which route that knowledge took when it is later at risk of being forgotten? Forgetting research in multimodal models measures what knowledge is lost under adaptation, yet has not asked whether acquisition route affects how easily that knowledge is forgotten. We call this untested premise the Pathway-Invariant Assumption. Music understanding enables a clean test because a music clip and a canonical text description can be aligned to the same perceptual content, allowing the same knowledge unit to enter a model through listening or reading while the target remains fixed. Across multiple architecturally distinct audio-language models, we observe a consistent asymmetry: text-pathway knowledge is forgotten more than matched audio-pathway knowledge under identical adaptation pressure. To attribute this effect to route rather than confounds, we introduce the Paired Pathway Controlled Protocol (PPCP), a three-phase design that establishes matched pathway baselines, activates both pathways under symmetric supervision on the same knowledge pool, and applies identical forgetting pressure to both pathways. The gap is stable across models and gain-controlled analyses, persists when contradictory overwrite is replaced by correct-label cross-domain learning, remains under single-modality pressure, and is not removed by lightweight replay. Two independent routing-depth controls confirm that the effect is not explained by architectural depth, pointing to input representation as the dominant factor. Under PPCP, our results demonstrate that forgetting is highly route-dependent, establishing acquisition route as a new analytical dimension for forgetting research and multimodal system design.

[NLP-144] AdaMame: A Training Recipe for Adaptive Multilingual Reasoning

【速读】：该论文旨在解决大型推理模型（Large Reasoning Models, LRMs）在多语言数学推理任务中出现的“语言坍塌”（language collapse）问题，即模型虽在英文上表现优异，却无法根据查询语言进行相应语言的推理。现有基于强化学习（RL）的解决方案通常通过引入二元语言保真度奖励来提升语言一致性，但往往导致准确率下降、推理过程中出现中途语言切换（mid-trace code-switching）以及过度消耗生成令牌（token）等权衡问题。本文提出AdaMame，一种两阶段训练范式，其核心创新在于通过自适应对齐机制，在不牺牲推理准确率的前提下，动态引导模型将推理语言与查询语言保持一致。第一阶段为监督微调（SFT），利用五种语言中自然产生的推理轨迹进行微调，建立多语言推理能力；第二阶段采用改进的组相对策略优化（AdaMame-GRPO），引入随训练进程逐步增长的查询条件对齐因子，促使模型先探索多种推理语言，再逐步聚焦于与查询语言一致的推理路径。在两个基准测试、两种LRM架构及12种语言上的实验表明，AdaMame-GRPO在推理准确率、语言保真度和令牌效率三方面均达到帕累托最优（Pareto-optimal），尤其在跨领域、低资源语言上表现出显著优势。

链接: https://arxiv.org/abs/2606.15080
作者: Dayeon Ki,Kevin Duh,Marine Carpuat
机构: University of Maryland (马里兰大学); Johns Hopkins University (约翰霍普金斯大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 20 pages, 5 figures

点击查看摘要

Abstract:While Large Reasoning Models (LRMs) show strong performance in English, they often fail to reason in the language of the query, a phenomenon known as language collapse. Existing RL-based fixes typically add a binary language fidelity reward to the accuracy objective, yet still incur trade-off in accuracy, mid-trace code-switching, and excessive token usage. In this work, we propose AdaMame, a two-stage training recipe for multilingual mathematical reasoning that addresses these limitations by adaptively aligning the reasoning language to the query language without compromising accuracy. The first SFT stage fine-tunes on naturally occurring reasoning traces across five languages to establish multilingual reasoning capability. In the subsequent RL stage, we introduce AdaMame-GRPO, an adaptation of Group Relative Policy Optimization (GRPO) in which a query-conditioned alignment factor grows progressively during training, guiding the model to first explore diverse reasoning languages before exploiting reasoning in the query language. Evaluated across two benchmarks, two LRMs, and 12 languages, AdaMame-GRPO achieves Pareto-optimal performance across reasoning accuracy, language fidelity, and token efficiency over all baselines, with the strongest gains on out-of-domain, lower-resource languages.

[NLP-145] Ling and Ring 2.6 Technical Report: Efficient and Instant Agent ic Intelligence at Trillion-Parameter Scale

【速读】：该论文旨在解决生成式智能系统在大规模应用中面临的效率与能力难以兼顾的核心挑战，即如何在保证低延迟响应和强推理能力的同时，实现模型训练、服务与部署的可实践性。其解决方案的关键在于通过系统性协同设计，实现模型架构、优化目标、服务系统与智能体训练环境的统一优化。具体而言，采用基于Ling-2.0的架构迁移预训练与大规模后训练策略，避免从头训练；引入混合线性注意力机制（hybrid linear attention），融合Lightning Attention与MLA（Multi-Layered Attention），显著提升长上下文场景下的训练与解码效率；通过进化链式思维（Evolutionary Chain-of-Thought）、语言单元策略优化（Linguistic Unit Policy Optimization）、双向偏好对齐及最短正确响应蒸馏等方法，优化单位输出令牌的能力密度；针对深度推理与复杂智能体工作流需求，提出KPop强化学习框架，支持在大规模环境感知数据上稳定训练Ring-2.6-1T，通过编码、搜索、工具调用与工作流执行的异步调度机制，实现复杂智能体-环境交互的高效可扩展学习。整体方案为构建高效、可扩展且开放的智能体系统提供了可行路径，并开源了全部2.6系列模型检查点以推动实际智能体技术的发展。

链接: https://arxiv.org/abs/2606.15079
作者: Ang Li,Ben Liu,Bin Han,Bin Hu,Bin Jing,Binbin Hu,Bing Li,Cai Chen,Caizhi Tang,Changxin Tian,Chao Huang,Chao Zhang,Chen Liang,Chen Qian,Chengfu Tang,Chengyao Wen,Chilin Fu,Chunwei Wu,Cong Zhang,Cunyin Peng,Daixin Wang,Dalong Zhang,Deng Zhao,Dingnan Jin,Dingyuan Zhu,Donghao Zhang,Fan Yuan,Fangzheng Zhao,Fanzhuang Meng,Feifan Wu,Feng Xu,Fengbin Fang,Gangshan Wang,Guodong Yang,Hailin Zhao,Haitao Wang,Haitao Zhang,Hanxiao Zhang,Hanzi Wang,Hao Dai,Hao Liu,Hao Qian,Hao Wu,Haoxiong Liu,Haoyu Xu,Heng Zhang,Hong Liu,Hongliang Zhang,Hongrui Liu,Hongxun Li,Hongzhi Ruan,Huaidong Xiong,Huihuang Zheng,Huikang Tang,Jia Guo,Jia Li,Jia Liu,Jiameng Wang,Jiaming Liu,Jiannan Shi,Jianping Wei,Jiaolong Yang,Jiapeng Wang,Jie Gao,Jie Wang,Jiewei Wu,Jin Yang,Jinjin Li,Jinjing Huang,Jinquan Sun,Jinyao Chen,Juanhui Tu,Jun Liu,Jun Mei,Jun Xu,Jun Zhou,Junjie Ou,Junnan Sipan,Junpeng Fang,Kaihong Zhang,Kaiqin Hu,Ke Shi,Kuan Xu,Kun Tang,Kunlong Chen,Lanyin Mei,Lei Chen,Lei Liang,Lei Xu,Li Tang,Liang Jiang,Liangcheng Fu,Lihui Zhang,Linfeng Shi,Lintao Ma,Liyuan Liu,Longfei Li,Longfei Zheng,Lu Liu,Lu Yu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Efficient and scalable agentic intelligence requires models that can deliver both low-latency responses and strong reasoning capabilities while remaining practical to train, serve, and deploy. In this report, we present Ling-2.6 and Ring-2.6, a family of models designed to address this challenge at scale. Ling-2.6 is optimized for instant response generation and high capability per output token, whereas Ring-2.6 is tailored for deeper reasoning and more advanced agentic workflows. Instead of training from scratch, we upgrade the Ling-2.0 base model through architectural migration pre-training and large-scale post-training. This upgrade is guided by a unified co-design of model architecture, optimization objectives, serving systems, and agent training environments, enabling improvements in both model capability and deployment efficiency. At the architectural level, we introduce a hybrid linear attention design that integrates Lightning Attention with MLA, improving the efficiency of long-context training and decoding. To further enhance token efficiency, we optimize capability per output token through Evolutionary Chain-of-Thought, Linguistic Unit Policy Optimization, bidirectional preference alignment, and shortest-correct-response distillation. For agentic capabilities, we propose KPop, a reinforcement learning framework designed to support stable training of Ring-2.6-1T on large-scale environment-grounded data. KPop improves training efficiency through asynchronous scheduling across coding, search, tool use, and workflow execution, enabling scalable learning from complex agent-environment interactions. Together, Ling-2.6 and Ring-2.6 provide a practical pathway toward efficient, scalable, and open agentic systems. We open-source all checkpoints in the 2.6 family to support further research and development in practical agentic intelligence.

[NLP-146] Risk-Aware LLM Agents for Geospatial Data Retrieval: Design and Preliminary Adversarial Evaluation

【速读】：该论文旨在解决用户通过自然语言查询从云端地理空间目录中高效检索遥感数据时存在的语义理解与系统安全挑战，核心问题是如何实现用户意图到结构化API调用的可靠映射，同时保障操作的安全性与合规性。其解决方案的关键在于提出一种由三个智能体协同驱动的生成式框架：用于安全与策略控制的拦截层守卫代理（Guardrail）、负责意图解析的通用问答代理（General-QA），以及具备模式感知能力的推荐-分析代理（Recommender-Analyst）。该架构通过分层协作机制，实现了用户自然语言指令到精准、可执行API调用的语义对齐转换，并支持跨平台部署与灵活集成，显著提升了地球观测工作流的自动化水平与可扩展性。实验表明，尽管提示级安全指令能增强系统鲁棒性，但高风险的API操纵错误仍偶发存在，凸显了引入端到端拦截式守卫机制在平衡安全性、可用性与成本效率方面的必要性。

链接: https://arxiv.org/abs/2606.15077
作者: Kyle Gao,Joel Cumming,Jonathan Li,Linlin Xu,David A. Clausi
机构: University of Waterloo ( Waterloo 大学); SkyWatch (SkyWatch); University of Calgary (卡尔加里大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted for publication in the International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences (ISPRS Archives), ISPRS Congress 2026

点击查看摘要

Abstract:We present an LLM-driven framework for retrieving remote sensing data from cloud-based geospatial catalogues using natural language queries. The system converts user intent into structured API calls, enabling efficient access to satellite imagery and environmental datasets. The architecture integrates three agents: Guardrail for safety and policy enforcement, General-QA for intent interpretation, and Recommender-Analyst for schema-aware API call generation. This coordinated design ensures reliable, semantically aligned interaction with external data services. The modular framework is portable across platforms through API schema substitution and supports applications in environmental monitoring, disaster response, and climate analysis. It establishes a scalable interface between user intent and geospatial infrastructure, enabling streamlined and automated Earth observation workflows. Preliminary experiments under adversarial multi-turn settings show that prompt-level safety instructions improve robustness, although rare high-impact failures persist in API manipulation scenarios and highlight the need for adaptive, system-level defenses that balance safety, usability, and cost efficiency, which motivates the use of our intercept-level Guardrail agent.

[NLP-147] Stop When Further Reasoning Wont Help: Attention-State Adaptive Generation in Reasoning Models ICML2026

【速读】：该论文旨在解决大模型在推理过程中因“过度思考”（overthinking）导致的冗余文本生成与准确率下降问题。现有缓解方法存在局限性：基于训练的方法需大量计算资源，而无需训练的方法则依赖精心设计的提示词或不可靠的置信度信号。本文从注意力分布的角度出发，提出一种无需训练、即插即用的早期停止机制ASAG（Attention-based Stopping with Adaptive Generation），通过分析模型的推理状态动态调整生成策略，实现对推理过程的有效控制。其核心创新在于利用注意力分布变化来判断推理进展，并自适应地决定是否终止生成，从而在不改变原有模型结构的前提下显著减少冗余输出。实验结果表明，ASAG在九个基准测试中均表现出色，尤其在Qwen3-8B上实现了平均准确率提升3.2%的同时，生成令牌数降低近40%，充分验证了其有效性与普适性。

链接: https://arxiv.org/abs/2606.15070
作者: Jiakai Li,Ke Qin,Rongzheng Wang,Yizhuo Ma,Qizhi Chen,Muquan Li,Shuang Liang
机构: 未知
类目: Computation and Language (cs.CL)
备注: ICML 2026 Spotlight

点击查看摘要

Abstract:By incorporating test-time compute scaling, large reasoning models (LRMs) can solve complex problems through explicit chain-of-thought (CoT) reasoning processes. However, they often suffer from overthinking, resulting in redundant token outputs and degraded accuracy. Current methods to mitigate this issue remain limited: training-based approaches require substantial computational resources, while training-free methods rely on well-crafted prompts or unreliable confidence signals. In this work, we investigate early stopping from the perspective of attention distributions and propose a simple method, ASAG, which infers the model’s reasoning state and adaptively adjusts the generation strategy. The proposed framework is training-free and plug-and-play, enabling seamless integration into existing LRMs. Extensive experiments on nine benchmarks demonstrate consistent improvements across mainstream LRMs with varying parameter scales, including the DeepSeek-R1-Distill and Qwen3 series. Specifically, ASAG improves average accuracy by 3.2% while reducing the number of generated tokens by nearly 40% across all reasoning tasks on Qwen3-8B.

[NLP-148] CoCoGEC: Counterfactual Generation for Robust Grammatical Error Correction

【速读】：该论文旨在解决生成式语法纠错（GEC）模型在面对上下文微小扰动或扩展时性能显著下降的问题，揭示了现有模型对错误模式在多样化上下文中的泛化能力不足。其核心解决方案是提出一种名为CoCoGEC的反事实生成框架，通过系统性地生成与原训练样本具有相同错误模式但上下文被词级和句级扰动的反事实样本，增强模型对上下文变化的鲁棒性。该框架的关键在于：(1) 生成句内与句间反事实样本，保持原始句子的错误模式和句法结构；(2) 通过筛选标签翻转且具备高GEC互信息（MI）系数的反事实样本进行优化，从而提升模型对上下文敏感性的学习能力。大量实验表明，该方法显著提升了GEC模型的稳定性，在受扰动的BEA-19*、CoNLL-14和TEM-8数据集上分别实现了+9.9、+11.3和+20.8的绝对F0.5分数提升，优于多种数据增强基线方法。

链接: https://arxiv.org/abs/2606.15069
作者: Qianyu Wang,Xiaoman Wang,Yuanyuan Liang,Xinyuan Li,Yunshi Lan
机构: East China Normal University (华东师范大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Grammatical error correction (GEC) systems are usually trained and evaluated on GEC benchmarks, but their performance often drops sharply once the surrounding context is slightly perturbed or extended. This indicates that the existing GEC models usually fail to understand the error patterns in the varying contexts. In this paper, we thoroughly investigate the counterfactuals for GEC tasks, where the subtle changes to the contexts could lead to the label flipping issue. We propose CoCoGEC, a counterfactual generation framework that creates copies of training instances with error-irrelevant contexts altered. Our framework systematically generates counterfactuals by (1) generating intra- and inter-sentence counterfactuals that maintain the error patterns as well as syntax of the original instances by altering the word-level and sentence-level contexts; (2) revising the generated counterfactuals by selecting the instances with flipped labels and high GEC Mutual Information (MI) coefficient. Extensive experiments show that our method substantially improves the stability of GEC models, outperforming a set of data augmentation baselines. Particularly, it could achieve absolute F0.5 gains of +9.9, +11.3, and +20.8 points on the perturbed BEA-19*,CoNLL-14*, and TEM-8* data this http URL code is released at this https URL

[NLP-149] A Practical Evaluation Method for Long-Form Simultaneous Speech-to-Speech Translation

【速读】：该论文旨在解决现有同步语音到语音翻译（SimulS2ST）评估方法在长时序、连续输入场景下的局限性问题。当前评估多集中于短段或预分割的语音输入，且缺乏可复现性，同时其假设条件不适用于端到端系统。为此，论文提出一种实用的长时序SimulS2ST评估方法：基于源语音、预分割的源文本转录以及参考译文，通过自动语音识别（ASR）与强制对齐（forced alignment）技术从目标语音中恢复词级别的时间戳，再利用基于句子嵌入的对齐器将目标文本与对应源句进行匹配，从而实现句级延迟与质量指标（如YAAL和xCOMET）的计算，并进一步聚合为系统级评分。该方法的关键在于通过自动化的时序对齐与语义匹配，实现了对长时序翻译系统中延迟累积效应的精准量化，实验表明当前主流系统在长语音输入下存在显著的延迟积累问题。

链接: https://arxiv.org/abs/2606.15059
作者: Yulin Xue,Siqi Ouyang,Lei Li
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: Computation and Language (cs.CL)
备注: Accepted to IWSLT 2026 Scientific Track

点击查看摘要

Abstract:Simultaneous speech-to-speech translation (SimulS2ST) enables real-time cross-lingual communication, but existing evaluation has focused largely on short or pre-segmented speech rather than long-form, continuous input. Prior approaches are difficult to reproduce and make assumptions that do not hold for end-to-end systems. We present a practical evaluation method for long-form SimulS2ST. Given source speech, pre-segmented source transcripts, and reference translations, we run automatic speech recognition (ASR) and forced alignment on the generated target speech to recover token-level timestamps, then apply a sentence-embedding-based aligner to match the target text to its corresponding source sentences. This enables sentence-level computation of latency and quality metrics, including YAAL and xCOMET, which are then aggregated into final system-level scores. Experiments on representative SimulS2ST systems show that the method is effective in practice and reveal that current systems suffer from substantial latency accumulation on long speech.

[NLP-150] Equity with Efficiency: An Empirical Study of Tokenizers for Multilingual Large Language Models

【速读】：该论文旨在解决多语言大语言模型（Multilingual Large Language Models, MLLMs）中因子词分词器（subword tokenization）设计偏向高资源语言和拉丁字母脚本而导致的跨语言公平性问题，尤其针对东南亚地区低资源语言使用者面临的推理成本上升与跨语言能力差距扩大的挑战。其解决方案的关键在于提出并系统评估多种具有公平性考量的分词方法，通过统一基准对11种东南亚语言进行对比分析，揭示不同分词策略在压缩效率与跨语言公平性之间的权衡关系。研究发现，感知公平性的字节级BPE（Parity-aware BPE） 在效率与公平性之间达到了帕累托最优，实现了良好的压缩对齐与可接受的计算开销；而基于形态学驱动的字节编码（Morphology-Driven Byte Encoding） 虽然计算成本较高，但因其生成更丰富的形态表示，在语义推理任务上表现最佳；相比之下，字节潜在变换器（Byte Latent Transformer） 由于架构假设与有限低资源训练数据不匹配，导致下游任务性能较差。研究结果表明，跨语言公平性与分词效率并非不可调和，为构建更具包容性的多语言模型提供了实证依据与设计指导。

链接: https://arxiv.org/abs/2606.15044
作者: Kieron Seven Jun Wei Lee,Muhammad Reza Qorib,Andrew Ivan Soegeng,Hwee Tou Ng
机构: National University of Singapore(新加坡国立大学); Carnegie Mellon University(卡内基梅隆大学); SAP(思爱普)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Multilingual large language models (LLMs) depend on subword tokenization to bridge discrete text and continuous neural representation. State-of-the-art multilingual LLMs often use Byte-level Byte-Pair Encoding (BPE) tokenizers that structurally favor high-resource languages and Latin scripts. For speakers of underrepresented languages, particularly those across Southeast Asia, this bias inflates inference costs and widens cross-lingual capability gaps. We present the first systematic comparison of equitable tokenizers on a unified benchmark spanning 11 Southeast Asian languages. Beyond tokenizer-level analysis of compression efficiency and cross-lingual equity, we assess downstream task performance through controlled 1.5B-parameter language model training using the same training data. Our results show that Parity-aware BPE lies on the Pareto frontier of the efficiency-equity trade-off, achieving strong compression parity at competitive cost. Morphology-Driven Byte Encoding delivers the best semantic reasoning performance through morphologically richer representations, albeit at a higher computational expense. Byte Latent Transformer underperforms on downstream tasks, possibly because its architectural assumptions misalign with the constraints of limited low-resource training data. Together, our findings demonstrate that cross-lingual fairness and tokenization efficiency are not fundamentally at odds, and offer practical guidance for designing equitable multilingual models.

[NLP-151] ReportQA: QA-Based Radiology Report Evaluation

【速读】：该论文旨在解决现有放射科报告评估方法在临床相关性与可扩展性方面的局限性问题。当前自然语言生成评价指标缺乏临床意义，而临床效能（Clinical Efficacy, CE）指标虽关注重要医学发现，但仅聚焦于特定实体的存在性，且受限于人工标注，难以扩展临床实体或属性。为克服上述缺陷，本文提出ReportQA框架，其核心在于构建一个以临床为导向、具备高度灵活性的放射科报告评估体系。关键创新包括：基于放射科医生指导构建涵盖多模态影像与解剖区域的知识树，利用大语言模型（Large Language Models, LLMs）从原始报告中提取结构化信息；通过预定义模板生成问答对，并结合自过滤与基于报告的过滤策略进行质量控制；在评估阶段，将报告作为上下文，由LLM作为判别模型回答生成的问答对，最终依据问答准确率引入QAScore评价指标。实验表明，QAScore相较于现有指标更贴近放射科医生的判断，且揭示了当前基于报告的视觉-语言模型在学习细粒度临床表征方面存在明显不足及强烈的负向先验偏差。相比之下，以问题驱动的推理范式展现出更优性能。为保障研究可复现性与可拓展性，作者开源了知识树、结构化报告、问答对及完整的问答构建与评估流水线代码。

链接: https://arxiv.org/abs/2606.15037
作者: Yiming Shi,Shaoshuai Yang,Xi Chen,Haolin Li,Hengyu Zhang,Che Jiang,Kaiwen Wang,Xun Zhu,Dong Xie,Fei Wang,Dejing Dou,Miao Li,Ji Wu
机构: Tsinghua University (清华大学); Beijing National Research Center for Information Science and Technology (北京信息科学研究中心); Beijing Electronic Digital Intelligence (北京电子数字智能)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Radiology report evaluation is essential for advancing automated report generation. Natural language generation metrics have limited clinical relevance. Clinical efficacy (CE) metrics evaluate important medical findings, but focus mainly on presence and cover only a limited set of entities. Due to heavy reliance on manual annotations, it is difficult for CE metrics to extend clinical entities or attributes. In clinical practice, radiology reports serve as a medium for information transfer. Clinicians use them to perform downstream diagnostic tasks without directly inspecting images. Based on this insight, we propose ReportQA, a clinical-related and flexible radiology report evaluation framework, supporting detailed quantitative analysis of radiology report generation systems. We first collect datasets covering multiple imaging modalities and anatomical regions. We then construct knowledge trees of clinical entities and attributes with radiologist guidance, and use large language models (LLMs) to extract structured information from raw reports. Next, we generate QA pairs from predefined templates and apply quality control through self-filtering and report-based filtering. During evaluation, the report is treated as context, and an LLM acts as a judge model to answer the QA pairs. Based on the resulting QA accuracy, we introduce QAScore metric. Compared with existing metrics, QAScore shows better alignment with radiologist judgments. Experiments on multiple state-of-the-art vision-language models reveal that current report-based inference paradigms struggle to learn fine-grained clinical representations and exhibit strong negative prior biases. In contrast, question-driven inference provides a more effective alternative. For reproducibility and extensibility, we release the knowledge trees, structured reports, and QA pairs, along with the pipeline code for QA construction and evaluation.

[NLP-152] Deep Temporal Modeling and Ensemble Fusion for Multimodal Emotion Recognition from Physiological Signals ALT

【速读】：该论文旨在解决多模态生理信号融合下的情感识别与应激状态识别问题，核心挑战在于如何有效整合来自腕部和胸部传感器的异构生理信号（如心电、皮肤电导、加速度等），以提升情绪识别的准确性和鲁棒性。其解决方案的关键在于采用多层次融合策略：一方面，在传感器层面实施早期融合（early fusion），通过拼接腕部与胸部信号作为统一输入；另一方面，设计基于晚期融合（late-fusion）的集成学习框架，将LSTM、TCN与Transformer三种深度学习模型在多模态输入上独立训练后的预测结果进行加权融合。实验表明，Transformer在多模态场景中表现最优，而TCN在仅使用腕部信号时更具优势，最终集成方法在准确率（98.91 ± 0.13%）和宏平均F1分数（98.56 ± 0.17%）上均达到最佳性能，验证了多模型集成与多源信号融合在构建高效生理情绪识别系统中的关键作用。

链接: https://arxiv.org/abs/2606.15026
作者: Desta Haileselassie Hagos,Saurav Keshari Aryal,Patrick Ymele-Leki,Anietie Andy,Legand L. Burge
机构: Howard University (霍华德大学)
类目: Computation and Language (cs.CL)
备注: Accepted for publication in the 17th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics (ACM BCB 2026). DOI: this https URL

点击查看摘要

Abstract:Physiological stress and emotion recognition are important for health monitoring and affective computing. In this work, we present a comprehensive evaluation of deep learning models such as Long Short-Term Memory (LSTM), Temporal Convolutional Networks (TCN), and Transformer on the WESAD dataset for multimodal affect recognition using wrist and chest sensor signals. We perform ablation studies to assess the individual contributions of each modality by training models on wrist-only and chest-only inputs. In addition, we implement a late-fusion ensemble strategy that combines predictions from all three architectures trained on multimodal input. We also employ early fusion at the sensor level by concatenating wrist and chest signals before feeding them into each model. Our results show that Transformer models consistently achieve the highest accuracy in multimodal settings, while TCN models perform best in the wrist-only configuration. The ensemble method yields the highest overall accuracy (98.91 +/- 0.13%) and macro-F1 score (98.56 +/- 0.17%). These findings demonstrate the effectiveness of sensor fusion and ensemble-based fusion in developing robust systems for physiological emotion recognition.

[NLP-153] Are Online Skill and Memory Modules Always Worth Their Tokens? A Budget-Constrained Study of Web Agents

【速读】：该论文旨在解决在线网络代理（online web agents）在引入记忆、工作流或技能模块等增强机制时，因额外消耗测试阶段的令牌（token）而带来的计算成本问题。尽管这些模块通常被认为能提升性能，但其代价常被忽视，尤其是在固定推理预算下的实际效益评估中。研究的关键在于重新评估这些增强方法在与基线模型（vanilla baseline）具有相同令牌预算条件下的表现，以排除资源消耗差异对结果的影响。实验结果显示，在三个WebArena任务领域及三种大模型（Gemini 3 Flash、GPT-5.4-mini、Qwen 3.6-27B）上，采用相同令牌预算的基线模型在总体成功率上达到或超过AWE、ASI和ReasoningBank等增强方法，且往往使用更少的总令牌。类似趋势也在企业级知识工作场景WorkArena-L1上得到验证。这表明，尽管技能与工作流记忆在特定场景中可能有效，但在预算匹配条件下其优势往往不复存在。此外，研究强调运行间方差（run-to-run variance）对结果有显著影响，应作为评估在线网络代理的核心指标之一。

链接: https://arxiv.org/abs/2606.15017
作者: Sina Hajimiri,Masih Aminbeidokhti,Jose Dolz,Ismail Ben Ayed,Issam H. Laradji,Spandana Gella,Nicolas Gontier
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Online web agents often augment a base actor with memory, workflow, or skill modules. These modules can improve performance, but they also consume test-time tokens, a cost rarely reported alongside the actor’s inference cost. We study online augmentation, where this overhead is paid on every task, and re-evaluate its benefits under a fixed total inference budget. We compare AWM, ASI, and ReasoningBank with a token-matched vanilla baseline that uses the same budget for additional actor steps. Across three WebArena domains and three models, Gemini 3 Flash, GPT-5.4-mini, and Qwen 3.6-27B, the vanilla baseline matches or surpasses all three augmentation methods in aggregate success rate while often using fewer total tokens. We observe a similar trend on WorkArena-L1 with Qwen 3.6-27B, indicating that the effect extends to enterprise knowledge-work tasks. Our results suggest that skills and workflow memory can be useful in specific domains, but their apparent gains often vanish against a budget-matched actor. We further show that run-to-run variance materially affects outcomes and should be reported as a core evaluation criterion for online web agents.

[NLP-154] Nemotron 3 Ultra: Open Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agent ic Reasoning

【速读】：该论文旨在解决大语言模型（Large Language Model, LLM）在长上下文处理能力、推理效率与模型性能之间难以兼顾的核心挑战。针对现有模型在扩展上下文长度至百万级 tokens 时面临计算开销剧增、推理延迟高及资源消耗大的问题，提出了一种基于混合专家架构（Mixture-of-Experts, MoE）与状态空间模型（Hybrid Mamba-Attention）融合的新型架构——Nemotron 3 Ultra。其解决方案的关键在于：采用 5500 亿总参数、550 亿活跃参数的 LatentMoE 架构实现高效稀疏计算；通过多标记预测（Multi Token Prediction, MTP）和 NVFP4 精度预训练优化训练效率；结合多教师在线策略蒸馏（Multi-teacher On-Policy Distillation, MOPD）与多环境强化学习验证与反馈（multi-environment RLVR），显著提升模型在复杂任务中的泛化能力与推理稳定性；同时引入推理预算控制机制，在保证与顶尖模型相当精度的前提下，实现高达约 6 倍于当前公开最先进模型的推理吞吐量。这些技术协同作用，使 Nemotron 3 Ultra 在支持长达 100 万 token 上下文的同时，具备卓越的推理效率，适用于长时间运行的自主智能体（agentic）任务。

链接: https://arxiv.org/abs/2606.15007
作者: NVIDIA:Aaron Blakeman,Aaron Thomas,Aastha Jhunjhunwala,Abhibha Gupta,Abhinav Khattar,Adam Rajfer,Adi Renduchintala,Adil Asif,Aditya Vavre,Adriana Flores Miranda,Ahmad Bilal,Aileen Zaman,Ajay Hotchandani,Akanksha Shukla,Akhiad Bercovich,Aleksander Ficek,Alex Gronskiy,Alex Kondratenko,Alex Steiner,Alex Ye,Alexander Bukharin,Alexandre Milesi,Ali Taghibakhshi,Alice Gatti,Alisa Liu,Alok Kumar,Amar Phanishayee,Ameya Sunil Mahabaleshwarkar,Amir Klein,Amit Zuker,Amnon Geifman,Anahita Bhiwandiwalla,Ananth Subramaniam,Andrea Santilli,Andrew Fulks,Andrew McHarg,Andrew Tao,Andrii Skliar,Anjulie Agrusa,Ankur Srivastava,Ankur Verma,Anna Shors,Anna Warno,Antoni-Joan Solergibert I Llaquet,Arham Mehta,Arkadiusz Nowaczynski,Arti Jain,Ashwath Aithal,Ashwin Poojary,Asif Ahamed,Asit Mishra,Asma Kuriparambil Thekkumpate,Atefeh Sohrabizadeh,Avinash Kaur,Avinash Vem,Ayush Dattagupta,Barath Subramaniam Anandan,Bardiya Sadeghi,Ben Lanir,Benedikt Schifferer,Besmira Nushi,Bilal Kartal,Bill Thiede,Bita Darvish Rouhani,Bo Deng,Bob Schatz,Boris Ginsburg,Boxin Wang,Brad Nemire,Brandon Norick,Brian Dang,Brian Westphal,Brian Yu,Brucek Khailany,Bryan Catanzaro,Carlo del Mundo,Caryln Aarish,Chankyu Lee,Chantal Hwang,Charbel Sakr,Charles Wang,Charlie Truong,Chen Cui,Cheng Cheng,Cheng-Ping Hsieh,Chenghao Zhang,Chenhui Deng,Chintan Patel,Chris Alexiuk,Christian Cosgrove,Christian Munley,Christine Harvey,Christopher Parisien,Chunyang Shen,Coco Li,Collin Neale,Cynthia Gao,Cyril Meurillon,Dan Gil
机构: NVIDIA(英伟达)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We introduce Nemotron 3 Ultra, a 550 billion total and 55 billion active parameter Mixture-of-Experts Hybrid Mamba-Attention language model. We pre-trained Nemotron 3 Ultra on 20 trillion text tokens, then extended the context length to 1M tokens, and post-trained using Supervised Fine Tuning (SFT), Reinforcement Learning (RL), and Multi-teacher On-Policy Distillation (MOPD). Nemotron 3 Ultra is our most capable model yet, employing multiple key technologies - LatentMoE, Multi Token Prediction (MTP), NVFP4 pre-training, multi-environment RLVR, MOPD, and reasoning budget control. Nemotron 3 Ultra achieves up to ~6x higher inference throughput as compared to state-of-the-art publicly available LLMs while attaining on-par accuracy. The state-of-the-art accuracy, high inference throughput, and 1M token context length make Nemotron 3 Ultra ideal for long-running autonomous agentic tasks. We open-source the base, post-trained, and quantized checkpoints, along with the training data and recipe on HuggingFace.

[NLP-155] CoRA: Confidence-Rationale Alignment for Reliable Chain-of-Thought Reasoning

【速读】：该论文旨在解决大语言模型（Large Language Model, LLM）在链式思维（Chain-of-Thought, CoT）推理中存在的一种关键问题：模型对答案的高置信度可能与其生成的推理过程（rationale）之间缺乏有效对齐，即尽管推理看似合理，但实际可能因不完整或支持不足而误导判断。这一现象称为“置信度—推理一致性”（confidence–rationale alignment）偏差。为解决此问题，论文提出一种基于广义相对策略优化（Generalized Reward Policy Optimization, GRPO）的强化学习框架，其核心在于联合优化三个目标：答案正确性、模型对所选答案的置信概率，以及基于评分标准（rubric）的推理质量评估——该评分标准从依据性（grounding）、连贯性（coherence）、任务匹配度（task match）和与所选答案的关联性（connection to the selected answer）四个维度评估推理过程，且不向评判者透露正确答案。实验在MedQA、MathQA和OpenBookQA三个基准上使用三款开源大模型进行验证，结果表明，该方法相较于未调优检查点、监督微调（SFT）及仅以正确性为目标的GRPO方法，可将置信度—推理一致性误差降低高达26.51%，同时保持竞争力的准确率并通常改善模型校准能力。研究证明，可靠的链式思维推理不仅需要高置信的答案输出，更要求推理过程实质性地支撑结论。

链接: https://arxiv.org/abs/2606.14961
作者: Juming Xiong,Weixin Liu,Kevin Guo,Congning Ni,Junchao Zhu,Chongyu Qu,Chao Yan,Katherine Brown,Avinash Baidya,Xiang Gao,Bradley Malin,Zhijun Yin
机构: Vanderbilt University (范德比尔特大学); Vanderbilt University Medical Center (范德比尔特大学医学中心); Intuit AI Research (Intuit人工智能研究)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Chain-of-thought (CoT) reasoning can improve LLM performance, but high answer confidence may be misleading when the accompanying CoT rationale is plausible yet incomplete or poorly supported. We study confidence–rationale alignment: whether a model’s confidence in its committed answer is justified by its generated rationale. We introduce a GRPO-based reinforcement learning framework that jointly rewards answer correctness, committed-answer probability, and rubric-based rationale support, where the rubric assesses grounding, coherence, task match, and connection to the selected answer without revealing the gold answer to the judge. Across MedQA, MathQA, and OpenBookQA using three open-weight LLMs, our method reduces the confidence–rationale alignment error by up to 26.51% compared with untuned checkpoints, SFT, and correctness-only GRPO, while maintaining competitive accuracy and often improving calibration. These results show that reliable CoT reasoning requires not only confident answers, but rationales that substantively support them.

[NLP-156] Simplifying the Modeling of Arbitrary Conditionals in Natural Language

【速读】：该论文旨在解决因果Transformer（Causal Transformer）在处理任意条件分布时的局限性问题，即其固有的自回归因子分解机制虽支持高效的左到右解码与条件似然计算，却难以有效采样或评估包含过去、未来及混合上下文的任意条件分布。现有方法虽尝试通过新架构克服此限制，但常导致条件建模性能下降和生成质量退化。本文提出任意条件GPT（Arbitrary Conditionals GPT, AC-GPT），通过对标准因果Transformer进行简单修改，实现仅需一次前向传播即可对任意条件（包括过去、未来及混合上下文）进行采样与评估。关键创新在于保持标准的左到右顺序与下一词预测目标，从而确保模型在自然语言任务中维持高性能并支持高效训练。这一设计兼容性使得现有大语言模型（LLM）可通过微调适配任意条件生成，实验结果表明，该方法在建模任意条件分布方面优于基线模型，且未损害传统的左到右生成性能。

链接: https://arxiv.org/abs/2606.14943
作者: Yinhan Lu,Eric Elmoznino,Léo Gagnon,Sarthak Mittal,Tejas Kasetty,Guillaume Lajoie
机构: Mila — Quebec AI Institute(蒙特利尔人工智能研究所); McGill University(麦吉尔大学); Université de Montréal(蒙特利尔大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Causal Transformers model sequences through an autoregressive factorization of the joint distribution, which enables efficient left-to-right decoding and conditional likelihood computation. However, they cannot tractably sample from or evaluate arbitrary conditionals – e.g., a block of text conditioned on past and future tokens. Recent work aims to solve this problem through novel architectures, but they often lead to sub-optimal modeling of such conditionals and degraded generations. We propose Arbitrary Conditionals GPT (AC-GPT) which introduces a simple modification to standard causal Transformers to enable evaluating and sampling from arbitrary conditionals – including past, future, and mixed contexts – within a single forward pass. Unlike prior approaches, our method preserves the standard left-to-right ordering and next-token prediction objective essential for both strong performance and efficient training on natural language. Crucially, this compatibility allows existing LLMs to be fine-tuned for arbitrary conditioning. Our empirical results indicate that our method outperforms baselines on modeling arbitrary conditionals, without degrading standard left-to-right performance.

[NLP-157] An Empirical Study on Learning Latent Representations for Emotional Speech Synthesis

【速读】：该论文旨在解决情感语音合成（Emotional Speech Synthesis, ESS）中的核心挑战，即如何在保持说话人身份一致性的前提下，生成具有目标情感表达的自然、高保真语音。现有深度学习驱动的端到端语音合成系统虽在语音自然度与可懂度方面取得显著进展，但在可控情感表达和风格迁移方面仍存在不足。本文的关键解决方案在于对FastSpeech 2架构进行改进：通过引入说话人嵌入（speaker embedding）与韵律瓶颈（prosody bottleneck）模块，实现对情感特征的有效建模与解耦。该设计使系统能够精准生成单个说话人的多情感语音（子任务1），同时支持从参考说话人迁移情感风格至目标说话人，即使在目标说话人仅提供中性非表现性语音数据的情况下，仍能有效保留其说话人身份特征（子任务2）。这一方法显著提升了情感语音合成的可控性与泛化能力。

链接: https://arxiv.org/abs/2606.14922
作者: Vinh Dang Quang,Huy Ngo Quang
机构: Aimesoft JSC (Aimesoft JSC)
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: 4 pages

点击查看摘要

Abstract:For the last couple of years, the field of speech synthesis has improved dramatically thanks to deep learning. There are more and more deep learning-based TTS systems developed to make it possible to produce voices with high intelligibility and naturalness. Meanwhile, controlling the expressiveness is yet a big deal, generating speech in different styles or manners has received a lot of attention from community recently. This paper aims to give our solutions to deal with the task emotional speech synthesis (ESS) at VLSP 2022 which allows to generate humanlike natural-sounding voice from a given input text with desired emotional expression. By integrating speaker embedding, prosody bottleneck into FastSpeech 2, our systems can promisingly generate emotional speech of a single speaker (Sub-task 1), transfer speaking styles from another speaker to the target speaker with neutral non-expressive data while retaining the target speaker’s identity (Sub-task 2).

[NLP-158] Dr-DCI: Scaling Direct Corpus Interaction via Dynamic Workspace Expansion

【速读】：该论文旨在解决生成式智能体在大规模语料库上进行代理搜索（Agentic search）时，因依赖检索器介导接口（如BM25或ColBERT）而导致的证据利用受限问题。现有方法虽能高效排序相关文档，但仅提供排序结果或有限的文档视图，限制了智能体对信息的重新组织与跨文档约束验证能力。为突破此瓶颈，论文提出一种受检索器引导的直接语料库交互（DR-DCI）框架，其核心在于将检索过程作为可调用的代理动作，动态将相关文档拉入一个不断演化的本地工作空间（workspace），并在该局部环境中执行灵活的直接语料库操作（DCI）。该设计的关键在于融合检索器的高召回率与DCI的精细操作能力：通过检索维持探索的可扩展性，同时在局部工作空间中保留精确的信息重组与验证能力。实验表明，DR-DCI在不同规模下均表现出优异性能，相较于原始DCI和消融变体，在Browsecomp-Plus上实现71.2%准确率（提升达8.3个百分点），并显著降低工具调用次数、运行时间和成本；引入工作空间保持的上下文重置后，准确率进一步提升至73.3%。在语料规模从10万到1000万文档的扩展实验中，DR-DCI保持稳定高效，而原始DCI趋于不稳定，传统检索方法表现显著下降。此外，DR-DCI成功扩展至2000万级文件-文档的Wiki-18 QA场景，在六个基准测试中平均得分63.0，优于基于检索和训练的搜索代理基线。消融分析进一步证实，排序预览和跨文档DCI操作是性能提升的关键因素。

链接: https://arxiv.org/abs/2606.14885
作者: Yi Lu,Zhuofeng Li,Ping Nie,Haoxiang Zhang,Yuyu Zhang,Kai Zou,Wenhu Chen,Jimmy Lin,Dongfu Jiang,Yu Zhang
机构: University of Toronto (多伦多大学); Texas A&M University (德克萨斯农工大学); University of Waterloo (Waterloo大学); UC San Diego (加州大学圣地亚哥分校); Verdent AI (Verdent AI); Netmind AI (Netmind AI)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 25 pages, 4 figures, 22 tables

点击查看摘要

Abstract:Agentic search over large corpora relies on retriever-mediated interfaces (e.g., BM25 or ColBERT) for scalable candidate discovery. While effective at ranking relevant documents, these interfaces expose evidence only as ranked results or bounded document views, limiting agents’ ability to reorganize material and verify constraints across documents. Direct Corpus Interaction (DCI) addresses this limitation by exposing shell-executable corpus operations for flexible search, filtering, comparison, and verification. However, full-corpus terminal commands become slow and unstable as the corpus grows, degrading performance and efficiency. We introduce DR-DCI, a retriever-steered DCI framework that treats retrieval as an agent-callable action for expanding a local workspace. Rather than operating directly over the full corpus, the agent dynamically pulls relevant documents into an evolving workspace and conducts DCI operations within it. This design combines retriever-level recall with DCI-style precision: retrieval keeps exploration scalable, while DCI preserves the local operations needed for effective evidence resolution. Experiments show that DR-DCI is both effective and efficient across scales. On Browsecomp-Plus, DR-DCI reaches 71.2% accuracy, improving over raw DCI and ablated variants by up to 8.3 points while reducing tool usage, wall time, and estimated cost. With workspace-preserving context reset, accuracy further improves to 73.3%. In corpus-scaling experiments, DR-DCI remains effective from 100K to 10M documents, whereas raw DCI becomes unstable and BM25 performs substantially worse. DR-DCI also scales to a 20M-scale file-per-document Wiki-18 QA setting, achieving an average score of 63.0 across six benchmarks and outperforming retrieval-based and trained search-agent baselines. Ablation analysis further shows that ranked previews and inter-document DCI are key to performance.

[NLP-159] Context Compression Is Not One Thing: Readable Symbolic Re-expression vs. Coherent Summary at Matched Budget

【速读】：该论文旨在解决小规模语言模型在多跳问答任务中因上下文长度限制导致的推理能力受限问题，核心挑战在于如何在保持推理证据完整性的同时降低输入文本的令牌（token）消耗。其解决方案的关键是提出一种名为Telegraph English的可读符号化格式，将检索到的文本段落重写为结构化的实体-关系陈述，从而以更低的令牌开销保留关键的推理依据。在MuSiQue、TwoWiki和HotpotQA等多个数据集上的受控实验表明，Telegraph English在所有数据集上均显著优于三种匹配预算的压缩基线方法（字符级删除、截断和随机子采样），F1分数提升达13至20个百分点；同时，在最困难的数据集上也优于由相同编码器生成的连贯散文摘要。值得注意的是，预先注册的深度交互假设未被支持，即性能优势不随推理深度增加而增长，这表明在给定令牌预算下，可读符号化表达相比自然语言或连贯摘要能更密集地保留实体信息，是实现高效上下文压缩的核心机制。

链接: https://arxiv.org/abs/2606.14875
作者: Sisong Bei,Mikhail L. Arbuzov,Ziwei Dong,Dmitri Kalaev,Alexey Shvets
机构: Palo Alto Networks
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We study context compression for multi-hop question answering with small language models. We propose Telegraph English, a readable symbolic format that rewrites retrieved passages into structured entity-relation statements, preserving reasoning evidence at lower token cost. In controlled experiments on MuSiQue, TwoWiki, and HotpotQA, Telegraph English outperforms three matched-budget compression baselines (character-level deletion, truncation, and random sub-sampling) on every dataset, with gains of 13 to 20 F1 percentage point. It also outperforms a coherent prose summary produced by the same encoder on the hardest dataset. A pre-registered depth-interaction hypothesis is null: the advantage does not grow with reasoning depth within datasets. We interpret these results as evidence that readable symbolic re-expression preserves entity content more densely than either natural language or coherent summarization at matched token budget.

[NLP-160] Evaluating the Robustness of Proof Autoformalization in Lean 4

【速读】：该论文旨在解决现有生成式 AI（Generative AI）在数学非形式化证明自动形式化（proof autoformalization）任务中对鲁棒性不足的问题。当前主流方法虽能在理想化的、结构良好的非形式化证明上表现良好，但在面对风格变化或局部细节修改时，其输出稳定性与忠实性显著下降。为此，论文提出首个针对形式化模型鲁棒性的系统性评估框架，定义两类扰动：全局扰动（global perturbation）通过改写证明的表达风格但保持语义一致，要求形式化结果保持不变；局部扰动（local perturbation）则对数值、符号或推理步骤进行改动（可能为反事实设定），要求形式化结果能准确反映这些变化而非恢复原状或自行推断新内容。研究构建了基于miniF2F和MATH-500数据集的基准测试集，并设计自动化指标评估模型在两类扰动下的正确性稳定性与忠实度。实验评估七种近期先进模型，结果显示所有模型均对全局扰动敏感，且多数在局部扰动下无法保持忠实性，表明当前生成式AI在形式化任务中的可靠性仍存在明显缺陷。该研究的关键贡献在于揭示了现有模型在真实场景中应用时的脆弱性，并提出了可量化的鲁棒性评估标准，为未来更可靠的形式化系统设计提供了方向。

链接: https://arxiv.org/abs/2606.14867
作者: Zhengtao Gui,Sheng Yang,Zhouxing Shi
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Preprint

点击查看摘要

Abstract:Proof autoformalization aims to translate a mathematical informal proof written in natural language into a formal proof in a formal language such as Lean~4. Several works have developed LLM-based models for proof autoformalization. However, existing evaluations have typically focused on translating well-formed informal proofs from curated datasets. We argue that a robust proof autoformalizer must remain faithful even for informal proofs that diverge from these idealized ones, and we present the first study on the robustness of proof autoformalization models. We formulate two categories of perturbations and evaluate robustness under each: a global perturbation paraphrases the informal proof in a different style, under which the formalization should remain consistent; a local perturbation alters a value, symbol, or proof step, possibly in a counterfactual way, and a robust formalization should faithfully reflect the perturbation rather than reverting to the original one or inferring a different one on its own. We build a benchmark with both perturbations on miniF2F and MATH-500, and automatically measure how stable a proof autoformalization’s correctness is under global perturbations and how faithfully its output reflects local perturbations. We evaluate seven recent models, all of which are sensitive to global perturbations and mostly fail to remain faithful under local perturbations. Code and data are available via this https URL.

[NLP-161] PhoneHarness: Harnessing Phone-Use Agents through Mixed GUI CLI and Tool Actions

【速读】：该论文旨在解决当前移动代理（mobile agents）研究中过度聚焦于仅通过图形用户界面（GUI）控制来预测下一步操作，而忽视真实手机使用任务中所需多模态交互与可验证结果的问题。现有评估方法通常仅关注代理是否能生成看似合理的最终界面状态，却无法检验其在实际场景中是否真正达成预期的副作用（如成功发送消息或完成支付）。为此，论文提出PhoneHarness——一个支持混合动作（mixed-action）的基准测试框架与执行环境，能够协调设备端的图形界面（GUI）、命令行接口（CLI）及主机侧工具动作，通过确定性动作路由、有限的GUI委托机制以及可审计的执行轨迹，实现对真实手机工作流的可验证执行。其核心解决方案在于引入“动作表面路由”（action-surface routing）与“可验证执行”机制，确保代理不仅能在视觉层面上正确操作，还能在底层触发可验证的系统级效果。实验表明，基于PhoneHarness Bench的评估显示其75.0%的任务通过率，较最强非PhoneHarness设置提升12.9个百分点，证明了可靠手机自动化依赖于对动作执行路径的智能调度与可追溯性，而非单纯的视觉控制能力。

链接: https://arxiv.org/abs/2606.14832
作者: Chenxin Li,Zhengyao Fang,Zhengyang Tang,Pengyuan Lyu,Xingran Zhou,Xin Lai,Fei Tang,Liang Wu,Yiduo Guo,Weinong Wang,Junyi Li,Yi Zhang,Yang Ding,Huawen Shen,Sunqi Fan,Shangpin Peng,Zheng Ruan,Anran Zhang,Benyou Wang,Chengquan Zhang,Han Hu
机构: Tencent Hunyuan; The Chinese University of Hong Kong; The Chinese University of Hong Kong, Shenzhen; Tsinghua University
类目: Computation and Language (cs.CL)
备注: Project Page: this https URL

点击查看摘要

Abstract:Phone agents are increasingly expected to complete real mobile workflows rather than merely predict the next screen action. However, much of the current mobile-agent literature still evaluates agents primarily as GUI controllers that observe a screen, emit taps and swipes, and are scored by target app state. Real phone-use tasks are broader: they require deciding when to use app GUIs, device-side commands, or structured tools, while leaving evidence that the intended side effect actually occurred. We introduce PhoneHarness, a mixed-action benchmark and execution harness for studying phone-use agents on verifiable mobile workflows. PhoneHarness runs a device-side agent loop over GUI, CLI, and host-side tool actions, combining deterministic action routing with bounded GUI delegation and auditable execution traces. Its benchmark, PhoneHarness Bench, evaluates whether agents complete tasks with observable side effects, not only whether they produce plausible final answers. On the annotated evaluation split, PhoneHarness reaches a 75.0% pass rate, outperforming the strongest non-PhoneHarness settings by 12.9 percentage points. PhoneHarness and PhoneHarness Bench therefore play distinct but mutually dependent roles: the harness makes mixed phone workflows executable, while the benchmark measures whether agents can use that harness reliably and safely. Our findings suggest that reliable phone automation depends on action-surface routing and verifiable execution, not only visual GUI control.

[NLP-162] Spectro-Temporal Interference Confounds Phase Encoding in Spatial Audio Foundation Models INTERSPEECH2026

【速读】：该论文旨在解决当前空间自监督音频模型（spatial self-supervised audio models）在声音定位任务中对微秒级双耳相位精细结构（microsecond interaural phase fine structures）编码能力不足的问题，尤其关注其是否具备真正的双耳感知敏感性。解决方案的关键在于提出一种基于双耳掩蔽阈值差（binaural masking level difference, BMLD）的心理声学评估基准，以量化模型对相位线索的敏感度。通过对比等化抵消基线（equalization cancellation baseline）与广义互相关-相位加权（GCC PHAT）正控制模型，评估了九个冻结状态的音频模型，涵盖双耳自监督学习（binaural SSL）、单声道自监督学习（monaural SSL）及神经音频编码器。实验结果显示，四个单声道负控模型的BMLD为零，验证了评估方法的双耳特异性；而两类通用双耳SSL模型仅表现出极弱的相位敏感性，仅专用的空间双耳SSL模型能达到接近解析基线的BMLD表现。进一步的物理消融分析表明，通用双耳SSL模型主要依赖频时域干涉纹理而非跨通道相位计算，且语音任务中的高检测率源于对宽带包络的依赖，而非真实相位编码。因此，该研究揭示了现有模型在相位精细结构建模上的局限性，并强调了构建真正具有双耳感知能力模型的重要性。

链接: https://arxiv.org/abs/2606.14820
作者: Yuxuan Chen,Haoyuan Yu,Peize He
机构: The Chinese University of Hong Kong, Shenzhen (香港中文大学（深圳）); Jilin University (吉林大学); Hunan University (湖南大学); University of Electronic Science and Technology of China (电子科技大学)
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: Accepted to INTERSPEECH 2026; 6 pages, 3 figures

点击查看摘要

Abstract:Recent spatial self supervised audio models achieve high performance on localization tasks, raising questions about their encoding of microsecond interaural phase fine structures. We propose a psychoacoustic benchmark based on the binaural masking level difference to evaluate this. Using an equalization cancellation baseline and a GCC PHAT positive control we evaluate nine frozen audio models spanning binaural SSL, monaural SSL, and neural audio codecs. Four monaural negative controls yield zero BMLD confirming binaural specificity. Two general purpose binaural SSL models exhibit minimal phase sensitivity while dedicated binaural spatial SSL models achieve BMLD comparable to the analytical baseline. Progressive physical ablations show that general purpose binaural SSL models rely on spectro temporal interference textures rather than cross channel phase computation. High detection rates in speech reflect a confounding reliance on broadband envelopes rather than genuine phase encoding.

[NLP-163] Last But Not Least: Boundary Attention CalibratiON for Multimodal KV Cache Compression

【速读】：该论文旨在解决多模态大语言模型（Multimodal Large Language Models, MLLMs）在处理长视觉上下文时，因键值缓存（KV cache）膨胀导致解码延迟增加的问题。现有压缩方法依赖观察窗口注意力（observation window attention）进行稳定的词元重要性估计，但该机制在激进压缩下易稀释稀疏的视觉证据，并丢弃对答案至关重要的词元。为此，论文提出BACON——一种即插即用的压缩优化方法，其核心在于利用最后查询注意力（last-query attention）作为补充信号以恢复被忽略的视觉证据，同时通过层内一致性（intra-layer coherence）与层间持续性（inter-layer persistence）抑制孤立噪声，实现对观察窗口注意力的校准。实验表明，BACON在多种基准、模型、压缩预算及方法下均显著提升多模态KV压缩性能，尤其在最激进的压缩预算下平均提升7.5%，最高可达30.9%。

链接: https://arxiv.org/abs/2606.14782
作者: Tianhao Chen,Yuheng Wu,Kelu Yao,Xiaogang Xu,Xiaobin Hu,Dongman Lee
机构: KAIST(韩国科学技术院); Zhejiang Laboratory(浙江实验室); The Chinese University of Hong Kong(香港中文大学); National University of Singapore(新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) achieve strong vision-language reasoning, but long visual contexts enlarge the KV cache and increase decoding latency. Existing compression methods rely on observation window attention for stable token-importance estimation, yet this aggregation can dilute sparse visual evidence and discard answer-critical tokens under aggressive compression. Therefore, we identify last-query attention as a complementary source for recovering such evidence, but its answer-irrelevant signals can mislead retention. We propose BACON, a plug-and-play method that calibrates observation window attention with last-query evidence and suppresses isolated noise via intra-layer coherence and inter-layer persistence. Across diverse benchmarks, models, budgets, and compression methods, BACON improves multimodal KV compression by 7.5% on average under the most aggressive budget, with gains up to 30.9%.

[NLP-164] Mechanistic origins of catastrophic forgetting: why RL preserves circuits better than SFT?

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在微调过程中普遍存在的灾难性遗忘问题，即模型在学习新任务时严重丧失原有能力。其核心问题是：相较于监督微调（Supervised Fine-Tuning, SFT），强化学习（Reinforcement Learning, RL）为何能更有效地保留模型的先验能力？解决方案的关键在于从机制层面揭示二者差异的本质——通过引入“差分电路脆弱性”（differential circuit vulnerability）这一头级粒度的度量方法，量化微调过程中模型内部计算通路的退化程度。研究发现，在对Qwen2.5-3B-Instruct模型进行科学问答任务适配时，尽管SFT能更快适应新任务，但导致显著的内部电路破坏和先验能力丢失；而RL虽适应较慢，却能更有效地保留原始模型的计算结构。这一发现表明，对内部计算电路的保护可能是强化学习在抵御灾难性遗忘方面更具鲁棒性的根本机制原因。

链接: https://arxiv.org/abs/2605.28860
作者: Jeanmely Rojas Nunez,Viraj Sawant,Nathan Allen,Nomgondalai Amgalanbaatar,Yannis Zongo,Vasu Sharma,Maheep Chaudhary
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Fine-tuning large language models (LLMs) frequently induces catastrophic forgetting of prior capabilities. Recent work has shown that reinforcement learning (RL) retains prior capabilities more effectively than supervised fine-tuning (SFT), attributing this to policy-gradient updates remaining closer to the base policy \citeshenfeld2025rl. We extend this behavioral account to the mechanistic level and ask whether RL’s advantage is mirrored by stronger preservation of internal computational circuits. We introduce differential circuit vulnerability, a head-level measure of how much a circuit degrades under fine-tuning, and use it to compare RL and SFT on Qwen2.5-3B-Instruct adapted to scientific question-answering. We find a clear mechanistic trade-off: SFT adapts more rapidly to the target task but produces substantially greater circuit disruption and forgetting of prior capabilities, whereas RL preserves a larger fraction of the base circuit at the cost of slower task adaptation. These findings suggest that circuit preservation may help explain why RL is more robust to catastrophic forgetting. We released our code here: this https URL.

[NLP-165] From Static Inference to Dynamic Interaction: A Survey of Streaming Large Language Models ACL2026

【速读】：该论文旨在解决标准大语言模型（Large Language Models, LLMs）在动态、实时场景中应用受限的问题，因其主要设计用于静态推理且依赖预定义输入。现有研究中对“流式大语言模型”（streaming LLMs）的定义模糊且碎片化，常将流式生成、流式输入与交互式流式架构混为一谈，缺乏系统性分类框架。为此，本文提出了一种基于数据流与动态交互的统一定义，以澄清当前概念上的歧义；在此基础上，构建了一个系统的流式大语言模型分类体系，并深入分析其底层技术方法。此外，论文探讨了流式大语言模型在真实场景中的应用潜力，并指明了未来有前景的研究方向，以推动流式智能的发展。研究团队还维护了一个持续更新的相关文献资源库，便于学术界追踪最新进展。

链接: https://arxiv.org/abs/2603.04592
作者: Junlong Tong,Zilong Wang,YuJie Ren,Peiran Yin,Hao Wu,Wei Zhang,Xiaoyu Shen
机构: Shanghai Jiao Tong University; Institute of Digital Twin, Eastern Institute of Technology, Ningbo
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ACL 2026 Findings

点击查看摘要

Abstract:Standard Large Language Models (LLMs) are predominantly designed for static inference with pre-defined inputs, which limits their applicability in dynamic, real-time scenarios. To address this gap, the streaming LLM paradigm has emerged. However, existing definitions of streaming LLMs remain fragmented, conflating streaming generation, streaming inputs, and interactive streaming architectures, while a systematic taxonomy is still lacking. This paper provides a comprehensive overview and analysis of streaming LLMs. First, we establish a unified definition of streaming LLMs based on data flow and dynamic interaction to clarify existing ambiguities. Building on this definition, we propose a systematic taxonomy of current streaming LLMs and conduct an in-depth discussion on their underlying methodologies. Furthermore, we explore the applications of streaming LLMs in real-world scenarios and outline promising research directions to support ongoing advances in streaming intelligence. We maintain a continuously updated repository of relevant papers at this https URL.

[NLP-166] Human genetic evidence is associated with drug approval across therapeutic areas: an observational analysis of 26278 target-disease pairs with temporal validation and feature ablation

【速读】：该论文旨在解决药物靶点（drug target）的遗传证据与其临床获批成功率之间的关联性问题，具体探讨遗传证据是否能够作为预测药物靶点成功获批的重要依据。研究基于Open Targets与ChEMBL数据库中26,278个靶点-疾病配对的观察性分析，发现具有遗传关联证据的靶点其获批概率显著高于无遗传证据者（比值比OR = 3.25，95%置信区间2.79–3.79，p = 1.91×10⁻⁴²）。在控制同一基因共享多个配对所导致的非独立性后，目标层面的效应量仍保持显著（OR = 2.79），表明遗传证据的预测价值具有稳健性。研究进一步揭示，尽管文献挖掘（literature mining）在模型性能中贡献最大，但其主要反映的是批准后发表的文献“时间泄漏”（temporal leakage），而非真正的前瞻性预测能力；剔除文献证据后，其他类型证据仍维持高于基线的信号强度（AUPRC = 0.084，为基线的1.63倍），说明遗传证据本身具备一定的独立预测价值。然而，仅依赖遗传证据的模型增益有限（绝对AUPRC提升仅1.0个百分点），且整体模型校准性较差，提示当前方法在实际药物研发中的预测实用性有限。因此，该研究的关键解决方案在于系统评估不同证据类型的贡献并识别出可作为假设生成资源的1,433个经遗传支持的I/II期临床阶段靶点-疾病配对，同时强调所有结论均为观察性结果，需谨慎解释。

链接: https://arxiv.org/abs/2606.14823
作者: Victoria Paterson
机构: University of Edinburgh (爱丁堡大学)
类目: Genomics (q-bio.GN); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Genetic evidence is enriched among approved drug targets: in an observational analysis of 26,278 target-disease pairs from Open Targets and ChEMBL, targets with any genetic association had a 3.25-fold higher approval rate than those without (OR = 3.25, 95% CI 2.79-3.79, p = 1.91e-42). A target-level analysis accounting for non-independence of pairs sharing the same gene gave OR = 2.79 (bootstrap 95% CI 2.22-3.53); the oncology pair-level OR of 6.72 attenuates to 2.71 at the target level, illustrating how non-independence inflates area-specific estimates. The enrichment replicated in post-2015 approvals (OR = 3.51, p = 1.72e-8). Feature ablation across six evidence types revealed that literature mining alone accounts for most classifier performance (AUPRC = 0.099 versus 0.109 for all features), consistent with temporal leakage from post-approval publications. Excluding literature, remaining evidence types retain above-baseline signal (AUPRC = 0.084, 1.63x baseline). Sensitivity analyses bracket the pair-level OR between 3.25 and 4.93. Genetic evidence alone yields only a 1.0-percentage-point absolute AUPRC gain and the best model has poor calibration; the classifier has limited practical predictive value. We catalogue 1,433 genetically supported Phase 1/2 pairs as a hypothesis-generating resource. All findings are observational.

信息检索

[IR-0] Benchmarking LLM Agents on Meta-Analysis Articles from Nature Portfolio

链接: https://arxiv.org/abs/2606.17041
作者: Anzhe Xie,Weihang Su,Yujia Zhou,Yiqun Liu,Qingyao Ai
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 13 pages, 7 figures, preprint for arXiv, dataset and code available at this https URL

点击查看摘要

Abstract:Meta-analysis is a demanding form of evidence synthesis that combines literature retrieval, PI/ECO-guided study selection, and statistical aggregation. Its structured, verifiable workflow makes it an ideal substrate for evaluating systematic scientific reasoning, yet existing benchmarks lack ground truth across the full retrieval-screening-synthesis pipeline. We introduce MetaSyn, a dataset of 442 expert-curated meta-analyses from Nature Portfolio journals. Each entry pairs a research question with PI/ECO criteria, a retrieval corpus of 140k PubMed articles, verified positive studies, hard negatives that are topically similar but PI/ECO-ineligible, and complete search strategies and date bounds. Benchmarking twelve pipeline configurations (nine RAG variants and a protocol-driven agent) reveals a critical screening bottleneck: despite a retrieval ceiling of 90.9% recall at K=200, no system recovers more than 52.7% of ground-truth included literature. Current LLMs fail to reliably separate eligible studies from PI/ECO-failing distractors in pools of comparable topical relevance. Stage-attributed metrics capture where systems succeed and fail; a single end-to-end score does not. Comments: 13 pages, 7 figures, preprint for arXiv, dataset and code available at this https URL Subjects: Computation and Language (cs.CL); Information Retrieval (cs.IR) ACMclasses: H.3.3; I.2.7; H.3.7 Cite as: arXiv:2606.17041 [cs.CL] (or arXiv:2606.17041v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2606.17041 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[IR-1] How Much Do Reviews Really Contribute? A Study on Text-Enriched Matrix Factorization for Recommendations

链接: https://arxiv.org/abs/2606.16973
作者: Eduardo Ferreira da Silva,Mayki dos Santos Oliveira,Joel Machado Pires Denis Dantas Boaventura,Frederico Araújo Durão
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: 14 pages, 4 figures, SBBD 2026 ISSN 2763-8979

点击查看摘要

Abstract:Incorporating textual reviews into a Recommender System has become a prominent strategy for enriching collaborative signals with semantic information. However, the actual contribution of review-derived representations remains an open question, particularly when strong collaborative baselines are employed. In this work, we systematically investigate the impact of textual information on Matrix Factorization by introducing and comparing three enrichment strategies over a common collaborative backbone. First, we propose a learnable gating mechanism that adaptively balances collaborative and textual signals during training. This mechanism is applied to two distinct review representations: (i) aggregated topic profiles extracted from user and item histories, and (ii) full text embedding representations derived from reviews. Additionally, we explore a cross-attention mechanism that identifies and emphasizes the most informative dimensions of the textual representation before fusion with collaborative factors. We evaluate six variants: pure, enriched with topic profiles and text via gating; enriched with topics and text via gating; and enhanced with cross-attention over textual features. Experiments across multiple review-based datasets reveal that although adaptive fusion mechanisms improve representation flexibility, the marginal contribution of textual signals remains limited compared to the collaborative backbone. These findings suggest that, under typical rating-prediction settings, collaborative information continues to dominate performance, raising important considerations for the effective integration of semantic review signals into recommendation models.

[IR-2] A Theoretical Framework for Risk Analysis of Stochastic Rankers

链接: https://arxiv.org/abs/2606.16970
作者: Debasis Ganguly
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Different from deterministic rankers that seek to maximize relevance at top ranks, stochastic ranking policies instead estimate distributions over permutations, from which rankings are sampled, towards obtaining diversified or fair exposure. Such policies are commonly evaluated in terms of expected effectiveness postreranking. However, the randomness inherent in these policies gives rise to a fundamental but under-explored ex ante question: prior to applying stochastic reranking, how large can the induced variation in retrieval effectiveness be in the worst case? This paper presents a theoretical analysis of reranking risk, defined as the maximum absolute change in discounted cumulative gain (DCG) resulting from a permutation sampled from a stochastic reranking policy applied to a fixed retrieved this http URL derive that this risk is governed by the distribution of the recall points in the initial retrieved list. We conduct experiments on submitted runs from the TREC Fairness 2022 track that employ stochastic reranking policies and empirically demonstrate that the effectiveness variations predicted by our theory closely approximate the observed changes in DCG.

[IR-3] OneRank: Unified Transformer-Native Ranking Architecture for Multi-Task Recommendation KDD2026

链接: https://arxiv.org/abs/2606.16838
作者: Jiakai Tang,Sunhao Dai,Kun Wang,Zhiluohan Guo,Yu Zhao,Cong Fu,Kangle Wu,Yabo Ni,Anxiang Zeng,Xu Chen,Jun Xu
类目: Information Retrieval (cs.IR)
备注: KDD 2026 Accepted

点击查看摘要

Abstract:Multi-task learning (MTL) is essential in recommender systems to enable complementary learning among diverse user feedback. While modern industrial practices have shifted from DNNs to Transformer-centric architectures to strengthen sequence modeling and scaling capacity, they still decouple feature encoding from multi-task prediction, treating the Transformer as a task-agnostic encoder. This design fundamentally limits the performance and scalability by (1) creating an information bottleneck under heterogeneous task objectives, (2) inducing gradient interference that leads to the seesaw phenomenon, and (3) forcing a dataflow transition in which attention-based, context-adaptive representation learning is converted to static feed-forward task prediction with incompatible information read-write dynamics. We propose OneRank, a Transformer-native multi-task ranking framework that eliminates encoder-predictor separation and introduces task-private channels for forward representation learning and backward optimization, enabling task-specialized learning while reducing inter-task interference. In the forward pass, OneRank learns task-specific representations bottom-up through task-conditioned information selection, candidate-aware contextualization, and controlled cross-task interaction. In the backward pass, cross-task gradient detachment isolates task-private parameter updates from shared knowledge extraction modules, preventing negative transfer. We further replace static task-specific MLP scorers with dynamic matching-based scoring for context-aware personalized ranking. By internalizing multi-task reasoning within the Transformer stack, OneRank establishes a unified and scalable architectural paradigm. Offline and online experiments on large-scale industrial datasets show that OneRank significantly outperforms state-of-the-art baselines while maintaining computational efficiency. Comments: KDD 2026 Accepted Subjects: Information Retrieval (cs.IR) Cite as: arXiv:2606.16838 [cs.IR] (or arXiv:2606.16838v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2606.16838 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[IR-4] How Much Can We Trust LLM Search Agents ? Measuring Endorsement Vulnerability to Web Content Manipulation

链接: https://arxiv.org/abs/2606.16821
作者: Yimeng Chen,Zhe Ren,Firas Laakom,Yu Li,Dandan Guo,Jürgen Schmidhuber
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR); Computers and Society (cs.CY); Information Retrieval (cs.IR)
备注: 23 pages, 3 figures

点击查看摘要

Abstract:Large language model (LLM)-based search agents synthesize open-web content into actionable recommendations on behalf of users, creating a risk that attacker-published pages are transformed into endorsed claims. We introduce SearchGEO, a controlled evaluation framework for measuring endorsement corruption in LLM-based web-search agents, combining a web-evidence manipulation pipeline, a five-mode attack taxonomy, and multiple output-level metrics. We evaluate 13 LLM backends on 308 cases each. Results show that vulnerability patterns vary across backends: overall attack success rate (ASR) ranges from 0.0% on Claude-Sonnet-4.6 to 31.4% on Gemini-3-Flash, the strongest attack mode differs by model family, and the same deployment scaffold could amplify or decrease ASR on different backends. An auxiliary agent-skill probe, where endorsement becomes an install command, exposes a sharp split among otherwise robust backends: Claude over-rejects while GPT over-trusts. These findings argue for treating recommendation reliability under adversarial search content as a first-class dimension of backend safety evaluation.

[IR-5] Understanding the Behaviors of Environment-aware Information Retrieval ACL2026

链接: https://arxiv.org/abs/2606.16817
作者: Ruifeng Yuan,Chaohao Yuan,David Dai,Yu Rong,Hong Cheng,Hou Pong Chan,Chenghao Xiao
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: ACL 2026 Main

点击查看摘要

Abstract:Recent retrieval-augmented generation (RAG) approaches have demonstrated strong capability in handling complex queries, yet current research overlooks a critical challenge: different retrievers require fundamentally different query formulation strategies for optimal performance. In this work, we present the first systematic analysis of how LLMs can learn to adapt their query formulation strategies for different retrievers via reinforcement learning (RL). Our empirical study reveals that RL effectively teaches an LLM to tailor its queries to specific retriever characteristics. We discover that different retrievers exhibit surprisingly distinct optimal query styles (e.g., descriptive vs. question-like), suggesting strategies learned for one retriever ineffective for another. We further show that performance can be enhanced by incorporating retriever-specific human guidance and by scaling model size. To facilitate learning over multi-retrieval-step trajectories, we introduce a branching-based rollout technique that improves training stability. Our work provides the first empirical evidence and actionable insights for building truly retriever-aware RAG systems. Code and resources are available at this https URL.

[IR-6] Harmonizing Semantic and Collaborative in LLM s: Reasoning -based Embedding Generator for Sequential Recommendation

链接: https://arxiv.org/abs/2606.16703
作者: Qidong Liu,Mingyao Huang,Moranxin Wang,Wenxuan Yang,Haiping Zhu
类目: Information Retrieval (cs.IR)
备注: 11pages,5figures

点击查看摘要

Abstract:Sequential Recommender Systems (SRS) predict the next item of interest based on users’ interaction histories and have been widely deployed, but hindered by long-tail problem. Large Language Models (LLMs), with strong semantic understanding and reasoning capabilities, offer a promising way to enrich item semantics and have recently been used as embedding generators. However, two fundamental gaps remain. First, current LLM-based embedding methods fail to exploit the model’s inner reasoning capacity. Second, existing methods often inject collaborative signals implicitly via supervised fine-tuning, lacking explicit guidance for collaborative embedding alignment. In this paper, we introduce ReaEmb, a novel framework that resolves both issues via a Latent Reasoning-enhanced Contrastive Learning (LRCL) stage and a Collaborative Reward Reinforcement Learning (CRRL) stage. LRCL exploits the LLMs’ inner reasoning capacity through a two-pass forward process with an additional attention module. CRRL subsequently explicitly injects collaborative signals into the LLM via a tailored reinforcement learning. Extensive experiments on three real-world datasets demonstrate superior effectiveness of ReaEmb across multiple SRS models. To ease reproducibility, we release the code online.

[IR-7] SCAR: Semantic Continuity-Aware Retrieval for Efficient Context Expansion in RAG

链接: https://arxiv.org/abs/2606.16661
作者: Nathanaël Langlois
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注: 5 pages, 1 figure

点击查看摘要

Abstract:Fixed-length chunking in Retrieval-Augmented Generation (RAG) often leads to boundary fragmentation, where critical evidence is split across segments, degrading retrieval recall. While static windowing and parent retrieval improve recall, they introduce significant token overhead. We propose SCAR (Semantic Continuity-Aware Retrieval), an adaptive retrieval policy that selectively expands neighboring chunks by weighing query-neighbor relevance against a structural continuity penalty. SCAR uses a relative expansion threshold tied to each retrieved chunk’s own query-relevance, yielding an approximately scale-invariant decision rule that transfers across embedding models without recalibration. Across four diverse corpora (RFC, GDPR, a 10-K report, and a Merger agreement; N=320 queries; 160 boundary-fragmented), SCAR achieves 92.8% recall on boundary-fragmented queries with only 7.84 chunks, a 22.9% reduction compared to static windowing (10.16 chunks). Paired bootstrap tests (B=10,000) confirm the chunk reduction is highly significant (p0.0001, Cohen’s d=-1.49, large effect), with a small recall difference (Cohen’s d=-0.33). The policy transfers across three embedding models (text-embedding-3-large, BGE-large-en-v1.5, zembed-1) using the same single hyperparameter setting, and downstream RAGAS evaluation on the 10-K corpus confirms SCAR preserves generation faithfulness while reducing context tokens by 27.1%.

[IR-8] PIANO: Personalized Reranking via Information Aggregation Node for Music Search Optimization KDD2026 ECML

链接: https://arxiv.org/abs/2606.16641
作者: Weisheng Li,Chuqiao Huang,Pengcheng Li,Zhengchao Peng,Qiang Xiao,Zhongqian Xie,Qiang Huang,Chuanjiang Luo
类目: Information Retrieval (cs.IR)
备注: Accepted at ECML PKDD 2026. 18 pages, 4 figures

点击查看摘要

Abstract:Unlike short-video content, music tracks have long lifecycles and lasting value. Effective music search re-ranking must therefore align the user’s current query with long-term preferences while jointly optimizing Click-Through Rate (CTR) and Conversion Rate (CVR). However, existing methods suffer from two limitations: (1) sequential methods rely on item-interaction history and therefore cannot use historical search queries to tell which past preferences match the user’s current search intent; (2) most listwise models optimize a single objective (e.g., CTR only), and conventional multi-objective methods balance click and conversion at the item level, ignoring how these trade-offs play out across the whole ranked list. To address these limitations, we propose PIANO, a personalized listwise re-ranking framework with two key components: (i) the Query-Driven Interest Refiner (QDIR) uses cross-attention over historical queries to align past intents with the current one; (ii) the Information Aggregation Node (IAN), a learnable [CLS]-style token, aggregates the candidate list and predicts CTR/CVR at the list level. Extensive experiments on public and industrial datasets show consistent gains over strong baselines. In online A/B tests on NetEase Cloud Music, a leading music streaming platform, PIANO achieves statistically significant improvements in CTR (+0.62%) and CVR (+4.45%).

[IR-9] Leverag ing Code-Mixed Product Metadata and User Feedback for Personalized Recommendation on Daraz Bangladesh

链接: https://arxiv.org/abs/2606.16387
作者: KM Fahim A Bari,Muhammad Abdullah Adnan,Nafis Sadeq
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Bangladeshi e-commerce platforms host millions of product reviews written in Bengali Unicode, English, and Banglish, where Bengali is phonetically transcribed in Latin script. However, the impact of code-mixed reviews on recommendation performance remains largely unexplored. We present the first such benchmarking on product reviews from Daraz Bangladesh, evaluating six model families under a per-user chronological leave-last-out protocol. To address the severe long-tail sparsity of the dataset, where 59.3% of users have exactly one interaction, we conduct a systematic k-core threshold ablation across five density configurations. The results reveal that Item-based Collaborative Filtering remains stable across settings, Implicit Matrix Factorization degrades sharply with decreasing density, and Explicit Matrix Factorization uniquely improves at higher thresholds. To characterize the impact of code-mixing on recommendation quality, we perform a language-stratified evaluation of content-based filtering using character n-gram TF-IDF profiles. The results provide empirical evidence that fragmentation of the Banglish vocabulary reduces NDCG@10 by 46.8% relative to Bengali-script users, a degradation traceable to transliteration inconsistency across surface forms. This work establishes a reproducible evaluation foundation for recommendation research in code-mixed, low-resource e-commerce settings. The code is publicly available at this https URL.

[IR-10] RL-Index: Reinforcement Learning for Retrieval Index Reasoning

链接: https://arxiv.org/abs/2606.16316
作者: Yongjia Lei,Nedim Lipka,Zhisheng Qi,Utkarsh Sahu,Koustava Goswami,Franck Dernoncourt,Ryan A. Rossi,Yu Wang
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Retrieving external knowledge is essential for solving real-world tasks, yet it remains challenging when the relationship between a query and its relevant knowledge involves implicit and complex reasoning beyond surface-level semantic or lexical matching (e.g., mathematical problems relying on the same theorem or coding requiring deep reasoning). Existing approaches primarily rely on query-side reasoning (e.g., query rewriting), which introduces significant online latency and underutilizes the opportunity to perform reasoning over the knowledge corpus itself (i.e., index-side reasoning). In this paper, we propose RL-Index, an agentic indexing framework that formulates retrieval index reasoning as a reinforcement learning problem. Instead of performing reasoning at query time, RL-Index shifts reasoning to the indexing stage by augmenting documents with LLM-generated rationales that explicitly encode the latent query-knowledge relationship. To optimize the quality of these rationales, we employ Group Relative Policy Optimization (GRPO) and use retrieval similarity as a verifiable reward signal, enabling direct optimization of indexing decisions for retrieval effectiveness. Extensive experiments on the BRIGHT benchmark demonstrate that RL-Index consistently improves both retrieval and downstream question-answering performance, while significantly reducing online inference latency. Moreover, the learned rationale augmentation generalizes across diverse retrievers and generators, highlighting its robustness as a plug-and-play indexing strategy across different retrieval systems.

[IR-11] Viral Images: Identifying Reprintings within 1.5 Million Photographs in Chronicling America

链接: https://arxiv.org/abs/2606.16209
作者: Bruno Buccalon,Yueran Sun,Benjamin Charles Germain Lee
类目: Digital Libraries (cs.DL); Information Retrieval (cs.IR)
备注: 13 pages, 6 figures, 2 tables

点击查看摘要

Abstract:Within the millions of digitized historic American newspapers in the Chronicling America initiative are tens of millions of photographs, illustrations, cartoons, and advertisements. Much of this visual culture is shared across newspaper titles and issues. Just as reprinted texts within these newspapers speak to the virality of textual content, so too does this reprinted visual culture speak to newspapers as sites of constant information circulation and exchange. In this paper, we introduce Viral Images, a project to identify reprintings within 1.5 million photographs in Chronicling America. For our analysis, we adopt the Newspaper Navigator dataset of extracted photographs from over 16 million pages in Chronicling America. We introduce an unsupervised method of identifying reprintings by leveraging contrastive language-image pretraining (CLIP) to embed these 1.5 million photographs and applying clustering to identify re-printed content. We detail our public interface, this https URL, which we designed in order to enable humanists to interactively browse and study these identified clusters. In addition, we analyze the identified clusters, uncovering a diversity of photographs and advertisements that have been circulated across different newspapers over time.

[IR-12] heorem-Grounded Execution Ontologies for Interpretable Machine Reasoning

链接: https://arxiv.org/abs/2606.16010
作者: Raghu Anantharangachar
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models have achieved impressive performance on reasoning tasks spanning mathematics, science, programming, and commonsense inference. Despite these advances, their reasoning processes remain largely latent, making them difficult to interpret, verify, replay, debug, and transfer across domains. Existing approaches such as chain-of-thought, tree-of-thoughts, graph-of-thoughts, and tool-augmented reasoning expose intermediate reasoning artifacts but typically lack explicit execution semantics, formal state representations, and verifiable reasoning structures. We introduce Theorem-Grounded Execution Ontologies (TGEO), a framework that models reasoning as an executable state-transition process rather than a sequence of generated tokens. Given an input problem, TGEO identifies relevant theorem families, binds the problem to a domain ontology, discovers semantic objects, instantiates states and operators, constructs predicates and contracts, and synthesizes an executable reasoning graph. The resulting graph provides an interpretable, replayable, and auditable representation of reasoning in which every state transition, operator application, and validation step is explicitly represented. TGEO integrates five architectural components: (1) theorem-grounded reasoning priors, (2) executable ontologies, (3) operator-mediated state transitions, (4) predicate and contract-based execution validation, and (5) architectural auditing and failure localization. We evaluate TGEO on theorem-intensive reasoning tasks derived from mathematical benchmark domains and a curated Golden Execution Suite. Our findings demonstrate the value of executable reasoning representations for interpretable, verifiable, and reproducible AI reasoning systems. Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI) Cite as: arXiv:2606.16010 [cs.IR] (or arXiv:2606.16010v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2606.16010 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Raghu Anantharangachar [view email] [v1] Sun, 14 Jun 2026 20:44:29 UTC (77 KB)

[IR-13] Entity Labels Are Not Entity Signals: A Framework for Observable Relevance in Document Re-Ranking ICTIR’26

链接: https://arxiv.org/abs/2606.15998
作者: Utshab Kumar Ghosh,Shubham Chatterjee
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: ICTIR '26

点击查看摘要

Abstract:Entity-aware document retrieval uses query-associated entities as ranking signals, assuming that semantically relevant entities are also useful retrieval signals. We show this assumption is insufficient- and explain why. Unlike terms, which are ground-truth observations, entity links are hypotheses produced by an imperfect linker: an entity can be topically central yet provide no discriminative signal if the linker fires indiscriminately across relevant and non-relevant documents. We formalize this as a distinction between Conceptual Entity Relevance (CER)- whether an entity is topically related to a query- and Observable Entity Relevance (OER)- whether its observed presence in a collection discriminates relevant from non-relevant documents. Across four collections and annotation sources including human entity judgments, CER and OER exhibit near-chance agreement ( \kappa \approx 0 ), while OER operationalizations agree substantially ( \kappa \approx 0.5 ), confirming CER as the systematic outlier. CER-based supervision selects topically plausible but weakly discriminative entities, pruning fewer than 4% of non-relevant documents on some collections. Aligning supervision with OER improves non-relevant pruning by up to 10x and open-world MAP by 0.051 over BM25. Our findings motivate a shift from conceptual to observable notions of entity relevance in entity-aware retrieval.

[IR-14] Interactor: Agent ic RL oriented Iterative Creation for Ad Description Generation in Sponsored Search

链接: https://arxiv.org/abs/2606.15911
作者: Penghui Wei,Jiayu Wu,Chao Ye,Zhi Guo,Shuanglong Li,Lin Liu
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:This paper focuses on automatically generating informative ad descriptions in sponsored search. Unlike ad titles which are usually optimized to attract user click feedbacks, ad descriptions have a longer text span and possess the potential of incorporating world knowledge to address user search intents while presenting the fine-grained selling points of the ads. We propose Interactor, a multi-turn iterative creation framework optimized with agentic RL for ad description generation. The generation model acts as a policy that interacts with a customized environment consisting of multiple generative reward models. Given initial generations by the policy, the customized GenRMs evaluate multi-dimensional qualities including knowledge capacity and landing page consistency, providing both binary signals and reasoning feedbacks. The policy then iteratively refines the descriptions based on such feedbacks to ensure continuous improvement. Experiments on industrial datasets show that the Interactor framework significantly outperforms state-of-the-art approaches in generating knowledge-rich and faithful ad descriptions. Since May 2026, it has been deployed online in a leading search ads system, contributing to both ad revenue and user experience.

[IR-15] MAGE-RAG : Multigranular Adaptive Graph Evidence for Agent ic Multimodal RAG in Long-Document QA

链接: https://arxiv.org/abs/2606.15906
作者: Yilong Zuo,Xunkai Li,Jing Yuan,Qiangqiang Dai,Hongchao Qin,Ronghua Li
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Databases (cs.DB); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Long-document multimodal question answering requires a system to locate sparse evidence in long PDFs and integrate clues from text, tables, images, charts, and complex layouts. Existing RAG methods mostly rely on fixed Top-k retrieval over text chunks or pages. Text retrieval can compress the context but often loses visual and layout information; page-level visual retrieval preserves the original page, yet it also sends large irrelevant regions to the reader, leading to a static trade-off among evidence coverage, noise, and inference cost. This paper proposes MAGE-RAG, a multigranular adaptive graph evidence framework for long-document multimodal QA. MAGE-RAG uses page retrieval as the entry point for query-time evidence construction. Offline, it builds an evidence graph with page nodes and element nodes, encoding containment, reading order, layout adjacency, section hierarchy, and semantic-neighbor relations. At query time, an online evidence controller iteratively activates, opens, searches, and prunes evidence under explicit budgets. The resulting evidence subgraph is then rendered into structured multimodal reader input, allowing the LVLM to consume compact and relevant evidence within a limited context. On LongDocURL and MMLongBench-Doc, we establish a unified comparison and analysis protocol covering Direct MLLM, Text RAG, Page-level Visual RAG, and Graph/Agentic RAG. Experiments show that MAGE-RAG achieves 52.75 overall accuracy on LongDocURL, and 53.26 accuracy with 51.19 F1 on MMLongBench-Doc. Fine-grained breakdowns, budget-performance curves, ablations, and trace-based analysis further show that query-time evidence subgraph construction can balance dispersed evidence coverage with context-noise control. Our code is available at this https URL.

[IR-16] Intelligent Multimodal Retrieval and Reasoning for Geospatial Knowledge Discovery on the I-GUIDE Platform

链接: https://arxiv.org/abs/2606.15838
作者: Yunfan Kang,Erick Li,Furqan Baig,Wei Hu,Alexander Michels,Anand Padmanabhan,Shaowen Wang
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Geospatial knowledge discovery increasingly requires search across heterogeneous artifacts: datasets, maps, notebooks, software, publications, and the provenance links among them. Conventional geoportals support metadata and spatial filtering, but they rarely provide semantic retrieval, graph-aware provenance traversal, and conversational synthesis in one integrated system. This paper presents I-GUIDE Smart Search, a production multimodal geospatial retrieval-augmented generation (RAG) system embedded in the I-GUIDE Platform, and reports on its design, deployment, and evaluation. The system combines production-maintained OpenSearch keyword, vector, and spatial indexes with a Neo4j knowledge graph and an iterative RAG pipeline for memory-aware query augmentation, reasoning, retrieval-method routing, relevance grading, grounded generation, hallucination and relevance checking. In a single-A100 RAG deployment, I-GUIDE Smart Search supports interactive use up to about 100 concurrent simulated users, reaching 4.4 requests per second with p50 latency near 25 seconds despite 20-50 LLM calls per query. For answer quality, we evaluate a four-category benchmark of 170 unique human-filtered user-facing queries, together with ten intent-specific probe sets generated from the deployed indexes and graph. Smart Search improves retrieved evidence coverage and judged answer quality over non-retrieval and naive-RAG baselines, with the clearest gains on exact-identifier, spatially constrained, simple-recommendation, and domain-specific factual queries requiring current indexed evidence. We distill transferable deployment lessons for spatial RAG systems, covering spatial metadata quality, graph provenance, retrieval routing, interface contracts, refusal-aware evaluation, latency-cost tradeoffs, and the role of the user interface in deployed geospatial cyberinfrastructure.

[IR-17] One Sequential Recommendation Model Pretrained from Synthetic Priors Predicts Multiple Datasets KDD2026 KDD

链接: https://arxiv.org/abs/2606.15752
作者: Woosung Kang,Jiwon Jeong,Jonghyeok Shin,Jeongwhan Choi,Noseong Park
类目: Information Retrieval (cs.IR)
备注: Accepted at the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2026)

点击查看摘要

Abstract:Existing sequential recommendation models rely on dataset-specific training, where the learned parameters are fitted to the item catalog and the observed interaction distribution of the training data. This limits generalization to new domains, typically requiring retraining from scratch. In this work, we propose SRPFN, a Prior-data Fitted Network for sequential recommendation – predicting the next item in a single forward pass without any gradient-based parameter updates in the target domain. SRPFN is pretrained offline on 25.6M sequences sampled from a synthetic prior that spans diverse item-to-item transition patterns, learning to produce posterior predictive next-item distributions. At inference time, SRPFN generates recommendations by conditioning on a support set of item-item transition examples from the target domain, adapting to domain-specific patterns without retraining. Extensive experiments on five benchmarks across 10 baselines show that SRPFN achieves the best or second-best performance across nearly all metrics and datasets, while being substantially more computationally efficient than trained baselines. These results establish that a single model pretrained on synthetic priors can generalize across diverse real-world domains, offering a framework for update-free sequential recommendation.

[IR-18] Retrievable Gradients: Continual Post-Training Without Cumulative Weight Drift

链接: https://arxiv.org/abs/2606.15734
作者: Weihang Su,Jiacheng Kang,Jingyan Xu,Qingyao Ai,Jianming Long,Hanwen Zhang,Bangde Du,Xinyuan Cao,Min Zhang,Yiqun Liu
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Continual post-training enables models to absorb emerging knowledge after deployment, but repeatedly updating shared parameters can accumulate weight drift, potentially causing catastrophic forgetting and degrading general capabilities. Retrieval-augmented generation avoids such parameter drift, yet often lacks the depth of parametric knowledge integration. In this paper, we propose ReGrad (Retrievable Gradients), a new paradigm that treats gradients as retrievable units of knowledge. ReGrad pre-computes document-specific gradients offline, stores them in an indexed Gradient Bank, and retrieves only query-relevant gradients at inference time for temporary weight adaptation. However, raw language-modeling gradients are optimized for token-level document reconstruction rather than for query-driven knowledge use. We therefore introduce a bi-level meta-learning objective that reshapes document-derived gradients into generalizable adaptation signals for downstream tasks. Experiments across general and domain-specific settings show that \textscReGrad outperforms CPT and RAG baselines, enabling scalable and reversible parametric knowledge injection without accumulating weight drift.

[IR-19] ransfer Learning for FHIR Questionnaire Terminology Binding

链接: https://arxiv.org/abs/2606.15449
作者: Maxim Gorshkov
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Electronic prior authorization workflows require FHIR Questionnaire items to carry LOINC codes, yet most items in the HL7 Da Vinci CDS-Library lack these bindings. We treat this as a retrieval problem: given a Questionnaire item’s text, find the correct LOINC code in a pool of 97,314 active codes. We compare six methods (TF-IDF, frozen MiniLM, BioBERT, BioLORD, contrastively fine-tuned MiniLM, and a TF-IDF+GPT reranker) on a 54-item evaluation set spanning three query styles (natural question, medium, and terse). No single method wins on every metric. BioLORD, a frozen encoder pre-trained on biomedical ontology definitions, has the best top-rank accuracy (R@1 = 0.185, MRR = 0.246) despite seeing no task-specific data, while a contrastive fine-tune on raw LHC-Forms pairs takes R@5 (0.389) and R@10 (0.426). A distribution-shift ablation shows why the fine-tune in our main table is not the strongest one: adding GPT-generated paraphrases to the raw pairs drops R@5 from 0.389 to 0.296, so the augmented union underperforms raw-only training on every metric except R@1. Performance peaks at 5k training pairs. Error analysis on BioLORD’s R@1 failures shows that wrong-specificity and ambiguous-text cases together account for 59% of errors.

[IR-20] EventConnector: Mining Social Event Relations through Temporal Graphs

链接: https://arxiv.org/abs/2606.15448
作者: Zijie Lei,Haofei Yu,Ge Liu,Jiaxuan You
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Understanding and retrieving related real-world events based on their temporal dynamics is a fundamental challenge in time-sensitive applications such as forecasting, information retrieval, and social analysis. Existing methods often rely on semantic similarity or global time-series alignment, which overlook the transient and directional dependencies that frequently underlie real-world correlations. In this work, we introduce \textitEventConnector, a framework that constructs a temporal event graph capturing localized co-fluctuations and lead-lag relationships between events through their time-series trajectories. We further propose \textbfEC-Fusion, an adaptive retrieval mechanism that fuses EventConnector’s graph-based scores with a complementary Granger-causal signal via a graph-quality-aware mixing weight. Across two real-world prediction market benchmarks (Polymarket and Kalshi) and nine forecasting architectures evaluated over three random seeds, EC-Fusion is the best non-oracle retrieval method on 17/18 model–dataset cells, reducing RMSE by 6.87% on average (up to 10.86% ) over the strongest comparable retrieval baseline, with statistical significance at p 0.01 after Holm–Bonferroni correction. These results highlight the effectiveness of temporally grounded graph modeling, augmented with causal-signal fusion, in capturing latent event relationships beyond what semantic similarity or traditional alignment techniques can offer.

[IR-21] Confidence-Based Stopping Methods for Systematic Reviews

链接: https://arxiv.org/abs/2606.15380
作者: Aaron Fletcher,Mark Stevenson
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Technology Assisted Review stopping methods aim to ensure that no more documents are screened than necessary. Most existing approaches focus on achieving a target recall, which does not consider whether an information need has been met. This paper introduces two heuristic stopping methods that instead monitor whether screened documents contain enough information to make a decision. Evaluation on a standard dataset of Diagnostic Test Accuracy Systematic Reviews demonstrates that the proposed approaches substantially reduce the number of documents that need to be examined while, in the majority of cases, maintaining conclusions that are consistent with all evidence available.

[IR-22] S1-DeepResearch: Beyond Search Toward Real-World Long-Horizon Research Agents

链接: https://arxiv.org/abs/2606.15367
作者: Yao Dong,Xinglin Xiao,Liwei Dong,Xinlong Jin,Zhengbo Li,Heng Zhang,Duyun Wang,Nan Xu
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Deep research agents aim to solve complex knowledge-intensive tasks through long-horizon planning, evidence gathering, reasoning, and report generation. While recent progress in search agents has demonstrated strong capabilities in information retrieval and answer verification, most existing training datasets remain search-centric, focusing primarily on closed-ended question answering and information localization. As a result, they mainly train information-seeking behavior while providing limited coverage of key deep research capabilities, including evidence integration, knowledge synthesis, planning, file understanding, and structured report generation. In this work, we propose a unified trajectory construction paradigm for deep research agents that combines closed-ended QA and open-ended exploration. The proposed framework consists of graph-grounded task formulation, agentic trajectory rollout, and multi-dimensional trajectory verification, enabling scalable synthesis of high-quality agentic trajectories spanning long-chain complex reasoning, deep research instruction following, report writing, file understanding and generation, and skills usage. Compared with existing search-oriented datasets, our synthesized trajectories place greater emphasis on knowledge synthesis, complex reasoning, and planning. S1-DeepResearch-32B achieves state-of-the-art performance among open-source models of comparable scale across 20 benchmarks spanning five capability dimensions, including complex reasoning, instruction following, report generation, file understanding, and skills usage. On several challenging deep research benchmarks, it approaches the performance of leading proprietary frontier models. These results highlight the importance of jointly modeling information acquisition, knowledge synthesis, and planning-oriented agent behaviors for building effective deep research agents.

[IR-23] Beyond Monolingual Deep Research: Evaluating Agents and Retrievers with Cross-Lingual BrowseComp-Plus

链接: https://arxiv.org/abs/2606.15345
作者: Yuheng Lu,Qingcheng Zeng,Heli Qi,Puxuan Yu,Fuheng Zhao,Rui Yang,Hitomi Yanaka,Naoto Yokoya,Weihao Xuan
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: Preprint

点击查看摘要

Abstract:Deep research agents are increasingly evaluated on their ability to search for evidence, reason over retrieved sources, and produce grounded answers. Existing browsing benchmarks, however, largely assume that the user’s query and the supporting evidence are written in the same language, leaving open whether agentic search systems can operate when relevant evidence appears in another language. We introduce XBCP (Cross-lingual BrowseComp-Plus), a controlled benchmark that preserves the English question-and-answer space of BrowseComp-Plus but varies the languages of the supporting documents. XBCP instantiates two complementary settings: in the cross-lingual setting, each query is paired with evidence in a single assigned language. In the multilingual setting, the full evidence corpus is distributed equally and randomly across 12 languages spanning high-resource and low-resource regimes. We evaluate four deep research agents using sparse and dense multilingual retrievers, measuring answer accuracy, evidence recall, search behavior, calibration, citation fidelity, and oracle retrieval. Results reveal substantial degradation when evidence is translated. Even strong, dense retrievers lose evidence recall, and agents become less calibrated and cite evidence less reliably. Notably, accuracy remains lower even when all gold evidence is supplied directly. These findings suggest that cross-lingual deep research exposes both retrieval failures and an independent, agent-side difficulty in integrating language-mismatched evidence.

[IR-24] HoloRec: Holistic Encoding and Interleaved Reasoning for Generative Recommendation

链接: https://arxiv.org/abs/2606.15331
作者: Shuqi Zhao,Jingsong Su,Xiang Liu,Xingzhi Yao,Yiming Qiu,Huimu Wang,Liang Lin,Pengbo Mo,Mingming Li,Jiao Dai,Jizhong Han,Songlin Hu
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Generative recommendation models that formulate the task as sequence generation overcome the objective fragmentation problem of traditional cascade architectures, yet existing approaches still suffer from flat semantic representations lacking hierarchical structure for multi-step reasoning and an externally constructed chain-of-thought (CoT) that requires expensive annotations and remains disconnected from the generation objective. We propose HoloRec, an endogenous chain-of-thought recommendation mechanism that unifies representation, reasoning, and generation by constructing a hierarchical semantic encoding matrix via multi-granularity nested residual quantization optimized by a holistic reconstruction loss. HoloRec supports two inference modes: a non-thinking mode that uses lightweight multi-granularity supervised alignment for fast prediction, and a thinking mode that employs an interleaved reasoning scheme to generate CoT steps on the fly, directly embedding reasoning into the generation process without external data. Experiments on multiple public recommendation datasets demonstrate that HoloRec consistently outperforms baselines, with especially significant gains in sparse scenarios, and the thinking mode achieves better accuracy than the non-thinking mode with only modest inference overhead.

[IR-25] OneBar: An End-to-End Content-Grounded Generative Query Recommendation Framework for E-Commerce Video Feeds

链接: https://arxiv.org/abs/2606.15330
作者: Yao Tang,Ying Yang,Ben Chen,Yufei Ma,Zihan Liang,Chenyi Lei,Wenwu Ou,Jian Liu
类目: Information Retrieval (cs.IR)
备注: Any questions feel free to contact: benchen4395@gmail.com

点击查看摘要

Abstract:Short-video platforms now expose clickable search entries beneath the video player, enabling users to easily express content-induced search intent. However, conventional query recommendation systems on short-video platforms suffer from latency constraints and objective misalignment, while recent generative approaches struggle with noisy content-side metadata and preference drift. To address these issues, we propose OneBar, an end-to-end generative framework for real-time query recommendation for E-Commerce video feeds. OneBar features three key innovations: (1) a collaborative-multimodal intent grounding module that fuses multimodal video understanding and behavior-derived collaborative anchors; (2) a Unified End-to-End architecture equipped with a prompt-compression mechanism for efficient online serving; and (3) a progressive preference learning strategy for efficient preference-internalization, which internalizes hierarchical behavior preferences into the generative policy, eliminating the need for a separately trained reward model. Compared with online base, OneBar increases Query Exposure by 16.91% and Query Click by 18.68%, while maintaining a slight Query CTR gain of 0.19%. The additional search traffic further contributes to 20.36% more guided orders and 21.67% higher GMV.

[IR-26] Guiding Federated Graph Recommendation with LLM -encoded knowledge

链接: https://arxiv.org/abs/2606.15277
作者: Thi Minh Chau Nguyen,Hien Trang Nguyen,Duc Anh Nguyen,Van Ho-Long,Thanh Trung Huynh,Zhao Ren
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Databases (cs.DB); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
备注: Technical Report

点击查看摘要

Abstract:Graph-based recommender systems are highly effective at extracting collaborative signals from user–item interactions, and federated learning (FL) allows these models to be trained while preserving user privacy. However, aggregating graph representations across distributed, non-IID clients remains a challenge; structural embeddings learned locally often misalign, and naive averaging fails to capture meaningful cross-client relationships. Most existing federated graph methods rely exclusively on structural aggregation, neglecting the rich, global semantic context available in large language models (LLMs). In this paper, we propose a novel framework that uses LLM-encoded knowledge to guide federated graph recommendation. Specifically, clients learn structural representations from local graphs while simultaneously summarizing their typical interaction patterns into compact semantic vectors via a frozen LLM. The central server then uses these LLM-encoded semantic signals to discover related preference patterns across clients, guiding the selective aggregation of their structural representations. This enables semantically informed cross-client collaboration without exposing raw data. Extensive experiments on standard benchmarks show that guiding structural alignment with LLM-encoded knowledge consistently improves recommendation accuracy over existing federated graph baselines.

[IR-27] Beyond Positive Signals: Unlocking Implicit Negative Behaviors for Enhanced Sequential User Modeling

链接: https://arxiv.org/abs/2606.15252
作者: Zexuan Cheng,Yue Liu,Jun Zhang,Jie Jiang
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:User behavior sequence modeling has become a central component in modern click-through rate (CTR) prediction. Over the past years, the community has invested substantial effort into improving how sequences are encoded, from target-aware attention and interest evolution networks to unified architectures that jointly process sequential and non-sequential features. However, a more fundamental question remains under-explored: what should constitute the behavior sequence? Current practice constructs sequences exclusively from positive interactions (clicks, purchases, completions), while the far more abundant implicit negative behaviors (skips, low engagement, scroll-past) are largely underutilized. As gains from longer positive sequences approach diminishing returns, we revisit this underutilized data source within the sequential modeling framework. In this paper, we demonstrate that mixed-polarity behavior sequences, which chronologically interleave positive and negative tokens within a fixed length budget, consistently outperform positive-only sequences across diverse model architectures with negligible additional computational overhead. We further identify a semantic indistinguishability problem inherent to naive polarity embeddings and propose Target-Aware Polarity Fusion (TAPF), a lightweight target-conditioned gating mechanism that provides additional gains by differentiating behavioral evidence. Notably, even the simpler polarity bias baseline captures the majority of improvement, underscoring that the primary contribution is the mixed-polarity data paradigm itself. Experiments on three public benchmarks demonstrate consistent improvements of +1.9% to +9.6% relative AUC across five architectures, which validate the practical value of our approach.

[IR-28] Edu-Theater: A Data-Efficient Agent Framework for Scalable Learner Behavior Simulation through Staging Roll-Call

链接: https://arxiv.org/abs/2606.15225
作者: Weibo Gao,Qi Liu,Linan Yue,Zheng Zhang,Yichao Du,Fangzhou Yao,Ao Yu,Zhenya Huang,Shijin Wang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: LLM Agent, Educational Data Mining, Data Synthesis, Human Simulation

点击查看摘要

Abstract:Large-scale learner-task interaction data are crucial for intelligent educational systems but are costly to collect and constrained by privacy and learner engagement. Learner simulators play a critical role in simulating scalable learner behavior without the need for continuous involvement of real learners. However, existing methods are predominantly \textbfindividual-centric, pairing a simulator with each learner to iteratively infer latent knowledge states from dense interaction histories, which is both data- and computation-intensive, and fragile in cold-start scenarios. We propose a \textbfcohort-aware roll-call simulation paradigm that first constructs cohort-level proficiency priors and refines individual learner states through a small number of targeted diagnostic queries. Based on this paradigm, we introduce \textbfEdu-Theater, an LLM-powered agent system that performs cohort-aware learner simulation via a teacher agent and retrospective roll-call probing over learner logs. Edu-Theater enables scalable future behavior simulation without the need for dense per-learner histories. Experiments on two real-world datasets demonstrate that Edu-Theater achieves higher simulation accuracy with significantly fewer LLM calls, producing synthetic data that enhances downstream applications such as adaptive testing.

[IR-29] MVEB: Massive Video Embedding Benchmark

链接: https://arxiv.org/abs/2606.14958
作者: Adnan El Assadi,Roman Solomatin,Isaac Chung,Chenghao Xiao,Deep Shah,Manan Dey,Shriya Sudhakar,Zacharie Bugaud,Wissam Siblini,Ayush Sunil Munot,Yashwanth Devavarapu,Rakshitha Ireddi,Michelle Yang,Márton Kardos,Niklas Muennighoff,Kenneth Enevoldsen
类目: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We introduce the Massive Video Embedding Benchmark (MVEB), a 23-task benchmark for video embeddings spanning classification, zero-shot classification, clustering, pair classification, retrieval, and video-centric question answering. We evaluate 33 models and find that no single model dominates: MLLM-based embeddings lead on classification, clustering, pair classification, and QA; multimodal binding leads on retrieval and zero-shot classification; generative MLLMs without contrastive adaptation collapse on cross-modal tasks. Paired video-only vs. audio+video evaluations show that audio’s contribution depends on dataset annotation provenance: audio helps when labels were produced from both modalities and hurts when they were produced from visuals alone, a six-point gap consistent across model families. MVEB is derived from MVEB+, a 184-task pool, and is designed to maintain task diversity while reducing evaluation cost. It integrates into the MTEB ecosystem for unified evaluation across text, image, audio, and video. We release MVEB and all 184 tasks along with code and a leaderboard at this https URL.

[IR-30] Retrieval-as-a-Service:A System-Oriented Analysis of Industrial Retrieval Pipelines in Web Systems

链接: https://arxiv.org/abs/2606.14932
作者: Fang Liu,Yuan Yuan,Yifan Dang,Xuncheng Zhang,Cuiqianhe Du
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Retrieval systems have become a foundational infrastructure component in modern Web services, supporting applications such as content recommendation, advertising targeting, and API discovery. In large-scale industrial environments, retrieval is increasingly deployed as an independent service layer, commonly referred to as Retrieval-as-a-Service (RaaS). This paper presents a system-oriented survey of industrial retrieval pipelines, focusing on architectural design and deployment trade-offs under real-world constraints. Unlike prior surveys that emphasize algorithmic developments, we analyze retrieval systems from an infrastructure perspective, highlighting how latency requirements, scalability constraints, and resource limitations shape system design in production environments. We introduce a unified RaaS pipeline abstraction that models retrieval as a multi-stage service, including high-efficiency candidate generation, embedding-based semantic matching, and resource-aware re-ranking. We further examine the integration of Large Language Model (LLM)-based retrieval mechanisms and analyze their impact on semantic performance, latency, and computational overhead. The results provide a system-level understanding of retrieval as a service-oriented infrastructure and offer practical guidelines for designing scalable, efficient, and QoS-aware retrieval architectures in large-scale Web systems.

[IR-31] Co-Scraper: query-aware DOM Pruning and Reusable Scraper Synthesis for Lightweight Web Data Extraction

链接: https://arxiv.org/abs/2606.14821
作者: Shoupeng Wang,Jiantao Qiu,Wuyang Zhang,Conghui He
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The abundant and heterogeneous nature of web content necessitates automated information extraction, and generating scrapers that can be reused across similar web pages offers an effective solution for scalable data extraction. In this work, we propose Co-Scraper, a two-stage framework capable of handling the hierarchical complexity of long HTML documents. By integrating a query-aware DOM pruning mechanism with stable extraction strategy induction, Co-Scraper can effectively transforms web content into executable programmatic wrappers using a fine-tuned Qwen3-8B model. On the test set of SWDE, Co-Scraper achieves state-of-the-art performance with an F1 score of 94.78% and a reuse success rate of 90.39%. This framework significantly enhances the accuracy and resilience of data extraction, providing a highly efficient approach for web data acquisition tasks.

[IR-32] Combining Retrieval-Augmented Text Generation with LLM s for Reading Content Recommendations

链接: https://arxiv.org/abs/2606.14817
作者: Sooyeon Kim,Piotr S. Maciąg
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This work presents the design, implementation, and evaluation of a system for generating personalized reading content using Large Language Models (LLMs) combined with Retrieval-Augmented Generation (RAG). The proposed architecture consists of four modules: Input, RAG, Generation, and Judging and enables users to specify both a question and a target reading content complexity. RAG is employed to retrieve relevant information from the Internet, enriching and grounding the content produced by three modern LLMs: Meta LLaMA 4 Scout, LLaMA 3.1 8B Instant, and Google Gemma2 9B. Reading materials are generated using three prompting strategies (Chain-of-Thought, zero-shot, and few-shot), and the LLM-as-a-Judge module automatically evaluates answer quality and alignment with the desired readability level. Experimental results show that RAG consistently improves system performance across all models and prompting techniques, increasing relevance and particularly groundedness by up to 26-35 percentage points. Overall, the findings demonstrate that the RAG-augmented architecture effectively produces reading content tailored to user queries and desired textual complexity.

[IR-33] An Empirical Analysis of Optimization Dynamics and Sparsity Boundaries in Large-Scale Pedestrian Attribute Recognition

链接: https://arxiv.org/abs/2606.14770
作者: Houssam El Mir
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Pedestrian Attribute Recognition (PAR) is critical for video surveillance, enabling forensic search and re-identification systems. Extreme class imbalance remains a fundamental obstacle when merging PETA and PA-100K into a 109,000-image composite corpus, where minority attributes have positive sample fractions below 1%. This causes standard BCE optimization to suppress rare traits, a phenomenon we term the majority negative class cheating trap. We present a systematic ablation of Multi-Label Focal Loss hyperparameters (alpha and gamma) on a ResNet-18 backbone. A calibrated configuration (alpha=0.50, gamma=2.0) achieves a Macro F1-score of 62.32%, matching BCE baseline while preserving superior hard-example mining and convergence dynamics. Our approach uses pure loss-function engineering with zero computational overhead for edge deployment. We identify the Sparsity Wall, a hard boundary where positive sample fractions below 0.1% make global loss reweighting ineffective, requiring instance-level intervention.

[IR-34] Phishing Email Detection Using Large Language Models

链接: https://arxiv.org/abs/2512.10104
作者: Najmul Hasan,Prashanth BusiReddyGari,Haitao Zhao,Yihao Ren,Jinsheng Xu,Shaohu Zhang
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: 7 pages

点击查看摘要

Abstract:Email phishing is one of the most prevalent and globally consequential vectors of cyber intrusion. As systems increasingly deploy Large Language Models (LLMs) applications, these systems face evolving phishing email threats that exploit their fundamental architectures. Current LLMs require substantial hardening before deployment in email security systems, particularly against coordinated multi-vector attacks that exploit architectural vulnerabilities. This paper proposes LLMPEA, an LLM-based framework to detect phishing email attacks across multiple attack vectors, including prompt injection, text refinement, and multilingual attacks. We evaluate three frontier LLMs (e.g., GPT-4o, Claude Sonnet 4, and Grok-3) and comprehensive prompting design to assess their feasibility, robustness, and limitations against phishing email attacks. Our empirical analysis reveals that LLMs can detect the phishing email over 90% accuracy while we also highlight that LLM-based phishing email detection systems could be exploited by adversarial attack, prompt injection, and multilingual attacks. Our findings provide critical insights for LLM-based phishing detection in real-world settings where attackers exploit multiple vulnerabilities in combination.

人机交互

[HC-0] From 911 to Hospital: Challenges and Opportunities for AI Integration in Emergency Medical Services

链接: https://arxiv.org/abs/2606.16984
作者: Emily Hou,Marelyn Gonzalez,Andrew L. Kun,Osnat Mokryn,Orit Shaer
类目: Human-Computer Interaction (cs.HC)
备注: Accepted for publication in the Proceedings of CHIWORK 2026

点击查看摘要

Abstract:Artificial Intelligence (AI) is increasingly introduced into healthcare settings, yet its integration into fast-paced, high-pressure domains such as Emergency Medical Services (EMS) remains limited. EMS work unfolds across distinct stages, each characterized by different information needs, constraints, and forms of collaboration. Designing effective AI support requires understanding how AI interventions align with, or disrupt, EMS work across its different stages. We conducted semi-structured interviews with 25 EMS clinicians across the United States to examine how existing technologies currently support emergency services workflows and how they envision opportunities for, and concerns about, future AI-based support across different stages of emergency response. Our analysis reveals the cognitive, social, and procedural factors that enable EMS team coordination, which is grounded in situational awareness across distributed roles. EMS clinicians expressed significant concerns about how AI integration threatens this coordination mechanism across multiple dimensions: legal and privacy issues, technical reliability, contextual sensitivity, professional autonomy, and workflow friction. We propose five design principles for AI systems that augment distributed cognition and situational awareness, enabling EMS teams to deliver effective care under extreme constraints. Comments: Accepted for publication in the Proceedings of CHIWORK 2026 Subjects: Human-Computer Interaction (cs.HC) Cite as: arXiv:2606.16984 [cs.HC] (or arXiv:2606.16984v1 [cs.HC] for this version) https://doi.org/10.48550/arXiv.2606.16984 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.1145/3808045.3808066 Focus to learn more DOI(s) linking to related resources

[HC-1] A Causal Model of Theory of Mind in Conflict for Artificial Intelligence

链接: https://arxiv.org/abs/2606.16944
作者: Nikolos Gurney
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Theory of mind (ToM), the capacity to ascribe mental states to others and use those ascriptions for prediction and inference, is widely assumed to be essential for effective human-machine integration. Existing AI-ToM models address \emphhow to mentalize, but leave the question of when largely unaddressed. The central question is: under what situational and agent-level conditions is ToM engagement causally warranted in conflict? This paper presents a structural causal model formalized as a directed acyclic graph (DAG), treating ToM as a mechanism activated by situational and agent-level conditions rather than as an always-on capacity. The model specifies four exogenous variables capturing situational and agent-level conditions, five endogenous mediators, and a mechanistic ToM node producing engagement states through three distinct causal pathways: a tractability pathway, a reasoning-depth pathway, and an enabling-cause pathway. The primary outcome is epistemic accuracy, which decouples social reasoning from behavioral policy and generalizes across social phenomena beyond conflict. The framework gives AI systems a principled, resource-rational decision procedure for mentalizing, with implications for efficiency, trust, and the development of robust artificial social intelligence. Simulation validation, empirical human-machine teaming studies, and ethical considerations arising from conflict-optimized mentalizing are discussed.

[HC-2] Evolution Foundation: AI Shares Creative Control

链接: https://arxiv.org/abs/2606.16849
作者: Dylan Banarse,Stephen Todd,William Latham,Frederic Fol Leymarie
类目: Neural and Evolutionary Computing (cs.NE); Graphics (cs.GR); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:This paper investigates the creative process of automated design and artistic evaluation using an evolutionary system. We consider how a multimodal artificial intelligence (AI) model can communicate and guide a combined generative and evolutionary computational system. This creates a framework for the evolution of aesthetically pleasing complex 3D organic forms by integrating genetic algorithms with the visual reasoning capabilities of large-scale AI foundation models. The framework shifts the artist role from that of intensive direct selection to one of system design; transferring detailed step-by-step curation to an AI agent capable of multimodal aesthetic judgement. This framework enables the human artist/designer to rapidly traverse large areas of multi-dimensional evolutionary parameter space to find creative outcomes based on their semantic targets. Detailed audit trails of the AI’s aesthetic reasoning are generated for each experiment. Interactive visualisation tools, together with AI-generated summaries and evolutionary narratives, enable deep exploration into each evolutionary experiment and providing a transparent insight into the AI-guided process. Subjects: Neural and Evolutionary Computing (cs.NE); Graphics (cs.GR); Human-Computer Interaction (cs.HC) Cite as: arXiv:2606.16849 [cs.NE] (or arXiv:2606.16849v1 [cs.NE] for this version) https://doi.org/10.48550/arXiv.2606.16849 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[HC-3] A comparison of human and LLM -simulated participants in a writing style task

链接: https://arxiv.org/abs/2606.16778
作者: Felix Gröner,Erin K. Chiou
类目: Human-Computer Interaction (cs.HC)
备注: 37 pages, 10 figures

点击查看摘要

Abstract:Because large language models (LLMs) can produce natural language that is sometimes indistinguishable from texts produced by people, some researchers are starting to consider replacing human participants with LLM simulations. In this study, we test the extent to which the findings of a simulation with an LLM prompted to act as a synthetic participant match those obtained from 30 human participants. In our experiments, we evaluated how well writing style preference inference algorithms adapted to a participant over repeated interactions, compared to a baseline. We discover hints of bias and a lack of depth in GPT-4o’s text generation and judgement that prevent it from accurately simulating people’s behavior. Our results also hint at human biases that highlight the importance of considering human factors in the evaluation of systems that depend on human-automation interaction. Rather than treating these discrepancies as evidence for or against the validity of LLM-simulated participants, we present this study as a case analysis of methodological and design challenges.

[HC-4] MuVAP: Multimodal Multiparty Voice Activity Projection for Turn-taking Prediction in the Wild

链接: https://arxiv.org/abs/2606.16731
作者: Haotian Qi,Gabriel Skantze
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Current multiparty turn-taking models often rely on complex microphone arrays or multi-camera setups, limiting their applicability in human-robot interaction scenarios. We introduce MuVAP, a causal multimodal framework that extends Voice Activity Projection by grounding acoustic predictions in face tracks, enabling speaker-aware turn-taking predictions from a monaural audio stream and a single camera view. To address the combinatorial complexity of modeling multiple speakers, we propose Role-Relative Projection, which maps any N-speaker interaction onto a fixed current versus next floor-holder state. Because existing audiovisual datasets contain disruptive editing cuts that break causal tracking, we introduce the Audio-Visual Conversation Corpus, a 31-hour dataset of unedited, single-camera multiparty conversations. Evaluations demonstrate that MuVAP outperforms strong baselines on Shift-Hold and next-speaker prediction tasks across two- and three-speaker settings.

[HC-5] Mapping the Design Space for Youth Social Media: A Framework Centered on Friendship Building

链接: https://arxiv.org/abs/2606.16651
作者: JaeWon Kim
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:This dissertation develops a design framework for friendship-supportive youth social media. I conducted a qualitative meta-analysis across my formative, case-study, and co-design work with teens and young adults, synthesizing recurring design themes into three pillars: social understanding (legible norms, intentions, trust, reciprocity, and accountability), placeness (spatial and embodied affordances that make online interaction feel inhabitable), and identity alignment (authentic expression that remains current, plural, and interpretable). The framework is grounded in interpersonal, developmental, and sociotechnical theory, but its contribution is design-oriented: it translates broader accounts of friendship and social development into the specific ways social media platforms can shape youth friendship building. I initially validate parts of this framework through WhoamI Today (WIT), a platform deployed with 99 youth across the United States and Korea. My proposed work extends this validation through a follow-up deployment while refining the framework as a roadmap for cumulative design research on youth social media.

[HC-6] Using AI in engineering education: a balancing act driven by clear purpose

链接: https://arxiv.org/abs/2606.16626
作者: Olya Kudina
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: To appear in The Routledge Handbook of the Philosophy of Engineering, 2nd ed. Edited By Diane P. Michelfelder, Neelke Doorn

点击查看摘要

Abstract:Based on a questionnaire of 100 higher-education students, predominantly from engineering-related fields, and a critical review of recent literature, this chapter examines how students use and perceive Large Language Models (LLMs) in engineering education. Students primarily value LLMs for writing support, conceptual clarification, coding assistance, and brainstorming, while simultaneously expressing concerns about inaccuracies, bias, overreliance, academic integrity, and the burden of verification. Through an analysis of two dominant metaphors, namely LLMs as an “oracle” and as a “tutor,” the chapter shows how these systems cultivate expectations of authority, expertise, and personalized learning that often exceed their actual capabilities. The chapter further argues that students’ attachment to the promises of efficiency and personalized support reflects a form of “cruel optimism,” where the perceived benefits of LLMs often depend on the very skills, vigilance, and expertise that students are still developing. Overall, the chapter argues for a purpose-driven and context-sensitive approach to AI integration in engineering education, emphasizing critical AI literacy, reflective assessment design, pedagogical caution, and consideration of broader ethical and environmental impacts.

[HC-7] Beyond Usability: A UX Case Study on Using “Withdrawal Design” to Challenge Engagement Metrics in Social Robotics

链接: https://arxiv.org/abs/2606.16439
作者: Yibo Meng,Qiuyu Long,Richard Chen,Yan Guan,Xiaolan Ding
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Social robots for children with autism are often evaluated through engagement and interaction quality, assuming the robot acts as a social scaffold. We report a mixed-methods “withdrawal” study that tests a harder question: what changes when the robot is removed. In an 8-week home-based randomized controlled trial (N=40), children either retained a consumer social robot (Qrobot) or had it withdrawn after initial use. Quantitatively, continued access reduced anxiety (SCARED/RCADS), yet was associated with lower parent-reported social motivation and weaker gains in emotion recognition (SMS/RMET) compared to withdrawal. Interviews with guardians contextualized this divergence: removal sometimes prompted children to seek human interaction, while continued use could keep social behavior siloed within the child-robot dyad, despite exceptionally high usability (SUS). We synthesize a UXR point of view: for vulnerable users, “engagement” can mask ecological downsides. Success should be judged not by retention, but by designed separation that bridges back to human relationships.

[HC-8] LectūraAgents : A Multi-Agent Framework for Adaptive Personalized AI-Assisted Learning and Embodied Teaching

链接: https://arxiv.org/abs/2606.16428
作者: Jaward Sesay,Yue Yu,Siwei Dong,Yemin Shi,Guangyao Chen,Börje F. Karlsson
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Effective personalized AI-assisted learning demands systems that can not only generate accurate learner-specific educational materials, but also dynamically adapt their instruction to diverse learners. However, existing educational agents have primarily focused on lecture content automation and simulations, which often fall short of modelling multimodal and embodied instructional methods tailored for the individual learner. To this end, we propose LectūraAgents - a multi-agent framework that enables personalized learning through end-to-end adaptive embodied teaching. At its core, LectūraAgents mirrors a professor-student relationship, in which a ProfessorAgent leads a collaborative team of specialized subordinate agents through research, planning, review, and embodied delivery of lecture contents that adapt to a learner’s needs. The framework offers three main contributions: (1) a hierarchical multi-agent architecture for end-to-end personalized learning; (2) an adaptive embodied teaching mechanism, wherein the ProfessorAgent executes visible and pedagogically motivated teaching actions (e.g., handwrite, highlight, underline, etc.) over contents in a teaching environment; and (3) a Teaching Action-Speech Alignment (TASA) algorithm that employs salience-based heuristics and temporal semantic segmentation to generate coherent teaching action sequences aligned with learner profiles. We evaluate LectūraAgents on diverse courses at high school, undergraduate, and graduate levels using sample-specific rubric-based analysis; with generated lecture materials and teaching actions assessed and validated by expert educators. Experimental results show consistent gains in lecture content quality, embodied teaching quality, assessment, and personalization over existing approaches, positioning LectūraAgents as a pedagogically well-grounded framework for personalized learning at scale.

[HC-9] An Augmented Reality Brain-Robot Interface for Generalist Robot Arm Manipulation

链接: https://arxiv.org/abs/2606.16413
作者: Shangkai Zhang,Rousslan Fernand Julien Dossa,Luca Nunziante,Marina Di Vincenzo,Kai Arulkumaran
类目: Robotics (cs.RO); Human-Computer Interaction (cs.HC)
备注: Accepted at the 2026 IEEE International Conference on Robot and Human Interactive Communication (RO-MAN)

点击查看摘要

Abstract:The integration of augmented reality (AR) and EEG-based brain-computer interfaces (BCIs) offers a promising path for enabling intuitive control of robots for assistive purposes. However, existing AR brain-robot interface (BRI) systems are often constrained to task-specific structures, limiting their utility in real-world environments. We present an AR BRI designed for generalist robot arm manipulation that combines gaze-based object selection with motor imagery action control. Our system uses eye-tracking for intuitive object targeting and context-aware visual overlays (“Place” and “Use”) to guide the user through tasks within a shared autonomy framework. We evaluated the interface through a feasibility study with 18 healthy participants performing three multi-step activities of daily living: drinking, using a drawer, and operating an oven. Our results demonstrate that this interaction paradigm enables effective sequential task execution and high user engagement, achieving a “Good” usability rating (SUS 70). These findings support the feasibility of the proposed interaction paradigm for complex BCI-driven robotic assistance, and motivate future evaluation with the intended target population. Project website: this https URL.

[HC-10] Medical Heuristic Learning: An LLM -Driven Framework for Interpretable and Auditable Clinical Decision Rules

链接: https://arxiv.org/abs/2606.16337
作者: Wei Xu,Ke Yang,Gang Luo,Keli Zheng,Lingyan Hu,Jing Wang,Kefeng Li
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Predictive modeling for clinical tabular data is central to clinical decision support and therefore requires not only strong predictive performance but also transparent decision logic. Although deep learning and tree-based ensemble methods can achieve high accuracy, their black-box nature remains a major obstacle to clinical deployment. This challenge is further compounded by common characteristics of medical data, including limited sample sizes, severe class imbalance, and feature evolution arising from changes in diagnostic criteria and clinical documentation. To address these issues, we propose Medical Heuristic Learning (MHL), an instantiation of the learning-beyond-gradients paradigm for clinical tabular prediction. Instead of relying on neural network weight updates, MHL uses a large language model (LLM)-driven workflow that integrates statistical probes, medical knowledge probes, rule synthesis, and code-level iterative refinement to optimize a deterministic and executable decision system. The resulting model is expressed not as opaque parameters, but as versioned pure-Python decision rules that are explicitly interpretable, fully auditable, and clinically grounded. MHL also supports continual learning by starting from previously validated rules and iteratively revising them using updated feature information under data drift or feature evolution. Comprehensive experiments on medical datasets show that MHL achieves performance comparable to state-of-the-art methods while maintaining strong behavior in small-sample and highly imbalanced settings. The results further indicate that this explicit rule update mechanism can help alleviate catastrophic forgetting under feature evolution. Overall, these findings suggest that non-gradient-based heuristic systems offer a transparent and adaptable alternative for high-stakes clinical decision support.

[HC-11] Patient-centered visualization of multistage cancer treatment trajectories

链接: https://arxiv.org/abs/2606.16335
作者: Laura Lackner,Marius Bill,Martin Bornhaeuser,Karolin Trautmann-Grill,Helena Klara Jambor
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Effective communication of multistage cancer treatment trajectories remains a major challenge, particularly for patients with limited health literacy. We present a patient-centered visualization approach for representing complex, phase-based oncology treatments, integrating principles from information visualization, user experience (UX) design, and cognitive psychology. Using acute myeloid leukemia (AML) as a case study, we developed two timeline-based representations: a static, visually simplified trajectory emphasizing structure and hierarchy, and an interactive variant with layered information. We evaluated both approaches in a quantitative survey, measuring comprehension of treatment sequences, perceived confidence, and information quality. Results show that the static visualization significantly improves understanding and clarity, highlighting the importance of visual hierarchy, consistent encoding, and reduced complexity when communicating temporal medical processes compared to the baseline. In contrast, additional interactivity did not improve performance and introduced navigational overhead, suggesting that interaction must be carefully aligned with cognitive demands. Our findings contribute to visualization research by demonstrating how patient-centered design can improve the interpretability of multistage treatment trajectories. We derive design implications for temporal medical visualizations, emphasizing simplicity, structural clarity, and accessibility to support informed decision-making in clinical contexts.

[HC-12] Measuring Whether LLM Tutors Teach or Solve: A Diagnostic for Educational Impact

链接: https://arxiv.org/abs/2606.16206
作者: Junyi Yao,Zihao Zheng,Baichuan Li
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Large language models are increasingly proposed as educational tutors, yet stronger task-solving ability does not necessarily imply stronger learning support. Motivated by recent calls to measure the social impact of NLP systems in practice, we study whether public LLM tutoring benchmarks distinguish learning-supportive behavior from mere answer production. We propose a lightweight diagnostic based on the gap between solving-oriented and pedagogy-oriented benchmark performance. Using public MathTutorBench leaderboard results, we show that these dimensions are only partially aligned: across eight publicly reported models, the correlation between solving and pedagogy composites is 0.421, and several models shift meaningfully in rank when evaluation moves from solving to pedagogy. We then analyze the public TutorBench sample and show that agency-relevant behaviors are explicitly encoded in benchmark rubrics, especially in active-learning settings that reward guiding questions, calibrated hints, and non-disclosive scaffolding. Together, these findings suggest that educational-impact evaluation should not treat task success as a sufficient proxy for learning support. We argue that public tutoring benchmarks can better support positive-impact evaluation by reporting solving-oriented and pedagogy-oriented scores separately and by making disclosure-sensitive, student-agency-preserving criteria more explicit.

[HC-13] A comparative and critical study of EEGNet for fNIRS-driven cognitive load classification

链接: https://arxiv.org/abs/2606.16160
作者: Mehshan Ahmed Khan,Houshyar Asadi,Li Zhang,Mohammad reza Chalak Qazani,Ghazal Bargshady,Stefanos gkikas,Christian arzate,Sam Oladazimi,Zoran Najdovsk,Lei Wei,Chee Peng Lim
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Accurately classifying cognitive load from functional near-infrared spectroscopy (fNIRS) signals remains a significant challenge due to temporal variability, inter-subject differences, and sensitivity to preprocessing choices. This study provides a comprehensive evaluation of EEGNet for fNIRS-based cognitive load classification by systematically examining the effects of temporal segmentation strategies (overlapping vs. non-overlapping), window lengths (10s, 20s, 30s), feature extraction methods (Analysis of Variance (ANOVA), Principal Component Analysis (PCA), Fast Independent Component Analysis (FastICA)), learning rate configurations (fixed and adaptive), and evaluation protocols (random split vs. subject-independent (SI)). Results from random-split experiments show that overlapping segmentation, combined with smaller fixed learning rates (0.01-0.001), yields the highest accuracies, due to temporal redundancy and dense sampling of hemodynamic transitions. However, SI evaluation reveals a substantial drop in accuracy, demonstrating limited generalization to unseen participants. Under SI evaluation, non-overlapping segmentation outperformed overlapping windows, with the best accuracy of 56.11% achieved using PCA features with a 20-second window and a 0.1 learning rate. These findings indicate that eliminating temporal redundancy helps the model learn more robust and generalizable representations of cognitive load across individuals. Although adaptive learning rate strategy improved training stability, it did not surpass the performance of optimally selected fixed learning rates. The study highlights the critical role of segmentation strategy and learning rate selection in improving model generalization and identifies methodological considerations essential for developing reliable, real-time, and SI cognitive load classification systems using fNIRS.

[HC-14] GraphStory: Collaborative Story Writing through Event-Based Narrative Editing

链接: https://arxiv.org/abs/2606.16102
作者: Xuan-Vu Le,Minh-Loi Nguyen,Khanh-Duy Le,Minh-Triet Tran,Trung-Nghia Le
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Story writing is a popular yet complex creative activity that requires organization of ideas and iterative exploration, particularly during early-stage ideation. While many AI-based writing assistants have been developed, existing approaches primarily focus on generating long-form coherent text and improving user controllability during text production, providing limited support for brainstorming, connecting ideas, and validating alternative narrative flows. We present GraphStory, an interactive writing support system that leverages a graph-based representation to provide a comprehensive view of narrative structure and facilitate ideation. The system enables users to organize and connect plot points, explore alternative branches, and validate evolving narratives through an integrated story generation workflow. It further provides a structured interface to support efficient iteration over multiple story paths. Results from a user study with professional and semi-professional writers show that GraphStory reduces the effort of organizing narrative structures and better supports creativity and exploration compared to normal AI-based writing workflows.

[HC-15] Beyond the Blood Draw: Explainable Machine Learning for Non-Invasive Dysglycemia Risk Screening

链接: https://arxiv.org/abs/2606.16056
作者: Black Sun,Chenyi Zhang,Kaiyi Ji,Xi Lu
类目: Machine Learning (cs.LG); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Dysglycemia, encompassing both prediabetes and diabetes, affects huge numbers of adults worldwide, yet many of them remain undiagnosed. We developed and validated machine-learning (ML) models for non-invasive screening of dysglycemia risk that require no laboratory tests. Pooling data from the National Health and Nutrition Examination Survey (NHANES) 2017–2023 (n=14,352), we trained six ML models with stratified 5-fold cross-validation and compared them with two established clinical risk scores. LightGBM achieved the highest area under the receiver operating characteristic curve (AUC=0.820, 95% CI: 0.806–0.835), outperforming the Finnish Diabetes Risk Score (0.745) and American Diabetes Association Risk Test (0.783). SHAP analysis identified age, race/ethnicity, and waist-to-height ratio as the most influential predictors. Subgroup analyses confirmed consistent performance across demographic strata (AUC: 0.735–0.832). These results demonstrate the feasibility of explainable, laboratory-free dysglycemia screening for deployment in community settings and self-tracking health applications.

[HC-16] AI as a Sparring Partner – an HCAI Approach to Promote Human Capabilities

链接: https://arxiv.org/abs/2606.16020
作者: Thomas Herrmann
类目: Human-Computer Interaction (cs.HC); Computers and Society (cs.CY)
备注: 16 pages, 2 tables

点击查看摘要

Abstract:A systematic literature reveals that the role of AI as a sparring partner (SP) is often proposed but not systematically analyzed or defined. We propose the definition: An AI sparring partner (AI SP) interacts with users in a combination of a cooperative as well as a challenging, competitive mode, where AI is selected, customized or self-adapting to meet a level of skills that neither over- nor under-challenges the users. They can have the experience that they become better or are better than AI. AI as a SP can support creativity, extend viewpoints or foster learning and critical thinking either towards ideas and decisions or towards the AI itself. Sparring with AI can either be explicit, e.g. when AI simulates certain roles, or implicit, when AI is used to find out whether or how to perform better.

[HC-17] Bridging the Usability Gap: Lessons from Interpreting Studies for Machine Interpreting Design

链接: https://arxiv.org/abs/2606.16009
作者: Claudio Fantinuoli
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Machine interpreting (MI), the live, real-time branch of speech translation, has achieved remarkable progress on standard benchmarks, with some systems approaching human parity on textual fidelity. Yet the user experience remains far inferior to interpreter-mediated communication, revealing what we term the \emphaccuracy illusion: systems that appear accurate on paper but fail in practice to support smooth, goal-oriented interaction. This paper defines MI as a distinct subfield of speech translation, with its own characteristics and the need for evaluation methods grounded in communicative effectiveness rather than isolated fidelity metrics. Drawing on insights from interpreting studies, we identify critical dimensions of professional interpreting practice that are overlooked by current systems, and consolidate them into three interdependent design priorities for future MI: \emphagency (context-sensitive initiative and repair), \emphgrounding (multimodal and discourse-level situational awareness), and \emphexperience (adaptive improvement through real interaction). Together, these priorities chart a path toward closing the usability gap and enabling systems that can sustain authentic multilingual communication in real time.

[HC-18] Are LLM -based Chatbots Good Enough to Support Computer Science Students in Multiple-Choice Exercises?

链接: https://arxiv.org/abs/2606.15919
作者: Markos Stamatakis,Omkar Gavali,Joshua Berger,Christian Wartena,Anett Hoppe,Ralph Ewerth
类目: Human-Computer Interaction (cs.HC)
备注: 13 pages (excluding references), 6 tables, 1 figure, 1 equation

点击查看摘要

Abstract:Chatbots based on large language models (LLMs) are increasingly adopted for information retrieval, text generation, and writing assistance. In educational settings, their use is also rapidly increasing. Students leverage these systems to complete tasks, access information, and support learning. However, the role of LLM-based chatbots in supporting learning and assessment in university-level computer science education is still underexplored. To address this gap, we investigate the performance of several LLM-based chatbots in solving multiple-choice questions (MCQs) at the university level and evaluate their capabilities to assist student learning. We developed 70 MCQs for a university lecture on interactive visual data analysis and evaluated the chatbots’ performance using different prompt designs. We further compared the results with students’ performance. Finally, we conducted a user study in two lectures (interactive visual data analysis, computer vision) to investigate how chatbot-generated answers and explanations affect students’ performance. The chatbot performance showed significant differences between smaller models and GPT-4o and GPT-5 models, which achieved the best results. The results of the user study show that presenting ChatGPT answers together with an explanation does not improve students’ performance in general.

[HC-19] Contaminated Collaboration: Measuring Gender Bias Transfer in LLM -Assisted Student Writing

链接: https://arxiv.org/abs/2606.15914
作者: Ariyan Hossain,Kazi Kamruzzaman Rabbi,Farig Sadeque,S M Taiabul Haque
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: 18 pages, 7 pages

点击查看摘要

Abstract:Gender bias in LLMs has been studied extensively in model outputs, with biased prompts shown to amplify stereotyped generations. Whether such bias propagates into text produced by humans who use these systems, however, remains underexplored. We investigate whether gender bias in an LLM writing assistant transfers into career plan essays written by students. We first verify that a gender-biased prompt induces gender-differentiated language in LLM-generated essays, while a neutral prompt does not. We then recruited participants (N = 123) in a controlled environment to write career plan essays for paired biographical profiles differing only in gender under three conditions: no AI assistance, neutral LLM assistance, or gender-biased LLM assistance. Students in the biased condition produced essays with a significantly larger agentic gap and more gender-stereotypic occupation suggestions than those in the control and neutral conditions. Our results also reveal that this bias transfer is asymmetric: agency is suppressed in female-target essays while male-target writing remains largely unaffected. Our findings highlight the risk of bias propagation in AI-assisted writing, calling for fairness-aware design in educational AI tools.

[HC-20] he Missing Layer: Why EdTech Needs Design-Time Generative UI Not Just Runtime Personalization

链接: https://arxiv.org/abs/2606.15902
作者: Seyed Parsa Neshaei,Abhinand Shibu,Fatma Betül Güres
类目: Human-Computer Interaction (cs.HC)
备注: Accepted at the NextGen Learning Interfaces Workshop in AIED 2026

点击查看摘要

Abstract:The dominant paradigm in using generative UI (GenUI) for adaptive EdTech considers the use of AI as a runtime engine: content is authored once in a fixed form, and AI adapts delivery dynamically based on learner needs, behaviors, or profiles. We argue that this paradigm has an issue: it moves the burden of accessibility and representation diversity onto systems that see learners only after content has already been locked into particular details. For learners who might need audio-first, simplified text, interactive, or low-bandwidth representations, runtime adaptation is too late and too costly to be equitable at scale, and might lead to inaccurate learning content due to the inability to conduct verification at scale. We propose an alternative method: accessibility belongs in the authoring layer. Specifically, we advocate for a card-based GenUI paradigm, in which educational content is encoded as modality-agnostic semantic units, and GenAI produces multiple interface representations, such as interactive, audio, text-simplified, or low-bandwidth, at learning design time to be verified by the instructor before it reaches any learner. This shifts the AI intervention from delivery to creation, embeds Universal Design for Learning principles into the authoring workflow, and removed per-learner inference costs. We situate this idea against recent work on GenUI, multimodal content generation, adaptive authoring, and equitable delivery, and argue that realizing this goal requires closer integration of AI, HCI, and learning sciences than what either of those communities has so far provided.

[HC-21] Challenging Partisan Expectations Reduces Political Polarization

链接: https://arxiv.org/abs/2606.15901
作者: Do Won Kim,Ozgur Can Seckin,Saumya Bhadani,Alessandro Flammini,Giovanni Luca Ciampaglia,Bao Tran Truong
类目: ocial and Information Networks (cs.SI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Political conversations are often proposed as a remedy for political polarization, yet their effectiveness remains inconsistent. We argue that this inconsistency partly reflects a neglected feature of political contact: the expectations partisans bring to these encounters. We hypothesize that conversations should reduce political polarization the most when they violate the expected link between partisan identity and issue position. We test this hypothesis in a 2x2 experiment in which 1,983 U.S. adults engaged in structured conversations with an AI chatbot whose presented partisan identity and policy stance were independently manipulated. We find that expectation-challenging conversations in which participants talk with a disagreeing ingroup member or an agreeing outgroup member are effective in reducing affective and issue polarization. Although these effects emerge without meaningful shifts in participants’ own policy positions, a follow-up survey shows that most effects disappear over one month. Interestingly, these conversations maintain or improve objective measures of deliberation but are experienced as less satisfying by participants. Our findings identify expectation violation as an underexplored depolarization mechanism. Our results also demonstrate the promises and limitations of how conversational AI can serve as a scalable method for experimentally studying interventions to mitigating partisan divides.

[HC-22] Rethinking Scaffolding in LLM Tutors: The Interactional Mismatch Between Benchmarks and Real-World Deployments ICML2026

链接: https://arxiv.org/abs/2606.15766
作者: Alexandra Neagu,Jeffrey T. H. Wong,Marcus Messer,Rhodri Nelson,Peter B. Johnson
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: Pluralistic Alignment Workshop @ ICML 2026, Seoul, South Korea

点击查看摘要

Abstract:A central pedagogical value evaluated in AI tutor benchmarks is scaffolding: guiding students through graduated steps toward a solution. Alignment and evaluation methods for embedding scaffolding behaviour into chatbots, however, rest on an implicit assumption: that students will take up the scaffolding and engage in the conversation. To examine whether this assumption holds, we introduce an evaluation pipeline around two metrics - Chatbot Scaffolding and Student Uptake - and apply them across nine datasets of 9,490 chats, spanning AI tutor benchmarks and real-world deployments of educational chatbots. Our analysis reveals that while benchmarks assume a high-scaffolding, high-student-uptake environment, students in real-world settings exhibit lower levels of uptake overall - frequently bypassing the chatbot’s pedagogical framing to drive the interaction toward their own learning goals at little interpersonal cost. We argue that bypassing scaffolding is not necessarily detrimental; rather, it frequently highlights a mismatch between a chatbot’s pedagogical framing and the student’s learning goals. To meaningfully evaluate the effectiveness of a chatbot’s assistance, future benchmarks must move beyond the assumption that students will simply take up the scaffolding, and instead evaluate how these chatbots navigate diverse learning contexts and student-driven interaction patterns.

[HC-23] SCAN: A Decision-Making Framework for Effective Task Allocation with Generative AI

链接: https://arxiv.org/abs/2606.15601
作者: Fendi Tsim,Alina Gutoreva
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 16 pages, 2 figures, 3 tables. Preprint

点击查看摘要

Abstract:We introduce SCAN – a human-centric decision-making framework to facilitate learners for effective task allocation with Generative Artificial Intelligence (GenAI) based on Vygotsky’s Zone of Proximal Development and Metacognition. In SCAN, we systematize and formalize AI-human interaction by introducing a task-identification approach with four “sub-zones”: Substitute, Complement, Aid, and Non-negotiable. After describing the four sub-zones, we demonstrate how SCAN framework can be applied for knowledge workers in the workplace and students in education to metacognitively “scan” their use of Generative AI. We then discuss how such framework can be related to cognitive load theory, cognitive offloading, sycophancy, three decision-making modes in human-AI interactions (automation, augmentation, and collaboration), future of work such as upskilling and deskilling, and how it accounts for both human-human and human-AI learning. We propose that SCAN offers a great starting point before discussing whether GenAI complements or replaces our abilities when completing a task, with a general objective of sustaining lifelong learning, and a specific goal of reaching hybrid intelligence.

[HC-24] Process-Oriented Evaluation of AI-Assisted Scientific Writing

链接: https://arxiv.org/abs/2606.15583
作者: Patrick Queiroz Da Silva,Sanchaita Hazra,Doeun Lee,Sachin Kumar,Bodhisattwa Prasad Majumder
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Bad writing hinders the publication of science. The role of artificial intelligence (AI) in generating and editing scientific texts remains unsettled. Abstracts serve as the critical gateway to scientific manuscripts, often shaping readers’ interest. We inspect how individuals revise AI-generated abstracts compared to human-authored abstracts when incentivized to communicate scientific content. Using 869 keystroke-level edit logs with 240k total edits, we construct behavioral labels and measure linguistic properties of edit bursts to investigate the edit trajectories. AI abstracts exhibit higher sentence-level agency, whereas human-authored abstracts outperform in global coherence, even with edits. Experts engage in stigmatic behavior, switching their strategy from predominantly restructuring to substitution when AI source is disclosed. Language Models (LMs) improve edit outcomes through a mix of local and global features, but still actively struggle with global coherence. Both humans and LMs often target the weakest sections of abstracts, but fail to improve stronger areas. Our large-scale process-oriented evaluation highlights the perks and pitfalls of both human and LM editing processes as machine-generated texts emerge in scientific communication.

[HC-25] Do we have the knowledge we need? Rethinking human-AI decision-making in corporations

链接: https://arxiv.org/abs/2606.15575
作者: Anne S. R. Marx,Ricardo M. Avelino,Torbjørn Netland,Mennatallah El-Assady
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: Proceedings of AutomationXP26 Workshop of the 2026 CHI Conference on Human Factors in Computing Systems, April 14, 2026, Barcelona, Spain. ACM, New York, NY, USA, 8 pages

点击查看摘要

Abstract:Organizational knowledge is fragmented across a variety of software systems, tacit expertise, and manual documents that have traditionally been designed for human consumption. As AI systems are increasingly deployed and granted decision-making roles, they require access to this knowledge. This raises two questions: how should organizations store and maintain knowledge so that it remains accessible to both humans and future AI systems, and how should agency be allocated between humans and AI across tasks with different risks and levels of uncertainty? In this position paper, we describe how organizational knowledge evolves and contribute a framework that maps task attributes and knowledge availability to recommended agency allocations and control mechanisms. We illustrate the applicability of the framework on two different manufacturing tasks: a routine operation (visual quality inspection) and a one-off strategic decision (factory location), and conclude with opportunities for future research.

[HC-26] If These Walls Could Talk: Critical Play with Large Language Models in Museums

链接: https://arxiv.org/abs/2606.15565
作者: Anders Sundnes Løvlie
类目: Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly being used in museums to as role playing chatbots which let visitors talk to simulated versions of people and artefacts from the past. While such installations can be playful and engaging, they are also problematic because LLMs cannot be trusted to speak truthfully. I identify a fundamental dilemma for the use of LLMs in museum chatbots: LLMs cannot be trusted to tell the truth, and efforts to make them more reliable may ruin that which is attractive about the bots in the first place - their ability to engage in life-like conversation. In response, I propose designing for critical play with LLM-based bots: Designing for playful interactions with bots that are unreliable but still able to represent the past in an adequate and engaging manner - as fictional characters representing historical narratives, styles of discourse, diverse perspectives, humor and satire.

[HC-27] “OpenBloom”: A Stigma-Sensitive LLM Design Probe for Reproductive Well-Being

链接: https://arxiv.org/abs/2606.15536
作者: Yang Hong,Ashley Hua,Adya Daruka,Sharifa Sultana
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Ongoing discussions in Human-Computer Interaction(HCI) have examined the role of AI-based tools in health information seeking, particularly within sensitive domains such as reproductive health. We introduce “OpenBloom,” a web application and an exploratory design probe that utilizes Large Language Models (LLMs) to turn reproductive health articles into question-based prompts to explore stigma around reproductive wellbeing. Through a survey study with 34 participants across their 136 interactions with OpenBloom, we explore how AI-generated question-based learning interacts with sociocultural stigma, contextual sensitivity, and reflexiveness. While current LLM outputs largely meet expectations for non-offensiveness, they default to superficial rephrasing or factual recall and lack critical reflections. We discuss implications for applying Feminist HCI, contestability, and value-sensitive AI frameworks to future LLM-mediated reproductive health technologies.

[HC-28] Participatory Design for Assistive Mobility in Indian Homes Grounded in Lived Experience

链接: https://arxiv.org/abs/2606.15528
作者: Jyoti Rautela,Abinash Kumar Swain,Madhan kumar Vasudevan
类目: Human-Computer Interaction (cs.HC)
备注: 16 pages, 5 figures, 2 tables; EMERGE 2026 conference paper

点击查看摘要

Abstract:Assistive mobility devices support independence for people with lower-limb disabilities, yet many are designed and evaluated in clinical or controlled environments. In Indian households, narrow spaces and dense furniture often make assistive devices difficult to use indoors, leading people to rely on improvised movements or support from family members and caregivers. In this work, we explore domestic mobility through a participatory and co-speculative design approach, focusing on how people with lower-limb disabilities navigate and maneuver within their homes. We conducted a series of semi-structured interviews and bespoke booklet-based participatory workshops with 22 participants with lower-limb impairments. To support reflection and discussion, we designed bilingual bespoke booklets grounded in domestic design frictions, using images and scenarios to encourage storytelling, critique, and speculation. Our findings reveal mobility challenges that differ significantly from those typically observed in clinical contexts. Rather than yielding a fixed set of design solutions, the study contributes situated insights into domestic mobility frictions, participant articulation, and the limits of speculative participation in this context.

[HC-29] A Prototypical Decision-Support Tool for Household Energy Management: A New Zealand Case Study

链接: https://arxiv.org/abs/2606.15513
作者: Abdollah Baghaei Daemei
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:This paper presents the system architecture and operating logic of The Home-Energy Check-Up (New Zealand), a web-based public decision-support prototype designed to help New Zealand households identify avoidable energy-cost leakage, complete a short guided home inspection, and generate a prioritized behavior-first energy roadmap. The application is implemented as a single-file Python Streamlit system with session-state navigation, a household input dataclass, conservative low-high saving estimators, a seven-check inspection layer, a recommendation-ranking layer, visual analytics, anonymous Google Sheets persistence, downloadable reports, and a certificate-of-completion interface. The system does not claim to be a certified energy audit, New Zealand Building Code H1 verification method, Healthy Homes compliance statement, or guaranteed bill-forecasting engine. Instead, it operationalizes a practical educational workflow: start with money, collect only the minimum required household profile, convert user answers into a score and action set, estimate annual savings using transparent formulas, and convert behavior savings into a staged save-to-upgrade pathway. The manuscript details the front-end, state-management, calculation, data-storage, visualization, recommendation, deployment, privacy, and limitation layers of the prototype. It also identifies research-grade improvements required before the tool is used for validated impact assessment, including external validation against measured energy data, robust concurrent data writes, clearer uncertainty calibration, accessibility testing, and formal user evaluation. The contribution is a reproducible architecture for translating household energy advice into an interactive, gamified, data-light decision-support pathway for New Zealand homes.

[HC-30] What do you mean by human-AI collaboration: Prerequisite functions and the affordances needed to achieve it

链接: https://arxiv.org/abs/2606.15509
作者: Mutlu Cukurova
类目: Human-Computer Interaction (cs.HC); Computers and Society (cs.CY)
备注: 22 pages,1 table, Submitted for Review to the Handbook of AI and the Future of Education (R. Wegerif, I. Casebourne, A. Zhou I. J. Ness, Eds.)

点击查看摘要

Abstract:The concept of ‘collaboration’ has been extended rapidly to describe what people now do with conversational agents, intelligent tutors, adaptive platforms, and generative artificial intelligence (AI) tools in general. This chapter asks what is gained and lost when a demanding concept from the learning sciences is applied so freely. Returning to long-standing accounts of collaborative learning, it reconstructs the requirements that a situation, an interaction, and a set of cognitive processes have historically had to meet before being called collaborative. Human-AI collaboration requires a partly symmetric and negotiated relationship, shared and negotiable goals, a low and shifting division of labour, interactive and synchronous exchange, and mutual modelling, grounding, and socially shared regulation. Reviewing process-sensitive empirical studies of writing and problem solving, the chapter shows that most current human-AI interaction is better described as consultation, governance, delegation, or instruction rather than as collaboration. To make these distinctions functional, the chapter introduces a five-level diagnostic taxonomy of human-AI teaming (i.e. transactional, situational, operational, praxical, and synergistic) defined by the affordances an AI system exhibits. It shows that only the highest level begins to satisfy the conditions the tradition places on collaboration. The chapter derives the functions an AI system must possess for collaboration to be achievable, argues that most of these are present-day engineering choices rather than capabilities to be awaited, and sets out the implications for research, measurement, and responsible practice of human-AI collaboration in education.

[HC-31] he Perils of Agency: How Developers Perceive Prioritize and Address Risks in Agent ic AI Products

链接: https://arxiv.org/abs/2606.15485
作者: Hao-Ping Lee,Jessica He,David Piorkowski,Thomas Serban von Davier,Jodi Forlizzi,Sauvik Das
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:Agentic AI systems act autonomously, use tools, adapt to context, and operate in complex real-world environments. However, these same characteristics can create or exacerbate product risks. We studied how industry developers (n=35) perceive, prioritize, and address the risks in their agentic AI products. We found that developers’ perceptions of risk were closely tied to the qualities that made the product agentic, such as autonomy, tool use, and usage in a real-world context. Developers prioritized product and business risks before considering downstream societal risks like job displacement and end-user privacy. This prioritization also impacted developers’ ability and motivation to mitigate agentic risks. Finally, developers lacked mature controls for containing agentic risks, often relying on constraining the same characteristics that make agents useful: e.g., autonomy and goal complexity. These findings reveal a capability vs. risk control tension in agentic AI development: developers need to address risks that emerge from agentic capabilities, yet they currently have limited support for doing so without constraining agentic functionality.

[HC-32] A Scalability Analysis of Quantitative Confidence Assessment Methods for Assurance Cases

链接: https://arxiv.org/abs/2606.15480
作者: Simon Diemert,Jens H. Weber
类目: oftware Engineering (cs.SE); Human-Computer Interaction (cs.HC)
备注: Preprint. Version of Record to appear in SafeCOMP’26 Workshop Proceedings published by Springer

点击查看摘要

Abstract:This paper proposes a model to estimate the decision complexity and effort required to apply quantitative confidence assessment methods to assurance cases. The model considers both the worst and average case for these measures and characterizes how these quantities scale with argument size. Prior work has indicated that the additional effort required to apply these methods is a barrier to their adoption by assurance case practitioners. Researchers developing new methods, or improving existing methods, can use this model to estimate the effort required to apply their method. The proposed model is parameterized using data from published case studies and is applied to three existing quantitative confidence assessment methods: the Bayesian Belief Network method, the Dempster-Shafer Theory method, and the Certus method. The results show that, while Certus has the highest worst-case decision complexity, its average-case effort is lower than the BBN and DST methods.

[HC-33] “ChatGPT help me draft a breakup text”: The Covert Triad and Articulation Labor in AI-Assisted Romantic Communication

链接: https://arxiv.org/abs/2606.15460
作者: Skyler Wang,Isabella Luppi
类目: Human-Computer Interaction (cs.HC); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Generative artificial intelligence (AI) has begun infiltrating the most ordinary domains of romantic life – drafting apologies, softening reproaches, and decoding a partner’s ambiguous messages. While recent scholarship on AI in intimate life has concentrated on chatbot companions, this article shifts the frame to AI as an intermediary in human-to-human romantic communication. Drawing on a multi-modal corpus of vernacular discourse from 2023 to 2026, we contribute two complementary concepts. The covert triad names a structural change: a relationship phenomenally dyadic but operationally triadic, with the third party visible only to the partner who deploys a model. Articulation labor names the mechanism whereby the expressive component of emotional labor – converting felt experience into language that a partner can receive – is increasingly delegated to AI, even as feeling labor remains lodged in the user. Authenticity, under these conditions, is being reconfigured from a property of linguistic authorship to one of emotional ownership, a shift actively contested.

[HC-34] A Bilateral Teleoperation Framework for Dexterous Manipulation

链接: https://arxiv.org/abs/2606.15434
作者: Stefano Dalla Gasperina,Dong Ho Kang,Haiyun Zhang,Aldo Galvan,Job D. Ramirez,Aaron Kim,Mark Helwig,Kazuto Yokoyama,Takahisa Ueno,Tetsuya Narita,Ann Majewicz-Fey,Ashish D. Deshpande,Luis Sentis
类目: Robotics (cs.RO); Human-Computer Interaction (cs.HC); Systems and Control (eess.SY)
备注: 4 pages, 7 figures, 1 appendix,

点击查看摘要

Abstract:Dexterous teleoperation requires precise arm-hand coordination, low-latency feedback, and robust interaction in real-world contact-rich environments. This paper presents a modular bilateral teleoperation framework that integrates operator-side input interfaces with a robot-side dexterous hand and compliant robotic arm in a unified control architecture. The system supports position-based hand retargeting, differential arm control, multi-scale haptic feedback, and shared control for stable manipulation. We validate the framework through a real-world dexterous manipulation task, highlighting coordinated arm-hand control and contact-aware interaction. Beyond feasibility, we identify key design insights related to cross-embodiment mismatch, haptic feedback granularity, and shared control. The proposed platform provides a practical teleoperation system and a foundation for collecting high-quality demonstrations for future learning-from-demonstration research.

[HC-35] Cognitive Trajectory Modeling: Quantifying Human-AI Co-Creation through Cognitively Grounded Interaction Trajectories

链接: https://arxiv.org/abs/2606.15358
作者: Nicholas Davis
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Co-creative AI research increasingly seeks methods capable of representing how interaction dynamics evolve through time. While many existing approaches focus on observable interaction characteristics, interaction metrics, behavioral coding schemes, or activity traces, these methods often struggle to capture higher-order interaction dynamics, including how collaborative processes reorganize, stabilize, regulate, and evolve through time. This paper introduces Cognitive Trajectory Modeling (CTM) as a cognitive theory of interaction dynamics that conceptualizes cognition, interaction, and creative processes as temporally organized trajectories unfolding across cognitively meaningful attractor landscapes. CTM builds upon the theoretical foundations of the Enactive Model of Creativity and Creative Sense-Making (CSM), revisiting the role of sense-making curves and cognitive trajectories in representing co-creative interaction dynamics. We formalize this perspective through the Cognitive Trajectory Principle, which states that temporal representations are only theoretically interpretable as cognitive trajectories when their underlying states possess directional cognitive meaning. Building on this principle, CTM generalizes the notion of cognitive trajectories beyond any particular coding scheme and provides a broader framework for modeling interaction dynamics through trajectories unfolding across meaningful attractor landscapes. We further distinguish cognitive trajectories from interaction traces and situate CTM within a broader hierarchy of cognitive, interaction, and domain dynamics. More broadly, we argue that understanding co-creative systems requires methods capable of modeling how cognition and interaction dynamics unfold through time. CTM provides a foundation for studying interaction dynamics across co-creative AI and human-AI interaction.

[HC-36] Co-Creating Buildable and Open Social Robot Study Companions with University Students

链接: https://arxiv.org/abs/2606.15239
作者: Farnaz Baksh,Matevž B. Zorec,Feiazie Baksh,Karl Kruusamäe
类目: Robotics (cs.RO); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注: Accepted for 18th International Conference on Social Robotics (ICSR + ART 2026), London, UK | 1-4 July 2026

点击查看摘要

Abstract:Open-source social robots offer accessibility, repairability, and student empowerment, yet the build itself often presents a barrier. Existing platforms either ship pre-assembled, foreclosing hands-on learning, or expose students to unfamiliar fasteners, opaque wiring, and inaccessible service points that erode engagement. Whether targeted mechanical redesign can lower this barrier whilst maintaining structural integrity remains untested. Here we show that Design for Assembly (DfA) and Design for Disassembly (DfD) interventions reshape how a build feels before they shorten how long it takes. Working with university students in Guyana and Estonia, we applied the Double Diamond framework to co-create the Robot Study Companion (RSC) v4.1: mapping pain points, then redesigning its chassis around twist-lock fasteners, snap-fit joints, and tool-free service latches. Across two studies with developers and first-time builders, system usability climbed from Poor to Excellent (SUS 59.4 to 89.4), perceived workload trended downward (NASA-TLX 4.29 to 4.00), and mean assembly time trended downward (21.4 to 13.7 minutes, with juniors’ learning effect), whilst orientation cues and navigation continuity for first-time builders emerged as the next documentation frontier. Perceived workload, not completion time, appears to govern whether students take up open hardware.

[HC-37] City landscape in sight: A crowdsourced framework for unlocking urban-scale window view perceptions from real estate imagery

链接: https://arxiv.org/abs/2606.15198
作者: Chucai Peng,Sijie Yang,Ang Liu,Yang Xiang,Zhixiang Zhou,Filip Biljecki
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:City landscapes viewed through home windows influence quality of life, yet perceptions of actual window views at the urban scale remain understudied. This study presents an approach for large-scale mapping of perceptions using 12,334 window view images (WVIs) collected from actual residential properties listed on real estate platforms in Wuhan, China, representing a rarely explored form of urban view imagery that offers advantages over the rendered or simulated window views commonly examined in previous studies. Through a non-immersive virtual reality platform, we collected 27,477 pairwise comparisons across six perceptual dimensions (e.g.\ Vivid) from 304 participants based on 499 WVIs. A hybrid neural network model was trained to predict human perceptions of all crowdsourced WVIs and map their spatial distribution. Results reveal significant spatial autocorrelation with distinct hot and cold spots across the whole city. Floor level strongly influences human perceptions: while higher floors offer more preferred and extensive window views, lower-floor windows provide residents with quiet and vivid views. An inference model further shows that window view composition matters considerably: high ratios of sky, trees, and low-rise buildings enhance people’s preferences and perceptions of vividness, whereas high ratios of high-rise buildings increase perceptions of monotony and oppression. Importantly, these effects are non-linear: the excessive presence of certain elements can alter their impact on human perception. This work advances urban-scale understanding of residents’ visual experiences and provides evidence-based guidance for human-centric urban planning and real estate to optimise visual landscapes from windows.

[HC-38] Graph of Trace: Visualizing Execution Traces of Scientific Agent ACL2026

链接: https://arxiv.org/abs/2606.15116
作者: Tianci Gao,Haoxuan Li,Jianhe Li,Tianxiang Zhao,Runze Shi,Weiran Wang,Zezhao Wu,Lu Mi
类目: Human-Computer Interaction (cs.HC)
备注: Accepted to ACL 2026 Demo Track

点击查看摘要

Abstract:Scientific AI agents can autonomously carry out complex research workflows, yet these unfolded workflows often remain difficult for humans to inspect and review, limiting interpretable, controllable and effective human-AI collaboration. To address this challenge, we present a monitoring and visualization framework that records fine-grained execution events and organizes them into a directed graph that makes agent workflows explicit as they proceed. The system records intermediate steps (e.g. tool calls and code executions), and renders them as real-time updated visual traces that expose workflow structure. This allows users to examine how results are produced, identify where failures emerge, and better understand agent behavior across different stages of the research process. We conduct an evaluation on complex research tasks with domain experts of interdisciplinary backgrounds in AI, neuroscience, and biology. Experts report that structured traces visualization improves understanding of agent workflows, perceived interpretability, and usability for analysis and further interaction.

[HC-39] Sensory Restoration via Brain-Computer Interfaces: A Unified 2 x 2 Framework and Convergence Roadmap

链接: https://arxiv.org/abs/2606.15091
作者: Xuan-The Tran
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Millions of individuals worldwide suffer from sensory and communication deficits caused by neurodegenerative diseases, stroke, or trauma. Brain-computer interfaces (BCIs) offer a promising avenue for sensory and motor restoration. However, the scientific literature remains highly fragmented between invasive neuroprosthetics and non-invasive electrophysiological decoders, with a lack of consistent terminology and comparison metrics. This chapter proposes a unified 2 x 2 framework categorizing BCIs along two axes: degree of invasiveness (invasive vs. non-invasive) and signal direction (afferent sensory-IN vs. efferent sensory-OUT). We define and distinguish the paradigms of restoration, substitution, and augmentation. Furthermore, we outline a structural roadmap for the convergence of these modalities over near-, medium-, and long-term horizons, focusing on physical limits and the integrative role of machine learning foundation models.

[HC-40] Cloze: An Open Research Platform for Studying Human-AI Conversations in Mental Health Contexts

链接: https://arxiv.org/abs/2606.15033
作者: Matthew Flathers,Francesco Cipriani,John Torous
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: 7 pages, 2 figures. Cloze is released under AGPL-3.0

点击查看摘要

Abstract:Cloze is an open-source web platform for conducting controlled, monitored studies of human-AI conversation in mental health research contexts. Consumer large language model (LLM) products such as ChatGPT, Claude, and Gemini are built for individual productivity, and offer researchers little experimental control, inconsistent data export, and no shared safety scaffolding that holds across providers. Cloze gives research teams a single environment in which they configure which models participants converse with, how the AI is instructed, how conversations are scheduled over time, and which safety constraints apply unconditionally, while every message is captured with full provenance (model version, prompt configuration, timing). The platform currently supports OpenAI, Anthropic, Google, and locally hosted open-weight models served through Ollama behind a unified interface, and runs in the cloud or fully on premises so that participant data need never leave an institution. Cloze is research infrastructure for building an evidence base on human-AI interaction in mental health contexts. It is not a therapeutic product.

[HC-41] “Stuck in a Spiral”: Shame and Guilt as Social Regulators of AI Use in Computing Education

链接: https://arxiv.org/abs/2606.14920
作者: Kate Hamilton,Irene Hou,Dev Patel,Sheena Nnam,Hena Patel,Stephen MacNeil
类目: Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:While prior work has examined patterns of adoption and social norms around AI use, less is known about how emotional factors, such as shame and guilt, shape students use of AI tools. We present an interview study with 19 computing students through a functionalist perspective of shame and guilt, which interprets emotions as social signals that regulate behavior. Our findings show that these emotions regulate when and how students make their use visible, as they engage in hiding behaviors and selective disclosure. Students described shaming themselves, their peers, and even faculty for using AI. Shame and guilt often coexist with continued AI use, creating cycles of reduced agency and moral tension rather than promoting behavior change. Students described feeling tensions between their AI use and their identities as competent, hardworking, or ethical computing students. Students also used language and metaphors of addiction to describe their experiences. These results highlight the need to consider the socio-emotional aspects of AI use, which may be influenced by how AI policies are implemented and enforced. We discuss classroom practices that can foster healthy, open discussion and support responsible AI use.

[HC-42] Automated Gaze-based Behavioral Segmentation and Temporal Representation for Bridge Inspection in Unconstrained 3D Environments

链接: https://arxiv.org/abs/2606.14893
作者: Daniel Jimenez Gil,Haosen Zhang,Zixin Wang,Mohamad Alipour
类目: Human-Computer Interaction (cs.HC)
备注: 46 pages, 13 figures

点击查看摘要

Abstract:Visual bridge inspection is a knowledge-intensive task in which inspectors coordinate visual search, spatial navigation, structural reasoning, and defect identification and documentation. It is a central maintenance task for bridges and a key basis for safety assessments, yet its results are susceptible to individual subjectivity. While eye-tracking-based behavioral studies quantify underlying processes, existing research often imposes restrictive simplifications to reduce environmental complexity, thereby compromising ecological validity. This study proposes an automated data analytics framework for converting multimodal inspection data into an inspection mode time series. Unconstrained 3D gaze, head-movement, drone navigation, and scene geometry data are segmented into temporal windows and classified into three functional modes: global scanning, local inspection, and navigation. The resulting temporal representation enables the extraction of interpretable behavioral descriptors, including transition probabilities, dwell times, transition entropy, fixation measures, and spatial revisit metrics. A feasibility study using a virtual bridge inspection platform demonstrates that the proposed representation captures meaningful differences in inspection strategy and reveals exploratory relationships with inspection performance. This study contributes a framework for human-informed computer-aided infrastructure inspection systems, inspector training, and data-driven assessment of constructed facilities.

[HC-43] Impedance MPC with Patient-Torque Estimation for Knee Rehabilitation Exoskeletons

链接: https://arxiv.org/abs/2606.13485
作者: Yongyan Cao,Jinshan Tang
类目: ystems and Control (eess.SY); Human-Computer Interaction (cs.HC); Neural and Evolutionary Computing (cs.NE); Robotics (cs.RO); Medical Physics (physics.med-ph)
备注:

点击查看摘要

Abstract:Knee rehabilitation exoskeletons must enforce a prescribed joint trajectory while remaining safely compliant with involuntary spasm and voluntary patient effort-objectives in tension for any fixed-gain impedance controller. We present an Impedance Model Predictive Control framework for knee rehabilitation exoskeletons, demonstrated on a series-elastic-actuator (SEA) platform: an algebraic feedforward reduces the knee dynamics to a constant-coefficient scalar double integrator, and a receding-horizon quadratic program (QP) computes corrective torques while enforcing hard range-of-motion, torque, and velocity limits (ISO 13482). A Kalman disturbance state driven by direct SEA-based torque sensing (the series-elastic spring deflection measured through the elastic element - an intrinsic, EMG-free patient-torque estimate, not a separate load cell) gives a nominal offset-free guarantee and, via its sign and the desired-motion direction, sensorless Assist-as-Needed. The constant state matrix permits offline precomputation of the QP cost inverse, enabling 500 Hz operation with a multi-step horizon. Across seven-controller benchmarks (sinusoidal tracking, isometric hold), the 500 Hz Kalman MPC is offset free 0.1 mrad RMS, 0.1 mrad steady-state, 0.2 mrad peak under 15 Nm spasm, versus a 515 mrad steady-state offset for classical impedance at the same stiffness - the direct-measurement channel converging the estimate near-immediately (within a few sampling periods). Without the estimator it realizes a classical impedance (4.8 mrad RMS, 8.3 mrad steady-state). All MPC variants meet the 87 mrad clinical criterion; no classical controller does. The architecture is formulated for the 20 DOF MyoSuite myoLeg via coupling-aware per-joint QPs.

[HC-44] Impedance MPC for Physical Human-Robot Interaction: Predictive Disturbance Rejection with Joint-Limit Safety

链接: https://arxiv.org/abs/2606.08281
作者: Yongyan Cao,Jinshan Tang
类目: Robotics (cs.RO); Human-Computer Interaction (cs.HC); Systems and Control (eess.SY); Medical Physics (physics.med-ph)
备注: 7 pages and 3 figures

点击查看摘要

Abstract:Physical human-robot interaction (pHRI) demands simultaneous trajectory accuracy and compliant safety under unplanned contact. Classical impedance control incurs a nonzero steady-state position error under sustained human force – the applied force divided by the task stiffness – which integral action reduces only within a narrow stable-gain budget. We present a two-layer Impedance MPC that resolves this tension. Layer~1 analytically cancels gravity, Coriolis, and task-space inertia, reducing the residual plant to a configuration-independent double integrator with a constant state-transition matrix. Layer~2 solves a 30-variable convex QP at 100,Hz, exploiting this constant structure so the free-response matrix is precomputed once; an augmented Kalman filter estimates the persistent disturbance state, giving a formal zero-steady-state-error guarantee. A null-space inverse-barrier potential and a task-space workspace projection enforce joint-limit safety across the tested workspace. On a 7-DOF Franka FR3, Impedance MPC with Kalman augmentation attains sub-0.05,mm steady-state error versus 44.8,mm for classical impedance (a 800-fold reduction) under a sustained 15,N force, sub-millimeter tracking on four 3-D circles, and graceful robustness to measurement noise and inertial mismatch up to 30%.

[HC-45] Integrating Multi-Label Classification and Generative AI for Scalable Analysis of User Feedback

链接: https://arxiv.org/abs/2601.23018
作者: Sandra Loop,Erik Bertram,Sebastian Juhl,Martin Schrepp
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 8 pages, 2 figures, submitted to Springer Nature

点击查看摘要

Abstract:In highly competitive software markets, user experience (UX) evaluation is crucial for ensuring software quality and fostering long-term product success. Such UX evaluations typically combine quantitative metrics from standardized questionnaires with qualitative feedback collected through open-ended questions. While open-ended feedback offers valuable insights for improvement and helps explain quantitative results, analyzing large volumes of user comments is challenging and time-consuming. In this paper, we present techniques developed during a long-term UX measurement project at a major software company to efficiently process and interpret extensive volumes of user comments. To provide a high-level overview of the collected comments, we employ a supervised machine learning approach that assigns meaningful, pre-defined topic labels to each comment. Additionally, we demonstrate how generative AI (GenAI) can be leveraged to create concise and informative summaries of user feedback, facilitating effective communication of findings to the organization and especially upper management. Finally, we investigate whether the sentiment expressed in user comments can serve as an indicator for overall product satisfaction. Our results show that sentiment analysis alone does not reliably reflect user satisfaction. Instead, product satisfaction needs to be assessed explicitly in surveys to measure the user’s perception of the product.

[HC-46] Do Large Language Models Have Emotions?

链接: https://arxiv.org/abs/2606.14742
作者: Amit Goldenberg,James J. Gross
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Do LLMs have emotions? A recent paper from Anthropic reports finding internal representations of emotion concepts in Claude Sonnet 4.5, concluding that the LLM has ‘functional emotions.’ We evaluate this claim against what is known about how emotions actually function in biological systems. We argue that emotions serve two core functions: the context-sensitive interpretation of situations, and the reorganization of processing across multiple systems in response to those interpretations. The Anthropic findings offer partial support for the first function, though the consistent, discrete emotional representations identified in Claude sit uneasily with affective neuroscience findings that human emotion is characterized by variable rather than uniform neural signatures. On the second function, the evidence is mixed: Claude’s representations modulate output without producing the dynamic reorganization of attention, decision speed, and motivational state that defines emotion in biological systems. We close by proposing what it would take for an LLM to have emotions.

计算机视觉

[CV-0] BRDFusion: Physics Meets Generation for Urban Scene Inverse Rendering

链接: https://arxiv.org/abs/2606.17049
作者: Yi-Ruei Liu,Jie-Ying Lee,Zheng-Hui Huang,Yu-Lun Liu,Chih-Hao Lin
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Inverse rendering of urban scenes from captured videos enables numerous applications, including content creation and autonomous driving simulation. Physically-based rendering methods follow and control lighting physics, but suffer from reconstruction and rendering artifacts. While generative models produce realistic videos, they offer limited consistency and controllability. We present BRDFusion, a unified framework that combines two complementary models for inverse and forward rendering. Specifically, BRDFusion recovers explicit, consistent scene properties with physical modeling and alleviates optimization ambiguity with generative priors. During forward rendering, the physical model provides controllable rendering from the scene configuration, and the generative model denoises and fixes artifacts. Therefore, our method produces high-quality videos while allowing precise control, outperforming baselines in real and synthetic scenes. Moreover, BRDFusion supports novel-view relighting, night simulation, and dynamic object insertion/editing. Project page: this https URL

[CV-1] Exact Posterior Score Estimation for Solving Linear Inverse Problems

链接: https://arxiv.org/abs/2606.17048
作者: Abbas Mammadov,Ozgur Kara,Kaan Oktay,Iskander Azangulov,Adil Kaan Akan,Hyungjin Chung,James Matthew Rehg,Yee Whye Teh
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Diffusion and flow-based models learn powerful data priors by training a denoiser to reverse Gaussian corruption. To use this prior to solve a linear inverse problem, one needs to sample from the posterior, but the score that the prior provides is the unconditional score, not the posterior score. Existing methods either steer a fixed pretrained denoiser with approximate measurement-matching corrections, or train a conditional restoration model that abandons the denoising structure of the prior. We derive the exact posterior score in closed form for linear Gaussian inverse problems under general Gaussian interpolants, and show that posterior sampling reduces to a denoising problem at an operator-dependent shifted pivot under an anisotropic noise covariance. We turn this identity into Exact Posterior Score (EPS), a denoising training objective that preserves the input/output structure of standard pretraining and can therefore be trained from scratch or fine-tuned from a pretrained denoiser. At inference, EPS uses the same sampler as the underlying backbone, with no likelihood gradients or projections. We evaluate EPS on five linear inverse problems across FFHQ and ImageNet, where it outperforms training-free and training-based baselines on fidelity, perceptual, and distributional metrics, while using roughly an order of magnitude fewer denoiser evaluations than gradient-based posterior samplers.

[CV-2] Geometric Action Model for Robot Policy Learning

链接: https://arxiv.org/abs/2606.17046
作者: Jisang Han,Seonghu Jeon,Jaewoo Jung,René Zurbrügg,Honggyu An,Tifanny Portela,Marco Hutter,Marc Pollefeys,Seungryong Kim,Sunghwan Hong
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Project page: this https URL

点击查看摘要

Abstract:Generalist robot policies must follow user instructions while reasoning about how objects, cameras, and robot actions interact in the 3D physical world. Recent vision-language-action models (VLAs) and video world-action models (WAMs) inherit strong semantic or temporal priors from large-scale foundation models, but they still operate primarily on 2D image frames or 2D-derived latent spaces, leaving implicit the 3D geometry required for contact-rich manipulation. We propose the Geometric Action Model (GAM), a language-conditioned manipulation policy that directly repurposes a pretrained geometric foundation model (GFM) as a shared substrate for perception, temporal prediction, and action decoding. GAM splits the GFM at an intermediate layer: the shallow layers serve as an observation encoder, and a causal future predictor inserted at the split layer forecasts future latent tokens conditioned on language, proprioception, and action history. The predicted future tokens are then routed through the remaining GFM blocks for feature propagation and decoding, allowing a single backbone to produce both future geometry and actions. This design equips the GFM with language-conditioned temporal world modeling through minimal architectural modification while preserving its rich geometric priors. Across a broad suite of simulation and real-robot manipulation benchmarks, GAM is more accurate, more robust, faster, and lighter than current foundation-model-scale baselines.

[CV-3] R2RDreamer: 3D-aware Data Augmentation for Spatially-generalized 2D Manipulation Policies

链接: https://arxiv.org/abs/2606.17040
作者: Xiuwei Xu,Haowen Sun,Angyuan Ma,Yiwei Zhang,Zhenyu Wu,Xiaofeng Wang,Bingyao Yu,Zheng Zhu,Jie Zhou,Jiwen Lu
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Spatial generalization is critical for imitation-learned manipulation policies, but achieving it typically requires scaling demonstrations across diverse object poses, robot configurations, and camera viewpoints. Data augmentation from a few source demonstrations offers a practical alternative to costly real-world collection. Simulation-based augmentation can create controllable variation, but requires complex environment and object setup and may introduce a sim-to-real gap. Recent real-to-real methods avoid these issues by jointly editing 3D observations and action trajectories from real demonstrations, yet they still rely on strong 3D scene parsing and geometry completion, and often produce observations tailored to 3D pointcloud policies rather than RGB-based 2D policies. We propose R2RDreamer, a real-to-real demonstration augmentation framework that preserves the geometric consistency of 3D action-observation editing while moving visual completion to 2D video space. Specifically, R2RDreamer first performs lightweight 3D augmentation by editing incomplete object pointclouds and end-effector trajectories in a shared 3D frame; it then projects the edited scene into masked image-space control videos with occlusion-aware reasoning and uses a dense-control image-to-video model to complete temporally coherent RGB observations. Experiments on spatially shifted manipulation tasks with both 2D diffusion-style policies and vision-language-action policies show that R2RDreamer improves spatial generalization from limited source demonstrations, with analyses validating the contributions of 3D editing, occlusion-aware projection, and video completion.

[CV-4] he Importance of Phase in Neural Representations: An Internal Oppenheim-Lim Test of Image Classifiers

链接: https://arxiv.org/abs/2606.17037
作者: Alper Yıldırım
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Oppenheim and Lim (1981) showed that natural images stay recognizable when reconstructed from their Fourier phase alone, while the magnitude carries little of their identity. We ask whether trained image classifiers reproduce this asymmetry inside their hidden layers, and we test it causally: given two images, we transplant the phase of one onto the magnitude of the other at a chosen layer and record which image the prediction follows. In PRISM2D, GFNet, and ViT-B/16 the prediction follows the phase or sign donor, and deleting all image-specific magnitude barely moves accuracy, so identity rides on phase while image-specific magnitude is largely dispensable to the readout. ResNet-50 at first seems to break the pattern, because transplanting sign after its ReLUs does nothing; a fair intervention before the ReLU reveals a strong latent sign code in the late blocks, and a DC-only control shows the readout consumes a channel-wise spatial average. Controls rule out the trivial case in which magnitude simply stops depending on the image. The architectures therefore share a phase/sign identity code but expose it in different bases, set by rectification and readout geometry, which gives a mechanistic account of the texture–shape gap between CNNs and attention models.

[CV-5] Qwen -RobotWorld Technical Report: Unifying Embodied World Modeling through Language-Conditioned Video Generation

链接: https://arxiv.org/abs/2606.17030
作者: Jie Zhang,Xiaoyue Chen,Anzhe Chen,Chenxu Lv,Deqing Li,Gengze Zhou,Hang Yin,Haoqi Yuan,Haoyang Li,Jiahao Li,Jiazhao Zhang,Jingren Zhou,Kaiyuan Gao,Kun Yan,Lihan Jiang,Ningyuan Tang,Pei Lin,Qihang Peng,Shengming Yin,Tianhe Wu,Tianyi Yan,Xiao Xu,Yan Shu,Yanran Zhang,Ye Wang,Yi Wang,Yilei Chen,Yixian Xu,Yiyang Huang,Yuxiang Chen,Zekai Zhang,Zhendong Wang,Zhixing Lei,Zhixuan Liang,Zihao Liu,Zikai Zhou,Xiong-Hui Chen,Chenfei Wu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We introduce Qwen-RobotWorld, a language-conditioned video world model for embodied intelligence. With natural language as a unified action interface, it predicts physically grounded future visual trajectories from current observations across robotic manipulation, autonomous driving, indoor navigation, and human-to-robot transfer. This unified formulation provides three promising application directions: synthetic data generation for policy training augmentation, scalable virtual environments for policy evaluation, and language-guided planning signals for downstream robot control. This is achieved through a three-part design: a) Double-Stream MMDiT with MLLM Action Encoding, where a 60-layer double-stream diffusion transformer couples frozen Qwen2.5-VL semantics with video-VAE latents through layer-wise joint attention; b) Embodied World Knowledge (EWK), an 8.6M video-text corpus (200M+ frames) with action-language mapping over 20+ embodiments and 500+ action categories; and c) General+Expert Progressive Curriculum, a two-stage training strategy that first learns general visual priors and then injects embodied specialization under a shared language interface. Extensive results show strong competitiveness: ranks 1st overall on EWMBench and DreamGen Bench, outperforms all open-source models on WorldModelBench and PBench. Additional zero-shot analyses on RoboTwin-IF benchmark further support robust generalization and multi-view consistency.

[CV-6] MeshLoom: Feed-Forward Non-Rigid Registration of Mesh Sequences

链接: https://arxiv.org/abs/2606.17027
作者: Jianqi Chen,Jiraphon Yenphraphai,Xiangjun Tang,Sergey Tulyakov,Chaoyang Wang,Peter Wonka,Rameen Abdal
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:We present MeshLoom, a feed-forward registration network that directly reconstructs vertex deformations across mesh sequences. Our approach advances non-rigid registration beyond existing models, which are typically constrained by costly per-instance optimization, narrow object categories, pairwise-only inputs, or merely intermediate outputs. The network is simple and efficient, registering multiple meshes within seconds. At its core lies a topology-aware encoder–decoder design. Specifically, we first introduce a topology-aware point representation that encodes the anchor (reference) mesh’s topology into its per-vertex features. This representation strengthens the network’s understanding of the anchor-mesh geometry and disambiguates points that are Euclidean-close yet geodesically distant. We then propose a multi-modal encoder that fuses this anchor-mesh representation with complementary cues from each frame, such as shape latents and image features. These multi-source signals are compressed into a compact global motion embedding that captures dense inter-frame correspondence. A lightweight decoder then queries this global embedding with the anchor-mesh point representation, retrieving per-vertex deformations at target timestamps. Through extensive experiments across diverse motions and object categories, we show that MeshLoom achieves state-of-the-art results on non-rigid registration. In addition, we find that our global embedding-then-query paradigm naturally enables the network to generate deformations at intermediate timestamps, which extends MeshLoom to motion interpolation and mesh morphing. Project page: this https URL .

[CV-7] FusionRS: A Large-Scale RGB-Infrared Remote Sensing Dataset for Dual-Modal Vision-Language Foundation Models

链接: https://arxiv.org/abs/2606.17020
作者: Jiaju Han,Ben Zhang,Xuemeng Sun,Qike Zhang,Yuxian Dong,Chengyin Hu,Fengyu Zhang,Yiwei Wei,Jiujiang Guo
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Remote sensing vision-language models have advanced Earth observation understanding, but most existing work remains centered on RGB imagery, leaving the complementary information in infrared data underexplored. Infrared images provide distinctive cues, including thermal intensity structures, object boundaries, and illumination-invariant scene features, which can enrich visual-language learning beyond conventional RGB observations. However, a large-scale RGB-infrared-text dataset for remote sensing vision-language modeling is still absent. To address this gap, we introduce FusionRS, the first large-scale RGB-infrared-text dataset designed for dual-modal vision-language learning in remote sensing. FusionRS is constructed by translating diverse public RGB remote sensing images into infrared-style counterparts, forming aligned RGB-IR image pairs. Each pair is associated with conventional scene captions and IR-aware captions that explicitly describe infrared-specific visual properties while preserving semantic content. Based on FusionRS, we train dual-modal vision-language foundation models for RGB-IR joint understanding. We first train CLIP-style models for RGB-IR-text alignment, and then fine-tune generative VLMs for dual-modal RGB-IR captioning. Experiments show that FusionRS improves RGB-IR alignment, infrared-to-text retrieval, and dual-modal captioning over RGB-only and non-IR-aware training settings. Ablation studies further verify that IR-aware captions are crucial for strengthening infrared-language alignment, highlighting the importance of modality-specific textual supervision for more scalable RGB-infrared remote sensing vision-language representation learning.

[CV-8] ActiveSAM: Image-Conditional Class Pruning for Fast and Accurate Open-Vocabulary Segmentation

链接: https://arxiv.org/abs/2606.16996
作者: Tran Dinh Tien,Zhiqiang Shen
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Preprint. Code is available at this https URL

点击查看摘要

Abstract:Segment Anything Model 3 (SAM 3) provides a strong frozen backbone for concept-prompted segmentation, but applying it directly to open-vocabulary semantic segmentation (OVSS) is inefficient: full-resolution decoding is typically run over the entire dataset vocabulary, whereas each image contains only a small active subset of classes. We introduce ActiveSAM, a training-free, zero-shot inference framework that turns SAM 3 into an active-vocabulary segmenter. ActiveSAM first canonicalizes and expands class prompts, then estimates an image-conditioned active set from a low-resolution presence preview. Only the retained classes are decoded at full resolution, using bucketed prompt multiplexing with the frozen SAM 3 decoder. The preview stage uses only class-presence evidence and skips unnecessary segmentation-head computation, while the final stage applies margin-aware background calibration to suppress low-confidence pixels. ActiveSAM requires no target-dataset training, no weight updates, and no oracle class-presence labels. Across eight OVSS benchmarks, ActiveSAM improves the speed-accuracy tradeoff of training-free open-vocabulary semantic segmentation, outperforming the current state-of-the-art SegEarth-OV3 by approximately +1.4 mIoU on average while running up to 5.5x faster on large-vocabulary datasets. ActiveSAM also demonstrates the strongest robustness under image corruption that simulates real-world distribution shift, making it well-suited for deployment in noisy-input domains such as autonomous driving and embodied AI. Code is available at this https URL.

[CV-9] DreamX-World 1.0: A General-Purpose Interactive World Model

链接: https://arxiv.org/abs/2606.16993
作者: DreamX Team,Yancheng Bai,Rui Chen,Xiangxiang Chu,Rujing Dang,Hao Dou,Bingjie Gao,Qiwen Gu,Siyu Hong,Jiachen Lei,Geng Li,Jifan Li,Ruimin Lin,Qingfeng Shi,Bingze Song,Lei Sun,Jing Tang,Ruitian Tian,Jun Wang,Jiahong Wu,Pengfei Zhang,Shen Zhang,Jiashu Zhu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL , Code: this https URL

点击查看摘要

Abstract:DreamX-World 1.0 is a general-purpose interactive text/image-to-video world model for controllable long-horizon generation. It supports camera navigation, revisits to previously observed regions, and promptable events across photorealistic, game-style, and stylized domains. Our data engine combines camera-accurate Unreal Engine rendering, action-rich gameplay recordings, and real-world videos with recovered camera geometry. For camera control, we introduce E-PRoPE, a lightweight variant of projective positional encoding that retains PRoPE’s projective camera geometry while applying camera-aware attention to spatially reduced tokens. We convert a bidirectional video generator into a few-step autoregressive world model using causal forcing, DMD-style distillation, and long-rollout training. Training on self-generated long-horizon contexts exposes the model to its own generated history and reduces the style and color drift that accumulates across autoregressive chunks. Memory-Conditioned Scene Persistence retrieves earlier views through camera-geometry-based retrieval, while residual recycling makes the conditioning path less sensitive to imperfect memory latents. Event Instruction Tuning adds composable event control, and reinforcement learning alignment recovers camera control and visual quality after distillation. With mixed-precision DiT execution, residual reuse, 75%-pruned VAE decoding, and asynchronous pipeline parallelism, DreamX-World 1.0 reaches up to 16,FPS on eight RTX,5090 GPUs. On our 5-second basic evaluation, DreamX-World 1.0 achieves a camera-control score of 73.75 and an overall score of 84.76, outperforming HY-WorldPlay 1.5 and LingBot-World in overall score, which achieve 80.79 and 80.45, respectively.

[CV-10] A Multi-Center Benchmark for Abdominal Disease Diagnosis and Report Generation from Non-Contrast CT MICCAI2026

链接: https://arxiv.org/abs/2606.16991
作者: Mariam Elbakry,Aliaa Sayed Sheha,Salma Hassan Tantawy,Aya Yassin,Concetto Spampinato,Karim Lekadir,Xiaomeng Li,Marawan Elbatel
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Early Accept (top ~9%), MICCAI 2026

点击查看摘要

Abstract:Multiphasic contrast-enhanced CT (CECT) is widely used for abdominal lesion characterization, yet it carries inherent risks of contrast-induced nephropathy, escalates acquisition burden, and heavily contributes to radiologist workload. To address these challenges, we introduce a novel multi-center benchmark for multi-organ abdominal disease diagnosis and automated radiology report generation, which learns to synthesize contrast-enhanced findings from single-phase non-contrast CT (NCCT). To support this, we curated a large-scale dataset of paired NCCT-CECT studies and their corresponding contrast-enhanced radiology reports from two centers, partitioned into internal sets and an external validation cohort. Under a unified evaluation protocol, we benchmarked five contemporary deep learning architectures encompassing chest-specific, abdomen-specific, and general-purpose multimodal domains. Extensive experiments demonstrate that NCCT retains diagnostic signals, achieving an average multi-organ AUC of 69.1% on the internal cohort and 63.1% on the external cohort, respectively. By releasing this dataset and standardized benchmark publicly, this study aims to catalyze future research into safer, resource-efficient, and globally accessible contrast-free abdominal imaging workflows. Code is available at: this https URL.

[CV-11] SurroundNEXO: Ego-Centric Metric Bridging for Spatially Consistent Geometry in Autonomous Driving

链接: https://arxiv.org/abs/2606.16960
作者: Shuai Yuan,Runxi Tang,Yuzhou Ji,Fudong Ge,Hanshi Wang,Yifei Wang,Xianming Zeng,Jianyun Xu,Xingliang Liu,Yanfeng Wang,Zhipeng Zhang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Modern autonomous driving depends on accurate metric 3D understanding for perception, reconstruction, and planning, which in turn requires reliable multi-camera depth prediction. However, the outward-facing nature of vehicle-mounted surround-view camera rigs inherently limits visual overlap across views, challenging the correspondence-based assumptions that underpin conventional multi-view geometry. To bridge this gap, we present SurroundNEXO, named after the Spanish word nexo for a geometric link, a low-overlap multi-camera metric depth framework that grounds cross-view reasoning in ego-centric geometry rather than dense visual correspondences. Instead of directly enforcing early global fusion, SurroundNEXO first assigns image tokens globally comparable ego-frame viewing directions through Ego-Ray Positional Encoding, then uses sparse LiDAR measurements as metric anchors to propagate absolute scale cues, and finally expands feature interaction progressively from view-local modeling to decomposed spatio-temporal reasoning and global integration. This design enables metric-scale depth prediction with improved spatial consistency across weakly overlapping cameras. Across low-overlap autonomous driving benchmarks, including NuScenes, Waymo and DDAD, SurroundNEXO reduces single-view error by 33.2%, improves cross-view consistency by 10.5%, and enhances metric reconstruction quality by 25.6% compared with SOTA methods. It further remains robust under extremely sparse depth prompts and exhibits strong zero-shot generalization to unseen camera layouts.

[CV-12] Simulation-Based Multi-Fillet Evaluation of Woody Breast Poultry Fillets

链接: https://arxiv.org/abs/2606.16951
作者: Chirantan Sen Mukherjee,Seung-Chul Yoon,William J. Beksi
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: To be published in the 2026 International Conference on Automation Science and Engineering (CASE)

点击查看摘要

Abstract:Woody breast (WB) is a myopathy in modern broiler chickens that causes the breast muscle to become unusually stiff and fibrous, leading to decreased meat quality and significant economic losses. State-of-the-art automated WB detection relies on a side-view imaging system to analyze the bending behavior of a single fillet as it falls off a conveyor belt. While highly accurate, this approach is constrained by its single-fillet field of view, creating throughput bottlenecks on commercial processing lines. In this paper, we address this limitation via a novel multi-fillet detection architecture utilizing a top-down camera configuration. To validate our approach, we first develop a high-fidelity digital twin of an industrial conveyor system. Next, we synthesize a diverse dataset of 3D fillet meshes and model their viscoelastic bending dynamics using a physics-based simulation engine. Lastly, a continuous 2D shape deformation score is extracted from the top-down perspective as the simulated fillets traverse the roller precipice. Experimental results demonstrate that the top-down shape score effectively captures the contour changes of the fillets as it bends, providing a robust and scalable alternative to a side-view imaging system for simultaneous multi-fillet WB evaluation.

[CV-13] Semantic Flip: Synthetic OOD Generation for Robust Refusal in Embodied Question Answering and Spatial Localization

链接: https://arxiv.org/abs/2606.16898
作者: Dongbin Na,Chanwoo Kim,Giyun Choi,Dooyoung Hong
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 18 pages, 3 figures. Code and data: this https URL ; project page: this https URL

点击查看摘要

Abstract:Detecting unanswerable user queries remains essential for the reliable deployment of real-world embodied agents. However, modern vision-language models (VLMs) often generate overly confident answers even when the available visual memory cannot support the query. Such overconfidence poses various task-dependent risks. The agent may provide misleading information to the user in Embodied Question Answering and select an arbitrary coordinate and physically guide the user there in spatial reasoning for navigation. Despite these high stakes, only a few prior studies directly address when and how an embodied VLM should respond with “I do not know.” This work proposes Semantic Flip, a simple yet effective framework that synthesizes auxiliary out-of-distribution (OOD) samples for embodied refusal without requiring external OOD annotations. The key idea is to independently transform the query and video memory to construct auxiliary OOD pairs that lack sufficient visual grounding. These synthesized pairs enable training a lightweight rejection module on top of a frozen pretrained VLM. The module attaches to any existing VLM-based pipeline without retraining the underlying model. Across two complementary benchmarks, Semantic Flip consistently outperforms strong prompting baselines. This work also introduces SpaceReject, a new refusal benchmark for spatial localization with deliberately unanswerable queries over long video memory, where Semantic Flip achieves an F_1 score of 0.9559. The source codes and datasets are publicly available at this https URL.

[CV-14] Latent Space Reinforcement Learning for Inverse Material Estimation in Food Fracture Simulation CVPR

链接: https://arxiv.org/abs/2606.16870
作者: Adrian Ramlal,Yuhao Chen,John S. Zelek
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: Accepted in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2026 MetaFood Workshop

点击查看摘要

Abstract:Realistic visual simulation of food manipulation requires accurate material parameters, yet these are difficult to measure directly and vary across the heterogeneous regions of a single food item. We address the inverse problem of estimating material parameters from a target description of fracture behavior in a non-differentiable continuum damage mechanics simulator. Using orange peeling as a test case, we train a neural surrogate on 2,000 forward simulations and compare Covariance Matrix Adaptation Evolution Strategy (CMA-ES, a gradient-free evolutionary optimizer) with Proximal Policy Optimization (PPO, a reinforcement learning algorithm) across the original 9-dimensional parameter space and two learned 4-dimensional latent representations. Since different oranges have different material properties, a practical inverse system must handle arbitrary targets without retraining. We train a goal-conditioned PPO policy that learns a general inverse mapping: given any target description of peeling behavior, the policy produces a material parameter estimate in a single forward pass (8 surrogate evaluations, approximately 10ms). Operating in a normalizing flow latent space with a shared surrogate evaluator, the goal-conditioned policy achieves 0.642 actual recovery when validated through the simulator, outperforming the original parameter space by 23%. A warm-start extension that initializes CMA-ES refinement from the policy’s output further improves recovery to 0.828 with 540 evaluations. These findings provide a practical framework for inverse food physics and lay groundwork for vision-driven material identification from video observations of food manipulation.

[CV-15] Federated Medical Image Segmentation under Real-World Label Noise: A Benchmark Suite for Noisy Label Learning Method Selection

链接: https://arxiv.org/abs/2606.16868
作者: Markus Bujotzek,Dimitrios Bounias,Stefan Denner,Ralf Floca,Maximilian Fischer,Peter Neher,Klaus Maier-Hein
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:

点击查看摘要

Abstract:While federated learning (FL) enables collaborative medical image segmentation without centralizing sensitive data, real-world deployment is frequently complicated by cross-site label imperfections such as contour disagreement, missing or additional structures, and confused labels. Federated noisy label learning (FNLL) aims to mitigate these effects, yet remains underused in practice as existing evidence is largely based on synthetic noise, simplified settings, and limited real-world noisy evaluation. We address this gap by introducing a benchmark suite that combines diverse real-world noisy datasets, deployment-relevant client-noise scenarios, and label-noise-targeted evaluation to support systematic FNLL assessment and informed method selection. The suite combines curated real-world noisy medical image segmentation datasets from diverse sources with a comprehensive federated segmentation framework including various client-noise scenarios and noise-targeted evaluation. The presented suite provides a realistic and discriminative basis for FNLL evaluation in medical image segmentation and establishes a reusable foundation for fair benchmarking, dataset-specific label-noise characterization, and future method development under realistic federated settings. Code is available at this https URL.

[CV-16] Redirecting the Flow: Image Customization through Attention Distribution Shift

链接: https://arxiv.org/abs/2606.16866
作者: Jie Li,Suorong Yang,Jian Zhao,Furao Shen
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Subject-driven image customization aims to generate images that not only follow textual instructions but also preserve the identity of a given reference subject. Existing approaches, including test-time fine-tuning, encoder-based methods, and token competition in shared attention spaces, suffer from limited efficiency, misalignment between extracted reference features and the generative process, and interference from irrelevant information. To address these limitations, we formulate the customization task as a distribution shift induced by incorporating reference images into text-to-image generation, and derive a Conditional Attention Distribution Shift formulation grounded in maximum entropy theory. Building on this formulation, we propose CustomShift, a dual-branch architecture based on Stable Diffusion 3. The Reference-Alignment Branch leverages self-attention between reference images and subject names to achieve layer-wise alignment with latent representations, while the Cross-Guidance Branch integrates textual and reference cues to guide generation. Experiments on the DreamBooth and Custom101 benchmarks demonstrate that our method consistently outperforms state-of-the-art approaches, achieving a better balance between semantic fidelity and subject consistency.

[CV-17] An Open-Source Monitoring Framework for Data Exploration and Progress Tracking in Multi-Center Radiology Studies

链接: https://arxiv.org/abs/2606.16861
作者: Markus Bujotzek,Jonas Scherer,Stefan Denner,Peter Neher,Benjamin Hamm,Lorenz Feineis,Uenal Akuenal,Andreas Bucher,Tobias Penzkofer,Klaus Maier-Hein
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multi-center studies are crucial for advancing medical and radiological research. Data exploration, collaboration discovery, and study progress monitoring are essential for maximizing their potential. However, in practice these processes often rely on manual communication and shared tables, which quickly become outdated and hinder efficient coordination in large distributed studies. This highlights the need for dedicated monitoring solutions that provide transparent and up-to-date insights into study progress. We propose a lightweight, open-source monitoring architecture for multi-center studies based on the widely used Grafana-Prometheus stack. The framework collects aggregated monitoring metrics from distributed study sites and visualizes them through configurable dashboards. As a real-world deployment example, the framework is integrated into the medical imaging platform Kaapana and evaluated within a large multi-center research network. By deploying our solution within the Germany-wide RACOON consortium, we demonstrate its ability to enable privacy-preserving data exploration and study progress monitoring across all 38 German university clinics. The monitoring framework supports transparent coordination of distributed research activities and can facilitate more efficient management of large-scale multi-center studies. The source code and Kaapana integration are publicly available at this https URL.

[CV-18] Robust Spoofed Speech Detection via Temporal Pyramid Modeling

链接: https://arxiv.org/abs/2606.16837
作者: Mahtab Masoudi Nezhad,Nima Karimian
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Sound (cs.SD)
备注:

点击查看摘要

Abstract:Spoofed speech detection is increasingly challenged by realistic synthesis, voice conversion, and replay attacks, with cross-dataset generalization remaining a major limitation. This work we propose a Temporal Pyramid Adapter that utilize parallel temporal convolutions with varying receptive fields to capture multi-scale spoofing cues, ranging from local artifacts to global prosodic irregularities. We also integrated self-supervised XLS-R representations combined with front-end adapters, including Mel, Sinc, and a Temporal Pyramid design for multi-scale temporal modeling. The proposed model is evaluated cross multiple benchmark including ASVspoof 2017, ASVspoof 2021 (DF/LA), PartialSpoof, DiffSSD, and multilingual HQ-MPSD datasets. Experimental results demonstrate that Temporal Pyramid model obtained AUC of 99.24% and a EER of 3.87% on the PartialSpoof database, which is significantly outperforming the base model and several SOTA baseline such as LCNN-BLSTM (9.87% EER) and TRACE (8.08% EER). Additionally, multilingual evaluations confirm that while spoofing artifact are independent from language. While self-supervised representations improve robustness, performance degrades under domain and language shifts, highlighting the need for better adaptation and calibration strategies.

[CV-19] Decoupling Semantics from Distortions: Multi-Scale Two-Stream Vision-Language Alignment for AI-Generated Image Quality Assessment ICME2026

链接: https://arxiv.org/abs/2606.16799
作者: Zijie Meng
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 11 pages, 2 figures Accepted by ICME2026(spotlight)

点击查看摘要

Abstract:Existing vision-language model (VLM)-based AI-generated image quality assessment (AIGIQA) methods suffer from a fundamental semantic-distortion dimensional conflict: monolithic representations optimized for semantic discrimination inherently entangle compositional understanding with low-level perceptual sensitivity, rendering them blind to fine-grained quality degradations. We introduce MST-CLIPIQA, a multi-scale two-stream framework that achieves hierarchical vision-language alignment through explicit representational decoupling. Our architecture leverages dual CLIP encoders with complementary patch granularities: coarse-grained streams capture global semantic coherence while fine-grained streams preserve textural signatures and artifact patterns. An information bottleneck-inspired gated fusion mechanism performs adaptive cross-scale distillation, with optional cross-attention enabling prompt-anchored correspondence evaluation when generation prompts are available. Extensive experiments across five benchmarks establish new state-of-the-art results, achieving average improvements of 1.11 percent SRCC on quality and 2.35 percent SRCC on text-image correspondence prediction, while maintaining efficiency with only 0.8M trainable parameters. Our project is available at this https URL.

[CV-20] WaveDINO: Learning-Based Atmospheric Correction of Unwrapped InSAR Interferograms Validated by GNSS: Results at Laguna del Maule and Campi Flegrei Volcanoes

链接: https://arxiv.org/abs/2606.16795
作者: Robert Popescu,Juliet Biggs,Tianyuan Zhu,Nantheera Anantrasirichai
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 6 figures

点击查看摘要

Abstract:Interferometric Synthetic Aperture Radar (InSAR) enables effective monitoring of volcanic deformation; however, the observed signals are often corrupted by atmospheric phase delays, seasonal surface changes, and decorrelation effects. Existing atmospheric correction methods, such as numerical weather model-based methods, can reduce these effects but do not consistently remove atmospheric artefacts and may introduce residual biases. To address these limitations, we propose a novel learning-based method for denoising unwrapped InSAR interferograms, using a hybrid training strategy that combines physically motivated synthetic deformation with real atmospheric noise. Specifically, we introduce WaveDINO, a wavelet-based multi-scale denoising framework conditioned on frozen DINOv3 foundation-model features and terrain information. Training uses synthetic magma-source deformation superimposed on short-term interferograms to expose the network to realistic atmospheric statistics while retaining known ground truth. Performance is evaluated on both controlled synthetic data and long-term real interferograms from Laguna del Maule (Chile) and Campi Flegrei (Italy), with independent GNSS measurements used for validation. WaveDINO consistently outperforms competing models, improving agreement with GNSS measurements, and reducing mean GNSS misfit by approximately 3% and 19% at two sites, respectively, while surpassing weather-model-based corrections.

[CV-21] LLM -Based Visual Explanation Evaluation Framework for Assessing the Explainability of Facial Skin Disease Classification Models

链接: https://arxiv.org/abs/2606.16794
作者: Gyuyeon Na
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This study proposes a domain-specific LLM-based Visual Explanation Evaluation Framework for assessing Grad-CAM explanations in facial skin disease diagnosis models. While previous studies have primarily focused on improving classification performance through data augmentation techniques, relatively few studies have systematically examined whether model explanations are grounded in clinically relevant lesion regions. In this study, geometric augmentation, color-based augmentation, and mixed augmentation strategies were applied to facial skin disease classification models based on EfficientNet-B0, MobileNetV3, and ResNet18. Grad-CAM was employed to generate visual explanations representing the models’ decision-making processes. Furthermore, an LLM-as-a-Judge evaluation framework was designed using GPT-5.5, Gemini 3.5 Flash, and Claude Sonnet 4.6 to assess Grad-CAM explanations from the perspectives of lesion localization and explanation trustworthiness. To improve evaluation consistency and clinical grounding, a progressive prompt engineering strategy was introduced, incorporating evaluation rubrics, clinical knowledge, penalty rules, and structured output formats. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2606.16794 [cs.CV] (or arXiv:2606.16794v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2606.16794 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-22] Gen-VCoT: Generative Visual Chain-of-Thought Reasoning via Diffusion-Based RGB Intermediate Representations

链接: https://arxiv.org/abs/2606.16783
作者: Zhiqiang Zhou,Junliang Dai,Xu ling
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 12 pages, 5 figures

点击查看摘要

Abstract:Multimodal large language models (MLLMs) excel at visual reasoning but rely on text-based chain-of-thought (CoT), lacking interpretable visual intermediates. Existing methods use opaque tokens or external tools, missing key properties. We propose Gen-VCoT, a framework using expert vision models to generate RGB images as reasoning intermediates. It has three stages: visual grounding (SAM segmentation), geometric reasoning (Marigold depth maps), and semantic reasoning (Qwen2-VL integration). An adaptive router selects reasoning depth. Evaluations show Gen-VCoT improves spatial (25% better) and depth (50% better) questions, but may hurt simple factual queries. Text CoT outperforms visual intermediates on CLEVR (91.2% vs 62.5%), showing task-dependent optimal representations. Gen-VCoT establishes a new paradigm for interpretable multimodal reasoning.

[CV-23] xt-Vision Co-Instructed Image Editing

链接: https://arxiv.org/abs/2606.16767
作者: Chenxi Xie,Yuhui Wu,Qiaosi Yi,Lei Zhang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Existing image editing methods can be generally categorized into textual instruction-based and visual prompt-based ones. Textual instructions are semantically expressive, but are limited by the coarse granularity of spatial control of the editing results. In contrast, visual prompts such as drag and point can provide precise spatial guidance, but are limited by the inherent ambiguity in semantic intent. To unify the strength of textual and visual prompts, we present Text-Vision Co-Instructed Image Editing, which jointly models textual instructions as semantic intent and sparse visual instructions as spatial guidance, aiming to achieve precise and intent-faithful image manipulation. To this end, we first construct a textual-visual instruction paired dataset with more than 23K samples derived from dynamic videos, enabling aligned supervision for cross-modal instruction. We then propose TV-Edit, a Textual-Visual instruction unified Editing framework to contextualize drag or point-based visual instructions with image-text semantics and lift them into semantic-aware control representations for pretrained editing backbones. By integrating semantic intent and spatial constraints, TV-Edit leads to more precise spatial control, less instruction ambiguity, and stronger structural consistency than text-only or drag-based alternatives. Finally, we establish TV-Edit-Bench, a deliberately designed benchmark to evaluate semantic faithfulness, spatial alignment, and visual consistency with ground-truth references and controlled textual-visual variations for reliable assessment. Our experiments across multiple editing backbones demonstrate that TV-Edit consistently yields more precise and intent-faithful edits, significantly outperforming state-of-the-art instruction-based and drag-based baselines.

[CV-24] 3D Classification of Paramagnetic Rim Lesions in Multiple Sclerosis via Asymmetric QSM-FLAIR Modeling MICCAI2026

链接: https://arxiv.org/abs/2606.16756
作者: Veronica Pignedoli,Giacomo Boffa,Nicoletta Noceti,Matilde Inglese,Francesca Odone,Matteo Moro
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 3 figures, accepted at MICCAI 2026. Github link: this https URL

点击查看摘要

Abstract:Paramagnetic rim lesions (Rim ^+ ) identified on susceptibility-sensitive MRI have recently emerged as a specific biomarker of chronic active inflammation in Multiple Sclerosis (MS) and are associated with long-term disability progression. However, susceptibility imaging and expert interpretation remain limited to specialized centers, visual assessment is time-consuming and variable, and the low prevalence of Rim ^+ lesions poses severe class imbalance challenges for automated analysis. We propose a 3D multimodal deep learning framework for lesion-level Rim ^+ /Rim ^- classification from Quantitative Susceptibility Mapping (QSM) and FLAIR MRI. The architecture explicitly models modality asymmetry by treating QSM as the primary susceptibility-driven signal and conditioning it with FLAIR-derived structural context. To improve robustness under limited data, we employ self-supervised multimodal pretraining followed by supervised fine-tuning with contrastive regularization. The method was evaluated on a clinically acquired cohort of 88 people with MS with expert lesion annotations as reference standard. Results highlight improved performance compared to prior architectures, supporting the effectiveness of asymmetric multimodal modeling for automated chronic active lesion identification.

[CV-25] Structure-aware Knowledge-guided Heterogeneous Mamba for Zygomaticomaxillary Suture Assessment

链接: https://arxiv.org/abs/2606.16749
作者: Xiaoqi Guo,Birui Chen,Xinquan Yang,Chaoyun Zhang,Xuefen Liu,Mianjie Zheng,Kun Tang,Xuguang Li,Wen Ma,Yanhua Xu,Linlin Shen
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The Zygomaticomaxillary Suture is a key circummaxillary structure that connects the zygomatic bone and the maxilla, which serves as a primary site of resistance during maxillary advancement, and its maturation status directly influences the timing and efficacy of orthopedic interventions. However, accurate staging of ZMS maturation remains challenging due to subtle high-frequency transitions in suture lines and the global semantic ambiguity between adjacent stages. To address this, we present the first public ZMS dataset, comprising 3,790 ZMS images covering the entire age range from 4 to 24 years. Based on this dataset, we propose SKMamba, a Structure-aware and Knowledge-guided Mamba-based multi-modal framework for automated ZMS maturation assessment. SKMamba adopts a decoupled dual-path architecture that mimics the hierarchical diagnostic process used by experienced orthodontists. We first introduce an Implicit Edge Extractor (IEE), which leverages structural pre-training to reduce trabecular noise and accentuate sutural boundaries. Complementarily, a Cross-Modal Semantic Alignment (CSA) module is designed to incorporate anatomical descriptions from a large language model (LLM). This module helps align local morphological cues with global semantic descriptions while ensuring that objective morphological evidence remains the primary basis for decisions. Extensive experiments on our ZMS dataset demonstrate that SKMamba achieves state-of-the-art performance compared to existing methods. Code is available at this https URL.

[CV-26] Revealing Artifacts via Noise Amplification: A Novel Perspective for AI-Generated Video Detection

链接: https://arxiv.org/abs/2606.16742
作者: Renxi Cheng,Jie Gui,Hongsong Wang
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 13 pages, 5 figures

点击查看摘要

Abstract:With the rapid advancement of video generation models, distinguishing between AI-generated and authentic videos has emerged as a challenging endeavor. The majority of existing research endeavors concentrate on the development of detectors for identifying samples generated by generative adversarial networks. Nevertheless, the detection of AI-generated videos, particularly those produced by text-to-video models, still remains an uncharted territory. Although state-of-the-art text-to-video models can generate realistic visual content similar to real videos, they fall short of generating the details of the images and the changes in details within the videos. Inspired by this, we address AI-generated video detection from a novel perspective of bit-planes, which can effectively describe the details or noises in images or videos. To this end, we propose a simple yet effective approach called Noise Amplification. This approach first extracts noise signals based on bit-planes, then amplifies these noise signals, and finally feeds them into the discriminator networks for video fake classification. Noise amplification is comprehensively constructed by incorporating three aspects: pixel-level intensity enhancement, region-level spatial amplification, and frame-level temporal aggregation. To evaluate methods of AI-generated video detection in challenging scenarios, we also introduce a benchmark named HardGVD. Extensive experiments on both the large-scale dataset GenVidBench and HardGVD show that our simple approach significantly outperforms state-of-the-art methods.

[CV-27] PATCH: Action-Chunk-Conditioned Latent Patch Innovation Monitoring for Robot Manipulation

链接: https://arxiv.org/abs/2606.16690
作者: Yanan Zhou,Ranpeng Qiu,Yincong Chen,Jiajie Cui,Weiming Zhi
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Learning-based manipulation policies have made substantial progress in real-world robot manipulation, particularly for short-horizon action generation. However, deployment in open workspaces remains fragile under unexpected local scene dynamics, such as moving objects, transient occlusions, or disturbances near the intended motion. Existing runtime monitors often rely on global observation anomalies, policy uncertainty, or frame-level visual changes, and struggle to distinguish task-relevant execution risk from benign visual variation. We introduce PATCH, an action-chunk-conditioned latent patch innovation monitor for deployment-time intervention. Given the active action chunk, PATCH defines a projected execution corridor, predicts latent patch evolution inside it, and accumulates persistent residuals unexplained by the robot’s own motion. These residuals form a localized intervention signal that allows PATCH-Router to pause execution, select an available recovery source, and resume the original policy once localized innovation subsides. Experiments on real robot rollout data show that PATCH produces more stable and context-relevant triggers than competing runtime monitors. Real-robot deployment further demonstrates monitor-driven intervention and policy resumption for disturbance-aware manipulation. Project Page: this https URL.

[CV-28] MMDiff: Extending Diffusion Transformers for Multi-Modal Generation

链接: https://arxiv.org/abs/2606.16673
作者: Yagmur Akarken,Orest Kupyn,Christian Rupprecht
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion transformers have demonstrated remarkable generative capabilities, yet the rich perceptual representations computed across their denoising trajectory are discarded once the content is rendered. We present MMDiff, a framework that transforms a frozen diffusion transformer into a multi-modal generative system that jointly produces images alongside any combination of dense perceptual modalities using lightweight decoder heads. Our central finding is that perceptual information is temporally distributed along the denoising trajectory, and that multi-timestep feature fusion with spatially varying aggregation weights is essential, improving semantic segmentation results by up to 28.7% mIoU over single-timestep extraction. We further adopt concept-driven attention extraction for interpretable spatial guidance, and show that frozen diffusion features are competitive with and complementary to state-of-the-art encoders such as DINOv3. By training only lightweight decoder heads on a frozen backbone, we achieve strong performance in semantic segmentation, salient object detection, and depth estimation, and demonstrate that this framework enables effective synthetic data generation at scale.

[CV-29] Sinkhorn-CPD: Robust point cloud registration via unbalanced entropic optimal transport

链接: https://arxiv.org/abs/2606.16672
作者: Jin Zhang,Mingyang Zhao,Bing Liu,Xin Jiang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 10 figures; journal version published in Computer-Aided Design

点击查看摘要

Abstract:Coherent Point Drift (CPD) is widely used for rigid point cloud registration because of its soft correspondences and closed-form parameter updates. However, CPD’s target-side marginal constraint forces every observation, including outliers, to receive exactly unit probability mass. This assumption degrades registration accuracy under heavy outliers and partial overlap. Optimal transport (OT) methods can handle missing mass through unbalanced formulations, but require hand-tuned annealing schedules. In this paper, we propose Sinkhorn-CPD, which replaces CPD’s target-side marginal constraint with dual Kullback-Leibler penalties, allowing the algorithm to discard outliers on both sides. The resulting formulation is a fully unbalanced entropic optimal transport problem, which can be efficiently solved by generalized Sinkhorn iterations. Moreover, Sinkhorn-CPD preserves the closed-form Procrustes and variance updates of CPD. In our method, the variance sigma^2 plays the role of the entropic regularization parameter, which induces an automatic annealing schedule from diffuse to sharp correspondences without manual temperature tuning. Experiments on synthetic, cross-category, and scan-to-CAD benchmarks show that Sinkhorn-CPD achieves state-of-the-art accuracy, with strong robustness to outliers and partial overlap.

[CV-30] Look Again Before You Abstain:Budgeted Conformal Evidence Acquisition for Reliable Vision-Language Model

链接: https://arxiv.org/abs/2606.16667
作者: Jian Xu,Delu Zeng,John Paisley,Qibin Zhao
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Large vision-language models (LVLMs) hallucinate: they assert visual details that the image does not support. A principled remedy is selective prediction with a distribution-free guarantee-verify each claim and abstain when the claim is not grounded, so that the hallucination rate among asserted claims is provably bounded. We show, however, that this guarantee is bought at a brutal price: to keep the hallucination rate below 5% on a balanced object-existence benchmark, a state-of-the-art conformal filter must abstain on more than 80% of claims. We argue that abstention is wasteful when more visual evidence is cheaply available, and introduce Budgeted Conformal Evidence Acquisition (BCEA), which replaces the binary answer/abstain decision with a three-way choice: answer, abstain, or acquire additional visual evidence by re-examining the image (zooming, cropping, or applying a claim-specific intervention) under a bounded compute budget. We make two observations. First, acquisition that is plugged naively into a calibrated filter breaks the statistical guarantee – realized risk overshoots the target by up to 17 points – because the acquisition step destroys the exchangeability that conformal calibration relies on. Second, folding the entire acquisition policy into the score function and re-calibrating on post-acquisition scores \emphrestores the finite-sample guarantee while still recovering coverage. BCEA further uses structured, claim-type-specific interventions. Across the POPE benchmark and COCO-constructed existence and spatial-relation claims, on four open VLMs, BCEA controls the hallucination rate at the target level and consistently improves coverage over a guaranteed-abstention baseline.

[CV-31] Vision-Language Models as Zero-Annotation Oracles in Histopathology

链接: https://arxiv.org/abs/2606.16658
作者: Vishal Jain,Giorgio Buzzanca,Sarah Cechnicka,Maarten Naesens,Priyanka Koshy,Tri Nguyen,Jesper Kers,Candice Roufosse,Bernhard Kainz
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 1 figure, 6 tables. Code available at this https URL

点击查看摘要

Abstract:Foreground segmentation is the critical first step of every computational pathology pipeline, yet existing methods rely on hand-tuned heuristics or supervised models that overfit to narrow stain and scanner distributions, failing silently on specialised stains such as Jones silver or Elastica van Gieson. We propose a coarse-to-fine approach that recasts foreground segmentation as a visual perception task and leverages general-purpose vision-language models (VLMs) as zero-annotation oracles. Our key insight is that tissue-versus-background discrimination is a natural-image recognition problem, not a histopathological one, so VLMs trained on internet-scale corpora generalise where domain-specific models cannot. We introduce Leica-75, a benchmark of 75 renal transplant whole-slide images spanning three stain families. On Leica-75, our method achieves the highest segmentation quality on out-of-distribution stains (Dice 0.858 +/- 0.027 on Jones, 0.853 +/- 0.041 on EVG) with 7x lower cross-stain variance than the best supervised baseline, while remaining competitive on in-distribution HE. Few-shot prompting with automatically curated exemplars (Auto-context) rescues hard cases on Stress-32 (n=32), a curated stress-test subset (Dice 0.470 to 0.819 for the 2B model). VLM-based annotation review matches human expert consensus (kappa=0.989 for blur detection; mean precision/recall grading accuracy 0.708 vs. human 0.646 for segmentation mask review). The resulting pseudo-labels are used to distil lightweight student models that are as performant as the teacher model while running for a fraction of the cost. Our framework provides a principled, scalable solution to a persistent infrastructure bottleneck in digital pathology.

[CV-32] MVM-IOD: An Industrial Object-Centric Benchmark Dataset for the Evaluation of 3D Reconstruction Methods

链接: https://arxiv.org/abs/2606.16638
作者: Robert Langendörfer,Markus Hillemann,Markus Ulrich
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:3D object reconstruction, and camera pose estimation in industrial applications are challenging tasks, as errors are costly while the computation time is often limited. The complexity of typical industrial objects further complicates these tasks. Most of the existing datasets in this context do not depict realistic industrial scenarios. Therefore, we introduce the Machine Vision Metrology Industrial Object Dataset (MVM-IOD). Images of typical industrial objects are captured systematically, by moving a camera, mounted at the end effector of an industrial robot arm, on a hemisphere around the objects. MVM-IOD contains reference camera poses and reference 3D point clouds, the acquired RGB images of 9 objects and 2 background choices resulting in 18 scenes, which allows evaluation of all image based methods that compute a 3D reconstruction, camera poses, or novel views of a scene. Based on MVM-IOD, we extensively evaluate current SOTA 3D reconstruction and camera pose estimation methods, such as Structure from Motion, Multi-View Stereo, recent feed forward methods (Visual Geometry Grounded Transformer, \pi3), and 2D Gaussian Splatting and report our findings as a baseline for future research. The experiments show that capture setups like ours generate out-of distribution images for feed forward methods, leading to suboptimal point clouds and camera poses. However, these out-of-distribution images can be shifted closer to the training distribution by applying simple preprocessing steps. Consequently, in certain industrial applications, feed forward methods should be used with caution.

[CV-33] DCP-Prune: Ultra-Low Token Pruning with Distribution Consistency Preservation

链接: https://arxiv.org/abs/2606.16633
作者: Xifeng Xue,Xiaokang Wang,Zirui Li,Ming-Ming Cheng,Guolei Sun
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: The code will be released at: this https URL

点击查看摘要

Abstract:Recent vision token pruning methods effectively preserve model performance under moderate token budgets but become unstable under ultra-low token budget. Our analysis shows that as the pruning budget decreases, accuracy degradation is often accompanied by larger feature distribution shifts. Critically, the degree of this distribution shift strongly correlates with performance degradation. To better characterize this phenomenon, we introduce a lightweight distribution consistency metric to estimate the distribution shift between retained and full tokens. Motivated by these observations, we propose a two-stage pruning framework consisting of Anchor-Context Graph Recovery (ACGR) and Text-Aware Token Cluster Selection (TATCS). Specifically, ACGR transfers contextual information before token removal, while TATCS dynamically re-selects representative tokens when severe distribution shift is detected. Extensive experiments demonstrate that our method achieves superior and more stable performance under ultra-low token budget. Notably, it retains 92.1% of the upper-bound average performance on LLaVA-1.5-7B with only 16 visual tokens.

[CV-34] SUP-MCRL: Subject-aware Unified Pseudo-feature Coded Multimodal Contrastive Representation Learning for EEG Visual Decoding

链接: https://arxiv.org/abs/2606.16615
作者: Shengyu Gong,Weiming Zeng,Yueyang Li,Zijian Kang,Hongjie Yan,Wai Ting Siok,Nizhuan Wang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Non-invasive brain-computer interfaces suffer severe fidelity degradation in neural visual decoding when generalizing to natural visual experiences. Conventional multimodal contrastive representation learning solely optimizes geometric distance alignment, neglecting semantic consistency and subject selectivity, causing spurious zero-shot alignment. We propose SUP-MCRL, a unified framework integrating three collaborative mechanisms: (1) Semantic-entity Aware Visual Encoder (SAVE), learning spatial attention to extract semantic content without pre-trained saliency models; (2 Unified EEG Enhancer (UEE), employing multi-scale atrous convolutions and inter-band attention for adaptive cross-subject robustness; and (3) Prototype-based Progressive Augmenter (PPA), maintaining an EMA-updated pseudo-feature pool to prevent representation collapse. Zero-shot experiments on THINGS-EEG achieve 66.0%/91.9% (Top-1/Top-5) intra-subject and 24.0%/52.9% LOSO accuracy, surpassing state-of-the-art methods. Code is available at this https URL.

[CV-35] DifferAD-R1: A Difference-Guided IndustrialAnomaly Localization with Multimodal LargeLanguage Models

链接: https://arxiv.org/abs/2606.16601
作者: Dingrong Wang,Xian Tao,Zhen Qu,Hengliang Luo,Xinyi Gong,Fei Shen,Zhengtao Zhang,Guiguang Ding
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to IEEE Transactions on Circuits and Systems for Video Technology

点击查看摘要

Abstract:Industrial anomaly localization aims to accurately identify and localize abnormal regions in industrial products, addressing the critical challenge of detecting unseen defect categories in real-world scenarios. Traditional closed-set methods often suffer from poor cross-scenario generalization, while existingMultimodal Large Language Model (MLLM)-based approachesface two core limitations: they either adopt QA-style paradigmsmisaligned with the practical demands of localization, or relyon standard optimization techniques such as Group RelativePolicy Optimization (GRPO), which fails to deliver effectivelearning signals for subtle defects. To tackle these issues, thispaper proposes DifferAD-R1, an MLLM-augmented reinforcement learning framework tailored for industrial anomaly localization. We design a Difference-Guided dual-image paradigm,which reformulates the localization task as a one-shot difference grounding problem to effectively explore cross-scenarioanomalies. A Dual-Consistency Localization Reward is developedfor hard-to-detect anomalies, enhancing optimization stabilityand robustness. Additionally, we integrate a difficulty-awarestrategy with adaptive reweighting and group-wise resamplingto prioritize learning on challenging instances. To facilitateevaluations in real-world industrial settings, we construct theAD-DualDiff dataset, comprising 13K paired images across 20categories. Experimental results demonstrate that DifferADR1 significantly outperforms existing baselines and achievescompetitive performance compared to large-scale models likeQwen3-VL (235B parameters). Our code is publicly availableat: this https URL.

[CV-36] Rotational Symmetry based Object Pose Estimation from Point Clouds in the Absence of Known 3D Models

链接: https://arxiv.org/abs/2606.16593
作者: Weichen Dai,Ruixun Yu,Yangjie Tang,Yifan Du,Yiyang Zhang,Donglei Sun,Hua Zhang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Object pose estimation is crucial to many industrial applications, with one example being automated spray painting using a robot. However, confidentiality concerns often limit access to high-quality 3D models, posing a significant challenge for point-cloud-based pose estimation. In such scenarios, rotational symmetry, a readily accessible characteristic of many industrial objects, can provide valuable prior information to facilitate pose this http URL this paper, we propose a method that leverages the rotational symmetry commonly found in industrial objects to address the challenge caused by the absence of 3D models. The object pose is jointly estimated with point cloud refinement through an iterative optimization process. This optimization relies on a rotational symmetry constraint loss. To construct this loss, each 3D point is rotated according to the currently estimated pose, and multiple correspondences are identified using nearest-neighbor search by exploiting the rotational symmetry property. These correspondences are then used to compute the rotational symmetry constraint loss, which iteratively refines both the pose and the point this http URL explicitly incorporating rotational symmetry into the optimization process, the proposed method achieves robust pose estimation and generalizes well across diverse object types. The proposed method is evaluated on a dataset specifically created for point clouds without known 3D models, consisting of four categories of synthetic objects and one real wheel hub collected from a production line. Experimental results demonstrate that the proposed method achieves performance comparable to methods that rely on known 3D models.

[CV-37] LOCUS: Local Visual Cue Search for Enhancing Fine-Grained Perception in Multimodal Large Language Models

链接: https://arxiv.org/abs/2606.16586
作者: Zhou Tao,Fang Zhang,Zewen Ding,Shida Wang,Xiaokun Sun,YongXiang Hua,Haoyu Cao,Linli Xu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) remain unreliable on fine-grained visual perception, even when high-resolution inputs preserve the necessary local details. We identify this limitation as visual context rot: decisive evidence may exist in the full image, yet fail to be reliably selected and used amid redundant visual context. We propose LOCUS (LOcal visual CUe Search), a training framework that teaches MLLMs to internalize local evidence search through a verifiable proxy task. During training, LOCUS provides a local crop as a visual cue and optimizes the model to recover its spatial support in the full image using an IoU-based reward. The visual cue is used only during training, leaving the standard image-question inference interface unchanged. Experiments across fine-grained perception, hallucination, general understanding, and reasoning benchmarks show that LOCUS improves localization-sensitive visual understanding while preserving broad capabilities. Attention analyses further indicate stronger focus on task-relevant evidence regions, suggesting that training-time visual cue search provides an effective route to internalized fine-grained evidence selection.

[CV-38] Multi-Modal Spatio-Temporal Graph Neural Network with Mixture of Experts for Soil Organic Carbon Prediction

链接: https://arxiv.org/abs/2606.16580
作者: Daniele Mos,Felipe Drummond,Anton Bossenbroek,Soufiane el Khinifri
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Paper is 27 pages, 14 figures, 12 tables

点击查看摘要

Abstract:Top-soil organic carbon (SOC) prediction is fundamental to agricultural sustainability, land use policy and fertilization planning. Existing approaches face two limitations: they pair hand-crafted covariates with classical ML or single-modal deep models that miss rich spectral and temporal information, and grid-based architectures ignore the irregular spatial structure of field measurements. We introduce SpTGNN, a multi-modal spatio-temporal graph neural network addressing both. SpTGNN represents soil measurements as nodes in a heterogeneous graph with three edge types (spatial proximity, spectral similarity, elevation), and applies relational graph attention to learn separate patterns per relation. A fine-tuned TerraMind encoder extracts node features from Sentinel-2, Sentinel-1 and DEM signals, combined with per-sample environmental covariates and learned positional and temporal embeddings. A sparse Mixture-of-Experts module fuses the four streams via top- k routing. Uncertainty is captured by pairing heteroscedastic regression (aleatoric) with deep ensembles (epistemic), and a Moran’s I penalty regularizes spatial autocorrelation. We evaluate on a global SOC corpus split into three regional instances ( \sim 49k samples globally, Africa \sim 26k, Europe \sim 14k). Our 5-member deep ensemble reports R^2=0.762 , RMSE =3.51\pm0.48 g/kg and MAPE =22.9% on the Africa test split, improving over a tabular XGBoost baseline; the best single checkpoint reaches validation R^2=0.864 . Ablations confirm the heterogeneous graph, MoE fusion and fine-tuned backbone each contribute substantively, and the ensemble UQ stack achieves post-calibration ECE of 0.031 (hybrid) and 0.026 ( \beta -NLL). To our knowledge, this is the first framework to unify foundation-model feature extraction, heterogeneous graph attention and decomposed uncertainty quantification for SOC estimation.

[CV-39] ransformation-driven generation of comparable projection images from multimodal anatomical scenes

链接: https://arxiv.org/abs/2606.16573
作者: Dariusz Pojda,Krzysztof Domino,Michał Tarnawski,Agnieszka Anna Tomaka
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 36 pages, 11 figures

点击查看摘要

Abstract:This work addresses the computational problem of generating reproducible projection-space observations from heterogeneous anatomical scenes whose components may undergo independent spatial transformations. We propose a transformation-driven framework for synthetic projection imaging from multimodal anatomical data and demonstrate it on mandibular-motion scenarios. In contrast to conventional Digitally Reconstructed Radiograph (DRR) approaches primarily designed for registration, projection realism, or rendering efficiency, the proposed formulation treats projection imaging as an observation process operating on an explicitly represented anatomical scene. Independently transformable volumetric and surface-based anatomical objects are embedded within a shared scene representation and propagated directly into projection space through explicit transformations. Projection geometry, acquisition modelling, material interpretation, and image presentation remain explicitly separated, enabling controlled exploration of methodological assumptions while preserving reproducibility and direct comparability between generated projections. Particular emphasis is placed on transformation-driven anatomical scenarios relevant to craniofacial analysis, including mandibular motion and therapeutic repositioning. Using a shared anatomical reference scene composed of CT/CBCT volumes, segmented structures, surface models, and auxiliary anatomical or therapeutic objects, the framework enables generation of directly comparable VirtualRTG projections from multiple anatomical configurations while preserving identical imaging assumptions. Rather than aiming at fully physically faithful radiographic simulation, the proposed approach provides a controllable and reproducible methodological environment for studying anatomy–projection relationships, motion observability, and transformation-aware imaging workflows.

[CV-40] PROSE: Training-Free Egocentric Scene Registration with Vision-Language Models

链接: https://arxiv.org/abs/2606.16569
作者: Zhiang Chen,Nahyuk Lee,Boyang Sun,Taein Kwon,Marc Pollefeys,Zuria Bauer,Sunghwan Hong
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Project page: this https URL

点击查看摘要

Abstract:Registering two captures of the same indoor space taken at different times underpins persistent spatial memory for robots and AR systems, yet the realistic version of this task is egocentric and its most scalable form is RGB-only. Head-mounted cameras yield blurry, fast-moving, partially overlapping views from which dense geometry is hard to recover. Classical registration leans on exactly the clean point clouds this setting lacks, while learned scene-graph methods require a pre-built or annotated graph and a trained matcher that we find brittle under egocentric data. We take a different route, using a pretrained vision-language model as the source of both scene understanding and cross-scan matching. Our method, PROSE (Prompted Scene rEgistration), lifts each RGB sequence into an object-level 3D scene graph using off-the-shelf foundation models for geometry, segmentation, and language, then prompts the same VLM to match object instances across the two RGB sequences. To make this matching tractable and reliable, we leverage object heights as a prior and verify each proposed match with a paired same/different query, then solve for the rigid transform by hypothesizing a candidate per matched object and selecting the one with the strongest geometric consensus. PROSE adds no learned parameters and requires no depth sensor, training, or annotated graph. On the egocentric Aria Digital Twin and Aria Everyday Activities benchmarks, it outperforms both geometric and learned scene-graph baselines in registration accuracy, on ground-truth and RGB-reconstructed point clouds alike, and the scene graph it produces transfers directly to downstream tasks.

[CV-41] Local-GS: Accelerating 3D Gaussian Splatting via Tile-Local Warp Coherence

链接: https://arxiv.org/abs/2606.16566
作者: Yang Luo,Yan Gong,Yongsheng Gao,Jie Zhao,Xinyu Zhang,Huaping Liu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) has significantly advanced real-time novel view synthesis by representing scenes as dense collections of anisotropic 3D Gaussian primitives. However, the irregular spatial distribution of Gaussians often leads to poor GPU utilization, as warp divergence and redundant computation degrade rendering performance. To address this, we present Local-GS, a warp-coherent rendering paradigm that, organizes Gaussian primitives with respect to SIMT (Single Instruction, Multiple Threads) execution boundaries rather than scene geometry. Specifically, we propose three warp-coherent stages: a hoisting stage that precomputes shared parameters at tile level, a culling stage that discards warps with no contribution, and a blending stage that replaces per-pixel branching with a uniform instruction stream. Across extensive benchmarks on multiple datasets, Local-GS improves efficiency without compromising quality. As a plug-and-play optimization, it provides additional performance gains to all tested baselines, culminating in a 7.76\times speedup on Deep Blending scenes.

[CV-42] Assessing Reliability of Symbol Detection in Concept Bottleneck Models

链接: https://arxiv.org/abs/2606.16535
作者: Javier Fumanal-Idocin,Javier Andreu-Perez
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Symbolic Computation (cs.SC)
备注:

点击查看摘要

Abstract:Concept Bottleneck Models (CBMs) are a relevant tool for explainable Artificial Intelligence because they make their predictions through human-interpretable symbols. However, high task accuracy does not guarantee that these symbols are detected faithfully: jointly trained CBMs may encode task-specific shortcuts in the bottleneck, making their explanations unreliable. In this paper, we study concept-detection reliability by swapping independently trained concept detectors and classification heads that share the same symbolic vocabulary. We use the resulting performance degradation, concept-level metrics, and symbol-wise uncertainty estimates to identify concepts that are especially prone to spurious firing. Finally, we propose a reliability-aware training strategy in which a shared concept detector is optimized with multiple classification heads and penalized for relying on globally or instance-wise unreliable symbols. On CUB-200-2011 with full concept supervision, detectors and heads are almost freely interchangeable (swap drop below one accuracy point, relative retention above 99% , and no concept detected below chance), whereas on a controlled synthetic task we show that, as the concept-supervision weight is reduced, models keep near-perfect task accuracy while swapped accuracy and agreement with the ground-truth concepts collapse to chance. Our reliability-aware training substantially mitigates this leakage, roughly doubling swap accuracy in the leaky regime.

[CV-43] Kairos: A Native World Model Stack for Physical AI

链接: https://arxiv.org/abs/2606.16533
作者: Kairos Team,Fei Wang,Shan You,Qiming Zhang,Tao Huang,Zuoyi Fu,Zhisheng Zheng,Yunlong Xi,Feng Lv,Xiaoming Wu,Zeyu Liu,Cong Wan,Pu Li,Ruiqing Yang,Xiaoou Li,Wei Wang,Kangkang Zhu,Yuwei Zhang,Shi Fu,Xiaoning Wu,Xuzeng Fan,Dacheng Tao,Xiaogang Wang
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:World models are transitioning from passive visual generators to foundational, operational infrastructure for Physical AI: they must natively acquire world knowledge from heterogeneous experience, maintain persistent states over long horizons, and execute efficiently within real deployment constraints. We introduce Kairos, a native world model stack designed around these requirements. (1) Kairos learns the world by pioneering a Native Pre-training Paradigm governed by a Cross-Embodiment Data Curriculum, which organizes open-world videos, human behavioral data, and robot interactions into a progressive developmental pathway. (2) Kairos maintains the world by unified world understanding, generation, and prediction within a Native Unified Architecture equipped with Hybrid Linear Temporal Attention, where sliding-window attention captures local dynamics, dilated sliding windows capture mid-range dependencies, and gated linear attention maintains persistent global memory. We establish formal theoretical bounds demonstrating that this temporal factorization strictly limits error accumulation, mathematically guaranteeing state propagation across extended horizons. (3) Kairos runs the world by incorporating a Deployment-Aware System Co-Design to support low-latency rollout generation on server and consumer-grade hardware for real-world observation-action-feedback loops. Experiments on embodied world-model, long-horizon, and action-policy benchmarks show that Kairos achieves top level performance while offering a strong efficiency-capability trade-off. Together, these results position Kairos as a cohesive operational foundation for future self-evolving physical intelligence.

[CV-44] BadWorld: Adversarial Attacks on World Models

链接: https://arxiv.org/abs/2606.16519
作者: Linghui Shen,Mingyue Cui,Xingyi Yang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:Visual world models (VWMs) synthesize interactive, action-conditioned rollouts from a single context image. However, it remains an open question how robust these models are to adversarial perturbations. Standard adversarial attacks fail to assess this vulnerability because attackers lack ground-truth future videos and cannot predict subsequent user controls. We introduce BadWorld, a label-free adversarial framework tailored for autoregressive VWMs that systematically overcomes both constraints. First, to bypass the need for future supervision, we propose a self-supervised velocity attack that directly disrupts the early denoising dynamics of the model. Second, to ensure the attack generalizes across unpredictable user actions, we formulate a trajectory-adaptive bi-level optimization that actively mines hard control sequences to forge control-agnostic perturbations. Evaluated on representative VWMs with continuous and discrete controls, BadWorld exposes severe structural fragility. Visually indistinguishable adversarial images reliably trigger catastrophic degradation in future rollouts, leading to incomplete denoising, structural collapse, and control inconsistency. These findings reveal critical risks for deploying VWMs in safety-critical systems while highlighting a practical mechanism for privacy protection.

[CV-45] Active Reference Acquisition in Few-Shot Font Generation ICDAR2026

链接: https://arxiv.org/abs/2606.16502
作者: Shinnosuke Matsuo
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ICDAR2026

点击查看摘要

Abstract:Few-shot font generation aims to synthesize the remaining glyphs of a font given one or a few reference glyphs while preserving stylistic consistency, thereby supporting font designers in efficiently completing a typeface. Existing methods primarily focus on improving generation quality given a fixed reference set. However, when the current reference glyphs are insufficient to represent the target style, few-shot font generation may fail to produce satisfactory results. In practical scenarios, additional reference glyphs can often be obtained from the designer when necessary. Accordingly, we propose a new framework, Active Reference Acquisition in Few-Shot Font Generation, in which the model sequentially decides which character to acquire next as an additional reference. Furthermore, we propose a reference part-coverage-based acquisition function to efficiently query the designer. Motivated by the observation that font styles are well characterized by local structural parts, we represent each glyph using a histogram of local features and select query characters that maximize the expected part coverage of the reference set. By prioritizing characters that contain parts not yet covered by the current references, the proposed method progressively expands the diversity of visual parts in the reference set. As a result, generation quality is improved with fewer queries. Experiments on the Google Fonts dataset demonstrate that the proposed method achieves higher generation quality than random querying and reference-agnostic baselines. The code is available at this https URL.

[CV-46] Unified Multimodal Model for Brain MRI Imputation and Understanding MICCAI2026

链接: https://arxiv.org/abs/2606.16484
作者: Zhiyun Song,Che Liu,Tian Xia,Avinash Kori,Wenjia Bai
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注: Early accepted to MICCAI 2026

点击查看摘要

Abstract:Multimodal large language models (MLLMs) hold great potential for medicine, as they inherit knowledge from LLM and allow multiple data modalities to be integrated, analysed and interpreted in natural language. However, the field of medical MLLMs is constrained by non-trivial challenges, notably the scarcity of high-quality training data and the frequent occurrence of missing data in the real-world clinical setting. Here, we propose a novel unified multimodal model, UniBrain, for brain magnetic resonance image (MRI) analysis. To address potential missing brain MRI modalities, we employ a unified training strategy to perform joint imaging modality imputation and brain image understanding. During training, an interleaved and description-enriched data flow is constructed to train the model in an autoregressive manner, enabling medical reasoning with generated multimodal data. A self-alignment strategy is introduced to leverage dense image embeddings to learn fine-grained anatomical features without requiring detailed image captions. Furthermore, we propose a dynamic hidden state mechanism to alleviate the exposure bias during long-context multimodal inference. Extensive experiments on multi-disease brain MRI dataset demonstrate that UniBrain achieves high performance for brain image imputation, understanding, and disease diagnosis under various extents of modality incompleteness.

[CV-47] Uncertainty Quality of VGGT: An Analysis on the DTU Benchmark Dataset

链接: https://arxiv.org/abs/2606.16479
作者: Markus Hillemann,Robert Langendörfer,Steven Landgraf,Markus Ulrich
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted for publication in the ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences

点击查看摘要

Abstract:Visual Geometry Grounded Transformer (VGGT) has already attracted a great deal of attention in a short period of time, not least due to the Best Paper Award at CVPR-2025. Similar to DUSt3R and MASt3R, VGGT aims to bring about a paradigm shift by replacing established methods like bundle adjustment and feature matching with a simple, unified, feed-forward neural network that predicts camera poses, depth maps, and dense 3D structure directly from multiple images of a scene in a few seconds. A key aspect is its ability to process an arbitrary number of views consistently in a single forward pass without any post-processing or iterative optimization. For photogrammetry, this opens new possibilities for real-time, scalable, and accessible 3D reconstruction. In this context, not only high reconstruction accuracy but also high-quality uncertainty estimates are crucial, as they foster trust and enable robust quality assurance. This paper therefore investigates the quality of VGGT’s uncertainty predictions. The analysis identifies an effective confidence threshold for filtering VGGT’s raw output and demonstrates that enhancing uncertainty quality holds strong potential for improving the accuracy of its 3D reconstructions.

[CV-48] AURA: Active-Response Attribution under Treatment Ambiguity in Bacterial Cytological Profiling

链接: https://arxiv.org/abs/2606.16477
作者: Kartik Jhawar,Mrunmayee Deshpande,Wilfried Moreira,Guillermo C. Bazan,Lipo Wang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:When a bacterial sample is exposed to several antibiotics, not every applied drug necessarily acts: if the organism is resistant to one of them, that drug leaves no morphological trace. The clinically meaningful quantity is therefore not which antibiotics were applied, but which ones were active. We show that these two are sharply decoupled in real E. coli microscopy - naively assuming the applied combination equals the active one is correct only about 37% of the time - yet existing computational tools are ill-suited to recovering the active set. Forward perturbation models such as scGen, CPA, and IMPA are designed to predict appearance from treatment, not the reverse, and inverting them degrades sharply; discriminative image classifiers tend to memorise strain- and batch-specific texture and fail to transfer across experimental replicates. We introduce AURA, which reframes the task as constrained, energy-based inverse attribution. Its central inductive bias is that the active set must be a subset of the applied set; this collapses the candidate space and lets AURA infer the active subset of applied antibiotics by decomposing residual morphology into antibiotic response atoms and selecting the subset with the lowest reconstruction energy, using no strain label at test time. AURA-E adds evidence-aware abstention, withholding a prediction when candidate explanations remain near-equally plausible. On cross-replicate transfer in an E. coli cytological profiling dataset, AURA recovers the active antibiotic combination with 95.47% exact-match accuracy.

[CV-49] MVOFormer: Flow-Semantic Transformer for Robust Monocular Visual Odometry

链接: https://arxiv.org/abs/2606.16474
作者: Jituo Li,Shunwang Sun,Jialu Zhang,Xinqi Liu,Jinyao Hu,Zhicheng Lu,Sajad Saeedi,Guodong Lu
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 8 pages, 6 figures. Accepted for publication in IEEE Robotics and Automation Letters (RA-L)

点击查看摘要

Abstract:Monocular visual odometry (MVO) is foundational to autonomous navigation and robotic localization. However, existing learning-based MVO approaches often struggle with either a lack of interpretable, complementary features or overly complex multi-stage architectures. These limitations inherently restrict their robustness and cross-domain generalization. In this work, we propose MVOFormer, a novel transformer framework for robust monocular visual odometry. Our architecture features a Flow-Semantic Dual Branch Encoder that synergizes dense geometric motion cues with object-centric semantic priors, explicitly distinguishing static structures from dynamic distractors. These representations are then fused by an Iterative Multimodal Decoder, enabling coarse-to-fine pose refinement while dynamically suppressing attention on unreliable regions. Extensive evaluations demonstrate that, without any target-domain fine-tuning, MVOFormer achieves superior zero-shot generalization and robustness, significantly outperforming prior learning-based frame-to-frame methods across diverse benchmarks including TartanAir, KITTI, TUM-RGBD, and ETH3D-SLAM.

[CV-50] Decoupled Object-Centric Video Understanding for Generating Robotic Manipulation Commands

链接: https://arxiv.org/abs/2606.16470
作者: Thanh Nguyen Canh,Thanh-Tuan Tran,Haolan Zhang,Ziyan Gao,Xiem HoangVan,Nak Young Chong
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Translating video demonstrations into executable robot commands remains challenging because existing methods often fail to identify which objects are functionally involved in the demonstrated action. As a result, they may generate commands that are linguistically plausible but operationally ambiguous. We propose an object-centric video understanding framework that decouples action recognition from object identification to generate precise, grammar-free manipulation commands. Our approach integrates Temporal Shift Modules (TSM) for efficient spatio-temporal action classification with a novel \textbfObject Selection algorithm that identifies task-relevant objects through trajectory-based role classification, blur detection, and overlap minimization. The selected objects are then processed by Vision-Language Models (VLMs) for robust category recognition and zero-shot generalization. Evaluated on a modified Something-Something V2 dataset, our method achieves 86.79% action classification accuracy and BLEU-4 scores of 0.337 on standard objects and 0.261 on novel objects. These results improve over the strongest task-specific baseline by 80.2% and 143.9%, respectively. Larger gains are observed in METEOR and CIDEr, reaching 157.9% and 171.7% on novel objects. Across all semantic metrics, our approach consistently outperforms task-specific methods and remains competitive with, or surpasses, large general-purpose VLMs while retaining a modular, object-centric design.

[CV-51] ResEdit: Residual embeddings for precise generative image editing

链接: https://arxiv.org/abs/2606.16457
作者: Ahmet Canberk Baykal,Valentin Deschaintre,Yannick Hold-Geoffroy,Michael Fischer,Anna Frühstück,Cengiz Öztireli,Iliyan Georgiev
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: Accepted to the EGSR 2026 journal track

点击查看摘要

Abstract:Conditional diffusion image generators can be repurposed for editing through inversion, without the need for large-scale paired fine-tuning data. However, producing high-quality, targeted edits while maintaining image identity and global consistency remains challenging, as weakly conditioned inversion often embeds conflicting image features into the noise. We demonstrate that incorporating a residual image encoding as additional conditioning enables both improved identity preservation and better editability. We optimize this residual encoding to provide a strong conditioning signal for reconstruction, thereby reducing the reliance on inversion and susceptibility to its aforementioned pitfalls. To ensure this residual does not interfere with desired edits, we incorporate a gradient reversal-based optimization strategy that disentangles the residual from the edited condition. We illustrate our method’s ability to produce high-fidelity results across precise intrinsic-based editing and relighting, and show proof-of-concept text-guided manipulation.

[CV-52] PermaVid: Consistent Video Generation Across Edits via Disentangled Context Memory

链接: https://arxiv.org/abs/2606.16449
作者: Shuai Yang,Bingjie Gao,Ziwei Liu,Jiaqi Wang,Dahua Lin,Tong Wu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Consistent video generation under editing operations requires persistence: when edits modify scene appearance or layout, subsequent generations should remain coherent across time and viewpoints. However, existing memory designs struggle to maintain long-term consistency after such modifications, as stored contexts may become outdated or invalid. To address this, we propose PermaVid, a novel framework built upon a multi-modal context memory that disentangles spatial context into semantic appearance and geometric structure, together with an edit-aware memory update and retrieval strategy that keeps memory evolution aligned with subsequent observations. Specifically, we develop two complementary memory banks: an RGB context memory that captures appearance-aware observations while implicitly encoding geometry, and a depth context memory that preserves geometry-only structure disentangled from semantics. Building on this design, we introduce a memory-guided video generation model that performs multi-modal feature fusion under reference conditions drawn from mixed-modality memory contexts. Experiments demonstrate that our method maintains strong long-term semantic and structural consistency after edits, significantly outperforming state-of-the-art methods.

[CV-53] Hierarchical Fine-Grained Aerial Object Detection

链接: https://arxiv.org/abs/2606.16448
作者: Yan Zhang,Fang Xu,Wen Yang,Gui-Song Xia
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages

点击查看摘要

Abstract:Fine-grained aerial object detection, driven by the intrinsic granularity of real-world object categories, is crucial for advanced scene understanding in remote sensing. Existing methods largely inherit the paradigm of coarse-grained object detection, relying solely on single-label supervision and thus struggling to distinguish model-level categories with subtle structural differences. However, for each specific model (e.g., Boeing 787), structured prior knowledge such as attributes and hierarchies offers discriminative semantics across multiple granularities. Motivated by this, we present ExpertDet, a scheme that incorporates expert-informed cues to enhance fine-grained aerial object detection. Specifically, we design Vision-aware Masked Attribute Modeling (VMAM), which aligns attribute semantics with visual structures by reconstructing randomly masked attributes from visual cues, enabling the detector to capture subtle structural distinctions. We further propose Hierarchical Visual Instance Promotion (HierVIP), which builds a visual prototype tree based on hierarchical relations and imposes taxonomy-aware constraints to preserve cross-level semantic continuity while enhancing category discrimination. Moreover, we curate a new fine-grained object detection benchmark for Precise recognition of model-specific Ships and Planes from aerial imagery, PSP, covering 106 ship classes and 30 airplane models, respectively, featuring the most extensive collection of model-specific categories among existing aerial object detection datasets to date. We benchmark state-of-the-art object detection algorithms on the PSP benchmark. Extensive evaluation demonstrates that ExpertDet consistently outperforms other fine-grained competitors across hierarchy levels. The dataset, benchmark, and code are available at this https URL.

[CV-54] V2P-Manip: Learning Dexterous Manipulation from Monocular Human Videos

链接: https://arxiv.org/abs/2606.16436
作者: Kaihan Chen,Yanming Shao,Haifeng Ji,Xiaokang Yang,Yao Mu
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Achieving autonomous robotic dexterous manipulation requires precise, human-like action sequences at scale. As a scalable supplement to costly teleoperation data, extracting trajectories with both visual fidelity and physical plausibility from monocular videos represents a promising frontier in embodied AI. To this end, we introduce V2P-Manip, an efficient framework designed to learn dexterous manipulation policies directly from human demonstration videos. We establish an efficient, integrated pipeline encompassing 3D asset acquisition, trajectory estimation, and dexterous policy learning. To bridge the gap between visual perception and physical constraints, we introduce a two-stage refinement process to enforce spatial alignment and physical consistency. Evaluations on the TACO and OakInk benchmarks demonstrate that our approach significantly outperforms previous methods in pose accuracy, adaptability to unstructured environments, and training efficiency. Ultimately, experimental results confirm an average success rate of over 75% across multiple synthetic manipulation tasks and validate the adaptability of the extracted manipulation priors across diverse dexterous hand embodiments.

[CV-55] Beer-Lambert Guided Representation Learning for Unsupervised Anomaly Detection in Sub-THz Food Inspection Images

链接: https://arxiv.org/abs/2606.16421
作者: Gyutae Hwang,Sang Jun Lee
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 3 figures

点击查看摘要

Abstract:Food manufacturing requires reliable inspection systems to detect foreign material contamination and maintain product safety. Sub-THz transmission imaging provides material-dependent attenuation characteristics that are useful for detecting low-density contaminants in food products. However, existing unsupervised anomaly detection methods mainly rely on RGB-pretrained visual representations, which may not adequately capture the transmission behavior of Sub-THz images. This paper proposes a Beer-Lambert guided representation learning framework for unsupervised anomaly detection in Sub-THz food inspection images. The proposed method introduces an attenuation decomposition module as an auxiliary regularization module that constrains student representations through attenuation reconstruction during training. In addition to the conventional one-class setting, we introduce a Leave-One-Food-Out protocol to evaluate generalization capability under unseen food categories. Experimental results on the Inline-Food-Inspection-THz dataset show that the proposed method improves overall anomaly detection performance over the baseline method.

[CV-56] Instance-Aware Knowledge Distillation for Semi-Supervised Learning of an On-Board Multi-Task Dense Prediction Model for Collision Avoidance System

链接: https://arxiv.org/abs/2606.16414
作者: Gyutae Hwang,Sang Jun Lee
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 7 figures

点击查看摘要

Abstract:Collision avoidance systems have evolved toward camera-based deep learning approaches for driving scene understanding. However, deployment in edge environments such as country clubs is constrained by limited computational resources and unreliable communication infrastructure. Moreover, constructing large-scale datasets for the target domain involves substantial annotation cost. To address these limitations, we propose an instance-aware knowledge distillation framework for semi-supervised learning. Specifically, we generate pseudo labels that mitigate teacher bias by leveraging domain priors from the teacher and instance-centric knowledge from foundation models. The trained lightweight student is deployed in the proposed collision avoidance system and performs multiple dense prediction tasks in real-time. The system detects frontal obstacles and encodes their spatial information into controller area network messages for automated guided vehicle operation. To achieve this, we construct a large-scale country club dataset and perform field validation of the proposed system. Experimental results demonstrate that the student outperforms the large teacher in instance segmentation while mitigating performance degradation in monocular depth estimation. Compared with the teacher, the student reduces FLOPs by 22.68 \times and parameters by 14.33 \times , achieving 6.46 FPS on a low-cost edge device.

[CV-57] RGFVR: Reference-Guided Face Video Restoration with Flow Matching

链接: https://arxiv.org/abs/2606.16401
作者: Cem Eteke,Batuhan Tosun,Eckehard Steinbach
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Face video restoration from degraded observations is challenging, as it requires simultaneously recovering visual fidelity, temporal consistency, and subject identity. Existing approaches are often either reference-free, which can lead to identity loss when person-specific facial details are lost, or subject-specific, which limits generalization to unseen identities. We propose a subject-agnostic, reference-guided framework for identity-preserving face video restoration. Our method introduces bimodal perceptual-descriptive identity conditioning into a pretrained flow-based text-to-video generator and employs a two-stage training strategy to strengthen identity guidance during restoration. Experiments show that our approach improves restoration fidelity, temporal consistency, and identity preservation, achieving superior performance under challenging video degradations, including downsampling, blur, noise, and compression artifacts. The code is available under: this https URL.

[CV-58] SP3: Spherical Priors for Plug-and-Play Restoration

链接: https://arxiv.org/abs/2606.16396
作者: Sean Man,Ron Raphaeli,Matan Kleiner,Or Ronai
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:In this paper, we introduce SP ^3 , a novel Plug-and-Play algorithm that accelerates maximum a posteriori image restoration by replacing denoisers with Spherical Encoders (SE) as generative priors. SP ^3 approximates the intractable proximal prior step by utilizing the SE tightly structured latent space as a robust projection onto the natural image manifold. Alternating this projection with a closed-form data-consistency step, via Half-Quadratic Splitting, achieves stable convergence without requiring gradient computation during inference. This unique formulation unlocks “anytime” restoration capabilities, producing sharp, plausible images from the first iteration. Evaluations across a variety of image restoration tasks demonstrate that SP ^3 achieves perceptual quality comparable to state-of-the-art zero-shot diffusion and flow methods while being 3 - 630\times faster.

[CV-59] owards UAV Image Dehazing: A UAV Atmospheric Scattering Model Benchmark and Geometry-Aware Deep Unfolding Network

链接: https://arxiv.org/abs/2606.16392
作者: Wenxuan Fang,Jiangwei Weng,Yu Zheng,Junkai Fan,Guangfa Wang,Xiang Chen,Jian Yang,Jun Li
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In UAV applications, haze significantly obscures distant details and weaken structural information, hindering the recovery of details. Current UAV scenarios still face two key challenges: (i) paired hazy/clean images from the real world are unobtainable, while the classical atmospheric scattering model is inadequate for modeling the spatially non-uniform haze in UAV imagery; (ii) existing dehazing methods struggle to remove the heavy haze accumulated in the upper regions of UAV images. To address these issues, we first propose a UAV Atmospheric Scattering Model (UASM), which explicitly incorporates flight altitude, viewing pitch, and extinction to characterize the non-uniform haze distribution in UAV imaging. Based on UASM, we develop a physics-driven dehazing framework, termed Geometry-aware Proximal Deep Unfolding Network (GP-DUN). Specifically, GP-DUN consists of three key modules: a Latent Geometry Estimator (LGE) that infers transmittance consistent with UAV imaging geometry, a Geometry-aware Gradient Descent Module (GeoGDM) that embeds UASM into the data-fidelity term and performs physics-consistent closed-form updates, and an Pooling-Expert Proximal Mapping Module (PE-PMM) that learns an implicit prior to restore textures and structures beyond the capability of explicit physical modeling. In addition, we further construct UASM-HazeSet, which provides controllable paired synthetic data together with 2,285 real UAV haze images for testing. Extensive experiments show that GP-DUN consistently outperforms existing methods on both UASM-HazeSet and real UAV haze benchmarks.

[CV-60] GraphBEV: Multi-Modal Feature Alignment for Autonomous Driving

链接: https://arxiv.org/abs/2606.16354
作者: Ziying Song,Caiyan Jia,Lin Liu,Shaoqing Xu,Lei Yang,Yadan Luo
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 30 pages, 7 figures

点击查看摘要

Abstract:Feature misalignment in BEV perception is a critical yet often overlooked challenge in autonomous driving, especially under calibration uncertainties between LiDAR and camera sensors. To address this issue, we propose a robust multi-modal fusion framework, GraphBEV++, which systematically mitigates projection-induced misalignment. The framework consists of two key modules: LocalAlign-v2 and GlobalAlign-v2. LocalAlign-v2 introduces neighborhood-aware depth features via graph matching to correct local misalignment. It supports both LSS-based and query-based BEV representations, making it compatible with BEVFusion and BEVFormer architectures for consistent cross-paradigm alignment. GlobalAlign-v2 encompasses two variants: Deformable and Diffusion. The Deformable variant addresses global misalignment in LSS-based multi-modal BEV by explicitly learning cross-modal feature offsets. In contrast, the Diffusion variant targets implicit misalignment in query-based BEV by injecting noise to simulate misalignment and employing a denoising process to recover aligned features. Experimental results show that GraphBEV++ achieves state-of-the-art performance under misalignment noise on nuScenes and Waymo subset, improves long-range detection on Argoverse2, and generalizes effectively to the 3D occupancy prediction task, consistently improving occupancy estimation accuracy and robustness under both clean and noisy settings. Furthermore, GraphBEV++ effectively alleviates misalignment issues in end-to-end autonomous driving. Compared with five baselines (UniAD, VAD, FusionAD, MomAD, and WoTE), it demonstrates superior performance in both open-loop (nuScenes) and closed-loop (Bench2Drive and NAVSIM) evaluations across perception, prediction, and planning tasks.

[CV-61] What Should a Streaming Video Model Remember?

链接: https://arxiv.org/abs/2606.16353
作者: Haonan Ge,Yiwei Wang,Hang Wu,Yujun Cai
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Streaming video understanding models must answer queries at any moment during an ongoing stream, using only what they have observed so far and under fixed memory and computation budgets. Existing methods address this by adding memory banks, retrieval modules, or visual token compression to preserve long-range history. However, strong recent-window baselines show that indiscriminate history injection can dilute current-scene perception, suggesting that the key challenge is not whether to use memory, but how to allocate it selectively. We formulate this as budgeted online latent evidence allocation and propose \textbfSelectStream, a selective latent-memory framework that keeps the current observation directly visible to a frozen VLM while exposing historical information only through a compact, query-conditioned evidence budget. Three coordinated mechanisms govern when to write, what to preserve, and how to retrieve: surprise-driven adaptive windowing, priority-preserving consolidation, and query-conditioned graph reasoning over a fixed-capacity latent memory graph. Retrieved evidence is calibrated and injected as latent tokens for answer generation, without replaying frames or growing the context with stream length. Experimental results show that SelectStream achieves strong online streaming performance and preserves general video understanding, reaching 82.67% on StreamingBench, 67.03% on OVO-Bench, and 74.4% average accuracy on offline video benchmarks, while outperforming strong recent-window baselines and prior streaming memory methods.

[CV-62] When the Past Matters: FlashBack Memory for Precipitation Nowcasting

链接: https://arxiv.org/abs/2606.16342
作者: Yuhao Du,Boxiao Huang,Chengrong Wu,Jiankai Zhang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate precipitation nowcasting is crucial for disaster mitigation and socio-economic planning, yet existing methods often struggle with false alarms, missed events, and long range dependency modeling at high spatiotemporal resolution. To address these challenges, we propose FlashBack Memory (FB), a module that dynamically retrieves key historical states and integrates them via an adaptive fusion gate, enhancing the spatiotemporal representation capability of recurrent-based models. We incorporate FB into PredRNN, PredRNNpp, MIM, MotionRNN, and PredRNN-V2, and evaluate on CIKM2017, Shanghai2020, and SEVIR datasets. Experimental results demonstrate that FB significantly improves MSE, MAE, SSIM, and CSI metrics, particularly for high-intensity rainfall and long-sequence predictions, while reducing false alarms and missed events and enhancing temporal consistency and spatial localization. The proposed method provides a general and efficient memory enhancement mechanism, improving the overall performance of recurrent-based precipitation nowcasting models.

[CV-63] Chronological Blindness: Benchmarking Temporal Reasoning in Vision-Language Models with CHRONOSIGHT

链接: https://arxiv.org/abs/2606.16334
作者: Parthaw Goswami,Jaynto Goswami Deep
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Human perception of visual scenes is inherently temporal. We instinctively recognise whether a fruit is ripening or rotting, whether construction is progressing or being demolished, and approximately how much time separates two photographs of the same subject. Whether large vision-language models (VLMs) share this competence remains an open and practically important question. We introduce CHRONOSIGHT, a rigorously controlled benchmark evaluating five dimensions of visual temporal reasoning: CHRONORANK (chronological ordering of image sequences), CHRONOLOCATE (ordinal stage localisation from a single image), CHRONODELTA (estimation of time elapsed between two images on a logarithmic scale), CHRONOREVERSE (detection of temporally reversed sequences), and CHRONOODD (identification of a temporal outlier within a set). The benchmark comprises 1,000 items across eight process families (biological growth, food transformation, physical weathering, construction, environmental change, human ageing, astronomical phenomena, and urban dynamics) spanning timescales from minutes to millennia. We evaluate eight open-source VLMs (500 M to 19 B parameters) under two prompting regimes and collect human performance baselines. Human performance averages 0.89 across tasks; the best open model (Qwen2.5-VL-7B) reaches 0.40 under direct prompting, a gap we term chronological blindness. Lightweight LoRA fine-tuning on 151 examples raises CHRONODELTA accuracy from near-zero to 0.43, transferring zero-shot to related tasks (CHRONOODD: 0.37; CHRONOREVERSE: 0.64)suggesting the bottleneck is partly instruction following rather than visual perception. Benchmark, code, and predictions will be released upon acceptance.

[CV-64] Differentiable Packing of Irregular 3D Objects with Adaptive Container Estimation

链接: https://arxiv.org/abs/2606.16333
作者: Palak Gupta,Shanmuganathan Raman
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)
备注: Comments: 20 pages, 8 figures, 5 tables. Under review at Computers Graphics (Elsevier)

点击查看摘要

Abstract:Most existing approaches either fix the container in advance or optimize only a single container dimension through an outer search loop, leaving the remaining dimensions as a manual tuning problem. We present a differentiable packing framework that jointly optimizes all 6N object pose parameters and all three container side lengths inside a single gradient-based loop. The formulation combines six physics-inspired, differentiable loss terms computed directly on triangle meshes through axis-aligned bounding-box proxies. An adaptive squeezing mechanism periodically tightens the container whenever the overlap loss falls below a pair-count-scaled threshold, producing a large initial drop in container volume, followed by small refinements. All pairwise computations are written in tensor-broadcasting form, giving a 3.4 to 54 times speedup over a reference loop-based implementation. The pipeline is implemented in Python and PyTorch, with no physics engine, FFT library, or convex decomposition. On multiple object categories, the method produces containers that are 11 to 32 percent smaller than time-matched DBLF and simulated-annealing baselines at N =100, while running in under 4 minutes per instance on a single consumer GPU.

[CV-65] Attention-Based Prototype Calibration for Multi-Rater Few-Shot Medical Image Segmentation MICCAI2026

链接: https://arxiv.org/abs/2606.16325
作者: Truong Vu,Minh Khoi Ho,Yutong Xie
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: MICCAI 2026 main track

点击查看摘要

Abstract:Few-shot medical image segmentation methods typically assume a single ground-truth annotation, overlooking systematic variability across expert raters commonly observed in clinical datasets. We propose an attention-based prototype calibration framework for few-shot multi-rater segmentation that models rater-specific deviations from a consensus representation in prototype space. A lightweight yet principled attention operator directly refines rater prototypes without modifying the backbone feature extractor, making the approach fully compatible with existing prototype-based few-shot segmentation methods. This design preserves semantic consistency while enabling personalized segmentation outputs with minimal computational overhead. Experiments on multi-rater medical imaging datasets demonstrate consistent improvements over baseline prototype approaches, highlighting the effectiveness of structured prototype calibration for modeling annotation variability. Our code is available at this https URL.

[CV-66] HAFMat: Hybrid Priors Guided Adaptive Fusion for Single-Image Human Material Estimation

链接: https://arxiv.org/abs/2606.16323
作者: Yu Jiang,Jiahao Xia,Jiongming Qin,Jianchi Sun,Chunxia Xiao
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:

点击查看摘要

Abstract:Physically based rendering (PBR) material estimation is a fundamental appearance decomposition task with broad applications in virtual content creation, relighting, and digital human rendering. However, estimating PBR materials from a single human image remains highly ill-posed, since illumination, geometry, and reflectance are heavily entangled in the observed appearance. To mitigate this ambiguity, we propose HAFMat, a hybrid-prior-guided framework for single-image human material estimation. Our method introduces guidance maps that encode complementary cues, including appearance, body geometry, structure, and prior material predictions from pre-trained models. A key observation is that these guidance cues are heterogeneous: some cues mainly provide texture-level constraints, while others convey higher-level semantic information. To exploit this property, we design a Multi-layer Adaptive Feature Fusion Mechanism, which adaptively fuses guidance features with decoder features at different stages. This design enables texture-dominant and semantic-dominant cues to guide material decoding at appropriate levels, leading to more accurate and physically plausible material estimation. Extensive experiments on both synthetic and real data demonstrate that our method achieves state-of-the-art performance in material estimation and downstream relighting.

[CV-67] raining-free sparse attention based on cumulative energy filtering

链接: https://arxiv.org/abs/2606.16317
作者: Chunlu Li,Yixuan Pan,Bai Du,Zhenyuan Chen,Yanzhao Li,Hui Dong,Hui Wang,Zhiqiang Zou
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Sparse attention accelerates Diffusion Transformers (DiTs) for video generation by computing only the important tokens while skipping the rest. The token selection strategy is key to balancing sparsity and accuracy. We formulate the token filtering process as a dual-goal optimization problem: maximizing sparsity and minimizing accuracy degradation. Existing algorithms cannot fulfill both objectives simultaneously. For example, Top-p only considers the accuracy constraint, while Top-k maintains a fixed computational budget but loosens the accuracy constraint. This paper demonstrates that maintaining a fixed recall rate is sufficient for ensuring accuracy, whereas a fixed threshold is suboptimal for reducing computational cost. Therefore, we propose a dynamic thresholding scheme to improve sparsity while maintaining the same level of accuracy. Furthermore, our algorithm is deeply integrated with Flash Attention (FA), eliminating the need for any additional masking computation overhead. Experimental results on Wan 2.2 validate that, compared to the BLASST algorithm which is also integrated with FA, our dynamic thresholding strategy enhances sparsity from 61.42% to 82% with a VBench metric drop of less than 5%. This results in an approximate 15% in attention computation and a 1.61\times increase in computational efficiency, which is 1.18x higher than that of BLASST. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2606.16317 [cs.CV] (or arXiv:2606.16317v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2606.16317 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-68] Explainable Flood Segmentation on Sentinel-1 SAR Imagery: A Comparative Study of CNN and Transformer Architectures

链接: https://arxiv.org/abs/2606.16302
作者: Arundhuti Banerjee,David Daou
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Rapid and accurate flood prediction is essential for disaster response and mitigation planning. Synthetic Aperture Radar (SAR) sensors in satellites are well-suited for this purpose because they operate independently of weather and daylight conditions. Although SAR-based data enable all-weather flood monitoring, distinguishing flooded land from permanent water remains a significant challenge, particularly when flooding is defined strictly as inundated land. This study provides a comprehensive comparison of convolutional neural network (CNN) and vision transformer architectures for multi-class flood segmentation using Sentinel-1 SAR imagery, specifically trained to separate flooded land from permanent water bodies and land. Three state-of-the-art (SOTA)CNN-based models, U-Net, U-Net++, and DeepLabV3 with ResNet-34 backbone, and three SegFormer variants (b0,b1,b2) were evaluated in two benchmark datasets, the ETCI NASA dataset and SenFloods11, using scene-based data splits to ensure a realistic assessment of spatial generalization. The results demonstrate that SegFormer-b2 significantly outperforms the U-Net baseline on the ETCI dataset (higher flood IoU across all 7 test scenes in the Wilcoxon signed-rank test), while after fine-tuning on Sen1Floods11, the advantage narrows to within the range of scene variability and is concentrated in spatially fragmented flood events. The study includes both qualitative and quantitative explainability techniques to visually comprehend model decisions and systematically assess prediction reliability. Qualitative analysis reveals that SegFormer-b2 produces more spatially coherent Grad-CAM activations focused on flood-relevant features, while U-Net generates more informative uncertainty estimates along flood boundaries.

[CV-69] DDTNet: Degradation Disentanglement and Transfer Network for Test-Time All-in-One De-weathering Adaptation

链接: https://arxiv.org/abs/2606.16298
作者: Kuan-Hung Lin,Fu-Jen Tsai,Yan-Tsung Peng,Min-Hung Chen,Chia-Wen Lin,Yen-Yu Lin
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:All-in-one adverse weather image restoration aims to remove multiple degradations, such as rain, haze, and snow, using a single unified model. Despite their broad applicability, existing methods typically compromise performance, delivering balanced but suboptimal results for individual degradation types. This issue becomes more pronounced when a domain gap exists between training and testing data. Motivated by the observation that modeling degradation patterns is more feasible than recovering clean content, we propose the Degradation Disentanglement and Transfer Network (DDTNet), which focuses specifically on degradation transfer. By disentangling degradation patterns from target-domain degraded images and transferring them to source domain clean images, DDTNet generates domain-adaptive paired training data. These pairs are then used to fine-tune restoration models, significantly enhancing their adaptability across diverse weather conditions and domains. The core of DDTNet is the Degradation Disentanglement Module (DDM), which comprises Degradation Coupled Attention (DCA) to capture both general and weather-specific features, thereby enabling effective disentanglement and transfer of degradation patterns. Experimental results demonstrate that DDTNet significantly and consistently improves existing all-in-one models across real-world deraining, desnowing, and dehazing datasets.

[CV-70] Sex-based Network-Specific Differences in Connectomes: A Krakencoder-Based Analysis

链接: https://arxiv.org/abs/2606.16294
作者: Vibhashree S H,Debanjali Bhattacharya,Vamshi Krishna Kancharla,Neelam Sinha
类目: Computer Vision and Pattern Recognition (cs.CV); Neurons and Cognition (q-bio.NC)
备注:

点击查看摘要

Abstract:This study examines how deficiencies in one brain connectome modality propagate to the other, using the Krakencoder as a simulation framework. Structural and functional connectomes from 702 healthy participants in the Human Connectome Project were analyzed, with the impact of each of the Yeo-7 functional networks assessed separately. Seven scenarios were considered, each involving the removal of a single network while the remaining networks were preserved. The resulting perturbations in cross-modal predictions were quantified using three complementary metrics: KL divergence on eigenvalue spectra, Frobenius norm, and Wasserstein distance. In addition, the persistence of sex-specific information within the predicted connectomes was evaluated. Across all metrics and both prediction directions, the Default Mode Network produced the largest perturbations, whereas the Somatomotor network yielded the smallest. Sex differences in network-level perturbation signatures were subtle, with the best result being an accuracy of 66.09% from connectomes predicted under network-removal conditions. In contrast, connectomes predicted from intact inputs achieved substantially higher sex classification accuracy, reaching up to 84.76%. These findings confirm that full predicted connectomes retain considerably more sex-discriminative information than perturbation-derived signatures alone.

[CV-71] RealityBridge: Bridging Editable 3D Gaussian Splatting Driving Simulations and Real-World Videos

链接: https://arxiv.org/abs/2606.16278
作者: Zhenhua Wu,Yun Pang,Mingkun Chang,Yuwei Ning,Liangzhi Wang,Yi Xiao,Guanbin Li
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Long-tail hazardous scenarios are essential for safety-oriented autonomous driving, yet they are difficult to collect and reproduce at scale. Editable 3D Gaussian Splatting (3DGS) simulation offers a promising alternative by reconstructing real driving scenes and supporting controllable scene editing. However, edited 3DGS-rendered videos still suffer from a significant Sim-to-Real gap, including rendering artifacts, degraded foreground assets, inconsistent illumination, and temporal flickering. Existing restoration and video generation methods are insufficient for this task, as they often fail to jointly repair 3DGS-specific artifacts, improve visual realism, and ensure temporal consistency. To fill this gap, we propose RealityBridge, a structure-preserving and asset-aware Sim-to-Real framework for edited 3DGS driving videos. RealityBridge uses multimodal controls, including rendered videos, foreground masks, edge maps, and semantic masks, together with a lightweight GateNet for adaptive condition allocation across backbone layers. We further construct targeted training data and introduce autoregressive long-video training with reward-guided post-training to improve restoration quality, temporal stability, and hallucination suppression. Extensive experiments on internal and public driving datasets show that RealityBridge outperforms existing methods in artifact removal, illumination harmonization, and long-sequence temporal consistency.

[CV-72] GraphWorld: Long-Horizon Planning with World Models for End-to-End Autonomous Driving

链接: https://arxiv.org/abs/2606.16274
作者: Ziying Song,Caiyan Jia,Lin Liu,Lei Yang,Shengkai Zhang,Feiyang Jia,Fengda Zhao,Peiliang Wu,Shaoqing Xu,Chen Lv,Yadan Luo
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages, 5 figures

点击查看摘要

Abstract:End-to-end autonomous driving has made significant progress by unifying perception, prediction, and planning within a single learning framework, achieving strong performance in short-horizon decision making. However, most existing E2E-AD methods remain confined to short-horizon planning and lack the ability to model long-term temporal dependencies, which severely limits their generalization and security in complex and highly interactive driving scenarios. In this work, we propose GraphWorld, an E2E-AD framework that explicitly enhances long-horizon planning through latent world modeling. We introduce an Ego-Centric Interaction Graph, which adaptively models critical neighboring agents based on spatial proximity, and propagates relational context to planning queries via cross-node cross-attention. We present a World-State-Conditioned Planning that learns ego-centric latent world representations by modeling interactions between an ego vehicle and surrounding agents. This latent world state captures key interaction dynamics and safety-relevant semantics, and serves as a conditioning signal to guide long-horizon, safety-aware trajectory planning. Extensive experiments on Bench2Drive, NAVSIMv1/2, and nuScenes demonstrate that GraphWorld significantly reduces collision rates and improves long-horizon planning performance, validating its effectiveness in complex driving environments.

[CV-73] Contrastive Learning for Seismic Horizon Tracking with Domain-Specific Priors

链接: https://arxiv.org/abs/2606.16271
作者: Alexandre Thouvenot,Lionel Boillot,Vincent Gripon
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 5 pages, 5 figures. Submitted to the IEEE GRSL for possible publication

点击查看摘要

Abstract:Unsupervised 3D seismic horizon tracking faces a key limitation: signal-based propagators provide accurate trace-level alignment but often fail near faults, whereas texture-driven deep models are more robust to discontinuities, typically at the cost of labeled data requirements and reduced trace-level precision. We propose a self-supervised fusion of both paradigms in which signal-derived local horizon correspondences act as domain-specific priors to train a texture-based deep learning model. Specifically, we estimate reliable trace-to-trace flows from reflector slopes and use them to form positive pairs in a contrastive objective, while restricting training to high-confidence neighborhoods, optionally augmented with a fault mask. The objective is not to infer ambiguous correspondences close to discontinuities, but to preserve horizon identity across them. As a result, the network learns voxel-wise embeddings that preserve local signal continuity while enabling horizon propagation beyond discontinuities through similarity search. Experiments on the public F3 dataset and a faulted synthetic dataset achieve lower mean absolute error (MAE) than unsupervised baselines and competitive performance against a semi-supervised method using a single labeled slice.

[CV-74] KeepLoRA: Continual Learning with Layer-Scaled Residual Gradient Adaptation

链接: https://arxiv.org/abs/2606.16256
作者: Mao-Lin Luo,Yi-Lin Zhang,Zi-Hao Zhou,Yankun Hong,Xialiang Tong,Mingxuan Yuan,Tong Wei,Min-Ling Zhang
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Continual learning for pre-trained vision-language models requires balancing three competing objectives: retaining pre-trained knowledge, preserving knowledge from a sequence of learned tasks, and maintaining the plasticity to acquire new knowledge. This paper presents KeepLoRA++, balancing these objectives through a unified dual-dimensional knowledge retention mechanism. We analyze knowledge distribution of Transformer architecture from both inter-layer and intra-layer perspectives. The inter-layer perspective examines how retention is distributed across layers, while the intra-layer perspective focuses on the parameter space within each layer. Our analysis reveals a structural property: general transferable knowledge is mainly encoded in the shallow layers and the principal subspace of the parameters, while task-specific adaptations are localized in the deep layers and the residual subspace. Motivated by this insight, KeepLoRA++ introduces a layer-scaled residual gradient adaptation method. New tasks are learned by restricting LoRA parameter updates to the residual subspace, combined with a shallow-to-deep layer scaling, to prevent interference with previously acquired capabilities. Specifically, the gradient of a new task is projected onto a subspace orthogonal to both the principal subspace of the pre-trained model and the dominant directions of previous task features, while simultaneously assigning smaller update magnitudes to shallow layers and larger ones to deeper layers. Our theoretical analysis and empirical evaluations confirm that KeepLoRA++ successfully balances these three competing objectives, consistently outperforming representative baselines across image classification, visual question answering, and video understanding tasks.

[CV-75] UniDDT: Unifying Multimodal Understanding and Generation with Decoupled Diffusion Transformer

链接: https://arxiv.org/abs/2606.16255
作者: Shuai Wang,Liang Li,Yang Chen,Ruopeng Gao,Yao Teng,Limin Wang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This work was completed in \textbf{November 2025}

点击查看摘要

Abstract:Unified Multimodal Models (UMMs) have emerged as a critical direction for general-purpose multimodal intelligence, integrating understanding and generation into a single framework. However, existing UMMs face prominent challenges: (1) the inherent learning conflicts between visual understanding and generation tasks, leading to suboptimal modeling in both tasks; (2) different understanding and generation visual spaces impeding scalability; (3) over-reliance on task-specific data that neglects the duality of text-image understanding and generation. To address these challenges, we propose UniDDT, which leverages a Noisy ViT encoder along with an LLM to unify semantic encoding for visual generation and understanding tasks, while employing a separate diffusion decoder to decouple diffusion decoding from text decoding. With this Noisy ViT encoder, UniDDT is able to leverage the latent space as a unified visual representation, enabling seamless compatibility between understanding and generation tasks. Thus, the scalability within the generation tasks and the semantic expressiveness within understanding tasks can be balanced. Also, we construct dual data structures from the same image-text pairs, fostering interdependence between the generation and understanding data to exploit their inherent duality. Extensive experiments demonstrate that UniDDT achieves effective unification of multimodal understanding and generation with enhanced semantic consistency and scalability. For visual generation tasks, our UniDDT achieves 0.87 GenEval score and 86.9 DPG overall score. For multimodal understanding tasks, our UniDDT achieves 1699.5 score on MME benchmark and 76.5 overall score on SEEDbench.

[CV-76] Learned Image Compression for Vision-Language-Action Models

链接: https://arxiv.org/abs/2606.16253
作者: Hyeonjun Kim,Jegwang Ryu,Sangbeom Ha,Junhyeok Lee,Jun-Hyuk Kim,Hyemin Ahn,Jaeho Lee
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Vision-language-action (VLA) models increasingly rely on high-frequency multi-camera observations, making visual communication a major bottleneck for real-time robotic control in bandwidth-constrained or distributed deployment settings. Existing image and video codecs, however, are designed to preserve generic visual fidelity rather than the control performance of downstream VLA policies. In this work, we introduce SPARC (SPatially Adaptive Rate Control), a learned image compression framework tailored for VLA-driven robots. Our key observation is that the importance of visual information varies substantially across both camera views and spatial regions within an image. Based on this observation, SPARC employs a lightweight temporal mask selector that adaptively allocates bitrate over latent representations according to task relevance while leveraging temporal context. We further introduce a tilted rate loss that stabilizes training by reducing the tendency of entropy-based objectives to over-suppress rare yet task-critical visual patterns. Experiments on diverse robotic benchmarks, including RoboCasa365, VLABench, and LIBERO, show that SPARC consistently achieves stronger control performance than conventional image/video codecs and recent learned compression methods under the same bitrate budget. We additionally demonstrate real-world deployment benefits in remote-control settings, where our method substantially improves the bitrate-success tradeoff.

[CV-77] Structure-Semantic Co-optimized Latent Diffusion Model for Fast Visual Anagram Synthesis

链接: https://arxiv.org/abs/2606.16241
作者: Xiang Gao,Yunpeng Jia
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Visual anagram is an intriguing form of art creation wherein a single image presents different conceptual interpretations under transformations such as flipping or rotation. Recent work has achieved visual anagram synthesis by leveraging pretrained text-to-image (T2I) diffusion models, yet still suffers from several key limitations including computational inefficiency, suboptimal aesthetic quality, and weak semantic fidelity and expressiveness. This work focuses on generating visual anagrams with substantially improved visual quality at minimal computational cost, thereby advancing intelligent creation of illusionary digital art. To increase image resolution while reducing time overhead, we adapt the cutting-edge parallel denoising algorithm from pixel-based T2I model to the adversarially distilled latent-based one, and accordingly propose a structure-semantic co-optimization (S2CO) framework to counteract the consequent visual degradation. As the core of our approach, S2CO framework comprises three key innovations: (\romannumeral1) null-text structure alignment optimization; (\romannumeral2) semantic enhancement optimization; (\romannumeral3) attention-guided noise fusion. Building upon these components, our method dubbed \textbfS2CO-Anagram is able to generate higher-resolution anagram images with noticeably superior visual harmony and semantic faithfulness than related SOTA approaches, all while achieving substantially faster inference speed. Code will be publicly available.

[CV-78] Propagating Structural Guidance: Synthesizing Fluorescein Angiography from Fundus Images and Sparse OCT Scans MICCAI2026

链接: https://arxiv.org/abs/2606.16234
作者: Tengfei Ma,Ruiqi Wu,Chenran Zhang,Ye Geng,Na Su,Xiangyuan Duanmu,Tao Zhou,Yi Zhou,Wen Fan
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to MICCAI 2026 (Early Accept)

点击查看摘要

Abstract:Fundus fluorescein angiography (FFA) is critical for assessing retinal vascular abnormalities, but its acquisition is invasive and not always feasible. In contrast, color fundus photography (CFP) is non-invasive and widely accessible, which has motivated studies on CFP-to-FFA synthesis. However, prior works rely solely on CFP surface texture, fundamentally limiting the ability to reconstruct functional vascular information and subtle pathological changes. To address this, we propose a novel framework that synthesizes FFA from CFP with structural guidance provided by optical coherence tomography (OCT). We construct a multi-modal retinal imaging dataset with paired CFP, FFA, and OCT from 3,676 patient eyes–the first tri-modally aligned dataset in retinal imaging. To bridge the spatial gap between OCT and fundus modalities, we propose a Spatially Aligned Cross-Modal Fusion (SACMF) module that projects depth-resolved OCT features onto the fundus plane and injects them into the CFP encoder via adaptive layer normalization. Beyond feature fusion, we further introduce Token-wise Cross-Modality Alignment (TCMA), a token-level contrastive learning strategy that explicitly aligns CFP and FFA representations at corresponding spatial positions. Our method achieves superior synthesis performance compared to state-of-the-art methods. Moreover, extensive experiments demonstrate that the FFA images synthesized by our approach bring greater improvements in downstream disease diagnosis performance than existing methods, highlighting the clinical potential of our approach as a non-invasive decision-support tool in routine workflows. The code is available at this https URL.

[CV-79] LUCID: Learned Undersampling-Adaptive Consistency-Guided Inference with Deterministic Flow Matching for Sparse-View CT Reconstruction

链接: https://arxiv.org/abs/2606.16212
作者: Jigang Duan,Jiayi Wang,Heran Wang,Ping Yang,Genwei Ma,Xing Zhao
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Sparse-view CT reduces radiation dose and scanning time by acquiring fewer projection views, but angular undersampling makes reconstruction severely ill-posed, causing streak artifacts, structural blurring, and loss of fine details. Existing supervised methods are often tied to specific sampling settings, whereas generative methods may introduce anatomically inconsistent hallucination-like structures under severe undersampling. We propose Lucid, a sparsity-adaptive, consistency-guided reconstruction framework based on a Flow Matching generative prior for sparse-view CT. Lucid is trained only on high-quality CT images to learn a continuous transport between a Gaussian distribution and the high-quality CT image distribution, independent of view sampling. During inference, the sampling sparsity level is explicitly incorporated to adapt the generative trajectory of a single pretrained model. Specifically, Lucid constructs a degradation-matched initial state by sparsity-weighted fusion of the sparse-view FBP image and Gaussian noise, performs sparsity-modulated Flow Matching updates, and applies projection-domain data-consistency correction after each prior update. Experiments under multiple sparse-view settings show that Lucid achieves stable reconstruction performance across different sampling densities, improves image quality and structural fidelity, and reduces the risk of hallucination-like structures in generative sparse-view CT reconstruction.

[CV-80] DynFS-MoE: Dynamic Functional-Structural Mixture-of-Experts for Post-Traumatic Epilepsy Diagnosis

链接: https://arxiv.org/abs/2606.16203
作者: Jun-En Ding,Spencer Chen,Henry Noren,Daniel Valdivia,Christine Yohn,Suhina Patel,Taylor Zink,Hai Sun,Feng Liu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Post-traumatic epilepsy (PTE) is a severe complication of traumatic brain injury (TBI), yet early identification remains challenging due to the complex structural and functional alterations it induces in the brain. To address this, we propose a dynamic multimodal Mixture-of-Experts (MoE) framework that integrates functional and structural MRI through time-aware functional-structural encoding and class-conditioned expert routing. Within this framework, modality-specific and cross-modal experts learn complementary representations, while a Modality-Class MoE (MCoE) module dynamically dispatches expert weights according to each classification objective. Experimental results across three binary classification tasks demonstrate that the framework consistently outperforms static fusion baselines, and high-interpretability analyses further reveal meaningful region-of-interest (ROI) interactions. This dynamic multimodal expert framework effectively captures class-dependent brain interaction patterns and provides an interpretable approach for PTE diagnosis and risk stratification.

[CV-81] EgoPhys: Learning Generalizable Physics Models of Deformable Objects from Egocentric Video

链接: https://arxiv.org/abs/2606.16202
作者: Hyunjin Kim,Ri-Zhao Qiu,Guangqi Jiang,Xiaolong Wang
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: Project Page: this https URL

点击查看摘要

Abstract:Humans naturally understand object physics through everyday interactions, but faithfully predicting complex deformable dynamics, such as elastic materials and fabrics, remains a major challenge for computer vision and robotics. We present EgoPhys, a framework that constructs deformable physical digital twins from egocentric RGB-only video using generalizable priors. EgoPhys overcomes the limitations of existing methods to enable controllable deformable digital twin generation from egocentric videos by distilling per-object inverse-physics solutions into a compact codebook, enabling prediction of dense spring stiffness fields for unseen objects without per-spring test-time optimization. Trained with generalizable priors from diverse egocentric interactions, EgoPhys outperforms baselines in reconstruction, future prediction, and zero-shot generalization. To support training and evaluation, we curate an egocentric interaction dataset covering diverse deformable objects, scenes, and manipulation styles. We deploy EgoPhys on a real xArm6 robot, demonstrating that a digital twin initialized from a single egocentric human play video can serve as an internal world representation to aid in deformable-object planning, highlighting egocentric RGB observations as a scalable path toward real-to-sim pipelines.

[CV-82] GRACE: Boosting Video MLLM s with Grounded Action-Centric Evidence for Viewer Sentiment Prediction

链接: https://arxiv.org/abs/2606.16198
作者: Ruoxuan Yang,Tieyuan Chen,Xiaofeng Huang,Haibing Yin,Jun Wang,Xiping Chen,Jun Yin,Xuesong Gao,Weiyao Lin
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 5 figures

点击查看摘要

Abstract:Viewer sentiment prediction in video advertisements aims to infer the latent affective response evoked in the audience. To bridge the gap between what is shown and what is felt, models must deduce hidden viewer emotions from explicit visual narratives, concrete character-object interactions, and visible textual cues. However, standard Multimodal Large Language Models (MLLMs) typically rely on holistic frame representations, which leave these fine-grained, affect-relevant events implicit and complicate precise emotional reasoning. To address this, we propose a grounded action-centric evidence augmentation framework that enhances video MLLMs’ clue extraction and comprehension by introducing explicit event structure and localized visual evidence. Our method extracts temporally ordered subject-verb-object (SVO) triplets and auxiliary visible textual cues from action-centric video descriptions, grounds subject and object entities as visual entity crops, and then enables the MLLM to perform clue-enhanced emotional reasoning based on these extracted structured clues. In this way, action triplets specify “what happens”, while grounded visual entity crops anchor “who or what participates in each event” to concrete visual evidence. Experiments on the Pitts dataset show consistent improvements over Qwen2.5-VL and Qwen3-VL baselines. Ablation studies, cross-dataset evaluation on AdsQA, and transfer experiments on an emotion-focused TVQA subset further support the effectiveness and generalization of our approach.

[CV-83] When Confidence Lacks Concepts: Interpretable OOD Detection via Representation Perturbations

链接: https://arxiv.org/abs/2606.16196
作者: Anju Chhetri,Pratik Shrestha,Ramesh Rana,Prashnna Gyawali,Binod Bhattarai
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Deep neural networks have achieved remarkable performance across medical imaging tasks, yet their tendency to overgeneralize under distributional shifts poses a major obstacle to safe clinical deployment. Out-of-Distribution (OOD) detection methods aim to mitigate this risk, but most existing approaches rely on opaque internal signals with poorly understood semantic meaning, limiting trust in safety-critical settings. In this work, we propose an interpretable OOD detection framework that probes the stability of model predictions under class-conditioned semantic perturbations. Leveraging sparse autoencoders (SAEs), we learn class-specific concept vectors from in-distribution data that disentangle dense intermediate representations into sparse, semantically meaningful components. At inference, we perturb deeper-layer representations using the concept vectors associated with the model’s predicted class and measure the class logits stability. We hypothesize that in-distribution samples exhibit low sensitivity to such perturbations, as their representations align with class-specific semantic directions, whereas OOD samples show amplified deviations due to representational misalignment. By framing OOD detection as a concept conditioned stability analysis, our approach provides both a discriminative OOD signal and an interpretable lens into the internal mechanisms driving model uncertainty, making it particularly suitable for high stakes medical applications.

[CV-84] Cascaded Sparse Autoencoders Learn Multi-Level Visual Concepts in Multimodal LLM s

链接: https://arxiv.org/abs/2606.16193
作者: Yusong Zhao,Hengyi Wang,Tanuja Ganu,Akshay Nambi,Hao Wang
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have demonstrated strong performance on vision-language tasks, yet their internal visual representations remain difficult to interpret. Sparse Autoencoders (SAEs) provide a scalable way to decompose dense model activations into sparse, interpretable features. However, existing SAE architectures primarily recover flat feature dictionaries and are less suited for explicit multi-level concept organization. In this paper, we introduce cascaded sparse autoencoders (CSAEs) for learning hierarchical visual concepts in MLLMs. Rather than nesting or stacking SAE sparse activation codes, CSAEs train a second-level SAE directly on the decoder weights of the first-level SAE, treating learned low-level feature directions as inputs for higher-level abstraction. This design enables CSAEs to learn “concepts of concepts” while avoiding drawbacks from the shared-prefix coupling of nesting, Matryoshka-style hierarchies and the bottlenecks of naively stacked SAEs. Experiments across Qwen3-VL, Gemma-3, and LLaVA on multiple visual datasets show that CSAEs improve interpretability in terms of hierarchical concept coherence over state-of-the-art SAE baselines. Results on concept steering further demonstrate that the learned concept groups support effective group-level interventions in MLLM outputs.

[CV-85] asr: training-efficient any-step diffusion transformer for real-world image super-resolution

链接: https://arxiv.org/abs/2606.16188
作者: Xiang Gao,Chenxin Zhu,Yushun Fang,Qiang Hu,Xiaoyun Zhang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion models excel in Real-World Image Super-Resolution (Real-ISR) due to their powerful generative priors but suffer from slow iterative sampling. Although existing one-step distillation methods accelerate inference, they typically require auxiliary teacher models that inflate training memory and restrict scalability to large-scale architectures. Furthermore, these fixed-step models lack the flexibility to trade off speed for quality. In this paper, we propose TEASR, a training-efficient any-step diffusion framework for Real-ISR that enables both one-step and multi-step restoration within a unified model. Our key idea is to perform self-adversarial distillation within a single diffusion model, eliminating the need for auxiliary teachers or discriminators. Specifically, we propose a timestep-aware rectification strategy that stabilizes one-step generation across noise levels. These two designs further enables the distillation of 20B-parameter diffusion models on a single GPU, significantly improving training efficiency. Moreover, we introduce a dual-branch diffusion transformer with decoupled timestep condition to separate the current noise state and the denoising target to enhance sampling quality. Extensive experiments demonstrate that TEASR supports seamless any-step sampling and consistently outperforms state-of-the-art methods across multiple datasets.

[CV-86] Learned JPEG Compression for DNN Vision

链接: https://arxiv.org/abs/2606.16185
作者: Kaixiang Zheng,Ahmed H. Salamah,Siyu Chen,En-Hui Yang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:JPEG, a lossy image compression technique designed for human viewers, has maintained its dominance for decades. However, in the era of artificial intelligence (AI), a substantial portion of image data, often compressed by JPEG, is and will continue to be consumed by deep neural networks (DNNs) instead of humans, thus creating a need to optimize JPEG for DNN inference performance. To this end, we propose learned JPEG compression for DNN vision (J4D), a novel training framework for determining JPEG encoding parameters to minimize compression rate while maximizing DNN inference performance. The major challenge of solving this optimization problem lies in representing the JPEG codec and compression rate in closed form. By incorporating a differentiable soft quantizer based on a probabilistic quantization scheme, we not only obtain a differentiable proxy for the JPEG codec, but are also able to compute the entropy of the coded source analytically, which is a close estimate of the actual compression rate. Equipped with both the differentiable JPEG codec and the information-theoretic rate estimator, we are then able to solve the aforementioned optimization problem with backpropagation. After training, the learned encoding parameters will be subsequently used in actual JPEG encoding based on probabilistic quantization. Extensive experimental results across multiple datasets and DNN architectures demonstrate that J4D consistently and significantly outperforms the default JPEG and other competitive JPEG codecs optimized for DNNs. Notably, compared to the default JPEG, J4D achieves an increase in accuracy by as much as 11.60% at the same rate, or a reduction of compression rate up to 80.05% at the same accuracy. Additionally, with the help of J4D, we show the potential to design universal JPEG encoding parameters for various DNN architectures for the first time.

[CV-87] Closed-Loop Triplet Synergistic Generation for Long-Form Video

链接: https://arxiv.org/abs/2606.16184
作者: Xinlei Yin,Xiulian Peng,Xiao Li,Zhiwei Xiong,Yan Lu
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Multi-shot long-form video generation remains challenging due to identity drift and compounding inconsistencies across shots. While storyboard-driven pipelines improve controllability, they are often executed in a feed-forward manner, with limited mechanisms to incorporate generated visual evidence back into subsequent conditioning. We propose CoTriSyGen, an agentic framework that formulates multi-shot long video generation as a closed-loop visual-text-memory synergy process, where planned intent, persistent memory, and generated visuals are jointly leveraged for iterative correction and long-range coherence. A vision-language-model-based analyzer reasons over this triplet and produces updates to both prompts and memory along two pathways: (i) intra-shot refinement, which triggers targeted regeneration when semantic or compositional violations are detected and refines image-to-video prompt for coherent motions; and (ii) inter-shot refinement, which rewrites subsequent-shot prompts to propagate newly manifested entities or attributes and improve prompt quality (e.g., compositional grounding and cinematic fluency) based on generated evidence. The loop is grounded in an entity-centric memory modeled as a mutable visual state that evolves as the story progresses, which is continuously updated by both the generator and the analyzer by adding new and evolved entities to reflect appearance changes, accumulated multi-view evidence, and multi-entity compositions. Experiments on our curated StoryBench benchmark demonstrate substantial improvements in cross-shot consistency, prompt adherence, and cinematic continuity over representative methods.

[CV-88] o forget is to preserve: Machine Unlearning for 3D medical image segmentation

链接: https://arxiv.org/abs/2606.16180
作者: Nitesh Kumar Singh,Akhilesh Singh,Arjun Arora
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 9 pages, 5 figures

点击查看摘要

Abstract:With new data privacy laws such as the General Data Protection Regulation (GDPR) [1] that allow individuals to ask that any of their personal information be erased from trained machine learning models, there has been a push to investigate the unlearning of data from models as a way to comply with these laws. In this regard, based on four mechanics, we consider several approximate unlearning strategies applied to the MRBrainS18 dataset [2]. We use a 3D ResNet-50 [3] as a backbone architecture for segmentation that has been pre-trained with the Med3D framework [4]. Considering the pre-trained model as a baseline, we evaluate respective retention accuracy on 2 types of subjects, i.e., retain and forget. We assess these approaches through their Dice similarity coefficient and mean absolute error (MAE) values using two separate training horizons 20 and 50 epochs. The results show that the Noisy Label strategy had the best overall trade-off with a decrease of 93% in the forget set while maintaining 84% accuracy for the retained set after 50 epochs. All other strategies showed extreme levels of forgetting at higher epoch numbers while also demonstrating catastrophic degradation of their retain set performance. The results of this study provide a strict baseline of performance metrics for unlearning on a subject-specific level and provide practitioners with clear criteria for selecting the proper strategies.

[CV-89] Fi-Gaussian: Frequency-Aware Implicit Gaussian Splatting for Single Image Dehazing

链接: https://arxiv.org/abs/2606.16168
作者: Yuhan Chen,Ying Fang,Guofa Li,Wenxuan Yu,Yicui Shi,Kunyang Huang,Wenbo Chu,Keqiang Li
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Single image dehazing continues to be hindered by the loss of high-frequency details and the difficulty of accurate physical scattering modeling. To address these issues, we propose Fi-Gaussian, a frequency-aware implicit Gaussian splatting network for single image dehazing. Unlike explicit rendering methods that rely on 3D point clouds, our method employs implicit Gaussian splatting to adaptively model the underlying distribution of clear images as a continuous representation in 2D feature space. The core of the network is a frequency-aware implicit Gaussian splatting module, which decouples low-frequency structural information and high-frequency texture information in the frequency domain and then performs adaptive Gaussian aggregation with complex-valued weights to recover fine details. In addition, a physics-driven scattering renormalization mechanism is introduced to estimate the transmission map and atmospheric light under the guidance of implicit Gaussian priors. Extensive experiments on multiple benchmark datasets demonstrate that Fi-Gaussian achieves state-of-the-art quantitative performance and produces visually superior dehazed results, validating the effectiveness of implicit Gaussian splatting for low-level vision tasks.

[CV-90] Dehaze-GaussianImage: Zero-Shot Dehazing via Efficient 2D Gaussian Splatting Representation

链接: https://arxiv.org/abs/2606.16163
作者: Yuhan Chen,Wenxuan Yu,Guofa Li,Kunyang Huang,Ying Fang,Yicui Shi,Wenbo Chu,Keqiang Li
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Existing single image dehazing methods are often constrained by computational redundancy in pixel-level optimization and the lack of physical interpretability in implicit neural networks. These limitations hinder the balance between representation efficiency and reconstruction fidelity. To address these issues, we propose Dehaze-GaussianImage, the first zero-shot framework that introduces 2D Gaussian Splatting (2DGS) into the image dehazing domain to break the traditional pixel-grid processing paradigm. Distinct from static convolutional neural networks (CNNs) or Transformers, our approach models hazy images as continuous and dynamically evolvable anisotropic Gaussian fields. Specifically, we propose a novel reconstruction-decoupling zero-shot learning strategy that embeds the atmospheric scattering model into the Gaussian parameter space. This strategy drives Gaussian primitives to adaptively split, clone, and prune during optimization, achieving geometric-level decoupling of the transmission medium and clear textures. Furthermore, explicit structure-preserving constraints are introduced to suppress artifacts commonly caused by traditional physical priors. Experimental results demonstrate that the proposed method achieves state-of-the-art (SOTA) performance in a fully unsupervised manner with minimal parameters, highlighting the potential of explicit Gaussian representation for low-level vision tasks.

[CV-91] Multimodal LLM -Empowered Re-Ranking for Generalizable Person Re-Identification

链接: https://arxiv.org/abs/2606.16161
作者: Jiachen Li,Xiaojin Gong
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Domain Generalizable (DG) person re-identification (Re-ID) has attracted growing research interest due to its potential for deployment in unseen real-world scenarios. Most existing approaches address DG Re-ID by focusing on training domain-generalizable encoders but ignore the possible refinements in inference stage. In contrast, this work explores an alternative direction which improves inference re-ranking to enhance DG Re-ID. Conventional re-ranking methods typically rely on neighborhood-based distances to refine the initial ranking list, inherently depending on features produced by the Re-ID encoder. However, they deteriorate on target domains since the encoder lacks sufficient generalizability to produce reliable feature distances on unseen scenarios. Inspired by the remarkable generalization capabilities of recent Multimodal Large Language Models (MLLMs), we propose an MLLM-empowered distance metric to improve re-ranking in DG Re-ID. Specifically, we first adapt an MLLM to Re-ID data through supervised fine-tuning, which incorporates a domain-agnostic prompt and a query-candidate hard mining scheme. Then, the adapted MLLM is employed to compute a \mu -distance during inference, which is robust to domain gap and significantly enhances subsequent re-ranking performance. Our approach is model-agnostic and can be seamlessly integrated into previous re-ranking frameworks. Extensive experiments demonstrate that our approach consistently yields substantial performance improvements across multiple DG Re-ID benchmarks. The code of this work will be released at this https URL soon.

[CV-92] Continuous Splatting meets Retinex: Continuous Gaussian Splatting and Implicit Reflectance Modeling for Low-Light Image Enhancement

链接: https://arxiv.org/abs/2606.16159
作者: Yuhan Chen,Yicui Shi,Guofa Li,Wenxuan Yu,Ying Fang,Guangrui Bai,Wenbo Chu,Keqiang Li
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Low-light image enhancement aims to recover clear images from low-illumination observations and is crucial for high-level downstream vision tasks. However, existing methods frequently encounter color distortion and structural artifacts when balancing global smooth illumination adjustment and local high-frequency detail recovery. To address these issues, we propose CGS-Retinex as the first low-light image enhancement framework based on explicit-implicit joint modeling. Our framework deeply integrates continuous Gaussian splatting with Retinex theory. Specifically, we represent the image grid as a continuous parameter field and propose a continuous Gaussian renderer to estimate the spatially continuous global illumination distribution. This approach fundamentally eliminates grid artifacts caused by discrete Gaussian sampling. Furthermore, we introduce an implicit neural representation to model reflectance independently. We leverage shallow high-frequency features to guide the network in accurately reconstructing degraded texture details. Within the Retinex framework, we incorporate physics-inspired brightness consistency constraints and illumination smoothness regularization to enable explicit illumination and implicit reflectance to maintain proper exposure and achieve high-fidelity recovery of high-frequency structures and colors. Extensive experiments demonstrate that CGS-Retinex significantly suppresses dark-region noise and overexposure while achieving exceptional high-frequency structural fidelity and color restoration by precisely decoupling illumination and texture. This work establishes a novel continuous physical representation paradigm for low-light image enhancement.

[CV-93] A Comprehensive Survey of Medical Image Segmentation: Challenges Benchmarks and Beyond

链接: https://arxiv.org/abs/2606.16153
作者: Pengyu Zhu,Xiaojing Zhang,Kunbo Zhang,Chunyan Zhang,Zhenyu Wang
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 12 pages,3 figures,1 table. All related resources are available at this https URL

点击查看摘要

Abstract:Medical image segmentation plays a critical role in clinical diagnostics, treatment planning, disease monitoring, and neurological disorder identification. This article presents a comprehensive review of its systematic development, covering widely used public datasets, representative methods built on the U-Net, Transformer, and SAM architectures, and key evaluation metrics with their differences, followed by an analysis of major challenges from multiple perspectives. Unlike surveys that focus on a single model family or a specific clinical application, this review organizes U-Net-, Transformer-, and SAM-based methods within a unified analytical framework, with a particular focus on their effectiveness in improving segmentation accuracy and efficiency. This work aims to guide future research and support clinical translation of medical image segmentation, with all related resources publicly available in our GitHub repository: this https URL.

[CV-94] Shift-and-Sum Quantization for Visual Autoregressive Models ICLR2026

链接: https://arxiv.org/abs/2606.16131
作者: Jaehyeon Moon,Bumsub Ham
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: ICLR 2026

点击查看摘要

Abstract:Post-training quantization (PTQ) enables efficient deployment of deep networks using a small set of data. Its application to visual autoregressive models (VAR), however, remains relatively unexplored. We identify two key challenges for applying PTQ to VAR: (i) large reconstruction errors in attention-value products, especially at coarse scales where high attention scores occur more frequently; and (ii) a discrepancy between the sampling frequencies of codebook entries and their predicted probabilities due to limited calibration data. To address these challenges, we propose a PTQ framework tailored for VAR. First, we introduce a shift-and-sum quantization method that reduces reconstruction errors by aggregating quantized results from symmetrically shifted duplicates of value tokens. Second, we present a resampling strategy for calibration data that aligns sampling frequencies of codebook entries with their predicted probabilities. Experiments on class-conditional image generation, inpainting, outpainting, and class-conditional editing show consistent improvements across VAR architectures, establishing a new state of the art in PTQ for VAR.

[CV-95] raining-Free Open-Vocabulary Visual Grounding for Remote Sensing Images and Videos

链接: https://arxiv.org/abs/2606.16124
作者: Ke Li,Di Wang,Yongshan Zhu,Ting Wang,Weiping Ni,Tao Lei,Quan Wang,Xinbo Gao
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Remote sensing visual grounding (RSVG) aims to localize a referred target in a remote sensing image or video according to a natural language expression. Existing RSVG methods usually rely on task-specific manual annotations, which are costly to collect and inevitably limited in covering the diversity of real-world geospatial scenarios. As a result, they often struggle to generalize to open-vocabulary queries involving novel objects, fine-grained attributes, complex spatial relationships, and functional semantics. In this paper, we propose RSVG-ZeroOV, a training-free framework that leverages frozen generic foundation models for zero-shot open-vocabulary RSVG. RSVG-ZeroOV follows an Overview-Focus-Evolve paradigm, which exploits the distinct yet complementary attention patterns of vision-language models (VLMs) and diffusion models (DMs) to progressively generate precise grounding results. Specifically, (i) Overview utilizes a VLM to extract cross-attention maps that capture semantic correlations between the referring expression and visual regions; (ii) Focus leverages the fine-grained modeling priors of a DM to compensate for object structure and shape information often overlooked by VLM attention; and (iii) Evolve introduces a simple yet effective attention evolution module to suppress irrelevant activations, yielding purified object masks. To handle video inputs, we further present Video RSVG-ZeroOV, which extends image-level grounding to spatio-temporal grounding through a query-relevant key-frame selector and a temporal propagator, enabling efficient and temporally coherent video grounding without video annotations or fine-tuning. Extensive experiments on six image and video grounding benchmarks show that RSVG-ZeroOV consistently outperforms existing zero-shot baselines and achieves competitive or superior performance compared with weakly- and fully-supervised methods.

[CV-96] EdgeZSAD: Practical Zero-Shot Anomaly Detection on Edge Devices

链接: https://arxiv.org/abs/2606.16119
作者: Taewan Cho,Andrew Jaeyong Choi
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Industrial inspection needs zero-shot anomaly detection (ZSAD) that remains useful under edge deployment constraints. Recent methods often rely on ViT-L foundation backbones (~300M parameters), which exceed the memory and operator budget of typical embedded hardware. We study this regime through EdgeZSAD, a compact reference system built around a TinyViT-21M-512 backbone, an asymmetric global-local readout (EdgeGLR), and a reproducible source-side training recipe (Real-IAD-DR). We train a single checkpoint in a source-trained, target-unseen protocol and evaluate it across six industrial benchmarks. Across three independent runs, the resulting model reaches an average image AUROC of 91.6 on MVTec-AD and 88.2 on VisA, while remaining directly deployable on Jetson Orin Nano Super (TensorRT FP16) and RB5 Gen2 (QNN GPU FP16). Across the six device-rescored benchmarks, image-AUROC drift stays below 0.2 points, indicating that the exported graph preserves host-side ranking behavior in the evaluated deployment setting.

[CV-97] SceneCraft: Interactive System for Image Editing via Scene Graph

链接: https://arxiv.org/abs/2606.16103
作者: Duc-Manh Phan,Ngoc-Dai Tran,Duy-Khang Do,Tam V. Nguyen,Minh-Triet Tran,Trung-Nghia Le
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in generative AI have enabled natural language-driven image editing, yet existing systems often fail in complex scenes with multiple interacting objects because they rely heavily on users crafting precise text prompts. To address the absence of structured control, we propose SceneCraft, a novel interactive framework that bridges user intent and model execution by representing images as editable scene graphs. Instead of guessing text prompts through trial and error, users interact directly with a visual graph to perform complex spatial and relational operations. These graph modifications are automatically translated into precise, context-aware editing prompts, effectively eliminating linguistic ambiguity. To ensure robust and diverse results, structured prompts are dispatched to multiple state-of-the-art generative models. Evaluations across diverse editing scenarios show that SceneCraft provides a more intuitive control mechanism, significantly reducing the cognitive burden of manual prompt engineering while generating outputs that users consistently rate as higher in quality and fidelity.

[CV-98] Effective and Low-cost Lane-based Map Localization for Vehicle-Centric Route Generation

链接: https://arxiv.org/abs/2606.16101
作者: Hong-Shiang Lin,Jung-Hsin Chen,Yu-Luen Tzeng,Wei-Hao Chen,Yi-Chen Lee,Li-Jhe Chen,Peng-Yuan Chen
类目: Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 18 figures. Under Review

点击查看摘要

Abstract:Driver-centric route representation plays a vital role in intuitive driving guidance systems. This paper presents OLRA, a low-cost, map-localization-based framework that derives driver-view-aligned routes by matching map-based navigation routes with camera-detected lane markings. This alignment process mutually enhances vehicle localization accuracy and visual route consistency. To bridge the evaluation gap across different paradigms, we introduce practical route evaluation metrics and benchmark OLRA against OpenPilot, a representative direct-generation approach. Experimental results on the nuScenes dataset demonstrate that OLRA outperforms OpenPilot in complex road segments and in route estimation at distance beyond 20 meters, achieving lower overall Euclidean error. This study is expected to promote future research in low-cost, maplocalization-based route generation methods.

[CV-99] VinQA: Visual Elements Interleaved Long-form Answer Generation for Real-World Multimodal Document QA CVPR2026

链接: https://arxiv.org/abs/2606.16092
作者: Young Rok Jang,Hyesoo Kong,Kyunghwan An,Jae Sub Huh,Gyeonghun Kim,Stanley Jungkyu Choi
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to CVPR 2026. Main paper: 5 figures, 4 tables; includes supplementary material

点击查看摘要

Abstract:Real-world documents combine text with tables, charts, photographs, and diagrams arranged in diverse layouts, yet existing research on multimodal large language models (MLLMs) for document QA predominantly produces text-only responses, underutilizing these visual elements. We introduce VinQA, a dataset for long-form answer generation where cited visual elements are explicitly interleaved with their supporting text and grounded in relevant document pages. To support this task, we study two encoding methods for feeding raw document page images into an MLLM, along with their visual-element citation mechanisms: (1) Page Encoding, which directly encodes full-page images with bounding boxes of visual elements and treats these boxed regions as citable units; and (2) Modality Encoding, which parses each page to extract text and crop visual elements, encodes them separately, and uses these cropped elements as citable units. In our experiments, we propose M-GroSE, a multimodal evaluation framework extending GroUSE to assess answers along four dimensions: completeness, answer relevancy, faithfulness, and unanswerability. We additionally report Visual Source F1 to directly measure visual citation accuracy. Although proprietary frontier models still achieve the best overall scores on the VinQA test split, fine-tuning open Qwen2.5-VL models on the training split substantially improves their performance and narrows this gap. Modality Encoding is initially more robust for complex documents with long text, many visual elements, and diverse citation requirements. After training on VinQA, however, Page Encoding reaches a comparable level, competing effectively even without the explicit parsing used in Modality Encoding. Finally, Visual G-Eval, an MLLM-based judge, confirms that fine-tuned models insert visual elements at semantically appropriate positions with faithful supporting text.

[CV-100] ool-IQA: Augmenting Image Quality Assessment with Simple Tools

链接: https://arxiv.org/abs/2606.16082
作者: Guanyi Qin,Junjie Zhang,Chunming He,Yibing Fu,Jie Liang,Tianhe Wu,Lei Zhang
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Vision-Language Models (VLMs) have been increasingly adopted for Image Quality Assessment (IQA). However, current methods typically employ a static one-shot scoring paradigm, despite the fact that humans assess image quality through dynamic visual inspection, e.g., selectively adjusting views to verify details and subtle artifacts. Specifically, relying solely on a single-pass observation introduces two primary limitations: first, perceiving the image only at a global scale restricts the assessment of finer local details; second, the original intensity distribution of the image may overwhelm the visibility, leading to insufficient inspection of image quality. To address these issues, we propose Tool-IQA, shifting the assessment mechanism from passive scoring to a tool-augmented workflow. In particular, we equip VLMs with simple yet effective view tools: a Magnifier to inspect local details, and a Gamma Corrector to uncover visibility and hidden artifacts. The assessment follows a structured pipeline that consists of an initial observation with rubric notes, a tool-augmented in-depth inspection, and a final quantification for calibrated quality score. Furthermore, to ensure efficient and purposeful tool callings, we introduce a batch-aware training strategy to reward tool interactions that can yield positive contributions rather than simply encouraging usage. Experiments on a variety of IQA benchmarks demonstrate that, with effective tool calling and calibrated assessment, our proposed Tool-IQA significantly outperforms existing state-of-the-art models, e.g., it achieves a PLCC of 0.854 on the challenging CLIVE dataset.

[CV-101] AME: A Multi-Type Contributor Attribution Framework in Generative AI Markets

链接: https://arxiv.org/abs/2606.16075
作者: Yang Shi,Songwen Pei,Yang Gao,Bingxue Zhang
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Generative AI enables value creation through multi-stage collaboration among heterogeneous contributors, including training data, base models, fine-tuning behaviors, and prompts. However, how to fairly allocate the data value remains largely unexplored. This paper formulates multi-stage generative AI value allocation as a new research problem and identifies three core challenges: heterogeneous data contribution valuation, data rights mapping, and trustworthy execution. We propose AME (Attribution-Mapping-Execution) framework, a unified framework that integrates data contribution valuation, data rights mapping, and trustworthy execution into a single workflow. Experimental results demonstrate that AME framework achieves data value allocation outcomes more consistent with human reference judgments while maintaining low-cost trustworthy execution. Our work provides an initial foundation for value assessment and revenue allocation in generative AI data markets.

[CV-102] Stepwise Token Selection for Efficient Multimodal Large Language Models

链接: https://arxiv.org/abs/2606.16067
作者: Landi He,Shawn Young,Lijian Xu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In multimodal large language models (MLLMs), inference cost is largely dominated by the visual token prefix rather than the language backbone, making token reduction a key factor for improving efficiency. Existing approaches typically assign independent importance scores to visual tokens and retain a fixed number of top-ranked tokens, implicitly assuming token independence and a uniform compression ratio across inputs. In this work, we reformulate visual token pruning as a sequential decision-making process. Specifically, we introduce a pointer-style selection mechanism that iteratively chooses informative tokens, conditioning each decision on previously selected ones, and dynamically determines when to stop via a learned termination action. This enables joint optimization of both the selected subset and its size. To enable end-to-end training under standard language modeling objectives, we design a differentiable relaxation based on a variance-preserving noise interpolation scheme, allowing gradients to propagate through the discrete selection process. Extensive experiments on LLaVA-v1.5-7B and Qwen2.5-VL-7B demonstrate that our approach consistently outperforms fixed-ratio baselines across different compression levels. Under aggressive pruning that removes 88.9% of visual tokens, our method preserves 94.6% of the original accuracy while achieving a 1.88x speed-up in prefill latency.

[CV-103] PointDiffusion: Diffusion-Based Scene Completion in the Point Cloud Domain

链接: https://arxiv.org/abs/2606.16048
作者: Chidera Agbasiere,Mikhail Sannikov,Faith Ogunwoye,Erik Shaikhiev,Alex Kozinov,Ilya Mikhalchuk,Iana Zhura,Dzmitry Tsetserukou
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Reconstructing dense 3D scenes from sparse LiDAR point clouds is a fundamental challenge in autonomous driving, where latent diffusion models offer a promising solution. However, existing approaches rely on object-level autoencoders that collapse into unstable global representations at outdoor scale and suffer from ground truth data corrupted by odometry drift that systematically degrades supervision quality. Furthermore, multi-step diffusion inference incurs prohibitive latency for real-time deployment. We propose a novel multi-token Gaussian VAE with cross-attention pooling for stable scene-scale LiDAR compression, combined with an anchor-based ICP ground truth refinement pipeline that eliminates drift-induced noise from training supervision. Together, these components enable a scaffold-free single-step diffusion completion model that achieves an approximately 16x reduction in squared Chamfer distance on SemanticKITTI seq. 08 (0.396 m^2 to 0.024 m^2), surpasses LiDiff and ScoreLiDAR by 17-19% and 10-11%, respectively, and operates at 25-143x lower inference latency. Our results demonstrate that data quality dominates model design in this regime and that multi-token latent spaces provide a stable first stage for latent diffusion-based scene completion.

[CV-104] rusting Right Predictions for Wrong Reason s: A LIME Based Analysis of Deep Learning Interpretability in Lung Cancer Diagnosis

链接: https://arxiv.org/abs/2606.16036
作者: Samarpan Poudel,Vladislav D Veksler
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Lung cancer is the leading cause of cancer-related mortality, with approximately 2.5 million new cases and 1.8 million deaths annually, making reliable diagnosis a clinical priority. Although deep learning models have achieved strong performance in lung cancer classification, evaluation has largely focused on predictive accuracy, leaving their decision-making processes insufficiently examined. This study compares three architecturally distinct models: a Convolutional Neural Network (CNN), a pretrained ResNet50, and a Vision Transformer (ViT), trained on the IQ-OTH/NCCD lung cancer CT dataset. Local Interpretable Model-Agnostic Explanations (LIME) were applied to investigate model reasoning. In addition to standard performance metrics, a dual-correlation framework was introduced to measure both prediction agreement and explanation agreement across model pairs. All three models achieved strong classification performance, with ResNet50 attaining 98.61% accuracy, CNN 97.91%, and ViT 93.75%, while all achieved ROC-AUC scores of 0.99. Prediction correlations exceeded 0.99 across all model pairs, indicating highly consistent outputs. However, LIME explanation correlations remained below 0.26, revealing substantial differences in the image regions used to reach those predictions. Analysis of misclassified samples further identified a consistent spatial pattern: incorrect predictions were associated with attention outside the lung parenchyma, whereas correct predictions focused primarily within lung regions. These findings demonstrate that prediction agreement is a poor proxy for reasoning consistency, and that interpretability evaluation must be treated as an independent validation criterion alongside predictive performance in clinical AI systems.

[CV-105] he Third Challenge on Image Denoising at NTIRE 2026: Methods and Results CVPR

链接: https://arxiv.org/abs/2606.16031
作者: Lei Sun,Hang Guo,Bin Ren,Shaolin Su,Xian Wang,Danda Pani Paudel,Luc Van Gool,Radu Timofte,Yawei Li
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: accepted by cvprw2026

点击查看摘要

Abstract:This paper reports on the NTIRE 2026 Challenge on Image Denoising, specifically focusing on the high-noise regime ( \sigma = 50 ). The competition investigates advanced neural architectures designed to restore high-fidelity details from images corrupted by additive white Gaussian noise (AWGN). Unlike constrained benchmarks, this track emphasizes peak quantitative performance, measured by Peak Signal-to-Noise Ratio (PSNR), without limitations on parameter count or computational overhead. By synthesizing contributions from 20 finalist teams out of 116 registrants, this report benchmarks the latest technical innovations and provides a comprehensive snapshot of the current state-of-the-art in unconstrained image restoration.

[CV-106] Stringalign: Moving beyond summary statistics with a transparent Unicode-aware tool for evaluating automatic transcription models

链接: https://arxiv.org/abs/2606.16015
作者: Yngve Mardal Moe,Marie Roald
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Comparing text strings is crucial when evaluating and understanding the performance of various text processing tasks such as document recognition and audio transcription. With an increasingly complex landscape of AI-based handwritten text recognition (HTR), optical character recognition (OCR) and automatic speech recognition (ASR) models, there is a need for tools that facilitate evaluation in a flexible and reproducible way. This paper presents Stringalign, a Python library designed to simplify the evaluation process for automatic transcription projects and facilitate transparent evaluation. Stringalign’s tools to examine and visualise both the rate of errors and the types of errors a model makes, give insights into possible improvements and help inform model selection for a particular task. Widely used string comparison metrics, such as the character and word error rates (CER and WER), although useful, can be ambiguous due to varying definitions of what constitutes a character and a word. Stringalign addresses this challenge by ensuring all preprocessing (i.e. normalisation and tokenisation) is transparent and easily replicable, and by providing tools to move beyond summary statistics and analyse common model errors. Moreover, Stringalign adheres to FAIR (Findable, Accessible, Interoperable, and Reusable) principles for research software while staying lightweight and easy to adapt into researchers existing workflows. In this paper, we discuss challenges with character and word level string comparisons and show through examples that where existing tools can yield opaque and sometimes confusing results, Stringalign provides an easy-to-use and unambiguous alternative.

[CV-107] Classifying by Proxy: Explainable and Reproducible Ensemble of Proxy Tasks for Child Sexual Abuse Imagery Classification

链接: https://arxiv.org/abs/2606.15993
作者: Clara Ernesto,Carlos Caetano,Sandra Avila,João Macedo,Camila Laranjeira,Leo S. F. Ribeiro
类目: Computers and Society (cs.CY); Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 7 figures, 7 tables. Accepted at ACM FAccT 2026

点击查看摘要

Abstract:Child Sexual Abuse Imagery (CSAI) classification systems are needed solutions for lessening the psychological impacts often felt by law enforcement agents responsible for evaluating these materials and for efficient removal of these materials from the web. However, due to the nature of the task, researching and developing such systems is not a trivial endeavor. The images are highly sensitive, and the related datasets are under restrictive access regimes, which means most studies in the area are not reproducible or distributable and are therefore hard to compare and validate. More concerning still, most models for this task today lack an aspect often desired by law enforcement agents: explainability. In this paper, we apply an ensemble of Proxy Tasks – tasks that correlate to CSAI classification – yielding improvements in reproducibility, explainability, and security for distribution. This concept is applied for the first time to real CSAI, with a novel selection of relevant Proxy Tasks (selected from the CSAI literature) and training adaptations to the original framework. Our final model achieves competitive results, yielding 91.9% balanced accuracy on the RCPD dataset with the best Proxy Task combination. We furthermore contrast these results with the best-in-class representation learning model, DINO, and show that our ensemble improves accuracy and provides explanations for its classification results, a feature that a single deep learning model can seldom provide.

[CV-108] Multi-Task Tennis Stroke Biomechanics Analysis Using MediaPipe Pose

链接: https://arxiv.org/abs/2606.15992
作者: Jigyashman Hazarika
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 9 figures

点击查看摘要

Abstract:We built a multi-task pipeline for tennis stroke biomechanics from plain RGB video. On top of pose-based stroke recognition, it adds two new tasks, predicting shot direction and grading posture quality, plus a rule-based feedback layer that suggests coaching tips. Strokes are found automatically using a weighted joint velocity score, s(t) = 0.5 v_wrist + 0.3 m_elbow + 0.2 m_shoulder, removing the need for manual annotation. Pose comes from MediaPipe Pose Landmarker (33 landmarks, metric world coordinates), with each stroke turned into a 30-frame by 39-feature sequence for TennisTransformerGPU, a compact 564,103-parameter transformer (4 layers, 4 heads, d=128) with three parallel output heads. Trained on 1,281 labeled strokes from 7 pros and 1 amateur across 11 videos, it hits 83.7% stroke-type accuracy, 61.9% on direction, and 62.6% on posture under a random 80/20 split. The interesting test is cross-player: train on pros, evaluate on the amateur. Stroke type barely budges, 82.9%, a 0.8% drop. Direction prediction does not transfer; it just falls back to the majority class. An ablation shows why world coordinates matter so much here: switching to image-space landmarks tanks cross-player stroke-type accuracy from 83% to 47% and direction from 68% to 21%. Everything runs on Kaggle’s free T4 GPU tier and is fully reproducible.

[CV-109] A Text Recognition Dataset from Sahidic Coptic Ancient Manuscripts ICDAR2026

链接: https://arxiv.org/abs/2606.15987
作者: Fabio Quattrini,Carmine Zaccagnino,Costanza Bianchi,Silvia Cascianelli,Rita Cucchiara
类目: Computer Vision and Pattern Recognition (cs.CV); Digital Libraries (cs.DL)
备注: Accepted at ICDAR 2026

点击查看摘要

Abstract:In this work, we target Handwritten Text Recognition (HTR) in low-resource scenarios, which arise from underrepresented languages, rare scripts, and degraded visual conditions typical of historical documents. We introduce SCAM (Sahidic Coptic Ancient Manuscripts), a new line-level dataset built from digitized ancient manuscripts written in the extinct Sahidic Coptic dialect. The dataset reflects a realistic and challenging setting, as it combines heterogeneous acquisition conditions across libraries with typical manuscript degradations such as ink fading, bleed-through, and material deterioration. In addition to visual complexity, SCAM poses significant linguistic challenges due to the scarcity of resources for Sahidic Coptic, its uncommon alphabet, and dialect-specific diacritics. To support research in low-resource HTR, we benchmark several state-of-the-art approaches based on different paradigms, highlighting their limitations and strengths in this setting. Our results underline the gap between current HTR performance on well-resourced modern scripts and historically grounded, low-resource scenarios, thus providing a reference point for future developments.

[CV-110] Mind the Gap: Diagnosing Constraint Discovery Failures in Text-in-Image Editing

链接: https://arxiv.org/abs/2606.15982
作者: Rui Gui
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:A key challenge in multimodal reasoning is determining which visual dependencies become relevant under a specific task, rather than merely recognizing visible content. We study this through edit-induced constraint discovery in text-in-image editing, a controlled diagnostic setting where a local text change can activate secondary consistency constraints: given a valid editing instruction and an image, can a model identify the secondary regions that must also change? Across 461 diagnostic cases, four MLLMs, and 19 constraint subtypes, models recover only 46% case-level macro recall under unguided prompting versus 94% when constraints are explicitly provided, suggesting that a substantial portion of the failure arises when models must decide which unstated dependencies to surface. Oracle-field decomposition shows that case-specific causal explanations are the most effective partial guidance (0.782 recall), above region names (0.610) or type labels (0.646), suggesting that edit-specific causal cues account for much of the oracle gain. A downstream experiment further shows that higher self-discovery recall does not necessarily improve task performance: unverified self-discovery introduces false positives that offset recall gains, motivating precision-aware constraint elicitation.

[CV-111] HadBalance: A Plug-and-Play Unified Global Geometric Prior Framework for Generalizable Biomedical Segmentation MICCAI2026

链接: https://arxiv.org/abs/2606.15976
作者: Zhuangzhi Gao,Feixiang Zhou,He Zhao,Wenhan Chen,Ruiyu Luo,Xin Wang,Hongyi Qin,Zhongli Wu,Yanda Meng,Yitian Zhao,Alena Shantsila,Gregory Y. H. Lip,Eduard Shantsila,Yalin Zheng
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Provisionally accepted by the 29th International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI 2026). 11 pages, 3 figures, 2 tables

点击查看摘要

Abstract:Precise biomedical image segmentation is crucial for clinical diagnosis. Geometric cues (e.g., boundary, shape, and topology) can improve structural consistency, yet most are task-specific and lack a unified geometric foundation that generalizes across organs and modalities. We are motivated by the observation that several medical segmentation targets can be approximated as globally near-convex shapes. A convex region is one in which any two interior points can be connected by a line segment entirely contained within the region. In practice, medical targets may exhibit small local concavities or boundary irregularities; we refer to such globally convex-like shapes as near-convex. Motivated by this, we derive Hadwiger Shape Priors from Hadwiger’s theorem as an interpretable global regularizer using three 2D measures: area A, perimeter P, and Euler characteristic chi, enabling transfer across organs and modalities. However, because medical datasets are shape-heterogeneous, enforcing near-convex priors uniformly can over-regularize non-convex anatomy with significant concavities, washing out concavities and fine details and degrading segmentation accuracy. To address this challenge, we propose Conflict-Aware Objective Balancing (CAOB), which integrates shape priors with segmentation in a gradient-aware manner. For each prior, CAOB removes only the gradient component that conflicts with segmentation while preserving the remaining aligned component, and adaptively regulates objective influences to prevent prior dominance. This enables stable use of shape priors on shape-heterogeneous data without erasing genuine concavities or fine structural details. We call this plug-and-play framework HadBalance.

[CV-112] CRIS: Cross-Plane Self-Supervised Isotropic Restoration for Anisotropic Volumetric Imaging Across Modalities

链接: https://arxiv.org/abs/2606.15967
作者: Adi Ahituv,Anat Ilivitzki,Moti Freiman
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 22 pages, 8 figures, supplementary material included. Submitted to Medical Image Analysis

点击查看摘要

Abstract:Anisotropic volumetric acquisitions are common in clinical MRI and volume electron microscopy (vEM), where sparse through-plane sampling creates thick slices or sections that degrade orthogonal reformats and downstream analysis. We present CRIS, a cross-plane self-supervised framework for isotropic restoration without paired isotropic ground truth. CRIS casts 3D restoration as 2D stripe completion on orthogonal reformats of an isotropic grid: high-resolution in-plane slices are synthetically degraded and periodically masked for training, while at inference blank slices define the isotropic grid, two orthogonal reformats are restored, and predictions are fused by multi-view averaging. We evaluate CRIS on two MRI cohorts and two microscopy benchmarks up to 8x anisotropy. On brain MRI, CRIS achieves 32.921 +/- 0.436 dB PSNR and 0.9631 +/- 0.0027 SSIM, outperforming interpolation, SMORE4, SIMPLE, SA-INR, and ATME, and gives the best segmentation consistency (Dice 0.940 +/- 0.004, ASSD 0.245 +/- 0.014 mm, HD99 1.275 +/- 0.061 mm). On reference-free abdominal MRI, CRIS reduces FID/KID to 48.714/0.023. On vEM, CRIS outperforms interpolation, NIIV, and vEMINR, reaching 29.133 dB/0.834 3D PSNR/SSIM at 4x, 27.123 dB/0.734 on EPFL at 8x, and 21.915 dB/0.699 on noisy hemibrain data. In a robustness experiment, one variable-gap CRIS model evaluated across gap factors 3–7 and coronal, axial, and sagittal degradations maintained higher PSNR/SSIM than interpolation (36.36–31.14 dB and 0.977–0.932 vs. 33.07–27.85 dB and 0.951–0.853). These results support CRIS as a modality-flexible route to isotropic restoration without paired isotropic targets or configuration-specific retraining. Code is available at this https URL.

[CV-113] VEPHand: View-Efficient Photometric Hand Performance Capture at Scale

链接: https://arxiv.org/abs/2606.15966
作者: Zhengyang Shen,Kai-Hung Chang,Erroll Wood,Deying Kong,Bo Peng,Timo Bolkart,Jinlong Yang,Bowen Zhao,Danhang Tang,Sasa Petrovic,Emre Aksan,Jérémy Riviere,Vassilis Choutas,Delio Vicini,Jay Busch,Shichen Liu,Zhe Cao,Hugh Liu,JingJing Shen,Jonathan Taylor,Mingsong Dou
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:

点击查看摘要

Abstract:Robust, high-fidelity 3D hand capture, while fundamental to digital human creation, remains challenging with practical multi-view systems that balance rich photometry with the geometric ambiguities of reconstruction arising from limited viewpoint density. This paper presents an end-to-end pipeline for dynamic hand performance capture and registration, specifically designed for view-efficient setups ( \sim 20 views). We address key challenges with two primary innovations. First, to overcome reconstruction difficulties like limited view overlap and background clutter, our mask-free neural method robustly extracts detailed hand geometry and appearance from unmasked images using scene parameterization and scenario-specific density regularization. Second, addressing registration challenges such as accurately capturing non-linear skin deformations and ensuring plausible results during severe self-contact, we propose a physics-inspired framework. It aligns reconstructions to a personalized hand model by optimizing intrinsic volumetric offsets within its canonical tetrahedral mesh, alongside pose parameters. This approach, supported by robust losses and optimization, captures fine surface deformations, ensures plausible results under severe articulation and self-contact, and demonstrates strong tolerance to input noise. We demonstrate the scalability and robustness of our automated pipeline on an extensive dataset of over 12,000 sequences, from which we also derive a large-scale, high-quality synthetic 2D/3D hand dataset for training downstream tasks. This showcases its effectiveness for single hands, intricate two-hand interactions, and natural hand-object manipulations. Our method achieves state-of-the-art reconstruction fidelity in view-efficient, unmasked scenarios and highly accurate registration. Our project page are available at this https URL.

[CV-114] You Dont Need Strong Assumptions: Visual Representation Learning via Temporal Differences

链接: https://arxiv.org/abs/2606.15956
作者: Ninad Daithankar,Alexi Gladstone,Yann LeCun,Heng Ji
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Progress in AI has largely been driven by methods that assume less. As compute and data increase, approaches with weaker inductive biases generally outperform those with stronger assumptions. This is particularly characteristic of the field of Visual Representation Learning, where approaches have gone from being dominated by Supervised Learning, to Weakly Supervised Learning, to the now widespread success of Self-Supervised Learning without human labels. Yet, even modern Self-Supervised Learning approaches still depend on strong inductive biases such as augmentations, masking, or cropping. If this trend holds, even these remaining biases should become bottlenecks at scale – and our experiments confirm this: the optimal strength of inductive biases decreases as data grows. This motivates the search for approaches that rely on fewer assumptions. To this end, we introduce Temporal Difference in Vision (TDV), a new paradigm for self-supervised learning from video that avoids existing inductive biases, relying instead on a causal assumption that the past causes the future. TDV functions by jointly training an image encoder and a motion encoder so that the current frame’s representation plus the encoded motion equals the next frame’s representation. Despite not leveraging any strong inductive biases, TDV matches state-of-the-art recipes on dense spatial tasks, laying the foundation for representation learning without strong assumptions.

[CV-115] Learning Directional Semantic Transitions for Longitudinal Chest X-ray Analysis MICCAI2026

链接: https://arxiv.org/abs/2606.15938
作者: Zhangfeng Hu,Zefan Yang,Ge Wang,Tanveer Syeda-Mahmood,Anushree Burade,Mannudeep Kalra,Pingkun Yan
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: MICCAI 2026

点击查看摘要

Abstract:Chest X-ray (CXR) interpretation often requires longitudinal comparison to assess disease progression. Existing approaches typically rely on temporal feature fusion or inter-study discrepancy modeling, yet remain limited in capturing subtle progression semantics and overlook the inherently directional nature of disease trajectories. In this paper, we propose ProTrans, a novel vision-language pretraining framework that formulates disease progression as a directional semantic transition between paired CXR studies. ProTrans leverages radiology reports to anchor individual CXR representations within interpretable disease states, and introduces a learnable progression feature map to explicitly encode semantic shifts between states, aligned with report-derived progression descriptions. To enforce direction-aware perception, ProTrans incorporates a reversed temporal modeling process and imposes bidirectional reconstruction consistency across states and transitions, thereby disentangling directional semantics and promoting coherent trajectory modeling. Extensive experiments on longitudinal downstream tasks, including disease progression classification and progression captioning, demonstrate that ProTrans consistently outperforms existing methods, establishing a unified pretraining framework for longitudinal CXR understanding. this https URL

[CV-116] GOOSE-M2F: Adapting Mask2Former for High-Fidelity Long-Tailed Fine-Grained Semantic Segmentation in Unstructured Outdoor Terrain ICRA

链接: https://arxiv.org/abs/2606.15937
作者: Jyothiraditya Lingam,Nikhileswara Rao Sulake,Sai Manikanta Eswar Machara
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This solution has got 3rd position at GOOSE 2D Fine-Grained Semantic Segmentation (FGSS) Challenge at ICRA~2026

点击查看摘要

Abstract:We present GOOSE-M2F, a task-specific adaptation of Mask2Former for the GOOSE 2D Fine-Grained Semantic Segmentation (FGSS) Challenge at ICRA~2026. The GOOSE benchmark spans 64 fine-grained classes across unstructured outdoor terrain with a severely long-tailed distribution, where rare classes occupy fewer than 50 pixels per image. We extend the Swin-Large Mask2Former baseline with three targeted contributions: (1)200 Object Queries to eliminate representational saturation; (2)a Feature Refinement Module (FRM) combining ASPP-lite and CBAM dual-attention; and (3)an Auxiliary Supervision Head that delivers direct per-pixel gradients for rare classes. A multi-stage training strategy pairs Distribution-Balanced loss, Rare-Class Copy-Paste augmentation, dynamic IoU-aware re-weighting, and EMA. At inference, a dense sliding-window engine with 2D Gaussian kernel blending and 4-scale TTA adds +10.57%. GOOSE-M2F achieves 70.08% Official Composite mIoU (63.55% fine, 76.61% coarse), placing 3rd on the GOOSE 2D FGSS leaderboard. Code and trained models are publicly available at: \hrefthis https URLGithub GOOSE-M2F Code and \hrefthis https URLHugging Face GOOSE-M2F.

[CV-117] urboGS: Accelerating 3D Gaussian Splatting via Error-Guided Sparse Pixel Sampling and Optimization ICML2026

链接: https://arxiv.org/abs/2606.15924
作者: Zheng Dong,Daifei Qiu,Pinxuan Dai,Ke Xu,Jiamin Xu,Lili He,Rynson W.H. Lau,Weiwei Xu
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: Accepted by ICML2026. Project page: this https URL

点击查看摘要

Abstract:Consumer-level applications require fast optimization of 3D Gaussian Splatting (3DGS) with high-fidelity novel view rendering. However, existing 3DGS acceleration approaches still incur substantial computation on redundant pixels while sacrificing fine details. In this paper, we present TurboGS, an error-guided training framework that accelerates 3DGS by concentrating optimization on perceptually informative pixels. TurboGS is built upon four core components: (1) a tile-wise sparse pixel sampling, which, driven by multi-view reconstruction errors during training, prioritizes challenging regions and skips well-reconstructed ones to avoid redundant gradient computation; (2) a tile-wise structure-aware loss with sparse Normalized Cross-Correlation, which provides sparse yet effective supervision to preserve fine details and stabilize training; (3) an error-driven Gaussian density control strategy, which dynamically allocates model capacity and removes redundant primitives; and (4) a tailored hybrid optimizer that couples Hessian-informed updates with Adam moment damping to stabilize and improve convergence under sparse supervision. Experiments on standard benchmarks demonstrate that TurboGS can deliver on par or superior rendering quality within 100 seconds on a single RTX 5090 GPU card (up to 10x training speedup over vanilla 3DGS).

[CV-118] OmniOPSD: Rationale-Privileged On-Policy Self-Distillation for Affective Computing

链接: https://arxiv.org/abs/2606.15920
作者: Zebang Cheng,Shuimu Chen,Boxue Yang,Yuanshen Guan,Jingyi Chen,Zheng Lian,Xiaojiang Peng,Fei Ma,LaiZhong Cui,Qi Tian
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Reinforcement learning for multimodal large language models (MLLMs) is often hindered by severe reward sparsity in complex reasoning tasks. This challenge is particularly pronounced in human-centered scenarios involving states, emotions, intentions, and behaviors, where heterogeneous multimodal signals and subjective human factors make high-quality chain-of-thought (CoT) annotations expensive and difficult to obtain. Although many multimodal datasets provide expert-annotated ground-truth labels, directly using these labels for supervised fine-tuning may encourage shortcut learning in multimodal perception and provides limited transparency for safety-critical human–AI interaction. To address these limitations, we propose OmniOPSD, a Rationale-Privileged On-Policy Self-Distillation framework that uses frontier-generated rationales as teacher-side privileged evidence rather than student imitation targets. OmniOPSD uses frontier-generated evidence-aware rationales only as training-time privileged evidence context for a local teacher. The student samples its own rollout from the original multimodal input, while the rationale-privileged teacher scores the same tokens and provides dense token-level supervision. Thus, the student learns on its own trajectory distribution without directly imitating frontier-model completions, and inference requires no labels, rationales, CoT annotations, or closed-source model access. Experiments on MER-UniBench show that OmniOPSD achieves state-of-the-art performance with an average score of 84.19 , and ablations further support the value of rationale-privileged teacher guidance.

[CV-119] High-Fidelity 4D Hand-Object Capture via Multi-View Spatiotemporal Tracking and Physics-Aware Gaussians

链接: https://arxiv.org/abs/2606.15908
作者: Bo Peng,Xu Chen,Yi Gu,Hidenobu Matsuki,Mingsong Dou,Jingjing Shen,Deying Kong,Juyong Zhang,Zhengyang Shen
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:The growing demand for high-fidelity 4D hand-object interaction (HOI) data in embodied AI and spatial computing is currently bottlenecked by the reliance on pre-scanned object templates and physical markers. While recent methods have demonstrated promising results in reconstructing 4D hand-object interaction from videos, they are highly sensitive to initial estimates of hand and object poses. Yet, estimating these poses from images is challenging, in particular under severe occlusion which is inherent in hand-object interaction scenarios. We propose a novel system for the robust and accurate reconstruction of hands and objects from synchronized and calibrated multi-view videos without requiring any templates or markers. Our system consists of two main components with key innovations: (1) a multi-view feed-forward transformer model that aggregates cross-view geometry and temporal cues to provide a reliable, metric-consistent initialization for both poses and dense object geometry, and (2) a hand-object physics-aware Gaussian-based optimization framework to refine the initial estimates, integrating tetrahedral constraints, collision refinement, and appearance decomposition to produce physically plausible and visually accurate reconstruction. Validated on public benchmarks and an extensive internal dataset, our pipeline achieves highly robust, artifact-free reconstruction, providing an efficient foundation for automated 4D asset generation. Our project page are available at this https URL.

[CV-120] SiGnature: Explicit Motion Diffusion for Stylized Semantic Gesture

链接: https://arxiv.org/abs/2606.15889
作者: Adi Rosenthal,Tomer Koren,Nadav Shaked,Doron Friedman,Ariel Shamir
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:While recent advances in co-speech gesture generation have achieved impressive rhythmic synchronization, synthesizing gestures that are both semantically meaningful and faithful to a speaker’s unique non-verbal style remains an open challenge. Semantic gestures, such as iconic shapes or deictic pointing, are statistically sparse, making them difficult to learn effectively within standard generative models. We present SiGnature, a framework for Stylized and Semantic Gesture generation that reconciles precise semantic control with high-fidelity style preservation. Unlike prevalent methods that rely on entangled latent representations, SiGnature operates in an explicit joint-rotation space. This design enables our core contribution, Joint Motion Integration (JMI), a training-free inference mechanism capable of injecting any external motion sequence, particularly in-the-wild semantic gestures, directly into the diffusion process. JMI automatically identifies the specific active joints'' conveying a semantic action and injects them into the generation, while relying on the diffusion backbone to synthesize the remaining body dynamics, including posture and flow, in accordance with the pre-learned style of the target speaker. This allows for the plug-and-play integration of arbitrary motions, including complex semantic gestures, without retraining or introducing the Frankenstein’’ artifacts typical of cut-and-paste methods. Extensive experiments and perceptual studies demonstrate that SiGnature offers superior semantic motion control while maintaining smooth and natural co-speech gesture generation and preserving the distinct characteristics of the speaker, thereby outperforming state-of-the-art baselines.

[CV-121] xt region detection in historical astronomical diagrams

链接: https://arxiv.org/abs/2606.15886
作者: Zeynep Sonat Baltacı,Raphaël Baena,Fei Meng,Somkéo Norindr,Florence Somer,Matthieu Husson,Mathieu Aubry
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Text detection is a crucial task in the analysis of historical documents. While datasets and benchmarks exist for text detection in manuscripts and maps, the study of text in mathematical diagrams has received little attention. To address this, we introduce a large-scale, diverse, open-access dataset of 948 historical astronomical diagrams containing 10,940 oriented polygonal text regions. Our dataset spans ten centuries (8th to 18th) and seven main linguistic traditions: Arabic and Persian (115), Chinese (332), Byzantine (233), Latin (185), Hebrew (48), and Sanskrit (35). It captures a wide range of diagram styles and textual content, from symbols to multi-line paragraphs. Each text instance is annotated with ordered polygons that precisely delineate text regions and encode the reading direction. In addition, we annotated the 2,293 regions in Latin diagrams with 20 class labels. We evaluated several strong baselines on our dataset, including TESTR, DeepSolo++, and Poly-DETR, a simple extension of DINO-DETR that we design to predict ordered polygon vertices. Poly-DETR achieves state-of-the-art performance on the MTHv2 and cBAD2019 benchmarks and provides a solid, simple baseline on our dataset. Code and dataset available online.

[CV-122] Deep Residual Injection for Full-Spectrum Forensic Signal Perception in Multimodal Large Language Models ICML2026

链接: https://arxiv.org/abs/2606.15880
作者: Kaiqing Lin,Zhiyuan Yan,Ruoxin Chen,Ke-Yue Zhang,Yue Zhou,Caiyong Piao,Bin Li,Taiping Yao,Bo Wang,Youchang Xiao,Shouhong Ding
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted at ICML 2026

点击查看摘要

Abstract:Multimodal large language models (MLLMs) have been increasingly adopted in forensics for their robust semantic understanding. As AI-generated images become realistic, semantic-level inconsistencies alone are often insufficient for reliable detection. This motivates a critical question: whether MLLMs can achieve full-spectrum forensic signal perception, i.e., capturing low-level generator artifacts without sacrificing pre-trained semantic knowledge. We further perform a layer-wise analysis of forensic signal perception in MLLMs, showing that semantic information is primarily formed in the early-to-middle layers, whereas direct fine-tuning for artifact learning disrupts these semantic representations. Based on this insight, we propose Deep Visual Residual MLLM (Deep-VRM) to preserve early semantic processing while injecting artifact-specific visual signals as a residual path into an intermediate layer, where they are fused with semantic token representations and propagated through subsequent trainable layers. This enables later layers to jointly model semantic reasoning and signal-level forensic cues, and surprisingly, the model learns to adaptively leverage different levels of forensic signals depending on the input, achieving robust and generalizable detection performance. Extensive experiments show that our method achieves state-of-the-art across most benchmarks. The code and data are available at this https URL.

[CV-123] Metis: A Generalizable and Efficient World-Action Model for Autonomous Driving and Urban Navigation

链接: https://arxiv.org/abs/2606.15869
作者: Jingyu Li,Zhe Liu,Dongnan Hu,Junjie Wu,Zipei Ma,Wenxiao Wu,Chao Han,Zhihui Hao,Zhikang Liu,Kun Zhan,Jiankang Deng,Xiatian Zhu,Li Zhang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:World action models~(WAMs) have shown great promise for autonomous driving and urban navigation. Built upon Vision-Language-Action models or video generation models, existing approaches suffer key limitations: (1) High inference latency due to future observation prediction at test time, and (2) tightly coupled video and action modeling leading to representational mismatch and degraded generalization. To address both issues, we propose Metis, an end-to-end WAM framework that decouples video generation and action prediction. Specifically, Metis employs a Mixture-of-Transformers architecture with dedicated experts for video generation and action prediction, preserving the intrinsic distributional properties of each task. To enhance efficiency, we introduce an asymmetric attention mask that enables joint training of both experts while allowing the action model to bypass explicit video generation during inference. This design ensures training-inference consistency and significantly reduces computational costs without compromising planning performance. Extensive experiments demonstrate state-of-the-art performance on the NAVSIM navhard and navtest benchmarks and the CityWalker navigation benchmark, validating both the generalizability and efficiency across diverse tasks. Real-robot deployments further confirm the practical feasibility of our approach.

[CV-124] CogCanvas: A Benchmark for Evaluating Multi-Subject Reference-Based Image Generation

链接: https://arxiv.org/abs/2606.15867
作者: Long-Bao Nguyen,Quang-Khai Tran,Tam V. Nguyen,Minh-Triet Tran,Trung-Nghia Le
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multi-subject reference-based image generation requires jointly preserving multiple human identities, binding per-person objects and fashion items, and respecting a specified background scene, a regime where current diffusion models remain brittle. Existing benchmarks evaluate only one axis at a time and none jointly captures multi-identity composition with human-object interaction, background grounding, and spatial plausibility. We introduce CogCanvas, a benchmark of 1,952 curated reference images spanning 100 celebrity identities, 115 distinctive objects and fashion items, and 29 real-world background scenes including landmarks, from which we construct 1,361 compositional prompts covering 2-5 person group sizes. The curation pipeline combines DINOv2-based deduplication, two-stage aesthetic filtering, and automated derivation of structured interaction and position graphs that serve as ground-truth supervision. CogCanvas supports three tasks, reference-based multi-human-object generation (primary), text-to-image compositional generation, and reference retrieval, under a unified six-axis evaluation protocol. We introduce two metrics tailored to the multi-reference setting: BG-Sim, which scores background fidelity on SAM 3-masked regions via DINOv3 feature similarity, and Attr-VQA, which uses a multimodal LLM to verify per-subject attribute binding and inter-person interactions against the structured graphs. Benchmarking five SOTA methods reveals that every model degrades substantially as group size grows from 2 to 5, with near-complete failure on object/fashion binding beyond three subjects.

[CV-125] Object Tokens as a Bridge Between Segmentation and Visual Question Answering in Robotic Surgery

链接: https://arxiv.org/abs/2606.15861
作者: Yiping Li,Ronald de Jong,Romy van Jaarsveld,Franco Badaloni,Gino Kuiper,Jelle Ruurda,Josien Pluim,Marcel Breeuwer
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Visual Question Answering (VQA) in robotic surgery, referred to as surgical VQA, requires high-level understanding of complex surgical scenes and the integration of visual perception with language reasoning, with the potential to support surgical training and intraoperative decision-making. Recent Vision-Language Models (VLMs) have shown promising performance through parameter-efficient fine-tuning; however, most existing approaches rely on coarse visual grounding, typically limited to bounding boxes, which fails to capture the fine-grained spatial structure of surgical objects. In this work, we propose a unified framework that jointly performs pixel-level segmentation and visual question answering within a single framework. Our approach integrates a VLM with a Segment Anything Model (SAM)-based decoder and represents scene elements as object tokens generated by the VLM. These object tokens guide answer prediction and are further projected to the SAM-based decoder to produce segmentation masks. By optimizing the object token embeddings through both segmentation and question answering objectives, the model learns spatially grounded representations that enhance visual reasoning while providing explicit pixel-level grounding. We evaluate the proposed method on the private RAMIE (Robot-Assisted Minimally Invasive Esophagectomy) dataset and the public EndoVis18 dataset, where it consistently outperforms baseline methods for surgical VQA. These results demonstrate that incorporating context-aware object tokens into vision-language models improves fine-grained surgical scene understanding.

[CV-126] A Dual-Branch Collaborative Framework for Joint Optimization of Underwater Image Enhancement and Object Detection

链接: https://arxiv.org/abs/2606.15857
作者: Liyuan Cao,Zheng Liu,Guanghao Liao,Yonghui Yang,Qi Li
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Due to wavelength dependent light absorption and scattering, underwater images usually suffer from color distortion and blurred details, which limits underwater object detection performance. Existing underwater image enhancement methods mainly focus on visual quality improvement, while it is still difficult to balance enhancement quality, processing efficiency, and downstream detection performance. Therefore, this paper proposes an efficient dual-branch underwater image enhancement framework for object detection. The detail enhancement branch improves brightness and local contrast to recover texture details in dark regions. The color restoration branch uses adaptive compensation to reduce color distortion and improve color gradation. By combining the complementary outputs of the two branches, the proposed framework provides clearer and more informative images for object detection. On the UIEB and EUVP datasets, the proposed method achieves UIQM scores of 2.249 and 2.576. When applied to the YOLOv8 detection task on the URPC dataset, the proposed method improves mAP50 by 2.1% compared with the baseline. Extensive experiments show that our method improves object detection in complex underwater scenes, while balancing enhancement quality and processing efficiency.

[CV-127] EmoZone-Talker: Regional Semantic Control of Audio-Driven 3DGS Talking Heads via Facial Action Units

链接: https://arxiv.org/abs/2606.15848
作者: Tingting Chen,Shaojun Wang,Huaye Zhang,Diqiong Jiang,Chenglizhao Chen
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) has shown strong potential for high-fidelity talking head synthesis. However, enabling fine-grained, interpretable, and editable facial expression control remains fundamentally challenging due to intrinsic conflicts between speech-driven facial dynamics and explicit expression signals. Existing methods rely on implicit multimodal fusion, leading to spatial entanglement and temporal instability. We present EmoZone-Talker, a novel framework that reformulates audio-driven facial animation as a structured spatial-temporal coordination problem under cross-modal conflicts. Our approach introduces an explicit spatial disentanglement and temporal dynamics modeling of facial motion. Specifically, we propose Synergy Zones with Prioritized Attention Bias (SZ-PAB) to explicitly decouple modality contributions via region-wise constraints guided by anatomical priors, and a Channel-Independent Temporal AU Encoder (CIT-AE) to model temporally coherent AU dynamics. By integrating these representations into 3D Gaussian deformation, EmoZone-Talker enables precise and interpretable control over facial expressions. Extensive experiments demonstrate that our method improves expression controllability and realism, with notable gains in upper-face accuracy and temporal coherence, while preserving high rendering quality and accurate lip synchronization. Code will be publicly released to facilitate reproducibility and further research.

[CV-128] Learning a Sampling-Free Variational DNN Plugin from Tiny Training Sets to Refine OOD Segmentation With Uncertainty Estimation

链接: https://arxiv.org/abs/2606.15837
作者: Jimut B. Pal,Suyash P. Awate
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Methodology (stat.ME); Machine Learning (stat.ML)
备注: Accepted at the Journal of Machine Learning for Biomedical Imaging

点击查看摘要

Abstract:Deep neural networks (DNNs) frequently fail to generalize to out-of-distribution (OOD) medical images because of variations in scanners and acquisition protocols. Retraining DNN models to address these distribution shifts is often impractical due to the high cost of acquiring and annotating new medical datasets. To address this, we introduce VarDeepPCA, a novel lightweight variational DNN framework designed to restore/refine degraded segmentation maps by leveraging intrinsic geometric priors. Unlike existing approaches that require target-domain data or extensive pre-training, our VarDeepPCA explicitly learns a distribution of valid anatomical geometries using only small in-distribution (ID) datasets. Theoretically, our novel variational learning framework leverages a reinterpretation of the softmax mapping to implicitly perform exact distribution modeling, thereby enabling computationally efficient, sampling-free learning and inference. This also enables VarDeepPCA to provide uncertainty estimates associated with its restored segmentation maps. We empirically validate our framework across 4 distinct clinical applications, using 14 publicly available datasets, involving segmentation of the myocardium, neuroretinal rim, prostate, and fetal head. Comparisons against 15 existing methods demonstrate that VarDeepPCA consistently restores segmentation maps produced by the existing methods on OOD data to (i) significantly improve anatomical plausibility of geometries and clinical utility of the segmentations, and (ii) significantly reduce errors, without needing any more training data than that used by existing methods.

[CV-129] SACE: Concept Erasure at the Semantic Singularity in Visual Autoregressive Models

链接: https://arxiv.org/abs/2606.15819
作者: Siya Yang,Nanxiang Jiang,Zhaoxin Fan,Yunfeng Diao
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The rapid progress of visual autoregressive (VAR) models has unlocked a transformative frontier for high-fidelity text-to-image synthesis, while heightening concerns over the safety alignment of generated content. Naive application of existing erasure techniques to VAR models causes catastrophic semantic collapse and visual artifacts, since they are predominantly designed for the homogeneous denoising steps of diffusion models. To address this foundational challenge, we first propose the Semantic Singularity Axiom, which posits that any target semantic concept embedded within a prompt is definitively locked at Scale-0. Then rigorously validate this axiom through our proposed Incremental Semantic Saliency Analysis (ISSA),which also enable the community to transparently inspect the coarse-to-fine semantic injection process. Guided by this insight, we introduce the first scale-aware concept erasure framework (SACE) for VAR models. By strictly confining interventions to the first scale, our approach couples an Entropy-Regularized Erasure Objective to prevent high-entropy sampling degeneration, alongside a restorative preservation loss to safely anchor the integrity of entangled benign priors. Extensive experiments demonstrate that our method achieves surgical concept erasure performance across various domains with minimal training overhead, timely and elegently resolute the critical safety vulnerabilities inherent in emerging VAR architectures. Code is available at: this https URLthis https URL.

[CV-130] CPS4: Class Prompt driven Semi-Supervised Spine Segmentation with Class-specific Consistency Constraint

链接: https://arxiv.org/abs/2606.15802
作者: Qingtao Pan,Hongzan Sun,Bing Ji,Shuo Li
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision Language Model (VLM) has great potential to enhance the quality of pseudo labels in semi-supervised spine segmentation by leveraging textual class prompts to generate segmentation map, but no one has studied it yet. Although promising, it lacks explicit constraints to ensure consistency between spine class prompts and spine unit region, resulting in unsatisfactory performance in multi-class segmentation map generation. In this paper, we propose CPS4, the first text-guided semi-supervised spine segmentation network using class prompts to enhance the quality of spine pseudo labels. Specifically, CPS4 is implemented through two training stages. (i) Class-specific consistency constrained VLM pretraining stage: we propose token- and pixel-level attention loss to optimize the consistency between class prompts and spine units, forcing the textual class prompt to be closely coupled with the target spine unit in the semantic space. (ii) Class Prompt driven semi-supervised spine segmentation stage: using the pretrained vision-text encoder, we derive each class-specific binary segmentation map for the unlabeled spine image and integrate them into an unified multi-class segmentation map, improving the quality of the spine pseudo label generated by the semi-supervised spine segmentation network. Experimental results show that our CPS4 achieves superior spine segmentation performance with Dice of 80.44%, only using 5% labeled data on the public spine segmentation dataset, surpassing popular semi-supervised learning and VLM methods. Our code will be available.

[CV-131] DifFRACT: Diffusion Feature Reconstruction and Attribution for Circuit Tracing

链接: https://arxiv.org/abs/2606.15796
作者: Artyom Mazur,Nina Konovalova,Aibek Alanov
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Mechanistic interpretability seeks to explain neural network behavior by decomposing model computations into interpretable features and circuits. While transcoder-based circuit tracing has recently enabled detailed causal analyses of large language models, multimodal diffusion transformers for image generation remain comparatively opaque. We still lack tools for understanding how semantic information propagates across denoising steps and how text and image representations interact within double-stream MM-DiT architectures. Existing methods provide only partial insight: attention maps expose a limited view of token interactions, while sparse autoencoders can discover interpretable features but do not directly reveal how these features are transformed and composed through nonlinear MLP layers. In this work, we extend transcoder-based circuit tracing to multimodal diffusion transformers. We train timestep-conditioned transcoders that faithfully approximate the input-output behavior of MLP sublayers in FLUX.1[schnell]. By replacing MLPs with transcoders and linearizing the remaining computation, we obtain exact feature-to-feature attribution and recover compact, interpretable circuits. Empirically, our transcoders match or slightly outperform sparse autoencoders on the sparsity-faithfulness tradeoff. The resulting circuits reveal mechanisms underlying attribute binding and cross-stream semantic propagation, and provide causal explanations for systematic generation errors. Moreover, circuit-guided interventions are substantially more precise and effective than standard SAE-based steering. Our results demonstrate that transcoder-based circuit analysis is feasible for state-of-the-art diffusion transformers and provides a powerful framework for understanding and controlling multimodal generative models. The code is available at this https URL

[CV-132] Domain-Guided Prompting of the Segment Anything Model for Seismic Interpretation: The Role of Attributes Visualization and Hybrid Prompts

链接: https://arxiv.org/abs/2606.15786
作者: Aniq Ahmad,Heather Bedle,Ahmad Mustafa
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Geophysics (physics.geo-ph)
备注:

点击查看摘要

Abstract:The advent of large pretrained foundation models for computer vision has significantly improved the efficiency of visual data interpretation. The Segment Anything Model (SAM), in particular, offers powerful zero shot segmentation capabilities through prompt based interaction, thus making it a promising tool for seismic interpretation. However, most existing applications of SAM rely on fine tuning for specific geological targets, which requires extensive labeled data, incurs high computational cost, and often compromises the model’s generalization capability. In this study, we introduce a principled framework for zero shot adaptation of foundation models to seismic data. The framework is built on two key components: (1) aligning seismic attributes and visualization choices (e.g., colormaps) with the geological target of interest, and (2) employing a hybrid prompting strategy that combines sparse user defined point prompts with dense mask prompts derived from SAM’s internal feature activations. We systematically evaluate this framework across multiple geological targets, datasets, prompt configurations, and seismic attribute representations. Our results demonstrate that geologic target aware selection of seismic attributes and colormaps, combined with hybrid prompting, enhances the separability of geological features and improves boundary delineation and segmentation accuracy relative to point based prompting alone. Our findings show that, when these components are jointly applied, SAM can achieve competitive segmentation performance in a fully zero shot setting, thereby eliminating the need to retrain SAM for each geologic feature. This work establishes a practical and scalable pathway to leverage foundation models in seismic interpretation, reducing reliance on labeled data while preserving model generality.

[CV-133] Mitigating Visual Hallucinations in Multimodal Systems through Retrieval-Augmented Reliability-Aware Inference

链接: https://arxiv.org/abs/2606.15782
作者: Pratheswaran Hariharan,Haiping Xu,Donghui Yan
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 28 pages, 9 figures

点击查看摘要

Abstract:Multimodal large language models (MLLMs) have demonstrated strong capabilities in vision-language understanding and natural-language response generation. However, these systems can still produce overconfident predictions and hallucination-like outputs, particularly when the visual evidence is weak, ambiguous, or semantically inconsistent. Most existing approaches focus on improving multimodal representation alignment or retrieval-augmented generation, while providing limited mechanisms to quantify instance-level prediction reliability or identify incorrect visual outputs. This work proposes a retrieval-augmented reliability-aware inference framework for trustworthy multimodal visual understanding. The proposed framework constructs an external visual evidence database using pretrained visual embeddings and nearest-neighbor retrieval over normalized feature representations. Retrieved evidence is used to estimate prediction trustworthiness through multiple reliability indicators, including similarity strength, class-support agreement, evidence margin, entropy-based uncertainty, and an aggregate reliability score. Based on these signals, a decision gate determines whether the system should accept the prediction, answer with caution, or abstain/fallback when evidence is insufficient. A multimodal response-generation layer then produces a final user-facing response conditioned on the reliability decision. Experiments on ImageNet-100 demonstrate that the proposed reliability-aware framework improves accepted prediction accuracy from 85.84% to 88.88% at 89.04% coverage. The hallucination-like accepted wrong-answer rate is reduced from 14.16% to 11.12%. These results show that integrating retrieval evidence, reliability estimation, and selective decision gating can improve calibration and reduce overconfident visual errors without retraining large multimodal models.

[CV-134] Faithful Action-unit Causal Reasoning for Counterfactually Faithful Emotion Explanations

链接: https://arxiv.org/abs/2606.15779
作者: Van Thong Huynh,Hong Hai Nguyen,Thuy Pham,Trong Nghia Nguyen,Soo-Hyung Kim
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Multimodal models can name the action units (AUs) behind a facial emotion, but their AU-emotion rationales are typically plausible rather than faithful: nothing forces the AUs a model invokes to be the AUs that actually drive its prediction. We cast AU-emotion reasoning as a counterfactual-consistency problem between the rationale, the label, and a structural AU-emotion causal graph G, and propose FACR, which grounds the reasoner in an independently induced, polarity-aware G and trains a counterfactual-faithfulness objective: a do-intervention on an AU that G marks causal for a class must move the prediction, while one it marks irrelevant must leave it unchanged. Faithfulness is thereby both trainable and measurable through a matching interventional metric, which we evaluate against a known causal structure, the PSPI pain-AU composition, as no existing affective-reasoning benchmark allows. We are explicit that this metric tests fidelity to the supplied structure rather than its rediscovery: it asks whether the trained reasoner invokes the AUs the structure marks causal, on held-out subjects and a second dataset. Under subject-independent evaluation on UNBC-PAIN, the objective raises the agreement between the invoked AUs and the PSPI composition from a no-objective baseline of 0.08 to 0.57, at a small detection cost; an unfaithfulness control attributes the gain to the objective. On a cross-dataset emotion transfer, the objective likewise raises fidelity to G on a seven-class task (0.50 to 0.84). Finally, we attach a language verbalizer and extend the audit to the generated text: biasing each action unit’s emission by its latent activation makes the rationale faithful by construction, so that ablating an AU removes it from the explanation, a property that transfers to a second language-model backbone, whereas a freely generated rationale is unfaithful.

[CV-135] Ellipse Meets Bit-Planes: A Novel Approach to RNFL based Glaucoma Detection Using Advanced Image Processing and Deep Learning

链接: https://arxiv.org/abs/2606.15772
作者: Snigdha Paul,Sambit Mallick,Anindya Sen
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This work proposes an integrated pipeline for automatic glaucoma detection method from easily available colour fundas images based on an adaptive algorithm for ellipse-based polar transformation, to enhance the analysis of the Retinal Nerve Fiber Layer (RNFL) as the primary biomarker for observing glaucomatous changes, regardless of optic disc and macula position. Utilizing this transformation, we introduce two distinct frameworks tailored to different operational needs. The first framework, a deep learning-inspired feature fusion approach, achieves a 99.3% detection rate, ideal for settings where high precision is essential, despite higher computational demands. The second framework employs a novel image-processing algorithm based on bit-plane slicing, offering 92.31% accuracy and optimized for environments requiring rapid inference with minimal resource consumption. Both frameworks provide scalable and cost-effective solutions for early glaucoma detection. This study highlights the potential of RNFL-based diagnostic tools in addressing the global challenge of glaucoma, particularly in underserved regions.

[CV-136] ask-Instructed Causal Routing of Vision Foundation Models for Multi-Task Learning

链接: https://arxiv.org/abs/2606.15765
作者: Donghyun Han,Yuseok Bae,Jung Uk Kim,Hyung-Il Kim
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 6 figures

点击查看摘要

Abstract:Vision foundation models (VFMs) have demonstrated strong robustness and transferability across a wide range of visual tasks. However, each model typically encodes strong inductive biases shaped by its pre-training objective and data domain, resulting in fragmented yet complementary visual knowledge. As a result, a single model often struggles to capture the diverse visual representations required across multiple dense prediction tasks. To address this limitation, we propose TIGER (Task-Instruction-Guided Expert Routing), a framework that coordinates multiple heterogeneous VFMs for multi-task dense prediction. Instead of naively aggregating expert features, TIGER leverages natural-language task instructions to guide a routing network that assigns token-level expert weights conditioned on task semantics, enabling adaptive integration of complementary expert features. TIGER further introduces a counterfactual loss that aligns routing decisions with each expert’s causal contribution by measuring prediction changes when experts are excluded, encouraging more reliable and interpretable routing. We evaluate TIGER on two multi-task dense prediction benchmarks, NYUD-v2 and Pascal Context, where it consistently outperforms recent multi-task learning baselines while keeping all VFMs frozen. These results demonstrate that combining instruction-guided expert routing with counterfactual causal alignment enables effective coordination of heterogeneous vision foundation models.

[CV-137] he Circumplex Degeneracy Behind the Rare-Class Limit in Affect Recognition

链接: https://arxiv.org/abs/2606.15763
作者: Van Thong Huynh,Hong Hai Nguyen,Soo-Hyung Kim
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In-the-wild expression recognition persistently fails on a few rare emotions, and the standard explanation is class imbalance. Through a controlled multi-task study on two benchmarks, we show the failure is instead a property of affect geometry: the rare classes are degenerate on Russell’s circumplex, and that degeneracy bounds what any loss or cost can achieve. Our instrument is a circumplex-cost optimal-transport term that prices expression confusions by their valence-arousal distance. The term improves the official score and expression macro-F1, but a control most studies omit shows the gain is not geometric: a uniform cost, equivalent to a generic confidence penalty, matches it on Aff-Wild2 (p=0.625) and significantly exceeds it on AffectNet (+0.057 over base, larger than the circumplex). What the geometry reshapes is the structure of the errors, making them affectively nearer the truth on Aff-Wild2 (p=0.031 against the uniform control), an effect that does not survive on AffectNet, where a visual confound at the far corner of the circumplex overwhelms it. The rare-class failure, by contrast, is stable across both datasets we examine: the degenerate pairs (anger-fear on Aff-Wild2, anger-contempt on AffectNet) resist frequency-based interventions, the transport term, and an action-unit-augmented cost built specifically to separate them. We conclude that progress on rare expressions requires representations that distinguish the classes, not supervision that reprices their confusions, and we provide the controls and metrics needed to tell the two apart.

[CV-138] OmniTraffic: A Controllable Generation Pipeline and Benchmark for Spatio-Temporal Traffic Reasoning

链接: https://arxiv.org/abs/2606.15749
作者: Maonan Wang,Zhengyan Huang,Kemou Jiang,Yuhang Fu,Jiayue Zhu,Yuxin Cai,Xingchen Zou,Qiaosheng Zhang,Yi Yu,Ding Wang,Xi Chen,Ben M. Chen,Yuxuan Liang,Zhiyong Cui,Man On Pun,Yirong Chen
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注: 34 pages, 28 figures

点击查看摘要

Abstract:Traffic scene understanding requires models to reason beyond object recognition, including lane topology, multi-view geometry, temporal evolution, and signal-phase semantics. However, existing traffic-oriented multimodal benchmarks largely emphasize passive visual recognition or isolated video understanding, offering limited support for evaluating structure-aware traffic reasoning under controlled conditions. We introduce OmniTraffic, a controllable generation pipeline and benchmark for spatio-temporal traffic reasoning. Built around 12 real-world intersections reconstructed into editable 3D traffic environments and complemented by surveillance footage from two countries, OmniTraffic supports both controlled and natural-condition evaluation. It defines a three-level task hierarchy spanning scene perception, multi-view and temporal reasoning, and decision support. Using structured traffic metadata, OmniTraffic generates synchronized multi-view VQA samples covering vehicle states, lane functions, view–BEV correspondence, temporal dynamics, and signal-phase analysis, resulting in 8M VQA samples and a 3K human-verified test set. Evaluation of eleven frontier MLLMs reveals a large human–model gap, with the most pronounced failures in topology-grounded and spatio-temporal reasoning tasks. Fine-tuning a lightweight MLLM on simulated OmniTraffic data further improves performance on real-world traffic scenes, demonstrating the value of simulation-generated supervision for traffic-specific multimodal reasoning. Beyond a fixed dataset, OmniTraffic provides an extensible pipeline with configurable intersections, camera views, traffic demands, signal phases, visual conditions, and rare events.

[CV-139] MAF: Multimodal Adaptive Few-shot Prompting for Sentiment Analysis with MLLM s

链接: https://arxiv.org/abs/2606.15694
作者: Hangling Xie
类目: Multimedia (cs.MM); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Multimodal large language models (MLLMs) have demonstrated remarkable capabilities in understanding complex multimodal content. However, their performance in sentiment analysis exhibits acute sensitivity to prompt design, rendering static, uniformly applied prompts inherently suboptimal for capturing the nuanced multimodal cues that vary across inputs. To address this limitation, we propose a Multimodal Adaptive Few-Shot Prompting (MAF) framework, which dynamically retrieves and integrates query-relevant demonstrations to elicit the sentiment reasoning capabilities of MLLMs in a context-sensitive manner. MAF constructs a demonstration retrieval module that holistically encodes facial expressions, scene context, and textual semantics, with a lip movement amplitude detection mechanism introduced for accurate speaker identification in multi-person scenarios. Departing from conventional fixed-weight fusion, a lightweight coefficient generation network is trained to output query-conditioned fusion weights in real time, enabling weighted aggregation of multimodal similarity scores to retrieve the top-K most informative demonstrations. Prediction stability is further enhanced through majority voting over multiple candidate outputs generated by the MLLM. Extensive experiments on public benchmark datasets demonstrate that MAF achieves substantial and consistent performance improvements over the corresponding backbone variants and remains competitive with strong multimodal sentiment-analysis baselines.

[CV-140] Learning New Tasks via Reusable Skills: Skill-Compositional Experts for Embodied Continual Learning

链接: https://arxiv.org/abs/2606.15685
作者: Shuaike Zhang,Shaokun Wang,Haoyu Tang,Jianlong Wu,Liqiang Nie
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 5 figures

点击查看摘要

Abstract:Embodied Continual Learning (ECL) aims to enable robots to continually acquire new manipulation tasks while retaining previously learned behaviors under closed-loop control. Compared with conventional continual learning, ECL suffers from more severe catastrophic forgetting. Feature drift accumulated under closed-loop control progressively propagates through sequential decision-making, leading to degradation of previously learned behaviors. A key challenge in ECL lies in structured skill reuse across continually evolving tasks, since existing methods primarily focus on skill learning without explicitly organizing them for coherent task execution. To address this issue, we propose SCE, a Skill-Compositional Experts framework for ECL. SCE builds a skill base via Compositional Skill Grounding (CSG), which decomposes task demonstrations into reusable skills. Based on this, Dual Execution-and-Transition Experts (DETE) enable new task learning through skill composition, where one branch ensures skill execution and the other supports transitions between skills for coherent behavior. Experiments on LIBERO benchmarks and real-world manipulation tasks demonstrate that SCE consistently improves retention and overall task performance. Further feature drift analyses and ablation studies verify the effectiveness of our method. Project website: this https URL.

[CV-141] 3D Consistency Optimization for Self-Supervised Monocular Video Depth Estimation

链接: https://arxiv.org/abs/2606.15681
作者: Yuanye Liu,Ke Zhang,Junzhe Jiang,Li Zhang,Vishal Patel,Xiahai Zhuang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Reliable monocular video depth estimation is crucial for downstream 3D reasoning and embodied AI in endoscopic navigation. However, existing self-supervised approaches typically treat video frames independently or rely on weak temporal regularization. These methods, lacking a holistic perception of the underlying 3D scene, inevitably suffer from geometrically inconsistent predictions and severe cross-frame drift. To address these limitations, we introduce a new paradigm that recasts sequential video depth estimation as an unconstrained multi-view 3D reconstruction problem, enabling full exploitation of the powerful geometric priors embedded in recent 3D foundation models. The core of our approach is a 3D consistency optimization framework driven by three constraints: image-level photometric rendering, explicit world-coordinate geometric alignment, and multi-scale temporal gradient consistency. Such unified optimization elegantly anchors isolated frames to a globally coherent 3D structure. Our method has been validated in both the self-supervised training scenarios and challenging zero-shot clinical environments. Results show that the proposed approach achieves state-of-the-art spatial accuracy, outperforming the frame-based, video-based depth estimators and the multi-view 3D reconstruction baselines.

[CV-142] CEVAR: Centerline Embedding Extraction for Endovascular Aneurysm Repair MICCAI2026

链接: https://arxiv.org/abs/2606.15667
作者: Roman Naeem,Timo Niiniskorpi,Charlotte Sandström,Naman Desai,Anders Jeppsson,Ida Häggström,Fredrik Kahl,Håkan Roos,Jennifer Alvén
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted Version. Accepted at MICCAI 2026

点击查看摘要

Abstract:Long-term mortality rates after endovascular aneurysm repair (EVAR) remain elevated due to post-EVAR rupture caused by loss of seal in stent graft sealing zones. Structured CT review using centerline measurements improves detection, but current workflows require manual centerline editing and expert operators. We propose a transformer framework for automated, protocol-driven sealing zone assessment that combines 3D centerline tracking with embedding-based geometric prediction. Two state-of-the-art image-to-graph models are evaluated for aorto-iliac centerline extraction from follow-up CT and for measurement of stent position, vessel diameters, and seal lengths according to EVAR4C protocol. Across the full test set and a challenging no-contrast subset, the proposed fully automatic method outperforms the commercial semi-automatic workflow.

[CV-143] OneFocus: Enabling Real-World X-ray Security Screening with a Unified Vision-Language Model

链接: https://arxiv.org/abs/2606.15663
作者: Jiali Wen,Hongxia Gao,Litao Li,Yixin Chen,Kaijie Zhang,Qianyun Liu,Xiaoqin Wen
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 10 figures

点击查看摘要

Abstract:X-ray contraband detection is critical for security in large-scale logistics and transportation, yet conventional detectors struggle to adapt to emerging contraband types and lack fundamental visual understanding. Vision-language models (VLMs) offer strong generalization but are hindered by the scarcity of high-quality X-ray image-caption data. To bridge this critical gap, we present MMXray, a meticulously curated benchmark of 52,124 image-caption pairs spanning 28 fine-grained classes of X-ray contraband. To enrich MMXray with realistic occlusion patterns, we further introduce CleanDET, a dedicated synthesis dataset containing clean foreground contraband images from 28 categories and background images with diverse density levels, together with AnyContraSyn, a controllable synthesis method designed to operate on CleanDET. We also develop OnePipe, an extensible pipeline for systematic data curation. Built on MMXray, we propose OneFocus, a unified VLM that supports four core tasks: visual question answering, contraband localization, classification, and image understanding. OneFocus achieves state-of-the-art performance in X-ray contraband understanding and demonstrates robust cross-domain generalization, establishing a strong vision-language baseline for security screening.

[CV-144] SpatialAvatar-0: High-Quality 4D Head Avatar with Multi-Stage Reconstruction

链接: https://arxiv.org/abs/2606.15659
作者: Yiran Wang,Zeyu Zhang,Yuanming Li,Ziming Wang,Yang Zhao
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:High-quality 4D head avatars from one or a few source portraits are central to telepresence, AR/VR, and digital-human interaction. 3D Gaussian Splatting (3DGS) has emerged as the dominant representation, with two complementary regimes (generalizable feed-forward predictors and per-subject refiners) maturing in parallel. However, existing feed-forward predictors are trained on a single dataset family with a hard-coded source count, inheriting the corresponding domain bias. Per-subject refiners require 300K–600K iterations and rely on adaptive densification that destroys upstream Gaussian layouts, preventing the two regimes from sharing a representation end-to-end. To bridge both regimes we propose SpatialAvatar-0 on a shared FLAME-mesh-bound Gaussian representation: a feed-forward generator with a parameter-free K-source mean-pool and a monocular-temporal to multi-view-spatial two-phase schedule that anchors against identity-prior collapse onto the smaller multi-view set. We further introduce a 10K-iter layout-preserving per-subject refinement loop that freezes the FLAME-binding and Gaussian count and replaces densification with a three-component anti-spike regularization. On VFHQ/HDTF cross-domain zero-shot we surpass the in-domain leader GAGAvatar by +1.5 dB PSNR despite never training on either test domain, and on the SplattingAvatar monocular benchmark we lead every reported metric, surpassing the 300K-iter GeoAvatar by +1.3 dB PSNR at up to 60x shorter per-subject schedule than common SOTA baselines. Website: this https URL.

[CV-145] Self-Questioning Vision-Language Models: Reinforcement Learning for Compositional Visual Reasoning

链接: https://arxiv.org/abs/2606.15651
作者: Saraswathy Amjith
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-Language Models (VLMs) are AI systems that process both images and text, yet they often struggle with compositional visual reasoning questions that require chaining multiple steps together, such as identifying objects, counting them, and comparing the results. Existing approaches improve this reasoning by training models on human-written step-by-step explanations, but creating these annotations is expensive and difficult to scale. We propose a self-questioning framework that trains a VLM to break visual questions into smaller sub-questions and answer each one before producing a final response, using a reinforcement learning algorithm called Group Relative Policy Optimization (GRPO). The model is never shown examples of how to decompose questions, it discovers this behavior on its own, guided by a reward signal that scores whether the output contains sub-questions and whether the final answer is correct. We apply this framework to a 3-billion-parameter model, training on both synthetic scenes of geometric shapes (CLEVR) and real-world photographs (A-OKVQA). On A-OKVQA, both self-questioning and standard reinforcement learning substantially improve accuracy over the untrained model (52.2% and 51.6% vs. 46.8%). We introduce the first self-questioning VLM by rewarding not only the final answer like standard RL but additionally for generating intermediate sub-questions, enabling it to discover compositional decomposition strategies. These results suggest that teaching AI systems to ask themselves intermediate questions is a promising strategy for complex visual reasoning, particularly when the difficulty of a question warrants explicit step-by-step decomposition.

[CV-146] Fusing Transferred Priors and Physics-based Decomposition for Underwater Image Enhancement

链接: https://arxiv.org/abs/2606.15648
作者: Haochen Hu,Yanrui Bin,Zhengyan Zhang,Minchen Wei,Chih-yung Wen,Bing Wang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The underwater images are captured within diverse water-medium conditions, leading to complex degradation, including color bias, low contrast, and blur effect. Recently, learning-based methods have demonstrated their potential for underwater image enhancement (UIE). However, most of the previous work focus on the training strategy or network design to make the enhanced result aligned well with the labels in datasets, ignoring that the labels are selected from the enhanced results of previous UIE methods and these pseudo-labels are noisy. Consequently, the performance of their models is not satisfactory to a certain extent. However, collecting the true labels of the underwater images is challenging. In this work, we propose a transfer learning-based UIE that does not require underwater images to have paired noisy or true labels for learning. Instead, the UIE task is first divided into global color correction, haze removal, and background noise suppression following the underwater physics. Then multiple types of prior from other vision tasks are leveraged as cross-domain supervision in each step. In this way, a novel UIE is available via transfer learning, and the physics-aligned UIE decomposition provides theoretical soundness. Qualitative and quantitative experiments demonstrate that our proposal based on physics and priors fusion achieves SOTA performance in the UIE task and effectively boosts downstream vision tasks, significantly outperforming benchmark methods. Project repo: this https URL.

[CV-147] owards Next-Generation Healthcare: A Survey of Medical Embodied AI for Perception Decision-Making and Action

链接: https://arxiv.org/abs/2606.15647
作者: Cheng Zhang,Qing Cai,Xingzheng Wu,Xun Yang,Xiaojun Chang,Bingkun Bao,Liqiang Nie,Xinwang Liu,Yi Yang
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 19 pages, 9 figures

点击查看摘要

Abstract:Foundation models have demonstrated impressive performance in enhancing healthcare efficiency across a wide range of medical applications. Nevertheless, their limited ability to perceive, understand, and interact with the physical world significantly constrains their effectiveness in real-world clinical workflows, where safety-critical decision-making and physical execution are tightly coupled. Recently, embodied artificial intelligence (AI) has emerged as a promising physical-interactive paradigm for intelligent healthcare, enabling agents to operate in complex medical environments. As research in this area rapidly expands, understanding how intelligent agents function as integrated, end-to-end systems in clinical environments becomes increasingly critical. However, existing surveys on medical embodied AI largely emphasize individual aspects or functional components, lacking a unified system-level organization of the field. To support and consolidate recent advances, we systematically survey the core components of medical embodied AI, with a particular emphasis on the coordinated integration of perception, decision-making, and action. We further review representative medical applications and relevant datasets, and we analyze the major challenges encountered in real-world clinical practice. Finally, we discuss key directions for future research in this rapidly evolving field. The associated project can be found at this https URL.

[CV-148] Open-World Video Segmentation

链接: https://arxiv.org/abs/2606.15632
作者: Qing Su,Kaiyang Li,Yuan Zhuang,Fei Miao,Shihao Ji
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:While video segmentation has advanced rapidly on short clips and closed-set benchmarks, open-world video segmentation remains largely unexplored. The challenge is twofold: (1) existing methods are not designed to support object discovery and identity maintenance in long videos of dynamic ego-motion, and (2) existing evaluation protocols rely on a rigid 1:1 matching that unfairly penalizes semantically valid predictions with mismatched granularity. To address both gaps, we introduce Savvy, a practical and strong system for zero-shot open-world long-horizon video segmentation. Savvy combines hierarchical mask discovery, deferred admission, and track consolidation to support persistent object discovery, safe track promotion, and stable long-range identity maintenance. We further propose OGA, a granularity-aware evaluation suite for open-world video segmentation. Built on a Granularity-Agnostic (GA) matching protocol, OGA relaxes conventional 1:1 matching to an n:1 mapping, but still enforces temporal rigor by detecting support discontinuities through sever points and scoring each reference object through its dominant coherent fragment. This prevents fragmented or flickering support from being over-rewarded while enabling GA-adapted metrics and structural diagnostics: identity persistence (IP), and identity concentration (IC). On VIPSeg, we show that standard 1:1 evaluation substantially underestimates open-world methods, whereas GA evaluation recovers much of their suppressed performance. On the more realistic long-horizon benchmarks: ScanNet and HM3D, Savvy consistently outperforms strong baselines across both classical and proposed metrics, including STQ, VPQ _\infty , IP and IC. Together, these results establish a practical benchmark and a strong baseline for open-world long-horizon video segmentation.

[CV-149] XPASS-Vis: A Dataset for Cross-Domain Personalized Image Aesthetic Assessment

链接: https://arxiv.org/abs/2606.15629
作者: Takato Hayashi,Hiroaki Takahara,Candy Olivia Mawalim,Hiromi Narimatsu,Akisato Kimura,Shiro Kumano,Shogo Okada
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Personalized image aesthetic assessment (PIAA) seeks to model, at the individual level, the subjective nature of aesthetic judgments toward artworks and photographs. Aesthetic preference is known to be both deeply personal and partially consistent across visual domains. Yet existing PIAA datasets and methods are largely confined to a single domain, or provide too few samples per annotator within each domain to enable personalization across domains. Consequently, the cross-domain generalization of personalized aesthetic preferences remains largely unexplored. To address this gap, we introduce XPASS-Vis, the first dataset explicitly designed for cross-domain PIAA. XPASS-Vis comprises 6,526 stimuli from three visual domains – art, fashion, and landscape – rated by 129 annotators, yielding 87,836 user-stimulus interactions, each annotated with an overall aesthetic score and nine aesthetic-emotion ratings. Notably, each annotator rated more than 200 stimuli per domain, providing sufficient per-domain coverage to support personalization both within and across domains. Moreover, we establish baseline models for cross-domain PIAA under unsupervised domain adaptation (UDA), where a model trained on a labeled source domain is transferred to an unlabeled target domain. A systematic evaluation of representative UDA approaches shows that the best-performing method recovers approximately 60% (Spearman’s \rho = .28) of the supervised upper bound under a fully unsupervised setting. This provides encouraging evidence that personalized aesthetic preferences are, to a meaningful extent, transferable across visual domains. At the same time, a substantial gap remains, highlighting the need for PIAA-specific adaptation strategies. XPASS-Vis and the accompanying baselines provide a foundation for future research on cross-domain PIAA. All datasets and code will be made publicly available upon acceptance.

[CV-150] NeRD: Neuro-Symbolic Rule Distillation for Efficient Ontology-Grounded Chain-of-Thought in Medical Image Diagnosis

链接: https://arxiv.org/abs/2606.15617
作者: Hongxi Yang,Yiwen Jiang,Siyuan Yan,Jamie Chow,Eunis Li,Charlotte Poon,Stephanie Fong,Xiangyu Zhao,Deval Mehta,Yasmeen George,Zongyuan Ge
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Interpretability is essential for trustworthy medical image diagnosis. However, existing concept-driven interpretable methods have key limitations: Concept Bottleneck Models (CBMs) require scoring all predefined concepts at inference time and for manual intervention, imposing a substantial burden on clinicians, while rationale-based generative approaches often select concepts by class discriminability, which can drift from diagnostic ontologies. To address these issues, we propose Neuro-Symbolic Rule Distillation (NeRD), a framework that produces efficient, ontology-grounded reasoning chains that are sufficient yet non-redundant, without manually crafting diagnostic rules. Experiments on two skin datasets demonstrate strong diagnostic performance and interpretability, and blinded expert evaluation confirms the clinical plausibility of NeRD rationales. Our method further enables a first expert-in-the-loop study for Multimodal Chain-of-Thought-based diagnosis, achieving efficient and effective concept-level intervention.

[CV-151] MoECa: Aligning Feature Reuse with Expert Decomposition in Diffusion Transformers

链接: https://arxiv.org/abs/2606.15615
作者: Maoliang Li,Haojing Chen,Jiayu Chen,Zihao Zheng,Xinhao Sun,Hailong Zou,Xiang Chen
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: under review

点击查看摘要

Abstract:Diffusion Transformers with Mixture-of-Experts (DiT-MoE) improve model capacity under sparse activation, but diffusion inference is still bottlenecked by redundant computation across timesteps. Existing caching methods mainly operate at the token level, which becomes suboptimal in DiT-MoE because each token update is internally decomposed into multiple routed expert branches. Our analysis shows that cross-timestep redundancy in DiT-MoE is better characterized at the expert-branch level than at the whole-token level. Based on this observation, we propose MoECa, a fine-grained caching framework that performs branch-level feature reuse across timesteps. MoECa further introduces expert-aware adaptive control and synchronized cache updates across MoE and attention paths to maintain stable intermediate states. Experiments on multiple DiT-MoE models show that MoECa consistently achieves a better speed-quality trade-off than prior caching methods, with up to 2.83 \times inference speedup and minimal quality degradation.

[CV-152] Variational Test-time Optimization for Diffusion Synchronization

链接: https://arxiv.org/abs/2606.15614
作者: Hyunsoo Lee,Farrin Marouf Sofian,Kushagra Pandey,Stephan Mandt
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint. Project website: this https URL

点击查看摘要

Abstract:Collaborative generation, which coordinates multiple diffusion trajectories to extend the capabilities of pretrained priors, has emerged as a powerful paradigm for extending the applicability of diffusion models. Among existing approaches, diffusion synchronization provides a scenario-agnostic solution by introducing general guidance mechanisms. However, current synchronization approaches rely heavily on heuristics and still require task-specific tailoring, which limits their generalizability and performance. In this work, we mathematically derive a synchronization framework based on optimal control, providing a principled explanation of diffusion synchronization. During sampling, we optimize control variables to guide multiple trajectories toward coherent solutions while remaining close to the underlying diffusion prior. Our method operates entirely at test-time without additional training, thereby enabling broad applicability across diverse generation scenarios when combined with strong pretrained priors. We demonstrate consistent improvements over baselines on three representative collaborative generation tasks, covering a wide range of modalities and applications. Beyond performance gains, our work establishes a novel foundation for collaborative generation, opening a principled path toward extending pretrained generative models to new collaborative generation settings.

[CV-153] Mutual Distillation of Dual-Foundation Models for Semi-Supervised PET/CT Segmentation MICCAI2026

链接: https://arxiv.org/abs/2606.15611
作者: Fuyou Mao,Beining Wu,Yanfeng Jiang,Bohan Xu,Lixin Lin,Naye Ji,Hao Zhang,Yan Tang
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: MICCAI 2026

点击查看摘要

Abstract:Organ segmentation from PET/CT is critical for quantitative analysis and radiotherapy planning in oncology. To ease the high annotation cost of PET/CT segmentation, semi-supervised learning (SSL) provides a practical and effective solution for developing deep models with limited labeled data. Recent developments in visual foundation models have demonstrated remarkable adaptability with improved efficiency. In this work, we propose a mutual distillation framework that seamlessly exploits both structural and functional foundation models, which act as modality-specific generalists for distilling knowledge from structural CT and metabolic PET imaging. By bridging the gap between the task-specific precision of student models and the segmentation priors of generalist foundation models, we propose \textbfMuDuo, a mutual distillation framework that synergistically leverages SAM-Med3D for CT and SegAnyPET for PET to distill their knowledge into a lightweight student network. Our approach eliminates the need for manual prompts while maximizing the utility of unlabeled data for automatic segmentation, achieving state-of-the-art performance on the AutoPET dataset with only 5 labeled cases. Our source code is available at this https URL.

[CV-154] On the Adversarial Robustness of Multimodal LLM Judges

链接: https://arxiv.org/abs/2606.15608
作者: Zihan Wang,Guansong Pang,Zelin Liu,Wenjun Miao,Jin Zheng,Xiao Bai
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) are increasingly used as automated judges, e.g., for image quality and safety assessment. However, their adversarial robustness remains largely unexplored, threatening the fairness and reliability of automated judging. To bridge this gap, we introduce RobustMLLMJudge, the first general framework for evaluating the adversarial robustness of general-purpose MLLMs when functioning as judges. It covers diverse attacks against popular judge approaches across quality and safety evaluation scenarios. Using RobustMLLMJudge, we reveal that i) different MLLM judges are highly vulnerable to score-inflating adversarial attacks; and ii) although effective, these attack methods face a critical challenge due to unique constraints in the evaluation protocols of MLLM judges. We further propose MGSIA, namely Manifold-Guided Semantic Induction Attack, a novel method that bypasses these constraints to enable more effective and transferable attacks on MLLM judges. The core idea of MGSIA is to combine affirmative semantic induction with high-score manifold alignment: it maximizes the probability that judges yield affirmative responses (e.g., “Yes”) to binary semantic queries, while regularizing adversarial representations toward high-score centers estimated from proxy protocols. Together, these objectives yield transferable score-inflating perturbations. Extensive experiments demonstrate the superiority and generalizability of MGSIA in deceiving advanced MLLM judges under different evaluation scenarios, highlighting the need for robust MLLM judges. Code and data will be made available at this https URL.

[CV-155] Parameter-Efficient Adaptation of SAM 3 for Automated ITV Generation from 4DCT Images

链接: https://arxiv.org/abs/2606.15604
作者: Changwoo Song
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 4 figures, 2 tables

点击查看摘要

Abstract:Four-dimensional computed tomography (4DCT) captures the full respiratory cycle of thoracic anatomy, yet current Internal Target Volume contouring workflows process each phase in isolation, discarding temporal coherence and leaving contours vulnerable to phase-specific artifacts. We present a lightweight framework that applies parameter-efficient fine-tuning to the Segment Anything Model 3 (SAM 3) via low-rank adaptation (LoRA) to align its text-prompted segmentation with the medical domain using only seven annotated 3D CT volumes. Furthermore, the framework incorporates a hard negative mining strategy to improve boundary discrimination in low-contrast thoracic regions. At inference, phase-wise predictions are refined through phase-coherent temporal filtering and spatial connectivity analysis. Since respiratory motion is continuous and periodic, genuine anatomy appears in contiguous blocks of phases, whereas transient artifacts appear sporadically and are thus effectively suppressed. Experiments on pulmonary and cardiac structures yield median Dice scores of 0.968 and 0.910 with 95th-percentile Hausdorff distances of 0.998 mm and 2.931 mm, respectively. The proposed framework effectively eliminates the severe false-positive predictions inherent in the zero-shot inference of the unadapted SAM 3. With only seven annotated volumes, the framework retains over 95% of full-data accuracy, and the entire pipeline is trainable on a single consumer-grade GPU, demonstrating a scalable, data-efficient solution for adaptive radiotherapy.

[CV-156] Fusion-E2Pulse: A Multimodal Event-RGB Fusion Network for Non-contact Pulse Wave Reconstruction MICCAI2026 MICCAI

链接: https://arxiv.org/abs/2606.15597
作者: Qian Feng,Hao Guo,Yan Niu,Zhenhuan Xu,Yidi Li
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by MICCAI 2026. The final version will appear in the official MICCAI proceedings published by Springer

点击查看摘要

Abstract:Non-contact pulse wave reconstruction hinges on the precise recovery of waveform morphology, including the dicrotic notch. Conventional Red-Green-Blue (RGB)-based methods, which extract physiological signals from recorded facial videos, are constrained by the integral imaging mechanism of standard cameras, where the exposure process induces a smoothing effect that attenuates subtle vascular pulsation details. Conversely, neuromorphic event cameras, while offering exceptional sensitivity to intensity fluctuations, are inherently susceptible to noise and artifacts induced by minor motion. To exploit the synergy between frame-based integration and event-based differential sensing, we propose a novel multimodal network named Fusion-E2Pulse. This framework utilizes filtered RGB signals as structural priors to suppress motion artifacts, while leveraging the high-sensitivity of event streams to recover fine-grained morphological details. Experimental results demonstrate that Fusion-E2Pulse achieves state-of-the-art performance, effectively balancing noise suppression and morphological fidelity, achieving a mean absolute error of 0.78 bpm for heart rate estimation, a waveform correlation of 0.89, and a systolic phase duration error of 16.74 ms, validating its efficacy in reconstructing fine-grained pathological features.

[CV-157] Pixels to Proofs: Probabilistically-Safe Latent World Model Control via Parallel Conformal Robust MPC

链接: https://arxiv.org/abs/2606.15594
作者: Devesh Nath,Anutam Srinivasan,Haoran Yin,Ruitong Jiang,Jeffrey Fang,Glen Chou
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:We present SLS^2, a framework for safe feedback motion planning from pixels using robust model predictive control (MPC) in learned latent world models. Our approach trains an action-conditioned joint-embedding world model with compact Markovian latent states, enabling efficient gradient-based trajectory optimization through learned latent dynamics. To enforce safety for the true system despite imperfect latent predictions, we inform a GPU-accelerated system level synthesis (SLS) robust MPC scheme with conformal prediction to obtain calibrated latent error bounds and robust latent-space constraint sets. We further learn and conformalize a latent constraint checker, allowing the SLS planner to impose probabilistic safety constraints during closed-loop execution. We evaluate our method on vision-based control tasks, where it improves both goal-reaching performance and safety over latent world-model and safe-planning baselines.

[CV-158] DenseControl: Instance-Level Controllable Synthesis of Dense Crowd Image

链接: https://arxiv.org/abs/2606.15592
作者: Juncheng Wang,Lei Shang,Wang Lu,Baigui Sun,Shujun Wang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to IEEE TMM

点击查看摘要

Abstract:In this paper, we introduce DenseControl, a novel pipeline for generating dense crowd images. Specifically, DenseControl meticulously positions and sizes each generated instance to align precisely with the predefined coordinates and scales. Based on this, we further allow for control over the background, style, and attributes of instances. The motivation behind DenseControl stems from the observation of two main challenges in synthesizing crowd images: controlling signal embedding and maintaining topological integrity when imparting instance scale guidance. To address these, we first introduce the Isolated Object Embedding (IOE) map, a novel representation that facilitates spatial location control while mitigating the difficulties associated with learning projections for model. Secondly, we propose an Implicit Scale Embedding (ISE) strategy that seamlessly integrates with the IOE map to encode precise scale information. To further enhance the efficacy of combining ISE with the IOE map, we incorporate a Position Shortcut mechanism that enhances cross-attention to alleviate projection challenges. We evaluate DenseControl through two lenses: synthesis quality and applicability in latent applications. Experiments across different control conditions demonstrate DenseControl achieves state-of-the-art results in dense crowd image synthesis. Furthermore, we showcase applications in augmenting crowd analysis under data scarcity, transfer learning, and weather generalization scenes, to highlight the practical utility of DenseControl. The codebase will be released.

[CV-159] Unlocking Diffusion Hierarchies: Adaptive Timestep Selection for Zero-Shot Segmentation

链接: https://arxiv.org/abs/2606.15590
作者: Ramin Nakhli,Mahesh Ramachandran,Luca Ballan
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Zero-shot segmentation has recently shown notable improvement by leveraging the rich visual priors in large-scale text-to-image diffusion models, such as Stable Diffusion. However, current diffusion-based methods often face limitations due to the trade-off between spatial resolution and contextual information, as well as their reliance on a single static timestep for feature extraction. To overcome these challenges, our work introduces two key advancements. First, our Contextual Similarity Maps fuse high-resolution attention maps with rich U-Net encoder features, providing both fine-grained and robust per-pixel representations. Second, we identify an emergent hierarchical semantic progression within the denoising process of various diffusion models: representations transition from part-level abstractions at earlier timesteps to object-level abstractions at later stages. Leveraging this insight, we introduce a mechanism to adaptively select the optimal timestep for each pixel. Extensive experiments demonstrate that our method consistently outperforms existing zero-shot segmentation baselines, validating the efficacy of combining contextual features with dynamic, hierarchical timestep selection.

[CV-160] oward the Whole Picture: Accumulative Fingerprint Mapping and Reconstruction for Small-Area Mobile Sensors

链接: https://arxiv.org/abs/2606.15574
作者: Xiongjun Guan,Jianjiang Feng,Jie Zhou
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Small-area fingerprint sensing on mobile devices creates a fundamental mismatch between acquisition and recognition: each touch captures only a tiny, pose-varying local patch, while reliable biometric matching ultimately requires a stable and sufficiently complete fingerprint representation. Existing pipelines largely cope with this mismatch by treating repeated touches as independent partial templates, which leads to repeated registration, repeated matching, and no guarantee of adequate global coverage. In this paper, we advocate a different formulation, namely \emphaccumulative fingerprint mapping and reconstruction for small-area mobile sensing. Rather than matching every partial patch separately, the proposed perspective converts a sequence of local observations into a unified fingerprint state that is progressively refined as new touches arrive and can be matched only once after consolidation. As a concrete baseline, we present a classical pipeline that performs patch-wise structural feature extraction, feature-level registration and fusion, fingerprint map construction, and phase-based ridge reconstruction. More importantly, we position this baseline within a broader mobile fingerprint framework that integrates structured token learning, two-stage pose reasoning, and diffusion-based generative reconstruction. This viewpoint reframes mobile fingerprint recognition from multi-capture multi-match processing to accumulative map building, state refinement, and one-shot matching, offering a principled route toward efficient, pose-robust, and deployment-friendly biometrics for small-area mobile platforms. The baseline implementation has been publicly released at this https URL.

[CV-161] An Extensive Benchmark for Single-round and Multi-round Instruction-based Image Editing

链接: https://arxiv.org/abs/2606.15570
作者: Yiwei Ma,Ke Ye,Weihuang Lin,Jiayi Ji,Xiaoshuai Sun,Tat-Seng Chua,Rongrong Ji
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by International Journal of Computer Vision (IJCV), 2026

点击查看摘要

Abstract:In recent years, there have been notable advancements in the area of instruction-based image editing (IIE), which focuses on the automatic alteration of input images using a model. Nevertheless, assessing the effectiveness of these editing models poses a considerable challenge due to the intricate nature of instructions and the wide variety of edits. To tackle this problem, one urgent task in this domain is the development of a robust evaluation framework that can precisely gauge the quality of editing outcomes and offer valuable benchmarks to guide future improvements. To address this challenge, we present a comprehensive evaluation benchmark named I2EBench2.0, designed for single-round and multi-round assessment of IIE models. I2EBench2.0 has four key features: 1) Evaluation Across Single and Multi-rounds: I2EBench2.0 simultaneously evaluates both single-round and multi-round instruction-based edits, assessing the precision and consistency of the edits. 2) Extensive Evaluation Criteria: I2EBench2.0 encompasses a broad range of criteria, evaluating both high-level and low-level aspects of each IIE model. Specifically, it incorporates 16 dimensions for single-round evaluations and 7 for multi-round evaluations. 3) Alignment with Human Judgment: To ensure our benchmark aligns with human evaluation, we conducted a comprehensive user study for each criterion. 4) Research-driven Insights: By analyzing the strengths and weaknesses of current IIE models across all 16 single-round and 7 multi-round dimensions, we provide critical insights aimed at directing future research in this area. We tested eight recently developed IIE models using I2EBench2.0 and derived academic insights through meticulous comparison and analysis. The related code, dataset, and images generated by all IIE models are available on GitHub: this https URL.

[CV-162] RaLMPH: Reliability-aware Learning for Multi-Pathologist Harmonization in Whole-Slide Image Classification MICCAI2026

链接: https://arxiv.org/abs/2606.15554
作者: Sungrae Hong,Jiwon Jeong,Soeun Cheon,Donghee Han,Sol Lee,Jisu Shin,Kyungeun Kim,Mun Yong Yi
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by MICCAI 2026

点击查看摘要

Abstract:Multiple Instance Learning (MIL) is a standard paradigm for Whole-Slide Image (WSI) analysis and has achieved strong results in computational pathology. However, most MIL pipelines assume a single “gold” label per slide, which conflicts with clinical practice where substantial inter-pathologist variability is common. Existing multi-annotator learning and label-refinement methods typically estimate global annotator reliability or rely on single-instance assumptions, making them poorly suited to MIL and to localized diagnostic contexts where experts disagree. We propose RaLMPH (Reliability-aware Learning for Multi-Pathologist Harmonization), a MIL-based label reconciliation framework for WSIs annotated by multiple pathologists. RaLMPH introduces a reliability field that jointly models (i) local neighborhood structure in WSI feature space and (ii) expert uncertainty (entropy), enabling per-sample identification of trustworthy reference neighborhoods. Leveraging this field, RaLMPH performs sample-wise local annotator ranking to select reliable opinions per slide and applies an adaptive gating mechanism to fuse labels conditioned on local reliability. Experiments on a clinical WSI dataset with labels from six pathologists, as well as controlled simulated benchmarks, show that RaLMPH consistently outperforms existing approaches. Further analyses clarify how our reliability-aware mechanism improves label reconciliation and downstream MIL performance.

[CV-163] EcoBin: A Two-Stage Deep Convolutional Neural Network for Contamination-Aware Waste Classification

链接: https://arxiv.org/abs/2606.15547
作者: Raghav Senthil Kumar
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 7 pages, 8 figures

点击查看摘要

Abstract:Waste classification models have become highly accurate at sorting waste, often exceeding 95% on benchmark datasets. However, these models fail to account for contamination in recyclable waste. We present EcoBin, a two-stage deep convolutional neural network that classifies household waste by its disposal pathway and that explicitly accounts for contamination. The first stage is a base waste classifier built on an EfficientNetV2-S backbone that assigns each of the thirty waste categories in our dataset to one of four disposal pathways. The second stage is a contamination classifier that inspects any item routed toward recycling and overrides the decision to garbage when contamination is detected. Because no public dataset of contaminated recyclables exists, we synthesize one by segmenting images of clean recyclable objects with a U2-Net model and compositing realistic contamination textures onto their surfaces. The first stage achieves 87.42% test accuracy and a 96.13% pathway-adjusted accuracy. Meanwhile, the contamination stage distinguishes clean from contaminated items with a 0.99 ROC-AUC. On a test set of contaminated recyclables, the complete pipeline routes 24 of 25 items correctly, compared with only 1 of 25 for the base classifier alone. A McNemar’s test confirms that the improvement contributed by the contamination stage is statistically significant (p 0.001).

[CV-164] rack2View: 4D-Consistent Camera-Controlled Video Generation via Paired 3D Point Tracks

链接: https://arxiv.org/abs/2606.15534
作者: Feng Qiao,Zhaochong An,Zhexiao Xiong,Serge Belongie,Nathan Jacobs
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Re-rendering an existing video from a novel camera viewpoint requires the output to follow the prescribed camera trajectory while preserving the appearance and dynamics of the original scene across every frame. Existing methods rely on per-frame pose embeddings, noisy point-cloud renderings, or implicit learned correspondences, none of which provides an explicit, temporally continuous link between source and target pixels. We propose Track2View, which conditions a video diffusion transformer on paired 3D point tracks: sparse trajectories of scene points projected into both the source and target camera views. These tracks provide explicit spatiotemporal correspondences that are temporally continuous by construction, encoding what content should appear where and when. At the core of Track2View is a dual-view track conditioner that transfers visual context from source to target view through parameter-free geometric operations and learned temporal aggregation, ensuring generalization to arbitrary camera trajectories without memorizing specific motions. We further introduce a data curation pipeline that extracts one-to-one track correspondences by running a 3D point tracker on temporally concatenated multi-camera view pairs. On a 400-video benchmark spanning static and dynamic scenes, Track2View achieves state-of-the-art results across visual quality, view synchronization, and camera accuracy, reducing rotation error by 30-65% and translation error by 61-72% relative to leading baselines. Project page is available at this https URL: this https URL

[CV-165] Selective Synergistic Learning for Video Object-Centric Learning

链接: https://arxiv.org/abs/2606.15527
作者: WonJun Moon,Jae-Pil Heo
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Typical video object-centric learning (VOCL) approaches employ slot-based frameworks that rely on reconstruction-driven encoder-decoder architectures, where learning is mediated by two spatial maps: attention maps from the encoder and object maps from the decoder. As these two distinct maps exhibit different properties, a recent dense alignment strategy attempted to reconcile this discrepancy by enforcing agreement across all spatio-temporal patches via contrastive learning. However, this indiscriminate alignment inadvertently propagates the inherent weaknesses of each module, such as noisy encoder predictions and blurred decoder boundaries. Moreover, computing dense similarities across all pairs incurs a computational cost quadratic in the total number of spatio-temporal patches, severely limiting scalability. Motivated by this, we propose Selective Synergistic Learning (SSync). Instead of exhaustive patch-to-patch alignment, SSync prevents error propagation by selectively distilling only the most reliable cues: leveraging the encoder strictly for boundary refinement and the decoder for interior denoising. This is realized via a pseudo-labeling with linear complexity, eliminating the need for quadratic spatial comparisons. Also, to prevent the reinforcement of architectural biases like slot redundancy, we introduce a transitive pseudo-label merging that consolidates overlapping slots based on spatio-temporal activation consistency. Extensive studies demonstrate that SSync improves decomposition quality and serves as a versatile, plug-and-play module while also exhibiting exceptional robustness to slot configurations. Code is available at this http URL.

[CV-166] ST-DiffEye: Diffusion-based Continuous Gaze Generation via Joint Scanpath-Trajectory Modeling

链接: https://arxiv.org/abs/2606.15486
作者: Brian Nlong Zhao,Ozgur Kara,Junho Kim,James M. Rehg
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We study the problem of human gaze modeling, which aims to generate the gaze patterns a viewer produces while observing a visual stimulus. Gaze is primarily captured through two modalities: continuous eye-tracking trajectories, which describe fine-grained motion dynamics, and discrete scanpaths, which describe high-level fixation structure. Because gaze varies substantially across viewers and trials, we treat this variability as a defining property rather than noise and model gaze as a stochastic generative process. Existing generative gaze models supervise on only one of these two representations in isolation. We hypothesize that trajectories and scanpaths describe gaze at complementary scales and are jointly informative during training, and test this hypothesis through ST-DiffEye, a joint trajectory-scanpath diffusion framework that couples both modalities by concatenating them as an additional raw input channel, requiring no architectural overhead beyond an input and output channel expansion. We further introduce a principled evaluation framework based on the Continuous Ranked Probability Score (CRPS), which generalizes any existing sequence similarity metric into a proper scoring rule that jointly assesses the accuracy and diversity of generated gaze. Experiments on task-driven visual search, covering both target-present and target-absent scenarios, and on free-viewing benchmarks demonstrate state-of-the-art performance. These results, along with detailed ablations, confirm the benefit of joint modeling and the value of distribution-aware evaluation in capturing the intrinsic variability of human gaze. Project webpage: this https URL

[CV-167] Analyzing Visual Aircraft Representations with Sparse Autoencoders

链接: https://arxiv.org/abs/2606.15468
作者: Deepshik Sharma
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 18 pages, 4 figures, 7 tables

点击查看摘要

Abstract:Vision models can achieve strong performance on classification tasks, but the internal representations supporting their predictions are often difficult to interpret. This work investigates whether sparse autoencoders can decompose intermediate representations of a vision model into interpretable features. We train a ConvNeXt classifier on the FGVC-Aircraft dataset, extract spatial activations from its final feature stage, and train a sparse autoencoder on these activations. The learned sparse features are analyzed using top-activating image patches, activation strength, and class selectivity. Qualitative visual inspection reveals that several features correspond to recognizable aircraft structures and visual patterns. We evaluate a subset of selected features using input-space and feature-space ablations, measuring how blurring image patches and suppressing sparse features affect class logits, classification margins, and prediction confidence. The results suggest that sparse autoencoders can reveal partially interpretable, class-relevant visual features associated with aircraft recognition, while also exposing limitations such as polysemanticity and coarse spatial localization.

[CV-168] Lesion-DDPM: Lesion-Enhanced 3D Diffusion for MS MRI Synthesis

链接: https://arxiv.org/abs/2606.15457
作者: Weidong Zhang,Yongchan Jung,Shafayat Mowla Anik,Furen Xiao,Vasudevan Janarthanan,Enkhzaya Chuluunbaatar,Byeong Kil Lee,Jeeho Ryoo
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:3D FLAIR MRI is widely recommended as one of the standard MRI sequences for brain imaging in multiple sclerosis (MS), but publicly available MS datasets remain relatively small and vary across scanners, acquisition protocols, and lesion patterns. This scarcity and variability hinder the development of robust neuroimaging machine learning models and are particularly challenging for generative models that aim to synthesize images while preserving small, sparse lesions. We propose Lesion-DDPM, a 3D conditional diffusion framework for lesion-aware FLAIR synthesis that incorporates multi-level anatomical mask injection together with a lesion-weighted reconstruction loss to emphasize lesion voxels while maintaining global brain structure. Using a curated subset of the MSLesSeg dataset, we compare Lesion-DDPM with representative state-of-the-art GAN- and diffusion-based models, assessing both image-generation metrics and downstream 3D U-Net segmentation. In our experiments, Lesion-DDPM achieved the lowest lesion-region reconstruction error among all methods. In a downstream 3D U-Net lesion segmentation task, a model trained only on Lesion-DDPM-generated scans and evaluated on real MRIs reached a Dice score of 0.616 compared with 0.569 for the best competing synthetic dataset. When Lesion-DDPM images were added to the real training set, the Dice score further increased to 0.685.

[CV-169] Post-Launch Capability Expansion of Vision-Language Models via Prompting for On-Orbit Spacecraft Inspection CVPR2026

链接: https://arxiv.org/abs/2606.15427
作者: Nicholas A. Welsh,Lennon J. Shikhman,Monty Nehru Attazs,Seemanthini K. Putane,Van Minh Nguyen,Ryan T. White
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 5 pages, 1 figure, 2 tables. Equal contribution by Nicholas A. Welsh and Lennon Shikhman. Published in the CVPR2026 Workshop on AI4Space

点击查看摘要

Abstract:Spaceborne inspection systems often deploy perception models prior to launch, after which updating model weights or expanding fixed label sets becomes operationally impractical. While supervised models can be integrated pre-flight, adding new semantic capabilities in orbit requires retraining and re-uploading parameters. We investigate whether prompt-driven vision–language models can enable post-launch semantic expansion, allowing new spacecraft components to be specified via natural-language prompts without modifying onboard weights. We evaluate zero-shot instance segmentation of spacecraft components under a strictly frozen, single-pass inference protocol on a test set of 129 images of previously unseen satellites. Under fixed global thresholds and no post-processing, SAM3 achieves 0.385 mAP@ 0.5 and 0.267 mAP@ 0.5:0.95 . Performance is strongly scale-dependent: large structural elements like spacecraft bodies ( 0.639 AP@ 0.50 ) and solar arrays ( 0.598 AP@ 0.5 ) localize reliably, while relatively small appendages like antennas ( 0.221 AP@ 0.5 ) and thrusters ( 0.081 AP@ 0.5 ) remain difficult. Prompt formulation influences performance, with structured prompts incorporating spatial and geometric descriptors yielding up to 82% improvement over short category-name prompts. The model operates within the memory and compute envelope of contemporary embedded GPUs, suggesting prompt-driven grounding can provide a practical mechanism for post-launch semantic extension of dominant spacecraft structures while highlighting limitations of zero-shot localization for fine-scale components under orbital domain shift.

[CV-170] From Frames to Temporal Graphs: In-Context Egocentric Action Recognition with Vision-Language Models

链接: https://arxiv.org/abs/2606.15417
作者: Bessie Dominguez-Dager,Francisco Gomez-Donoso,Miguel Cazorla,Marc Pollefeys,Daniel Barath,Zuria Bauer
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Action reasoning in egocentric video requires capturing fine-grained transitions of hand-object interactions, a task where general-purpose Vision-Language Models (VLMs) often struggle when operating directly on raw pixels. We propose to decouple visual perception from symbolic reasoning by converting videos into Temporal Action Graphs. In a multi-stage prompting pipeline, we first generate dense natural language narratives over short temporal windows as a semantic bottleneck, then formalize them into structured, open-vocabulary graph representations. On the EGTEA and Epic-Kitchens-100 datasets, the symbolic representation unlocks efficient in-context learning: few-shot graph demonstrations yield substantial accuracy gains over zero-shot frame and graph-based inference alike. Even in the zero-shot setting, graph-based reasoning remains competitive with pixel-based inference despite potential pretraining contamination favoring the latter. Across 11 open-weight VLMs from 6 model families ranging from 2B to 235B parameters, our findings indicate that current VLMs are more effective as symbolic reasoners than as direct visual observers. By projecting video into the language domain, we provide a scalable, fine-tuning-free alternative to end-to-end approaches that better leverages these models’ latent reasoning strengths. The code will be made public.

[CV-171] Segmentation-based Detection for Efficient Multi-Task Spacecraft Perception CVPR

链接: https://arxiv.org/abs/2606.15409
作者: Sivaperuman Muniyasamy,Surendar Devasundaram
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 2 figures, 6 tables. CVPRW AI4SPACE-SPARK 2026 Challenge Stream-1 First Place Winners. Code is available at this https URL

点击查看摘要

Abstract:Vision-based perception is fundamental to Space Situational Awareness and autonomous on-orbit operations such as rendezvous, docking, servicing, and navigation. However, progress in this area is limited by the scarcity of annotated space imagery and by challenging visual-domain characteristics including severe illumination changes, low signal-to-noise ratio, and high contrast. We address Stream 1 of the SPARK 2026 Challenge, which requires a single model for spacecraft classification, detection, and fine-grained component segmentation across multiple target types. We propose a compact architecture that integrates a MobileNetV3 encoder with a U-Net-style decoder, combining computational efficiency with accurate dense prediction. Detection is derived analytically from the union of predicted component masks, avoiding a separate bounding-box regression head in the single-spacecraft setting. Our method achieved an overall leaderboard score of 0.9482, with task-specific scores of 1.0000 in classification, 0.9788 in detection, and 0.8917 in segmentation. The proposed approach ranked second overall in the SPARK 2026 Challenge, demonstrating that lightweight encoder-decoder architectures can deliver strong multi-task performance for practical onboard space vision systems.

[CV-172] mestep Rescheduling in Diffusion Inversion ICML2026

链接: https://arxiv.org/abs/2606.15389
作者: Shangquan Sun,Ting Gong,Zhirui Liu,Jiamin Wu,Runkai Zhao,Mianxin Liu,Wenqi Ren,Xiaochun Cao
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICML 2026. 23 pages, including appendices

点击查看摘要

Abstract:Diffusion inversion, which maps images back to the Gaussian latent space of a diffusion model, is a critical task for image reconstruction and editing. While DDIM enables fast deterministic inversion, it inherently introduces deviations that accumulate into noticeable inversion errors. Existing methods often address this by solving a fixed-point problem but largely overlook how the selection of the diffusion timestep in the noise scheduler influences inversion fidelity. In this work, we reveal that the deviation scale in diffusion inversion is strongly dependent on the timestep size, and exhibits a parabolic trend, with larger errors concentrated at both small and large timesteps. Based on this finding, we propose a simple yet effective nonuniform timestep scheduler that integrates a global rescaling with a local dynamic programming based rescheduling, enabling a strategic allocation of computational effort that minimizes the overall inversion error and preserves higher inversion accuracy. Our method serves as an off-the-shelf enhancement for existing inversion techniques and requires no extra parameters or computational overhead. Through extensive experiments, we verify that integrating our scheduler consistently boosts the performance of existing inversion methods, achieving superior results in image reconstruction and editing.

[CV-173] MNet: Extended 2D/3D Networks for Anisotropic Medical Image Segmentation

链接: https://arxiv.org/abs/2606.15370
作者: Kirsten Odendaal,Rade Bajic
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This work demonstrates a full reproduction and extension of MNet, a hybrid 2D/3D convolutional network designed for anisotropic medical image segmentation. The original architecture was re-implemented within the nnU-Net framework to verify its reported performance and robustness to variable voxel spacing, known as anisotropy. Experiments were conducted on PROMISE prostate MRI and a controlled subset of LiTS liver CT under matched preprocessing and compute constraints. The reproduced MNet achieved a Dice similarity coefficient (DSC) of 89.0 +/- 0.9% on PROMISE, within 0.8% of the published result, and 94.3 +/- 1.9% / 54.6 +/- 3.1% for liver and tumor segmentation on LiTS, respectively. Two lightweight extensions were further introduced: (1) a learned Fusion Gating mechanism enabling adaptive 2D-3D feature blending, and (2) a VMamba state-space module for efficient long-range depth modelling. The Spatial Gating variant improved DSC by +0.8% with less than 3% inference overhead, while VMamba improved performance consistency, reducing PROMISE Dice variation to +/- 0.7% and achieving the strongest LiTS liver performance at 95.8% Dice. Both extensions preserved MNet robustness to anisotropy, with delta Dice = 1.5% across 1-4 mm voxel spacing. Overall, the study confirms MNet reproducibility and demonstrates that adaptive fusion and state-space modelling have the potential to further strengthen segmentation reliability under anisotropic conditions. However, further tests are required to provide definitive conclusions.

[CV-174] Sustainable Face Recognition on Low-Power Devices with VQ-VAE Embeddings

链接: https://arxiv.org/abs/2606.15355
作者: Christos Chronis,Georgios Th. Papadopoulos,Iraklis Varlamis
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Face recognition has become a cornerstone of modern AI applications, yet conventional approaches often rely on computationally intensive models deployed in cloud environments, leading to increased network traffic, high energy consumption, and a heavy carbon footprint. This work introduces a sustainable, edge-deployable face recognition framework based on Vector-Quantized Variational Autoencoders (VQ-VAE), which generates compact and semantically rich latent representations of facial images. By leveraging the compression capacity and reconstruction quality of VQ-VAE embeddings on the edge and combining them with the power of pre-trained face embeddings in a knowledge distillation setup, our system achieves comparable accuracy to state-of-the-art face embedding models while significantly reducing memory and computation requirements on the edge, making it suitable for low-power edge devices. The integration of VQ-VAE compression minimizes network overhead while keeping the matching accuracy high by retaining only the most informative facial features in the latent space. As a result, the reconstructed images preserve the key identity characteristics, improving the robustness and overall performance of the face embeddings.

[CV-175] Facial Affect Analysis for Service-Oriented Systems: Advances Challenges and Future Visions

链接: https://arxiv.org/abs/2606.15351
作者: Spyridon Georgiou,Aggelos Psiris,Thomas Lagkas,Vasileios Argyriou,Panagiotis Sarigiannidis,Iraklis Varlamis,Georgios Th. Papadopoulos
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Facial Affect Analysis (FAA) is evolving from a stand-alone recognition task into a reusable perception capability for Service-Oriented Software Ecosystems (SoSE). This paper preserves the FAA methodological core while reframing recent advances through systems-engineering requirements for composable and dependable services. We review representative progress in static and dynamic expression analysis, action-unit and micro-expression modeling, and modern CNN, Transformer, graph, and hybrid architectures, then interpret these advances by their operational fit in edge, cloud, and hybrid service pipelines. The synthesis emphasizes SoSE concerns that determine deployability: service contracts for uncertainty-aware outputs, latency and availability envelopes, lifecycle monitoring and recalibration, governance-aware integration, and interoperability across independently evolving components. Our analysis shows that benchmark gains alone are insufficient for SoSE readiness; robustness under shift, intervention stability, fairness, privacy posture, and runtime guarantees are equally critical. We conclude with a roadmap for treating FAA as an operational service component with explicit interfaces, measurable quality attributes, and accountable lifecycle management.

[CV-176] DYNA-PRUNER: Input-Adaptive Data-Model Co-Pruning for Efficient and Scalable Spatio-Temporal Media Prediction ICME2026

链接: https://arxiv.org/abs/2606.15346
作者: Fuyan Zhang,Yuqi Li,Yingli Tian,Edmond S.L. Ho
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM)
备注: ICME 2026 Spotlight Paper

点击查看摘要

Abstract:Spatio-temporal prediction supports radar/satellite nowcasting and city-scale traffic monitoring, but modern models are often too expensive for real-time deployment. This stems from a mismatch between dense computation and strong input-dependent redundancy (e.g., calm seas or clear skies). To enable automated, resource-aware architecture optimization in scalable media analysis, we propose Dyna-Pruner, an end-to-end framework for input-dependent co-pruning of data and model structure. A shared-importance synchronization mechanism generates coupled masks that prune redundant regions and their corresponding computational units (e.g., convolutional filters), yielding per-sample sparse sub-networks at inference time. Experiments on WeatherBench, SEVIR, and TaxiBJ show seamless integration with CNN, RNN, and Transformer backbones, reducing FLOPs by up to 70% and achieving a 2.5\times speedup on NVIDIA Jetson AGX Orin with negligible accuracy loss ( 1% ).

[CV-177] CausalDrive: Real-time Causal World Models for Autonomous Driving

链接: https://arxiv.org/abs/2606.15341
作者: Tianyi Yan,Huan Zheng,Dubing Chen,Meizhi Qu,Yingying Shen,Lijun Zhou,Mingfei Tu,Bing Wang,Guang Chen,Hangjun Ye,Haiyang Sun,Cheng-zhong Xu,Jianbing Shen
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:World models have emerged as a promising paradigm for scaling autonomous driving (AD) data, yet existing video generative models fall short as interactive simulators. Layout-conditioned renderers rely on “oracle” future trajectories of all background agents, rendering them strictly non-reactive. Conversely, pure action-conditioned predictors lack semantic control over complex interactions and suffer from prohibitive diffusion latencies, hindering closed-loop policy learning. To bridge this gap, we present CausalDrive, a controllable, real-time foundation driving world renderer. CausalDrive operates solely on the initial front-view frame, the ego-vehicle’s trajectory, and a macroscopic text prompt. By excluding future NPC layouts, we compel the model to intrinsically predict causal interactions, enabling text-driven control over Driving Sociology, allowing users to dynamically orchestrate diverse counterfactual reactions to identical ego-actions. To overcome the efficiency bottleneck and address the covariate shift in autoregressive generation, we propose a novel Context-Forced DMD architecture. This combines continuous flow-matching with a self-correcting distillation objective, achieving interactive speeds of 12 FPS. This breakthrough transforms the passive video generator into a playable neural simulator. We demonstrate its versatility across three downstream applications: (1) generative closed-loop evaluation with significantly mitigated collision artifacts, (2) large-scale Reinforcement Learning (RL) post-training driven by a Video2Reward module, and (3) real-time human-in-the-loop simulation. Extensive experiments validate that policies trained within CausalDrive’s reactive scenarios exhibit superior interaction capabilities in the real world.

[CV-178] SGFormer: Semantic Graph Transformer for Incremental 3D Scene Graph Generation

链接: https://arxiv.org/abs/2606.15328
作者: Mengshi Qi,Changsheng Lv,Zijian Fu,Xianlin Zhang,Huadong Ma
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In this paper, we propose SGFormer++, a novel Semantic Graph Transformer for 3D scene graph generation (SGG), which aims to parse point cloud scenes into semantic structural graphs, where nodes denote detected object instances and edges encode their pairwise relationships, with the core challenge lying in modeling complex global scene structure. While existing graph convolutional network (GCN)-based methods suffer from over-smoothing and limited receptive fields, SGFormer++ leverages Transformer layers as its backbone to enable global message passing. Specifically, we introduce two key components tailored for 3D SGG: (1) a Graph Embedding Layer++ that efficiently integrates edge-aware global context with linear computational complexity, and (2) a Semantic Injection Layer++ that enriches visual features with linguistic priors from large language models (LLMs) and vision-language models (VLMs), boosting semantic representation without introducing extra trainable parameters. To further address the practical challenge of incremental SGG (I-SGG), where new relationship categories arrive sequentially, we equip SGFormer++ with a novel Spatial-guided Feature Adapter, which calibrates predicate features using subject-object spatial geometry to counter scale variation, and a Cascaded Binary Prediction Head that mitigates catastrophic forgetting via task-incremental classifier expansion and logit distillation. Extensive experiments on the 3DSSG benchmark demonstrate that SGFormer++ achieves state-of-the-art performance in both standard and incremental settings: it yields a significant 4.49% absolute improvement in Predicate A@1 under the incremental setting. Code and data are available at: this https URL.

[CV-179] PPDM: Pixel Puzzling Diffusion Model for Speed and Memory Efficient Volumetric Medical Image Translation

链接: https://arxiv.org/abs/2606.15323
作者: Tianqi Chen,Jun Hou,Yinchi Zhou,James S. Duncan,Chi Liu,Bo Zhou
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 5 figures, 5 tables

点击查看摘要

Abstract:Diffusion models have demonstrated superior fidelity for medical image-to-image translation, but their extension to high-resolution 3D volumes is severely constrained by prohibitive computational cost and GPU memory requirements. Existing memory-efficient strategies often compromise global volumetric consistency or fine anatomical detail. In this work, we propose the Pixel Puzzling Diffusion Model (PPDM), a simple and effective framework for memory- and speed-efficient 3D medical image translation. PPDM introduces a reversible pixel puzzle-unpuzzle operator that trades spatial resolution for channel dimensionality, substantially reducing activation memory while preserving global context. To further improve efficiency and stability, we adopt a direct bridge diffusion formulation that starts from the conditional input rather than pure noise, enabling the model to focus on task-relevant residuals. In addition, a puzzle-gradient loss is incorporated to enforce spatial coherence and suppress grid-like artifacts introduced by spatial rearrangement. We evaluate PPDM on multiple challenging 3D medical image translation tasks, including low-count PET denoising, joint PET denoising and attenuation correction, and cross-modal MRI translation. Across all tasks, PPDM consistently matches or outperforms full 3D diffusion models while reducing training GPU memory usage by up to an order of magnitude and significantly accelerating inference, and it outperforms existing memory-efficient diffusion approaches based on latent compression or frequency decomposition. These results demonstrate that PPDM provides a practical and scalable solution for high-fidelity 3D diffusion-based medical image translation under limited computational resources.

[CV-180] Conditional Multi-Event Temporal Grounding in Long-Form Video

链接: https://arxiv.org/abs/2606.15320
作者: Yuanhao Zou,Arthad Kulkarni,Lucas Tonanez,Lincoln Spencer,Guangyu Sun,Tianxingjian Ding,Andong Deng,Yi Li,Shuangjun Liu,Yuan Li,Dashan Gao,Ning Bi,Taotao Jing,Shuai Zhang,Chen Chen
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multimodal large language models have made rapid progress in video temporal grounding, yet real-world applications routinely require localizing every event that satisfies compositional temporal and spatial conditions. Existing benchmarks fall short: they localize only a single moment per query, count without temporal conditions, or treat grounding and counting as disjoint tasks. We introduce CoMET-Bench for Conditional Multi-Event Temporal Grounding in long-form video, comprising 2789 queries over 600 videos averaging 33.8 minutes across five real-world domains, with each query composed from 4 temporal conditions, 3 spatial conditions, and a dedicated negative-query subset. We further propose a unified evaluation protocol jointly measuring counting, grounding, and negative-query recognition, including a new Rejection-F1 metric that prevents trivial gaming by lazy “always-empty” models. Benchmarking a broad suite of MLLMs, agent-based, and grounding-specialized methods reveals that existing approaches remain far from solving this task. Building on these findings, we propose CoMET-Agent, a training-free agentic framework that reformulates the task as structured search-and-aggregate, improving F1@0.5 by 6.1% over GPT-5 purely through structural reasoning. Failure analysis further surfaces three open directions: fine-grained entity tracking, position-uniform retrieval, and causal event pairing.

[CV-181] CoMNeT: A MedNeXt-CorrDiff Framework for Volumetric Brain Tumor Segmentation

链接: https://arxiv.org/abs/2606.15305
作者: Michael L. Evans,MD Fayaz Bin Hossen,MD Shibly Sadique,Walia Farzana,Khan M. Iftekharuddin
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 4 figures, 2 tables

点击查看摘要

Abstract:Accurate brain tumor segmentation from multiparametric magnetic resonance imaging (MRI) is critical for treatment planning, response assessment, and quantitative neuro-oncology research. However, automated segmentation remains a difficult task in computer vision because of variation in tumor appearance and MRI protocols across patient scans. Moreover, clinically important regions such as enhancing tumor (ET) and tumor core (TC) are often small relative to the full brain volume, furthering increasing the difficulty of achieving high voxel-level precision. In this paper, we show that combining a modern 3D convolutional segmentation model with corrective diffusion-based refinement and ensembling improves volumetric glioma segmentation on the UTSW-Glioma dataset. We propose CoMNeT, a MedNeXt-CorrDiff framework that uses four MRI modalities as input and predicts ET, TC, and whole tumor (WT) regions for automated brain tumor segmentation. MedNeXt is used as the primary segmentation model with Global Response Normalization for feature learning, while CorrDiff is trained as a postprocessing residual refinement method to correct errors in the probability maps before final thresholding. Using five-fold cross-validation, CoMNeT achieved the highest Dice score for most tumor regions, with ET, TC, WT, and average Dice scores of 0.7543 +/- 0.0261, 0.6806 +/- 0.0166, 0.9049 +/- 0.0128, and 0.7798 +/- 0.0184, respectively. CoMNeT outperformed two selected baseline models: SegResNet (0.7555 +/- 0.0190 average Dice) and standalone MedNeXt (0.7697 +/- 0.0154 average Dice). Our findings support the use of corrective diffusion and fold-level probability ensembling as practical additions to existing state-of-the-art 3D convolutional models for automated glioma segmentation.

[CV-182] HemExp: Clinically-Guided Latent Diffusion for Modeling Hematoma Expansion

链接: https://arxiv.org/abs/2606.15304
作者: Orhun Utku Aydin,Satoru Tanioka,Tzu I Chuang,Alexander Koch,Dimitrios Rallios,Marie Gultom,Begum Tahhan,Fujimaro Ishida,Dietmar Frey,Adam Hilbert
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Hematoma expansion (HE) after spontaneous intracerebral hemorrhage (ICH) is a major determinant of acute triage and treatment decisions in neurosurgical care. However, most existing methods provide either a binary expansion risk or a single follow-up volume, limiting uncertainty-aware decisions. We introduce HemExp, a clinically-guided latent diffusion model that generates patient-specific follow-up non-contrast CT images, along with segmentations of intraparenchymal and intraventricular hemorrhage. Generation is conditioned on baseline imaging, clinical variables, and an explicit expansion indicator, enabling controllable simulation of realistic clinical scenarios. HemExp uses a hemorrhage-aware multi-head variational autoencoder and models progression as the difference between baseline and follow-up latent representations with a conditional diffusion model. The model is trained on paired scans from 450 patients across multiple centers and evaluated on 107 patients from a held-out institution. HemExp produces spatial HE probability maps by generating multiple synthetic follow-up images per patient to estimate distributions of plausible follow-up hematoma volumes. Perturbing clinical inputs such as symptom-onset-to-imaging time or anticoagulant status shifts the predicted follow-up volume distribution. HemExp extends binary predictors and demonstrates robust estimation of clinically relevant outcomes in the imaging space, such as hematoma volume, intraventricular involvement, and mass effects. Overall, our results support controllable latent diffusion as a promising direction for uncertainty-aware modeling of early ICH progression.

[CV-183] G2IA: Geometry-Guided Instance-Aware Retrieval and Refinement for Cross-Modal Place Recognition

链接: https://arxiv.org/abs/2606.15287
作者: Xianyun Jiao,Jingyi Xu,Zhongmiao Yan,Xieyuanli Chen,Lin Pei
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Cross-modal place recognition (CMPR) enables camera-only robots to localize against pre-built LiDAR maps in autonomous navigation scenarios. This image-to-point-cloud setting is challenged by two coupled ambiguities: the modality gap between perspective RGB appearance and sparse metric geometry, and perceptual aliasing among urban places with similar roads, facades, intersections, and object arrangements. Instead of treating CMPR as a single global descriptor matching problem, we argue that reliable retrieval requires both geometry-aware representation alignment and fine-grained candidate verification. In this paper, we propose G2IA, a geometry-guided instance-aware framework for image-to-point-cloud place recognition. In the retrieval stage, visual geometry priors from VGGT and instance features are integrated to construct place descriptors that are more compatible with LiDAR-derived map representations. In the refinement stage, the retrieved candidates are re-ranked by explicitly verifying whether local instance shapes and their relative spatial layouts are consistent across modalities. Experiments on public benchmarks demonstrate that G2IA consistently improves image-to-point-cloud place recognition under different localization thresholds, and exhibits strong cross-dataset generalization.

[CV-184] Decoupled Motion Representation Learning for Moving Infrared Small Target Detection

链接: https://arxiv.org/abs/2606.15286
作者: Guoyi Zhang,Peiwen Wu,Han Wang,Xiangpeng Xu,Xiaohu Zhang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Infrared small target detection in dynamic scenes remains challenging due to the highly coupled motions among targets, imaging platforms, and dynamic backgrounds. Existing multi-frame methods usually perform implicit temporal modeling, where coherent background dynamics dominate motion correspondence learning, leading to an inherent trade-off between detection and false alarms. In this work, we observe that background motions exhibit strong global coherence, whereas small targets mainly correspond to sparse local motion anomalies. Moreover, many false-alarm responses maintain high consistency with globally coherent motion patterns, indicating that they mainly originate from coherent background dynamics rather than genuine target motions. Based on these observations, we propose a decoupled motion representation learning framework for moving infrared small target detection. Specifically, an explicit motion branch is introduced to model globally coherent motion dynamics using pretrained optical flow priors, together with a structure-preserving self-supervised adaptation strategy for infrared motion correspondence learning. Meanwhile, an implicit motion branch based on deformable feature alignment is designed to capture target-sensitive local motion anomalies under coherent motion guidance. Furthermore, a coherent-motion-guided local anomaly reasoning module is proposed to identify and suppress coherent-motion-induced false responses during localized motion modeling. Extensive experiments on two challenging infrared small target detection benchmarks demonstrate that the proposed method consistently outperforms existing state-of-the-art approaches, particularly in dynamic scenes with complex motions, while maintaining favorable inference efficiency.

[CV-185] Enhancing Precision Agriculture with a Hybrid Deep Learning Framework for Multi-Class Plant Disease Classification and Interpretability

链接: https://arxiv.org/abs/2606.15282
作者: Hasibul Islam Sufi,Ridam Roy,Shayla Alam Setu,Mahimul Islam Nadim
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This study proposes an overall deep learning architecture for multi-class classification of plant diseases from high-resolution leaf imagery, with a particular interest in investigating the behavior of ResNet-50 and a hybrid ResNet + Vision Transformer (ViT) design. A specially gathered image database with 15,200 training images and 3,800 validation images spanning 38 classes across multiple crops, including tomato, apple, grape etc. were subjected to preprocessing steps such as resizing, normalization, and data augmentation to enhance model robustness. Multiple architectures, including ResNet-50, MobileNetV2, and EfficientNet-B0, were trained and compared with the hybrid ResNet + ViT model. All models were fine-tuned using the AdamW optimizer and cross-entropy loss, with early stopping applied to prevent overfitting and ensure generalization. Furthermore, interpretability techniques such as Grad-CAM and saliency maps were implemented to indicate disease-relevant regions, while segmentation-based analysis was performed to identify the affected parts of a leaf. For every one of the considered architectures, ResNet-50 led to the highest accuracy of 98.74%, whereas the hybrid ResNet + ViT model achieved a competitive accuracy of 98.58%, showing that the hybrid architectures were effective in capturing both local and overall information. The experimental results showcase the promise of transformer-based models to achieve highly accurate, interpretable, and computationally efficient computer-based multi-class multi-disease classification systems, providing helpful assistance for cultivation management practices as well as for precision farming.

[CV-186] MamBOA: State-Space Architecture for Video Recognition

链接: https://arxiv.org/abs/2606.15275
作者: Mustafa Bora Çelik
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 7 figures. Codes available at [ this https URL ]

点击查看摘要

Abstract:Fine-grained action recognition demands temporal reasoning that general-purpose architectures address through different cost-accuracy tradeoffs: 3D dense operators couple computation to the input volume, while difference-based methods approximate motion through rigid, hand-crafted subtraction of uncontextualized features - each reflecting a deliberate design choice with corresponding limitations in expressiveness or flexibility. We present MamBOA, a backbone-agnostic temporal framework built upon a novel interleaved scan structure that recasts the selective state-space recurrence (S6) as a native motion synthesizer. By interleaving consecutive feature representations extracted from a pretrained backbone into a single alternating sequence, the proposed scan structurally drives the recurrence to encode both temporal observations of each position within a shared hidden state, separated by only a single decay step - rendering the inter-frame transition an intrinsic component of the state dynamics rather than an externally computed quantity. A cascade of dedicated alignment and decoding operations then distills this joint encoding into an explicit motion representation, which a dual-path pooling mechanism adaptively aggregates by balancing attention-driven selection with uniform temporal coverage. The framework interfaces seamlessly with CNN, Transformer, and Mamba backbone families, adding only ~2.1 GFLOPs per feature pair. On Diving48, MamBOA achieves 85.02% Top-1 accuracy with an image-pretrained backbone and 86.24% with a video-pretrained backbone processing the entire video in a single forward pass - demonstrating that structurally induced state-space dynamics constitute a principled and general foundation for motion modeling.

[CV-187] rusted Multi-View Deep Learning Classification of Fetal Congenital Heart Disease with Feature-level and Decision-level Fusion

链接: https://arxiv.org/abs/2606.15265
作者: Tan Zhou,Shifa Yao,Suncheng Xiang,Dahong Qian,Baoying Ye
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Congenital heart disease (CHD) refers to the abnormal anatomical structure caused by the abnormal development of the heart and great vessels during embryonic development. Traditional diagnostics often fail to achieve high accuracy and efficiency, especially given the complexity of cardiac anatomy. This study presents a specialized multi-view deep learning framework for CHD binary classification using echocardiographic images. A large-scale CHD dataset, including five views, was used to train the model, enabling it to integrate multi-angle image data. The framework utilizes advanced feature extraction and attention mechanisms to improve diagnostic precision and reliability. An uncertainty-based decision-making component is also integrated to handle low-quality images, enhancing diagnostic outcomes. Experimental results show that this method achieves top-tier performance on our dataset and provides a robust tool for early CHD detection, underscoring its potential for clinical use. The dataset and source code will be released upon paper acceptance.

[CV-188] Focus Align and Sustain: Counteracting Gradient Dilution in Incremental Object Detection ICML2026

链接: https://arxiv.org/abs/2606.15253
作者: Aoting Zhang,Dongbao Yang,Chang Liu,Xiaopeng Hong,Yu Zhou
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICML2026

点击查看摘要

Abstract:Adapting Detection Transformers to Incremental Object Detection (IOD) poses a systemic challenge, as set-based optimization is inherently destabilized by sequential learning. In this work, we identify Gradient Dilution as the root cause of performance degradation, wherein optimization signals required to preserve old knowledge are progressively weakened. This phenomenon manifests as a cascading erosion of preservation gradients in magnitude, direction, and support coverage, driven by three tightly coupled factors: Signal Dispersion, where foreground gradients are overwhelmed by background noise; Assignment Drift, where stochastic query-target matching induces inconsistent gradient trajectories; and Support Attrition, where gradients from retained samples insufficiently cover the old-class feature space, weakening decision boundaries under interference from new classes. To counteract this, we propose FAS, a unified framework that Focuses, Aligns, and Sustains gradient flow throughout incremental learning. Specifically, we introduce prior-injected queries to focus discriminative signals by filtering background interference at the source. We further propose deterministic anchor distillation to align query-target assignments and enforce semantic consistency across stages under unstable matching. Finally, we devise manifold-support replay to sustain distributional support of old classes, counteracting representational erosion induced by continual updates. Extensive experiments show that FAS restores robust optimization dynamics and outperforms state-of-the-art methods, achieving over 5.0 AP improvement in the challenging 40+10x4 incremental setting.

[CV-189] Landmark-free Assessment of Lower-limb Alignment with Implicit Neural Shape Functions from Knee Radiographs MICCAI2026

链接: https://arxiv.org/abs/2606.15250
作者: Zhisen Hu,Antti Kemppainen,David Johnson,Egor Panfilov,Huy Hoang Nguyen,Timothy Cootes,Claudia Lindner,Aleksei Tiulpin
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to MICCAI 2026

点击查看摘要

Abstract:Radiographic assessment of lower-limb alignment (LLA) is important for predicting joint health and surgical outcomes in total knee arthroplasty. Traditional measurement methods are manual and time-consuming, while recent machine learning approaches typically rely on locating a fixed set of anatomical landmarks. This dependence limits flexibility and may require re-annotation when clinical definitions change. To address this, we propose an automated workflow using Implicit Neural Shape Functions (INSF). Rather than relying on explicit landmark coordinates, we encode the anatomy into a compact latent space and regress clinical alignment measurements directly from these latent codes. This architecture allows for rapid extendability to new tasks without altering the backbone representation. We trained our method on an internal dataset of 566 knee radiographs, each annotated with the outline of the femur and tibia. We evaluated it on both an internal test dataset of 50 patients and a separate external set of 402 preoperative cases from the MRKR dataset. Manual clinical measurements are available for these data, and the MRKR measurements will be made publicly accessible. Performance was comparable to state-of-the-art landmark-based methods and manual agreement, while offering a flexible shape representation that can be extended to additional measurement tasks.

[CV-190] SPARK: Spatial Policy-driven Adaptive Reinforcement learning for Knowledge distillation BMVC

链接: https://arxiv.org/abs/2606.15243
作者: Mohamed Jismy Aashik Rasool,Shabir Ahmad,Gisong Oh,Teag Kuen Whangbo
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 3 figures,5 tables ,BMVC submission

点击查看摘要

Abstract:Low-bit quantization enables deployment of image restoration (IR) networks on resource-constrained devices, but introduces rounding noise that disproportionately degrades high-frequency regions such as edges and fine textures. Existing knowledge distillation (KD) methods apply distillation signals uniformly across all spatial locations, overlooking the varying reconstruction difficulty across image regions. To address this, we propose SPARK (Spatial Policy-driven Adaptive Reinforcement Learning for Knowledge Distillation), a framework that adaptively allocates distillation effort using a lightweight reinforcement learning (RL) policy network. At each training step, a difficulty feature extractor computes four signals, namely Laplacian variance, pixel variance, student reconstruction error, and teacher-student knowledge gap, which are fed into a compact policy CNN that produces a stochastic spatial weight map to modulate the KD loss during quantization-aware training (QAT). SPARK is IR task-agnostic, adds no inference cost, and integrates into any existing QAT pipeline without architectural changes. Experiments on benchmark datasets demonstrate that SPARK consistently outperforms PTQ, QAT, and state-of-the-art (SOTA) KD approaches across multiple student architectures, achieving reconstruction quality closest to the full-precision teacher under significant computational constraints.

[CV-191] HairLRM: Strand-based Hair Modeling via Large Reconstruction Models SIGGRAPH2026

链接: https://arxiv.org/abs/2606.15238
作者: Yuefan Shen,Yican Dong,Xiufeng Huang,Zhongtian Zheng,Youyi Zheng,Kui Wu
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: ACM SIGGRAPH 2026 Conference Paper

点击查看摘要

Abstract:The fundamental limitation of traditional strand-based modeling is not simply data scarcity, but the ill-posedness of inferring complex 3D fields from 2D imagery without structural constraints. This unconstrained regression leads to catastrophic failures in resolving both global occlusion (e.g., in ponytails) and local directionality (e.g., in curls), resulting in over-smoothed, plausible-but-incorrect geometries. To resolve this, we integrate the strong geometric priors of Large Reconstruction Models (LRMs) into the strand generation pipeline. Using the LRM mesh as a structural anchor, we employ a novel Dual Orientation AutoEncoder to lift coarse geometry into high-fidelity strands. By resolving vector field singularities through latent-space optimization and surface-guided refinement, our method effectively disentangles complex topological structures, setting a new benchmark for robustness and accuracy in hair reconstruction.

[CV-192] Show the Signal Hide the Noise: Spectral Forcing for Pixel-Space Diffusion

链接: https://arxiv.org/abs/2606.15236
作者: Weichen Fan,Haiwen Diao,Penghao Wu,Ziwei Liu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code link: this https URL

点击查看摘要

Abstract:Pixel-space diffusion models are trained on full-bandwidth noisy images, yet the useful signal available to the denoiser is strongly frequency dependent. Under rectified-flow diffusion and natural-image power-law spectra, the per-band data-to-noise contour k^*(t) = (1-t)^-2/\alpha separates a signal-bearing low-frequency region from a noise-dominated high-frequency region at each time t . We show that this implicit coarse-to-fine structure is not merely descriptive: it induces a capacity-allocation problem. A standard pixel-space denoiser must discover the moving bandwidth boundary internally and can spend computation on frequency-time regions where the optimal prediction collapses to deterministic baselines rather than data-distribution modeling. To make this boundary explicit, we introduce Spectral Forcing, a parameter-free, time-conditional 2D-DCT low-pass operator applied to the noisy input before the patch embedder. Its cutoff expands monotonically with the diffusion time and becomes the identity at the data endpoint. Through controlled synthetic experiments, we identify the regime in which the operator is beneficial: coarse patch tokenization and data whose high-frequency content is predominantly noise rather than essential signal. On ImageNet-256 with JiT-700M/32, Spectral Forcing consistently improves both FID and Inception Score across different training epochs, demonstrating robust gains throughout training; at finer tokenization, the spectral forcing is still competitive. We further insert the unchanged operator into SenseNova-U1, a unified text-to-image model, where it improves DPG-Bench and GenEval, showing that the input-side spectral prior transfers beyond class-conditional generation. These results suggest a route to capacity-efficient pixel-space diffusion by showing the signal and hiding the noise.

[CV-193] Comparing Human Gaze and Vision-Language Model Attention in Safety-Relevant Environments

链接: https://arxiv.org/abs/2606.15202
作者: Marta Vallejo,Siwen Wang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 30 pages, 33 figures. Submitted as a preprint. Code and data available upon reasonable request

点击查看摘要

Abstract:Human visual attention plays an important role in how people perceive and respond to environments containing potential risks. This study investigates whether large vision-language models can identify the same regions of a scene that attract human attention in safety-relevant environments. Eye-tracking data were collected from ten participants viewing 33 scene images representing environments with varying levels of potential risk using Pupil Invisible wearable glasses. Gaze coordinates were mapped onto stimulus images to generate population-averaged human gaze heatmaps. In parallel, GPT-4o was prompted through the OpenAI Vision Application Programming Interface (API) to generate spatial predictions of visual attention, which were converted into saliency maps for comparison with human gaze patterns. Spatial alignment between human gaze heatmaps and model-generated saliency maps was evaluated using four complementary metrics: Pearson correlation (r = 0.515 ± 0.117), Normalised Scanpath Saliency (NSS = 0.988 ± 0.323), Kullback-Leibler divergence (KL = 1.766 ± 0.844), and Area Under the Receiver Operating Characteristic Curve using the Judd formulation (AUC-Judd = 0.806 ± 0.076). A cross-model comparison with Gemini Pro, Gemini Flash, and Claude showed that all models exceeded the AUC-Judd chance baseline of 0.5 and achieved positive NSS scores. Gemini Pro demonstrated the strongest spatial localisation according to three of the four metrics, whereas GPT-4o produced the closest distributional match to human attention as measured by KL divergence. These findings suggest that large vision-language models can identify regions that broadly correspond to where humans direct visual attention in safety-relevant scenes without requiring eye-tracking training data. The results highlight the potential of vision-language models as a scalable tool for approximating human attentional patterns. Comments: 30 pages, 33 figures. Submitted as a preprint. Code and data available upon reasonable request Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2606.15202 [cs.CV] (or arXiv:2606.15202v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2606.15202 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-194] Keep It in Mind: User Centric Continual Spatial Intelligence Reasoning in Egocentric Video Streams ICML

链接: https://arxiv.org/abs/2606.15200
作者: Yun Wang,Junbin Xiao,Han Lyu,Yifan Wang,Jing Zuo,Zhanjie Zhang,Hong Huang,Dapeng Wu,Angela Yao
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 45 pages. this https URL

点击查看摘要

Abstract:We introduce UCS-Bench, a dataset spanning 170+ hours of egocentric visual observations with 8.1K+ timestamped questions for diagnosing User-Centric Continual Spatial intelligence in egocentric video streams. UCS-Bench targets a new problem that emphasizes dynamic spatial reasoning, long-term memory, and their alignment with users’ real-time locations. We propose DirectMe, a framework that incrementally constructs and maintains a structured spatial memory from streaming egocentric observations. DirectMe enables robust tracking and recall of object locations, all relative to the user’s movement over time. By tightly coupling visual perception with memory updates and spatial reasoning, our approach supports long-horizon queries that require recalling interactions, resolving viewpoint-induced ambiguities, and adapting to dynamic scenes. Our experiments show that DirectMe significantly improves the spatial reasoning of leading multimodal LLMs; it also surpasses many spatially aware and long-form streaming video models. We hope our benchmark and solution will advance spatial intelligence research for egocentric AI assistants. Data and code are available at this https URL.

[CV-195] Adaptive Inference-Time Scaling via Early-Step Latent Verification for Image Editing

链接: https://arxiv.org/abs/2606.15188
作者: Yue Yu,Yang Jiao,Jiayu Wang,Qi Dai,Jingjing Chen
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Instruction-based image editing has made notable progress with recent advances in generative models. However, the quality of the edited result is still influenced by the randomly sampled initial noise, particularly in complex editing scenarios. An unsuitable initial noise may lead to unsatisfactory editing results. Recent inference-time scaling methods address this issue by sampling multiple initial noises and selecting better candidates. Nevertheless, most of them follow a decode-then-verify scheme which introduces an efficiency-accuracy trade-off. When decoding is performed after limited inference steps, the decoded images often remain too noisy for reliable assessment, whereas sufficiently denoised images require much higher computational cost. To address this issue, we propose VeriLatent, a plug-and-play adaptive inference-time scaling framework with early-step latent verification for image editing. Specifically, we propose a novel verifier that scores each initial noise through a latent-space editing activation map at an early stage. It identifies promising candidates by assessing whether they can induce an effective edit in the correct region. This enables efficient early pruning without decoding latents into images. Building on this, we further develop an adaptive search strategy for inference-time scaling. It allocates inference budgets according to editing difficulty, thereby reducing the number of function evaluations (NFE). Extensive experiments on multiple benchmarks and different base models demonstrate that VeriLatent consistently improves both editing performance and inference-time scaling efficiency.

[CV-196] Enabling Real-Time Point-of-Care Ultrasound Segmentation: A GPU-Free Deployment in Resource-Limited Settings

链接: https://arxiv.org/abs/2606.15176
作者: Weihao Gao
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 15 pages,4 figures

点击查看摘要

Abstract:Ultrasound imaging is the most widely adopted medical modality globally due to its low cost and portability, yet artificial intelligence (AI) deployment remains constrained by reliance on GPU-accelerated models, creating a structural paradox where the cost of “intelligence” exceeds that of the imaging device itself. Here, we present the systematic adaptation and extensive evaluation of UltraSeg, an ultra-lightweight architecture originally developed for colonoscopic polyp segmentation, now engineered for point-of-care ultrasound (POCUS) across ten public datasets spanning six anatomical sites (breast, thyroid, kidney, carotid, fetal, and small-animal tumor). We systematically validate both variants in ultrasound domains: UltraSeg-130K (0.13M parameters) achieves 89.7 FPS on single-core CPUs and 34.8 FPS on a refurbished mobile device, while UltraSeg-500K (0.5M parameters) delivers 44.6 FPS on CPU and 16.1 FPS on mobile device. UltraSeg-500K matches or exceeds the Dice performance of the 31M-parameter UNet and approaches 105M-parameter TransUNet in average performance, with superior zero-shot cross-dataset generalization on external validation sets (UDIAT, DDTI). By enabling clinical-grade segmentation without GPU dependency, this work brings AI costs in line with ultrasound accessibility, making advanced diagnostics available in resource-limited settings.

[CV-197] Label Shift Aware Adaptation for Online Zero-shot Learning with Contrastive Language-Image Pre-Training (CLIP)

链接: https://arxiv.org/abs/2606.15169
作者: Pengxiao Han,Changkun Ye,Yanshuo Wang,Jinguang Tong,Miaohua Zhang,Xuesong Li,Jie Hong,Lars Petersson
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-language models like Contrastive Language-Image Pre-Training (CLIP) have been extensively studied in data-scarce scenarios. A particularly challenging and realistic task in this area is online zero-shot learning with CLIP, where unknown test samples are predicted sequentially in random order by CLIP while keeping the feature extraction and model parameters fixed during the sequential inference phase. Most existing approaches in this setting address the problem by adapting representations online using incoming test samples, while neglecting the distribution of the data on which CLIP was initially trained. This mismatch can lead to degraded performance when the label distribution in the test data differs from that of the training domain. To address this gap, we propose Label Shift Aware (LSA), which formulates the online zero-shot classification task as a domain adaptation problem. Specifically, LSA adapts the predictions computed by CLIP, which was trained on an unknown source distribution, to a target distribution using only unlabeled test data, and applies label shift correction to mitigate the mismatch between the source and target domains. The extensive experiments across multiple datasets demonstrate that the proposed LSA consistently outperforms state-of-the-art online zero-shot learning methods based on CLIP.

[CV-198] Variational Network with Wavelet-based UNET in Accelerated MRI Reconstruction from Under Sampled K-space Data

链接: https://arxiv.org/abs/2606.15167
作者: Yasir Arafat Prodhan(1),Shaikh Anowarul Fattah(1) ((1) Department of Electrical and Electronic Engineering, Bangladesh University of Engineering and Technology, Dhaka, Bangladesh)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 9 figures

点击查看摘要

Abstract:Fully sampled MRI requires dense k-space acquisition, leading to long scan times, reduced clinical throughput, and increased sensitivity to patient motion. Accelerated MRI addresses this by acquiring undersampled k-space data and reconstructing the missing information computationally. However, reconstruction from undersampled measurements is highly ill-posed and can introduce aliasing artifacts, noise amplification, and loss of anatomical detail. Although conventional parallel imaging and compressed sensing methods mitigate these issues, and deep learning methods have further improved reconstruction quality, preserving high-frequency structures under aggressive undersampling remains challenging. In this work, we propose a Variational Network with a Wavelet-based U-Net (W-UNet) for accelerated MRI reconstruction. The framework combines physics-guided iterative reconstruction with learnable multi-scale frequency representations. Standard pooling operations are replaced with Discrete Wavelet Transform and Inverse Wavelet Transform modules, enabling lossless downsampling while preserving low-frequency structure and high-frequency edge details. Integrated into the refinement and sensitivity map estimation stages, the proposed design improves artifact suppression, feature preservation, and reconstruction fidelity in both single-coil and multi-coil settings. Experiments on fastMRI knee and M4Raw brain datasets show state-of-the-art performance. Ablation studies further confirm the effectiveness of wavelet-based feature decomposition for accelerated MRI reconstruction.

[CV-199] GeoStream: Toward Precise Camera Controlled Streaming Video Generation

链接: https://arxiv.org/abs/2606.15162
作者: Yizhou Zhao,Yifan Wang,Xiaoyuan Wang,Yushu Wu,Hao Zhang,Moayed Haji-Ali,Rameen Abdal,Ashkan Mirzaei,Yanyu Li,Willi Menapace,Laszlo Jeni,Sergey Tulyakov,Peter Wonka,Chaoyang Wang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate interactive camera control is essential for video-based world models, but most existing approaches learn camera motion implicitly, leading to inaccurate control under out-of-distribution trajectories. Explicit geometric conditioning improves controllability, but existing methods are non-autoregressive and rely on a static 3D cache built from an initial frame, which becomes ineffective once the viewpoint moves beyond the original frustum. We propose GeoStream, a framework that enables precise metric-scale camera control in autoregressive streaming video generation. Our method maintains a self-refreshing 3D cache that is periodically updated online from the model’s own outputs: we estimate depth from the most recently generated frame, unproject to 3D, and reproject into the target view to produce point reprojections as geometric conditioning for subsequent synthesis. By the same principle, the conditioning seen during training is also rendered from the student’s own generated frames, yielding a fully on-policy distillation that naturally aligns the train and inference conditioning distributions. Unlike prior work that uses off-policy condition noising, our approach trains the model against the exact error distribution it encounters at inference, mitigating both standard autoregressive drift and the second-order geometric feedback loop that arises when the cache itself is derived from generated outputs. Quantitative and qualitative results show that our approach substantially improves camera controllability.

[CV-200] DLWM: Diverse Latent World Models for Efficient Multimodal Reasoning

链接: https://arxiv.org/abs/2606.15160
作者: David Huang,Lianlei Shan
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Preprint. 9 pages main text, 15 pages total including appendix, 2 figures

点击查看摘要

Abstract:Reasoning capabilities of multimodal large language models (MLLMs) have improved considerably in recent years. Existing approaches typically rely on explicit chain-of-thought or continuous latent-space trajectories to enhance multi-step reasoning. However, these methods generally assume that an input admits a single latent interpretation and unfold reasoning along a fixed path or under a uniform computation budget. In real-world multimodal settings, visual observations are often subject to occlusion, blur, viewpoint variation, or semantic ambiguity, giving rise to multiple plausible interpretations. A uniform reasoning strategy not only limits the model’s ability to explore multiple hypotheses but also incurs high memory usage and rollout cost. We present DLWM (Diverse Latent World Models), a multimodal reasoning framework that combines latent-space reasoning with reinforcement learning. First, we construct a set of diverse latent world hypotheses in continuous latent space, each capturing a different plausible interpretation of the visual input, and unfold latent reasoning independently on each hypothesis. An orthogonality-based diversity regularizer explicitly prevents hypothesis collapse. Second, we formulate the latent reasoning process as a resource-constrained sequential decision problem and introduce a resource-aware reinforcement learning policy that adaptively allocates computation across hypotheses, dynamically deciding whether to expand, terminate, or merge reasoning paths, thereby substantially reducing memory footprint and improving rollout efficiency. Experiments on multiple multimodal reasoning benchmarks demonstrate that DLWM outperforms existing methods by 2-5 points in accuracy while reducing memory usage by 24%.

[CV-201] RefGC-SR2: Reference-guided Generated Content Super-Resolution and Refinement

链接: https://arxiv.org/abs/2606.15158
作者: Jeahun Sung,Dahyeon Kye,Soo Ye Kim,Jihyong Oh
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: The first two authors contributed equally to this work. The last two authors are co-corresponding authors. Please visit our project page at this https URL

点击查看摘要

Abstract:Reference-guided generation (e.g., object compositing, customization) has progressed rapidly, yet current pipelines share a fundamental limitation: the object-centric high-resolution reference image (HRRI) provided by users is downsampled to a fixed low-resolution (LR) before being fed into the model, so the fine-grained details are discarded before the output is even produced. In addition, the generation step then introduces its own artifacts (e.g., identity distortion) on top of this loss. Existing reference-guided generated content refinement (RefGCR) methods can correct some of these artifacts but still operate in the LR domain; reference-guided super-resolution (RefSR) methods recover resolution but assume natural-image degradations and ignore the artifact distribution of generative pipelines. To address both gaps in a single formulation, we introduce a new task: reference-guided generated content super-resolution-refinement (RefGC-SR ^2 ), where the original HRRI is reused at the post-processing stage to recover lost details, refine generative artifacts, and upscale the output simultaneously. We construct the first real-world triplet data generation pipeline for this RefGC-SR ^2 task, training a diptych-conditioned generator to synthesize paired low-quality anchors that public pretrained models cannot provide. We further present a frequency-aware diffusion transformer model for RefGC-SR ^2 that selectively injects fine details from the HRRI while removing generative artifacts. Extensive experiments demonstrate that our RefGC-SR ^2 model successfully (i) refines the object identity faithfully with respect to the reference, and (ii) recovers high-resolution details, so that the final result is significantly higher quality and practically more usable compared to existing RefGCR and RefSR baselines.

[CV-202] HiRo: A Compact Four-Directional Hierarchical Reservoir Token-Mixer for Efficient Image Classification

链接: https://arxiv.org/abs/2606.15151
作者: Md Farhadul Islam,Ishan Thakkar,J. Todd Hastings
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted at ICONS 2026

点击查看摘要

Abstract:Recent image classification models must balance local feature modeling, cross-window interaction, and parameter efficiency. Many high-performing architectures rely on fully trainable token-mixers, which improve representation learning but increase parameter count, optimization complexity and computational cost. We propose a parameter-efficient image classification model called HiRo that integrates shifted-window partitioning with multi-directional hierarchical reservoir computing. Images are divided into non-overlapping patches (treated as tokens), linearly projected, normalized, and enriched with 2D sinusoidal positional encodings, then processed within local windows. Inside each window, tokens are scanned in four directions and passed through a two-stage slice-and-mix reservoir module. In the first stage, directional sequences are split into contiguous slices, each processed by its own fixed reservoir with a trainable closed-loop readout. The resulting slice outputs are summarized using the start, end, and mean representations, and then mixed by a second-stage fixed reservoir for each direction. The mixed slice representations are expanded back to the token level and fused with the first-stage outputs, after which the four directional outputs are realigned and averaged. Consecutive blocks alternate between regular and shifted windows to enable cross-window interaction, followed by layer normalization, a residual feed-forward network, and global pooling for classification. This design combines regular and shifted window partitioning with hierarchical multi-directional reservoirs to make an efficient local-to-cross-window token-mixing framework for image classification. Despite using under 1M trainable parameters and significantly lower memory and time than transformer-style baselines, HiRo also achieves 99.46%, 85.57%, and 59.10% accuracy on MNIST, CIFAR-10, and CIFAR-100, respectively.

[CV-203] MotionVLA: Vision-Language-Action Model for Humanoid Motion

链接: https://arxiv.org/abs/2606.15142
作者: Nonghai Zhang,Siyu Zhai,Yanjun Li,Zeyu Zhang,Zhihan Yin,Yandong Guo,Boxin Shi,Hao Tang
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Generating realistic humanoid motion from scene images and text involves both low-frequency pose semantics and high-frequency physical dynamics. However, many existing methods tokenize motion with a single shared codebook, forcing heterogeneous motion signals into the same quantization space. Our frequency-domain analysis of human motion data reveals a clear mismatch between single-codebook quantization and motion statistics: five DCT coefficients capture 93% of joint-position energy but only 37% of joint-velocity energy, which can bias quantization toward pose statistics and under-represent high-frequency velocity components. A second challenge lies in adapting a standard autoregressive model to effectively model high-frequency physical signals in motion sequences. Therefore, we propose DSFT, a dual-stream frequency tokenizer that separates motion into Base and physical streams and compresses them independently with DCT truncation and BPE. Furthermore, we present MotionVLA, a Qwen3.5-based model that arranges Base and physical tokens in a unified sequence, where Phys tokens are predicted after Base tokens. Experiments on HumanML3D and MBench show that, despite using a lightweight 2B backbone, MotionVLA reduces the Diversity gap to real data by over 50% on HumanML3D and improves Motion-Condition Consistency by 3.8% on MBench, supporting frequency-aware dual-stream decoupling as an effective formulation for autoregressive motion generation. Code: this https URL. Website: this https URL.

[CV-204] Beyond Scalar Distances: Semantic Attribute Gradients from Frozen MLLM s for Visual Embeddings

链接: https://arxiv.org/abs/2606.15134
作者: Shubhang Bhatnagar,Dheeraj Baiju,Narendra Ahuja
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Vision encoders for retrieval are typically trained with class-label supervision: each training pair reduces to a scalar that uniformly pushes the embedding apart or pulls it together, as if every visual attribute either differed or matched. A multimodal large language model (MLLM), shown the same pair, can articulate those attributes and use them to predict whether the images share a class. We propose \textbfSAGA, a framework that turns this language-grounded, attribute-aware perception into a training signal for the encoder itself. Specifically, we use Group Relative Policy Optimization (GRPO) to reward the MLLM for correct predictions on the vision encoder’s tokens. Since correct predictions require those tokens to expose the specific attributes that differ or match between the pair, the gradient pushes the encoder to encode them, replacing the uniform pair-level scalar with attribute-resolved supervision. An auxiliary attention-distillation loss anchors the encoder’s embedding to tokens the MLLM attended to, and a standard metric-learning loss shapes the embedding geometry for nearest-neighbour retrieval. The MLLM is frozen throughout and discarded at inference, matching the deployment cost of a metric-learning baseline. SAGA improves Recall@1 by 3 to 6 points over state-of-the-art baselines on CUB-200-2011, Cars-196, FGVC-Aircraft, and iNaturalist Aves on zero-shot image retrieval.

[CV-205] Drag Mesh-2: Physically Plausible Dexterous Hand-Object Interaction with Articulated Objects

链接: https://arxiv.org/abs/2606.15133
作者: Tianshan Zhang,Yijia Duan,Yanjun Li,Zeyu Zhang,Hao Tang
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Code: this https URL . Website: this https URL

点击查看摘要

Abstract:Dexterous interaction with articulated objects is important for household, assistive, and humanoid manipulation, where multi-finger hands can provide compliant contact patterns beyond parallel-jaw grasping. However, articulated-object manipulation differs from static-object manipulation: the target part cannot be directly actuated, and its motion must emerge through sustained physical hand–handle contact. This makes the transition from object-centric articulated generation to hand-driven dexterous hand–object interaction non-trivial, since geometric trajectory replay or open-loop execution does not model the contact dynamics required to move the articulated part. Moreover, policies trained only for task completion under fixed dynamics can overfit nominal contact loads, especially without tactile or force feedback, and may degrade when the contact load changes. To address these challenges, we present DragMesh-2, a contact-driven framework for dexterous interaction with articulated objects that extends articulated interaction from object-centric generation to hand-driven dexterous hand–object interaction, where articulated motion must arise through physical contact. We further propose PICA, a physically informed contact-aware training mechanism that injects physical signals into policy learning without tactile or force feedback, improving robustness and task success under changing contact loads. Finally, we conduct systematic evaluation across multiple damping conditions and articulated-object categories to study robustness under contact-load variation, and provide a pure-geometry dexterous interaction resource to support future loco-manipulation and humanoid hand–object interaction research. Across seven GAPartNet objects, DragMesh-2 achieves stronger robustness under contact-load variation than the compared methods while maintaining high task success across damping conditions.

[CV-206] EyeMVP: OCT-Informed Fundus Representation Learning via Paired CFP–OCT Pretraining

链接: https://arxiv.org/abs/2606.15129
作者: Zhuo Deng,Ruiheng Zhang,Ziheng Zhang,Weihao Gao,Yitong Li,Qian Wang,Lei Shao,Jiaoyue Dong,Zhixi Zeng,Lijian Fang,Haibo Wang,Xiaobin Lin,Tao Liu,Zhicheng Du,Zhengwei Zhang,Lin Yang,Zheng Gong,Xinyu Zhao,Zhenquan Wu,Fang Li,Zhiguang Zhou,Guoming Zhang,Sun Jing,Han Lv,Wenbin We,Lan Ma
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Color fundus photography (CFP) is the mainstay for large-scale retinal screening, yet its diagnostic capacity is constrained by the lack of depth-resolved structural information. Optical coherence tomography (OCT) provides cross-sectional retinal anatomy, but is less accessible in population-level screening. Here, we present EyeMVP, a cross-modal retinal foundation model that uses paired CFP–OCT pretraining to learn OCT-informed CFP representations. EyeMVP is pretrained on 674,893 strict same-eye same-day paired CFP–OCT image triples from 112,642 patients across eight hospitals in China. The model uses cross-modal masked reconstruction to enrich CFP representations with OCT-associated supervision, while requiring only CFP images at inference. To accommodate the non-aligned imaging geometry between en-face CFP and cross-sectional OCT, EyeMVP combines source-constrained cross-attention with CFP-derived structural masks. Across 16 downstream tasks, including classification, segmentation, few-shot adaptation, and cross-modal retrieval, EyeMVP outperforms representative retinal foundation models and shows consistent gains on tasks involving macular and optic nerve structure. For CFP-challenging macular diseases, EyeMVP achieves an AUROC of 0.948 for macular edema (vs.~0.852 for EyeCLIP) and 0.825 for myopic macular schisis. In an exploratory reader study, EyeMVP exceeds junior and intermediate ophthalmologist groups but does not reach senior ophthalmologist performance on macular edema, while showing numerically higher balanced accuracy than all reader groups on myopic macular schisis. These results suggest that pixel-level cross-modal reconstruction can enrich CFP representations with OCT-associated supervision, providing a practical route toward stronger CFP-based retinal analysis in screening settings.

[CV-207] Multi-view feature High-order Fusion for Space Weak Object Detection and Segmentation

链接: https://arxiv.org/abs/2606.15118
作者: Weilong Guo,Yuhan Sun,Shengyang Li
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Weak objects are common in images and videos of space applications. However, it is hard to learn proper representations from their limited appearance information. Inspired by multi-view learning, we develop simple multi-view attentions, treating their outputs as multi-view features. We also propose a multi-view feature high-order fusion method (MHF) to aggregate more accurate and richer features of weak objects. Our MHF extends the commonly used low-order feature fusion method to higher orders. It enhances the model’s capacity to capture relevant and complementary information about weak objects. This is achieved by introducing high-order multi-view features perception and a recursive task-contribution gated selection of multi-view features. The new operation is highly flexible and customizable. It is compatible with various variants of multi-view feature representations. We conduct extensive experiments on two newly constructed space science datasets and an open, large-scale satellite video dataset. Our MHF serves as a plug-and-play module and significantly improves various vision transformers and convolution-based detection and segmentation models. We achieve all state-of-the-art accuracies on both tasks across three datasets. Our MHF can be a new basic module for visual modeling that effectively represents weak objects in terms of multi-view learning. The code will be available at this https URL.

[CV-208] acher-Student Structure for Domain Adaptation in Ensemble Audio-Visual Video Deepfake Detection

链接: https://arxiv.org/abs/2606.15117
作者: Elham Abolhasani,Maryam Ramezani,Hamid R. Rabiee
类目: Multimedia (cs.MM); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Sound (cs.SD)
备注:

点击查看摘要

Abstract:The rapid advancement of generative AI models is leading to more realistic deepfake media, encompassing the manipulation of audio, video, or both. This raises severe privacy and societal concerns. Numerous studies in this area have yielded promising intra-domain results; however, these models frequently exhibit decreased efficacy when faced with data from dissimilar domains. Consequently, recent deepfake detection approaches focus on enhancing the generalization ability through multiple techniques that incorporate all input modalities, including audio, images, and their interactions. In this regard, we propose the EAV-DFD method, a generalized deep ensemble audio-visual model (EAV-DFD) combined with a domain adaptation mechanism utilizing a teacher-student framework to enhance the model’s ability to perform and generalize effectively across unseen domains. To evaluate the model’s performance, we used the FakeAVCeleb dataset as the primary domain and the DFDC, Deepfake_TIMIT, and PolyGlotFake datasets as an unseen domain. Our experimental results demonstrate that the proposed framework is efficient in domain adaptation, improving AUC performance of the model by 4.09%, 17.94%, and 0.5% on three unseen datasets, using only a small portion of them to train the student model. This leads to a novel deepfake detection model capable of adapting to new domains and interpreting which modality has been manipulated, highlighting the potential of our approach for real-world applications.

[CV-209] Learn Temporal Consistency For Robust Satellite Video Detector

链接: https://arxiv.org/abs/2606.15112
作者: Weilong Guo,Shengyang Li,Yanfeng Gu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 8 figures

点击查看摘要

Abstract:Satellite video object detection (SVOD) for oriented and fine-grained objects plays an important role in satellite applications. Most existing SVOD methods only focus on one or a few coarse-grained categories of moving objects and represent objects with horizontal bounding boxes. They have difficulty extracting complete, accurate, and consistent information about objects in whole satellite videos. In this paper, we propose a satellite video object detection framework based on Temporal Consistency Learning (TCL). TCL adeptly detects oriented and fine-grained objects by leveraging the rich temporal contexts within satellite videos. The framework integrates three key modules: temporal and fine-grained feature aggregation (TFA), structure encoding (SE), and temporal consistency constraint (TCC). TFA and TCC modules facilitate consistent representation learning across frames, while the SE module encodes both appearance and structural information for precise fine-grained recognition. Experimental results on the SAT-MTB benchmark dataset demonstrate TCL’s superior performance, achieving a new state-of-the-art oriented and fine-grained detection accuracy of 47.7% mAP–a 4.8% improvement over the baseline. Furthermore, our TCL framework readily accommodates existing image-based detectors, leading to enhanced detection accuracies.

[CV-210] Physics-Driven Zero-Shot MRI Reconstruction with Non-local Image Priors

链接: https://arxiv.org/abs/2606.15110
作者: Lingtong Zhang,Wenlei Li,Mu He,Li Xiao,Yang Ji
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Zero-Shot Self-Supervised Learning (ZS-SSL) has emerged as a promising paradigm for accelerated Magnetic Resonance Imaging (MRI) reconstruction, eliminating the reliance on fully-sampled external datasets. However, learning solely from a single under-sampled scan suffers from supervision scarcity and optimization instability, often leading to overfitting or artifacts. To address these challenges, we propose a robust physics-driven ZS-SSL framework that synergizes physical consistency with image-domain non-local priors. Our method introduces three core innovations: (1) a Coil Sensitivity Map (CSM)-Guided Dynamic Repository, which stabilizes the training trajectory by filtering physically inconsistent artifacts based on coil sensitivity constraints; (2) a SPIRiT-based regularization, which enforces k-space self-consistency via a learned correlation kernel and stochastic masking; (3) a Non-Local Self-Similarity (NSS) Pixel Bank, which leverages the high-fidelity reference established by the former modules to explicitly mine non-local anatomical similarities, thereby augmenting supervision in the image domain. Extensive experiments on the FastMRI dataset demonstrate that our approach achieves state-of-the-art performance, particularly under high acceleration factors, effectively bridging the gap between zero-shot learning and supervised methods. The code is available at this https URL.

[CV-211] xt-Driven Fusion for Infrared and Visible Images: Achieving Image Scene Adaptation on Hyperbolic Space

链接: https://arxiv.org/abs/2606.15104
作者: Huan Kang,Hui Li,Tianyang Xu,Tao Zhou,Xiao-Jun Wu,Josef Kittler
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 8 figures

点击查看摘要

Abstract:Infrared and visible image fusion aims to integrate complementary modalities, while existing Euclidean methods impose rigid distance metrics that distort multi-modal interactions and parent-to-child semantic hierarchies. To overcome these limitations, we introduce a text-driven fusion framework empowered by hyperbolic manifold learning. During training, BLIP-extracted text prompts serve as topological anchors within the hyperbolic space, guiding vision-attribute alignment through hyperbolic embeddings that naturally accommodate varying semantic granularities. By exploiting the exponential volume growth dictated by the Poincaré ball’s negative curvature, this approach seamlessly embeds hierarchical trees to encode coarse-to-fine semantics without metric saturation, while the vast peripheral space prevents texture distortion during cross-modal fusion. At inference, the fusion process autonomously adapts to input content using the learned text-attribute priors, completely eliminating the need for textual input. Experimental results show our method outperforms state-of-the-art approaches on benchmark datasets, with code available at this https URL.

[CV-212] hink Less Act Early: Reinforced Latent Reasoning with Early Exit in Vision-Language-Action Models ICML2026

链接: https://arxiv.org/abs/2606.15099
作者: Dianqiao Lei,Lianlei Shan
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
备注: Accepted at ICML 2026

点击查看摘要

Abstract:Existing Vision-Language-Action (VLA) models predominantly rely on explicit Chain-of-Thought (CoT) reasoning to bridge perception and action. While effective, this paradigm suffers from high computational costs and error propagation in multi-step tasks. In this paper, we propose Adaptive Variable Alignment VLA (AVA-VLA), a novel Latent Reasoning VLA framework that models reasoning as a sequence of unobservable latent variables, bypassing the need for explicit text generation. However, latent trajectories are inherently susceptible to noise interference and misalignment with downstream objectives. To address this, we introduce a Reinforcement Learning-based Denoising mechanism that treats latent state generation as a sequential decision process, optimizing reasoning trajectories via task-level rewards. Furthermore, we incorporate an Early-Exit Strategy that adaptively terminates reasoning based on state confidence, enabling a dynamic trade-off between depth and efficiency. Extensive experiments on embodied decision benchmarks demonstrate that AVA-VLA achieves a 6x inference speedup over explicit CoT methods while attaining a 98.3% average success rate on LIBERO, improving both efficiency and long-horizon stability over full-reasoning baselines.

[CV-213] xture-Shape Bias Balancing for Robust Synthetic-to-Real Semantic Segmentation in Automotive NIR Imagery KDD2026 ECML

链接: https://arxiv.org/abs/2606.15072
作者: Felix Stillger,Ben Hamscher,Lukas Hahn,Annika Mütze,Tobias Meisen,Kira Maag
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ECML PKDD 2026 (ADS Track)

点击查看摘要

Abstract:Semantic segmentation is a fundamental component of visual perception in modern automotive systems, enabling pixel-level scene understanding. Near-Infrared imaging (NIR) offers stable detection under difficult illumination conditions, but the development of domain-specific semantic segmentation models remains challenging due to the lack of high-quality annotated data from real-world scenarios. Synthetic datasets offer a scalable alternative, but models trained on synthetic images often suffer performance degradation when transferred to real domains. We present the first systematic study on synthetic to real domain adaptation for semantic segmentation in NIR images in the automotive domain. We propose a generative augmentation framework that transforms synthetic images into realistic NIR-style variants via our introduced target style adaptation (TSA). TSA fine-tunes a latent diffusion model via low-rank adaptation on a small curated set of real NIR images and applies it to synthetic training data using structure-preserving multi-signal conditioning. To reduce texture bias and improve segmentation robustness, we further apply a Voronoi-based style diversification strategy (VSD) that modifies the original textures while preserving scene geometry. Experiments with multiple model architectures on NIR data from vehicle interiors and street scenes show that balancing inductive bias during training leads to noticeably more robust semantic segmentation and effectively reduces the domain gap in our real-world scenarios by up to 63.6% on exterior and 28.4% on interior data. The code is available at GitHub.

[CV-214] Bridging Geographic Bias in Urban Streetscape Inference via Lifelong Learning with Visual-Semantic Pivoting

链接: https://arxiv.org/abs/2606.15055
作者: Xinze Zhang
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Visual perception of urban streetscapes underpins evidence-based decisions in landscape planning, public health, and place-making. Yet models trained on a few well-photographed metropolises systematically misjudge underrepresented districts, propagating geographic bias into downstream policy. We address this gap with HVSP-LL, a lifelong learning framework that couples a stratified visual-semantic pivoting module with an equity-aware rehearsal mechanism. The pivoting module organises landscape concepts along a three-tier ontology (macro structure, meso composition, micro element) and aligns image features to learnable semantic anchors at each tier, providing transferable representations that resist distributional drift. The lifelong adaptation component sequentially absorbs new urban regions while constraining inter-region perception gaps through a worst-region sample-reweighting objective and a structurally-aware exemplar buffer. We evaluate HVSP-LL on a panoramic streetscape benchmark assembled from twelve cities across four continents and seven perceptual dimensions. The framework attains 0.834 Spearman correlation on the held-out city sequence, an absolute 6.1 point improvement over the strongest continual baseline, and shrinks the inter-city perception gap to 0.094 – a 38% reduction relative to the strongest continual baseline (0.151) and a 57% reduction relative to a representative regularisation baseline (0.218). Ablations confirm that each tier of the pivoting hierarchy contributes monotonically, and the equity-aware rehearsal converts mean backward transfer from -0.038 (without retention) to +0.013, eliminating catastrophic forgetting on the held-out sequence. Our results indicate that hierarchical anchoring is a practical pathway toward geographically equitable streetscape inference at city scale.

[CV-215] Gaussian Spatial Priors for Anatomy-Aware Object Detection in Surgical Videos

链接: https://arxiv.org/abs/2606.15049
作者: Yunfan Li,Artem Shmelev,Himanshu Gupta
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Detecting anatomical structures in surgical video is essential for intraoperative safety frameworks such as the Critical View of Myopectineal Orifice (CVMPO) in inguinal hernia repair. While prominent structures like the Cooper’s Ligament and Triangle of Doom are reliably detected by standard methods, smaller structures such as the epigastric vessels remain challenging due to their visual ambiguity and intermittent visibility. We observe that the spatial relationship between structures is anatomically constrained, and propose a Gaussian Spatial Prior (GSP) module that encodes this relationship as a compact, parametric bias injected into the self-attention of a DAB-DETR decoder. The prior is computed offline from training annotations as a small set of frozen Gaussian parameters and recomputed at each decoder layer using the iteratively refined reference points. On a dataset of inguinal hernia repair videos with 5-fold cross-validation, GSP improves dependent class detection by +33.5% ( \textAP_50 ) over DAB-DETR and +53.9% over YOLOv26, while also improving anchor detection by +6.0% . These gains are statistically significant across all folds ( p=0.012 , paired t- test).

[CV-216] mporal Difference Learning for Diffusion Models ICML2026

链接: https://arxiv.org/abs/2606.15048
作者: Qizhen Ying,Yangchen Pan,Victor Adrian Prisacariu,Junfeng Wen
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 4 figures. Accepted at ICML 2026

点击查看摘要

Abstract:Diffusion models are typically trained with objectives that focus on local denoising targets at individual time steps (or adjacent pairs), which do not enforce consistency between predictions along the denoising trajectory. This lack of cross-time consistency can degrade performance, especially for few-step samplers. We introduce a temporal difference (TD) objective that penalizes inconsistency of the model’s multi-step progress along the denoising path. By reformulating the diffusion process as a Markov reward process and casting denoising as a policy evaluation problem in reinforcement learning, we derive a unified TD approach that applies to both discrete- and continuous-time diffusion formulations. We further propose a principled sample-based reweighting method that stabilizes training. Empirically, we show that using our TD training can significantly improve sample quality measured by FID, with stronger advantages when the number of sampling steps is small, highlighting its practical utility under low-computation-budget scenarios. We provide ablation studies to justify our design choices, including pairwise loss reweighting, regularization weight, and one-step stride. Overall, our TD approach can be a general drop-in that enforces cross-time consistency and improves generation quality across different diffusion generative models.

[CV-217] owards Global AI-Driven Cervical Cancer Screening

链接: https://arxiv.org/abs/2606.15019
作者: Thuy Nuong Tran,Ömer Sümer,Evangelia Christodoulou,Lennart Nauschütte,Simon Kalteis,Martin Paulikat,Esmira Pashayeva,Klara Steinheuer,Isabella Borges,Piotr Kalinowski,Hermann Bussmann,Sieng Sokmney,Poeung Kuong,Sathiarany Vong,Achim Schneider,Magnus von Knebel-Doeberitz,Patrick Godau,Lena Maier-Hein
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 20 pages, 9 figures

点击查看摘要

Abstract:The global elimination of cervical cancer is a key public health goal set by the World Health Organization (WHO), with screening programs reducing mortality by up to 80%. However, access to experts and biopsy services is limited in low- to middle-income countries (LMICs). Deep learning (DL)-based algorithms offer promising support for screening, but most existing approaches have been developed and validated on private datasets from single countries. We present the first DL-based approach to cervical cancer screening validated on data from multiple countries. Technically, we phrase the problem of detecting and classifying lesions in colposcopy images as a multi-task learning problem, in which we simultaneously perform image-level classification and lesion segmentation. Our model was trained on a private data set of acid stain colposcopy images with manually generated lesion segmentation masks and corresponding histopathological results, employing extensive data augmentation to address image variability. In an in-distribution validation with pathology results serving as ground truth, our algorithm outperformed medical experts (Balanced Accuracy: 0.68 vs 0.64) in CIN1- (Cervical intraepithelial neoplasia grade 1 or lower) versus CIN2+ (grade 2 or higher) classification. External validation on four colposcopy data sets from four countries featuring radical differences in prevalence and patient characteristics yielded superior performance of our method compared to baseline methods. Performance variability across countries was high with AUC values ranging from 0.54 - 0.80. Overall, algorithm performance varied with age, transformation zone (cervical area most prone to lesion development), presence of comorbidities and pathognomonic signs, with comorbidities having by far the largest negative effect. Future work should focus on improving model robustness and generalizability.

[CV-218] NEXUS: Neural Energy Fields for Physically Consistent Contact-Rich 3D Object Dynamics

链接: https://arxiv.org/abs/2606.15015
作者: Qizhen Ying,Guangming Wang,Yangchen Pan,Victor Adrian Prisacariu,Yixiong Jing
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 18 pages, 4 figures, 6 tables. Preprint

点击查看摘要

Abstract:Physics-grounded video generation requires controllable 3D object dynamics that remain physically consistent under contact, deformation, and external forcing. Existing trajectory-based methods often model isolated physical effects, making it difficult to compose conservative and non-conservative dynamics in contact-rich 3D scenes. We present NEXUS, a neural energy-field framework for contact-rich 3D object dynamics. NEXUS represents each object as a structural graph and constructs dynamic object-object and object-environment contact graphs. Inspired by Hamiltonian Neural Networks, NEXUS formulates motion through scalar energy and dissipation terms rather than directly predicting states or accelerations. Conservative effects, including gravity and elastic deformation, are composed as additive energy terms, while non-conservative effects such as damping and impact-induced energy loss are modeled with learned Rayleigh-style dissipation. Forces are derived by differentiating the energy and dissipation functions and rolled out with a multi-substep semi-implicit integrator. Across controlled trajectory benchmarks, NEXUS improves long-horizon accuracy over representative learned and physics-structured dynamics baselines under varying mechanical properties and physical-effect compositions. We further show that NEXUS trajectories provide effective guidance for contact-rich video generation, improving physical plausibility while maintaining competitive visual quality.

[CV-219] ReGenHuman: Re-Generating Human Appearances for Realistic Full-Body Video Anonymization

链接: https://arxiv.org/abs/2606.14972
作者: Adam Sun,Eshaan Barkataki,Arnold Milstein,Gordon Wetzstein,Ehsan Adeli
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Anonymizing human-centric video data is an understudied problem. Prior anonymization techniques either blur or redact pixels at the cost of realism and downstream utility, or generate frame-by-frame at the cost of temporal coherence. We introduce ReGenHuman, the first full-body video anonymization pipeline that is simultaneously realistic, temporally consistent, and anonymous by construction. Contrary to past approaches which redact or edit the inputs directly, we propose a regenerate, don’t edit paradigm. Our approach composites 2D pose, segmentation, and monocular depth into two complementary conditioning streams - StructAll and StructHuman, which are used to fine-tune a video-to-video diffusion backbone on in-the-wild human videos, synthesizing the human regions entirely from identity-free structural cues. We evaluate our model on privacy, quality, and utility, and show that our ReGenHuman achieves the best tradeoff across all three axes against current baselines. We further show that our anonymized videos remain effective for downstream tasks, including video question answering.

[CV-220] Multi-Modal Attention for Automated Disaster Damage Assessment Using Remote Sensing Imagery and Deep Learning

链接: https://arxiv.org/abs/2606.14963
作者: Tewodros Syum Gebre,Jagrati Talreja,Leila Hashemi-Beni
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: This paper has been accepted for publication in ISPRS Congress 2026 and the 47th Canadian Symposium on Remote Sensing (CSRS 2026) Annals

点击查看摘要

Abstract:Timely and accurate disaster damage assessment is crucial for effective emergency response, resource allocation, and recovery. Traditional methods, which often rely on manual inspections or sparse data, are typically slow and error-prone. This paper introduces a novel framework leveraging remote sensing imagery and deep learning to automate building damage classification. Using pre- and post-disaster satellite imagery, our model categorizes buildings into four damage levels: no damage, minor damage, major damage, and destroyed. The core innovation is a multi-modal attention mechanism that fuses bi-temporal features to explicitly detect and assess structural changes. We employ a lightweight ConvNeXT-Tiny backbone to ensure efficient processing without compromising performance. Key contributions include: (1) a cross-attention module for multi-modal data fusion, (2) an optimized preprocessing pipeline for large-scale datasets, and (3) robust data augmentation techniques. Experiments on a large-scale disaster dataset demonstrate an overall classification accuracy of 94.90%. The model effectively discriminates between damage categories and remains resilient to incomplete data. This system significantly improves assessment speed and accuracy, aiding emergency responders in prioritizing interventions. This work advances automated disaster damage detection by integrating multi-temporal imagery with deep learning, offering a scalable solution for real-time response.

[CV-221] Learning Sparse Latent Predictive Foundation Model for Multimodal Neuroimaging

链接: https://arxiv.org/abs/2606.14957
作者: Haoxu Huang,Long Chen,Jingyun Chen,Jinu Hyun,James Ryan Loftus,Kara Melmed,Daniel Orringer,Jennifer Frontera,Seena Dehkharghani,Arjun Masurkar,Narges Razavian
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Under Review Preprint

点击查看摘要

Abstract:Brain MRIs are routinely acquired as multiple complementary sequences with unique contrast weighting, including T1-weighed imaging (T1w) anatomic and fluid-sensitive T2-weighted (T2w) contrasts. However, methods for learning unified representations across the multitude of MRI contrast mechanisms at health-system scale are lacking. In this study, we introduce Neuro-JEPA, a sparse multimodal neuroimaging foundation model that combines a latent predictive objective with a Mixture-of-Experts architecture to encode brain MRI across core T1w, T2w, and fluid-suppressed FLAIR imaging (FLAIR). We further provide a systematic methodological study of architectural, masking, objective, and sparsity design choices beneficial for robust neuroimaging multimodal representation learning. Neuro-JEPA was pretrained on 1,551,862 scans from 428,647 studies after modality-specific preprocessing with data curation across three core structural brain MRI sequences. We evaluated the learned representations across clinical and research settings, including 25 tasks from three health systems: NYU Langone, NYU Long Island, and Massachusetts General Hospital, and 22 tasks from 12 public datasets, covering unimodal, multimodal and cross-domain evaluation configurations. Across these benchmarks, existing neuroimaging foundation models showed inconsistent gains over a simple convolutional neural network (CNN) baseline, whereas Neuro-JEPA achieved stronger and more consistent performance across all evaluated settings. These results establish a scalable methodological framework for multimodal neuroimaging representation learning and highlight the need for foundation model evaluation protocols that include simple baselines, clinically heterogeneous cohorts and controlled multimodal comparisons.

[CV-222] FlexPooling with Simple Auxiliary Classifiers in Deep Networks

链接: https://arxiv.org/abs/2606.14926
作者: Muhammad Ali,Omar Alsuwaidi,Salman Khan(Department of Computer Vision, Mohamed bin Zayed University of Artificial Intelligence (MBZUAI), Abu Dhabi, UAE)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In computer vision, the basic pipeline of most convolutional neural networks consists of multiple feature extraction layers, where the input signal is downsampled to a lower resolution in each subsequent layer. This downsampling process is commonly referred to as pooling, which is an essential operation in CNNs. Pooling improves robustness against transformations, reduces the number of trainable parameters, increases the receptive field, and lowers computation time. Since pooling is a lossy process but remains important for extracting high-level information from low-level representations, it is important to preserve the most prominent information from previous activations to improve network discriminability. Standard pooling is usually performed using dense pooling methods, such as max pooling or average pooling, or through strided convolutional kernels. In this paper, we propose a simple yet effective adaptive pooling method, called FlexPooling, which generalizes average pooling by learning a weighted average over activations jointly with the rest of the network. We further show that attaching Simple Auxiliary Classifiers (SAC) to the CNN improves performance and demonstrates the effectiveness of the proposed method compared with standard pooling methods. Experiments on multiple popular image classification datasets show that FlexPooling consistently outperforms baseline networks, achieving approximately 1 to 3 percent improvement in accuracy.

[CV-223] Mask Proposal Voting Based on Geodesic Framework for Robust Image Segmentation

链接: https://arxiv.org/abs/2606.14912
作者: Li Liu,Mingzhu Wang,Zhenjiang Li,Da Chen,Laurent D. Cohen
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Despite great advances, finding accurate segmentation remains a challenging task, especially in scenarios with cluttered backgrounds, complex intensity variations and topology appearance. Minimal path models have exhibited their strong ability in addressing image segmentation tasks. However, the performance of minimal paths-based segmentation approaches is heavily influenced by model initialization, hence limiting their application scope in practice. In this work, we propose a novel mask proposal voting framework that overcomes the major drawback of classical approaches, allowing robust segmentation even in complicated scenarios. Firstly, we introduce an efficient method for constructing adaptive domain cuts as a constraint for initializing the region-based min-cut evolution, by which diverse and reliable mask proposal candidates can be generated, substantially increasing the possibility of accurately covering the objective region by these proposals. Secondly, we propose a new mask voting scheme to build a voting score map encoding the final segmentation information. In contrast to classical path voting methods, our model allows incorporating priors to assign different importance to each individual mask. As a consequence, the proposed segmentation model is capable of accurately delineating object boundaries under complex scenarios, and is insensitive to initialization. Experiments demonstrate that our method consistently outperforms state-of-the-art minimal path-based approaches in both accuracy and robustness.

[CV-224] Deep Learning in Seismic Interpretation: Federated Advances in Salt Dome Segmentation

链接: https://arxiv.org/abs/2606.14905
作者: Muhammad Zain Mehdi,Muhammad Zaid,Owais Aleem
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 7 pages, 8 figures

点击查看摘要

Abstract:Salt-dome delineation is a critical, high-impact task in subsurface geological interpretation, driving decisions in hydrocarbon exploration, reservoir modeling, and drilling safety. While convolutional encoder-decoder architectures have delivered significant improvements in automated salt segmentation, their widespread application is severely limited by data sovereignty concerns, dataset bias, and the scarcity of labeled seismic volumes. This paper introduces FedSaltNet, a Federated Learning (FL) framework explicitly engineered for robust, generalizable, and privacy preserving salt-dome segmentation. We couple a lightweight Small U-Net backbone, chosen for its efficiency and regularization properties with a novel Foreground-Weighted (FG-WEIGHTED) aggregation strategy designed to tackle domain-specific class imbalance. Through an extensive comparative study emulating non-IID conditions across four diverse seismic datasets (TGS, SEAM, F3, GBS), we demonstrate two critical findings: The FG-WEIGHTED algorithm effectively mitigates data heterogeneity, yielding a 4.0% relative improvement in Intersection over Union (IoU) over the best conventional FL method. The simple U-Net architecture proved essential, outperforming the higher capacity ResNet-18 U-Net variant by 166% in average IoU, underscoring the necessity of architectural simplicity in data-constrained federated environments. FedSaltNet provides a validated, high-performance solution that establishes the viability of federated deep learning for collaborative, next-generation subsurface interpretation.

[CV-225] Improved Knowledge Distillation for Land-Use Image Classification

链接: https://arxiv.org/abs/2606.14886
作者: Arundhuti Sur,Abhiroop Chatterjee,Susmita Ghosh,Emmett Ientilucci
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by IGARSS 2026

点击查看摘要

Abstract:In the present article, an improved Knowledge Distillation (KD) framework has been proposed for efficient compression of deep convolutional neural networks for land-use image classification task. Motivated by the need to achieve competitive classification accuracy while reducing computational complexity, a teacher-student learning paradigm is adopted in which a VGG16 network transfers knowledge to a lightweight MobileNetV2 model. The proposed framework integrates hard supervision from ground truth labels with a soft supervision strategy that combines Kullback-Leibler divergence and Cosine Similarity losses. Experiments conducted on three land-use datasets show that the proposed KD-based method yields improved performance, and achieves an accuracy of 99.04%, outperforming both baseline student training and single-loss distillation approaches, while retaining substantial model compression.

[CV-226] Understanding Cross-Modal Contributions in Continual Vision-Language Models: A Theoretical Perspective

链接: https://arxiv.org/abs/2606.14883
作者: Salimeh Sekeh,Mary Wisell
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Continual vision-language models are commonly addressed through sequential fine-tuning; however, although this paradigm enables adaptation to new environments (tasks), it inherently emphasizes the contribution of previously learned environments (tasks) at the expense of the stability required to preserve previously acquired knowledge. While existing approaches have adequately studied continual learning and catastrophic forgetting in vision-language models (VLMs), the theoretical understanding of modality-specific contributions across a sequence of environments remains largely unexplored. In this paper, we present a new theoretical perspective to understand the cross-modal (vision-language) contributions to consecutive environments. We empirically evaluate our theoretical findings on large VLMs and demonstrate their effectiveness in capturing environment-level cross-modal contributions. Our analysis provides deeper insights into continual VLMs, highlighting their contribution robustness to varying task orders and inter-task similarities, and their improved generalization performance.

[CV-227] VANDERER: Map-Free Exploration using Future-Aware and Visual-Curiosity-Guided Diffusion Policy

链接: https://arxiv.org/abs/2606.14879
作者: Venkata Naren Devarakonda,Raktim Gautam Goswami,Prashanth Krishnamurthy,Farshad Khorrami
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Mobile agents require efficient exploration strategies to map unseen environments and autonomously plan tasks. Traditional methods rely on generating occupancy maps and optimizing the sequence in which unexplored regions are visited. However, in sensor-constrained settings, such as those limited to monocular cameras, generating accurate occupancy maps is challenging. To address this, we propose VANDERER, an exploration framework that leverages a Visual Curiosity Module (VCM) to guide pre-trained diffusion policies using only monocular image data. This curiosity module predicts the outcomes of proposed actions via a navigation world model and evaluates them through a curiosity cost. The cost then guides the diffusion process toward generating actions that maximize exploration. Evaluated across diverse simulated environments, VANDERER consistently outperforms established baselines, exploring an average of 13.4% more area than NoMaD. Our results reveal a direct correlation between visual and geometric curiosity in outdoor environments, demonstrating that VANDERER can effectively leverage this relationship for efficient exploration using sensor-constrained agents.

[CV-228] An Ensemble Deep Learning Approach for Reliable and Scalable Lemon Leaf Disease Classification

链接: https://arxiv.org/abs/2606.14871
作者: Shayan Abrar,Sudeepta Mandal,Abdul Awal Yasir,Sonjoy Bhattacharjee,Sadman Haque Bhuiyan,Samanta Ghosh,Rafi Ahamed
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 5 pages, 12 figures, 3 Tables, Presented at 18th IEEE International Conference on Computational Intelligence and Communication Networks (CICN) 2026

点击查看摘要

Abstract:Early detection of plant diseases is crucial to plants and for the farmers. Plant diseases reduce fruit yield and quality, and plants are more susceptible to other stresses when they are infected. The lemon leaf disease dataset contains 1354 images. The dataset has 9 classes. Among the 9 classes only one class is for healthy leaf, and the other 8 classes are leaf diseases. The dataset was split into training (70%), testing (15%) and validation (15%) sets after comprehensive preprocessing. Two pretrained models (InceptionV3 and MobileNetV2) were applied and then combined these models using an ensemble technique to boost robustness. Ensemble models showed a promising performance of 99.27% accuracy. Adversarial Training is applied to improve models’ ability and ensure reliable predictions under noisy data. Grad-CAM visualization highlights the important regions of leaf images that validate the model prediction with confidence level.

[CV-229] Multi-HMR 2: Multi-Person Camera-Centric Human Detection Mesh Recovery and Tracking

链接: https://arxiv.org/abs/2606.14841
作者: Guénolé Fiche,Philippe Weinzaepfel,Romain Brégier,Fabien Baradel
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Most advances in human mesh recovery (HMR) have focused on pelvis-centered recovery, overlooking metric 3D localization and detection accuracy in the camera coordinate system - two key factors for real-world applications such as human-robot interaction and social scene understanding. Current evaluation protocols often ignore these aspects, emphasizing per-person, root-centered recovery rather than camera-space perception. As a result, existing approaches rely on fixed camera assumptions or handcrafted post-processing, limiting their robustness and practical deployment. We introduce Multi-HMR 2, a simple yet robust DETR-based framework for Multi-person Camera-centric Human detection, mesh Recovery, and tracking. Multi-HMR 2 predicts a scene-consistent camera together with human meshes, enabling metric 3D localization without ground-truth intrinsics. Moreover, by distilling image-based memory features from SAM2, Multi-HMR 2 extends to tracking, achieving consistent identity association without video supervision. Despite its conceptual simplicity - no handcrafted components, no video input, and no ground-truth cameras - Multi-HMR 2 achieves state-of-the-art pelvis-centered performance while substantially improving detection accuracy and metric 3D localization.

[CV-230] S23DR 2026: End-to-End 3D Wireframe Prediction via DETR-Style Set Prediction with Contrastive Denoising

链接: https://arxiv.org/abs/2606.14811
作者: Nitiz Khanal
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Technical report; S23DR 2026 Challenge submission

点击查看摘要

Abstract:We present WireframeDETR, our submission to the Structured Semantic 3D Reconstruction (S23DR) 2026 Challenge, which requires predicting a 3D building wireframe from multi-view COLMAP point clouds. Our method applies DETR-style set prediction directly to 3D point clouds, producing wireframes as sets of edge coordinate pairs without any intermediate vertex detection stage. We introduce three technical contributions: (1) contrastive denoising training that stabilises noisy Hungarian matching in early epochs; (2) a multi-scale encoder that aggregates the last encoder layer outputs via learned scalar weights; and (3) progressive auxiliary loss weighting that concentrates gradient signal on the decoder layers that most benefit from it. Our model achieves a public test HSS of 0.575 (F1~=~0.664, IoU~=~0.516) and a best validation HSS of 0.534 on the cleaned val split.

[CV-231] HSQ-VLM: A Novel Spatially-Constrained Quadrant Segmentation VLM Model for Explainability in Diabetic Retinopathy

链接: https://arxiv.org/abs/2606.14803
作者: Shivum Telang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diabetic Retinopathy (DR) is an aggressive retinal disease and a leading cause of global blindness, yet its clinical management is currently hindered by the black-box nature of diagnostic AI. While deep learning models achieve high classification accuracy, there is a critical lack of explainability methods capable of detailing the exact anatomical landmarks and lesion distributions that lead to a clinical decision for DR. Therefore, we propose HSQ-VLM, a novel quadrant segmentation pipeline on fundus images that utilizes a Landmark-Anchored Cartesian Cross-Attention mechanism to unify visual feature extraction with structured clinical reasoning. Unlike traditional methods that rely on arbitrary image partitioning, our pipeline implements 4-quadrant Topological Latent Partitioning (TLP) to dynamically align retinal features with a fovea-centered coordinate system. This allows the Vision-Language Model to generate natural language reports that quantify pathology with anatomical precision. On a dataset of 3,500 high-resolution fundus images, this innovative methodology achieved a lesion detection sensitivity of 99.6% for hemorrhages and 96.4% for microaneurysms, while demonstrating a significant reduction in boundary-ambiguity errors compared to standard segmentation baselines.

[CV-232] Position: The Systemic Lack of Agency in Visual Reasoning ICML2026

链接: https://arxiv.org/abs/2606.14795
作者: Yizhao Huang,Haoyang Chen,Shiqin Wang,Pohsun Huang,Jiayuan Li,Haoyuan Du,Yandong Shi,Zheng Wang,Zhixiang Wang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICML 2026

点击查看摘要

Abstract:This paper argues that a systemic lack of Agency constrains the implicit reasoning capabilities of current Vision-Language Models (VLMs). Implicit reasoning refers to the ability to autonomously discover and utilize hidden visual evidence to bridge information gaps, rather than merely relying on explicitly specified targets. This capacity underlies human visual understanding and everyday reasoning. We argue that this limitation arises from a tendency to approach visual reasoning primarily as passive semantic retrieval, rather than as active, situated reasoning that depends on autonomous visual exploration. As a result, most existing benchmarks primarily assess Passive Capacity, leaving this aspect of reasoning largely unmeasured. To address this gap, we introduce the Visual Implicit Reasoning Diagnosing Benchmark (V-IRD), which targets this missing quadrant by requiring models to derive answers strictly through autonomous visual analysis. Our results show that, despite strong retrieval abilities, prominent VLMs struggle to utilize reference objects and to attend to visual evidence that requires self-directed inquiry. Simply put, strong semantic recognition does not equate to active visual exploration, revealing a critical gap in current VLMs. More information can be found at this https URL

[CV-233] Efficient Reinforcement for Visual-Textual Thinking with Discrete Diffusion Model

链接: https://arxiv.org/abs/2606.14792
作者: Yoonjeon Kim,Yuhta Takida,Chieh-Hsin Lai,Eunho Yang,Yuki Mitsufuji
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:RL-based post-training has been widely adopted to enable interleaved visual and textual reasoning in unified multimodal models capable of both text and image generation. However, most existing approaches are built upon autoregressive (AR) unified models, which require full image regeneration during visual reasoning. In this work, we demonstrate that multimodal discrete diffusion models are effective alternatives to AR models for reinforcement learning in interleaved reasoning, owing to their ability to perform efficient visual rollouts via localized visual editing rather than full image-token regeneration. This reduces rollout computation during GRPO by 26.9% compared to AR baselines, with minimal performance drop. Despite the improved efficiency, we find that joint reward assignment, which employs a shared reward signal across modalities, introduces cross-modal interference between unrelated image and text token sequences during RL updates. To address this issue, we propose factorized reward assignment, a strategy that assigns rewards independently to text and vision segments. With factorized reward assignment, our RL approach achieves an 11.2% improvement over joint reward assignment and a 38.04% improvement over the base model.

[CV-234] Vision-Encoder Behavioral Fingerprints of Image-to-Image Generative Models: A Training-Paradigm-Driven Taxonomy of Six Commercial APIs

链接: https://arxiv.org/abs/2606.14787
作者: Hunter Hill
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:We study six production image-to-image AI systems (gpt-image-1, Gemini 2.5 Flash Image, Flux Kontext, SDXL img2img, SD3 img2img, and Qwen Image Edit) under a content-adaptive sub-JND adversarial perturbation pipeline, scoring all outputs by frozen DINOv2 ViT-B/14 token distances against clean references. Across a 3,588-call corpus spanning COCO photographs, CelebA-HQ portraits, and AI-generated inputs, the six systems partition into two image-invariant behavioral bands on a 2D (patch_mean, ssim_clean) plane: edit-trained models (Flux Kontext, Qwen Edit, Gemini) cluster in a tight band, while T2I-base models adapted at sampling time (SDXL, SD3, gpt-image-1) cluster in a drift band.

[CV-235] MatchLM2Lite: A Scalable MLLM -to-Lite Framework for Reproduced Content Identification

链接: https://arxiv.org/abs/2606.14786
作者: Xiaotian Fan,Hiok Hian Ong,David Yuchen Wang,Zirui Zhu,Kanchan Sarkar,Kun Xu
类目: Multimedia (cs.MM); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Content moderation is critical for online video platforms to ensure content safety, protect creators, and sustain positive user experiences. Beyond filtering harmful content, platforms must guarantee content authenticity at scale so that users are exposed to diverse, original videos rather than low-value reproductions. We present MatchLM2Lite, a real-time, production-grade reproduced content identification (RCI) system that leverages the powerful understanding of a multimodal large language model (MLLM) distilled into a small and fast-inference model. Our system jointly models video, audio, and text signals, operating on pairs of videos to produce fine-grained reproduction scores. The system comprises two modules, MatchLM and MatchLite, and a two-stage training recipe. First, our high-capacity MLLM, MatchLM, serves as a teacher model to define the upper bound of RCI performance. Its capabilities are then distilled into a compact student model, MatchLite. This design allows MatchLite to deliver low-latency, high-throughput inference on video pairs while preserving much of MatchLM’s accuracy, making it suitable for integration into real-time recommendation systems. MatchLM achieves an F1-score improvement of +8.57 compared to our previous production model. After knowledge distillation, MatchLite retains a +6.55 gain in F1-score while reducing computational cost by 35x. Deployed at scale, MatchLM2Lite enables efficient, pairwise multimodal RCI, stably serving online traffic at high queries per second (QPS) with an end-to-end latency below 30 seconds. This system has reduced the reproduced video view rate on our platform by 2.5% without degrading user engagement, demonstrating its effectiveness in a large-scale production environment.

[CV-236] he Vision Encoder as a Privacy Boundary: Visual-Token Side Channels in Encoder-Free Vision-Language Models

链接: https://arxiv.org/abs/2606.14783
作者: Chenyu Zhou,Qiliang Jiang,Shuning Wu,Xu Zhou
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:A vision encoder compresses image pixels into semantic embeddings, implicitly acting as a privacy boundary by preserving semantic content while attenuating pixel-local detail required for exact text recovery. Encoder-free vision-language models (VLMs) remove this boundary by routing image patches directly into the language-model token stream, thereby exposing an architectural privacy attack surface: intermediate visual tokens become a pre-output side channel. Under a token-access adversary, decoders invert visual-token streams from two encoder-free VLMs, Gemma4 and Fuyu, recovering recognizable image structure and readable held-out access codes, whereas matched encoder-based controls localize target regions but recover no exact strings. Within-model ablations show that the operative factor is spatial sampling fidelity of the visual-token grid, especially character-direction sampling density, rather than token or value count. The leakage is not limited to exported tokens: Gemma4 layer-0 key-value cache tensors are directly invertible, placing the side channel within KV caches commonly persisted by production serving stacks for decoding efficiency. The attack survives clutter, realistic document degradation, and zero-shot transfer to public document images, and it resists value-level defenses such as additive noise and quantization. Effective mitigation must therefore reduce spatial sampling, making removal of the vision encoder a first-class privacy decision in VLM deployment.

[CV-237] Variational Deep Unfolding with Mamba-Based Nonlocal Modeling for Underwater Image Enhancement

链接: https://arxiv.org/abs/2606.14781
作者: Daniel Torres,Julia Navarro,Catalina Sbert,Joan Duran
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Underwater imaging plays a crucial role in ocean engineering, although captured data often suffer from poor visibility and color distortion. To address these challenges, we propose a model-based deep unfolding network for underwater image enhancement that integrates variational modeling into a learnable architecture. The framework is guided by a variational formulation based on a dehazing decomposition, incorporating a multiplicative residual component to absorb remaining artifacts and a nonlocal gradient-type constraint to preserve structural details and enhance edge sharpness. We provide a theoretical analysis establishing the existence of solution for the associated minimization problem. The proposed unfolding method incorporates Mamba layers to efficiently capture self-similarities in the scene. In addition, we introduce a proximal trajectory loss that enforces consistency between the unfolding stages and the iterations of an ideal restoration regularizer. Experimental results demonstrate that the proposed unfolding approach achieves improved visual quality and competitive quantitative performance compared with recent state-of-the-art methods. The source code will be available at this https URL .

[CV-238] YTClickbait21K: Human-Annotated Multimodal Dataset for YouTube Clickbait Detection Across Diverse Channels and Content Categories

链接: https://arxiv.org/abs/2606.14780
作者: Md. Minhazul Islam,Md. Tanbeer Jubaer,Amith Khandakar,Shovon Sarker,Sumaiya Rahman,Md. Masum Mia,Mohamed Arselene Ayari,Hamed Noori
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Clickbait content on video-sharing platforms poses a significant challenge to information reliability, yet progress in automated detection has been constrained by the lack of large-scale, high-quality multimodal datasets. We present YTClickbait21K, a human-annotated YouTube clickbait dataset comprising 21,238 videos collected from 40 channels across 29 countries, covering diverse content categories such as news, entertainment, education, and gaming. Each sample includes structured metadata (title, description, engagement statistics) along with associated thumbnail images, enabling comprehensive multimodal analysis. To ensure annotation quality, every video was independently labeled by three annotators using a standardized decision framework that incorporates textual, visual, and cross-modal consistency cues, with final labels determined through majority voting. The dataset exhibits substantial inter-annotator agreement (k=0.65), confirming reliable labeling despite the inherent subjectivity of clickbait detection. By combining scale, annotation rigor, and multimodal richness, this dataset provides a robust benchmark for developing and evaluating machine learning models, facilitating research in cross-modal semantic understanding, and advancing automated content moderation systems.

[CV-239] FactCheck: Feasibility-aware Long-term Action Anticipation with Multi-agent Collaboration

链接: https://arxiv.org/abs/2606.14778
作者: Rui Cao,Jiannong Cao,Bo Yuan,Zhiyuan Wen,Mingjin Zhang
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Long-term action anticipation (LTA) aims to predict an ordered sequence of future verb-noun actions from a partially observed video. While this task serves as the foundation for embodied intelligence, anticipating physically feasible long-term actions remains a critical challenge. Existing methods, which operate in an open-loop manner, often hallucinate non-existent objects, violate object affordances, or disregard object states, as they lack explicit mechanisms to verify action feasibility against the physical environment. To address this, we propose FactCheck, a novel multi-agent collaboration framework that improves feasibility through a closed-loop “Observe-Plan-Verify” mechanism. FactCheck decomposes the complex LTA task into specialized roles: an Observer that recognizes historical actions from video observations and constructs a dual-form structured memory, comprising a History Action Abstract that captures high-level human intentions and environmental status, and a History Action Graph that encodes object states and temporal dependencies; a Planner that generates draft future actions conditioned on both low-level historical actions and high-level History Action Abstract; and a Verifier that rigorously validates the draft against the History Action Graph and refines infeasible actions. Extensive experiments on the EPIC-Kitchens-55 and EGTEA Gaze+ benchmarks demonstrate that FactCheck consistently outperforms state-of-the-art methods. Our work establishes a new paradigm for feasibility-aware long-term action anticipation, effectively closing the loop of action recognition, action prediction and action verification.

[CV-240] JoyAI-VL-Interaction: Real-Time Vision-Language Interaction Intelligence

链接: https://arxiv.org/abs/2606.14777
作者: Dingyu Yao,Junhao Zhou,Chenxu Yang,Chuanyu Qin,Haowen Hou,Zheming Liang,Congcong Wang,Yuhang Cao,Shenglong Ye,Shuai Xie,Shuhuan Gu,Haoyang Huang,Qingyi Si,Nan Duan,Jiaqi Wang
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Many moments in the real world do not wait for a user to ask. A fire starts on a security monitor, an expression flickers across a video call, or a product a viewer wants flashes by in a livestream. Yet today’s large models remain mostly turn-based by design: they answer only when addressed, and even video-call apps that appear interactive still operate as question-answer systems, reacting only when polled or prompted. We argue for a different paradigm: a model that is present in the world like a person. It continuously watches what is happening now, decides on its own whether to speak or stay silent, interacts in real time, and delegates to a background model when the problem is hard. To advance interaction models and their adoption across domains, we make two fully open-sourced contributions. First, we release JoyAI-VL-Interaction, an 8B-scale, vision-first VL-interaction model. The model makes the response decision internally, choosing each second to stay silent, respond, or delegate to a background model, and it excels at vision-triggered responsiveness and time awareness. We pair it with a transferable training recipe, from which capabilities we never trained for emerge, such as guiding a shopper through changing app screens or improvising a lecture from a slide deck. Second, we release a complete, deployable system built around that model. The system streams any ongoing video into the model, making it genuinely present in the world. All other components are pluggable, including ASR/TTS modules, memory, visualization UI, and a background brain that can connect to any API or agent. Across six real-world scenarios, human raters prefer JoyAI-VL-Interaction over the in-app video-call assistants of Doubao and Gemini by a wide margin. To our knowledge, this is the first open, vision-driven interaction model released together with its training recipe, data, and complete deployable system.

[CV-241] Double-Helix Vision (DH-V2): A Geometry-Based Visual Sampler for Bandwidth-Constrained Perception

链接: https://arxiv.org/abs/2606.14773
作者: Jinwen Wen
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 5 pages, 3 figures, 5 tables. Code and benchmarks: this https URL

点击查看摘要

Abstract:We present Double-Helix Vision (DH), a geometry-based visual sampler that compresses 2D images into compact 1D signals using paired golden-ratio-inspired spiral trajectories. Rather than processing every pixel uniformly, DH employs two phase-shifted helices (Alpha and Beta, offset by 180 degrees) to sample the image with biologically-inspired foveation: high density at the center, sparse coverage at the periphery. At 4K resolution, DH achieves a 1,433x compression ratio (99.93% reduction) while preserving the geometric structure of the scene. The full perception pipeline – including spatial mapping, temporal collision detection, and intra-frame structural disparity estimation – runs in 0.52 ms at 1080p on CPU-only hardware, with no neural network dependencies. On CIFAR-10 at extreme sampling budgets (K=128 points per helix), DH achieves a +6.03% accuracy gain over uniform random sampling. A JSON-serializable Robotics API is provided, delivering sub-millisecond spatial perception reports in 2.7 KB packets. Code and benchmarks are available under the MIT License.

[CV-242] ScoutVLA: UAV-Centric Active Perception via a Dual-Expert VLA Model for Open-World Embodied Question Answering

链接: https://arxiv.org/abs/2606.14772
作者: Wenhao Lu,Zhengqiu Zhu,Xiaofeng Wang,Xiaoran Zhang,Yatai Ji,Yong Zhao,Yue Hu,Yingzhen Nie,Jinlong Zhu,Zheng Zhu
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Aerial Embodied Question Answering (EQA) requires Unmanned Aerial Vehicles (UAVs) to actively perceive the environment and answer natural language questions. Existing outdoor EQA systems usually stop once the target enters the UAV’s field of view, leaving the fine-grained viewpoint adjustment needed for evidence-seeking questions largely unresolved. To address this issue, we introduce FG-EQA, a fine-grained active perception EQA benchmark with more than 40K simulated trajectories and 1K real-world trajectories. Drawing inspiration from the ``waggle dance’’ of scout bees, which iteratively adjust their flight paths to verify target information, we propose ScoutVLA, an evidence-driven Vision-Language-Action model for outdoor EQA. To emulate this active exploration behavior, ScoutVLA features a decoupled dual-expert architecture: a vision-language expert infers the semantic intent to identify missing evidence, while an independent action expert employs high-DoF flow matching to generate continuous viewpoint-refinement trajectories. To balance the competing demands of continuous control and semantic reasoning, we devise a decoupled training strategy with a knowledge insulation mechanism that prevents the action gradients from erasing the model’s multimodal reasoning ability. Extensive simulated experiments and a qualitative real-world field study both verify the superiority of ScoutVLA over the state-of-the-art baselines, demonstrating a 10.48 \boldsymbol\times higher average strict success rate and a 7.72 \boldsymbol\times higher average QA correctness.

[CV-243] Momentum-Guided Semantic Forecasting (MoFore) for Self-Supervised Video Representation Learning

链接: https://arxiv.org/abs/2606.14765
作者: Qinwu Xu
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM)
备注: 13 pages, 5 Figures, and 2 Tables

点击查看摘要

Abstract:Self-supervised video representation learning has recently advanced through contrastive learning, masked reconstruction, and predictive representation learning. Reconstruction-based approaches such as MAE and VideoMAE learn representations by recovering masked visual content \citehe2022mae,tong2022videomae, while contrastive methods such as CLIP learn semantically meaningful embedding spaces through representation alignment \citeradford2021clip. In this work, we introduce a Momentum-Guided Semantic Forecasting framework (MoFore) for self-supervised video representation learning. Instead of optimizing for pixel-level reconstruction or task-specific semantic alignment, the proposed method learns temporally predictive video representations by forecasting future latent embeddings from temporally distant context clips. To improve robustness across temporal scales, we further introduce randomized temporal-gap forecasting during training. The framework combines predictive latent forecasting with contrastive regularization to encourage temporal consistency while preventing representation collapse. Experiments on the UCF101 dataset demonstrate that the proposed framework learns temporally consistent and semantically meaningful video representations without using action labels during training. Quantitative analysis shows strong temporal stability and emergent category-level structure in the learned embedding space, while qualitative retrieval experiments reveal motion-aware organization across related activities. Overall, the results suggest that long-range latent forecasting provides an effective and computationally efficient approach for self-supervised video representation learning without relying on reconstruction-based objectives. Comments: 13 pages, 5 Figures, and 2 Tables Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM) Cite as: arXiv:2606.14765 [cs.CV] (or arXiv:2606.14765v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2606.14765 Focus to learn more arXiv-issued DOI via DataCite

[CV-244] Avoiding Exponential Blow-Up in Distributive Lattice Submodular Minimization

链接: https://arxiv.org/abs/2606.14764
作者: Ishant Shanu
类目: Computer Vision and Pattern Recognition (cs.CV); Discrete Mathematics (cs.DM)
备注:

点击查看摘要

Abstract:Submodular function minimization has gained a lot of interest in recent years. They are highly applicable in the area of Computer Vision and Machine Learning. Often such applications require to work with submodular functions defined on distributive lattice. Current best way of dealing with it is using a transformation which extrapolates the submodular function for the respective boolean lattice. It makes optimization system too inefficient due to enlargement of the working space. Quantitatively, the expanded space has additional exponential (in set size) number of elements. We propose a generic framework for dealing with distributive lattice which only works within distributive lattice. Our framework allows one to use already established submodular function minimization algorithms for boolean lattice. In our experiment, we show the huge improvement in terms of running time over tranditional methods for handling distributive lattice.

[CV-245] Scribby: A Multi-Level LLM Framework for Semantic Video Analysis

链接: https://arxiv.org/abs/2606.14762
作者: Julian Abelarde,Hugo Garrido-Lestache Belinchon
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As video content continues to expand across educational platforms, recorded lectures, and live-streamed entertainment, the need for efficient and structured analysis of long-form footage has increased \cite1. Although many existing AI programs provide high-level video summaries based on AI-generated transcripts \cite2,3,4,5, these approaches are often limited to coarse overviews and lack detailed analysis of a video’s structure, thematic progression, and semantic relationships, all of which are required for comprehensive video analysis. This paper proposes an LLM-based video summarization framework that balances macro-level comprehension with micro-level semantic analysis \cite6,12,13. The first stage of the process indexes the video at a micro level by (1) analyzing the full transcript, (2) analyzing individual transcript sentences, and (3) grouping these sentences by semantic similarity using an LLM as a judge \cite6,13. Contextual continuity is retained during sentence-level processing by incorporating both the global transcript analysis and adjacent sentence information into each evaluation prompt. This framework establishes a foundation for video analysis tools that visualize semantic chunking and semantic matching through relevance-based heatmaps. Limitations and future expansions of the framework are also discussed. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2606.14762 [cs.CV] (or arXiv:2606.14762v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2606.14762 Focus to learn more arXiv-issued DOI via DataCite

[CV-246] GeoRoPE: Ground-Aware Rotary Adaptation for Remote Sensing Foundation Models

链接: https://arxiv.org/abs/2606.14760
作者: Yu Luo,Kun Hu,Mengwei He,Xiaogang Zhu,Shan Zeng,Allen Benter,Wei Xiang,Patrick Filippi,Thomas Francis Bishop,Zhiyong Wang
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Remote-sensing foundation models (RSFMs) benefit from pretraining on imagery from multiple sensors and ground sampling distances (GSDs), but such exposure alone does not resolve scale mismatch during downstream adaptation. A fixed token-grid offset can correspond to different ground distances across sensors, making grid-based positional priors physically inconsistent. Meanwhile, heterogeneous spatial granularity means that compact urban regions and homogeneous landscapes may require different positional sensitivities even under the same GSD. Therefore, we propose GeoRoPE, a ground-aware, RoPE-compatible, and parameter-efficient spatial adaptation method for RSFMs. GeoRoPE recalibrates token-level positional interactions from two complementary aspects. First, \textitGeo-Coordinate Calibration (GCC) rescales raw token-grid offsets according to the ground distance represented by one token-grid step, producing geo-calibrated relative coordinates across GSDs. Second, \textitGeo-Frequency Calibration (GFC) adjusts the native RoPE frequency with a relation-specific factor, enabling position sensitive adaptation to scene-dependent spatial granularity. GeoRoPE is injected into pretrained RSFMs through a lightweight adapter, preserving the frozen spatial prior while adding geo-aware positional corrections. Experiments across multiple RSFMs, sensors, resolutions, and downstream tasks demonstrate that GeoRoPE improves cross-resolution robustness and scale-sensitive representation learning.

[CV-247] mporally Consistent and Controllable Video Generation of 2D Cine CMR via Latent Space Motion Modeling

链接: https://arxiv.org/abs/2606.14759
作者: Yiheng Cao,Gustavo Andrade-Miranda(SyCoIA - IMT Mines Alès),Jiatian Zhang,Guillaume Sallé,Xin Gao
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Cine cardiac magnetic resonance is the gold standard for assessing cardiac function, but the scarcity of public datasets limits the development of advanced data-driven models. To address this limitation, we propose a generative method for synthesizing temporally coherent and anatomically consistent cardiac sequences. Our text-to-video framework decouples cardiac spatial structure from temporal motion. First, a fine-tuned diffusion model synthesizes an initial frame from a clinical text prompt, controlling anatomical features. Then, a latent flow model conditioned on a cardiac phase embedding generates the complete cardiac motion, ensuring spatial consistency and temporal control. Our model generates anatomically and pathologically diverse sequences with high temporal coherence and strong fidelity to input prompts, achieving a FID of 31.68 for image realism and a CLIP score of 31.04 for text-image alignment. These experimental results highlight its potential to produce high-fidelity, on-demand medical data, offering a scalable solution to data scarcity.

[CV-248] Disentangling Hallucinations: Orthogonal Semantic Projection for Robust Interpretability

链接: https://arxiv.org/abs/2606.14758
作者: Emirhan Bilgiç,Baptiste Caramiaux,Zhi Yan,Gianni Franchi
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 41 pages in total. 5 figures, and 2 tables in the main paper; 10 figures and 17 tables in the appendix

点击查看摘要

Abstract:As Vision-Language Models are increasingly deployed in safety-critical applications, the trustworthiness of their explanations becomes crucial. Explainable AI (XAI) methods for Vision-Language Models often suffer from semantic hallucination, where attribution maps highlight prominent image regions even when prompted with incorrect text descriptions (e.g., highlighting a dog when prompted ``cat’'). Although this problem is widespread, a formal mathematical analysis of XAI methods and CLIP embeddings is largely missing in the literature. We demonstrate that this phenomenon is not specific to a single architecture but is a fundamental consequence of Linear Semantic Leakage in high-dimensional embedding spaces. We propose a unified theoretical framework, Linear Semantic Attribution (LSA), which generalizes across discriminative methods. We introduce OSP, a geometric intervention that utilizes the residual property of OMP to disentangle unique semantic signals from shared concepts. We prove theoretically and demonstrate empirically that OSP minimizes hallucination by orthogonalizing the query vector against distractor concepts, rendering the attribution model blind to shared features while preserving fidelity for correct prompts. Our code is available at: this https URL

[CV-249] Spatial Priors via Space Filling Curves for Small and Limited Data Vision Transformers ICML2026

链接: https://arxiv.org/abs/2606.14757
作者: Leyla Naz Candogan,Arshia Afzal,Pol Puigdemont,Volkan Cevher
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: ICML 2026

点击查看摘要

Abstract:Though Vision Transformers (ViTs) have become the dominant backbone in many computer vision tasks, due to permutation equivariance, their attention mechanism lacks explicit spatial inductive biases. This become particularly important in two settings: when model capacity is small or training data is limited. Inspired by the attention masking strategies in Linear Transformers and the scanning patterns of Vision SSMs, we introduce VIOLIN, a lightweight masked attention mechanism that encodes spatial structure within attention via Space Filling Curves (SFCs) with less than 0.0015% extra parameters and negligible computational overhead. VIOLIN scans the image using multiple SFCs to construct curve-specific decay masks, which are then combined and multiplied with the attention matrix. Across a wide range of evaluations, VIOLIN consistently improves performance. In limited data regimes such as fine-tuning on VTAB-1K, it boosts accuracy across all task groups and by up to 8.7% on the tasks where spatial information is essential. It can be combined with parameter-efficient fine-tuning methods such as LoRA to further increase the performance. Beyond fine-tuning, VIOLIN improves various small scale ViT architectures (e.g., DeiT, DINO) during pretraining on ImageNet-1K. Additionally, on pixel-level CIFAR-100 training, a task that is highly dependent on location information, VIOLIN increases accuracy by up to 7.2%. Overall, VIOLIN provides a computationally efficient yet effective way to inject spatial inductive bias into ViTs, especially benefiting small models and limited data settings.

[CV-250] Divide-and-Denoise: A Game-Theoretic Method for Fairly Composing Diffusion Models ICML2026

链接: https://arxiv.org/abs/2606.14756
作者: Abhi Gupta,Polina Barabanshchikova,Vikas Garg,Samuel Kaski,Tommi Jaakkola
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted as spotlight at ICML 2026

点击查看摘要

Abstract:The abundance of pre-trained diffusion models provides an opportunity for composition. Combining several models, however, runs the risk of one model dominating or models disagreeing with each other. Here, we propose Divide-and-Denoise, a method for coordinating multiple pre-trained diffusion models during sampling. Much like managing a specialized workforce, our method creates a fair but efficient division of labor across models. Central to our method is the notion of an allocation which defines the responsibility of each model to every region of the noisy sample. At every timestep, we then denoise by (i) updating the allocation by solving a fair division game, where we divide the sample into regions that maximize total utility under fairness constraints, and (ii) aligning the models with this allocation, where we guide each model to denoise within its assigned region. This leads to a new composite denoising process that evolves in tandem with a division process. We evaluate Divide-and-Denoise on conditional image generation. Across several quality metrics, including the GenEval benchmark, our method outperforms baselines and resolves common failures including missing objects and mismatched attributes. Experiments show that Divide-and-Denoise utilizes each model’s expertise without neglecting any other model.

[CV-251] Where Does Texture Evidence Live in SAM? Features Proposal Masks and Texture Segmentation

链接: https://arxiv.org/abs/2606.14755
作者: Nadav Orenstein,Aviad Cohen Zada,Shai Avidan,Gal Oren
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 26 pages, 13 figures, 20 tables. Code available at this https URL

点击查看摘要

Abstract:Texture segmentation stresses foundation segmentation because meaningful regions are defined by material or repeated appearance rather than object identity. Segment Anything Models (SAMs) often fail by default on such texture-defined partitions, but this failure is ambiguous: the texture evidence may be absent, missing from the proposal bank, or present but selected or assembled incorrectly by an object-centric readout. We ask what texture-relevant evidence is already preserved in frozen SAM before adaptation. We study two frozen evidence spaces: multiscale features, probed with a minimal clustering readout, and the automatic proposal bank, treated as evidence for a supervised consolidation readout. SAM is frozen throughout; we do not fine-tune the backbone or retrain the proposal generator. Across RWTD, STLD, an ADE20K-selected refined-crop complement, and a ControlNet-stitched PTD bridge archive, frozen SAM is not a texture segmenter by default, but its failures are not simple texture blindness. Coarse frozen features preserve texture organization, and proposal banks often contain texture-aligned masks or fragments. Natural scenes more often require assembly and commitment over fragments, while cleaner synthetic cases more often reduce to selecting an already coherent proposal. Default mask failure should therefore be decomposed into representation evidence, proposal-bank support, readout mismatch, and commitment failure.

[CV-252] Sub-Semantic Image Segmentation

链接: https://arxiv.org/abs/2606.14754
作者: Aviad Cohen Zada,Nadav Orenstein,Shai Avidan,Gal Oren
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 23 pages. Code: this https URL

点击查看摘要

Abstract:Images can be segmented based on visual cues (i.e., texture segmentation) or into objects (i.e., semantic segmentation). We propose a new category of sub-semantic image segmentation that blurs the line between the two. In sub-semantic image segmentation, language is not used to name whole objects. Instead, it is used to partition an image into stable appearance patterns that can be described by language. To do that, we couple a general-purpose vision-language model to SAM 3, a promptable segmentation backbone whose native text pathway can ground rich descriptions into masks. Simple coupling fails for a number of reasons that we identify in the paper, and we overcome them by introducing DETECTURE that resolves three concrete failure modes – language leakage between texture regions, prompt competition inside the segmentation backbone, and semantic distortion at the language-to-mask interface. Since there is no dataset of sub-semantic image segmentation, we introduce one, termed TextureADE. The new dataset is derived from the ADE20K dataset using a system we designed. We compare DETECTURE to a number of baselines and find that it achieves the strongest performance on several datasets using different metrics. Code is available at this https URL.

[CV-253] Beyond Self-Attention: Sub-Quadratic Vision Transformers for Fast Image Captioning

链接: https://arxiv.org/abs/2606.14753
作者: Chiradeep Ghosh,Dakshina Ranjan Kisku
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 8 pages, 8 figures

点击查看摘要

Abstract:Image captioning is a challenging and significant task that aims to generate coherent and semantically meaningful textual descriptions for given images. To accomplish this task, it requires a deep understanding of visual content along with the ability to express that understanding in natural language. Despite remarkable progress with transformer-based architectures, existing approaches often suffer from limitations, such as a lack of rich local feature representations and the high computational cost of quadratic self-attention. The proposed model focuses on improving computational efficiency by restructuring the vision transformer architecture. In designing this approach, the standard self-attention mechanism in Vision Transformers is replaced with a probabilistic transformer approach based on a Gaussian Mixture Model (GMM), a soft-clustering technique. Instead of computing pairwise attention among all image patches, the model groups similar patches into a fixed number of clusters using an Expectation-Maximization (EM) algorithm. This clustering-based mechanism reduces the computational complexity from quadratic O(n^2) to linear O(nK), where K n. The autoregressive GPT-based decoder is used for caption generation. The model is evaluated on the Flickr 30K dataset, demonstrating competitive and significant improvement over existing works.

[CV-254] X-Tokenizer: A Multimodal Action Tokenizer for Vision-Language-Action Pretraining

链接: https://arxiv.org/abs/2606.14752
作者: Xirui Kang,Yanpei Shi,Lucy Liang,Roy Gan,Dongxiu Liu,Pushi Zhang,Danpeng Chen,Xiaoyi Qin,Yinan Zheng,Jinliang Zheng,Hao Wang,Xianyuan Zhan,Hang Su
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
备注: Project page: this https URL

点击查看摘要

Abstract:Modern Vision-Language-Action (VLA) models must bridge pretrained vision-language reasoning and precise continuous robot control. Existing action tokenizers discretize actions primarily for reconstruction, producing codes that preserve motion geometry but provide only weak semantic supervision to the backbone. We therefore formulate action tokenization not as mere compression, but as semantic interface learning between multimodal reasoning and executable control. To this end, we introduce X-Tokenizer, a lightweight encoder-Semantic Residual Quantization (SRQ)-decoder architecture that provides a shared action interface across diverse robotic arm embodiments. Its key component, SRQ, imposes an asymmetric structure on residual vector quantization: the first level is trained with Masked Action Modeling (MAM) to form a discrete action language that captures coarse motion intent, while deeper levels remain reconstruction-oriented residuals that preserve fine-grained details. To further align action tokens with multimodal semantics, X-Tokenizer is pretrained with contrastive alignment to the representation space of a pretrained foundation model and with next-frame vision-language feature prediction. Pretrained on 2.4M trajectories (2.0B action frames), a single frozen X-Tokenizer plugs into a mixed discrete-continuous VLA as a representation-shaping supervision signal. X-Tokenizer achieves top real-world aggregate and strong RoboTwin 2.0 simulation results. Outperforming FAST in multimodal grounding (+13.5%) and long-horizon tasks (+8.25), it shows that action tokenizers serve as semantic interfaces for VLA pretraining beyond mere action compression.

[CV-255] Automated 3D Kinematic Monitoring for Circadian Activity and Anomaly Detection in Juvenile Fish

链接: https://arxiv.org/abs/2606.14749
作者: Chih-Wei Huang,Chang-Wen Huang,Chung-Ping Chiang,Tsung-Wei Pan
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Precision aquaculture faces a “phenotyping bottleneck” in tracking high-resolution behavioral traits, as conventional methods cannot quantify instantaneous three-dimensional (3D) physical exertion. To address this, we present a high-throughput 3D behavioral phenotyping framework integrating deep learning object detection with binocular stereo vision for real-time monitoring of juvenile tilapia in high-density environments. The system automates non-contact body length estimation and reconstructs 3D swimming trajectories from absolute spatial coordinates. By eliminating 2D perspective distortions, this approach precisely quantifies 3D velocity and acceleration, marking the first estimation of true physical swimming speeds in free-roaming juveniles. Results show the framework successfully establishes circadian locomotor baselines, serving as an early warning system for physiological stress and providing an objective metric for fish vitality.

[CV-256] Is My Vision-Language Data in Your AI? Membership Inference Test (MINT) Demo 2

链接: https://arxiv.org/abs/2606.14748
作者: Daniel DeAlcala,Gonzalo Mancera,Julian Fierrez,Aythami Morales,Ruben Tolosana,Ruben Vera-Rodriguez
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: IEEE Conf. on Computers, Software, and Applications (COMPSAC), 2026

点击查看摘要

Abstract:We present the Membership Inference Test (MINT) Demo 2, a framework designed to improve transparency in machine learning training processes. MINT is a technique for experimentally determining whether specific data were used during machine learning model training. We establish the theoretical framework and propose multiple architectures for MINT depending on the amount of information known about the models that are being audited. Experimental results using a popular face recognition model, 4 state-of-the-art LLMs, and multiple, diverse, and large-scale public image and text databases achieve promising accuracy levels in the detection of training data of up to 90%. Building on these results, we introduce a comprehensive web platform1 that expands these capabilities to image and text modalities. The platform integrates a diverse technological stack, including MINT, aMINT, and gMINT, allowing users to audit a wide range of models. This demonstrator aims to promote AI transparency and provides a practical tool to foster compliance with emerging AI regulations.

[CV-257] MMLongEmbed: Benchmarking Multimodal Embedding Models in Long-Context Scenarios

链接: https://arxiv.org/abs/2606.14747
作者: Haitian Wang,Ruoxi Sun,Quantong Qiu,Juntao Li,Junhui Li,Hua Chen,Jinxiong Chang,Min Zhang
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advancements have significantly expanded the theoretical context windows of Multimodal Embedding Models (MEMs). However, larger context windows do not necessarily translate into effective comprehension and representation of long-context multimodal inputs, which remains a critical bottleneck for real-world deployment. To address the lack of systematic evaluation in this setting, we introduce MMLongEmbed, the first comprehensive benchmark for evaluating MEMs in long-context scenarios. MMLongEmbed comprises four retrieval tasks spanning multiple context-length ranges, covering text, document, and video modalities. Through extensive evaluation of state-of-the-art models, we find that current architectures rely heavily on superficial feature matching and struggle to capture deep semantic and structural dependencies. We further observe that performance degradation varies systematically with context length and key information placement. Moreover, models exhibit substantially different robustness to redundant contextual information across modalities. For reproducibility, the benchmark and code are publicly available.

[CV-258] Style-CCL: Content-Preserving Style Transfer via Curriculum Continual Learning

链接: https://arxiv.org/abs/2606.14746
作者: Shiwen Zhang,Haoyuan Wang,Xianghao Zang,Haibin Huang,Chi Zhang,Xuelong Li
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: code and models of QwenStyle are released at this https URL and this https URL

点击查看摘要

Abstract:Content-Preserving Style transfer, given content and style references, remains challenging for Diffusion Transformers (DiTs) due to entangled content and style features. With a reverse triplet synthesis pipeline to build a million-scale training set and a dual-branch Style-Content DiT (SC-DiT) that decouples style and content via separate ROPE embeddings and causal masking, we observe that such a one-stage training paradigm on mixed style categories causes semantic styles to dominate, hindering texture style learning, and harming content preservation. To address these issues, we propose Style-CCL, a Multi-Stage Curriculum Continual Learning framework that trains SC-DiT from semantic (easy) to texture (hard) styles, and from clean to synthetic data, with Random Memory Rehearsal across stages to avoid catastrophic forgetting. Extensive experiments demonstrate that our Style-CCL achieves state-of-the-art performance in three core metrics: style similarity, content consistency, and aesthetic quality.

[CV-259] HorusEye: Language as Dynamic Attention for Emergency Visual Analysis

链接: https://arxiv.org/abs/2606.14741
作者: Armel Yara
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 18 pages, 9 figures, 11 tables

点击查看摘要

Abstract:We introduce HorusEye, Language as Dynamic Attention for Emergency Visual Analysis. Our investigation followed five stages. The first one is benchmarking RefCOCO-Degraded, a dataset of 15,244 images (3,811 base images x 4 conditions: Clean, Fog, Smoke and Thermal) with systematic visual degradation. Through four research questions, we evaluate multiple VLMs (Gemini, Qwen2-VL, BLIP-2, LLaVA, Kosmos-2) across visual grounding the second stage, language feedback recovery the third one, health VQA tasks the fourth, and hallucination analysis the final stage. Our key finding is that language feedback effectiveness is model-dependent: Gemini achieves +47.3% improvement in thermal conditions through iterative language feedback, while Qwen2-VL shows -5.1% degradation under the same protocol. We also identify the ‘Thermal Paradox’ where cropping strategies that improve RGB performance catastrophically fail in thermal imagery. Furthermore, BLIP-2 uniquely hallucinates more under degradation, making it unsuitable for emergency deployment

[CV-260] GridVQA-X: A Framework for Evaluating Multimodal Explainability Methods CVPR2026

链接: https://arxiv.org/abs/2606.14740
作者: Sujay Belsare,Sudarshan Nikhil,Sushant Kumar,Ponnurangam Kumaraguru,Chirag Agarwal
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 23 pages, 15 Figures, Accepted for poster presentation at CVPR 2026 TRUE-V Workshop

点击查看摘要

Abstract:With the increasing development of Vision-Language Models, it becomes imperative that their predictions are readily explainable to relevant stakeholders. However, the field of explainability has not kept pace with the multimodal surge. While recent Multimodal Explainable AI (MxAI) methods generate explanations to attribute the interaction between different modalities, current evaluation protocols lack the ground truth required to distinguish between true cross-modal reasoning (e.g., spatial composition) and shallow cross-modal shortcuts (e.g., Bag-of-Words attribute matching). It remains unknown whether MxAI methods faithfully capture synergistic interactions or merely hallucinate reasoning on models acting as simple feature detectors. In this paper, we introduce GridVQA-X, the first diagnostic framework specifically designed to evaluate cross-modal explainability. Unlike natural datasets, GridVQA-X leverages a closed-world synthesis logic to generate unique, mathematically guaranteed explanations. We utilize this controlled environment to train paired ground-truth models on identical architectures: M_\textpure , which learns robust spatial-relational reasoning and M_\textspur , which is structurally forced to rely on cross-modal shortcuts. This behavioral divergence creates a rigorous testbed: a faithful explainer must report distinct reasoning pathways for each model. Our findings reveal that widely used methods fail to distinguish between models relying on genuine spatial-relational reasoning and those exploiting cross-modal shortcuts, highlighting a critical gap in capturing true cross-modal synergy and misrepresenting how multimodal models actually make decisions.

[CV-261] UtVAA: Ultra-tiny Vision Transformer with Affix Attention for Mobile Image Classification

链接: https://arxiv.org/abs/2606.14735
作者: Romiyal George,Sathiyamohan Nishankar,Selvarajah Thuseethan,Roshan G. Ragel
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 7 figures

点击查看摘要

Abstract:Vision Transformers (ViTs) have demonstrated strong representation capability in image classification. However, their quadratic self-attention complexity and large parameter counts limit deployment on resource-constrained mobile and edge devices. This paper introduces UtVAA, an ultra-tiny Vision Transformer architecture designed for efficient visual recognition under strict computational budgets. It incorporates a novel Affix Attention block that combines depthwise-pointwise local feature extraction, linear self-attention, coordinate attention for spatial dependency modelling, and a lightweight ternary fusion strategy to integrate local and global representations. In addition, Dilated Bottleneck blocks expand the receptive field using dilated depthwise separable convolutions while maintaining low FLOPs and stable optimisation through residual connections. UtVAA is implemented in scalable Tiny, Medium, and Large variants, with the smallest model containing 204.67K parameters and 53.95M FLOPs. Experimental results on CIFAR-10, CIFAR-100, PlantVillage-Tomato and SLIF-Tomato datasets show that UtVAA achieves competitive accuracy within a sub-million-parameter regime. Overall, the results demonstrate that transformer-based vision models can be redesigned into ultra-tiny architectures without significant loss in discriminative performance, making UtVAA suitable for mobile and edge deployment. Code is available at this https URL

[CV-262] Steady-Forcing: Balancing Spatial Persistence and Motion Continuity in Long-Horizon Nature Video Diffusion

链接: https://arxiv.org/abs/2606.14732
作者: Matiur Rahman Minar,Seunghun Oh,GangHyeon Jeong,Unsang Park
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM)
备注: Project page: this https URL

点击查看摘要

Abstract:Autoregressive video diffusion models enable streaming generation but often degrade over long rollouts: static scene layouts drift, while mechanisms that improve spatial stability tend to suppress motion, causing natural flows such as water, fire, or smoke to stagnate. We study this stability-motion trade-off in fixed-camera long-horizon nature video generation, where the two failure modes can be more clearly separated than in moving-camera settings. We propose Steady-Forcing, a memory and training framework combining a persistent visual anchor (V-Sink), an exponential moving-average motion memory (EMA-Sink), block-relative temporal encoding, periodic cache purification, and distillation from a Wan2.1-14B teacher with motion-rewarded priors under task-focused configurations. Together, these components are designed to preserve background identity while sustaining visually plausible fluid dynamics over multi-minute autoregressive rollouts. Evaluations across seven baselines show that Steady-Forcing improves long horizon background consistency and imaging quality, while a blind user study indicates stronger perceived stability and motion continuity. The benchmark evaluation further suggest that generic VBench aggregate scores under-penalize fixed-camera artifacts as well as rewarding drift-induced optical flow as Dynamic Degree while not directly penalizing texture hardening or flow stagnation - motivating future task-specific benchmarks for static-camera nature-flow evaluation. Project page: this https URL

[CV-263] BBR-Net: Boundary-Balanced Replay for Continual Medical Image Segmentation

链接: https://arxiv.org/abs/2606.14731
作者: Zahid Ullah,Sieun Choi,Jihie Kim
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Continual learning for medical image segmentation remains challenging under domain shift because replay-based methods often preserve appearance information without explicitly modeling anatomical structure. This study investigates whether structural consistency governs knowledge retention in continual cardiac ultrasound segmentation. We propose the Boundary-Balanced Replay Network (BBR-Net), which selects replay samples using boundary-aware priority and class balance to preserve anatomically informative regions. The method is evaluated on CAMUS and CardiacNet under forward (CAMUS to CardiacNet) and reverse (CardiacNet to CAMUS) task orders. In the forward setting, BBR-Net retains source-task performance close to an offline joint-training reference, while markedly reducing catastrophic forgetting and preserving competitive target-task adaptation. Ablation results show that boundary-aware prioritization contributes to retention and improves the balance between source-task preservation and target-task adaptation when combined with class-aware sampling. In contrast, the reverse setting reveals that structure-aware replay fails when initial representations are learned from noisy and structurally inconsistent data. To isolate this effect, we conduct a controlled structural perturbation analysis by progressively corrupting source-task boundaries while keeping the dataset, architecture, and training protocol fixed. Forgetting increases consistently as structural reliability decreases, suggesting that replay effectiveness is strongly influenced by the quality of stored structural information, rather than by memory capacity alone. These findings indicate that preserving anatomical structure under domain shift is a central factor in continual medical image segmentation, and that replay mechanisms should account for structural reliability to support robust knowledge retention.

[CV-264] Hierarchical GRU with Input-Conditioned Slot Queries for Ball Action Anticipation SOCC CVPR2026 ICIP DATE

链接: https://arxiv.org/abs/2606.14730
作者: Parthsarthi Rawat
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026 SoccerNet Ball Action Anticipation Challenge, Validated Rank 4

点击查看摘要

Abstract:We present a hierarchical model for ball action anticipation in football broadcast video. Given a 30-second observation window, the system predicts actions occurring in the subsequent 5-second window across 10 classes. A shared local Transformer encodes clip-level features within each 5-second sub-window; a GRU then aggregates temporal context across all sub-windows; finally, a Transformer decoder with K input-conditioned event slots decodes the anticipation target via three decoupled heads (objectness, class, temporal offset). We introduce frequency-reweighted Hungarian matching that systematically favours rare action classes, and Gaussian soft targets for temporal bin supervision. On the SoccerNet Ball Action Anticipation benchmark, our method achieves 17.91% mAP on the test server.

[CV-265] FUSE: Quantifying Uncertainty in Vision-Language Models by Bayesian Fusing Epistemic and Aleatoric Uncertainty

链接: https://arxiv.org/abs/2606.14728
作者: Harry Zhang,Luca Carlone
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-language models (VLMs) are playing an increasingly important role across multiple domains. In many applications, such as robotics, it is crucial to quantify the uncertainty in the output of these models. We develop FUSE, a probabilistic framework for capturing two complementary sources of uncertainty in vision-language modeling: (i) aleatoric embedding-level uncertainty derived from input data vision-language ambiguity, and (ii) epistemic model-level uncertainty estimated from the semantic response diversity of VLMs. Our approach formulates a Bayesian fusion mechanism that analytically combines these uncertainty sources to produce a scalar measure of uncertainty. This measure can be used to reliably predict the model’s output correctness for downstream applications. We demonstrate that our method outperforms baselines and achieves SOTA uncertainty calibration.

[CV-266] FairGen: Preference-Aligned Diffusion for Demographically Equitable Medical Image Synthesis

链接: https://arxiv.org/abs/2606.14727
作者: Zhimin Li,Ruichen Zhang,Zhen Tan,Howard J Aizenstein,Jingtong Hu,Tianlong Chen
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for publication in npj Digital Medicine. 20 pages, 6 figures

点击查看摘要

Abstract:Medical imaging is central to modern diagnostics, and artificial intelligence (AI) systems are increasingly used to support image-based analysis by improving efficiency, accuracy, and access to care. However, inequities in healthcare access and differential disease prevalence create severe demographic imbalances in clinical image data. Such imbalances are compounded by the fact that diseases can manifest with distinct features across demographic groups, rendering certain phenotypic presentations naturally rare. AI models trained on such imbalanced data risk perpetuating diagnostic bias and widening healthcare disparities. Here we introduce FairGen, a fairness-aware diffusion framework that synthesizes demographically balanced medical images while preserving pathology-relevant visual features. By embedding physician-aligned preferences into the generation process, FairGen improves subgroup coverage during synthesis and downstream classification. Applied to dermatology, radiology, and neuroimaging benchmark tasks, FairGen achieves fairness improvements of 95.9% for skin images, 80.0% for chest radiography, and 35.2% for brain MRI, while maintaining competitive diagnostic accuracy relative to models trained on original clinical data. Clinician-facing expert review and external validation on independent cohorts further support that these gains extend beyond standard fidelity metrics and are not confined to the original in-distribution datasets.

[CV-267] Interpolation between Convolution and Attention via K-Nearest Neighbors

链接: https://arxiv.org/abs/2606.14725
作者: Mingi Kang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Undergraduate Thesis in Computer Science at Bowdoin College

点击查看摘要

Abstract:The shift from Convolutional Neural Networks to Transformers has reshaped computer vision, yet these two architectural families are typically viewed as fundamentally distinct. Convolutional Neural Networks are defined by spatially local convolution operations, while Transformers rely on global self-attention. We argue that convolution and self-attention, despite their apparent differences, can be unified within a single k-nearest neighbor aggregation framework. The critical insight is that both operations are special cases of neighbor selection and weighted aggregation. Convolution selects neighbors by spatial proximity while self-attention selects by feature similarity, revealing that they lie on a continuous spectrum rather than representing categorically different computations. We introduce Convolutional Nearest Neighbors (ConvNN), a unified framework that formalizes this connection. ConvNN exactly recovers standard and depthwise convolution by restricting neighbor selection to normalized spatial coordinates, and exactly recovers self-attention and its sparse variants, including KVT-attention, by replacing spatial proximity with scaled dot-product similarity. Beyond these special cases, ConvNN serves as a drop-in replacement for both convolution and attention layers, enabling systematic exploration of the intermediate spectrum between local and global aggregation through configurable similarity functions, neighbor selection strategies, positional encodings, and aggregation kernels. Comments: Undergraduate Thesis in Computer Science at Bowdoin College Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2606.14725 [cs.CV] (or arXiv:2606.14725v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2606.14725 Focus to learn more arXiv-issued DOI via DataCite

[CV-268] VigilFormer: Deformable Attention for Video Anomaly Detection with Causal Risk Inference

链接: https://arxiv.org/abs/2606.14724
作者: Xinze Zhang
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Video anomaly detection in surveillance settings must balance detection accuracy against real-time throughput, a tension that existing methods address either through stronger feature extractors or more efficient architectures, but rarely both. We present VigilFormer, a unified framework that combines deformable spatio-temporal attention with causal temporal modeling to detect anomalies in untrimmed surveillance video. The proposed Deformable Spatio-Temporal Encoder (DSTE) attends to a sparse set of informative locations across frames, avoiding the quadratic cost of dense attention while retaining the ability to capture irregular motion patterns. A Causal Anomaly Classifier (CAC) applies dilated causal convolutions over snippet-level features and optimizes a contrastive multiple-instance learning objective that separates anomalous and normal representations without frame-level labels. To meet deployment constraints, an Adaptive Confidence Scheduler (ACS) dynamically skips low-information frames at inference time, reducing redundant computation in static scenes. Evaluated on UCF-Crime, ShanghaiTech, and CUHK Avenue, VigilFormer achieves AUC scores of 87.83%, 97.21%, and 89.74% respectively, at 41.5 FPS on a single GPU, outperforming recent weakly-supervised methods in both accuracy and speed.

[CV-269] Disagreement-Based Cross-Model Routing for Implicit Video Question Answering

链接: https://arxiv.org/abs/2606.14723
作者: Durga Sandeep Saluru
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We study multiple-choice video question answering on the ImplicitQA benchmark, where the correct answer is never explicitly shown but must be inferred from off-screen events, line-of-sight cues, causal structure, and cross-shot spatial layout. On this benchmark a single frontier video LLM already operates near its accuracy ceiling, and we observe that conventional self-consistency strategies – majority voting across repeated samples of the same model – can hurt rather than help, because the model’s errors on hard questions are correlated. We propose disagreement-based cross-model routing, a pure inference-time procedure that requires no labels and no training. We triple-sample a native-video model (Gemini 3.1 Pro Preview) at temperature zero, exploit the genuine sample-to-sample variance of its video-processing pipeline to identify the roughly 20% subset of questions where the three samples disagree, and route only that subset to a second model from a different family (Claude Opus 4.8) that consumes uniformly sampled frames with adaptive thinking. On the 1001-question validation set with public ground truth – our main evaluation – the method improves AvgAcc by +1.43 over the best single sample of the primary model, with per-category gains concentrated on Motion Trajectory (+5.49), Inferred Counting (+3.45), and Vertical Spatial Reasoning (+1.82) – the categories most dependent on cross-shot reference resolution. The same pipeline applied to the held-out 172-question CVPR 2026 ImplicitQA challenge test set achieves 82.03 AvgAcc / 79.71 MacroAvgAcc (+1.81 over the best single sample of the primary model), confirming the validation result on an independent split.

[CV-270] DC-Motion: Decoupling Semantics and Details via Discrete-Continuous Tokens for Human Motion Generation

链接: https://arxiv.org/abs/2606.14721
作者: Hequan Wang,Jiaxu Zhang,Zhengbo Zhang,Zhigang Tu
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Text-to-motion generation requires synthesizing physically realistic dynamics that strictly follow complex and long-horizon textual instructions. Existing approaches rely on homogeneous representation spaces that may fail to capture the hierarchical nature of human motion, with diffusion models struggling at compositional semantic reasoning and AR models sacrificing fine-grained physical details due to quantization. To solve it, we introduce DC-Motion, a factorized generative framework designed to explicitly decouple semantics and details via discrete-continuous tokens. A Discrete-Continuous VAE (DC-VAE) first decomposes motion into discrete tokens for semantics and continuous residuals for fine-grained dynamics. Then, a masked AR model predicts the discrete structure from text, and a lightweight residual diffusion model recovers the continuous physical details. Extensive experiments demonstrate that DC-Motion effectively improves the capability to follow complex instructions. By effectively balancing semantic controllability and physical realism, our approach offers a highly adaptable modeling paradigm for human motion generation. On both HumanML3D and KIT-ML datasets, DC-Motion achieves state-of-the-art performance, delivering the best FID for motion realism and R-precision for text alignment.

[CV-271] AI for Maritime Security: Comparative Evaluation of CNN and Vision Transformer Architectures for Maritime Object Detection

链接: https://arxiv.org/abs/2606.14720
作者: Ismet Gocer,Zakirul Bhuiayn,Shakeel Ahmad,Raza Hasan
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 24 Pages

点击查看摘要

Abstract:This study aims to enhance maritime security by using advanced Artificial Intelligence (AI) and Computer Vision (CV) techniques. For this purpose, it was designed and assessed intelligent object detection systems that can detect the presence of ships on the sea surface under different real-time environments. To achieve this goal, a maritime image dataset with 6,468 images was used, covering different weather conditions like cloudy, foggy, rainy, and sunny environments. Six deep learning architectures were evaluated, including a base Convolutional Neural Network (CNN) model, four transfer learning models (Xception, VGG16, MobileNetV2, and EfficientNetV2L), and a Vision Transformer (ViT) model. The models were compared using multiple performance indicators, including accuracy, Type I and Type II errors, model size, and video processing time. The results show that model performance varies depending on computational constraints and deployment conditions. While lightweight architectures are suitable for resource-limited devices, the ViT achieved the best overall performance, reaching 100% accuracy with the lowest error rates and the fastest video processing time. The findings highlight the potential of AI-driven computer vision systems for maritime surveillance, border protection, and autonomous navigation.

[CV-272] RAMS: Resource-Adaptive and Detection-Conditioned Model Switching for Embedded Edge Perception

链接: https://arxiv.org/abs/2606.14716
作者: Kushal Khemani,Evan Leri,George Xu,Amit Hod
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Edge object detection on embedded hardware requires balancing inference latency and detection quality under changing resource pressure. We present RAMS, a lightweight runtime controller that monitors device pressure, calibrates switching thresholds from idle behavior, and dynamically selects among three resident YOLOv8 tiers (NANO/SMALL/MEDIUM at 320/416/640 px) without model-reload latency. RAMS defines five switching policies, including two detection-conditioned variants that prevent aggressive downgrades after recent vulnerable-road-user (VRU) detections. We further introduce the VRU-Weighted Accuracy Score (SWAS), a scalar metric for offline policy comparison without ground-truth annotations, together with an oracle-bounded variant that separates detector circularity from genuine tier-retention benefit. Across Raspberry Pi 5, x86 laptops, and Jetson Orin ONNX/TensorRT deployments, the same controller equations operate over a 37x latency range. On Jetson Orin TensorRT under heavy load, the safety2 policy achieves 3.41 ms mean latency, 5.6x faster than fixed-MEDIUM inference, while retaining 74% of its proxy accuracy through near-NANO operation with selective SMALL and MEDIUM locks during VRU-positive windows. Detection-conditioned switching improves SWAS by 25.4% under oracle scoring and 47.3% under detector-derived scoring relative to threshold-only policies under heavy load. Live KITTI evaluation reports per-tier VRU recall of 24.2%, 41.2%, and 59.0%, showing that reactive overrides are fundamentally limited by baseline detector recall.

[CV-273] Wavelength-Multiplexed 2D Beam Steering via a Passive Diffractive Network

链接: https://arxiv.org/abs/2606.16261
作者: Che-Yung Shen,Yuhang Li,Cagatay Isil,Tianyi Gan,Mona Jarrahi,Aydogan Ozcan
类目: Optics (physics.optics); Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE); Applied Physics (physics.app-ph)
备注: 20 Pages, 4 Figures

点击查看摘要

Abstract:We introduce a wavelength-addressable diffractive optical network that transforms illumination wavelength into a high-dimensional control parameter for arbitrarily programmable 2D beam steering. The proposed passive architecture comprises cascaded spatially optimized diffractive layers, jointly designed using deep learning, to rapidly map distinct wavelengths to predefined/desired output angles. Unlike conventional single-layer dispersive optical elements, which are physically restricted to 1D linear mapping, this framework harnesses complex wavefront transformations to utilize the illumination wavelength as an intrinsic addressing key for arbitrary 2D beam steering, eliminating the need for mechanical scanning or electronic phase control. We numerically demonstrate wavelength-controlled beam steering across 625 wavelength channels spanning 400-750 nm, realizing a 25 x 25 array of independently addressable beam positions with subwavelength positioning accuracy and high channel fidelity. Unlike conventional gratings, which constrain wavelength routing to a linear trajectory, the proposed diffractive network performs nonlocal wavefront transformations, enabling arbitrary wavelength-to-angle mappings across a 2D field of view. We further validate the proposed framework experimentally in both the terahertz and visible spectral regimes, demonstrating wavelength-multiplexed beam steering using 3D fabricated passive diffractive layers at terahertz frequencies and phase-only spatial light modulators in the visible spectrum. This wavelength-addressable diffractive architecture establishes a compact and scalable paradigm for high-speed programmable beam steering, with potential applications in optical communications, routing, imaging, sensing, and emerging photonic information-processing systems.

[CV-274] Variable-Rate Deep Image Compression based on Low-Rank Adaptation by Progressive Learning

链接: https://arxiv.org/abs/2606.16107
作者: Xing-Yu Xu,Chen-Hsiu Huang,Ja-Ling Wu
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:In the digital age, image compression is crucial for numerous applications, including web media, streaming services, high-resolution medical imaging, and connected vehicle networks, enabling efficient data storage and transmission. With the increasing demand for high-quality image communication, the need for advanced compression techniques becomes increasingly critical. Numerous Deep Image Compression (DIC) techniques have recently been introduced, showing impressive performance compared to traditional standards. However, variable-rate image compression remains an unresolved issue. Specific DIC methods deploy multiple networks to attain different compression rates, whereas others use a single model, which often results in higher computational complexity and reduced performance. This work proposes a progressive learning approach for variable-rate image compression based on the parameter-efficient fine-tuning method, the Low-Rank Adaptation (LoRA). We introduce an additional LoRA Rate-Adaptive Module (LoRAM) in DIC methods. Due to the re-parameterized merging of LoRA, our proposed method does not introduce additional computational complexity during inference. Compared to methods utilizing multiple models, comprehensive experiments demonstrate that our approach achieves competitive performance, saving 99% in parameter storage, 90% in datasets, and 97% in training steps.

[CV-275] Chroma-gated differentiable OKLCH interpolation: Continuous Oklab fallback for color-cast reduction

链接: https://arxiv.org/abs/2606.15352
作者: Naoyuki Uchida
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: 14 pages, 5 figures. Ancillary files: reproducibility scripts (symbolic verification, evaluation, and figure generation)

点击查看摘要

Abstract:OKLCH – the cylindrical (lightness, chroma, hue) form of Ottosson’s Oklab color space – is the interpolation space recommended by CSS Color 4 for gradients and color-mix(), and it is now broadly deployed. Its polar parameterization, however, casts color near the neutral axis in two ways: (1) an inter-hue detour between two chromatic endpoints that sweeps through an unintended hue (blue to yellow visibly passing through green), and (2) an off-line bow when one endpoint is achromatic. Existing remedies are uniformly two-valued – a threshold switch that fires only at an achromatic endpoint – so they address only (2); on chromatic pairs every one of them reduces to raw OKLCH, leaving the (1) inter-hue cast untreated. We introduce Continuous Oklab fallback (COFb), a one-parameter, differentiable chroma gate w©=C^n/(C^n+\sigma^n) that continuously blends the OKLCH path toward the linear Oklab path as chroma falls. A single gate reduces the (1) cast that the two-valued family leaves untreated and unifies the handling of (1) and (2) without any endpoint test. We characterize a cast-hue trade-off frontier, adopt a default ( n=1 , the rational Michaelis-Menten form; \sigma\approx0.19 for a typical sRGB palette, from a normalization-independent cast-half criterion), and verify the gate’s properties symbolically. At the default, COFb halves the inter-hue path detour (mean lateral deviation -49.5%, chroma-weighted hue excursion -35.5%). We also state the method’s limits: on (2) alone the two-valued switch remains better, and like any Cartesian blend COFb does not preserve chroma. In deployment, COFb runs entirely in plain Oklab (a,b) to sRGB, so it serves as a fallback that delivers the same cast-reduced gradients where modern CSS color interpolation (color-mix(in oklch) and the like) is unavailable – older engines, image and video pipelines, or GPU shaders.

[CV-276] Polyp-D2ATL: Deep Domain-Adaptive Transfer Learning for Colorectal Polyp Classification under Label Distribution Shift

链接: https://arxiv.org/abs/2606.15000
作者: Sajad Jabarzadeh Ghandilu,Maryam Sadat Hosseini Azad,Shahriar Baradaran Shokouhi,Emad Fatemizadeh
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 5 figures, 7 tables

点击查看摘要

Abstract:Early and highly accurate prediction of colorectal polyps, as an important sign of one of the most dangerous types of cancer, will result in saving more lives. Despite the advancements in colorectal polyp classification, many challenges remain in obtaining an automated polyp prediction system that is able to diagnose the difficult-to-predict polyps accompanied by different features in real scenarios, where the model can handle imbalanced data, label distribution shift, and cross-modality generalization successfully. In this study, we propose Polyp-D2ATL, a novel framework accompanied by a specific training strategy, which mitigates these limitations and effectively predicts the different classes of polyps belonging to the NICE classification. Our extensive experiments on the PICCOLO validation and test sets demonstrate that the proposed Polyp-D2ATL significantly outperforms existing state-of-the-art models across various reliable metrics, achieving an accuracy of 82.38%, a Macro-F1 of 77.49%, and a specificity of 87.47% on the validation set, alongside consistent improvements on the held-out test set which demonstrates the generalization capacity and clinical applicability of the proposed approach.

[CV-277] Leptomeningeal Collateral Detection on DSA via Vessel-Graph Neural Networks

链接: https://arxiv.org/abs/2606.14828
作者: Junyong Cao,Hakim Baazaoui,Chinmay Prabhakar,Suprosanna Shit,Lukas Bastian Otto,Susanne Wegener,Bjoern Menze,Ezequiel de la Rosa
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Leptomeningeal collaterals (LMCs) are an important prognostic factor in acute ischemic stroke. Existing automated methods rely on CT angiography (CTA), but individual LMCs are often too small to be resolved on CTA, limiting these methods to coarse collateral scoring. Digital subtraction angiography (DSA) visualizes individual collaterals at superior resolution, yet current assessment remains subjective, relying on manual grading scales that suffer from poor inter-rater agreement. We present a framework that formulates collateral detection as the classification of individual vessel segments on a graph derived from DSA. A hybrid graph-pixel architecture combines a topology-aware graph branch with a dense pixel branch, fused in a shared node-probability space. In a five-fold cross-validation setting, the fused model achieves a PR-AUC of 0.434, outperforming the graph-only (0.403) and pixel-only (0.362) baselines. To our knowledge, this is the first method to enable the individualization of LMCs in DSA, allowing for precise per-vessel quantitative assessment. This integration shifts DSA assessment toward objective evaluation, supporting future biomarker and pattern discovery for individual LMCs.

[CV-278] Explainable Task-Oriented Token Communication for AI-Native 6G Networks

链接: https://arxiv.org/abs/2606.14808
作者: Feibo Jiang,Lei Mao,Li Dong,Kezhi Wang,Cunhua Pan,Jiangzhou Wang
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Information Theory (cs.IT)
备注:

点击查看摘要

Abstract:The integration of Foundation Models (FMs) and wireless communications is driving the evolution of image communication from bit-accurate transmission toward task-oriented transmission. However, existing task-oriented image communication methods still face three major challenges: insufficient task-oriented Token representation, inadequate collaboration between Visual Tokens and Task Tokens, and limited interpretability of task decisions. To address these challenges, we propose an Explainable Task-Oriented Token Communication (ET-TokenCom) framework. By treating Tokens as unified units for information representation and transmission, the proposed framework constructs an end-to-end communication link that spans visual perception, wireless transmission, and task reasoning. At the transmitter, the ET-TokenCom framework extracts Visual Tokens from images to preserve low-level visual information. Meanwhile, Task Tokens generated by the FM are introduced to represent the target information and decision intent required by the current task. A Cross-Modal Attention (CMA) fusion mechanism is further designed, enabling Task Tokens to explicitly guide the selection, weighting, and transmission of Visual Tokens. At the receiver, the framework integrates Token decoding with an explainable output mechanism, where attention heatmaps are generated to highlight critical perceptual regions under different task objectives and reveal the influence of Task Tokens on the outputs. Finally, simulation results validate the effectiveness and robustness of the proposed ET-TokenCom framework.

[CV-279] Pixel-TTS: Image based Text Rendering for Robust Text-to-Speech

链接: https://arxiv.org/abs/2606.14750
作者: Adarsh Arigala,Arjun Gangwar,S Umesh,Yova Kementchedjhieva
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD)
备注: 5 pages, 4 figures, 4 tables

点击查看摘要

Abstract:Recent advances in pixel-based text modeling show that representing text as images enables models to exploit visual cues for language understanding. Grounding text in its visual form allows structurally similar characters with different Unicode encodings to produce similar embeddings, benefiting cross-lingual and zero-shot scenarios. Conventional text-based approaches treat each character independently, limiting generalization to unseen characters and requiring embedding expansion during cross-lingual adaptation. We propose Pixel-TTS, the first framework for visually grounded speech synthesis. It renders text as images and projects them through a 2D convolutional layer to generate embeddings. This design eliminates embedding matrix expansion during fine-tuning while improving robustness to unseen characters and orthographic variations. Extensive experiments show Pixel-TTS achieves competitive performance with strong baselines, faster convergence and robust zero-shot generalization.

人工智能

[AI-0] HAMON: Passive Optical Sequence Mixing for Long-Horizon Forecasting

链接: https://arxiv.org/abs/2606.17028
作者: Alper Yıldırım
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)
备注:

点击查看摘要

Abstract:Simple linear and frequency-domain models remain surprisingly competitive in long-horizon time-series forecasting, and recent mechanistic evidence suggests that standard forecasting benchmarks may not require the dense superposed representations that make transformers powerful in other domains. This raises a substrate-level question: if the core forecasting operator is often low-complexity and approximately linear, does it need to be implemented as learned digital temporal mixing? We introduce HAMON, a passive diffractive optical forecasting core in which historical values are encoded onto an optical aperture, future positions are left dark, and cascaded trainable phase masks with free-space diffraction shape the forecast directly in the output field. At inference, prediction is performed by a single passive optical propagation pass with no trainable digital sequence-mixing layer. Across standard benchmarks, HAMON outperforms the strongest digital baselines considered on ETTm2 at all horizons and on ETTh2 at all but the longest horizon, improving MSE by up to 14% and doing so consistently across horizons rather than at isolated points. It is competitive on Weather and trails the strongest baselines on the remaining ETT settings and on the high-channel-count Traffic and Electricity datasets. Phase encoding, intensity-compatible readout, and phase-scrambling ablations, together with a TorchOptics cross-simulator check, indicate that the forecasts arise from the data-bearing optical field rather than from a digital forecasting head. Because the passive core uses standard Fourier optics, HAMON defines a concrete target for optical hardware and for passive physical sequence mixing. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR) ACMclasses: I.2.6; C.1.3; I.6 Cite as: arXiv:2606.17028 [cs.LG] (or arXiv:2606.17028v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2606.17028 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-1] uneJury: An Open Metric for Improving Music Generation Preference Alignment

链接: https://arxiv.org/abs/2606.17006
作者: Yonghyun Kim,Junwon Lee,Haiwen Xia,Yinghao Ma,Junghyun Koo,Koichi Saito,Yuki Mitsufuji,Chris Donahue
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
备注: 32 pages, 9 figures

点击查看摘要

Abstract:We introduce TuneJury, an open, instance-level pairwise reward model for text-to-music that predicts a music preference score from a text prompt and an audio clip. The released checkpoint is trained on publicly available human-preference labels covering arena-style (A vs. B) votes, metric-alignment preference pairs, crowdsourced pairwise comparisons, and expert aesthetic ratings. The predicted score margin between two clips is well calibrated on our held-out test split, supporting data filtering via a simple score threshold. TuneJury generalizes to both held-out test pairs and out-of-distribution benchmarks, remaining competitive with prior baselines on the latter. For generators released after training, we introduce anchor calibration, a post-hoc, per-system Bradley-Terry calibration that recovers agreement at substantially better data efficiency than from-scratch retraining. The same frozen reward drives consistent reward-axis gains across three downstream applications: inference-time best-of-N selection, DITTO-style latent optimization, and expert-iteration post-training. TuneJury is available at this https URL.

[AI-2] Bayesian Inference and Decision Audits for Public Archives of Frontier AI Evaluations

链接: https://arxiv.org/abs/2606.17005
作者: Yanan Long
类目: Artificial Intelligence (cs.AI); Methodology (stat.ME)
备注:

点击查看摘要

Abstract:Public AI evaluations are often read as terminal leaderboards, yet the underlying evidence is a selective time series shaped by reporting rules, benchmark revisions, and missingness. Repeated public archives for LiveBench and Open LLM Leaderboard v2 serve as the primary longitudinal record; LMArena provides a preference stress test; and GAIA and tau-bench contribute limited agentic pilots. Together, these archives instantiate a Bayesian inference problem: under a fixed reporting convention, one constructed terminal-only example over 1,000 systems is compatible with two pre-terminal histories, yielding times of 23.03 or 75.13 to reach within 0.05 of the ceiling under the same terminal-tail model. In synthetic posterior comparisons, action-facing diagnostics differ across observation regimes. The candidate selection-aware frontier model fails synthetic recovery, objective-archive prediction, preference transfer, and uncertainty calibration; correspondingly, fixed audit gates reject its stronger claims. An archive-and-adjudication protocol reconstructs public evaluation histories, isolates a verified timing boundary, and falsifies unsupported frontier claims.

[AI-3] When in Doubt Plan It Out: Committed Small Language Model Deliberation for Reactive Reinforcement Learning ICML2026

链接: https://arxiv.org/abs/2606.16995
作者: Nathan Gavenski,Juarez Monteiro,Francisco Galuppo,Adriano Veloso,Odinaldo Rodrigues
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: LM4Plan Workshop at ICML 2026

点击查看摘要

Abstract:Reinforcement Learning (RL) policies often degrade in unfamiliar environments because they lack explicit deliberation. We propose Plan, Align, Commit, Think (PACT), a hybrid architecture that combines a fast, reactive RL policy with a slow, deliberative Small Language Model (SLM) planner. PACT invokes the SLM asynchronously to generate and validate candidate action plans. Once a plan is verified through simulation as safe, feasible, and complete, it is executed directly, bypassing the RL policy without retraining or modifying it. Evaluated on three FrozenLake configurations of increasing difficulty, PACT outperforms all baselines while relying on a 2B-parameter SLM backbone, suggesting that deliberative planning and reactive execution are more powerful in concert than either is alone in these settings.

[AI-4] Stable Menus of Public Goods: AI-Enabled Progress

链接: https://arxiv.org/abs/2606.16989
作者: Sara Fish
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: Accepted to the EC’26 Workshop on AI-Driven Research in EconCS

点击查看摘要

Abstract:Using an open problem from the EC 2025 paper “Stable Menus of Public Goods” as a testbed, we conduct experiments to understand the effectiveness of different AI-for-EconCS research workflows. Specifically, we study three questions: Does providing human intuition in the prompt help? Does automated multi-turn interaction help? And, does an LLM outperform a first-year PhD student? Regarding the first two questions, we provide evidence for the following workflow suggestions: (1) prompting with human intuition can encourage the LLM to have better “taste”, (2) multi-turn workflows help when the pipeline encourages “ambitious” steps. Regarding the third question, using an unpublished manuscript written by the paper’s senior authors prior to collaborating with the first-year PhD student, we compare the effectiveness of the LLM with that of the first-year PhD student, and find that the LLM is slightly less effective.

[AI-5] Consensus-based Agent ic Large Language Model Framework for Harmonized Tariff Schedule Code Classification

链接: https://arxiv.org/abs/2606.16987
作者: Truong Thanh Hung Nguyen,Khanh Van Quynh Nguyen,Hoang-Loc Cao,Tri Duong,Phuc Ho,Van Pham,Loc Nguyen,Hung Cao
类目: Artificial Intelligence (cs.AI)
备注: Accepted at the 3rd International Conference of Resilience by Technology and Design (RTD 2026)

点击查看摘要

Abstract:Accurate Harmonized Tariff Schedule (HTS) code classification is essential for customs clearance, duty assessment, trade statistics, and regulatory compliance in maritime logistics. However, exact HTS classification remains challenging because product descriptions are often short, incomplete, or ambiguous, while correct classification depends on hierarchical tariff structures, legal notes, and jurisdiction-specific rules. This paper proposes an agentic large language model (LLM) framework for Canadian 10-digit HTS code classification in smart-port and maritime logistics environments. The framework integrates multi-agent information retrieval, semantic retrieval over official tariff documents, evidence-grounded reasoning, consensus-based validation, element-wise voting across hierarchical code components, confidence estimation, and human-in-the-loop escalation. We evaluate the framework on a private dataset of 3,300 domain-expert-labeled product records collected from logistics and delivery contexts. Experimental results show that exact 10-digit classification remains difficult even for advanced LLMs, with performance decreasing from coarse chapter-level prediction to fine-grained tariff and statistical suffix assignment. These findings demonstrate the need for evidence-grounded, uncertainty-aware, and human-centered classification workflows rather than fully autonomous single-step prediction. The proposed framework supports more interpretable, accountable, and compliance-oriented HTS classification for maritime logistics and smart-port operations. Our code is available at this https URL.

[AI-6] he embrace of open science: An analysis of a decade of AI research and 56 800 conference papers

链接: https://arxiv.org/abs/2606.16974
作者: Kevin L Coakley,Thijs Snelleman,Holger Hoos,Odd Erik Gundersen
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The reproducibility crisis has directed the AI research community toward improving documentation practices. Several studies have identified methodological issues, and in response, the most impactful venues in the field have introduced reproducibility checklists. We seek to understand whether documentation practices have changed over time by assessing all published papers at five leading AI conferences over the past decade. Seven reproducibility variables were identified, quality-assured and used to analyse 56 800 publications. Our analysis reveals that in the period 2014 to 2024, documentation practices have improved; papers sharing both code and data increased nearly sixfold, from 11% to 64% Building on empirical reproducibility rates from a prior study, we estimate - inferred from documentation practices, not direct testing - that reproducibility increased from 28% in 2014 to 64% in 2024. Improvements in documentation practices predate the introduction of reproducibility checklists, suggesting these changes reflect a broader movement toward open science rather than a direct response to formal requirements.

[AI-7] Probing Low Frame Rate Degradation in Neural Audio Codecs INTERSPEECH2026

链接: https://arxiv.org/abs/2606.16969
作者: Alex Gichamba,Moise Busogi
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: Accepted at Interspeech 2026

点击查看摘要

Abstract:Low frame rates in neural audio codecs are attractive for autoregressive speech synthesis, where the generation cost scales linearly with the sequence length. Recent work has demonstrated that codecs can operate at 12.5 Hz and below, but the mechanisms underlying low frame rate degradation remain insufficiently understood. We investigate these mechanisms through a controlled frame rate ablation. We reproduce a quality cliff at 6.25 Hz reported in previous works and evaluate candidate explanations: phonemic collisions and codebook saturation, neither of which shows evidence of a fundamental barrier. The cliff is instead caused by suboptimal training configuration: fixed clip duration during training yields too few tokens at low frame rates, starving the decoder of inter-token context. Once corrected, WER degrades smoothly with phonemic load down to 3.1 Hz and 1.6 Hz, suggesting the inference-time efficiency gains of low frame rate codecs are more accessible than previously assumed.

[AI-8] Phantoms and Disclosures: a Causal Framework for Auditing Synthetic Data

链接: https://arxiv.org/abs/2606.16952
作者: Kareem Amin,Rudrajit Das,Alessandro Epasto,Adel Javanmard,Dennis Kraft,Mónica Ribero,Sergei Vassilvitskii
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Applications (stat.AP); Methodology (stat.ME); Machine Learning (stat.ML)
备注: 35 pages, 10 tables, 5 figures

点击查看摘要

Abstract:The rapid adoption of generative AI and Large Language Models (LLMs) has spurred interest in synthetic data as a privacy-preserving alternative to sensitive real-world datasets. However, generating high-utility synthetic data often carries the risk of memorizing and regurgitating private information from the training corpus. In this work, we present a customizable empirical auditing framework designed to detect and explain such data disclosures. Our framework introduces a mechanism to distinguish between “true disclosures”-where the system directly reproduces a user’s information-and "phantom disclosures’'-where the system incidentally generates a user’s data. By partitioning input data into training and holdout sets and applying rigorous statistical hypothesis testing, we determine if observed disclosures are consistent with strict privacy baselines, such as zero-learning or specific Differential Privacy (DP) bounds. Crucially, this approach requires no model access, no canary insertion, and no reference model training -only the synthetic output and a held-out control set. We demonstrate that this framework effectively functions as a membership inference attack, providing empirical lower bounds on privacy leakage that are tighter than prior data-based auditing methods. Our approach is model-agnostic, applies to any synthetic data generation mechanism, and requires orders of magnitude fewer computational resources than shadow-model or canary-based alternatives.

[AI-9] Scalable Circuit Learning for Interpreting Large Language Models ICML2026

链接: https://arxiv.org/abs/2606.16939
作者: Naiyu Yin,Dennis Wei,Tian Gao,Amit Dhurandhar,Karthikeyan Natesan Ramamurthy,Yue Yu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted to the Mechanistic Interpretability Workshop at ICML 2026

点击查看摘要

Abstract:A prominent research direction in mechanistic interpretability is learning sparse circuits over LLM components to reveal how they jointly produce model behavior. However, raw neurons are polysemantic, making learned circuits hard to interpret. Sparse autoencoder (SAE) features alleviate this, but their high dimensionality makes existing intervention-based circuit learning methods computationally prohibitive. We propose CircuitLasso, a scalable circuit-learning approach based on sparse linear regression. CircuitLasso recovers circuits whose structural accuracy matches that of state-of-the-art intervention-based methods on the benchmark data, at a fraction of the computational cost. For interpretability, CircuitLasso efficiently uncovers relationships among SAE features, showing how human-interpretable semantic features propagate through the model and influence its predictions. Finally, we validate the utility of our learned circuits by leveraging their insights to achieve comparable performance at substantially lower cost on a domain-generalization task.

[AI-10] CrossMaps: Confidence-Aware Open-Vocabulary Semantic Mapping for Rover Navigation ICRA

链接: https://arxiv.org/abs/2606.16935
作者: Jan-Niklas Klein,Sona Ghahremani,Christian Medeiros Adriano,Holger Giese
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: IEEE International Conference on Robotics and Automation (ICRA) 2026: ROSE International Workshop on Robotics Software Engineering, June 01, 2026, Vienna, Austria

点击查看摘要

Abstract:Rovers rely on perception to maintain spatial maps that encode both objects and sensor quality (e.g., range reliability, lighting artifacts, data density), guiding data fusion, embedding updates, and navigation under partial observability. To study these coupled perception-navigation processes, we present CrossMaps, a real-time confidence-aware open-vocabulary semantic mapping pipeline that constructs language-queryable maps from RGB-D data. Building on VLMaps-style approaches, CrossMaps integrates multi-scale CLIP embeddings with confidence-aware fusion and a dual-memory architecture consisting of Short-Term Memory (STM) and Long-Term Memory (LTM). The STM aggregates noisy visual observations using geometric, semantic, and temporal confidence cues, while confident and coherent cells are promoted to the LTM as persistent semantic landmarks. Designed for deployment with a Jetson Orin-powered UGV alongside SLAM, CrossMaps runs in real time and produces semantic heatmaps that can be queried with natural language to guide rover navigation.

[AI-11] A Unified Causal-Origin Taxonomy of Distributional Shifts in Reinforcement Learning

链接: https://arxiv.org/abs/2606.16933
作者: Ardianto Wibowo,Paulo E Santos,Amer Baghdadi,Matthew Stephenson,Karl Sammut,Jean-Philippe Diguet
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: The paper is currently under review at the Journal of Artificial Intelligence Research (JAIR)

点击查看摘要

Abstract:Reinforcement learning (RL) systems often degrade when operating conditions differ from those previously encountered, reflecting distributional shifts in the underlying data-generating process. Such shifts may occur between training and evaluation, as in In-Distribution (ID) and Out-of-Distribution (OOD) generalization, or within non-stationary settings where environment dynamics evolve over time. However, the formal relationship between these views remains unclear, and existing work mainly focuses on mitigation rather than the causal origin of shift within the agent-environment interaction. This work develops a unified causal-origin taxonomy that characterizes sources of distributional shift in RL and relates ID/OOD generalization to non-stationary settings. We transfer the classical dataset-shift principle from supervised learning to RL by reformulating distributional shift in terms of the generative interaction process. Using a Partially Observable Markov Decision Process (POMDP), we decompose the interaction into structural components, including the state distribution, observation process, policy, reward, and transition dynamics, together with the shifted-time boundary. The proposed taxonomy distinguishes internal, agent-driven, and external, environment-driven, distributional shifts. The shifted-time boundary perspective further characterizes explicit, implicit, and hybrid shifts. This formulation unifies ID/OOD generalization and non-stationarity as structured changes in the underlying process. We also introduce an evaluation framework for measuring shift impact and adaptation through performance degradation and recovery metrics. By grounding distributional shift in the causal-origin structure of RL, this work supports systematic analysis of robustness under distributional shift.

[AI-12] RAID: Semantic Graph Diffusion for True Cold-Start and Cross-Lingual Forecasting

链接: https://arxiv.org/abs/2606.16925
作者: Arunkumar V,Manoranjan Gandhudi,Gangadharan G. R.,Arun Prakash,S. Senthilkumar
类目: Artificial Intelligence (cs.AI)
备注: 25 pages, 4 figures, 8 tables

点击查看摘要

Abstract:Time-series foundation models show strong transfer performance when given a non-empty history window. However, true cold-start scenarios, where a new item has no prior observations, violate this assumption. We propose RAID (Retrieval-Augmented Iterative Diffusion) a framework, which replaces history-based correlation learning with metadata-driven semantic retrieval and graph-conditioned diffusion. RAID maps textual metadata into a shared semantic space using a frozen multilingual embedding model and constructs an inductive retrieval graph that extends naturally to unseen items. It first forms a base forecast by aggregating information from semantically related neighbors, then refines this forecast with a gated diffusion module to model residual uncertainty. Under a strict true cold-start protocol, RAID outperforms strong foundation models and competitive baselines on both forecasting accuracy and prediction interval coverage, while reducing inference latency by an order of magnitude through non-autoregressive decoding. The shared semantic space also enables zero-shot cross-lingual transfer, allowing a model trained on English descriptions to generalize to items described in other languages without direct supervision.

[AI-13] MA-SBI: Misspecification-Aware Simulation-Based Inference via Side-Channel Guidance

链接: https://arxiv.org/abs/2606.16923
作者: Arunkumar V,Manoranjan Gandhudi,Gangadharan G. R.,Arun Prakash,S. Senthilkumar
类目: Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: 23 pages, 9 figures, 12 tables

点击查看摘要

Abstract:Simulation-based inference (SBI) of latent parameters is often hindered by simulator misspecification, the mismatch between simulated and real-world observations caused by inherent modeling simplifications. RoPE, the recent state-of-the-art for robust SBI, addresses this through optimal transport between learned representations of real and simulated observations, but requires ground-truth parameter calibration pairs that are typically unavailable in the very settings where SBI is needed. What practitioners do have is unstructured side-information such as regime labels, instruction text, and policy bulletins. We propose Misspecification-Aware Simulation-Based Inference (MA-SBI), a calibration-free framework that turns this side-channel into a posterior correction. A learned corrector maps side-channel text to an observation-space shift applied before any pre-trained amortized posterior, requiring no retraining and no parameter ground-truth. Our main theorem bounds achievable bias reduction by the mutual information between misspecification and side-channel, with a non-vacuous constant that extends to all sub-Gaussian noise via Donsker-Varadhan. On hide-the-calibration benchmarks, MA-SBI with text alone matches the oracle posterior across 10 seeds and two backbones (TOST equivalence), while RoPE given more data does not. The two approaches are complementary: where misspecification is structural and recoverable from parameter pairs, RoPE dominates, as the theory predicts. A stochastic variant improves posterior-predictive log-likelihood on real COVID and OxCGRT epidemiological data, and correctly leaves the posterior unchanged on a well-specified cognitive-science corpus.

[AI-14] Demystifying Variance in Circuit Discovery of LLM s

链接: https://arxiv.org/abs/2606.16920
作者: Frank Zhengqing Wu,Francesco Tonin,Volkan Cevher
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Circuit discovery is a key technique in mechanistic interpretability to pinpoint the model components that are crucial for performing a given task. Although the current state-of-the-art method (EAP-IG) performs well on the metric of (un)faithfulness, it suffers from substantial variability. This includes resampling variance, where the circuit changes when we probe with a new batch of data from the same distribution; rephrasing variance, where the discovered circuit shifts when the prompts are rephrased; and sample-wise variance, where a circuit with low population unfaithfulness exhibits large fluctuations in unfaithfulness across individual samples. This paper studies the roots of these variances. We demonstrate that CEAP, our new circuit discovery method that improves upon EAP-IG with a theoretical guarantee, can substantially lessen resampling variance. We further show that rephrasing variance arises because prompts with different templates tend to activate different circuits in the model. This leads us to argue that it may be challenging to find a comprehensive circuit that explains and controls the model’s behavior on a task, which can be expressed in countless templates, suggesting that LLMs may be inherently hard to steer. We show that sparsity, which has been claimed to form more compact and interpretable task circuits, fails to solve this problem. Regarding sample-wise variance, we argue that it is largely benign: extremely poor unfaithfulness scores often stem from how unfaithfulness is defined, rather than from defects in the measured circuits. We show that the magnitude of unfaithfulness is affected by selective contribution scaling, a neural mechanism that accounts for the extremely poor scores sometimes observed. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2606.16920 [cs.LG] (or arXiv:2606.16920v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2606.16920 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-15] Greed Is Learned: Visible Incentives as Reward-Hacking Triggers

链接: https://arxiv.org/abs/2606.16914
作者: Tong Che,Rui Wu
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Deployed agents increasingly act with their reward proxy in view, such as a balance, score, or KPI dashboard. We show that reinforcement learning can make a policy \emphaddicted to such a visible self-benefit channel. It chases the displayed payoff across held-out domains, sacrifices the true task to do so, and follows the channel wherever we rewrite it, while policies that never saw the channel stay honest. We call this \emphreward-channel addiction and study it in \emphMoneyWorld, a synthetic sandbox. The addiction can \emphflip a model’s safety alignment: trained only on innocuous money tasks with no safety content, the model abandons the safe action it otherwise always takes whenever a dashboard pays for an unsafe one, and reverts to safe once the channel is hidden. This learned bribe replicates across model scales and families. Blindly optimizing super-capable, next-generation AI on KPIs or P\L can be dangerous for alignment. \emphGreed is learned when following such a channel pays.

[AI-16] Binary Tracking for Spatial QA and Navigation with Open Vision-Language Models

链接: https://arxiv.org/abs/2606.16902
作者: Dongbin Na,Chanwoo Kim,Soonbin Rho,Giyun Choi,Gangbok Lee,Dooyoung Hong
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 21 pages, 4 figures, 15 tables. Project page: this https URL ; Code and dataset: this https URL

点击查看摘要

Abstract:This work addresses spatial question answering for service robots traversing long egocentric routes. Given a query such as “where can I find a dry cleaner on the way back home?”, the system returns a metric coordinate that downstream navigation components can act on. Prior Spatial Question Answering approaches leverage retrieval-augmented agents built on closed-source models such as GPT-4o for path exploration. However, robots operating in the real world often cannot reliably depend on online closed-source models due to network instability, communication latency, and deployment cost. It creates a need for open-source based Spatial Question Answering approaches that can run onboard the robot, yet prior research in this direction remains limited. This work proposes BinTrack, a simple yet effective, fully open-source spatial-localization agent that leverages the temporal ordering of a robot’s trajectory. BinTrack performs a binary search over the trajectory segments between two anchor landmarks identified from a query. It improves overall accuracy by up to 22.8% over other open-source implementations and even matches the reported closed-source model result on the global category of the SpaceLocQA benchmark, the most challenging setting that has so far required strong reasoning agents such as GPT-4o. Furthermore, its optimized inference strategy consistently yields more than a 1.5x inference speedup over previous approaches. Finally, this work releases GangnamLoop, a novel and practical multi-trip outdoor benchmark collected by deploying a real quadruped robot on public streets with the anonymization policy. It revisits the same locations under different outdoor conditions and pairs the robot’s low viewpoint with the human owner’s. The source codes and datasets are publicly available at this https URL

[AI-17] Beyond Weights and Gradients: A Taxonomy of Federated Learning Messages

链接: https://arxiv.org/abs/2606.16891
作者: Alvaro Javier Vargas Guerrero,Xinguang Wang,Quang Manh Doan,Guy Nagels
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 4 figures, 9 pages, with 7 pages of content

点击查看摘要

Abstract:Federated Learning is rapidly evolving beyond the exchange of traditional model weights and gradients, yet existing definitions fail to capture the full scope of modern payloads like synthetic data and federated analytics. This paper addresses the gap by proposing a formal mathematical definition of a federated message that accounts for both utility and privacy. We introduce a taxonomy that organizes these exchanges into three categories: model structures, statistical summaries, and data-conditioned representations. By evaluating these groups based on computational demands, communication costs, and privacy risks, we provide a clearer understanding of the trade-offs involved in decentralized training. Our review of 202 recent publications highlights a significant shift since 2021 toward diverse messaging paradigms, signaling a move away from standard deep learning updates toward more specialized information sharing. This framework provides a structured path for future research to optimize federated systems for varying hardware and security requirements.

[AI-18] Upper Bounds on the Generalization Error of Deep Learning Models via Local Robustness and Stability

链接: https://arxiv.org/abs/2606.16883
作者: Abdul-Rauf Nuhu,Parham M. Kebria,Vahid Hemmati,Mahmoud N. Mahmoud,Edward Tunstel,Abdollah Homaifar
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Generalization is a critical property of data-driven models, particularly deep learning models deployed in safety-critical applications. Robustness-based generalization bounds have gained attention as a principled way to link robustness properties to generalization performance, often in a data-dependent manner. However, most existing bounds suffer from vacuousness in practical settings, yielding loose upper bounds that greatly exceed the actual error rates and limiting their usefulness for real-world evaluation. While this issue is often attributed to the uncertainty term, a substantial part of the problem originates from the robustness term itself, particularly for the 0-1 loss. Existing approaches typically treat the robustness term as a global measure, ignoring its variation across different sub-regions of the input space. In this work, we propose a generalization bound that addresses this limitation by scaling the robustness term according to the number of stable and unstable samples within each sub-region. Our bounds incorporate both data- and model-dependent factors while maintaining practical relevance (yielding tighter upper bounds on true error). Experiments on models trained on the ImageNet dataset show that our bounds remain consistently non-vacuous and achieve the tightest estimates among existing methods, closely aligning with empirical performance across a range of robust deep neural networks.

[AI-19] Deep Q-Learning on Hölder Spaces

链接: https://arxiv.org/abs/2606.16846
作者: Qian Qi
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We study the operator-theoretic core of Q-learning in continuous-time stochastic control with continuous states and actions. In value-based reinforcement learning, each Q-learning or DQN update is built from a Bellman optimality target; our analysis isolates this target in a diffusion setting and studies its regularity and approximation complexity. Under uniform ellipticity and Hölder-regular coefficients, we show that a Bellman update maps bounded inputs into an anisotropic regularity class, smoothing the state variable while leaving only Lipschitz dependence on the action variable. This yields a compact family of Bellman iterates and motivates a tensor-product DeepONet architecture adapted to the mixed regularity of the problem. We then derive explicit approximation and resource bounds, together with a stiffness–complexity trade-off as the time step \delta \to 0 . The resulting theory makes a direct contribution to Q-learning theory at the level of Bellman target regularity and approximation in continuous stochastic control. At the same time, we do not claim a full convergence theorem for practical sampled Q-learning with exploration, replay, and stochastic gradient updates.

[AI-20] Beyond Models: Reflections on Engineering AI-enabled Systems in a Project-Based Course

链接: https://arxiv.org/abs/2606.16842
作者: Amir Mashmool,Kishan Ravindra Sawant,Mojtaba Shahin,Nico Hochgeschwender,Rainer Koschke
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Teaching Software Engineering for AI-enabled systems entails addressing the integration of AI components within full-scale software architectures under realistic constraints. While machine learning courses emphasize model development, students often lack experience in architectural design, deployment, and monitoring of AI-enabled systems. Empirical evaluations of such system-oriented AI courses remain limited. This paper reflects on the design and implementation of a project-based master’s-level course titled AI Algorithms: Theory and Engineering, at the University of Bremen, in which students developed a movie recommendation system while making architectural design decisions to address challenges related to scalability, deployment, and evolving requirements. We conducted a mixed-methods study combining analyses of student submissions and questionnaire responses to investigate integration challenges, learning outcomes, and opportunities for improvement. Our results indicate persistent difficulties in early architectural decisions, heterogeneous ML integration, evolving requirements, and data management, largely due to uneven ML and software engineering expertise. From the educator’s perspective, the course fostered system-level reasoning and strengthened awareness of data-centric ML practices in AI-enabled systems.

[AI-21] ATOM-Bench: A Real-World Benchmark for Atomic Skills and Compositional Generalization in Manipulation Policies

链接: https://arxiv.org/abs/2606.16826
作者: Zenan Wu,Bingqing Wei,Lu Liu,Zheqi He,Xi Wang,Jiakang Liu,Zehui Li,Guocai Yao,Jing-Shu Zheng,Xi Yang,Yongtao Wang
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Homepage: this https URL

点击查看摘要

Abstract:Generalist manipulation policies are increasingly presented as foundation models for robotic control, but their real-world generalization remains difficult to diagnose. A policy may succeed on demonstrated tasks while still failing to execute fine-grained atomic skills or recombine learned skills in new task structures. We introduce \textbfATOM-Bench, a real-world benchmark for evaluating both atomic skills and compositional generalization in manipulation policies. ATOM-Bench factorizes tabletop manipulation into motor atoms and instruction atoms, and contains 30 atomic tasks and 24 held-out compositional tasks across paired single-arm and dual-arm robot tracks. We collect 3,000 human demonstrations for atomic fine-tuning and release both the demonstration data and evaluation rollout data to support reproducible real-world evaluation. Policies are fine-tuned on atomic tasks and evaluated on both atomic skill acquisition and held-out compositional tasks. We further introduce Atomic Score (AS) and Compositional Failure Share (CFS) to distinguish failures caused by weak atomic skills from failures caused by limited compositional reuse. Through 2,700 physical rollouts on five representative manipulation policies, we find that current policies can acquire simple instruction-grounding skills, but still struggle with fine-grained motor atoms, counting, and logical filtering. More importantly, strong atomic performance does not reliably transfer to held-out compositional tasks. ATOM-Bench provides a diagnostic testbed for studying whether failures arise from weak motor execution, poor instruction grounding, or limited compositional reuse.

[AI-22] GIST-CMTF: Goal-State Inference for Causal Minimal Tool Filtering in LLM Agents

链接: https://arxiv.org/abs/2606.16813
作者: Rahul Suresh Babu,Rohit Shukla
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Tool-augmented LLM agents rely on runtime filtering to decide which tools should be visible at each step. Causal Minimal Tool Filtering (CMTF) reduces tool-choice confusion by exposing only the next causally necessary tool frontier, but it assumes that the user request has already been mapped to a symbolic goal state. In practice, requests such as “handle my appointment” or “take care of this email” may correspond to multiple possible goals. This creates wrong-goal execution, where an agent follows a valid causal tool path for an unintended objective. We introduce GIST-CMTF, a goal-state inference layer that predicts candidate symbolic goals over the same state-transition vocabulary used by CMTF, estimates ambiguity, and either applies CMTF or exposes clarification as a causal action that produces missing goal or state variables. We evaluate GIST-CMTF across seven model backends, six filtering methods, and 120 controlled tool-use tasks. GIST-CMTF achieves 97.0% task success, compared with 80.1% for top-goal CMTF and 82.9% for semantic-goal CMTF. It reduces wrong-goal execution from 19.4% under top-goal CMTF to 2.5%, while preserving the one-tool exposure of causal filtering and using substantially fewer tokens than all-tools exposure. These results suggest that reliable tool-augmented agents should validate goal state, not only tool relevance, before exposing external actions.

[AI-23] Adaptive and Explicit safe: Triggering Latent Safety Awareness in Large Reasoning Models

链接: https://arxiv.org/abs/2606.16808
作者: Ke Miao,Jiaxin Li,Hongliang Chen,Yuke Hu,Zhan Qin
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:While Large Reasoning Models (LRMs) excel at complex tasks, they remain highly vulnerable to sophisticated jailbreaks and direct harmful queries. To address this vulnerability, prior works depend heavily on external manual data annotation for safety alignment. However, we observe that LRMs can inherently identify safety risks when being re-presented with original queries alongside their own reasoning trajectories – a capability we term Latent Safety Awareness. To leverage this safety awareness, we first employ Supervised Fine-Tuning (SFT) to explicitly induce safe tags to trigger safety analysis and guidance following the initial reasoning content for unsafe queries, while preserving standard responses for general queries to ensure adaptive triggering. Subsequently, we apply Direct Preference Optimization (DPO) to further enhance the correctness and stability of the safety analysis and guidance. Notably, responses required for both training stages are entirely generated by models being optimized. With (Safe Trigger) SFT and DPO, experimental results demonstrate significant safety enhancement. For example, the Attack Success Rate (ASR) of DeepSeek-R1-Distill-Llama-8B, on average, drops 24.65% and 36.72% on harmful and jailbreak benchmarks, respectively. Finally, our Safe Trigger method exerts almost no negative impact on general performance or user experience.

[AI-24] LabOSBench: Benchmarking Computer Use Agents for Scientific Instrument Control

链接: https://arxiv.org/abs/2606.16802
作者: Anqi Zou,Han Deng,Chengyu Zhang,Junquan Hu,Yu Wang,Yuxiang Xing,Aokai Zhang,Hanling Zhang,Zhaoyang Liu,Ben Fei,Zhihui Wang,Wanli Ouyang
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Current computer-use benchmarks primarily focus on software operation tasks in virtualized systems, whereas scientific instrumentation scenarios require coordinated control over complex interfaces, and feedback-driven parameter adjustment. However, directly evaluating agents on physical high-precision instruments is impractical due to high cost, safety risks, limited accessibility, and difficulty in ensuring reproducible evaluation. This motivates the need for a simulated yet realistic testbed that preserves the operational challenges of scientific instruments while enabling scalable and safe benchmarking. To this end, we introduce LabOSBench, a challenging benchmark for multimodal GUI agents built on a suite of web-based scientific-instrument simulators. Operating directly via a browser, LabOSBench avoids resource-heavy OS virtualization while supporting flexible task configuration and execution-based evaluation. Specifically, LabOSBench constructs 96 subtasks across eight instrument simulators, covering workflows from sample loading, alignment, parameter tuning, and data acquisition to result inspection. We evaluate general-purpose vision-language models, specialized GUI agent models, and advanced agentic frameworks at both subtask and end-to-end levels. Our experiments reveal that while existing agents can complete many structured GUI subtasks, they still struggle with feedback-driven operations and long-horizon workflow execution. Overall, LabOSBench provides a reproducible, low-cost testbed for advancing computer-using agents toward scientific-instrument control.

[AI-25] Decision-Weighted Flow Matching for Contextual Stochastic Optimization

链接: https://arxiv.org/abs/2606.16790
作者: Jize Xie,Haomiao Wu,Qiang Chen,Xiu Su,Yi Chen
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Conditional generative models are increasingly used as scenario generators for stochastic optimization, but standard training objectives emphasize uniform distributional fit rather than the downstream decisions induced by generated scenarios. This creates an objective mismatch: errors in statistically common regions may have little effect on decision regret, whereas errors in decision-sensitive regions can substantially change the optimal action. We propose Decision-Weighted Flow Matching (DW-FM), a regret-aligned training framework that preserves the simplicity of standard flow matching while reweighting its velocity-regression objective using decision-sensitive endpoint information. Theoretically, we connect downstream regret to pathwise velocity mismatch through a loss-induced decision discrepancy and an adjoint transport argument, yielding an ideal regret-aligned surrogate and practical endpoint-weighted objectives with regret guarantees. Empirically, we demonstrate the effectiveness of DW-FM on three CVaR-based contextual stochastic optimization benchmarks spanning synthetic portfolio, semi-real financial, and traffic-CVaR tasks, where DW-FM improves downstream regret over standard baselines.

[AI-26] Skill-to-LoRA: From Using Skills to Learning Behaviors for Token-Efficient LLM Agents

链接: https://arxiv.org/abs/2606.16769
作者: Tianyi Zhang,Zhonghao Qi
类目: Artificial Intelligence (cs.AI)
备注: Preprint. 10 pages, 4 figures

点击查看摘要

Abstract:Agent skills are commonly distributed as this http URL files: human-readable procedural documents that describe workflows, tools, resources, and domain conventions. While convenient for inspection and reuse, this design requires the same reusable procedure to be repeatedly injected into the runtime context. We propose Skill-to-LoRA(S2L), a behavior-centric skill representation that replaces runtime skill text with skill-specific LoRA adapters. Rather than compressing the skill document itself, S2L models the behavioral change induced by the skill text: offline, the complete this http URL is used to synthesize skill-guided demonstrations; online, the full document is omitted and the corresponding LoRA adapter is dynamically loaded to activate the learned skill behavior. We evaluate S2L with Qwen3.6-27B on a 21-skill subset of SWE-Skills-Bench. Compared with the no-skill and Full Skill Text baselines, S2L improves pass rate by 2.9 and 5.2 percentage points, respectively, while reducing per-step token cost by 6.6% relative to Full Skill Text prompting. S2L matches or improves Full Skill Text on 18/21 skills and the no-skill baseline on 15/21 skills. Control experiments further show that the gains depend on skill-specific adapter alignment: Wrong-LoRA and Shared-LoRA both reduce performance. These results suggest that many procedural agent skills can be converted from runtime instructions into trainable, dynamically loadable behavioral modules. Code will be released upon acceptance.

[AI-27] Automated jailbreak attack targeting multiple defense strategies

链接: https://arxiv.org/abs/2606.16751
作者: Qi Wang,Chengcheng Wan,Weijia He,Yanqing Li,Hanqi Sun,Xiaodong Gu,Jiangtao Wang
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated remarkable capabilities across a wide range of tasks. However, their safety remains a critical concern due to their susceptibility to adversarial prompt-based attacks. In this paper, we present UNIATTACK, an adversarial testing framework designed from a defense-oriented perspective to systematically construct effective black-box attack prompts. Unlike prior approaches that rely on static templates or iterative model-specific tuning, UNIATTACK extracts minimal but high-impact attack features from diverse existing attacks, optimizes them via a specialized attacker LLM, and composes them into flexible templates through automated refinement process. This feature-centric construction enables one-shot attacks that generalize across multiple models and safety categories, providing a practical tool for assessing LLM robustness. Our evaluation results shows that compared to the baselines, UNIATTACK achieves an average attack success rate (ASR) improvement of 64.63%-248.82% on models deployed with multi-layered defense mechanisms and it only takes 0.03%-4.96% cost of the baselines. UNIATTACK artifact is available at this https URL.

[AI-28] A First-Principles Derivation of LLM Policy Optimization: From Expected Reward to GRPO and Its Structural Extensions

链接: https://arxiv.org/abs/2606.16733
作者: Jianghan Shen,Siqi Luo,Yue Li,Jiyao Liu,Wanying Qu,Yi Zhang,Ziyan Huang,Tianbin Li,Ming Hu,Xiaohong Liu,Yirong Chen,Junjun He
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Policy gradient algorithms for language models optimize the same objective J(\theta) = \mathbbE*\tau \sim p*\theta(\tau)[R(\tau)] , which has exactly two factors: the trajectory probability p_\theta(\tau) and the reward R(\tau) . Every method from REINFORCE to PPO to GRPO and their descendants modifies one or both factors to address a specific failure in the preceding formulation. Existing surveys organize these methods by domain or chronology, which obscures the rationale behind each design choice and the precise location of its intervention within the gradient estimator. This survey revisits the landscape of LLM policy optimization from J(\theta) on first principles and uses the trajectory side, induced by p_\theta(\tau) , and the reward side, induced by R(\tau) , as the two axes along which methods are located. It covers the path from REINFORCE and PPO to GRPO, as well as post-GRPO variants, Agentic RL, and GRPO-OPD. The resulting framework is unified, diagnostic, and extensible: it analyzes methods from a shared objective, identifies which side each method modifies and why, and applies the same trajectory and reward axes across these settings. Across these settings, the framework also exposes compound failures that no single-side fix resolves and that therefore require joint design of the trajectory side and the reward side. The boundary cases and coupled failures identified by this map mark where existing solutions run out and provide a principled starting point for designing the next generation of LLM policy optimization algorithms.

[AI-29] Agent FairBench: Do LLM Agents Discriminate When They Act?

链接: https://arxiv.org/abs/2606.16723
作者: Triveni Morla,Rohith Reddy Bellibaltu,Manpreet Singh,Manmeet Singh Kapoor
类目: Artificial Intelligence (cs.AI)
备注: Submitted to IEEE Access

点击查看摘要

Abstract:Large language model (LLM) agents increasingly take actions (screening applicants, recommending credit, triaging patients), yet fairness for LLMs is still measured by grading answers. We introduce AgentFairBench, a cheap, reproducible, multi-domain benchmark for demographic disparity in the actions of LLM agents. Grounded in a companion framework, the Bias Conduction Framework (BCF, restated here), it spans three regulator-anchored domains: hiring, lending, and medical triage. Synthetic, demographic-neutral profiles are evaluated in counterfactual matched sets that vary only a name-coded race x gender signal (in the Bertrand Mullainathan tradition), under four agent scaffolds of increasing agency (direct, chain-of-thought, multi-agent deliberation, tool-augmented). A NumPy-only harness computes counterfactual flip rate, mean absolute score difference (MASD), action-rate disparity, and tool-invocation disparity, with bootstrap confidence intervals, paired tests, and false-discovery-rate control, for single-digit dollars per model. A live leaderboard with a held-out private split and a contamination canary admits external models by submission. Our pilot (864 decisions plus a test-retest replication) carries a methodological lesson: comparing a six-group score spread against a two-run noise difference overstates disparity by ~ 2.4X through statistic arity alone. Against an arity matched noise floor and an omnibus group test, claude haiku 4 5 shows no demographic effect above sampling noise (0 of 120 pairwise and 0 of 9 omnibus contrasts survive correction); a planted-bias test confirms the instrument detects disparity when present. The contribution is a sound, sensitive, adoption-ready instrument, the arity matched null methodology, and open artifacts to scale it. Code, data, and harness are released under open licenses, with an anonymized review artifact.

[AI-30] Medical world models: representing medical states modelling clinical dynamics and guiding intervention policies

链接: https://arxiv.org/abs/2606.16721
作者: Ke Liu,Mengxuan Li,Yanyi Bao,Tianyun Zhang,Chong Chu,Jiajun Bu,Haishuai Wang
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Medical diagnosis and treatment are dynamic processes in which patient states evolve over time and clinical interventions alter future outcomes. Although current medical AI can detect disease, estimate risk and generate reports, many systems still return static labels or scores, offering limited insight into how illness may progress or how alternative interventions may reshape its trajectory. Medical world models adapt the world-model idea from artificial intelligence to healthcare by learning internal simulators of patient-state dynamics. Their long-term goal is to help clinicians anticipate deterioration, compare treatment-conditioned futures and tailor care to individual patients. Yet relevant work remains scattered across foundation models, longitudinal modelling, disease simulation, treatment-effect estimation, reinforcement learning and digital twins. To bridge this gap, this review outlines a roadmap for advancing medical AI from isolated diagnosis and prediction toward medical world models that simulate disease evolution and support intervention decisions. This roadmap is organized around three coupled capabilities: patient-state construction, clinical dynamics modelling and intervention decision support. Across representative systems, the comparison highlights what each capability contributes and how partial components can be integrated into more mature perception–dynamics–planning systems. Finally, we identify the challenges involved in turning plausible rollouts into clinically useful simulators. Related literature is available at this https URL.

[AI-31] User as Code: Executable Memory for Personalized Agents

链接: https://arxiv.org/abs/2606.16707
作者: Bojie Li
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:A personalized AI agent needs a user memory: a persistent model of who the user is, built across many conversations and consulted on each new one. Today this memory is almost always stored as unstructured text, a knowledge graph, or a flat store of facts, and consulted by retrieval – fetching the entries most similar to the current request. Such “bag-of-facts” memory recalls individual facts well, but because storing a fact and acting on it are separate steps, it struggles to resolve contradictions, aggregate over many records, or enforce rules. We argue that user memory should instead be executable. We introduce User as Code (UaC), a paradigm in which an agent’s model of a user is a living software project: typed Python objects hold the user’s state and ordinary Python functions encode the rules that govern it, so representing and reasoning about the user happen in one medium an interpreter can run. The enabling mechanism is a two-phase pipeline: an append-only log that never discards a fact, periodically checkpointed into typed code. This changes what memory can do. On standard long-term conversation benchmarks, UaC matches both a full-context upper bound and the strongest prior memory systems on recall (78.8% on LOCOMO). Its advantage emerges where representation matters most. On aggregate questions over a user’s history – “how many international trips did I take last year?” – retrieval-based memory collapses (6-43%) while UaC stays near-perfect (99%), because the answer is a one-line computation over typed state rather than a search over text. And because its rules execute deterministically whenever the state changes, UaC can surface unsolicited, safety-critical alerts – such as a newly prescribed drug that conflicts with an allergy recorded months earlier – a capability query-driven memory cannot provide. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2606.16707 [cs.AI] (or arXiv:2606.16707v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2606.16707 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-32] Adaptive inference and function vectors in deep transformers

链接: https://arxiv.org/abs/2606.16694
作者: Ravin Raj,Gautam Reddy
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Applied Physics (physics.app-ph); Neurons and Cognition (q-bio.NC)
备注:

点击查看摘要

Abstract:Transformers are widely used as a general-purpose substrate for learning complex correlations between a large collection of coupled variables, but their internal mechanisms have remained mysterious. We introduce a theory of a deep transformer as a mean-field interacting system that implements distributed inference, subject to constraints on communication, locality and depth. We show that such a system can exploit internal state representations (‘function vectors’) to infer a latent context variable at increasingly finer scales over its layers. In an in-context regression task, the theory predicts a non-trivial relationship between non-Gaussian, hierarchical structure in the latent context variable, and transformer depth. Predictions are tested using constrained linear attention transformers and demonstrate adaptive inference in deep architectures. Feedforward blocks and depth enable transformers to implement a much richer class of in-context learning algorithms than previously described.

[AI-33] Optimising Temporary Accommodation Placement Across London with AI-Powered SaaS in E-Governance Systems

链接: https://arxiv.org/abs/2606.16652
作者: Hankun He,Jordan Richards,Gopalakrishnan Netuveli,Kumar Aniket,Ramya Pachatcharam,Binta Ade-olusile,Nathan Nagaiah,Matthew I Bellgard
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
备注: 13 pages, 4 figures, to be published in International Conference on AI and Sustainability Advances 2026 Companion Proceedings

点击查看摘要

Abstract:Temporary accommodation has become a major fiscal and administrative pressure for English local authorities, particularly in London, where demand and costs have risen sharply. This paper documents the creation and use of DOMUS, a cloud-based, AI-enabled decision-support system built from scratch at the University of East London and customised for the needs of London Borough of Newham to support statutory Temporary accommodation placement. DOMUS integrates household case records, policy-constrained affordability and suitability rules, and live private-rental listings within a single governance-aligned workflow. The system combines transparent, rule-based filtering with large language model-assisted search to standardise the application of bedroom need, affordability thresholds, geographic preferences, and accessibility requirements, while preserving officer discretion and audibility. Household and property attributes are encoded into policy-consistent representations prior to AI-assisted ranking and explanation. A pilot deployment in Newham’s secure environment evaluated operational performance relative to manual workflows. Results indicate substantial reductions in search time, improved adherence to key placement constraints, and high staff satisfaction, while maintaining statutory compliance and role-based accountability. Beyond TA, the paper frames DOMUS as replicable digital public infrastructure: a modular, cloud-native Software-as-a-Service architecture that can be deployed across other UK boroughs and adapted to other public administration tasks characterised by scarcity, rule-bound eligibility, and high stakes. The findings demonstrate the feasibility of scalable, ethically governed AI deployment in local government and contribute to debates on AI-enabled public value creation in e-governance.

[AI-34] he Integrator Advantage: Controlled Agent ic AI for Small and Medium-Sized Companies

链接: https://arxiv.org/abs/2606.16649
作者: Christopner Koch,Joshua A. Wellbrock
类目: Artificial Intelligence (cs.AI)
备注: 10 pages, 15 tables

点击查看摘要

Abstract:Agentic AI marks a new phase of enterprise automation. Unlike traditional automation or conversational AI, agentic systems can interpret goals, plan multi step tasks, access tools, interact with enterprise systems, and execute workflows with varying degrees of autonomy. For small and medium sized companies, this creates potential to reduce administrative burden, accelerate routine processes, and improve the use of organizational knowledge. This paper argues that the near term value of Agentic AI does not lie in full autonomy or workforce reduction, but in controlled partial autonomy for simple and medium complexity business processes. It proposes an integration framework covering use case suitability, autonomy levels, technical integration, governance, security, employee enablement, and measurable impact. The paper concludes that Agentic AI can become a productivity lever when implemented as a human centered capability with responsibility and accountability retained by people.

[AI-35] MR-GVNO: A Geometry-Aware Variational Physics-Informed Neural Operator for Mindlin-Reissner Plates on Irregular Domains

链接: https://arxiv.org/abs/2606.16624
作者: Siqi Wang,Daobo Sun,Yizheng Wang,Yilong Zhang,Yabin Jin,Xiaoying Zhuang,Timon Rabczuk
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Plate and shell structures are widely used in engineering, making rapid response prediction under varying geometries, materials, and loads highly desirable. However, conventional finite element methods require repeated modeling and solution, resulting in high computational costs. This study proposes a geometry-aware variational neural operator for Mindlin-Reissner plate problems, termed MR-GVNO. The method uses boundary point clouds to represent irregular geometries and employs separate encoders for spatially varying material fields, pressure loads, and scalar physical parameters. A cross-attention mechanism integrates these inputs with query point information to predict transverse deflections and rotations at arbitrary locations. MR-GVNO is trained without labeled solution data using a variational physics-informed loss derived from the discretized total potential energy. It directly processes irregular point clouds and allows different physical fields to be discretized independently, avoiding interpolation onto a common grid. Numerical experiments on single-hole, double-hole, and L-shaped plates demonstrate accurate response prediction under homogeneous and heterogeneous materials and uniform and random loads. The model also achieves millisecond-level full-field inference and favorable cross-geometry generalization.

[AI-36] Entropy-Gated Latent Recursion

链接: https://arxiv.org/abs/2606.16620
作者: Soham Bhattacharjee,Dushyant Singh Chauhan,Salem Lahlou,Martin Takac,Nils Lukas
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Inference-time scaling has become the dominant lever for improving language-model reasoning, but existing methods derive rollout diversity from a single source: stochastic token-level sampling. We argue that this single-axis sampling space is fundamentally limiting, and identify a second, fully deterministic and complementary axis: the layer span L at which a frozen model’s top decoder layers are recursively re-applied at high-uncertainty tokens. Different choices of L produce distinct rollouts that solve different subsets of problems, with no stochasticity. We instantiate this axis through Entropy-Gated Latent Recursion (EGLR), a training-free decoding procedure that re-applies the top- L layers for at most K_\max iterations until the next-token distribution converges. Combined with T temperature samples, EGLR turns a single-axis stochastic rollout pool into an L\times T Cartesian sampling space at almost the same per-rollout cost. We characterize this space across 8 instruction-tuned models and 6 math reasoning benchmarks, and show that the L -axis is genuinely complementary to temperature: on MATH-500 with Qwen2.5-3B-Instruct, the joint L\times T oracle reaches 91.6% , +8.2 percentage points beyond the temperature-only oracle ( 83.4% ) and +10.4 points beyond the layer-only oracle ( 81.2% ), confirming that the two axes capture genuinely complementary problems. The expanded rollout pool provides richer per-prompt candidates for any downstream procedure that consumes rollouts, including self-consistency, best-of- N with verifiers, and group-relative RL training (GRPO), opening a new direction for inference-time scaling that does not rely on stochastic noise.

[AI-37] CoffeeBench: Benchmarking Long-Horizon LLM Agents in Heterogeneous Multi-Agent Economies

链接: https://arxiv.org/abs/2606.16613
作者: Issa Sugiura,Daichi Hattori,Kazuo Araragi,Keita Ogawa,Shota Onose,Taro Makino,Teppei Usuki,Takashi Ishida
类目: Artificial Intelligence (cs.AI)
备注: 23 pages, 8 figures

点击查看摘要

Abstract:As LLM agents become capable of increasingly long-horizon tasks, evaluating their performance in economic systems is becoming increasingly important. Unlike existing benchmarks that primarily evaluate a single agent interacting with a passive environment, economic systems are inherently multi-agent, requiring autonomous agents to communicate, negotiate, and transact while pursuing their own objectives over extended periods. We introduce CoffeeBench, a benchmark for evaluating LLM agents in a long-horizon multi-agent economy composed of heterogeneous firms. In CoffeeBench, two farmers, two roasters, and two retailers autonomously operate their businesses over a 90-day simulation, each seeking to maximize cumulative net income through communication and transactions while managing cash, inventory, and pricing. The evaluated model controls one coffee roaster, while the remaining firms are controlled by fixed reference agents. Across several recent open-weight and proprietary LLMs, all models outperform a passive baseline that takes no actions, with most achieving positive net income. Analysis of agent behavior reveals substantial differences in long-horizon economic interaction: higher-performing models communicate more actively with other firms, whereas Claude~Haiku~4.5 exhibits an idle-drift failure mode, repeatedly choosing inaction despite producing coherent assessments and plans. We release our code and agent trajectories to support future research.

[AI-38] ARB4WM: An Adversarial Robustness Benchmark for World Models in Continuous Control UAI

链接: https://arxiv.org/abs/2606.16605
作者: Junjian Zhang,Hao Tan,Ruonan Li,Dong Zhu,Aiping Li,Zhaoquan Gu
类目: Artificial Intelligence (cs.AI)
备注: 24 pages, 10 figures, 5 tables. Source code available at this https URL

点击查看摘要

Abstract:World models are widely used in robotic and agentic engineering control systems due to their ability to learn latent dynamics for planning and decision-making. As these systems are increasingly deployed in safety-critical settings, understanding their robustness under adversarial conditions has become essential. However, existing evaluations lack a unified benchmark for testing adversarial threats across the policy, value, and latent-dynamics levels of world-model agents. To fill this gap, we present ARB4WM, a unified evaluation framework for pre-deployment robustness and risk assessment of world-model agents under visual perturbations. ARB4WM defines five white-box loss objectives across these three levels and studies their effects when combined with single-step or multi-step perturbation strategies and temporal attack modes, including full-frame, half-sequence, and sparse-frame exposure. Specifically, we evaluate four Dreamer-style agents across 20 tasks from MetaWorld and the DeepMind Control Suite under different loss objectives, perturbation strategies, and temporal attack modes. Results show that attacks targeting value estimation, latent representations, and RSSM dynamics can be as damaging as direct policy disruption, and that early or frequent perturbations are especially harmful, while input-level defenses provide limited recovery under adaptive attacks. These findings suggest that safety, risk, and reliability assessment for world models should cover multiple component-oriented attack objectives and temporal exposure protocols rather than relying solely on action-space robustness. Source code is available at this https URL.

[AI-39] ArtNet: A JEPA-Like Articulatory Predictive Framework for Robust Zero-Shot Phoneme Recognition INTERSPEECH2026

链接: https://arxiv.org/abs/2606.16595
作者: Zeqian Hu,Fuliang Weng,Shu Shang,Yaqian Zhou
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注: Accepted at Interspeech 2026

点击查看摘要

Abstract:Zero-shot cross-lingual phoneme recognition is often hindered by the fragility of direct acoustic-to-symbol mapping, which is susceptible to language-specific variations. Echoing joint-embedding predictive architecture (JEPA) work in vision, we propose ArtNet, a framework that explores a structured feature prediction task based on articulatory features to enhance acoustic robustness. Specifically, ArtNet integrates an articulatory predictor, designed to extract universal articulatory representations from self-supervised learning (SSL) features, with a variational information bottleneck (VIB) to suppress language-specific variations. Experiments on seven unseen languages demonstrate that ArtNet, particularly when synergized with the proposed vector-space inventory alignment (VSIA) strategy, significantly outperforms competitive baselines, achieving a 20.56% relative reduction in phoneme error rate (PER) and 7.01% in phoneme feature error rate (PFER).

[AI-40] Infant Spontaneous Movement Noise Improves Exploration in Deep RL

链接: https://arxiv.org/abs/2606.16590
作者: Francisco M. López,Markus R. Ernst,Francisco Cruz,Matej Hoffmann,and Jochen Triesch
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neurons and Cognition (q-bio.NC)
备注: 6 pages, 4 figures, 1 table. Accepted at IEEE ICDL 2026. Cite as: F. M. López, M. R. Ernst, F. Cruz, M. Hoffmann, and J. Triesch, “Infant Spontaneous Movement Noise Improves Exploration in Deep RL”, in 2026 IEEE International Conference on Development and Learning (ICDL). IEEE, 2026, pp. 1-6

点击查看摘要

Abstract:Exploration in deep reinforcement learning (RL) is commonly implemented as temporally uncorrelated white noise. However, recent works show that temporally correlated colored noise can improve exploration efficiency by producing smooth trajectories with better coverage of the state space. We inquire whether action noise inspired by infant spontaneous movements can also improve exploration in deep RL. We find that the power spectral densities of babies’ end-effector velocities follow a colored noise process where the spectral exponent increases with age. Inspired by this developmental pattern, we introduce a mechanism that progressively increases the temporal auto-correlation of exploration noise during RL training, matching the infant statistics. Experiments across several RL environments show that infant-inspired noise produces structured exploratory behavior and can improve learning efficiency compared to conventional exploration strategies. These findings suggest that human motor and cognitive development can provide useful guidance for designing learning mechanisms in artificial agents. Our code is available at this https URL.

[AI-41] NODEV: Toolbox for Neural ODE Verification

链接: https://arxiv.org/abs/2606.16567
作者: Abdelrahman Sayed Sayed,Pierre-Jean Meyer,Mohamed Ghazel
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY); Dynamical Systems (math.DS)
备注: 29 pages, 7 figures, Under review in TMLR

点击查看摘要

Abstract:Neural ordinary differential equations (neural ODE) have started to appear in safety critical settings such as continuous-time controllers for cyber-physical systems and classifiers integrated into automated decision pipelines, raising the question of whether their behavior can be formally verified. Existing tools dedicated to neural ODE provide only a single reachability call without iterative input set refinement, limiting the precision of their verdicts to whatever one reachability call can deliver. We present TNODEV, the first sound formal verifier for neural ODE that integrates a falsification checker, a fast interval-based reachability backend based on continuous-time mixed monotonicity, a verification and refinement loop with three input-set splitting heuristics, and a parallel scheduler in a single end-to-end pipeline. TNODEV supports safe-set inclusion verification on pure neural ODE, neural ODE in closed loop with a neural network controller and general neural ODE (GNODE), with the safe set specified either as an interval or as the half-space intersection induced by a target classification label. We evaluate TNODEV on a range of benchmarks across safe-set inclusion and classification-robustness properties, including a direct reachability comparison against NNV~2.0 and CORA and a verification comparison against NNV2.0 on MNIST general neural ODE classifiers.

[AI-42] ROSA-RL: Uncertainty-Aware Roundabout Optimized Speed Advisory with Reinforcement Learning ITSC

链接: https://arxiv.org/abs/2606.16558
作者: Anna-Lena Schlamp,Jeremias Gerner,Klaus Bogenberger,Werner Huber,Stefanie Schmidtner
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO); Systems and Control (eess.SY)
备注: 8 pages, 2 figures, 2 tables. Copyright 2026 IEEE. This is the accepted manuscript for 2026 IEEE International Conference on Intelligent Transportation Systems (ITSC), not the final published version

点击查看摘要

Abstract:Roundabouts challenge automated driving in mixed traffic, as heterogeneous and non-deterministic human behavior, unknown driving intentions, and high interaction complexity create uncertainty about whether the conflict zone will be blocked or available at the moment of entry. We present ROSA-RL – uncertainty-aware Roundabout Optimized Speed Advisory with Reinforcement Learning. It enables safe and efficient roundabout entry for automated and human-driven vehicles in mixed traffic through probabilistic conflict forecasting. A Transformer-based model predicts conflict zone occupancy over a five-second horizon, capturing multi-agent interactions to anticipate upcoming conflicts and available gaps. The prediction outputs encode uncertainty in future motion and intent, and augment the state of a classical RL framework, enabling uncertainty-aware speed coordination. Evaluated in simulations grounded in real-world data, ROSA-RL can effectively handle uncertainty and outperform a comparable model-based baseline, closing the gap to an ideal setting assuming fully known occupancy while improving traffic efficiency and safety. The source code of this work is available under: this http URL.

[AI-43] he Faithfulness Gap: Certifying Semantic Equivalence Between Natural-Language and Formal Mathematical Statements

链接: https://arxiv.org/abs/2606.16541
作者: Noor Islam S. Mohammad,Tamim Sheikh
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Autoformalization, translating natural-language mathematics into formal proof assistants, is bottlenecked not by translation fluency but by \emphfaithfulness: a formal statement can typecheck and be provable, yet still encode a different theorem than the source intended. We introduce \emphBidirectional Provability Fingerprinting (\bpf), a framework that certifies faithfulness by characterizing each candidate through its forward and backward consequence neighborhoods in the ambient theory and matching these against probes derived from the natural-language statement. We further introduce four novel components: (i) \emphCounterfactual Probe Generation (\cpg), a contrastive procedure that synthesizes probes targeting specific drift directions; (ii) the \emphEquivalence Spectrum, a continuous faithfulness score that replaces brittle binary verdicts; (iii) \emphAdaptive Probe Budget Allocation (\apba), an information-theoretic budget router; and (iv) \emphFaithfulness-Guided Decoding (\fgd), which uses \bpf signals as a reward during autoformalization. We prove a \emphdrift detection theorem and a \emphPAC-faithfulness result establishing that the equivalence class of a natural language statement is learnable from \mathcalO(\log(1/\delta)/\varepsilon) probes under mild assumptions. We release \driftbench, a benchmark of 2,183 NL/Lean~4 pairs with controlled drift labels across six subfields of mathlib4. \bpf,+,\cpg detects 89.6% of drifted formalizations at a 3.0% false-positive rate-against 41.2% for typecheck and 63.3% for LLM-judge baselines, and \fgd reduces the rate at which a state-of-the-art autoformalizer emits drifted statements by 47% . this https URL

[AI-44] Dual-Granularity Orthogonal Disentanglement for Generalizable Audio Deepfake Detection INTERSPEECH2026

链接: https://arxiv.org/abs/2606.16532
作者: Zhuodong Liu,Hugen Lv,Xiangyu Li,Chunhong Yuan
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注: Accepted at Interspeech 2026, 6 pages, 3 figures

点击查看摘要

Abstract:Audio deepfake detectors often fail to generalize across speakers, as they learn speaker-identity features rather than synthesis artifacts, known as implicit identity leakage. Existing methods address this but incur architectural complexity or training instability. This paper proposes a dual-granularity orthogonal disentanglement framework enforcing feature independence at two levels: sample-level cosine orthogonality captures directional decorrelation, while batch-level cross-covariance regularization eliminates linear correlations across embedding dimensions. A curriculum disentanglement schedule progressively strengthens the orthogonality constraint without auxiliary networks or adversarial dynamics. Experiments on ASVspoof 2019 LA, ASVspoof 2021 DF, and In-the-Wild datasets demonstrate that the proposed method achieves 1.35%, 7.88%, and 21.58% equal error rates (EER), respectively, surpassing gradient reversal disentanglement by 2.60% absolute on cross-dataset transfer.

[AI-45] Direction-Conditioned Policies via Compositional Subgoal Scoring for Online Goal-Conditioned Reinforcement Learning ICML2026

链接: https://arxiv.org/abs/2606.16515
作者: Swaminathan S K,Damiya Gondha,Theyanesh Eswaramoorthy Rajahkrishnan,Aritra Hazra
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: 17 pages, Accepted to the 2nd Workshop on Compositional Learning at ICML 2026 (Seoul, South Korea)

点击查看摘要

Abstract:Hamilton-Jacobi-Bellman theory implies that the optimal goal-conditioned action depends on the goal only through the gradient of the goal-reaching distance at the current state, yet standard online GCRL still conditions the actor on the raw goal – a signal that is geometrically uninformative when the goal is far from the data distribution. We propose Direction-Conditioned Policies (DCP), a fully online method that decomposes goal-reaching into two components sharing one InfoNCE representation \psi : a subgoal-scoring step that selects a visited state z_t aligned with the final goal g in \psi_g , and a direction-conditioned actor that consumes the unit direction d_t and magnitude r_t from \psi(s_t) to \psi(z_t) . The two components train jointly, factor cleanly at deployment (subgoal scoring is removed, while direction conditioning remains with g in place of z_t ), and admit independent modification at the same (d_t,r_t) interface. We prove three results. First, direction sufficiency under HJB: the optimal action under control-affine dynamics depends on the goal only through the value gradient. Second, a quantitative bound showing that, under mild conditions on the learned representation and assuming the scoring rule returns an on-path z_t , the actor’s conditioning input at training and at deployment coincide up to representation error and geodesic slack. Third, a controllable-subspace characterization of when directional conditioning fails. Across nine environments, DCP improves over Contrastive RL on most final metrics, with the largest gains on manipulation and obstacle-interaction tasks; a qualitative analysis of the learned \psi -distance landscape shows the contrastive representation behaves as an online quasimetric encoding environment topology, and the single failure case (AntSoccer) localizes to a learned-gradient pathology that the theory anticipates.

[AI-46] Model Graph Inductive Learning for Knowledge Graph Completion

链接: https://arxiv.org/abs/2606.16509
作者: Mohommad Esmaei Khani,Mahdieh Hasheminejad,Ali Taherkhani,Hossein Hajiabolhassan
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Link prediction in knowledge graphs fundamentally depends on the quality of learned embeddings for entities and relations. However, most existing methods derive these embeddings by aggregating only the local neighborhood of each entity, neglecting the global structure of the knowledge graph. This limited view prevents models from capturing higher-level structural patterns that are essential for accurate and generalizable link prediction. To address these limitations, we introduce Model Graph Inductive Learning (\textbfMGIL), a framework that constructs a model graph by clustering entities based on the similarity of their incoming and outgoing relational structures or their entity types. A GNN is then applied to this model graph to produce embeddings that capture the global view of the knowledge graph. These embeddings subsequently serve as high-quality initial features %embeddings for the original knowledge graph, replacing random initialization and leading to more stable and expressive representations. Extensive experiments on standard and recently proposed inductive benchmarks demonstrate that MGIL achieves state-of-the-art or highly competitive performance in inductive link prediction, highlighting its effectiveness across diverse graph settings. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2606.16509 [cs.AI] (or arXiv:2606.16509v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2606.16509 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-47] Post-Hoc Merging is Not Enough: Many-Shot Model Merging with Loss-Gap Balancing ICML2026

链接: https://arxiv.org/abs/2606.16501
作者: Kyungjin Im,Miru Kim,Chanin Eom,Minhae Kwon
类目: Artificial Intelligence (cs.AI)
备注: Accepted to the 43rd International Conference on Machine Learning (ICML 2026)

点击查看摘要

Abstract:Model merging has become a practical post-training strategy for building a single multi-task large language model (LLM) by combining multiple task-specialized models. However, most existing approaches rely on post-hoc merging, in which task-specific models are merged only once after training. This one-shot aggregation often suffers from task interference, leading to information erasure across individual tasks. In this work, we show that replacing post-hoc merging with an iterative many-shot merging protocol is effective in improving multi-task performance. Building on this insight, we propose METIS, Mitigating Erasure from Task Interference for Stable many-shot merging. METIS is a loss-aware many-shot merging method that addresses information erasure in post-hoc merging through task-wise loss-gap weighting and consensus-based masking. Notably, METIS exhibits significant performance improvement on the worst-performing task, effectively mitigating information erasure. (Project page: this https URL)

[AI-48] Steering Emotional Dynamics for Art Therapy: Controllable Narrative Script Generation through Hierarchically Guided LLM Agents

链接: https://arxiv.org/abs/2606.16481
作者: Suqing Wang,Qinghai Miao,Chao Guo,Yisheng Lv
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Art therapy plays a vital role in emotional healing, in which narrative creation acts as the primary vehicle for emotional expression. Given the inherently dynamic nature of emotions during healing, narratives with finely controlled emotional fluctuations enable individuals to safely project inner conflicts and achieve emotional catharsis. Recently, with the rapid development of Large Language Models (LLMs), automated narrative generation technology has provided a new pathway to support such artistic designs. However, while existing methods can produce fluent texts, they struggle to generate narratives that adhere to specified affective trajectories, failing to meet the demands of emotion-oriented psychological healing. To address these issues, this paper proposes EC-Script, an LLM agent-based framework that enables hierarchical control of the affective trajectory in narrative generation for emotional healing. To ensure that the generated narratives strictly follow the given emotional patterns, EC-Script establishes overall narrative direction through Emotion-Trajectory Planning, propels scene-level plot development with Character-Driven Scene Generation, and regulates local emotional changes of characters via Emotion-Controlled Script Writing. Ultimately, it outputs scene-by-scene script content that remains highly consistent with the preset affective trajectory. Experimental results demonstrate that EC-Script significantly outperforms baseline methods in affective trajectory adherence, exhibiting excellent and reliable emotional controllability, thereby providing effective technical support for AI-assisted emotional healing scenarios.

[AI-49] HOLO-MPPI: Multi-Scenario Motion Planning via Hierarchical Policy Optimization

链接: https://arxiv.org/abs/2606.16480
作者: Youngjae Min,Jovin D’sa,Faizan M. Tariq,David Isele,Navid Azizan,Sangjae Bae
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:Robots deployed in the real world must plan motions across diverse scenarios without per-scenario retuning. End-to-end reinforcement learning (RL) can generalize across scenarios but often becomes brittle under distribution shift, reward misspecification, and stochastic interactions. Model predictive path integral (MPPI) control enables strong real-time refinement without gradients, but its performance depends on a well-shaped sampling prior, while manually designing the priors does not scale to multi-scenario deployment. We present HOLO-MPPI (High-level Offline, Low-level Online MPPI), a multi-scenario motion planning framework that combines high-level policy learning with low-level stochastic optimal control. Offline, we learn a high-level policy that proposes scenario-robust plans in an abstract action space, with a learned world model for online rollout. Online, the policy serves as a data-driven prior generator that parameterizes MPPI’s sampling distribution conditioned on the current observation and goal. MPPI then optimizes low-level control sequences around this prior in real time to adapt to local disturbances. We instantiate HOLO-MPPI in autonomous driving by designing an effective high-level action space and tailored model architectures. Our evaluation across diverse driving scenarios shows that HOLO-MPPI improves upon MPPI and end-to-end RL baselines while maintaining real-time control.

[AI-50] nsor-Coord: Algebraic Decomposition of Joint Plan Tensors for Conflict-Free Multi-Agent LLM Planning

链接: https://arxiv.org/abs/2606.16478
作者: Mudit Rastogi
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) remain limited in multi-agent planning because independently generated plans can create coordination failures such as spatial collisions, resource contention, and temporal deadlocks. We introduce Tensor-Coord, a multilinear algebra framework that represents the joint plan of N agents as a third-order tensor (T \in R^N \times H \times A) over agents, timesteps, and actions. Canonical Polyadic (CP) and Tucker decompositions are used to identify latent coordination structure. The minimal epsilon-approximate CP rank R* defines a computable coordination complexity measure, with (CC(Pi)=(R*-N)/N). We prove that R*=N is necessary and sufficient for plan independence. The residual (E=T-T_R*) defines a conflict score over agent pairs, timesteps, and actions, localizing failures without domain-specific rules. Tucker factors provide interpretable agent roles, temporal phases, and action clusters that are converted into natural language constraints for iterative LLM replanning. Experiments on multi-robot delivery tasks across Easy (2 agents, 5x5 grid), Medium (3 agents, 5x5 grid), and Hard (4 agents, 5x5 grid) settings show convergence to conflict-free plans in 100% of 2-agent cases within 1.4 iterations on average, 80% of 3-agent cases within 3.2 iterations, and 60% of 4-agent cases within 4.0 iterations. CP rank scaled approximately linearly as (R*(N) = 3.9N + 0.5), supporting its use as a predictor of coordination complexity.

[AI-51] AI systems out-persuade expert humans

链接: https://arxiv.org/abs/2606.16475
作者: Kobi Hackenburg,Caroline Wagner,Luke Hewitt,Ben M. Tappin,Ed Saunders,Hannah Rose Kirk,Helen Margetts,Christopher Summerfield
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 16 pages, 4 figures

点击查看摘要

Abstract:Many societal decisions are settled by contests of persuasion. Conversational AI is a powerful new entrant in these contests, but whether it can out-persuade skilled and highly incentivized humans has remained unclear. Here, in a series of four preregistered experiments (n = 18,978 conversations from 6,923 people), we pitted AI systems against a range of human persuaders, including laypeople, winners of a separately preregistered four-round online persuasion tournament, professional canvassers, and world championship debaters. We found that AI systems were reliably more persuasive than expert humans, even when expert humans chose their issues, researched in advance, underwent hours of live, structured practice, and were incentivized with £1,000 cash bonuses. In a follow-up study, AI’s advantage persisted after experts received a coaching tool that let them practice against the AI that beat them, review their performance history, and see what AI would have said at key moments. We found converging evidence that AI’s advantage stemmed from rapidly deploying larger quantities of information: after coaching, expert humans could tie an AI constrained to respond at human speeds and with human-length messages. In a final study, we show that AI’s advantage extends to consequential real-world behavior: AI was nearly 3x more effective than professional canvassers from a UK fundraising firm at raising real-money donations to Save the Children. Together, these results establish that frontier AI systems out-persuade expert humans in conversation, with significant implications for political communication.

[AI-52] When Agent Automation Becomes Profitable: Quantifying and Insuring Autonomous AI Risk through Trace-Economic Underwriting

链接: https://arxiv.org/abs/2606.16465
作者: Binyan Xu,Xilin Dai,Fan Yang,Kehuan Zhang
类目: Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
备注: 26 pages, 14 figures, 29 tables

点击查看摘要

Abstract:AI agents can now take irreversible actions in operational systems, but agent-caused losses are still not clearly assigned, priced, or transferred. Providers often disclaim consequential damages, users are left with uncompensated losses, and default human review limits the efficiency gains of automation. We ask when autonomous AI deployment can become economically acceptable despite failure risk. Our answer is to quantify risk at the customer-task-trace episode level and transfer it through insurance. Automation is acceptable when its expected benefit exceeds the premium, control cost, and remaining risk. This requires a defined role with bounded permissions and comparable traces. We introduce trace-economic underwriting, which maps tool-use traces to customer exposure and claimable loss, then uses this representation for pricing, control, and risk transfer. It uses deterministic economic labels rather than an LLM judge. In our trace-to-loss testbed, trace-economic pricing reduces pricing MAE from 17.7K to 569 and removes regressive cross-subsidy. A 300-trace expert audit accepts 295 labels unchanged. On 1,000 real SWE-smith traces, trace-conditioned controls reduce CVaR95 by 72%. Theorem~1 gives a finite-sample scope condition. We release code, labels, and audit sheets.

[AI-53] Learning aligned EEG representations with subject-specific encoders

链接: https://arxiv.org/abs/2606.16462
作者: Bruna J. Lopes,Gabriel Schwartz,Sylvain Chevallier,Raphael Y. de Camargo,Bruno Aristimunha
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Cross-subject EEG decoding promises more training data, but it also exposes neural networks to strong inter-subject distribution shifts. We study whether task supervision and architecture alone can learn subject-aligned representations. We replace a shared EEG encoder with subject-specific encoders followed by a common classifier, and compare this hybrid model with standard EEGNet, AttentionBaseNet, and CTNet baselines with Euclidean Alignment (EA) on four motor-imagery datasets. EA improves shared encoders by recentering subject covariances, but the hybrid encoder largely internalises this role: validation-loss curves and latent-distance analyses change little when EA is removed. Subject-specific heads increase class distinctiveness and place each subject close to its own latent manifold, improving most subjects while leaving a method-sensitive subset. These results support subject-specific encoders as a learned alignment mechanism for EEG decoding and identify head selection for unseen subjects as the remaining bottleneck.

[AI-54] SPRI: SVD-Partitioned Residual Initialization for Data-Constrained MoE Upcycling

链接: https://arxiv.org/abs/2606.16456
作者: Weiqiao Shan,Ruixiang Mao,Yuang Li,Yuhao Zhang,Yingfeng Luo,Tong Zheng,Chen Xu,Yucheng Qiao,Chunxiang Jin,Yi Yuan,Jingdong Chen,Tong Xiao,Jingbo Zhu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 8pages, 12 tables, 3 figures

点击查看摘要

Abstract:Mixture-of-Experts (MoE) models enable efficient scaling, but training them from scratch remains prohibitively expensive. MoE upcycling mitigates this cost by converting pretrained dense models into sparse MoE models. However, existing upcycling methods typically rely on large-scale continued training and often perform poorly under data-constrained supervised adaptation, due to either homogeneous experts or overly disruptive perturbations to pretrained parameters. In this setting, effective upcycling must leverage pretrained weight structure while introducing sufficient diversity among routed experts. To this end, we propose SVD-Partitioned Residual Initialization (SPRI), which distributes SVD-partitioned residuals derived from pretrained feed-forward network (FFN) weights across routed experts, introducing controlled expert diversity grounded in pretrained spectral structure. We further introduce a two-stage training strategy to improve adaptation stability. We evaluate SPRI on multilingual speech-to-text translation, where limited supervised data challenges MoE upcycling and multiple target languages provide natural routing heterogeneity. On CoVoST2 across 15 En-to-XX directions, SPRI improves average BLEU and COMET over fully fine-tuned dense models by 2.58 and 3.32 points, respectively, and outperforms the prior best MoE upcycling baseline by 3.39 BLEU and 4.34 COMET points.

[AI-55] SDS-LoRA: Overcoming Anisotropic Gradient Scaling in Low-Rank Adaptation

链接: https://arxiv.org/abs/2606.16454
作者: Junghun Oh,Sungyong Baik,Kyoung Mu Lee
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Low-Rank Adaptation (LoRA) enables efficient adaptation of large pre-trained models to downstream tasks by parameterizing weight updates with low-rank matrices. In this paper, we investigate the limitations of the LoRA parameterization from a geometric perspective. Specifically, we show that when a full fine-tuning gradient is backpropagated to the low-rank matrices, it undergoes anisotropic scaling driven by their singular values. We argue that this phenomenon is undesirable because it distorts the full fine-tuning gradient by skewing it toward dominant singular directions while suppressing others. Our analyses demonstrate that anisotropic gradient scaling reduces the effective rank of the low-rank matrices’ gradients and results in suboptimal alignment between the full fine-tuning gradient and its low-rank approximation in LoRA, thereby exacerbating the gap to full fine-tuning. To address these limitations, we propose a new low-rank parameterization, SDS-LoRA, which structurally decouples singular values from the backward pass. Our method ensures that the full fine-tuning gradient backpropagates only through the orthonormal bases of the low-rank matrices’ subspaces, independent of their scales. Convergence analysis demonstrates that while LoRA’s convergence rate degrades with the condition number of the low-rank matrices, SDS-LoRA remains independent of it. Experimental results across natural language and vision benchmarks show that SDS-LoRA improves loss convergence and reduces the gap to full fine-tuning, significantly enhancing adaptation performance.

[AI-56] raining and Evaluating Diffusion Policies with Long Context Lengths

链接: https://arxiv.org/abs/2606.16447
作者: Abhinav Agarwal,Adam Wei,Taylan Kargin,Michael Zeng,Cole Becker,Arif Kerem Dayi,Pablo Parrilo,Asuman Ozdaglar,Russ Tedrake
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Imitation learning has enabled highly-dexterous robotic manipulation from RGB observations. Policies trained with these methods, however, typically condition robot actions on only a short history of observations. These policies cannot solve tasks that require memory and can get stuck repeatedly executing the same failing motions. In this work, we first benchmark policy performance as context length is incrementally increased from short to long, across a spectrum of tasks with varying local stability and memory requirements, and in multiple data regimes. To our knowledge, this is the first study to investigate context length in imitation learning at this level of detail. Our results challenge prior claims: naively scaling context length is not as brittle as advertised in literature. With an appropriate conditioning method and denoising backbone (UNet+Cross-Attention), single-task policies achieve high success rates on many tasks in the usual data regime even with naive scaling. Next, we propose a training algorithm to jointly train policies at multiple context lengths, further reducing the sample complexity of long-context learning. Finally, we apply our findings to re-evaluate some previously proposed solutions to long-context imitation learning.

[AI-57] NeuronFabric: A Software Reference Architecture for On-Chip Transformer Training with Local Adam

链接: https://arxiv.org/abs/2606.16440
作者: Evgeny Ukladchikov
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Publicly documented accelerator architectures generally separate training computation from optimizer-state updates or rely on external memory and host orchestration. This paper presents NeuronFabric, a software reference architecture intended for future FPGA and ASIC implementations of transformer training with local Adam updates. A complete C# prototype implements forward pass, backpropagation, and Adam optimization without external machine-learning frameworks. The goal is to validate numerical correctness and memory requirements before hardware implementation. The evaluated model is a 334K-parameter autoregressive transformer (d=88, H=4, f=264, L=4, vocab=256) trained on the Shakespeare corpus. The BF16W configuration achieves evaluation loss 1.5426 after 80K samples, compared with 1.5224 for an FP32 GPU reference, while producing coherent character-level text. The paper introduces BF16W, which stores weights in BF16 while retaining Adam optimizer moments in FP32. This reduces memory requirements for on-chip training. A 334K-parameter FP32 model with Adam moments requires approximately 4.0 MB, matching the BRAM capacity of a Xilinx ZCU102 device. The BF16W variant requires approximately 3.34 MB, leaving memory available for activation storage. We describe the vocabulary-budget constraint observed during earlier experiments, quantify BF16W memory savings, and outline FPGA training as the next stage of development. No FPGA measurements are included in this paper. This publication serves as a public architectural disclosure and software reference implementation for future FPGA and ASIC exploration of the NeuronFabric architecture.

[AI-58] Autonomous End-to-End SOH Prediction Services for Battery Systems via Temporal-Contrastive Representation Learning

链接: https://arxiv.org/abs/2606.16434
作者: Junting Wen,Dan Li,Qihao Quan,Xiwen Wang,Hang Yang,Zhaohong Meng,Zigui Jiang,Changlin Yang,Tianle Liu,Diego Muñoz-Carpintero,Jian Lou
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Accurate state of health (SOH) estimation is a critical diagnostic service for lithium-ion battery management. However, reliance on labor-intensive manual feature engineering and opaque black-box models hinders scalable industrial deployment. To address this, we introduce TC-SOH: a modular, plug-and-play service architecture for autonomous, end-to-end SOH prediction. TC-SOH employs a temporal-contrastive mechanism and a cross-window prediction pretext task to extract degradation-relevant representations directly from raw operational data. To improve transparency, we connect model efficacy with representation diagnostics: visualization, sensitivity analysis, redundancy analysis, bidirectional probing, future-SOH probing, and temporal shuffling show that learned features overlap with selected expert descriptors while retaining additional SOH-relevant variation, and that ordered temporal context improves subsequent-SOH prediction. Across four public datasets, TC-SOH outperforms the considered physics-informed and data-driven baselines, reducing MAPE by 1.91 times and RMSE by 2.13 times.

[AI-59] Posterior Twins: Distributional Behavioral Simulation for Enterprise Decisions

链接: https://arxiv.org/abs/2606.16415
作者: Ankit Das(Twinning Labs)
类目: Artificial Intelligence (cs.AI)
备注: 13 pages, 2 figures

点击查看摘要

Abstract:Enterprise behavioral simulation requires more than producing a plausible response. Many decisions depend on the shape of a population under a proposed action: which segments accept, defect, hesitate, or move into risk-sensitive states. This paper introduces Posterior Twins, a memory-grounded digital-twin approach that represents likely behavior as an updated distribution under a specific decision context. We evaluate a family of Twinning Labs behavioral-model operating points on a 226-example held-out behavioral-response benchmark and report both modal accuracy and Wasserstein-1 distance. The results show that modal accuracy and distributional fidelity identify different operating regimes. TL-Twin Alpha achieves the lowest observed Wasserstein-1 distance in the reported result set ( W_1 = 1.16 ), while TL-Twin Delta and TL-Twin Gamma provide balanced operating points near the modal-accuracy frontier. The paper frames these results as a systems result: governed memory, behavioral model routing, scenario orchestration, distributional aggregation, and auditability are necessary for turning simulated behavior into reusable enterprise decision evidence.

[AI-60] Looking Is Not Picking: An Attention-Segment Account of Tool-Selection Failures in LLM Agents

链接: https://arxiv.org/abs/2606.16364
作者: Shiyang Chen
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Software Engineering (cs.SE)
备注: 13 pages, 1 figure, 15 tables

点击查看摘要

Abstract:LLM agents mis-call tools, and the natural guess is that the model failed to see the right tool in a crowded harness. We show the opposite through a lens concurrent work sets aside – the model’s attention to labeled tool-definition segments. On real BFCL failures, by per-candidate attention argmax the model attends most to the correct tool 80% of the time (vs. 21% chance), and the gold is the under-attended segment on only 10%: it looks at the right tool and still picks wrong. This directly refutes the intuitive “crowded-harness / lost-in-the-middle” explanation: the failure is at the decision readout, not the harness, and we pin it there three ways. (1) Input vs. readout: repairing the prompt (reordering or duplicating the gold tool) recovers =23% of failures, while readout-side interventions recover 59-91%. (2) Representation-invariance: two gold-pointed interventions in different representations – an additive attention-logit bias and a residual-stream steering vector – recover largely the same failures (per-task Jaccard 0.865 pooled, 0.79-0.91 per model), so the bottleneck is localized to the readout independent of which representation is poked. (3) A training-free, gold-free selector: per-segment attention closes most of the gold-free-vs-oracle gap on BFCL (+11.9 pts pooled function-name selection vs. +17.9-pt oracle headroom) and adds +14.9 pts on Seal-Tools; every model positive (exact McNemar p=8e-4 each). Scopes differ: the causal attention-bias dose-response is bidirectional and monotonic on 10 mask-honoring models (3-32B), the full 0.5-32B span carrying only the correlational diagnostic; the deployable selector is evaluated on 5 single-turn models and does not yet transfer to a multi-turn loop.

[AI-61] Communication-Efficient Verifiable Attention for LLM Inference

链接: https://arxiv.org/abs/2606.16352
作者: Ziqun Chen,Ming Wu,Michael Heinrich,Jason Zeng,Huiying Lan,Tianwei Zhang,Rui Tan
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 19 pages, 16 figures

点击查看摘要

Abstract:Computation integrity of remote large language model (LLM) serving can be questionable. For conventional deep neural networks (DNNs), the existing TEE-shielded DNN partitioning (TSDP) approach uses Trusted Execution Environment (TEE) to compute non-linear components and verify the integrity of linear components offloaded to an untrusted GPU. However, directly applying TSDP to Transformer-based LLMs incurs significant TEE computation and TEE-GPU communication overhead. This paper presents Communication-efficient TEE-GPU Attention (\textscVeriAttn) for accelerating verifiable LLM inference. \textscVeriAttn offloads both linear and non-linear computations of attention to the GPU, while TEE performs verification. Moreover, for prefill, \textscVeriAttn uses a two-level pipeline to overlap data movement, TEE pre-/post-processing, and GPU computation. For decoding, when the key-value cache exceeds available GPU memory, \textscVeriAttn partitions attention across TEE and GPU to reduce repeated key-value transfers. Evaluation on an Intel TDX platform shows that \textscVeriAttn achieves 2.60-3.38 \times and 3.86-5.42 \times acceleration over TSDP for 6k-token prompts and 10k-token outputs during prefill and decoding, respectively.

[AI-62] SMEPilot: Characterizing and Optimizing LLM Inference with Scalable Matrix Extensions

链接: https://arxiv.org/abs/2606.16332
作者: Feiyang Chen,Haibo Chen
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Performance (cs.PF)
备注:

点击查看摘要

Abstract:Modern CPUs increasingly integrate matrix extensions, such as Arm Scalable Matrix Extension (SME), that provide high-throughput matrix execution within the CPU. For LLM inference, however, these units are not a universal replacement for conventional CPU cores: prefill, decode, attention, and KV-cache operations expose different arithmetic intensities, vector behavior, and layout requirements, while SME units and CPU cores still compete for shared memory bandwidth. This paper studies this mismatch through a roofline-based characterization of SME-enabled CPUs and uses the resulting model to guide operator-level execution choices. We present SMEPilot, an LLM inference engine that selects CPU-only, SME-only, or cooperative SME+CPU execution for each operator shape. SMEPilot partitions matrix work across SME and CPU cores at tile granularity, overlaps SME-suitable matrix stages with CPU-suitable vector stages in attention, and maintains layout state so packed tensor representations are reused rather than repeatedly rebuilt on critical paths. Across Llama-3.2-3B, Qwen3-4B, and Qwen3-30BA3B on phone, PC, and server platforms, SMEPilot improves end-to-end inference performance by up to 3.94 \times .

[AI-63] Phase-Aware Guidance Injection for Recurrent MAPPO in Assembly-Line Disruption Recovery

链接: https://arxiv.org/abs/2606.16330
作者: Xin Huang,Yongcai Wang,Fengyi Zhang,Zhikun Tao,Yunjun Han,Naiqi Wu
类目: Artificial Intelligence (cs.AI)
备注: 6 pages, 4 figures, accepted by the 2026 IEEE International Conference on Automation Science and Engineering (CASE 2026)

点击查看摘要

Abstract:Disruption recovery in industrial assembly lines requires timely decisions under machine faults, worker absence, and emergency orders. Existing methods either rely on rigid handcrafted recovery logic or learn adaptive policies that do not readily exploit heterogeneous external recovery knowledge at decision time to reduce abnormal recovery time (ART) and preserve on-time delivery (OTD). To address this gap, we propose a phase-aware guidance injection framework that augments a trained recurrent MAPPO (RMAPPO) scheduling policy through logit-level action bias during evaluation. The framework provides a unified decision-time interface for rule-based, replay-based, and online LLM-based guidance, while activating intervention only during abnormal and recovery phases. Experiments on a custom AssemblyLineEnv show that high-quality rule guidance yields the strongest gains, replay-based guidance degrades smoothly under imperfect availability, and online LLM guidance still provides useful intermediate improvements. These results show that decision-time guidance injection can exploit heterogeneous recovery hints without redesigning the actor.

[AI-64] Exploiting Search in Symbolic Numeric Planning with Patterns

链接: https://arxiv.org/abs/2606.16329
作者: Matteo Cardellini,Enrico Giunchiglia
类目: Artificial Intelligence (cs.AI)
备注: Under Review at the Journal of Artificial Intelligence Research

点击查看摘要

Abstract:In this paper, we present a procedure for numeric planning based on Symbolic Pattern Planning (SPP). Given a numeric planning problem \Pi , a pattern \prec is a sequence of actions used to define a formula encoding the subsequences of \prec executable from a starting state S . Cardellini, Giunchiglia, and Maratea (2024a) follow the Planning as Satisfiability approach by defining, at each step n \ge 0 , a formula \Pi^\prec_n in which (i) the pattern \prec is computed only for n=0 in the initial state I of \Pi , and then exploited at each step n , (ii) the starting state S is set to I , and (iii) the set G of goals is required to hold in the last state that can be reached by one of the subsequences of \prec concatenated n times. The procedure begins with n=0 , terminates as soon as \Pi^\prec_n is satisfiable, and otherwise proceeds by incrementing n . In this paper, possibly at each step, (i) we symbolically search for an intermediate state P reachable from I , closer to a goal state, (ii) dynamically recompute the pattern \prec_h – to be used in the next step – in P , (iii) refine the pattern \prec_g used to reach P , and (iv) start the new search from the state S which can be either the initial state I or the last computed intermediate state P , exploiting the computed patterns \prec_g and \prec_h to define the pattern \prec to be used in the search. In particular, at each step, we define a formula \Pi^\prec_S,P encoding the existence of a state P’ closer than P to a goal state, with P’ reachable from the starting state S when using the pattern \prec . We present different techniques for producing such formulas, each corresponding to a different strategy for exploring the search space. We prove their correctness and completeness, the latter under certain conditions.

[AI-65] AdaSTORM: Scaling LLM Reasoning on Dynamic Graphs via Adaptive Spatio-Temporal Multi-Agent Collaboration

链接: https://arxiv.org/abs/2606.16328
作者: Bing Hao,Ruijie Wang,Haodong Qian,Yunlong Chu,Yuhang Liu,Yumeng Lin,Minglai Shao,Jianxin Li
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) demonstrate remarkable potential in dynamic graph reasoning, but suffer from a scaling bottleneck: current models can only handle graphs with tens of nodes, constrained by exponential reasoning overhead and finite context windows. While multi-agent systems (MAS) offer collective reasoning and topology-aware orchestration, capabilities naturally suited for graph-structured tasks, their application to dynamic graphs remains unexplored. This paper presents Scaling LLM Reasoning on Dynamic Graphs via Adaptive Spatio-Temporal Multi-Agent Collaboration (AdaSTORM), a framework that reformulates large-scale dynamic graph reasoning into two stages: (i) Adaptive Partitioning, partitioning large-scale dynamic graphs into subregions that match the model’s reasoning capacity while minimizing inference cost; and (ii) Collaborative Reasoning, aligning graph partition topologies with a spatio-temporal decoupled multi-agent architecture. AdaSTORM is the first multi-agent framework tailored for dynamic graph reasoning. Extensive experiments show that AdaSTORM successfully breaks through the scaling bottleneck, scaling reasoning to thousand-node graphs with over 90% accuracy across several large-scale dynamic graph settings without external tools, significantly outperforms seven competitive baselines. Furthermore, it achieves state-of-the-art accuracy on existing benchmarks and generalizes robustly to real-world datasets. The source code is available at: this https URL.

[AI-66] ArtBoost: Synthetic Articulatory Data Augmentation for Acoustic-to-Articulatory Inversion INTERSPEECH26

链接: https://arxiv.org/abs/2606.16327
作者: Hyung Kyu Kim,Byungchan Hwang,Hak Gu Kim
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: Accepted in Interspeech26

点击查看摘要

Abstract:Recent acoustic-to-articulatory inversion (AAI) models rely on electromagnetic articulography (EMA) data, which are costly and limited in scale. To address this limitation, we propose \textitArtBoost, a novel data augmentation strategy that leverages large-scale speech–mesh datasets originally developed for speech-driven 3D facial animation to improve AAI under limited EMA supervision. \textitArtBoost extracts pseudo articulatory trajectories from visible facial anchors and uses them for pre-training before fine-tuning on real EMA data. Experiments show consistent improvements in PCC and RMSE. Trajectory analyses confirm that the pseudo articulatory signals reflect physically meaningful visible articulatory dynamics. Additional evaluations across different AAI architectures demonstrate stable performance gains, indicating that \textitArtBoost can be integrated into diverse AAI models. These results suggest that speech–mesh data provide an effective and scalable source of articulatory supervision for AAI. Project page: this https URL

[AI-67] Gaming-Resistant Insurance Contracts for Autonomous AI Agents : Strategy-Proof Toll Mechanism Design

链接: https://arxiv.org/abs/2606.16326
作者: Hao-Hsuan Chen
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Risk Management (q-fin.RM)
备注: 29 pages. Companion to arXiv:2605.26508 (Paper A, foundations) and arXiv:2605.25632 (Paper B, empirical)

点击查看摘要

Abstract:Paper A defines a time-consistent actuarial runtime that prices each side-effect-bearing action against a contractually fixed safe default and gates execution against a reserve budget. It treats the operator as passive. This paper makes the operator strategic. We characterise a five-attack space for autonomous AI-agent insurance contracts and prove when the actuarial runtime is gaming-resistant. Two attack surfaces – post-toll safe-default selection and within-boundary action splitting – are closed by Paper A’s minimal-authority and no-splitting clauses. The remaining three require new contract clauses. First, common-control aggregation prevents cross-boundary re-routing from reducing toll below the boundary potential applied to total exposure. Second, interface failures such as invalid JSON are contract-relevant events, not safety wins: treating them as zero-toll safe defaults can reward unreliable models, while escalation fees reverse the incentive. We validate this interface-compliance theorem on committed cross-model traces from the companion empirical paper. Third, a model-identity menu with a componentwise-minimum penalty schedule makes truthful reporting of the deployed model weakly dominant. We then compose these clauses with Paper A’s runtime guarantees to obtain joint incentive compatibility over the five-attack space. Finally, a two-parameter premium family discharges operator individual rationality and weak budget balance at the truthful equilibrium. The result is an incentive-compatibility layer for actuarial control of autonomous-agent side effects.

[AI-68] Architectural Wisdom: A Framework for Governing Optimization in AI Systems

链接: https://arxiv.org/abs/2606.16319
作者: Edward Y. Chang
类目: Artificial Intelligence (cs.AI)
备注: 17 pages, 2 tables, 2 figures

点击查看摘要

Abstract:Modern AI systems exhibit structural failures that capability scaling alone does not reliably fix: they optimize under-specified objectives with no architectural mechanism to question whether the objective should be optimized at all. Engagement maximization can amplify harmful pathways; tool-using agents can commit irreversible actions; preference-trained language models can become sycophantic. We argue that this failure is a wisdom problem, not an intelligence problem. We use “wisdom” in a deliberately architectural sense, not as a claim about virtue, consciousness, or moral omniscience. Intelligence accepts a goal and optimizes within it; wisdom interrogates whether the goal should be optimized at all. The two are separable architectural properties. We propose architectural wisdom as a corrigible objective-governance layer above the optimization substrate. The layer makes three structural commitments explicit and nondegenerate before any action: temporal horizon, relational boundary, and irreversibility. It is realized by four components (Structural Utility Transform, Moral Admissibility Interface, Arbitration and Escalation Controller, Value Revision Channel) that compute a six-coordinate wisdom tuple over horizon, relational coverage, irreversibility, admissibility, value revision, and auditability. We motivate the architecture by eight cases drawn from contemporary AI failures, secular wisdom traditions, and hard ethical situations, and defend the distinction against the intelligence-completeness thesis using goal-questioning over goal-taking, Bostrom’s orthogonality, structural separation in our exemplar cases, and persistent failure modes despite capability scaling. The framework is the conceptual contract for a larger architecture whose formal specifications and empirical validation are developed in subsequent work.

[AI-69] Is Your Trajectory Displacement Safe in Long-tail?

链接: https://arxiv.org/abs/2606.16313
作者: Qiao Sun,Weicheng Zheng,Yixin Huang,Hang Zhao
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 20 pages, 15 figures

点击查看摘要

Abstract:Long-tail scenarios remain a major bottleneck for autonomous driving evaluation, even as datasets grow by orders of magnitude. Existing evaluation pipelines are rarely human-aligned, safety-aware, verifiable, and explainable at the same time: closed-loop metrics often saturate among strong planners, while unstructured human ratings can be noisy without a carefully designed protocol. We formulate planning evaluation as additional-threat detection: given a planner trajectory and an expert reference, does the planner’s displacement introduce new unsafe driving behavior? We propose FluidTest, an evaluation pipeline with three components: a pairwise WebUI protocol for reliable human annotation; a taxonomy of 32 semantic threats with evidence-grounded decision graphs; and a three-agent verification system with reflection for precision and auditability. Experiments on the WOD-E2E dataset show that FluidTest produces consistent labels among trained annotators and identifies additional threats in 65% of Poutine trajectories and 51% of RAP trajectories. These results show that state-of-the-art planners can still exhibit substantial safety-relevant failures despite high Rater Feedback Scores (RFS) and low Average Displacement Error (ADE). Additional details, guidance, and code are available at this https URL.

[AI-70] AI Supply Chain Galaxy: 3D Visual Analytics for License Compliance

链接: https://arxiv.org/abs/2606.16292
作者: Weiru Han,Xuetao Shi,Wenyi He,Wei Wang,Rui Zhao,Moming Duan
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 15 pages, 6 figures

点击查看摘要

Abstract:The rapid proliferation of machine learning model reuse has transformed the AI ecosystem into a highly interconnected supply chain. Traditional compliance tools and static reports struggle to navigate these massive, multi-hop dependency networks. To address this, we present AI Supply Chain Galaxy (AISCG), an interactive 3D visual analytics system for model provenance and compliance auditing. AISCG maps models into a 3D spatial layout, integrating explicit structural dependencies with a rule-based compliance engine. It supports multi-scale exploration, from global community detection to localized, path-aware lineage tracing. We demonstrate its efficacy through an ecosystem-scale empirical analysis of 908,449 models from Hugging Face. Our findings reveal a concerning landscape: 55.46% of models exhibit compliance risks or metadata conflicts/omissions. We also identified distinct risk patterns, including a 56.67% license omission rate in adapter derivations and an 8.05% “license drift” rate in fine-tuning. Through a case study on the complex Llama model family, we show how AISCG empowers analysts to intuitively trace inherited restrictive terms and identify root causes across deep topological networks, significantly reducing the cognitive load of compliance auditing.

[AI-71] An affordable hardware-aware neural architecture search for deploying convolutional neural networks on ultra-low-power computing platforms

链接: https://arxiv.org/abs/2606.16290
作者: Andrea Mattia Garavagno,Edoardo Ragusa,Antonio Frisoli,Paolo Gastaldo
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Hardware-aware neural architecture search (HW-NAS) allows the integration of Convolutional Neural Networks (CNNs) in microcontrollers devices by automatically designing neural architectures that can fit prearranged hardware constraints. However, state-of-the-art HW-NAS target high-performance microcontrollers, whose power consumption does not meet sensing nodes requirements. This work presents a HW-NAS generating tiny CNNs that can run on ultra-low-power microcontrollers, featuring a lightweight search procedure enabling its execution even on embedded devices. Empirical results on three well-known benchmarks for tiny computer vision proved that the proposed HW-NAS was able to generate tiny CNNs while preserving state-of-the-art classification accuracy.

[AI-72] FlowMPC: Improving Flow Matching policies with World Models

链接: https://arxiv.org/abs/2606.16286
作者: Chandon Hamel
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Flow Matching (FM) is a powerful approach for behavior cloning in multimodal action spaces [Jiang et al., 2025], but because it is not trained to directly maximize expected return, there is still room to improve how FM policies act at test time. This work investigates whether a learned world model can improve FM policies by enabling Model Predictive Path Integral (MPPI) planning over candidate action sequences proposed by the policy. Building on TD-MPC2 [Hansen et al., 2024], I introduce FlowMPC, a framework that combines an imitation-learned FM policy with a learned world model for test-time planning in ManiSkill manipulation tasks [Tao et al., 2025]. Across PickCube and PickSingleYCB, adding the world model improved performance over the FM policy alone, with especially clear gains in end-of-episode success. These results suggest that world-model-based planning can effectively complement flow-based imitation policies without modifying the FM training objective.

[AI-73] SpecAlign: Efficient Specification-Grounded Alignment of Large Language Models via Synthetic Data

链接: https://arxiv.org/abs/2606.16276
作者: Wenjie Wang,Yue Huang,Zhengqing Yuan,Han Bao,Shiyi Du,Yuchen Ma,Yue Zhao,Yanfang Ye,Xiangliang Zhang
类目: Artificial Intelligence (cs.AI)
备注: 58 pages

点击查看摘要

Abstract:As large language models (LLMs) are increasingly deployed in real-world applications, alignment is no longer governed by a single universal notion of safety or helpfulness, but instead by provider- or application-specific model specifications. These specifications are typically long, structured, and frequently updated, yet existing alignment pipelines lack a systematic mechanism to operationalize them as training signals. In this paper, we propose specification-grounded alignment, a new alignment paradigm that treats provider-authored model specifications as the primary alignment target rather than abstract principles or static benchmarks. To instantiate this paradigm, we introduce SpecAlign, a framework that synthesizes alignment data directly from specification documents. SpecAlign combines structured rule annotation, controllable specification instantiation, and multi-agent adversarial data synthesis to generate fine-grained, boundary-aware preference pairs that capture both compliant behaviors and meaningful specification violations. Experiments across multiple model specifications and backbone models demonstrate that training with SpecAlign consistently improves rule compliance while preserving general capabilities and avoiding over-conservative behavior. These results suggest that grounding alignment in explicit model specifications enables rapid, precise, and scalable adaptation of LLM behavior to evolving policy requirements.

[AI-74] UXBench: Measuring the Actionability of LLM -Generated UX Critiques

链接: https://arxiv.org/abs/2606.16262
作者: Wenjie Wang,Yue Huang,Zipeng Ling,Han Bao,Hang hua,Xiaonan Luo,Yu Jiang,Shiyi Du,Yuexing Hao,Xiaomin Li,Yuchen Ma,Dianzhuo Wang,Yanfang Ye,Xiangliang Zhang
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 30 pages

点击查看摘要

Abstract:Large language models (LLMs) are increasingly deployed as UX judges that inspect interfaces, diagnose usability problems, and propose repairs. Yet no controlled benchmark measures whether the resulting critiques are reliable and actionable across heterogeneous product surfaces. We introduce UXBench, a benchmark for evaluating LLMs as interaction-grounded UX judges. UXBench comprises local-first runnable web fixtures spanning ten product-surface families, paired with coverage-gated browser exploration that forces models to collect interaction evidence before reporting. Each judge model produces a structured UX report over seven rubric dimensions; report quality is measured by whether a fixed downstream repair agent can improve the interface based on the critique. We evaluate eight frontier models under both an automated repair-lift protocol and a blind human validation study. Results show that UX judging is neither saturated nor one dimensional: models differ meaningfully in report actionability, exhibit distinct rubric-level repair signatures, vary in fixture-level reliability, and trade leadership across surface categories

[AI-75] Variance Reduction for Non-Log-Concave Sampling with Applications to Inverse Problems UAI

链接: https://arxiv.org/abs/2606.16257
作者: M. Berk Sahin,Ahmet Ege Tanriverdi,Behzad Sharif,Abolfazl Hashemi
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted to Uncertainty in Artificial Intelligence (UAI) 2026

点击查看摘要

Abstract:Sampling from high-dimensional, non-log-concave distributions with unnormalized densities is a fundamental challenge in machine learning, particularly when the exact gradient of the potential is unavailable and must be approximated via stochastic gradients that exhibit high variance under a fixed budget of gradient computations per iteration. Although variance reduction techniques such as SGD with momentum, STORM, and PAGE have demonstrated improved convergence properties in non-convex optimization, their implications for sampling from non-log-concave distributions remain largely unexplored. In this work, we develop the first unified analysis of these estimators for sampling from non-log-concave distributions. We establish improved non-asymptotic convergence rates in \varepsilon -relative Fisher information and, under a Poincaré inequality assumption, in squared total variation distance, and further prove weak convergence to the target distribution. We extend our analysis to solving inverse problems with score-based generative priors. We empirically validate our theory and demonstrate that, under a fixed gradient computations per iteration, variance-reduction techniques consistently improve sample quality in two standard imaging applications.

[AI-76] SPARK: Security Knowledge Priming and Representation-Guided Knowledge Activation for LLM -based Secure Code Generation

链接: https://arxiv.org/abs/2606.16244
作者: Xiaoyun Xu,Lichao Wu,Jona te Lintelo,Siyu Zhang,Stjepan Picek
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models routinely generate code with exploitable security flaws. Prior literature attributes this limitation to a lack of security expertise, steering current defense mechanisms toward heavy fine-tuning or external knowledge retrieval, which introduces significant computational overhead and data bias through redundant code examples. Contrary to this view, we argue that pretraining corpora are already rich in security material. The bottleneck is activation: without an explicit and brief cue, statistical pressure toward common training-distribution patterns suppresses the model’s safety-relevant representations. We present SPARK, an inference-time security harness that activates this latent knowledge without any retraining. The harness has two parts. Component~I retrieves a few of the relevant Common Weakness Enumeration (CWE) entries for each coding task and appends a short structured cue to the prompt; this alone is enough to surface the model’s existing security representations. Component~II adds a precomputed token bias to the logits at every decoding step. We obtain the bias by projecting a safe-direction vector, the unit difference between the mean safe and mean unsafe last-layer hidden states, through the language model head. The bias is computed once offline; applying it costs a single vector addition per generated token. We evaluate SPARK on 9 open-source models across C++, Java, and Python, and compare with 7 baselines spanning fine-tuning and retrieval-augmented methods. SPARK matches or improves on the best baseline in every setting while preserving HumanEval utility. We further test Component~I in a black-box setting on 7 of today’s strongest models, including Claude, DeepSeek, and GPT, demonstrating the bottleneck of insecure code generation and the improvements enabled by our method.

[AI-77] From Tokens to Regions: CUDA-Sensitive Instruction Tuning for GPU Kernel Generation

链接: https://arxiv.org/abs/2606.16231
作者: Wentao Chen,Jiace Zhu,Xing Zhe Chai,Zeng Qu,Qiaoling Xiao,Liucheng Duan,An Zou
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:High-performance CUDA kernels are essential for scalable AI systems, while Large Language Models (LLMs) still struggle to generate correct kernels due to strict and implicit execution constraints. Existing LLM-based approaches either rely on costly agentic or reinforcement-learning (RL) pipelines, or adopt supervised fine-tuning (SFT) objectives that fail to explicitly model CUDA sensitivity, namely code tokens or regions tightly coupled with execution constraints. In this work, we investigate CUDA sensitivity from the perspective of token confidence patterns, showing that CUDA sensitivity appears at both token and region levels, where most CUDA-sensitive tokens are predicted with high confidence, while a smaller low-confidence subset forms regions corresponding to execution-critical structures. These findings suggest that effective CUDA kernel generation should both leverage high-confidence CUDA-sensitive tokens and preserve low-confidence CUDA-sensitive regions. Building on these insights, we propose \textbf\underlineCUDA-\underlineSensitive Instruction \underlineTuning (CuSeT), a low-cost post-training method within a simple SFT framework. CuSeT follows the principle of ``from tokens to regions’’ by combining \emphadaptive token-level masking with \emphregion-aware sample reweighting. Experiments show that CuSeT consistently improves functional correctness across multiple model families and scales, outperforming standard SFT and advanced SFT variants, while achieving competitive performance against frontier CUDA kernel generation models with substantially lower inference cost.

[AI-78] Latent Thought Flow: Efficient Latent Reasoning in Large Language Models

链接: https://arxiv.org/abs/2606.16222
作者: Xiandong Zou,Jing Huang,Jianshu Li,Pan Zhou
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) increasingly rely on intermediate reasoning, yet explicit Chain-of-Thought (CoT) suffers from a linguistic space bottleneck: each thought must be decoded into tokens, causing high inference overhead. Latent reasoning moves deliberation into continuous space, but existing methods mostly learn deterministic or reward-maximizing paths, lacking a principled way to allocate probability across trajectories with different correctness and costs. We propose Latent Thought Flow (LTF), which models reasoning as variable-length continuous trajectories and trains a sampler to match a reward-induced posterior over answer quality and computation cost. We instantiate this with a continuous GFlowNet using stochastic latent transitions. To handle sparse answer supervision, we introduce an Entropy-Weighted Subtrajectory Balance objective for intermediate rewards and a reference-prior regularizer to anchor exploration. Experiments under finetuning and transfer learning settings show that LTF outperforms explicit CoT and latent reasoning baselines, improving accuracy by 9.5% while reducing reasoning length by 27.2% on average compared with strong latent reasoning baselines.

[AI-79] Calibrated Sampling-Free Uncertainty Estimation in Bayesian Deep Learning

链接: https://arxiv.org/abs/2606.16214
作者: Tobias Jan Wieczorek,Leon de Andrade,Thomas Möllenhoff,Marcus Rohrbach
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Modern deep learning models remain notoriously prone to overconfidence, limiting their reliability in high-stakes applications. Bayesian methods aim to counter this by learning a distribution over model parameters, and recent advances now make this feasible for large-scale architectures at costs comparable to AdamW. However, a challenge remains at test time: predictions must be averaged across many forward passes with weights sampled from the posterior, which is prohibitively expensive. Variance propagation offers an efficient alternative, computing layer-wise analytical approximations of uncertainty in a single forward pass. While such techniques are effective for MLPs, their extension to modern architectures remains challenging, due to increased depth and diversity of layer types. To fill this gap, we propose Calibrated Variance Propagation (CVP), which introduces a new propagation method for normalization layers, combines it with recent techniques for handling activation functions, and absorbs residual error through a light calibration step. CVP yields comparably accurate uncertainty estimates to MC sampling across transformers and CNNs, at a fraction of the cost. Against prior variance propagation work, CVP improves coverage at 0.5% risk from 8.2% to 14.6% with BEiT-3 on Visual Reasoning (NLVR2) and from 2.6% to 10.8% with ViLT on VQAv2, with gains extending to convolutional architectures.

[AI-80] Sensor-Conditioned Representation Learning via Scene-Relevant Observation Quotients

链接: https://arxiv.org/abs/2606.16210
作者: Yan Jiao,Pin-Han Ho,Limei Peng
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Learned representations in intelligent sensing systems are often evaluated by reconstruction fidelity or downstream prediction accuracy, but these criteria do not specify which latent distinctions are justified by the sensing process. In sensor-conditioned environments, nuisance factors can change measurements without changing the scene, while distinct scenes may be indistinguishable under limited sensing capability. This paper formulates sensor-conditioned representation correctness as preserving sensing-supported scene distinctions while suppressing nuisance-induced and sensor-unsupported variation. We introduce the scene-relevant observation quotient, a representation target induced by sensing-supported distinguishability after nuisance canonicalization, and develop Observation-Quotient Tucker-Structured Autoencoding (OQ-TSAE), a scene-nuisance factorized framework with diagnostics for false distinction, false merge, nuisance sensitivity, and latent ordering consistency. Experiments on a controlled benchmark show that quotient-consistent supervision improves representation-correctness diagnostics over reconstruction-oriented, metric-learning, and contrastive-learning baselines. Sensitivity, perturbation, and ablation studies show the importance of quotient-aligned supervision, reliable quotient relations, and quotient geometry. Complementary real-radar experiments show that a reconstruction-only OQ-TSAE variant retains competitive downstream utility, robustness under observation degradation, and low seed-to-seed variability. These results suggest that sensor-conditioned representations should be evaluated not only by predictive utility, but also by whether their latent geometry preserves sensing-justified scene distinctions.

[AI-81] Embedded Arena: Iterative Optimization via Hardware Feedback

链接: https://arxiv.org/abs/2606.16190
作者: Zhihan Zhang,Alexander Le Metzger,Jiuyang Lyu,Chun-Cheng Chang,Jiayi Shao,Yujia Liu,Emmanuel Azuh Mensah,Edward Wang,Kurtis Heimerl,Gregory D. Abowd,Shwetak Patel,Natasha Jaques,Vikram Iyer
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
备注: Code: this https URL

点击查看摘要

Abstract:Embedded devices from wildlife monitoring stations to clinical wearables require local AI inference due to latency, communication, or privacy constraints. Optimizing models for heterogeneous microcontrollers (MCUs) requires simultaneously satisfying hard physical constraints on memory, power, and temperature while preserving accuracy, a multidimensional optimization that is today performed manually by experts. We ask whether an LLM agent can autonomously navigate this complex, multi-turn pipeline guided by real hardware feedback, and introduce a hardware-in-the-loop agent arena in which the agent iteratively refines both model and firmware – compiling, flashing, and measuring on real hardware – to enable closed-loop optimization. Frontier models, including Claude Opus 4.7 and Gemini 3.1 Pro, fail entirely without hardware feedback (0% deployment success), whereas our hardware-in-the-loop formulation achieves the first successful deployment within three iterations and can surpass human expert results within seven. This agentic co-optimization achieves 250x compression for vision models with 3.3% accuracy loss and 400x for audio with 6% Feature Error Rate loss, enabling battery-free operation on a commercial MCU via solar harvesting. We demonstrate practical impact in two real-world systems: an elk-detection camera trap (96.7% accuracy) and a phonetic-transcription wearable (8.44% FER) for child development research.

[AI-82] PAL-Bench: Evidence-Grounded Profile Reconstruction from Longitudinal Personal Albums

链接: https://arxiv.org/abs/2606.16175
作者: Qiwei Yan,Zhiqiang Yuan,Zexi Jia,Nanxing Hu,Kailin Lyu,Jie Zhou,Jinchao Zhang
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Longitudinal personal albums are weak-schema multimodal databases: noisy perceptual records whose key facts require joins across faces, text, timestamps, locations, and repeated events. Existing visual, video, document, and lifelog benchmarks test sub-problems, but not album-scale profile reconstruction with social identity binding and evidence citation. Benchmarking this task is difficult because the ground truth needed for evaluation–owner profiles, social graphs, face-name maps, and evidence provenance–is private state that real albums cannot safely release. We introduce PAL-Bench, a controlled benchmark for evidence-grounded reconstruction under a public-record contract. Its Evidence Compiler builds latent private worlds, programs target-level evidence paths, renders album pixels, re-measures them through perception pipelines, and exports audited public/private views. Agents receive only perception-derived public records; targets, identifier maps, and evidence paths remain hidden. PAL-Bench contains 50 synthetic users, 36,659 public photo records, and 2,799 targets over owner facts, identities, and relations. A privacy-preserving audit with 10 participants confirms that PAL-Bench evidence structures match real private albums, though equivalent releases remain privacy-prohibitive. Across seven systems and two compute-matched diagnostics, a seven-metric protocol reveals a gap between plausible profile summarization and faithful social reconstruction: systems recover some owner facts but struggle with recurring identities and evidence citation. PAL-TRACE, a reference framework that freezes identity bindings before owner-fact mining, performs best but leaves hard identity resolution far from solved. PAL-Bench provides a testbed for perceptual entity resolution, multimodal data integration, temporal evidence aggregation, and provenance-aware structured prediction.

[AI-83] meVista: Exploring and Exploiting Vision-Language Models as Judges for Time Series Forecasting

链接: https://arxiv.org/abs/2606.16173
作者: Zhi Chen,Yuxuan Wang,Jialong Wu,Yong Liu,Haoran Zhang,Xingjian Su,Jianmin Wang,Mingsheng Long
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:High-quality time series forecasting is pivotal for real-world decision-making. However, traditional point-wise metrics often fail to reveal complex temporal patterns and align poorly with human intuitive preferences. While the ‘‘LLM-as-a-Judge’’ paradigm has revolutionized text evaluation by providing flexible, human-aligned judgment, its application to time series remains largely unexplored. In this paper, we leverage Vision-Language Models (VLMs) as judges for time series forecasting, harnessing their ability to comprehend time series plots grounded in textual information. Specifically, we propose a novel framework integrating micro- and macro-level judgments informed by contextual information to evaluate time series forecasting. To this end, we introduce TimeVista, a comprehensive VLM-as-a-Judge benchmark comprising 5563 time series samples paired with detailed evaluation rubrics. Extensive meta-evaluations demonstrate that VLMs are highly reliable judges, achieving significantly higher consistency with human preferences than conventional metrics. Building upon our benchmark, we comprehensively assess recent Time Series Foundation Models (TSFMs) under the VLM-as-a-Judge paradigm. Our results demonstrate that VLMs serve as robust and interpretable judges, providing a comprehensive, human-aligned standard for evaluating time series models.

[AI-84] AI Pluralism and the Worlds It Misses ICML

链接: https://arxiv.org/abs/2606.16167
作者: Rashid Mushkani
类目: Artificial Intelligence (cs.AI)
备注: To be presented at the ICML Pluralistic Alignment Workshop

点击查看摘要

Abstract:AI pluralism is often framed as a problem of representing diverse values, preferences, users, or outputs. This paper argues that this framing is incomplete because AI systems also impose ontologies: they define what counts as an entity, relation, feature, harm, benefit, and valid form of evidence. We define ontological flattening as the conversion of situated, contested, and historically specific meanings into a restricted technical category, proxy, aggregation rule, or benchmark target that is treated as neutral and difficult to contest. The paper develops a bounded conceptual and qualitative synthesis across value pluralism, pluralistic alignment, participatory and democratic AI, procedural justice, science and technology studies, accountability research, aggregate themes from 11 expert interviews, and three urban AI companion cases. The cases illustrate how pluralistic methods can improve or structure model behavior while still compressing categories, proxies, aggregation rules, and revision rights before affected actors have procedural standing. We introduce Pluralistic Lifecycle Governance (PLG) as a preliminary qualitative audit scaffold for documenting ontological openness, epistemic inclusion, procedural authority, evaluation pluralism, and lifecycle accountability. PLG is not presented as a validated scoring instrument; it is a framework for making the evidence and governance conditions of pluralistic AI explicit.

[AI-85] he Quality-Utility Paradox: Why High-Reward Data Impairs Small Model Mathematical Reasoning ICML2026

链接: https://arxiv.org/abs/2606.16152
作者: Haolong Qian,Xianliang Yang,Yinuo ma,Lirong Che,Feng Lu,Ye Guo,Lei Song,Jiang Bian,Chun Yuan
类目: Artificial Intelligence (cs.AI)
备注: Accepted at ICML 2026

点击查看摘要

Abstract:Knowledge distillation from powerful reasoning models is widely used to improve Small Language Models (SLMs) on mathematical reasoning, often assuming that traces with higher reward model scores provide more useful supervision. We identify a counterintuitive \textbfQuality-Utility Paradox in mathematical reasoning distillation. Data refined or synthesized by a stronger Oracle obtains higher perceived quality according to reward models, yet consistently underperforms traces generated by the SLM itself and selected through rejection sampling across Qwen2.5, LLaMA-3, and DeepSeek families. Our analysis shows that Oracle refinement couples logical repair with distributional drift away from the SLM’s native reasoning distribution. This drift increases the learner’s adaptation cost and can outweigh the benefit of improved reasoning logic. To test this mechanism, we introduce \textbfStyle-Aligned Refinement, which preserves the native trajectory of the SLM while retaining logical repair from the Oracle. This intervention lowers adaptation cost and restores downstream utility. These findings suggest that effective mathematical reasoning distillation should jointly optimize perceived solution quality and learner-data compatibility, rather than relying solely on reward-model scores. The datasets and code are available at this https URL.

[AI-86] LiteOdyssey: A Lightweight Reasoning AI Agent for Interpretable Rare-Disease Diagnosis

链接: https://arxiv.org/abs/2606.16149
作者: Minh-Ha Nguyen,Erica Gray,Chih-Ting Yang,Rizwan Hamid,Lingyao Li,Siyuan Ma,Thomas A. Cassini,Cathy Shyr
类目: Artificial Intelligence (cs.AI)
备注: 21 pages,5 main figures, working version 1

点击查看摘要

Abstract:Most medical AI systems improve by scaling additional machinery: more fine-tuning data, more agents, and/or larger retrieval databases. In rare-disease diagnosis, however, such scaling can produce systems that are difficult to deploy, audit, and maintain. We asked whether state-of-the-art diagnostic performance could instead be achieved by extending the reasoning chain of a single AI agent: guiding it with a diagnostic policy, developed through human-AI collaboration and augmenting with freely available biomedical tools. We introduce LiteOdyssey, a lightweight rare-disease diagnostic framework that guides reasoning language model through a clinical genetics workflow. This framework was developed through Policy Iteration with Human Feedback (PIHF) and uses dynamic access to public biomedical tools. On two challenging benchmarks that provide only patient clinical features, LiteOdyssey achieved state-of-the-art performance, with an overall disease Recall@1 of 59.3% over the combined 1,243 cases of LIRICAL (n = 370) and the PhenoPacket Store (n = 873). Both benchmarks have a high proportion of ultra-rare disease (a prevalence below 1 in 1,000,000, with ultra-rare shares of approximately 45% and 52.8%, respectively). On the more difficult PhenoPacket subset, where causal diseases were not mapped to Orphanet in our rarity-mapping pipeline, LiteOdyssey achieved 60.7% Recall@1, compared with 10.7% for the same baseline model (GPT-5.4) without tools. This performance was achieved without fine-tuning, multi-agent ensembles, or a large case-retrieval database. Gains were also observed in the following: on cases never seen during development, on a private cohort of real-world rare disease patients, and on a smaller open-weights model. LiteOdyssey suggests a path toward rare-disease AI systems that are accurate, easier to deploy, and more transparent for physician review.

[AI-87] hinking with Visual Grounding

链接: https://arxiv.org/abs/2606.16122
作者: Junkai Zhang,Yihe Deng,Kai-Wei Chang,Wei Wang
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Visual thinking should not only sound right; it should show its evidence. While recent vision-language models (VLMs) can produce natural-language reasoning traces, these traces often leave the supporting image regions implicit, making them hard to verify and difficult to supervise. We introduce visually grounded thinking, a reasoning process in which models interleave natural-language thoughts with explicit point or box groundings of the visual evidence used at each step. This lets the model express intermediate reasoning in language while grounding key objects in the image regions they refer to. To train this behavior, we construct a scalable synthesis pipeline that distills correct visual reasoning traces, extracts the visual objects required by the traces, grounds them with a SAM3-based agent, and derives aligned point and box supervision from the resulting masks. We further propose grounding-aware reinforcement learning, which combines answer correctness rewards with dense grounding rewards that score whether generated object references match the correct image evidence. Across two counting benchmarks and four spatial reasoning benchmarks, adding visually grounded thinking to Gemma3-4B-IT consistently improves performance over the original model and the non-grounded thinking baseline. On spatial reasoning, the visually grounded thinking 4B models match, and in some cases surpass, Gemma3-27B-IT from the same model family. Our analysis shows that point grounding is well suited to counting, while box grounding benefits most from explicit grounding rewards on spatial tasks. Overall, our results show that VLMs think better when their intermediate thoughts are tied to the image regions that make them true.

[AI-88] RecourseBench: A Modular Framework for Reproducible Algorithmic Recourse Evaluation

链接: https://arxiv.org/abs/2606.16113
作者: Zahra Khotanlou,Hashir Ahmed,Chenghao Tan,Ahmed Abdelaal,Amir-Hossein Karimi
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Algorithmic recourse methods provide counterfactual explanations that inform individuals of the actions required to overturn an unfavorable model decision. Despite rapid methodological progress, principled comparison remains elusive; existing frameworks are often difficult to extend and lack both interoperability and systematic verification that integrated methods faithfully reproduce their originally reported results. We introduce \emphRecourseBench, a unified evaluation framework built around three commitments namely, modularity, reproducibility, and interactivity. The framework decomposes the pipeline into five fully decoupled layers – Data, Preprocessing, Model, Recourse Method, and Evaluation – governed by abstract interfaces and a dynamic registry. To address the reproducibility gap in prior benchmarks, we introduce a four-tier classification system in which every integrated method is validated by an automated test suite against its originally reported results. We further provide an interactive web interface for flexible, configuration-driven comparison across methods, datasets, and model architectures. Our framework currently integrates 28 state-of-the-art recourse methods and, to our knowledge, constitutes the first recourse benchmark to explicitly enforce method-level reproducibility through automated, quantitative testing.

[AI-89] Scaling Adaptive Depth with Norm-Agnostic Residual Networks

链接: https://arxiv.org/abs/2606.16112
作者: Tomás Figliolia,Beren Millidge
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Residual architectures are ubiquitous in deep learning, but they suffer from a subtle structural limitation: the norm of the residual stream can grow rapidly with depth. As a result, updates from later layers become small relative to the accumulated residual state. This reduces their impact on the representation and limits the benefits of scaling models in depth. To address this, we introduce NAG, a norm-agnostic residual architecture that separates magnitude from directional information in the residual stream, preserving meaningful layer contributions throughout depth and preventing later updates from being systematically suppressed by residual-norm growth. Importantly, NAG introduces only a negligible number of additional parameters and relies on simple operations that are easily kernel-fusible, preserving training efficiency in practice. We show that this architecture outperforms baseline Transformers, with gains that increase substantially as depth grows, enabling effective training of much deeper models. The norm-agnostic formulation also leads to an interpretable Mixture-of-Depths (MoD) mechanism that adaptively skips both attention and MLP layers. Beyond serving as a post-training accuracy-compute tradeoff, this mechanism can be used as a pretraining-time scaling strategy: under iso-FLOP training, compute saved by reducing per-token forward-pass cost can be reinvested into training on more tokens while keeping the total parameter count and KV-cache budget fixed. In our experiments, moderate Mixture-of-Depths rates of approximately 20%-25% match full-depth baseline performance under equal training compute while substantially reducing the number of executed layer parameters and forward-pass FLOPs. These results identify sparsity in depth as a new scaling axis for fixed-compute training, enabling very deep yet FLOP-efficient models.

[AI-90] Phys-JEPA: Physics-Informed Latent World Models for Multivariate Time-Series Forecasting

链接: https://arxiv.org/abs/2606.16076
作者: Weizhi Nie,Weichao Liu,Honglin Guo,Yuting Su
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT)
备注: Submitted to arXiv as a preliminary manuscript. 10 figures

点击查看摘要

Abstract:Multivariate forecasting in physical systems requires models that predict coupled temporal variables while preserving meaningful state evolution. Deep forecasters can fit temporal correlations, and physics-informed models can regularize predictions with scientific constraints, but these directions are often connected only at the decoded-output level. As a result, the hidden predictive state that generates future trajectories may remain statistically useful but physically unstructured. We introduce Phys-JEPA, a physics-informed joint-embedding predictive architecture for multivariate time-series forecasting. Phys-JEPA learns a latent world model in which predictive states are decomposed into physical and residual components, and physical consistency is imposed directly on latent states and latent transitions rather than only on decoded forecasts. This formulation uses known physical variables to organize the representation space while retaining residual capacity for unresolved dynamics. On Jena Climate 2009–2016, Phys-JEPA reduces aggregate MSE from 0.12482 to 0.12273 and temperature MSE from 0.01892 to 0.01831 at H=24. On Traffic, full Phys-JEPA improves aggregate MSE over the supervised baseline across all tested horizons, reducing H=192 MSE from 0.800784 to 0.773873. On Electricity, the best variant depends on horizon: static latent consistency is strongest at H=24 and H=48, while full Phys-JEPA gives the best aggregate and target-variable MSE at H=192. These initial results suggest that moving physics-informed learning from output space to latent predictive state space is a promising direction for interpretable temporal world models.

[AI-91] MASCOT-Android: A Curated Dataset and Automated Collection Pipeline for Android Malware Source Code Specimens

链接: https://arxiv.org/abs/2606.16072
作者: Bojing Li,Duo Zhong,Prajna Bhandary,Raguvir S,Charles Maxa,Robert J Joyce,Charles Nicholas
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Compared with binaries and decompiled code, malware source code more directly reflects the attackers’ original intent. However, the scarcity of source code and the high cost of manual review make such datasets difficult to build and maintain. We propose MASCOT-Android, a curated dataset of Android malware source code and an automated collection framework for scalable malware source code discovery on GitHub. A key finding of our work is that repository-level documentation alone provides a strong signal for malware source code collection. Our model extracts character-level TF-IDF features from 8,772 malware and 25,747 benign README documents and trains a LinearSVC classifier to distinguish malware repositories. This README-only model achieves an accuracy of 96.28% and an FPR of 1.06% in local evaluation. In addition, the model outputs confidence scores, allowing users to adjust the decision threshold to balance FPR and coverage, which is practical in real-world malware source code collection.

[AI-92] Mind-Studio: Executable World Models with Lookahead Evaluation for Partially Observable Games

链接: https://arxiv.org/abs/2606.16070
作者: Yifei Dong(1),Mingen Zheng(1),Linquan Wu(2),Jeff Z. Pan(3),Jiaxin Bai(4) ((1) Hong Kong University of Science and Technology, (2) City University of Hong Kong, (3) University of Edinburgh, (4) Hong Kong Baptist University)
类目: Artificial Intelligence (cs.AI)
备注: 12 pages, 2 figures

点击查看摘要

Abstract:World-model synthesis aims to turn interaction experience into an internal model of environment dynamics. Existing symbolic approaches often fit observed transitions or mixtures of local rules, but they do not produce a complete executable program that can run independently of the real environment. We present Mind-Studio, a framework that synthesizes executable pygame-style world models from state-action-next-state trajectories using large language models. Mind-Studio combines entropy-selected traces with a lightweight game skill file containing object, action, and static scene information extracted from screenshots. We evaluate synthesis quality with a K-step lookahead fidelity protocol that compares generated world-model rollouts against Real-ALE rollouts from the same state. On Montezuma’s Revenge, Mind-Studio improves chosen-action next-state prediction from 0.3% for PoE-World to 48.7% while verifying 5 of 8 subgoals; across Alien, Assault, and Skiing, it achieves stronger branch-level fidelity than prior learned lookahead sources.

[AI-93] Auditing Reward Hackability in Code RL Training Environments

链接: https://arxiv.org/abs/2606.16062
作者: Shreshth Rajan
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We measure the rate at which code RL environments accept incorrect solutions as correct. On a 49-task sample of SWE-bench Verified, 28.5% of tasks have test suites weak enough that a Docker-verified incorrect patch passes them. On 20 R2E-Gym tasks across 6 repositories, the same pipeline at single-shot exploit generation yields 25.0%. A random-effects meta-analysis over 134 frontier model submissions to SWE-bench Verified finds, within the same human-rated difficulty stratum, model Pass@1 is +14.14 percentage points higher on flagged-hackable tasks than on robust ones (95% CI [+11.80, +16.48]; one-sided p 10^-6; I^2 = 0%; 123 of 134 models positive). We then describe a procedure for hardening the broken tasks. An inline LLM judge with a Docker gold-sanity gate runs each generated test against the gold solution before the judge is consulted. On the 11 broken tasks in the audit, the gate flags 65 of 105 decisive LLM-generated tests as failing on the gold patch itself, a 61.9% per-augmentation defect rate the LLM judge alone misses. With diversity-biased retry, the loop converges 9 of 11 tasks to a gated upgrade.

[AI-94] Mojo: A Promising Tool for Scalable Financial AI Efficiency

链接: https://arxiv.org/abs/2606.16059
作者: Henry Han
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 15, 3 figures

点击查看摘要

Abstract:For thirty years, quantitative finance has paid a costly two-language tax: models researched in Python are rewritten in C++ for production, often introducing numerical discrepancies. GPU-accelerated deep learning exacerbates this problem, as nondeterministic floating-point reductions can produce drift in long backtests, challenging regulatory reproducibility and auditability expectations. This article surveys Mojo, Modular’s 2026 Python-like systems language, as a structural response for capital markets engineering. While closing the Python-to-C++ performance gap, Mojo uniquely combines native interoperability with the low-level systems control required to construct bit-exact deterministic kernels. Its MLIR compilation infrastructure further allows a single codebase to target scalar, SIMD, multicore, and GPU execution, reducing the translation bottleneck between research and production. We benchmark four core financial AI workloads: Monte Carlo option pricing, LLM sentiment inference, multi-asset backtesting, and portfolio Value at Risk. On Apple Silicon, Mojo demonstrates 20x to 180x speedups over pure Python on directly measured kernels; larger-scale GPU workload results are projections calibrated from published benchmarks. Alongside transparent performance data, we introduce mojo-deterministic, an open-source library of reproducible reduction kernels, and provide a candid assessment of the problems Mojo does and does not yet solve.

[AI-95] How to Detect and Measure the AI Dangers to Democracy

链接: https://arxiv.org/abs/2606.16054
作者: Giulia Sandri,Claudio Novelli
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Research on artificial intelligence and democracy has grown quickly over the last decade. A shared conclusion in this literature is that AI does not create new democratic problems so much as it makes old ones worse. We now see this across information ecosystems, in elections, and in public administration. However, despite growing evidence, we lack a clear way to prioritize risks in this area, compare them across domains, and identify where democratic control is most likely to break down. So, our problem is: How can we systematize the problems that AI systems pose to democratic processes? This paper argues that principal agent theory may fit the task. In many phases of democratic systems, principals delegate key functions to AI systems and their providers without really being able to monitor how these systems operate or the outputs they produce. Treating AI as a delegation problem helps identify accountability gaps and other governance failures. Most importantly, as we shall illustrate, it provides metrics for empirical assessments of AI impact on democracy. As a second analytical element, we draw on the NIST AI Risk Management Framework and its seven characteristics of trustworthy AI, which supply substantive criteria for evaluating delegated tasks. Operationalized across the three domains through measurable indicators and domain specific trustworthiness criteria, we propose an analytical framework that centers on institutional assessability as the central condition for democratic control over AI. However, we stress that how severe a harm is, and how much risk is acceptable, are evaluative judgments that current methodologies neither acknowledge nor operationalize. This becomes acute when such evaluative judgments are (silently) delegated to private vendors. We identify this as a strong limitation left for future work.

[AI-96] ALCL: An Adaptive Log-Correntropy Loss for Robust Learning under Non-Gaussian Noise

链接: https://arxiv.org/abs/2606.16050
作者: Mainak Kundu,Ria Kanjilal,Ismail Uysal
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Robust deep learning under heavy-tailed and impulsive noise remains challenging because conventional losses such as mean squared error (MSE) exhibit unbounded sensitivity to outliers. Although correntropy-based objectives improve robustness, existing formulations rely on fixed kernel parameters that must be empirically tuned and remain static during training. To address these limitations, we propose an Adaptive Log-Correntropy Loss (ALCL), a heavy-tailed loss formulation that adaptively learns its robustness geometry during optimization. ALCL introduces a logarithmic residual model whose shape and scale parameters are learned jointly with network weights through differentiable reparameterization. This yields a principled maximum likelihood formulation whose influence function is formally bounded and redescending, allowing the loss geometry to adapt dynamically to evolving residual statistics while suppressing extreme outliers. Comparative experiments on four widely used benchmark datasets spanning grayscale and red-green-blue (RGB) image data under mixed heavy-tailed and impulsive noise demonstrate that ALCL consistently outperforms MSE and optimally tuned generalized correntropy losses in both reconstruction fidelity and downstream classification accuracy. While performance differences remain small under low-noise conditions, under high-noise regimes ALCL improves median accuracy by up to 4.75% on grayscale benchmarks and 4.51% on RGB datasets, with reduced variance across runs. These results demonstrate that adaptive robustness through joint learning of loss parameters provides a computationally efficient alternative to static correntropy-based losses for deep learning in non-Gaussian environments.

[AI-97] Leverag ing Deep Learning for Object and Position Recognition of Load Carriers for Autonomous Logistics Vehicles

链接: https://arxiv.org/abs/2606.16042
作者: Christoph Legat,Tobias Miller,Marco Riess
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 6 pages, 6 figures, IFAC World Congress2026, \c{opyright} 2026 the authors. This work has been accepted to IFAC for publication under a Creative Commons Licence CC-BY-NC-ND

点击查看摘要

Abstract:This work explores the use of artificial intelligence in mobile robotics to achieve autonomous detection and pose estimation of load carriers for automated pickup. A deep neural network is designed to recognize predefined landmarks on the carrier from RGBD data; these landmarks are then used to compute the carrier’s pose. The network operates directly on RGBD images to estimate landmark positions, which form the basis for determining the carrier’s location. The approach is validated in extensive experiments and comprises both software and hardware implementations. A deep learning-based framework is presented to detect load carriers and estimate their pose for use with autonomous logistics vehicles. Our method uses a convolutional neural network to identify characteristic reference points on the carrier from RGBD input and computes its pose by combining these inferred landmarks with prior geometric knowledge. Experiments show that the resulting accuracy is sufficient for reliable load carrier detection in industrial environments, confirming the suitability of the method for autonomous intralogistics applications. Comments: 6 pages, 6 figures, IFAC World Congress2026, \copyright 2026 the authors. This work has been accepted to IFAC for publication under a Creative Commons Licence CC-BY-NC-ND Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI) Cite as: arXiv:2606.16042 [cs.RO] (or arXiv:2606.16042v1 [cs.RO] for this version) https://doi.org/10.48550/arXiv.2606.16042 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-98] Open-SWE-Traces: Advancing Dual-Mode Multilingual Distillation for Software Engineering Agents

链接: https://arxiv.org/abs/2606.16038
作者: Wasi Uddin Ahmad,Nikolai Ludwig,Somshubra Majumdar,Boris Ginsburg
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: Work in progress

点击查看摘要

Abstract:The path toward autonomous software engineering is currently bottlenecked by a severe deficit of diverse, large-scale trajectory data. We address this by introducing \ourdataset, an expansive dataset of 207,489 agentic trajectories spanning nine programming languages (Python, Go, TS, JS, Rust, Java, PHP, C, C++). Sourced from 20,000 real-world PRs via OpenHands and SWE-agent harnesses, the dataset utilizes a hybrid-reasoning synthesis: Minimax-M2.5 generates trajectories with explicit “thinking” processes, while Qwen3.5-122B provides high-quality “non-thinking” traces. Filtered for permissive licenses (MIT, Apache, BSD) from SWE-rebench-V2, this data facilitates the training of models capable of long-horizon reasoning. We validate the dataset by fine-tuning the Qwen3-30B-A3B series (Thinking, Instruct, and Coder). The best performing model achieves resolve rates of 61.7% on SWE-bench Verified, 57.1% on SWE-bench Multilingual, and 36.8% on SWE-bench Pro. These results establish Open-SWE-Traces as a premier resource for distilling human-level software engineering capabilities into efficient, open-source agentic LLMs.

[AI-99] SciText2Eq: Assessing LLM s for Explainable Equation Generation for Scientific Creativity ACL2026

链接: https://arxiv.org/abs/2606.16003
作者: Yifan Mo,Xiao Fu,Yue Su,Qingyu Meng,Koen Hindriks,Qingzhi Liu,Jiahuan Pei
类目: Artificial Intelligence (cs.AI)
备注: Accepted by findings of ACL 2026

点击查看摘要

Abstract:This work investigates the ability of large language models (LLMs) to generate mathematical equations from scientific texts. Prior work faces challenges in unstructured grounding, multi-equation dependency, and humanaligned evaluation. To this end, we construct a dataset of AI research papers, pairing contextual passages with ground-truth equations and variable descriptions. We develop an explainable equation generation workflow and evaluate it across diverse open- and closed-source LLM backbones. We introduce an evaluation protocol combining automatic metrics, LLM-based rubrics, and human judgments to assess accuracy, explainability, and human-LLM alignment. Results indicate that LLMs perform moderately on lexical- and syntactic-based similarity, while struggling with semantic accuracy. Comparisons between LLM-based evaluations and human judgments reveal limited alignment, highlighting challenges in using LLMs to assess equation quality. These findings offer insights for improving equation generation models and developing more reliable evaluation methods for scientific text. We provide code and data for reproducibility.

[AI-100] Agent ic Framework for Deep Learning workload migration via In-Context Learning

链接: https://arxiv.org/abs/2606.15994
作者: Qiyue Liang,Steven Ingram,George Vanica,Andi Gavrilescu,Newfel Harrat,Hassan Sipra,Sethuraman Sankaran
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Translating deep learning models from PyTorch’s flexible, object-oriented design to JAX’s functional, stateless setup is usually a manual and error-prone task. Automated migration is challenging because Large Language Models (LLMs) struggle with strict and dynamic API alignment and are prone to mistakes for exacting operations. We propose a fully autonomous system that combines In-Context Learning (ICL) with oracle-driven self-debugging. First, we curated an ICL context that serves as a strict reference for idiomatic JAX styling and test case generation. Second, instead of depending on the LLM to deduce mathematical outputs, we run the source PyTorch modules to get their actual dynamic tensor states. This creates an unchangeable execution oracle. We then use an autonomous agentic loop to synthesize tests based on the oracle data. The test cases are executed repeatedly, and the traceback is sent back to the LLM for self-correction. Ablations show that combining ICL references with oracle grounding and self-debugging greatly outperforms pure instructional and basic agentic baselines. This improvement does not add an excessive computational overhead. Our lightweight pipeline achieves 91% numerical equivalence (compared to baseline: 9%, instruction + self-debugging: 27%) on neural modules, providing a highly reliable, scalable blueprint for cross-framework migration. This has been validated across several state-of-the-art models including SAM (segment anything), T5, Code Whisper amongst others showing high numerical equivalency. Code: this https URL

[AI-101] Quantifying the Impact of Lossy Compression on Neural Generative Surrogate Modeling

链接: https://arxiv.org/abs/2606.15959
作者: Zhimin Li,Harshitha Menon,Charles Jekel,Valerio Pascucci,Peter Lindstrom
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Neural networks are used as generative surrogate models for scientific discovery, which are trainable approximations of scientific simulations. These models enable users to replace time-consuming numerical simulations with learned alternatives, providing quick solutions. However, high-fidelity generative surrogate models require massive training datasets, which can create storage and I/O challenges. Lossy compression is a promising way to reduce this burden, but compression errors may affect the model quality in subtle ways, making it challenging to quantify their impact. In this work, we examine how lossy compression of training data impacts the quality of generative surrogate models. We begin by characterizing the uncertainty inherent in training neural networks, showing that identical training configurations can produce different models. By exploiting this variability, we propose a method to estimate how much compression-induced error a surrogate model can tolerate without affecting its accuracy. Evaluation of two application simulations demonstrates that our approach significantly reduces memory/storage requirements and speeds up training while producing high-quality surrogate models. These results show that lossy compression saves data storage up to 23.7x and 39x with negligible impact on the quality of the surrogate model. Meanwhile, reducing the size of the training data set also enhances the data loading speed and reduces the training time by up to 3x. Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2606.15959 [cs.DC] (or arXiv:2606.15959v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2606.15959 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-102] Green SARC: Predictive Cost and Carbon Governance for Agent ic AI Systems

链接: https://arxiv.org/abs/2606.15954
作者: Gaston Besanson
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
备注: 19 figures. Code: this https URL – Software DOI: this https URL

点击查看摘要

Abstract:Agentic AI systems act through tools and sub-agents, yet the controls meant to bound their financial and environmental cost still sit on dashboards evaluated beside or after execution. Green SARC applies the SARC governance-by-architecture framework – four enforcement sites in the agent loop – to FinOps and GreenOps, contributing the theory of what to enforce and how to predict it. We report four policy-independent results. (i) The unconstrained “State Snowball” is \Theta(n^2) in loop depth; on 3,000 real multi-step plans (SWE-rebench) it holds on 100%, with median curvature \hatc_2=216 exceeding the linear-accretion prediction p/2=134 – real plans accrete faster than the model. (ii) On real residuals the Normal- \sigma gate under-covers (92% at nominal 95%); split-conformal calibration holds (95.2%). (iii) A soft Lagrangian penalty tuned to the budget in expectation breaches it on 91.5% of seeds; the architectural gate breaches 0%. (iv) Under binding budgets the gate’s over-budget incidence is 0% on synthetic and real (BurstGPT) arrivals. End-to-end token/USD/carbon savings (47–55%) are real but policy-dependent in magnitude – set by a scope-cap knob, not by gate rejections. The library is open-source, dependency-free, and ships a regeneration script for every cited number.

[AI-103] Graphical-Probabilistic Modeling of Generative Flows in LLM -Native Software Systems

链接: https://arxiv.org/abs/2606.15943
作者: Víctor A. Braberman,Flavia Bonomo-Braberman
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: Published at 2026 IEEE/ACM 5th International Conference on AI Engineering - Software Engineering for AI (CAIN '26), April 12-13, 2026, Rio de Janeiro, Brazil

点击查看摘要

Abstract:Engineering LLM-native software remains a challenging and immature field. Current practice is largely exploratory, relying on experimentation and heuristic techniques such as prompting and context engineering. These, however, are low-level and lack the principled structure needed to support design-level reasoning or analysis. In contrast, traditional software engineering leverages modularity and abstraction to communicate and analyze system behavior. To bring similar rigor to LLM-native development, we propose methods for documenting generative flows and for stating properties of LLM-based software designs. Such methods must account for the stochastic, prompt-dependent behavior of large language models while remaining expressive enough to capture emergent phenomena. Our initial approach is based on graphical probabilistic models, tailored to capture phenomena characteristic of LLM-native systems. This framework – what we term Generation Networks – aims to provide a foundation for principled reasoning about generative interactions and system-level properties in LLM-centric software architectures.

[AI-104] ControlMap: Controllable High-Definition Map Generation for Traffic Scenario Simulation

链接: https://arxiv.org/abs/2606.15930
作者: Marwan Farag,Steffen Wäldele,Yu Yao
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Simulation is central to validating autonomous driving systems, yet current pipelines are limited by insufficient scenario diversity due to costly High Definition (HD) map creation. Scaling HD maps requires expensive data collection and manual processing. Moreover, existing generative models lack the fine-grained control necessary to target specific road topologies during generation. This paper presents a data-driven pipeline for controllable HD map generation using latent diffusion and ControlNet for spatial conditioning. To our knowledge, we are the first to inject spatial guidance signals into a diffusion model for HD map synthesis. Furthermore, our model supports adjustable conditioning strength through classifier-free guidance and city-level style transfer via city label conditioning. To complement existing metrics, we introduce two novel metrics to evaluate adherence to the control signal and similarity to ground-truth maps. Experiments demonstrate that our model generates realistic HD maps that faithfully follow input road topologies while accurately preserving city-specific details. Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI) Cite as: arXiv:2606.15930 [cs.RO] (or arXiv:2606.15930v1 [cs.RO] for this version) https://doi.org/10.48550/arXiv.2606.15930 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-105] Runtime Analysis of Cartesian Genetic Programming in Evolving Boolean Functions PPSN2026

链接: https://arxiv.org/abs/2606.15923
作者: Duc-Cuong Dang,Roman Kalkreuth,Andre Opris
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: To appear in the Proceedings of PPSN 2026

点击查看摘要

Abstract:Cartesian Genetic Programming (CGP) is among the practical and popular forms of Genetic Programming as it uses a graph-based representation of programs. This paper presents a first runtime analysis of CGP in evolving Boolean functions using complete training sets. We prove an asymptotic bound O(n D^5) for the expected number of fitness evaluations of CGP to construct a conjunction of n inputs using at most D \geq n-1 binary gates, a minimal function set, and even with a strict survival selection. When the non-strict selection is used, the bound is improved to O(n D^4) . Our analysis reveals interesting characteristics of CGP induced search, which have been only observed empirically. In particular, enabling the acceptance of equally good solutions, including those with connected gates non-contributing to fitness, can lead to a speedup, and consequently a better asymptotic time bound. In contrast to conjunctions, we also prove a negative result which shows that CGP requires exponential time to evolve an exclusive disjunction. Experiments evolving conjunctions complement our theoretical findings. The use of incomplete training sets is found to further reduce the average number of fitness evaluations while maintaining a good level of generalisation.

[AI-106] On-Policy Distillation with Curriculum Turn-level Guidance for Multi-turn Agents

链接: https://arxiv.org/abs/2606.15912
作者: Gengsheng Li,Mao Zheng,Mingyang Song,Ruiqi Liu,Tianyu Yang,Jie Sun,Qiyong Zhong,Haiyun Guo,Junfeng Fang,Dan Zhang,Jinqiao Wang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multi-turn agents that plan, invoke tools, and interact with environments offer a promising paradigm for solving complex tasks, yet their capabilities typically rely on very large models whose inference cost is prohibitive in this http URL-Policy Distillation (OPD) is a natural recipe for transferring such capabilities to smaller students, but we find that it suffers a characteristic failure mode in this setting: small student errors compound across turns and push the trajectory out of the teacher’s familiar state distribution, so the teacher’s supervision becomes least reliable precisely where the student needs it this http URL propose Guided On-Policy Distillation (Guided-OPD), a simple yet effective algorithm that mixes teacher- and student-generated turns within each rollout and schedules the teacher’s intervention probability along a curriculum that decays to this http URL guidance keeps early trajectories close to the teacher distribution and is then gradually withdrawn to recover the purely on-policy regime used at this http URL ALFWorld, ScienceWorld, and WebShop, distilling Qwen3 students from a Qwen3-30B-A3B teacher, Guided-OPD improves Score by 21.1% and Success Rate by 25.5% over vanilla OPD on average, with larger gains on smaller students.

[AI-107] opological Flow Matching ATC ICLR2026

链接: https://arxiv.org/abs/2606.15897
作者: Kacper Wyrwal,İsmail İlkan Ceylan,Alexander Tong
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: Accepted at ICLR 2026. 26 pages, 24 figures. Code: this https URL

点击查看摘要

Abstract:Flow matching is a powerful generative modeling framework, valued for its simplicity and strong empirical performance. However, its standard formulation treats signals on structured spaces, such as fMRI data on brain graphs, as points in Euclidean space, overlooking the rich topological features of their domains. To address this, we introduce topological flow matching, a topology-aware generalization of flow matching. We interpret flow matching as a framework for solving a degenerate Schrödinger bridge problem and inject topological information by augmenting the reference process with a Laplacian-derived drift. This principled modification captures the structure of the underlying domain while preserving the desirable properties of flow matching: a stable, simulation-free objective and deterministic sample paths. As a result, our framework serves as a drop-in replacement for standard flow matching. We demonstrate its effectiveness on diverse structured datasets, including brain fMRIs, ocean currents, seismic events, and traffic flows.

[AI-108] UrbanWell: Benchmarking Multimodal Large Language Models for Spatio-Temporal Urban Wellbeing Analytics KDD

链接: https://arxiv.org/abs/2606.15890
作者: Yanxin Xi,Xiang Su,Jie Feng,Yu Liu,Sasu Tarkoma,Pan Hui
类目: Artificial Intelligence (cs.AI)
备注: accepted by KDD Datasets and Benchmarks Track 2026

点击查看摘要

Abstract:Understanding urban wellbeing from multimodal data requires integrating heterogeneous spatial and temporal signals, posing significant challenges for current multimodal large language models (MLLMs). We introduce UrbanWell, a large-scale benchmark designed to systematically evaluate the spatio-temporal reasoning capabilities of MLLMs for urban wellbeing analytics through joint modeling of satellite and street view imagery. UrbanWell spans 38 cities across multiple years and includes diverse indicators covering (1) environmental conditions (CO _2 , NO _2 , PM 2.5 , and Normalized Difference Vegetation Index), (2) spatial accessibility (minimum distance to supermarkets and restaurants), (3) urban form (road length, road density, and land use), (4) urban vitality (population, economic activity diversity, and land use diversity), and (5) subjective perception attributes (e.g., safety, beauty, liveliness, wealth, and quietness). All indicators are aligned at grid level to enable standardized evaluation. Beyond static prediction, UrbanWell defines temporal reasoning tasks, including future value forecasting from historical observations and temporal trend classification. We benchmark 15 state-of-the-art representative MLLMs in a zero-shot setting, providing a comprehensive comparative evaluation across spatial and temporal dimensions. Experimental results indicate that while MLLMs capture salient spatial and perceptual cues, their performance varies substantially across heterogeneous urban indicators spanning environment and subjective perception. UrbanWell serves as a unified benchmark for evaluating multimodal spatial and temporal reasoning in urban wellbeing analytics, offering a standardized testbed for systematic assessment and future research on multimodal urban intelligence. Our codes and datasets are accessible via this https URL.

[AI-109] NVMOS: Non-Verbal Vocalization Quality Assessment in Speech

链接: https://arxiv.org/abs/2606.15888
作者: Jialong Mai,Jinxin Ji,Xiaofen Xing,Wencui Liu,Xiangmin Xu
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: 6 pages. Code and model: this https URL

点击查看摘要

Abstract:Non-verbal vocalizations (NVs), such as laughter, sighs, and coughs, are important acoustic cues for emotion and intent. Existing speech quality assessment methods typically focus on overall naturalness, while non-verbal TTS evaluations mainly examine whether a target NV appears with the correct type and position. However, the perceptual quality of NV events themselves remains underexplored. To address this gap, we construct an NV-MOS dataset containing outputs from multiple NV-TTS systems and naturally occurring NV samples, with ratings collected from three acoustic experts on a perceptual quality scale. We further analyze audio-capable multimodal large language models such as Gemini and find clear inconsistencies between their scores and expert ratings. These results suggest that general-purpose multimodal models cannot reliably replace human judgments for NV quality assessment. We then propose NVMOS, to our knowledge the first model that can reliably predict the perceptual quality of NV events in speech. Experimental results show that, with a local NV-event focusing module, NVMOS reaches expert-level or stronger agreement with human MOS.

[AI-110] Intelligence Is Not the Bottleneck: Validating an LLM First-Pass Manuscript Score Against Peer-Review Outcomes

链接: https://arxiv.org/abs/2606.15887
作者: Costa Georgantas
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 34 pages, 14 figures

点击查看摘要

Abstract:Large language model (LLM) systems are increasingly proposed to assist peer review, yet most evaluations judge the prose of machine-generated review text, not the validity of the numeric score a system assigns. We validate AIPR, which reads a submitted manuscript and emits five 0-100 quality dimensions and a weighted overall score, against the public decision outcomes of a major machine learning venue. AIPR grades by prompting alone, with no fine-tuning on reviews or decisions. Across 300 ICLR submissions with public decision tiers and reviewer ratings, graded under a frozen pipeline with hypotheses pre-registered before any score met any outcome, the overall score separates rejected from accepted submissions (AUROC 0.82, 95% CI 0.78-0.87), rises monotonically across tiers, and tracks the mean reviewer rating. The signal is strongest where we claim it: the lowest-scoring fifth is rejected far above the base rate, with oral papers absent. The validity comes mostly from the model: a one-paragraph prompt on the same model discriminates almost as well as the full pipeline (the small gap favours the pipeline but does not meet the pre-declared criterion, p = 0.09). What the engineering adds is reliability and a grounded review: AIPR’s score barely moves across repeated runs (0.7 vs. 2.8 points within-paper SD) where the bare prompt swings, and the same pass returns a rubric-structured, evidence-grounded review rather than a bare number, with the human keeping the decision.

[AI-111] LLM -as-Code Agent ic Programming for Agent Harness KDD2026 ICSE

链接: https://arxiv.org/abs/2606.15874
作者: Junjia Qi,Zichuan Fu,Jingtong Gao,Wenlin Zhang,Hanyu Yan,Xian Wu,Xiangyu Zhao
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: Accepted at the KDD 2026 Workshop on Agentic Software Engineering (AgenticSE)

点击查看摘要

Abstract:Every major LLM agent framework gives the LLM the role of orchestrator; the model decides what to do next, when to call tools, and when to stop. We argue that token explosion, control-flow hallucination, and unreliable completion are not implementation bugs but architectural consequences of assigning the deterministic work of looping, branching, and sequencing to a probabilistic system. A better prompt or a stronger model cannot guarantee the reliability of the LLM agent. We therefore propose Agentic Programming, in which the program governs all control flow, and the LLM is itself part of it, an adaptive component we call LLM-as-Code and invoke only where a task calls for reasoning or generation. Within each call the model keeps full flexibility, but it cannot alter the program’s execution path. With control in the program, the LLM’s context is built from the execution history’s call tree and forms a directed acyclic graph (DAG). Each call’s context length is then determined by its call depth rather than by accumulation over steps. A case study of computer-use agents shows that the design is practical, not just a theoretical stance, substantially improving the stability of long visual operation sequences.

[AI-112] STRIDE: Strategic Trajectory Reasoning via Discriminative Estimation for Verifiable Reinforcement Learning

链接: https://arxiv.org/abs/2606.15866
作者: Qinjian Zhao,Zhihao Dou,Dinggen Zhang,Xiangyu Li,Chaoda Song,Zhongwei Wan,Xinpeng Li,Yanyan Zhang,Kaijie Chen,Qingtao Pan,Chengcheng Feng,Zhiqiang Gao,Xiaoyu Xia
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) has become an effective post-training paradigm for improving the reasoning abilities of large language models. However, existing RLVR methods typically rely on final-answer correctness to assign trajectory-level rewards, providing sparse supervision and treating all tokens uniformly regardless of their actual contribution to reasoning. Although recent studies introduce intermediate signals such as process rewards, high-entropy tokens, and semantic uncertainty, these signals are often not inherently verifiable and may fail to distinguish beneficial strategic patterns from harmful ones. To address this limitation, we propose STRIDE (Strategic Trajectory Reasoning with Discriminative Estimation), a fine-grained RLVR framework that derives strategic reasoning supervision from verifiable outcomes. STRIDE contrasts successful and failed trajectories within each response group to estimate the outcome-discriminative preference of each n -gram strategic pattern, and further combines this signal with reasoning saliency entropy to identify decision-relevant strategic patterns. These patterns are assigned differentiated advantage values during RL optimization, enabling more precise credit assignment while preserving the verifiability of RLVR. Extensive experiments demonstrate that STRIDE consistently improves reasoning performance across diverse models, tasks, and extended settings, including VLMs and agent-based systems.

[AI-113] RetailBench: Benchmarking long horizon reasoning and coherent decision making of LLM agents in realistic retail environments

链接: https://arxiv.org/abs/2606.15862
作者: Linghua Zhang,Jun Wang,Jingtong Wu,Zhisong Zhang
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language model (LLM) agents have made rapid progress on short-horizon, well-scoped tasks, yet their ability to sustain coherent decisions in dynamic long-horizon environments remains uncertain. We introduce RetailBench, a data-grounded simulation benchmark for evaluating tool-using LLM agents in single-store supermarket operation. RetailBench models retail management as a partially observable decision process and is designed to support thousand-day-scale simulations. In this environment, agents must manage pricing, replenishment, supplier selection, shelf assortment, inventory aging, customer feedback, external events, and cash-flow constraints. We evaluate seven contemporary LLMs under representative agent frameworks over a 180-day evaluation horizon and compare them with a privileged oracle policy. Results show substantial variation across models: only a small subset survives the full evaluation horizon, and even the strongest LLM runs remain substantially behind the oracle policy in final net worth and sales outcomes. Behavioral analysis attributes these gaps to incomplete evidence acquisition, surface-level decision making, and the lack of a consistent long-horizon policy. RetailBench provides a controlled testbed for studying reliable autonomy in economically grounded long-horizon decision-making.

[AI-114] Heteroskedastic Signals in Budgeted LLM Verification: Structural Heterogeneity Limits Optimization Gains

链接: https://arxiv.org/abs/2606.15841
作者: Jinlong Yang
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language model (LLM) systems increasingly use uncertainty signals to allocate limited computation across verification, test-time scaling, tool execution, and other selective-compute decisions. Such policies rely on a \emphglobal signal comparability assumption: equal scores should carry comparable decision value across inputs. Using budgeted verification as a controlled diagnostic setting, we identify a failure mode of this assumption: uncertainty quality is heteroskedastic across cost strata, with some regions exhibiting near-random discriminability despite concentrating many errors. Under an explicit local model, we characterize the resulting distortion of global allocation and show that its upper bound scales with cross-stratum signal-quality dispersion. We separate weak signals, optimization instability, and structural heterogeneity through a controlled intervention hierarchy: Threshold, MP-Adapt, MP-Strat, and a deliberately simple cost-stratified thresholding intervention (CST). Across MBPP and MATH using Qwen3-8B, LLaMA3-8B, and GPT-4o-mini, global online adaptation yields inconsistent gains over static thresholding; MP-Strat partially recovers performance, while CST improves hit rate by up to 17 percentage points in strongly heterogeneous settings without gradient updates. These results identify structural heterogeneity, rather than optimizer weakness alone, as the primary bottleneck in the observed settings. More broadly, misaligned feedback structure cannot always be repaired by stronger optimization.

[AI-115] Wasserstein Convergence of ODE-Based Samplers in Decentralized Diffusion Model via Velocity Field Decomposition

链接: https://arxiv.org/abs/2606.15835
作者: Chencheng Tang,Xuanyu Xue,Fangyikang Wang,Chao Zhang,Hubery Yin
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 50 pages, 9 figures. Preprint under review

点击查看摘要

Abstract:Diffusion models have achieved impressive empirical success in generative tasks, and their convergence theory is now relatively well understood. Motivated by privacy and scalability, recent decentralized diffusion architectures replace a single global velocity field with multiple local experts and a routing mechanism, yielding a sampling dynamics with stochastic expert switching that falls outside standard diffusion convergence analyses. In this work, We study a decentralized diffusion framework with stochastic velocity fields and ODE-based sampling. We establish a convergence guarantee in Wasserstein-2 distance, showing that the distribution of the N -step discretization converges to the analytical solution at rate \mathcalO(N^-1/2+\varepsilon) in W_2 , where \varepsilon captures the neural approximation errors. To our knowledge, this is the first W_2 convergence result for decentralized diffusion models with an ODE-based sampling scheme.

[AI-116] AIChilles: Automatically Uncovering Hidden Weaknesses in AI-Evolved Systems

链接: https://arxiv.org/abs/2606.15834
作者: Yajie Zhou,Ao Li,Ashwin Silla,Zaoxing Liu,Vyas Sekar
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:The computer systems community has recently seen growing interest in AI-driven system evolution, where AI agents iteratively rewrite systems. Frameworks such as AdaEvolve and Engram report 12-60% score improvements over human-designed algorithms. While these results are promising, there are practical concerns if these AI-evolved programs can perform worse on unseen workloads and exhibit scalability regressions. Given the speed and scale of AI-generated code, we need automated mechanisms to uncover such identify hidden weaknesses in AI-evolved systems programs. To this end, we develop AIChilles that takes as input a baseline program P and an AI-evolved program P’ , AIChilles searches for valid workloads where P’ regresses relative to P in correctness, runtime, memory usage, or output quality. To tackle the diversity in system applications, weakness types and potential bugs, AIChilles combines deterministic workload-parameter extraction, agent-based constraint inference, differential oracles, and code-frequency coverage to discover diverse failures. Across five system applications and 30 AI-evolved programs, AIChilles finds 49 distinct hidden weaknesses. We also show that explicitly including AIChilles in the AI-driven development lifecycle can mitigate several of these weaknesses.

[AI-117] An Integrated System for Real-Time Student Assessment and Career Guidance Using Neural Networks in Computing Disciplines

链接: https://arxiv.org/abs/2606.15831
作者: Sakir Hossain Faruque,Md. Jubair Hossain,Sharun Akter Khushbu
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Systems and Control (eess.SY)
备注: 25 pages, 24 figures

点击查看摘要

Abstract:Many undergraduate students in Computer Science (CS) and Software Engineering (SWE) struggle to identify suitable career paths, particularly when their academic performance, abilities, and interests do not fully align. To address this issue, this study proposes an AI-driven Student Assessment and Career Prediction System that integrates a Career Guidance Expert (CGE) system with a Web-Based Student Assessment (WBSA) platform. Within the integrated framework, CGE enhances personalized career recommendations using AI while also assisting students after graduation in identifying suitable jobs, research domains, and higher study opportunities aligned with their skills and interests. The WBSA platform further strengthens interaction between students and faculty through assessments, personalized tasks, mentorship activities, and a secure real-time chat application. The CGE system employs a Multilayer Perceptron (MLP) model trained on real-world academic and extracurricular data collected using the snowball sampling method from the students of universities, achieving a validation accuracy of 94.71% in predicting personalized career paths. A pre-survey was conducted across universities to evaluate the proposed model before deployment. The WBSA system was developed as a modern web application using technologies such as this http URL, this http URL, and PostgreSQL to ensure scalability, responsiveness, and secure data management. The overall system is supported by a secure cloud-based infrastructure, the platform provides reliable performance while assisting graduates to select suitable career path in IT sector. In addition, a post-survey involving both students and faculty was conducted to gather feedback and further improve the overall effectiveness and usability of the system.

[AI-118] rustedARI: Towards Trust-Native Agent ic Routing Infrastructure for Agent ic AI

链接: https://arxiv.org/abs/2606.15822
作者: Qi Li,Zhenhua Zou,Shuo Li,Mingwei Xu,Zhuotao Liu
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:AI agents increasingly access external models, tools, and services through Agentic Routing Infrastructure (ARI) to manage the overhead of heterogeneous interfaces and fragmented subscriptions. Yet, the architecture of ARI introduces fundamental trust risks: it obtains plaintext access to agent queries and service responses, while leaving agents unable to verify that their queries are routed to intended service providers or that requests and responses remain untampered. To address this problem, we present TrustedARI, the first trust-native agentic routing infrastructure for agentic AI. Architecturally, TrustedARI is built upon three core innovations: (i) an ARI-adapted three-party TLS handshake that enables the agent and ARI to jointly authenticate the service provider through role-specific distribution of TLS key materials; (ii) a privacy-preserving query-construction protocol that allows the agent and ARI to collaboratively construct well-formed queries without exposing their respective private inputs; and (iii) a verifiable billing protocol that supports fair usage-based settlement while preserving the integrity and confidentiality of service responses. We implemented and extensively evaluated a prototype of TrustedARI to validate its performance. Experiments confirm that TrustedARI is highly efficient: our ARI-adapted handshake protocol reduces communication overhead by 39.34% compared to the existing three-party TLS handshake. Furthermore, the privacy-preserving query-construction protocol imposes negligible overhead-averaging 0.19 seconds in computation time and 0.58 MB in communication costs-while the verifiable billing protocol speeds up proof generation by 28.20x. Crucially, TrustedARI is readily deployable without any modification to the service providers. Subjects: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR) ACMclasses: I.2.0 Cite as: arXiv:2606.15822 [cs.AI] (or arXiv:2606.15822v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2606.15822 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-119] Let Them Steal: Trapping Large Language Model Extraction Attacks with Knowledge Honeypot

链接: https://arxiv.org/abs/2606.15810
作者: Yuyang Dai,Yushun Dong
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 16 pages

点击查看摘要

Abstract:Large language models deployed as commercial APIs are vulnerable to model extraction attacks, while existing defenses either act too late or degrade utility for legitimate users. We propose \textbfKnowledge Trap, a defense that redirects extraction attacks toward low-transferability knowledge through a \emphHoneypot Knowledge Graph (HKG) and breadcrumb-guided exploration. Instead of blocking queries or perturbing outputs, Knowledge Trap consumes the attacker’s limited query budget on knowledge with negligible downstream utility while preserving benign-user performance. Experiments in medical and financial domains show that Knowledge Trap reduces surrogate Agreement by 6.2% on average without degrading legitimate-user accuracy, outperforming existing defenses that impose measurable user impact. These results suggest that defending knowledge-space traversal is a practical direction for mitigating LLM extraction attacks.

[AI-120] Continuous Cross-Domain Traffic State Prediction via Memory-Augmented Graph Liquid Time-Constant Networks

链接: https://arxiv.org/abs/2606.15807
作者: Jinrong Xiang,Ming Xu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Traffic state prediction is a fundamental task in intelligent transportation systems. In practical applications, some regions suffer from limited traffic observations due to insufficient sensing infrastructure, making cross-domain knowledge transfer an important solution for data-scarce traffic prediction. However, existing cross-domain traffic prediction methods still face several limitations, including coarse-grained source-target adaptation, limited capability in handling unseen target-domain patterns, and insufficient modeling of continuous traffic dynamics under irregular or heterogeneous temporal conditions. To address these issues, this paper proposes a continuous cross-domain traffic prediction framework, termed Memory-Augmented Graph Liquid Time-Constant Network (MA-GLTC). Specifically, we first construct spatio-temporal units (STUs) to decompose traffic networks into transferable local units, enabling fine-grained knowledge alignment across domains. Then, a graph liquid time-constant network (GLTC) is developed to model graph-coupled traffic evolution in continuous time. Different from generic graph neural ODE-based models, GLTC introduces graph-coupled recurrent conductance into liquid time-constant dynamics, allowing node states to evolve with leakage, adaptive time constants, and neighborhood-aware feedback. Furthermore, a Memory-based Transfer Storage (MTS) mechanism is designed to preserve source-domain knowledge, retrieve matched traffic patterns, and update reliable target-domain patterns when unseen states emerge. Experiments on five public traffic datasets demonstrate that MA-GLTC consistently outperforms representative innerdomain and cross-domain baselines in both short-term and longterm prediction tasks. Compared with the second-best method, MA-GLTC reduces the average prediction errors by 3.02%, 0.33%, 8.92%, 10.09%, and 2.11%, respectively.

[AI-121] Unassigned Agents in Compilation-based Multi-agent Path Finding

链接: https://arxiv.org/abs/2606.15797
作者: Pavel Surynek
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Compilation-based techniques represent an important stream of solvers for multi-agent path finding (MAPF) due to their modularity and adaptability for non-standard variants of the problem. While in the standard MAPF the task is to navigate all agents from their initial positions to given individual goal positions without any collision, variants where a different requirement for agents is used are also relevant. Such a variant is MAPF with unassigned agents (UA-MAPF) where some agents have the same setting as in the standard MAPF with initial positions and goals while the remaining agents have the initial position but have no goal - unassigned agents. Despite unassigned agent do not need to reach any goal position they have to be moved out of the way of the standard agents if needed which represent a specific challenge. We show in this paper that UA-MAPF can be expressed in recent compilation-based techniques for MAPF based on formulating the problem as Boolean satisfiability, namely we adapt SMT-CBS and NRF-SAT, the recent solvers based on counterexample guided abstraction refinement and non-refined abstractions.

[AI-122] Proximal Policy Optimization for Amortized Discrete Sampling

链接: https://arxiv.org/abs/2606.15793
作者: Anna Zykova-Myzina,Timofei Gritsaev,Daniil Tiapkin,Nikita Morozov
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:This paper explores policy gradient algorithms for training stochastic policies to sample from structured discrete probability distributions under the Generative Flow Network (GFlowNet) framework. Building on extensive theoretical connections between GFlowNets and entropy-regularized reinforcement learning, we derive equivalents of standard policy gradient algorithms for training GFlowNets, as well as experimentally explore their various methodological aspects, including baseline training and advantage estimation. Most importantly, our work is the first to derive and successfully apply proximal policy optimization to GFlowNets, showing its improved convergence speed and data efficiency compared to standard GFlowNet training objectives on benchmarks ranging from synthetic energies to molecular graph generation.

[AI-123] GAS-Leak-LLM : Genetic Algorithm-Based Suffix Optimization for Black-Box LLM Jailbreaking

链接: https://arxiv.org/abs/2606.15788
作者: Aman Anifer,Vignesh Kumar Kembu,Vishnu M,Antonino Nocera,Vinod P.,Amal Murali PK,Akshay S Rajan
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) constitute pivotal components within the AI-dominated information technology ecosystem. To mitigate risks associated with harmful or policy-violating outputs, commercial systems employ advanced alignment strategies and multi-layered content moderation mechanisms. Despite these safeguards, recent research has demonstrated that LLMs remain vulnerable to adversarial manipulation, particularly through jailbreaking and prompt injection techniques. In this work, we propose GAS-Leak-LLM a novel jailbreaking attack based on a genetic algorithm that systematically evolves adversarial suffix to bypass safety constraints. Operating in a strict black-box setting, our method requires no access to model parameters or internals, thereby reflecting realistic threat scenarios in deployed systems. Through the iterative application of selection, mutation, and crossover heuristics, the framework systematically explores the discrete prompt space to identify high-fitness adversarial suffixes. Empirical findings reveal critical shortcomings in existing safety enforcement mechanisms and confirm the effectiveness and practical viability of the proposed attack.

[AI-124] LaWAM: Latent World Action Models for Efficient Dynamics-Aware Robot Policies

链接: https://arxiv.org/abs/2606.15768
作者: Jialei Chen,Kai Wang,Kang Chen,Shuaihang Chen,Feng Gao,Wenhao Tang,Zhiyuan Li,Weilin Liu,Zhuyu Yao,Boxun Li,Yuanbo Xu,Chao Yu
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Vision-Language-Action models (VLAs) leverage large-scale vision-language pretraining for semantic robot control, but often lack explicit foresight into how robot actions change the scene. World-Action Models (WAMs) address this limitation by conditioning policies on predicted futures, yet existing approaches typically rely on computationally expensive video generation with substantial pixel-level redundancy. We present LaWAM, a Latent World Action Model that exposes predictive dynamics to robot policies through compact latent visual subgoals instead of reconstructed future video. At the core of LaWAM is a latent-action-conditioned Latent World Model (LaWM). We obtain LaWM by training a latent action model in the latent space of a pretrained vision foundation model and repurposing its forward decoder to predict future observation features for scene evolution. LaWAM then conditions action generation on these predicted latent visual subgoals to enable dynamics-aware robot control. LaWAM achieves state-of-the-art or competitive success rates (SRs) across LIBERO (98.6% SR), RoboTwin (91.22% SR), and real-world manipulation tasks while retaining low-latency inference. LaWAM runs in 187 ms per action-chunk prediction and achieves up to 24x lower wall-clock latency than pixel-space WAMs.

[AI-125] Visualizing Uncertainty: Spatial Maps of Missing and Conflicting Evidence in Deep Learning

链接: https://arxiv.org/abs/2606.15767
作者: Dong Hyun Jeong,Feng Chen,Jin-Hee Cho,Lance M. Kaplan,Audun Jøsang,Soo-Yeon Ji
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Understanding when and why deep neural networks are uncertain is crucial for deploying reliable machine learning systems in safety-critical domains. While existing uncertainty quantification methods provide scalar measures of model confidence, they offer limited insight into which spatial regions of an input contribute to different types of uncertainty. We propose a novel visualization framework, Uncertainty Activation Map (UAM), that combines Evidential Deep Learning (EDL) with Full-Gradient Class Activation Mapping (FullGrad) to generate interpretable spatial uncertainty activation maps. Our approach distinguishes between two fundamental types of uncertainty: vacuity, representing lack of evidence, and dissonance, capturing conflicting evidence between competing hypotheses. By leveraging the complete gradient decomposition property of FullGrad and the principled uncertainty quantification of Subjective Logic, our method produces theoretically grounded visualizations that highlight specific image regions responsible for model uncertainty. With this framework, vacuity and dissonance activation maps are generated by computing belief-weighted attributions, enabling identification of where models lack knowledge versus where they encounter ambiguous evidence. Extensive evaluations across multiple benchmark datasets demonstrate that the proposed framework effectively addresses the critical gap between uncertainty quantification and explainability, providing intuitive visual feedback to assess model reliability in complex visual recognition tasks.

[AI-126] Snyk VulnBench JS 1.0: Can LLM s Find the Same Bugs Twice?

链接: https://arxiv.org/abs/2606.15762
作者: Liran Tal,Johannes Kloos,Arsenii Rudich,Stephen Thoemmes,Manoj Nair
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: 12 pages, 9 figures

点击查看摘要

Abstract:We ran 300 repeated vulnerability-finding scans to measure how repeatable agentic large language model (LLM) security review is on the same JavaScript code, prompt, and benchmark harness. The headline result is that LLM security findings were unevenly repeatable: reference-matched findings were stable, but extra model reports varied heavily from run to run. Across 250 model runs, 80 of 161 unique unmatched findings appeared in only one of five identical repetitions, while only 22 appeared in all five. By contrast, when Claude matched a Snyk Code reference finding, the behavior was much more stable: 134 of 158 unique reference-matched findings appeared in all five repetitions. The benchmark also shows complementarity. Models consistently found familiar, high-signal exploit shapes, and in one case surfaced a likely Snyk Code product gap. Snyk Code static application security testing (SAST) was deterministic and better at systematically enumerating repeated data-flow sinks. The results support combining agentic LLM review with deterministic SAST rather than treating either technique as a replacement for the other.

[AI-127] From Correlation to Causation in Lane Change Prediction for Automated Driving: A Causal Explanation Framework

链接: https://arxiv.org/abs/2606.15756
作者: Mohamed Manzour,Aditya Kumar,Augusto Luis Ballardini,Miguel Ángel Sotelo
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Lane-change prediction is a central task in intelligent vehicles, where early maneuver anticipation can support safer decision-making. However, many existing approaches mainly learn statistical associations between observed driving variables and future maneuvers, while overlooking the causal dependencies among the input variables themselves. This limits interpretability, especially when physically related variables such as longitudinal gap, relative longitudinal velocity, and Time-To-Collision (TTC) are treated as independent flat inputs. This article presents a causal-inference-based framework for lane-change prediction and explanation. The proposed approach combines linguistic feature construction, expert-constrained causal discovery, deep structural causal modeling with Deep End-to-end Causal Inference (DECI), intervention-based effect analysis, refutation testing, and recursive causal-chain explanation. The objective is not only to predict the future maneuver, but also to identify candidate variables that directly contribute to the prediction, the upstream factors influencing them, and the causal chains through which these effects propagate. The framework achieves average F1-scores above 95% during the first three seconds before the lane-marking crossing event. Beyond prediction accuracy, the framework uses intervention-based effect analysis to distinguish influential from weakly influential variables under the learned causal structure. It further distinguishes candidate direct contributors from mediated effects and generates contrastive causal-chain explanations that clarify why the predicted maneuver is favored and why the alternative maneuvers are less supported. The main contribution is therefore a mechanism-aware lane-change prediction pipeline that moves beyond correlation-based classification toward more interpretable causal reasoning for maneuver prediction.

[AI-128] RoboPIN: Grounded Embodied Reasoning via Pinned Chain-of-Thought

链接: https://arxiv.org/abs/2606.15753
作者: Yaoting Huang,Yifu Yuan,Linqi Han,Chengwen Li,Shuoheng Zhang,Xianze Yao,Hongyao Tang,Yan Zheng,Jianye Hao
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Embodied reasoning requires models to perceive task-relevant objects and spaces in physical environments and maintain consistent visual grounding throughout multi-step reasoning. However, current vision-language models rely on text-only or coordinate-augmented chain-of-thought, where entity references remain implicit and ambiguous. This may cause the reasoning process to decouple from visual evidence, entity references to drift across steps, and a causal disconnection between the reasoning trajectory and the final answer, with these problems further amplified in multi-view scenarios due to cross-view appearance changes. To address these issues, we propose Pinned Chain-of-Thought (\pincot), a structured reasoning paradigm that pins every reasoning step to visual evidence. \pincot introduces the concept of \reasoninganchor, which binds each task-relevant entity to a structured visual anchor with entity name, unique identity, view index, and spatial grounding, enabling consistent entity tracking across reasoning steps and views. We build a fully automated data generation pipeline to construct \dataset, a high-quality \pincot-formatted reasoning dataset. We then train \method through three-stage post-training that progressively injects embodied knowledge, structured reasoning ability, and process-supervised alignment, with rewards that directly constrain both anchor localization and identity consistency during reasoning. On 14 benchmarks covering embodied spatial reasoning, multi-view reasoning, and pointing, \method with only 4B parameters consistently outperforms 7B level open-source embodied models, achieving a 12% average improvement over the strongest 7B baseline, Mimo-Embodied. Further analysis shows that \pincot improves grounding accuracy and cross-step identity consistency, validating the effectiveness of process supervision.

[AI-129] InstantForget: Update-Free Backdoor Unlearning with Inference-Time Feature Reset

链接: https://arxiv.org/abs/2606.15730
作者: Zhenyu Yu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Backdoor unlearning aims to remove a malicious trigger behavior from a deployed model while preserving clean utility. We study the update-free inference-time setting, where model parameters remain frozen. First, we audit a common projection assumption under oracle paired clean and triggered features. Projection succeeds mainly on BadNets and leaves WaNet, Blended, and SIG at 0.683, 0.888, and 0.941 ASR on CIFAR-10 ResNet-18. This failure is not explained by spectral compactness, spatial locality, or subspace misalignment. It is predicted by a logit-triplet gap involving the target margin, target-logit drop, and non-target logit rise. We then introduce InstantForget, a clean-calibrated gated reset that flags anomalous features with a Mahalanobis score and moves only flagged features toward a neutral non-target representation. With one fixed operating point selected on held-out triggered validation, InstantForget reduces average ASR to 0.071 across four non-adaptive CIFAR-10 triggers without triggered samples or parameter updates at deployment. It also reaches 0.981 detection AUROC and transfers to six of eight tested backbones. Reported failures under WaNet, ModelNet10 point blend, two backbone geometries, and adaptive feature-compactness attacks define the method’s scope.

[AI-130] he algebra of Krom logic programs

链接: https://arxiv.org/abs/2606.15719
作者: Christian Antić
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI); Logic (math.LO)
备注:

点击查看摘要

Abstract:This paper investigates the algebraic structure of Krom logic programs, consisting only of facts and rules with at most one body atom. We show that sequential composition endows the class of Krom programs with a natural monoid structure and that this structure admits rich algebraic extensions to Krom seminearrings, Krom quemirings, Krom-Conway seminearrings, and Krom-Conway omegaseminearrings. Furthermore, we establish explicit generating sets and canonical decompositions, study the associated ^\omega -operator, characterize the Kleene star in graph-theoretic terms, and relate finite Krom monoids to transformation monoids and finite-state automata. These results provide new connections between logic programming, algebraic automata theory, and algebraic graph theory.

[AI-131] Artificial Intelligence Index Report 2026

链接: https://arxiv.org/abs/2606.15708
作者: Sha Sajadieh,Loredana Fattorini,Raymond Perrault,Yolanda Gil,Vanessa Parli,Lapo Santarlasci,Juan Pava,Nestor Maslej,Russ Altman,Erik Brynjolfsson,Carla Brodley,Jack Clark,Virginia Dignum,Vipin Kumar,James Landay,Terah Lyons,James Manyika,Juan Carlos Niebles,Yoav Shoham,Elham Tabassi,Russell Wald,Toby Walsh,Dan Weld
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Welcome to the ninth edition of the AI Index report. As AI continues to advance rapidly, the question becomes whether the systems built around it can keep up. Governance frameworks, evaluation methods, education systems, and the data infrastructure needed to track AI’s impact are struggling to match the pace of the technology itself. That gap between what AI can do and how prepared we are to manage it runs through every chapter of this year’s report. New in this edition, the report tracks how AI is being tested more ambitiously across reasoning, safety, and real-world task execution, and why those measurements are increasingly difficult to rely on. It also features new estimates of generative AI’s economic value alongside emerging evidence of its labor market effects, an analytical framework on AI sovereignty, and a science chapter developed in collaboration with Schmidt Sciences. For the first time, the report features standalone chapters on AI in science and AI in medicine, reflecting AI’s growing impact across these two domains.

[AI-132] When Generator Replay Degrades: Projected Rehearsal Orchestration for Heterogeneous Federated Class-Incremental Learning

链接: https://arxiv.org/abs/2606.15695
作者: Thinh T. H. Nguyen,Khoa D. Doan,Binh T. Nguyen,Danh Le-Phuoc,Kok-Seng Wong
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 46 pages

点击查看摘要

Abstract:Federated class-incremental learning (FCIL) becomes substantially harder when clients observe different label subsets, progress through tasks at different stages, and provide uneven supervision for the same semantic concepts. Existing FCIL methods often preserve old knowledge through input-space synthesis, but they can be fragile under heterogeneous task streams and difficult to transfer across modalities. To alleviate such issues, we propose PRO, a framework that replaces synthetic input replay with projected rehearsal orchestration. To remove external pretraining, we evaluate all methods under the same warmup. After this, PRO maintains compact class-level projected memories on the server and allows clients perform balanced pseudo multi-task training over current examples and old projected memories. To handle stronger representation drift, we further introduce PRO-MAX, which augments PRO with neighborhood-weighted memory alignment while preserving the same server-light principle that the server only aggregates model updates and memory statistics. Across image, text, and graph benchmarks, PRO and PRO-MAX improve retention and final utility under heterogeneous streams while remaining competitive in homogeneous FCIL. Even when baselines are given expanded replay budgets, they degrade under supervision imbalance and stage misalignment, indicating that replay quantity alone does not resolve replay-quality failures. Additional weak-task diagnostics further show that larger replay mismatch is associated with larger downstream degradation, while our method keeps projected memories better aligned with the evolving representation.

[AI-133] Imperfect Visual Verification for Code Edition : A Case Study on TikZ

链接: https://arxiv.org/abs/2606.15693
作者: Charly Reux(UR, INSA Rennes, DiverSe),Mathieu Acher(CNRS, IUF, IRISA, UR, DiverSe),Djamel Eddine Khelladi(DiverSe, UR, CNRS, IRISA),Clément Quinton(SPIRALS, CNRS),Olivier Barais(UR, IRISA, DiverSe)
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:LLMs have significantly advanced code generation, enabling the synthesis of functional programs. While recent systems achieve strong performance on many coding benchmarks, tasks involving programs such as TikZ that generate visual artifacts remain challenging, in particular on visual code customization. Unlike generation from scratch, customization requires localized, semantics-preserving edits: the model must locate relevant code, modify it according to the instruction, and preserve the remaining structure and rendering. Approaches based on post-hoc iterative refinement/correction where a verifier provides feedback to guide corrections, have shown promise. However, in the case of programs with a visual outcome such as in TikZ, where correctness is harder or likely impossible to formalize and evaluate automatically, deterministic verifiers do not exist. Hence, developers can only rely on imperfect verifiers. In this paper, we conduct an empirical study to answer:to what extent can iterative refinement remain effective when the verifier itself is unreliable? We use TikZ as a focused case study that isolates the core difficulties of the problem (weak code structure, fine-grained visual semantics, and difficult feature localization) in a controlled and challenging setting. We define visual code customization as an iterative editing problem with an imperfect oracle, and introduce a framework for analyzing such iterative refinements. We conduct a large-scale study and evaluate multiple LLM-based and tool-augmented visual verifiers within iterative refinement pipelines, and perform extensive manual annotation of refinement trajectories to assess verifier behavior and feedback quality. Our findings show that even imperfect verifiers can determine with moderate accuracy whether visual instructions are applied to code, achieving F1-scores up to 0.815. Feedback improves iterative refinement, especially for weaker models, adding 11–20 perfect customizations for Qwen3-vl-30b-a3b-Instruct, while stronger models like Gemini-3 gain fewer improvements (+5) but benefit more from accurate verification that prevents premature acceptance. Feedback is effective only when it precisely identifies image issues, provides actionable guidance, addresses all relevant problems, and remains grounded in the original instruction.

[AI-134] Recurrent Reasoning on Symbolic Puzzles with Sequence Models

链接: https://arxiv.org/abs/2606.15686
作者: Gowrav Mannem,Chowdhury Marzia Mahjabin,Jason Chen,Shivank Garg,Kevin Zhu
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language models often appear strong on symbolic and algorithmic tasks, yet this apparent strength can hide brittle behaviour when problems become longer, harder, or slightly out of distribution. A major limitation of current reasoning benchmarks is that many primarily test whether a model can produce a valid answer, while paying less attention to whether the solution is minimal, robust, and stable under controlled difficulty scaling. We introduce RecurrReason, a difficulty-controlled benchmark of four recurrent logic puzzles (Tower of Hanoi, River Crossing, Block World, and Checkers Jumping) with BFS-optimal trajectories and a single interpretable difficulty parameter N \in \1,\dots,10\ , totalling 10,817 unique puzzles and 285,933 moves. We benchmark two Transformer families, an encoder-decoder model (T5-style) and a decoder-only model (GPT-2-style), under consistent data splits and evaluation criteria, training on N=1 to 7 and evaluating on both held-out in-distribution instances and harder out-of-distribution instances at N=8 to 10 . Fine-tuned pre-trained T5 achieves 97.27% validation and 81.00% OOD accuracy on Block World; all models score 0.00% on River Crossing under all conditions. Failure mode analysis reveals that architecture is a stronger determinant of success than scale. Pre-training transfers only to puzzles with locally structured transition functions. Our code and dataset will be open-sourced upon acceptance.

[AI-135] Multi-agent Framework for Time-Sensitive Complementary Collaboration in Minecraft

链接: https://arxiv.org/abs/2606.15684
作者: Juheon Yi,Jinglu Wang,Xiaoyi Zhang,Yan Lu
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present TickingCollabBench, a Minecraft-based multi-agent benchmark for a novel class of time-sensitive complementary collaboration tasks. Our benchmark reflects four core characteristics of real-world collaboration: agent heterogeneity, mandatory collaboration, dynamic environments, and strict real-time constraints with failure risks. To enable this, we develop the TickingCollab framework, which supports the generation of diverse dynamic environments and abstracts Minecraft’s primitive APIs to enable declarative YAML task specifications for composing these events. Building on this, we design a feasibility-aware automated benchmark generation pipeline, where an LLM drafts structurally diverse task configurations and feasibility verifier filters out invalid ones using approximate constraints. Evaluations demonstrate that lang latency and inherent difficulty of coordinating under partial observability and agent heterogeneity cause LLMs to frequently fail under dynamic environments and fall significantly short of a global-knowledge oracle.

[AI-136] he Reservoir Attention Network: Cross-Pass State in Pretrained Transformers via Content-Addressable Reservoir Injection

链接: https://arxiv.org/abs/2606.15678
作者: Emma Leonhart
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 29 pages, 14 figures

点击查看摘要

Abstract:A feasibility and dynamics study of the Reservoir Attention Network (RAN), an architecture that injects a fixed, randomly-initialized reservoir into the mid-layer attention of a pretrained transformer to carry state across forward passes. Experiments span GPT-2 (124M, 355M) to Qwen2.5 (0.5B, 1.5B) on a single consumer GPU. The tasks are minimal probes chosen to isolate individual mechanisms; the broader always-alive agent vision is treated throughout as compute-limited future work, not a claim of this paper. The reservoir is left untrained (fixed random) by design: this isolates whether untrained recurrent dynamics alone suffice to carry usable cross-pass state, leaving trained recurrence as a complementary, more expensive direction.

[AI-137] Where Did It Go Wrong? Process-Level Evaluation of Web Agents with Semantic State Tracking

链接: https://arxiv.org/abs/2606.15673
作者: Jiwan Chung,JiHyuk Byun,Vibhav Vineet,Seon Joo Kim
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Web agents act through long interaction sequences, yet existing benchmarks evaluate only terminal success, discarding all process information and offering little guidance on improvement. In this work, we conduct a process-level analysis of web agents. We introduce WebStep, a benchmark of 1,800 task instances with controlled difficulty and automatic semantic state tracking. Each website exposes a deterministic semantic MDP alongside the GUI: the agent operates on the interface, while the environment records high-level states and transitions in the background, enabling fine-grained analysis without manual annotation. Based on the semantic trajectory, we first show that process metrics reveal differences invisible to outcome evaluation: three agents whose success rates cluster within 31-33% diverge in exploration reach versus execution accuracy. Then, decomposing by skill characterizes the nature of these differences, exposing opposite per-skill rankings hidden within the same website: e.g., on Housing, OpenAI CUA outperforms Qwen3.5 by 23.7% on commit actions yet underperforms it by 15.6% on filtering, pinpointing a concrete skill to improve even within a domain. Bifurcation analysis further localizes the decisive error that loses the task and shows that this error is agent-specific rather than shared. Finally, these differences widen as tasks grow harder: success rate is similar on easy tasks but separates sharply as exploration becomes more demanding. Our process-level analysis opens a new avenue in web agent evaluation, providing fine-grained and actionable insight into where and how each agent should be improved.

[AI-138] Z-Plane Neural Networks: Bounded Geometric Activation Replaces ReLU and LayerNorm

链接: https://arxiv.org/abs/2606.15669
作者: Sungwoo Goo,Hwi-yeol Yun,Sangkeun Jung
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Modern deep neural networks rely on Euclidean scalar activations (e.g., ReLU) and global normalization techniques (e.g., LayerNorm) to prevent gradient instability in deep architectures. However, these mechanisms inherently cause dead neurons, discard critical directional information, and destroy the orthogonality of feature representations. Inspired by the frequency-modulation transmission of biological axons, we propose the Z-Plane Neural Network, which maps hidden states into 2D phasor bundles on a hypersphere. We introduce a novel geometric activation function, Radial Bounding( \mathbfx / \max(1, |\mathbfx|_2) ), which limits the energy magnitude while preserving the phase (direction). We demonstrate mathematically that this isotropic activation maintains 1-Lipschitz continuity and prevents gradient vanishing by preserving tangential gradients. Empirically, a 100-layer Z-Plane Multi-Layer Perceptron (MLP)-entirely devoid of ReLU and LayerNorm-successfully converges on the MNIST dataset with 98.34% accuracy and absolute numerical stability, proving that bounded geometric activation alone is sufficient for stable deep learning.

[AI-139] Overcoming the Impedance Mismatch: A Theoretical Roadmap for Fusing Foundation Models and Knowledge Graphs ACL2026

链接: https://arxiv.org/abs/2606.15656
作者: Sahil Rajesh Dhayalkar
类目: Artificial Intelligence (cs.AI)
备注: 12 pages. Accepted at the ACL 2026 4th Workshop on Towards Knowledgeable Foundation Models ( this https URL )

点击查看摘要

Abstract:Modern artificial intelligence remains fundamentally divided between the continuous, probabilistic spaces of Foundation Models and the discrete, deterministic structures of Knowledge Graphs. While Retrieval-Augmented Generation (RAG) attempts to connect them by serializing graph data into text, we argue this lexical bridging is merely a superficial patch. In this paper, we formalize the underlying structural and geometric friction as the \textitImpedance Mismatch. By categorizing current neuro-symbolic integration strategies into a three-tiered hierarchy, we demonstrate that neither surface-level prompt injection nor continuous representation alignment can preserve the strict logical motifs required for reliable multi-hop reasoning. We define the specific mathematical limits, such as the Lexical Bottleneck and Topological Collapse, that show current architectures will eventually hallucinate or conflate semantic nodes. To achieve true semantic fusion, we propose a rigorous theoretical roadmap. We advocate for natively internalizing discrete symbolic structures through Structured Residual Streams, utilizing Vector Symbolic Architectures for latent sub-graph injection, and performing model updates via Orthogonal Subspace Editing. This actionable framework paves the way for models that seamlessly fuse the precision of symbolic logic with the expressivity of parametric memory.

[AI-140] Advanced Machine Learning and Deep Learning Techniques for Enhanced Cattle Identification and Detection: A Comprehensive Review

链接: https://arxiv.org/abs/2606.15655
作者: Fayazunnesa Chowdhury,Syed Md. Galib,Md Nasim Adnan,Md. Moradul Siddique,Md Robiul Karim,K M Tanvir Anjum
类目: Artificial Intelligence (cs.AI)
备注: Published in the journal of Annals of Emerging Technologies in Computing (AETiC), 34 pages, 5 Figures. The Article is available here: this http URL

点击查看摘要

Abstract:The need for effective cattle identification technology is now more acutely felt than ever in maintaining biosecurity, food safety, and supply chain efficacy in livestock management. This paper presents a systematic review of recent research in cattle identification using machine learning and deep learning techniques. The present systematic review measures the effectiveness of traditional and modern cattle identification techniques using studies from major academic databases, where articles were subjected to full-text review. Among these techniques, classical Machine Learning Techniques such as K-Nearest Neighbors and Support Vector Machines have demonstrated good results in cattle identification; however, Deep Learning Techniques, such as Convolutional Neural Networks, Residual Networks, and You Only Look Once, are better in cognition, detection, and identification tasks. Feature extraction relies on common techniques like Local Binary Pattern (LBP), Speeded-Up Robust Features (SURF), and Scale-Invariant Feature Transform (SIFT), while key features commonly used in these studies include muzzle prints and coat patterns. The review highlights key hurdles involving cattle identification, such as the limited number of publicly accessible datasets, issues with data quality susceptible to environmental changes and animal mobility, and high demand for real-time processing ability. The paper aims to inform researchers, policymakers, and stakeholders about implementing scalable, humane, and effective cattle identification systems to achieve sustainable livestock management.

[AI-141] PO-PDDL: Learning Symbolic POMDPs from Visual Demonstrations for Robot Planning Under Uncertainty

链接: https://arxiv.org/abs/2606.15654
作者: Wenjing Tang,Xuanjin Jin,Yuan Liu,Renming Huang,Cewu Lu,Panpan Cai
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Real-world robot task planning must operate under both stochastic action execution and partial observability, yet constructing Partially Observable Markov Decision Process (POMDP) models for real robotics domains remains difficult and labor-intensive. We introduce PO-PDDL, a symbolic formulation of POMDPs that preserves the relational structure and LLM-friendly syntax of the Planning Domain Definition Language (PDDL), while explicitly modeling partial observability, stochasticity, and beliefs. Building on this formulation, we propose a demonstration-driven pipeline for learning PO-PDDL models. The proposed method reconstructs latent symbolic state trajectories from real-robot execution videos, identifies partial observability via inconsistencies between inferred states and visual observations, and learns stochastic transition and observation models accordingly. The resulting PO-PDDL domains are reusable across tasks and enable online belief-space planning under both perception and execution uncertainty. Experiments on real-world long-horizon manipulation tasks show that our method consistently outperforms existing PDDL and POMDP model-learning approaches, achieving robust task planning under uncertainty with significantly lower planning cost.

[AI-142] IoT-Zoo: A Container-Based Framework for Heterogeneous IoT Device Profiles and Reproducible Traffic Capture

链接: https://arxiv.org/abs/2606.15653
作者: Vagner E. Quincozes,Diego Kreutz,Silvio E. Quincozes
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
备注: 10 pages, including 4 figures and 4 tables, submitted to SBRC 2026

点击查看摘要

Abstract:The validation of networking and security solutions for the Internet of Things (IoT) requires realistic and reproducible experimental data. However, existing platforms often achieve scalability by replicating a limited set of device types, which restricts profile diversity and fails to capture the heterogeneity of real-world IoT environments. In this paper, we present IoT-Zoo, a container-based testbed designed to support reproducible experimentation through heterogeneous, dataset-driven IoT device profiles. Built upon Containernet, IoT-Zoo automates the deployment of multi-domain scenarios and supports real application protocols such as MQTT and RTSP. The platform provides a single-command interface for environment provisioning and automated traffic capture (PCAP), enabling the generation of consistent traffic baselines and reducing the operational effort required to evaluate networking and security solutions.

[AI-143] AnonShield: Scalable On-Premise Pseudonymization for CSIRT Vulnerability Data

链接: https://arxiv.org/abs/2606.15650
作者: Cristhian Kapelinski,Douglas Lautert,Beatriz Machado,Diego Kreutz,Isadora Garcia Ferrão
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Performance (cs.PF)
备注: 9 pages, including 2 figures and 8 tables, submitted to SF/SBRC 2026

点击查看摘要

Abstract:We present AnonShield, a high-throughput, on-premise pseudonymization system that combines GPU-accelerated NER, streaming processing, caching, and schema-aware configuration. Evaluated on datasets up to 550 MB (70,951 records), AnonShield reduces processing time from over 92 hours to under 10 minutes (up to 738x speedup) while achieving up to 94.2% F1-score and 96.7% recall. Our results show that scalable pseudonymization of vulnerability data is feasible without sacrificing analytical utility, enabling compliant data sharing in operational CSIRT environments.

[AI-144] NeuroSymbolic AI for Legal AI-TRISM: Trustworthy Reliable Interpretable Safe Models

链接: https://arxiv.org/abs/2606.15646
作者: Deepa Tilwani,Yash Saxena,Ankur Padia,Srinivasan Parthasarathy,Manas Gaur
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have transformed natural language processing, but their lack of interpretable reasoning and tendency to hallucinate pose significant challenges for legal applications. While LLMs show promise for legal text analysis and generation, they struggle with accurate citation attribution and precedent verification. For example, in legal contexts, a single incorrect precedent can jeopardize a case. Current approaches to improve LLM reliability in legal domains suffer from two key limitations: inadequate integration of structured legal knowledge during training or fine-tuning, and insufficient verification mechanisms for generated legal content. To address these challenges, we propose the TRISM (Trustworthy, Reliable, Interpretable, Safe Models) framework, which integrates NeuroSymbolic AI principles with LLMs to leverage both neural learning capabilities and symbolic reasoning over structured legal knowledge. The TRISM approach addresses the above limitations while maintaining interpretable decision pathways. Our framework formalizes the extraction of symbolic knowledge from legal textual documents and incorporates Retrieval-Augmented Generation (RAG) as a core component for grounding LLM outputs in verified legal sources. In this position paper, we make the following contributions: (1) An analysis of the limitations of AI in law; (2) Introduce RASOR RAG which creates foundations for neurosymbolic RAG by generating explicit interpretable rationales that could be formalized into symbolic representations; (3) A formalized methodology for creating symbolic legal knowledge bases that support both interpretable reasoning and output verification in LLMs; and (4) The TRISM framework for integrating symbolic legal knowledge with LLMs.

[AI-145] CIWI-CKT: Chaos-Informed Wave Interference Feature Fusion and Cross-City Knowledge Transfer for Traffic Flow Forecasting

链接: https://arxiv.org/abs/2606.15642
作者: Abdul Joseph Fofanah,Lian Wen,David Chen,Shaoyang Zhang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Accurate traffic flow prediction remains challenging in cross-city, data-scarce scenarios where limited historical data hinders model generalisation. The chaotic nature of traffic dynamics, complex spatio-temporal dependencies, and heterogeneous urban networks complicate few-shot learning across cities. Existing deep learning approaches either treat traffic as purely deterministic or lack mechanisms to model wave-like interference patterns essential for cross-regime traffic dynamics. To address these limitations, this paper proposes CIWI-CKT, a novel Chaos-Informed Wave Interference Feature Fusion framework with Cross-City Knowledge Transfer. Our framework introduces three core innovations: chaos-informed wave generation that extracts measurable chaos invariants and models traffic as adaptive wave components; meta-interference processing that captures wave interactions between support and query regimes while producing a predictability score for confidence estimation; and chaos-aware meta-learning that enables efficient cross-city knowledge transfer while preserving chaotic characteristics. We establish theoretical guarantees including chaos-to-wave stability, wave-induced dimension reduction, and meta-learning generalisation bounds. Extensive experiments on four real-world traffic datasets demonstrate that CIWI-CKT significantly outperforms state-of-the-art spatio-temporal graph learning, transfer learning, prompt-based, and few-shot methods, improving prediction accuracy while substantially reducing required training data.

[AI-146] Retrieve Dont Retrain: Extending Vision Language Action Models to New Tasks at Test Time

链接: https://arxiv.org/abs/2606.15631
作者: Jeongeun Park,Juhan Park,Taekyung Kim,Sungjoon Choi,Dongyoon Han,Sangdoo Yun
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: this https URL

点击查看摘要

Abstract:Extending a vision-language-action (VLA) policy to a new task typically requires task-specific teleoperated demonstrations and per-task fine-tuning, making adaptation costly in both data collection and compute. In this paper, we show that this target-side per-task adaptation cost can be replaced by retrieval. Our retrieval-augmented policy is trained once on paired demonstrations from the target embodiment (query) and a cheaper embodiment (pool, e.g., human-hand video), then frozen. New tasks are added at deployment by appending pool-side demonstrations to a retrieval pool. The frozen policy conditions on retrieved trajectories at every control step, so new tasks are absorbed by indexing data rather than updating parameters. Fine-tuning is needed only to take on a new, unseen embodiment, not for each new task. We show that retrieval improves policies beyond a specific backbone, including standard VLA policies, but its effect is especially pronounced in Cosmos Policy, a video-generation-based world-action model (WAM). In this setting, retrieval supplies coarse task progression, while the WAM’s future-image objective provides an additional visual consistency signal that strengthens the retrieval-conditioned actions. On PushT, we study how retrieval provides a reusable high-level motion prior for cross-embodiment generalization to unseen goal angles, while on RoboTwin 2.0 our method outperforms cross-embodiment baselines on unseen tasks, and we additionally demonstrate the method on a real robot.

[AI-147] Surprise-Guided MergeSort: Budget-Efficient Human-in-the-Loop Ranking via Adaptive Comparison Scheduling

链接: https://arxiv.org/abs/2606.15623
作者: Yujin Park,Haejun Chung,Ikbeom Jang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 16 pages

点击查看摘要

Abstract:Pairwise comparison is the gold standard for subjective ranking tasks; however, exhaustive annotation requires a massive number of human comparisons ( O(n^2) ). While sorting-based methods have reduced this burden to O(n\log n) , they still require expensive human judgment for every single comparison. To further improve annotation efficiency, we propose leveraging a Vision-Language Model (VLM) not as an annotator replacement, but as a \emphquestion prioritizer to identify which comparisons genuinely require human judgment. The proposed \textbfSurprise-Guided MergeSort (SGS) framework achieves this through three integrated components: (1) a bottom-up MergeSort scheduler that structures comparisons and exploits transitivity, (2) a composite Surprise Scorer – combining position-bias-cancelled VLM confidence, Elo gap, and vote entropy – to quantify comparison ambiguity, and (3) an adaptive budget allocator that routes high-surprise pairs to humans while automating low-surprise pairs via transitivity inference. Validation was conducted on six diverse benchmarks spanning text similarity (STS-B, BIOSSES, SICKR-STS) and image quality assessment (KonIQ-10k, TID2013, LIVE Challenge). SGS effectively identified and skipped up to 535 non-informative comparisons per session. Consequently, it achieved Kendall’s \tau\times100 improvements of +6 to +12 over Active Elo under the same total budget. These results demonstrate that combining VLM-guided surprise metrics with algorithmic sorting provides a generally consistent accuracy-efficiency trade-off across diverse domains.

[AI-148] Frag Fuse: Bypassing Access Control of Large Language Model Agents via Memory-Based Query Frag mentation and Fusion USENIX-SECURITY2026

链接: https://arxiv.org/abs/2606.15609
作者: Zixin Rao,Wentian Zhu,Chan Aristella Lu,Zhaorun Chen,Wei Niu,Le Guan,Bo Li,Zhen Xiang
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 33 pages, 4 figures. Accepted by USENIX Security 2026

点击查看摘要

Abstract:Large language model (LLM) agents increasingly rely on long-term memory to support complex task execution, user personalization, and domain adaptation. Meanwhile, emerging access-control mechanisms for LLM agents are being explored to block policy-violating requests and prevent misuse. We reveal a novel attack surface arising from agent memory operations: prohibited content that would trigger access control can be fragmented across interactions, stored in long-term memory in benign-appearing form, and later reconstructed through memory retrieval without appearing explicitly in the final user query. We propose FragFuse, the first attack that enables unprivileged users to bypass agent access control by exploiting this temporal channel introduced by long-term memory. FragFuse operates in three stages: (1) identifying rejection-responsive fragments via black-box adaptive querying with fragment masking; (2) injecting these fragments into memory using marker carrier queries; and (3) retrieving and fusing the stored fragments through a follow-up attack query. Although FragFuse can be instantiated manually for individual agents, we further develop a surrogate-based optimization scheme that tunes fusion instructions and marker designs, enabling automated attack generation without violating the attacker’s threat-model assumptions. We evaluate FragFuse across four representative agent settings and task domains, covering three state-of-the-art agent access-control mechanisms. FragFuse achieves an average bypass success rate of 86.3% and an average end-to-end harmful task success rate of 41.1% across all settings, with only 4.4% average task-success degradation compared with configurations without access control. We also show that alternative defenses, including state-of-the-art prompt-injection detectors and perplexity detectors, do not effectively address this attack.

[AI-149] Integrating Reasoning and Generalization in Text-to-SQL via Self-Enhanced Fine-Tuning

链接: https://arxiv.org/abs/2606.15598
作者: Feng Lyu,Jinfeng Cen,Sijing Duan,Hao Wu,Shucheng Li,Weixu Zhang,Haolun Wu
类目: Artificial Intelligence (cs.AI)
备注: 14 pages, 13 figures, 7 tables

点击查看摘要

Abstract:Text-to-SQL aims to translate natural language questions into executable SQL queries over structured databases, enabling non-expert users to access data intuitively. While recent advances in large language models (LLMs) have shown promise in this task, existing LLM-based approaches often struggle to strike a balance between strong reasoning capabilities and robust generalization. To address these limitations, we propose CoTE-SQL to enhance the LLM-based text-to-SQL generation with three key innovations: (i) self-enhanced reasoning traces distilled from LLMs without human annotation, (ii) structured chain-of-thought (CoT) prompting with modular decomposition and examples retrieval, and (iii) error-aware revision based on SQL execution feedback. Extensive experiments on the Spider and Bird benchmarks demonstrate that CoTE-SQL achieves new state-of-the-art performance among methods built on open-source LLMs with comparable model sizes on Bird (53.39% EX / 59.02 VES) and strong results on Spider (79.60% EX / 77.19 VES), with especially significant gains on complex queries. Results highlight the effectiveness of combining self-enhancement, structured reasoning, and execution-time feedback within an LLM-based framework for text-to-SQL design.

[AI-150] Is Code Better Than Language for Algorithmic Reasoning ICML2026

链接: https://arxiv.org/abs/2606.15589
作者: Terry Tong,Yu Feng,Surbhi Goel,Dan Roth
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: ICML 2026

点击查看摘要

Abstract:For tool-augmented language models, comparing natural-language reasoning with code-execution pipelines is difficult because the comparison changes both the intermediate representation and the execution mechanism. We separate these factors with an intermediate intervention: the model expresses its reasoning as executable code, and the language model simulates that code in context to produce an answer. On a 40-task verifiable algorithmic benchmark, deterministic code execution outperforms natural-language reasoning by +31.6pp. We observe that the intermediate intervention is not meaningfully different from natural-language reasoning (+0.15pp). These results suggest that, in our evaluated setting, changing the intermediate representation alone does not explain the tool-use advantage, providing evidence for the performance gains requiring reliable external execution. We formalize this intuition with a simple statistical decision-theoretic model that characterizes when execution dominates end-to-end risk in our disentangled trace-generation/execution regime. We validate our theory using a reconstruction intervention that leverages a proxy language model to infer natural-language reasoning traces from code representations, recovering performance comparable to the original natural-language reasoning pipeline. All experiments are at this https URL.

[AI-151] Large Language Models as Optimizers: A Survey of Direct vs. Tool-Augmented Approaches and Their Performance Frontiers

链接: https://arxiv.org/abs/2606.15577
作者: Roko Peran,Luka Hobor,Mihael Kovac,Mario Brcic
类目: Artificial Intelligence (cs.AI)
备注: 6 pages, 1 figure, 2 tables, accepted at 49th ICT and Electronics Convention, MIPRO - this https URL ; Paper ID: #23463

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly involved in complex mathematical optimization, even if the pragmatic user who triggers them is unaware of it. After all, many real-world problems reduce to the search for better or the best solutions. The field of LLM-as-optimizer has three paradigms: direct optimization, tool-augmented optimization, and tool-creating optimization. Direct optimization uses iterative prompting and heuristic generation to navigate solution spaces. Tool-augmented optimization translates natural language problems into formal specifications and orchestrates external solvers. Tool-creating optimization goes further, using LLMs to discover reusable algorithms or heuristics that can be deployed at zero marginal LLM cost. We describe current performance frontiers based on the benchmarks from the literature. We identify the critical reasoning gap in current architectures and argue for trade-offs between the future potential of direct optimization and the auditability of tool-augmented optimization. Even future, more powerful models might opt for tool-making to improve operational efficiency for repetitive families of problems.

[AI-152] Localizing Credit at the Divergence: Path-Conditioned Self-Distillation for LLM Reasoning

链接: https://arxiv.org/abs/2606.15576
作者: Yu Li,Shu Hong,Tian Lan
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reinforcement learning from verifiable rewards assigns a single scalar to each rollout, leaving token-level credit assignment underspecified in long reasoning traces. On-policy self-distillation addresses this by letting the same model act as a teacher conditioned on privileged information, producing a dense per-token signal. But the common choice of a ground-truth answer is only an endpoint cue: on terse-answer tasks, the teacher falls silent at the intermediate positions where path-level guidance matters most. We propose Hindsight Self-Distillation (HSD), which conditions the teacher on a successful peer rollout drawn from the current training group. Such a peer is an exact sample from the success-conditioned policy, requiring no additional sampled rollouts. By providing a full successful continuation rather than only the final answer, the resulting credit signal concentrates at the divergence position between a failed rollout and a successful peer. Across Qwen3-8B and Qwen3-32B on math and code benchmarks, HSD obtains the best result against GRPO variants and on-policy distillation baselines, with the largest gains on terse-answer tasks such as AIME.

[AI-153] QoS-Aware Token Scheduling and Private Data Valuation for Multi-Modal Agent ic Networks ICME2026

链接: https://arxiv.org/abs/2606.15573
作者: Yao Du,Jing Liu,Pengfei Xu,Zehua Wang,Victor C.M. Leung,Cyril Leung,Victoria Lemieux
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: Accepted to IEEE ICME 2026. Supplementary materials are included in the TeX source

点击查看摘要

Abstract:In agentic systems, human-generated data records anchor the value of AI services. Yet cloud compute pipelines centralize processing on remote servers. Data centralization reduces personal data sovereignty and may potentially degrade the quality of service (QoS). Meanwhile, user contributions are diverse in quantity and quality: decentralized records can be biased, noisy, and heterogeneously distributed. To address the data challenge, we study fair token allocation and private data valuation for decentralized and resource-constrained agentic systems. Our approach embeds multi-modal representations in a shared semantic space and releases differentially private (DP) prototypes to preserve utility while reducing semantic leakage. With the DP guarantee, we design a fair token allocation scheme that rewards effective contributions and remains robust to data heterogeneity and AI resource scarcity. Extensive simulations demonstrate improved contribution-based fairness and QoS compared to standard benchmarks. The improved resistance to image reconstruction attacks indicates enhanced privacy for multi-modal personal data.

[AI-154] Distilling Drifting Transformers with Representation Autoencoders

链接: https://arxiv.org/abs/2606.15553
作者: Jiawei Zhang,Mengfei Xia,Gen Li,Yuantao Gu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Representation Autoencoders (RAEs) have improved diffusion and flow models by semantically richer latent space owing to the strongly label-wise clustered DINO features in the pretrained encoders. Yet in the distillation stage, the severe anisotropy and large curvatures caused by the rich semantic representations would hinder the convergence and performance, making the trajectory-based distillation unstable. In this work, we argue that the RAE latent space is compatible with distillation via the newly proposed Drifting Models. We first quantitatively study the curvatures and isotropy statistics across different autoencoders, and theoretically reveal that Drifting Model itself is highly likely to fail on extremely scattered spaces like reconstruction-based VAEs. These motivate us to apply the drifting paradigm directly to representation autoencoders. Our proposed method, Drift-RAE, distills pretrained flow models in RAE latent spaces using Drifting, together with insightful modifications that improve training stability by thereotically aligning drifting fields with other frameworks. Regarding the experimental evidences, we achieve 1.77 FID on ImageNet 256 dataset using only 10k distillation steps, surpassing state-of-the-art RAE distillation methods and appearing comparative with the original Drifting Model without requiring an auxiliary MAE feature extractor. The code will be made publicly available.

[AI-155] CmdNeedle: Measuring the Incompleteness of Command Denylists for AI Agents

链接: https://arxiv.org/abs/2606.15549
作者: Chuyang Chen,Zhiqiang Lin
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The adoption of AI agents is increasing rapidly. Terminal AI agents, i.e., AI agents that run in terminal environments, are a widely used type of AI agents. Terminal AI agents rely heavily on shell command execution to interact with the host systems. They adopt a three-list command-gating mechanism to mitigate security risks introduced by command execution, with denylists serving as the load-bearing component. However, modern operating systems often ship a large, ever-expanding set of shell commands with complex functionalities. Our observation is that even a built-in denylist of Claude Code, well-maintained by its developers, can overlook bypass commands that invalidate its effectiveness. Such negligence leads to fragile command denylists that cannot even block operations that practitioners expect them to block. This paper presents the first systematic characterization of command denylist fragility in terminal AI agents. The paper formalizes the command denylist fragility problem and proposes an LLM-driven pipeline, CmdNeedle, to detect such fragility. It prompts the LLM to propose possible bypasses and iteratively repairs them using feedback from a validator that executes them in a sandbox. In the evaluation, we applied CmdNeedle to 1,709 real-world command denylists (containing 13,332 denylist rules) collected from GitHub. The evaluation shows several key findings, including that 69.0–98.6% of the denylists are fragile, that this fragility occurs consistently across projects and agents, and the validity of several possible root causes for this fragility. Our pipeline and findings will hopefully facilitate future research and practice regarding the command denylists used by AI agents. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2606.15549 [cs.CR] (or arXiv:2606.15549v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2606.15549 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-156] AP-GRPO: Anchor-Gated Phonetic Alignment with Policy Optimization for Pathological Speech Reconstruction

链接: https://arxiv.org/abs/2606.15540
作者: Pengfei Zhang,Hoang H Nguyen,Yutong Song,Wenjun Huang,Tahmid Imtiaz Imu,Henry Peng Zou,Jiang Wu,Honghui Xu,Amir M. Rahmani
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Pathological speech from patients with neurodegenerative and neuromotor disorders is often acoustically distorted and linguistically fragmented, making pathological speech reconstruction necessary to recover intended textual content from distorted and incomplete speech recordings. Crucially, such recordings are rarely uniformly degraded: some words or short phrases remain reliable and can serve as audible anchors for reconstructing the corrupted surrounding content. We introduce Anchor-gated Phonetic Group Relative Policy Optimization (AP-GRPO), a GRPO framework with phonetic reward that aligns speech language models (SLMs) through audible-anchor preservation and inter-anchor phonetic compatibility to the original speech signal. AP-GRPO consists of: (i) an anchor-gated reward that matches reliable audible anchors in clear regions; and (ii) an inter-anchor phonetic alignment reward that evaluates whether recovered contents are phonetically supported by the corresponding corrupted inter-anchor speech span. Across four disease conditions, AP-GRPO improves faithful speech reconstruction, and the learned anchor constraint automatically adapts to each condition and thus reveals interpretable disease-specific profiles: conditions with severe articulatory degradation require stronger anchor enforcement, whereas milder impairment or linguistically impaired conditions rely more on phonetic alignment for inter-anchor recovery.

[AI-157] MADAR: An Address-Free Processor

链接: https://arxiv.org/abs/2606.15535
作者: Mohamed Amine Bergach
类目: Performance (cs.PF); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In a modern processor, computing is the cheap part. Most of its area and energy go to \emphaddressing – moving operands to and from a register file and cache, and running the tags, ports, miss queues, and bypass networks that find a value where it was left. MADAR deletes that machinery by abolishing the address. All state circulates in rings of slots that advance one position per clock; instructions and data ride in the same slots; a value is named by its place in an orbit – a \rp coordinate – not by an address; a fixed station computes when a circulating instruction sweeps past its operands, on a schedule set at compile time; and a hierarchy of rings of increasing period replaces the cache hierarchy, movement between them scheduled rather than triggered by a miss. No prior circulating-store, dataflow, or statically scheduled machine combines all four of these. We define the execution model, validate it in a cycle-accurate register-transfer-level implementation, show it \emphcompilable – a constructive scheduler emits programs cross-checked against the implementation – and price it with a first-order energy model. The payoff is clearest for AI acceleration: the multiply-accumulate at the heart of every matmul and convolution compiles to a streaming form whose energy per operation stays flat as the reduction grows, and the operand reuse that makes matrix multiplication efficient is carried by the ring-period hierarchy – the memory hierarchy doing by rotation what a cache does by tags. MADAR is a new design point for any computation whose data movement is known before the program runs.

[AI-158] AQ4SViT: An Automated Quantization Framework with Search Gating Policy for Compressing Spiking Vision Transformers

链接: https://arxiv.org/abs/2606.15523
作者: Rachmad Vidya Wicaksana Putra,Saad Iftikhar,Muhammad Shafique
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 8 pages, 4 figures, 2 tables

点击查看摘要

Abstract:Spiking Vision Transformers (SViTs) have emerged as alternative low-power ViT models, but their large sizes hinder their deployments on resource-constrained embedded AI systems. To address this, state-of-the-art works proposed quantization techniques to compress SViT models, but their manual, human-guided approach needs a huge design time and power/energy consumption to find the appropriate quantization setting for each given network, making this approach not scalable for quantizing multiple networks. Toward this, we propose AQ4SViT, a novel automated quantization framework for SViTs that can provide quick quantization settings with good trade-offs between accuracy and memory. To achieve this, AQ4SViT employs the following key ideas: quantization search strategy that evaluates the quantization setting candidates while considering the accuracy constraint; and search gating policy that quickly evaluates and selects promising quantization candidates by leveraging membrane potential drift as a performance proxy. In the search gating policy, AQSViT employs two search algorithm variants to provide trade-off options: Greedy search, which performs fast but may lead to local optima; and Beam search, which performs slower but has better performance in finding global optima selection due to a wider search space. Experimental results show that AQ4SViT-Greedy quickly finds the appropriate quantization settings, achieving up to 6.6x faster search time and up to 82.5% memory saving compared to the state-of-the-art; while AQ4SViT-Beam further reduces the memory footprint by up to 90% compared to the state-of-the-art, but with 4.5x longer search time; all these results are obtained while maintaining high accuracy within 1.5% from the original/non-quantized models on the ImageNet dataset. These results highlight that AQ4SViT framework offers advancements toward SViT deployments on embedded AI systems.

[AI-159] oolMenuBench: Benchmarking Tool-Menu Filtering Strategies for Reliable and Efficient LLM Agents

链接: https://arxiv.org/abs/2606.15508
作者: Rahul Suresh Babu,Laxmipriya Ganesh Iyer
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Tool-augmented large language model agents increasingly operate over large tool libraries, but existing evaluations often focus on whether a model can call a tool correctly rather than how the visible tool menu shapes reliability, efficiency, and safety-relevant risk exposure. We introduce ToolMenuBench, a benchmark for evaluating tool-menu construction in multi-step LLM agents. ToolMenuBench varies tool-menu size, distractor type, state-dependent task structure, and risk exposure, and reports both filter-level and downstream agent metrics, including visible-tool count, risky-tool exposure, task success, wrong-tool calls, premature actions, and token usage. In a controlled evaluation across seven model backends, three tool-menu sizes, six filtering methods, and seven evaluation settings, CMTF improves task success from 32.1% under all-tools exposure to 85.7%, while reducing average token usage by roughly 98%. Causal minimal tool filtering achieves the strongest overall tradeoff, reducing visible tools, wrong-tool calls, premature actions, and risky-tool exposure relative to unfiltered exposure, lexical filtering, state-aware filtering, and broader causal-path baselines. ToolMenuBench provides a reusable evaluation framework for studying the agent-interface problem: which tools should be visible, when they should be visible, and under what cost or risk constraints.

[AI-160] Frame-Conditioned Moral Computation in LLaMA 3.1-8B-Instruct: A Mechanistic Interpretability Audit of Ethical Reasoning

链接: https://arxiv.org/abs/2606.15507
作者: Ali Dasdan,Manan Shah,W. Russell Neuman,Chad Coleman,Kund Meghani,Safinah Ali
类目: Artificial Intelligence (cs.AI)
备注: 47 pages, 10 figures

点击查看摘要

Abstract:Behavioral audits of Large Language Models on moral prompts measure what the model says, not the internal computation producing it. We use Transluce, an AI-driven mechanistic-interpretability platform, to examine LLaMA 3.1-8B-Instruct on 54 moral prompts in four batteries: 17 dilemmas, policy, and meta-ethical questions (B1); 6 role-playing scenarios (B3); and a controlled trolley contrast varying the switching mechanism with people fixed (B4, 15 prompts) or identity attributes with mechanism fixed (B5, 16 prompts). Two complementary metric families, five cluster-level metrics and a six-metric neuron-level panel, converge on a Situational Anchor Effect: domain-specific representations dominate the top of the activation list across every battery. The model’s ethics-labeled capacity stays essentially constant; its salience (rank, priority, top-of-list presence) is highly sensitive to the interpretive frame the prompt selects. The B4-vs-B5 contrast confirms the model attends to whichever surface feature varies: aggregate ethics metrics are indistinguishable, but the dominant non-ethics distractor mirrors the design. A multi-temperature audit identifies a candidate ethics neuron (L16/N3837) stable across temperatures; a cross-model behavioral proxy on two frontier models yields preliminary evidence of divergence in self-reported moral focus, consistent with an Alignment Wrapper in which RLHF re-orders surface text without removing underlying domain-first frames. We unify these as Frame-Conditioned Moral Computation: the prompt’s surface vocabulary selects a feature manifold, and the moral conclusion is downstream of that selection. Behavioral alignment must be supplemented by Mechanistic Alignment: a research program asking whether ethics-related features can be shown causally privileged under controlled frame variation, not merely loud in the explanation. Comments: 47 pages, 10 figures Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2606.15507 [cs.AI] (or arXiv:2606.15507v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2606.15507 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-161] oward Vibe Medicine: A Self-Evolving Multi-Agent Framework for Clinical Decision Support

链接: https://arxiv.org/abs/2606.15504
作者: Qianxue Zhang,Yiming Ren,Shihuan Qin,Xiao Zhang,Liao Zhang,Jinyang Huang,Zhengliang Liu,Chenbin Liu,Hongying Feng,Jingyuan Chen,Yuzhen Ding,Weihang You,Hanqi Jiang,Yi Pan,Yifan Zhou,Junhao Chen,Lifeng Chen,Wei Liu,Tianming Liu,Zengren Zhao,Lian Zhang
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In recent years, the advances of large language models and autonomous agents have revolutionized the healthcare field, facilitating diagnosis and improving treatment results. However, most existing AI systems rely on pre-trained knowledge and predefined pipelines, which struggle to learn dynamically from the interactive chat session history that contains patient outcomes and past failures. To address this limitation, we propose VIBEMed, a multi-agent framework with a built-in self-evolution mechanism and architecture-level safety sandbox for robust clinical decision support. The system integrates three specialized agents, including a Clinical Diagnostic Agent (CDA) for hypothesis generation, a Therapeutic Execution Agent (TEA) for treatment planning, and a Clinical Evolution Manager Agent (CEMA) that distills longitudinal clinical feedback into reusable knowledge, transforming multimodal patient information into personalized medical decisions. Through self-evolution mechanism, the framework enables iterative updates across memory, model behavior, and decision strategies, allowing the system to improve over time. Experimental results show that VIBEMed demonstrates superior performance through its evolving mechanism in complex clinical cases, particularly in tasks that require integrated decision-making and longitudinal planning. The framework also supports reliable end-to-end decisions in challenging scenarios such as oncology treatment planning, highlighting its feasibility in real-world clinical contexts. Overall, VIBEMed provides a practical path beyond static AI systems toward adaptive, experience-driven clinical decision support, demonstrating the value of combining multi-agent collaboration with continuous evolution for advancing precision medicine.

[AI-162] LLM 4RTL: Tool-Assisted LLM for RTL Generation

链接: https://arxiv.org/abs/2606.15500
作者: Jing Jin,Robert Chu,Ning Yan,Masood S. Mortazavi
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have facilitated impressive progress in software engineering, code generation, tooling, and systems. Concurrently, a significant body of research has developed which explores a growing variety of methods and systems for applying LLMs to hardware and chip design (e.g., systems for RTL code generation based on functional description). However, when it comes to open Verilog/RTL code-generation, we need high-quality training samples to build specialized and more effective LLM systems through fine-tuning or low-rank adaptation. Here, we propose a ``judge-renew-check-renew-check’’ (JRCRC) pipeline which updates a current public dataset using a hierarchy of state-of-the-art commercial LLM models differing in their costs and capabilities in RTL code generation. This approach achieves a cost-effective mechanism for filtering and refining code-generation samples into a higher-quality training dataset. Our experiments also identify some common weaknesses of LLMs in rule-based reasoning and logic, and consequently, in RTL code-generation. Having identified these weaknesses, we develop an architecture for incorporating pre-processing tools to dynamically assist the LLMs in inferring logical relationships from tabular data formats. With our tools-assisted architecture for RTL code generation, we achieve significant overall performance gains in the VerilogEval benchmark and outperform many state-of-the-art methods. Our LLM4RTL system achieves performance comparable to that of GPT-4O using a significantly much smaller LLM.

[AI-163] owards End-to-End Automation of AI Research

链接: https://arxiv.org/abs/2606.15497
作者: Yutaro Yamada,Robert Tjarko Lange,Cong Lu,Chris Lu,Shengran Hu,Jakob Foerster,David Ha,Jeff Clune
类目: Artificial Intelligence (cs.AI)
备注: Published in Nature 651, 914-919 (2026)

点击查看摘要

Abstract:The automation of science is a long-standing ambition in the field of AI. While the community has made significant progress in automating individual components of the scientific process, a system that autonomously navigates the entire research lifecycle – from conception to publication – has remained out of reach. Here, we present the strongest demonstration to date toward automating the entire process end-to-end. We present The AI Scientist, which creates research ideas, writes code, runs experiments, plots and analyzes data, writes the entire scientific manuscript and performs its own peer review. Its ideas, execution, and presentation are of sufficient quality to produce a manuscript generated by an AI system that passes the first round of peer review at a major machine learning conference workshop. The workshop has an acceptance rate of 70 percent. Our system leverages modern foundation models within a complex agentic system. We evaluate The AI Scientist in two settings: a focused mode using human-provided code templates as an initial scaffold to conduct research on a specific topic, and a template-free, open-ended mode that leverages agentic search for wider scientific exploration. Both settings produce diverse ideas and automatically test, report on, and evaluate them. This achievement demonstrates AI’s growing capacity for scientific contribution and signifies a potential paradigm shift in how research is conducted. As with any impactful new technology, there could be significant risks, including taxing overwhelmed review systems and adding noise to scientific literature. However, if developed responsibly, such autonomous systems could greatly accelerate scientific discovery.

[AI-164] Bayesian 3D Steerable CNNs: Enabling Equivariance and Uncertainty Quantification Simultaneously

链接: https://arxiv.org/abs/2606.15479
作者: Abhishek Keripale,Ponkrshnan Thiagarajan,Susanta Ghosh
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Probability (math.PR)
备注:

点击查看摘要

Abstract:Steerable convolutional neural networks (Steerable-CNNs) guarantee SE(3)-equivariance by parameterizing kernels as linear combinations of steerable basis functions, but their deterministic nature precludes uncertainty quantification - limiting their use in settings where confidence estimates are essential. We propose a Bayesian Steerable-CNN that places posterior distributions over the basis coefficients, yielding stochastic kernels while preserving equivariance exactly. The loss function of the model is obtained via variational inference and minimized by Bayes-by-Backpropagation. The framework admits a decomposition of predictive uncertainty into epistemic and aleatoric components. Empirically, the model attains competitive classification accuracy alongside an expected calibration error of 0.0263 and outperforms its deterministic counterpart by up to 6.17% under distributional shift induced by additive Gaussian noise. Furthermore, we leverage the model’s uncertainty estimates to enhance its performance significantly, achieving a notable gain - approximately 4% higher accuracy across 84% of the test dataset. A statistically significant negative correlation between epistemic uncertainty and prediction error confirms that the learned posterior variance is semantically meaningful. The framework unifies Bayesian uncertainty quantification with the inductive bias of equivariant CNNs.

[AI-165] Who Drifted: the System or the Judge? Anytime-Valid Attribution in LLM Evaluation Pipelines

链接: https://arxiv.org/abs/2606.15474
作者: Yitao Li
类目: Artificial Intelligence (cs.AI); Applications (stat.AP)
备注:

点击查看摘要

Abstract:Continuous evaluation of LLM products relies on a strong LLM judge treated as ground truth: a cheap monitor scores every interaction and a team is paged when the score drifts down. But the judge is itself a model behind an API, and a silent version bump or scoring-prompt update changes how it scores – so every drift alarm is ambiguous between a worse product and a changed judge. We resolve the ambiguity with a fixed, human-labeled anchor set that the current judge re-scores at a steady interleave, a second betting e-process on the judge-versus-human gap, and a guard-window rule returning a verdict in none, system, judge. We prove anytime-validity, one-way identification (only the judge can move the anchors), an attribution race whose design law is that the anchors must out-run the main process they guard, and process orthogonality. On two real judge changes, a silent version bump is detected as judge drift in 60/60 runs with zero judge-to-system misattribution, and a contaminating strict-prompt change is correctly attributed on 110 of 120 runs at guard width 300 – while the industry-default rolling z-test false-alarms on 75% of drift-free streams. Every experiment replicates on a second domain (TL;DR summarization) with nothing re-tuned, and where the domains differ the differences are the ones the race predicts: the strict-prompt change shifts scores harder there, so the anchors fire faster and attribution becomes perfect (240/240). The monitor runs at approximately 0.64 of the cost of strong-judging every item, or 0.21 in a cheaper-but-deafer regime.

[AI-166] Understanding Diversity Collapse in RLVR via the Lens of Overtraining

链接: https://arxiv.org/abs/2606.15455
作者: Suqin Yuan,Jinkun Chen,Jiyang Zheng,Muyang Li,Lei Feng,Dadong Wang,Tao Xiang,Tongliang Liu,Bo An
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reinforcement learning with verifiable rewards (RLVR) has become a key approach for enhancing the reasoning abilities of large language models. However, RLVR often suffers from \emphdiversity collapse: Pass@ 1 improves while high- k Pass@ k degrades, which is viewed as a narrowing of the model’s reasoning boundary. We formalize this diversity collapse through the lens of \emphovertraining: once a problem’s contribution to the reference metric has effectively saturated, further updates no longer expand what the model can solve but still concentrate probability mass on the trajectories favored by on-policy sampling. Under a standard setup with few rollouts per problem, even a single observed success places a problem in a nearly saturated regime for high- k Pass@ k , so most updates in standard RLVR are overtraining from the boundary perspective. This perspective also suggests a reading of whether RLVR can expand the model’s reasoning abilities beyond the base model: since RLVR is structurally biased against high- k Pass@ k , its aggregate decline does not by itself mean that no new reasoning gains occurred. Interventionally, restricting updates to problems with zero observed success lifts Pass@ 256 above the base model on difficult benchmarks; observationally, a non-trivial fraction of initially unsolvable problems become solvable during standard RLVR training. Building on these findings, we propose \emphBayesian Boundary Gating (BBG), which redirects optimization away from overtraining by estimating each problem’s marginal contribution to the reasoning boundary. Across multiple reasoning benchmarks, BBG improves average Pass@ k across a wide range of k .

[AI-167] Hierarchical Modeling of ICD Codes in EHR Foundation Models

链接: https://arxiv.org/abs/2606.15447
作者: Megha Thukral,Dong Gyun Kang,Rudra Pratap Singh,Shruthi Kashinath Hiremath,Katrin Hänsel,Thomas Plötz
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Electronic health record foundation models typically treat ICD diagnosis codes as flat tokens, overlooking the clinically meaningful hierarchical structure that captures disease families, subcategories, and fine-grained diagnostic detail. As a result, existing EHR representation learning methods do not explicitly exploit the hierarchical structure already present in the coding system. In this work, we study ICD-10-CM hierarchy as a general inductive bias for clinical representation learning. We investigate two complementary mechanisms for incorporating hierarchy: first, by augmenting diagnosis sequences in a BERT-style transformer with tokens corresponding to different levels of the ICD hierarchy, and second, by injecting hierarchy into graph-based code representations through hierarchy-aware edges combined with diagnosis co-occurrence structure. Across these settings, we evaluate whether explicit hierarchy improves downstream prediction, which levels of the hierarchy are most useful, whether hierarchy encoding improves transfer across datasets, and how hierarchy reshapes embedding similarity structure. We conduct experiments on two large-scale real-world clinical datasets: MIMIC-IV, used for pretraining and in-domain evaluation, and eICU, used to assess cross-dataset transfer via frozen encoder probing. Our findings show that explicitly encoding ICD hierarchy improves over flat code representations in both in-domain and cross-dataset settings, while revealing that the most useful level of hierarchy depends on both the task and the modeling approach. More broadly, we focus on hierarchy-aware EHR representation learning and show that the benefits of encoding hierarchy are generalizable across modeling settings and hierarchy levels.

[AI-168] Defending against Adaptive Prompt Injection Attacks via Reasoning -enabled Task Alignment

链接: https://arxiv.org/abs/2606.15441
作者: Lipeng He,Yihan Wang,Jiawen Zhang,N. Asokan
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Indirect prompt injection attacks hijack LLM-based agents by embedding malicious instructions in third-party data that the agent retrieves during task execution. Existing defenses report near-zero attack success rate on static benchmarks, yet recent adaptive evaluations show that these results collapse once the attacker is allowed to optimize against the deployed defense. In this work, we trace this collapse to two failure modes. First, existing defense methods are confined to recognizing specific attack patterns, rather than assessing whether the intent of every embedded instruction is relevant to the user task. Second, training-based defenses, which otherwise offer the strongest safety-utility trade-off, assemble their adversarial examples from a handful of hand-crafted templates, and the resulting defender fails to generalize outside that narrow strategy distribution. To address these gaps, we propose RETA, a training-based method that grounds defense decisions on the user tasks rather than attacker-controlled data. At each tool-output step, the defender undertakes chain-of-thought reasoning verifying that its actions are consistent with the user task. Leveraging red-teaming, a simulated attacker synthesizes adversarial training data and receives a dictionary-learning diversity reward, achieving broad coverage of injection-reformulation strategies. Together, these allow the defender to be optimized via multi-objective reinforcement learning and achieve better safety-utility trade-off. Across six black-box adaptive attacks, RETA keeps every per-attack ASR below 10%, with average ASR of 2.92% and 3.75% on the two target models, while preserving most utility under attack and on clean inputs.

[AI-169] Beyond Classification: A Cough Regression Benchmark for Respiratory Acoustic Foundation Models ICML2026 ALT

链接: https://arxiv.org/abs/2606.15436
作者: Mayur Sanap,Prasanna Desikan,Edgar Lobaton
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: Accepted at the ICML 2026 Workshop on Structured Data for Health

点击查看摘要

Abstract:Respiratory acoustic foundation models (FMs) excel at cough classification, yet their ability to predict continuous health quantities from cough audio remains largely unexplored, despite the clinical value of passive age, BMI, and disease probability estimation in settings where physical measurements are unavailable. We introduce the multi-model, multi-target cough regression benchmark evaluating five FMs (OPERA-CT, OPERA-CE, OPERA-GT, HeAR, M2D+Resp) across six targets on three datasets under subject-disjoint protocols, comparing linear, MLP-small, and full MLP regression heads. MLP-small beats the mean-predictor baseline on all tasks and linear probing in 23 of 30 model x task cases, with full MLP overfitting on small clinical data but recovering on larger sets, revealing a dataset size x head-capacity trade-off. HeAR leads within-dataset age regression on Coswara (9.12 yr MAE); its CIDRZ result is excluded from headline claims owing to possible HeAR-CIDRZ pretraining overlap. OPERA-GT is favored over OPERA-CT on age in all three datasets, with the CIDRZ margin within seed variance, extending a generative-pretraining advantage from breath to cough. HeAR and M2D+Resp reach near-full performance at N = 50 samples while OPERA models require N = 400. Cross-dataset transfer is strongly asymmetric as large diverse data generalises to small clinical populations (CoughVID to CIDRZ: -0.17 yr) but not vice versa (CIDRZ to Coswara: +2.43 yr, +26.6%).

[AI-170] Constitutional Value Potentials: reading and steering internal priority margins in language models

链接: https://arxiv.org/abs/2606.15420
作者: Tong Che,Rui Wu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:A constitution tells a language model what to value, but little tells us whether it does. Adherence is judged from outputs, and output evidence is most fragile on value conflicts, where what matters is not which value a model mentions but which one it is willing to sacrifice. We provide evidence that this arbitration can be read from activations in a structured margin readout. We introduce Constitutional Value Potentials (CVP). For each value we learn a scalar potential from the hidden state: an internal pressure to preserve that value, supervised not by the prompt but by an independent judge’s verdict on which value the model’s own response actually preserved. The signed difference of two potentials is a priority margin. A constitutional clause becomes the claim that a margin stays positive, and a single monitor score flags when it does not. The monitor predicts conflict violations with AUROC up to 0.95, beats a strong hidden-state probe, and generalizes to held-out synthetic conflicts across three Qwen2.5 scales. The signal appears as the answer begins, from the prompt tail and first response token. Read this early, the same signal reveals whether an adversarial priority hack has actually pushed the model toward a violation, rather than only whether the prompt looks adversarial. The same directions also support intervention tests: under selected steering settings, moving along a value direction shifts judged trade-offs in the intended direction. Together, these results suggest that some constitution-relevant priorities are accessible as activation-space margins, rather than only as output behavior.

[AI-171] Reward Hacking in Language Model Agents : Revisiting AI Safety Gridworlds

链接: https://arxiv.org/abs/2606.15385
作者: Ömer Veysel Çağatan,Xuandong Zhao
类目: Artificial Intelligence (cs.AI)
备注: 28 pages, 16 figures, 13 tables

点击查看摘要

Abstract:Reward hacking, where AI systems exploit misspecified objectives to achieve high reward without satisfying intended goals, remains a central challenge in AI safety. Yet most known instances have been discovered post hoc in frontier systems where controlled study is impractical. We adapt the AI Safety Gridworlds framework into a text-based evaluation suite that reformulates classic reinforcement learning safety tasks for language-based agents. Across frontier and mid-scale models, we find that specification gaming emerges zero-shot: models systematically achieve high observed reward while underperforming on hidden safety objectives, and even apparently safe behaviors can reflect misunderstanding rather than principled safety. Reinforcement learning does not correct these failures: direct reward optimization widens the gap between observed and hidden reward, as the model’s initial competence causes it to lock into locally rewarding strategies before discovering safer alternatives. This pattern persists across model scales (1.5B–14B) and is not resolved by finer credit assignment, exploration prompts, or entropy regularization. Our results show that reward hacking arises naturally when optimizing proxy objectives with capable language model agents and resists standard mitigations, suggesting that proxy-reward failures in agentic settings may require approaches beyond standard exploration and credit-assignment fixes. To facilitate reproducibility, the code for this work is available at \hrefthis https URLour public repository.

[AI-172] Learning Earthquake Wave Arrival Time Picking from Labels with Inaccuracies

链接: https://arxiv.org/abs/2606.15377
作者: Sen Li,Xu Yang,S. Mostafa Mousavi,Anye Cao,Keting Fan,Yaoqi Liu,Changbin Wang,Qiang Niu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Geophysics (physics.geo-ph)
备注: 28 pages, 10 figures

点击查看摘要

Abstract:Inaccurately labeled training data, or “label noise”, poses a significant threat to the integrity of supervised machine learning models. This corruption directly degrades performance by teaching the model erroneous mappings between features and labels, which leads to poor generalization and reduced accuracy on properly labeled validation and test data. Current seismological applications mainly rely on large-scale training sets or data augmentation to reduce the label-noise impact, which can be labor-intensive and costly. Here, we introduce a Label Noise-Contrastive Robust Learning (LaNCoR) approach that can effectively handle noisy labels in seismic signal processing tasks, without requiring large-scale training datasets. In this approach, the input waveform feature and label representation distributions are aligned in the feature space to correct mislabeling and reduce its impact on the training process. We present LaNCoR’s performance on the task of P-phase arrival-time picking of real microseismic data using two baseline models and training approaches. Our results indicate that LaNCoR can improve performance by up to 28.8% across performance metrics. This approach holds great promise for model training in seismology and geosciences.

[AI-173] APEX: Adaptive Principle EXtraction A Three-Layer Self-Evolution Framework for Production AI Agents

链接: https://arxiv.org/abs/2606.15363
作者: Ya-Chuan Chen,Tien-Jen Lai,Hsiang-Wei Hu
类目: Artificial Intelligence (cs.AI)
备注: 8 pages, 1 figure, 4 tables. Evaluated on a production 15-node compute fleet with 114 real task traces. Code available at this https URL

点击查看摘要

Abstract:Self-improvement in AI agents has emerged as a key research frontier: systems that modify their own prompts, workflows, and decision rules based on accumulated operational experience. The state-of-the-art Self-Harness framework [1] achieves 14–21% improvement on Terminal-Bench-2.0 by mining failure clusters and patching the agent harness. However, Self-Harness optimises only one dimension – the prompt harness – leaving behavioural principles and workflow topology unchanged. We propose APEX (Adaptive Principle EXtraction), a three-layer co-evolution framework that simultaneously evolves: (L1) the harness via failure-mode patching, (L2) behavioural principles via success-trace distillation [2], and (L3) the agent workflow topology via structural fitness-based selection [6]. We implement APEX on Joe [13], a production-grade super AI Agent built on NVIDIA Nemotron and designed as an Edge AI Agent Factory for the NVIDIA Agent Challenge 2026, managing a 15-node compute fleet using 114 real task traces collected over 18 days. APEX achieves an APEX Health Score of 0.570 (+90% vs. baseline 0.300) in a single evolutionary run, distilling 6 novel reusable principles and selecting a research-first workflow topology scoring 0.900 (+20%). Our results demonstrate that multi-dimensional co-evolution substantially outperforms single-axis harness optimisation, at a cost of only 4 LLM calls (~270 s) on a local qwen2.5-coder:32b instance.

[AI-174] LearnOpt: Recovering the Latent Cognitive Structure of Standardized Examinations via Knowledge Graphs and Constrained Optimization

链接: https://arxiv.org/abs/2606.15349
作者: Joy Bose,Om Thomas
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 26 pages, 2 figures, 6 tables. Code, data, and calibration tooling: this https URL . Datasets on HuggingFace: joyboseroy/neet-skill-tags-2016-2024, joyboseroy/jee-advanced-skill-tags-2016-2023

点击查看摘要

Abstract:Standardized examinations are typically treated as uniform syllabus coverage problems. We argue they are better understood as adversarial systems with stable latent cognitive structures diverging systematically from official syllabi. We introduce LearnOpt, which recovers this structure from historical question papers and generates personalized, time-bounded study plans. Applied to nine years of NEET questions (2016-2024, n=1,496), LearnOpt builds an exam knowledge graph from LLM-tagged questions, extracts a five-category latent skill distribution, and formulates study planning as a knapsack-variant optimization over prerequisite-aware subgraphs with Bayesian Knowledge Tracing. Central finding: NEET’s latent skill distribution is stable within a syllabus regime (consecutive-year KL divergence 0.004-0.032 for 2016-2021, non-significant under permutation testing) but shifts significantly with NCERT’s 2023 syllabus rationalization: pooling 2016-2021 (n=1,072) vs 2023-2024 (n=392) gives KL=0.040 (p=0.0005), with Elimination/Negation questions rising from ~20-29% to ~31-35%. Latent structure, while not permanently stationary, is piecewise stable, with shifts detectable and attributable to curricular events. Within either regime, subject predicts skill profile more strongly than year. An optimization evaluation, using one real and two synthetic mastery profiles, shows the skill-weighted objective produces a modest but real reordering of recommended topics over a mastery-conditioned frequency baseline. Applying the pipeline to JEE Advanced reveals a profile dominated by Multi-concept Integration (80.9% vs. 33.3% for NEET), with a JEE-vs-NEET divergence (KL=0.505) exceeding NEET’s largest cross-subject divergence: exam tier shapes latent cognitive structure more than subject, which shapes it more than time within a regime. Code, knowledge graph, and annotated dataset are released publicly.

[AI-175] ChatPlanner: A Large Language Model Framework for Personalized Public Transit Routing

链接: https://arxiv.org/abs/2606.15315
作者: Tingting Yang,Chenhao Xue,Jun Chen
类目: Artificial Intelligence (cs.AI)
备注: Under Review at Transportation Research Part C

点击查看摘要

Abstract:Personalized public transit routing in public transit systems remains challenging due to the difficulty of capturing and integrating diverse user preferences into routing algorithms. This paper presents ChatPlanner, a novel framework that leverages Large Language Models (LLMs) to enable preference aware public transit routing. Our approach employs fine-tuned LLMs with Retrieval-Augmented Generation (RAG) to extract routing parameters and interpret nuanced user preferences from natural language queries, subsequently integrating these preferences into the objective function of a public transit routing algorithm. This study designs preference aware datasets incorporating eight personas and five contexts to establish scoring standards for both fine-tuning and RAG. This work conducted three experiments to validate the solutions’ feasibility, extraction of routing information and preferences, and solution set quality and completeness. Results demonstrate that ChatPlanner generates feasible solutions reliably. Fine-tuning enforces the required output structure and learns general preference patterns, while RAG provides query-specific context to resolve imprecise or conversational expressions and calibrate continuous scores. The combination of both achieves the highest accuracy in routing information extraction and user preference interpretation. Results based on selected case studies show that by capturing user preferences, ChatPlanner identifies valuable solutions across different dimensions that existing route planners overlook, generating more valuable route alternatives. This research establishes a new paradigm for integrating natural language understanding into transportation optimization.

[AI-176] LLM s on Tabular Data with Limited Semantics: Evidence from Industrial Car Retrofit Prediction

链接: https://arxiv.org/abs/2606.15314
作者: Aina Vila Pons,Ioannis Tzachristas,Constantinos Antoniou
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Industrial retrofit planning depends on structured operational data rather than free text: planners must estimate whether a newly registered prototype will require a retrofit, which retrofit package it will need, and how long the work will take. We study an industrial dataset linking a prototype-registration system (284,271 vehicles) with a retrofit-management system (48,716 cleaned visits), and compare strong tabular machine learning baselines with three LLM-based strategies on row-serialized inputs: embedding features (Amazon Titan), direct prompted classification (Claude Sonnet 4), and an ML+LLM stacking approach. Across binary occurrence prediction, 15-way retrofit-type classification, per-visit duration regression, and an aggregated monthly benchmark, classical tree ensembles remain the strongest standalone models. However, the LLM results reveal a consistent pattern: embeddings remain useful on tables (binary AUC = 0.982), direct prompting collapses once semantic signal is stripped by hashing (binary AUC = 0.500; multiclass weighted F1 = 0.018), and hybrid stacking yields the best manually built multiclass model (weighted F1 = 0.626). On the monthly benchmark, lag-based machine learning outperforms time-series foundation models, though Chronos-small remains competitive in zero-shot forecasting. The results suggest that on privacy-constrained industrial tables, LLMs are more effective as complementary components than as replacements for strong tabular baselines.

[AI-177] Forced Deferral: Manipulating Routing Decisions in Multimodal LLM Cascades

链接: https://arxiv.org/abs/2606.15308
作者: Zhongye Liu,Yaopei Zeng,Yurui Chang,Lu Lin
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:While multimodal large language models (MLLMs) have shown strong visual reasoning abilities, serving a large model for every query is computationally expensive. MLLM cascades mitigate this cost by first querying a weak but cheaper model and deferring to a strong model when the weak model’s output is unconfident. However, since the weak model’s confidence directly controls compute allocation, these systems expose a new attack surface: an adversary can manipulate confidence so that their queries are consistently deferred to the strong model. Motivated by this vulnerability, we introduce the Forced Deferral Attack (FDA), an adversarial image attack that lowers the weak model’s confidence and causes cascades to route queries to the strong model. FDA learns a universal border trigger by optimizing a temperature-flattened objective. This objective pushes the weak model’s token distribution on triggered inputs toward less concentrated targets constructed from its clean responses. Across datasets, model families, and deferral metrics, FDA consistently increases strong-model routing while outperforming image-perturbation and prompt-injection baselines. These results show that MLLM cascades are vulnerable to attacks that manipulate compute allocation, forcing unintended strong-model usage without directly targeting answer correctness.

[AI-178] LatentGym: A Testbed For Cross-Task Experiential Learning With Controllable Latent Structure

链接: https://arxiv.org/abs/2606.15306
作者: Daksh Mittal,Tommaso Castellani,Thomson Yen,Naimeng Ye,Fangyu Wu,Minghui Chen,Tiffany Cai,Emmanouil Koukoumidis,William Zeng,Hongseok Namkoong
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 61 pages

点击查看摘要

Abstract:We envision continually learning agentic systems that become more useful over time: as they encounter sequences of related tasks, they should infer the hidden structure shared across those tasks and use it to improve future decisions. This cross-task experiential learning capability is pivotal in domains such as personalization and interactive assistance, but existing training/evaluation frameworks do not provide shared, controllable latent structures and cannot measure whether or why agents improve. We introduce LatentGym: a controllable suite in which each environment is organized around a ground-truth latent variable governing the structure across tasks. Our construction yields metrics that separate exploration (whether the agent’s actions gather information about the latent) from exploitation (whether the agent uses what it has gathered). We demonstrate our suite on empirical studies addressing three questions: how and why frontier models fail to adapt across related tasks; whether post-training on related task sequences improves general cross-task adaptation, and where those gains come from; and how design choices such as inter-task feedback shape training dynamics and generalization. Together, these results establish a controlled foundation for studying how LLM agents learn from experience across tasks, and for designing agents that adapt more reliably in sequential, personalized, and interactive settings.

[AI-179] Discovering Lattice Reduction Strategies via Self-Play

链接: https://arxiv.org/abs/2606.15301
作者: Mohamed Malhou,Kristin Lauter,Ludovic Perret
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The Lenstra-Lenstra-Lovász (LLL) algorithm is a seminal contribution to computer science used for lattice basis reduction, yet its polynomial-time outputs produce bases that are far from optimal as the dimension grows. We show that deep reinforcement learning can discover strictly superior, generalizable reduction strategies by interacting with the primitive action space of LLL. We formulate lattice reduction as a single-player Markov Decision Process (MDP) and train a deep residual network using an AlphaZero-style self-play pipeline augmented with adaptive-horizon MCTS (Monte Carlo Tree Search), which couples multi-step network predictions with an entropy-gated expansion mechanism. The resulting policy, DeltaStar, is trained exclusively on small 8 -dimensional q -ary lattices and requires fewer primitive row operations than LLL. Crucially, it generalizes zero-shot to unseen moduli and higher dimensions up to n=32 without retraining.

[AI-180] A Formal Framework for Declarative Agent ic AI in Business Process Analysis

链接: https://arxiv.org/abs/2606.15291
作者: Mohammad Azarijafari,Luisa Mich,Michele Missikoff
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Agentic AI opens new opportunities for automating Business Process (BP), enabling autonomous decision-making and dynamic adaptation. However, realising this potential requires BP entities and their interactions to be defined with formal precision. This paper presents a formal framework for Agentic BP analysis through the AGO methodology. AGO captures the modelling perspective in terms of who is acting (Agents), why it is carried out (Goals), and what the relevant entities are (Objects). Grounded in set theory and mathematical logic, we formally define the AGO entity types and their interactions, organising all definitions into a BP Knowledge Base (BPKB). The resulting BPKB supports structured querying, incremental updates, and automatic generation of BP workflows, while ensuring soundness and completeness of the derived paths.

[AI-181] Hybrid NARX-LLM for Greenland Iceberg Discharge: Prompt-Driven Residual Correction

链接: https://arxiv.org/abs/2606.15288
作者: Yiquan Gao,Duohui Xu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Atmospheric and Oceanic Physics (physics.ao-ph)
备注:

点击查看摘要

Abstract:Greenland iceberg discharge exhibits complex nonlinear dynamics with limited observability, challenging traditional predictive models. We present a Hybrid NARX-LLM framework that combines a nonlinear autoregressive model with exogenous inputs (NARX) and a large language model (LLM) for residual correction. We further propose a Physics-Informed Prompt (PIP) method that transforms unstructured physical knowledge into structured prompts for zero-shot in-context reasoning. The primary objective is to explore the corrective potential of this framework for modeling Greenland iceberg discharge, rather than merely optimizing predictive accuracy. The NARX component captures intrinsic temporal dependencies, while the LLM, guided by PIP, encodes glacier dynamics and environmental drivers and perceives key trend patterns to correct systematic prediction errors. This integration allows the model to reason about unmodeled factors and produce interpretable residuals, enhancing overall predictive accuracy. Applied to Greenland iceberg discharge time series, our approach addresses extreme events that are difficult to predict due to rare variations and nonstationary trends, a limitation often overlooked by traditional methods. By fusing structured time-series modeling with knowledge-driven foundation AI, the framework offers a scalable and interpretable pathway to bridge data-limited climate forecasting with physics-informed LLM reasoning. The code is available.

[AI-182] RECTOR: Masked Region-Channel-Temporal Modeling for Affective and Cognitive Representation Learning

链接: https://arxiv.org/abs/2606.15278
作者: Jinhan Liu,Mahsa Shoaran
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Affective and cognitive disorders manifest as distributed, time-varying brain network dynamics across regions, channels, and time, challenging robust representation learning from EEG/sEEG for clinical diagnosis. We propose RECTOR (Masked Region-Channel-Temporal Modeling), an end-to-end self-supervised framework that unifies joint region-channel-temporal representation learning beyond fixed anatomical priors. At its core, RECTOR-SA is a hierarchical, block-sparse self-attention induced by Adaptive Functional Partitioning that evolves region structures from static anatomical definitions to adaptive functional regions. The self-supervision is driven by Masked Topology and Representation Learning, which jointly optimizes three complementary objectives: Masked Predictive Modeling, Topological Structure Modeling, and Cross-View Consistency. Across diverse benchmarks, RECTOR sets a new state-of-the-art in EEG emotion recognition and sEEG task-engagement classification. Crucially, its strong robustness to missing channels and cross-montage generalization underscores its potential for large-scale pre-training on heterogeneous EEG/sEEG, providing interpretable insights at both region and channel levels.

[AI-183] Feature Attribution in Directed Acyclic Graphs Using Edge Intervention

链接: https://arxiv.org/abs/2606.15273
作者: Qiheng Sun,Junxu Liu,Xiaokai Mao,Haocheng Xia,Jinfei Liu,Kui Ren,Haibo Hu
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Shapley value-based feature attribution methods face challenges in scenarios involving complex feature interactions and causal relationships, even when a causal structure is provided. Existing methods typically adopt a node-centric view, attributing importance solely to individual features. Consequently, they often fail to simultaneously capture the externality and exogenous influence of features, leading to unreasonable interpretations. To overcome these limitations, we propose a novel feature attribution method called DAG-SHAP, which is based on edge intervention. DAG-SHAP treats each feature edge as an individual attribution object, ensuring that both externality and exogenous contributions of features are appropriately captured. Additionally, we introduce an approximation method for efficiently computing DAG-SHAP. Extensive experiments on both real and synthetic datasets validate the effectiveness of DAG-SHAP. Our code is available at this https URL.

[AI-184] rust-Region Diffusion Policies for Massively Parallel On-Policy RL

链接: https://arxiv.org/abs/2606.15260
作者: Huy Le,Onur Celik,Denis Blessing,Tai Hoang,Claas A Voelcker,Axel Brunnbauer,Felix Richter,Michael Volpp,Gerhard Neumann
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reinforcement learning with massively parallel simulations has become a standard framework for developing robust, deployable policies; however, most existing approaches still rely on simple Gaussian policy parameterizations. Diffusion models provide a more expressive policy class and have shown strong performance on challenging control problems, yet most diffusion-based RL methods are designed for offline or off-policy training. In this work, we ask whether diffusion policies can be trained effectively in the massively parallel, on-policy regime. To this end, we introduce Trust-region Diffusion Policies (TruDi), which enables diffusion policies for on-policy RL with massively parallel simulations. This setting is particularly challenging because the data distribution changes quickly across updates, making stable training with complex policies difficult. TruDi addresses this by integrating a trust-region optimization rule to enforce a KL-divergence constraint over the entire diffusion trajectory. Empirically, we evaluate TruDi on a diverse set of 4 massively parallel RL benchmarks comprising a total of 73 tasks. Across these tasks, TruDi consistently outperforms or is on-par with strong baselines on standard tasks and achieves clear gains on more challenging humanoid control tasks, establishing a strong new baseline for massively parallel on-policy RL.

[AI-185] Mask-Proof: An LLM -based Automated Data Curation Pipeline on Mathematical Proofs

链接: https://arxiv.org/abs/2606.15258
作者: Jierui Zhang,Siyuan Tan,Xinhang Li,Longzhuangzhi Lin,Dailin Li,Chengfeng Gu,Xinping Li,Yaxian Hao,Shengjia Liang,Yuxiang Ren,Wenhao Liu
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly capable of mathematical problem solving and can even assist with research-level proofs, yet we still lack a scalable and reproducible way to measure step-level reasoning in long proofs across diverse sources. This evaluation gap limits trustworthy AI assistance in proof-certified scientific progress. Existing evaluations often emphasize final answers or rely on costly expert grading, while end-to-end proof generation remains open-ended and hard to verify automatically. We introduce Mask-Proof, a pipeline that turns real proofs into automatically checkable masked-step tasks. It masks key formula steps, provides the necessary surrounding context, and evaluates model reconstructions with an LLM-based equivalence judge using repeated votes for stability. The resulting Mask-ProofBench contains 292 curated problems across diverse research areas. Experiments with 17 models show that reasoning-enhanced models outperform standard models by 12% to 27%. Our evaluator achieves 96.8% agreement with expert annotators, enabling faithful, reproducible, and comparable measurement of step-level mathematical reasoning. Benchmark, annotations, and code are available at this https URL.

[AI-186] Driving Fast or Slow? Neuro-Symbolic Guidance for Motion Prediction in Multi-Modal Ground Mobility

链接: https://arxiv.org/abs/2606.15251
作者: Simon Kohaut,Felix Divo,Julius Hahnewald,Benedict Flade,Julian Eggert,Kristian Kersting,Devendra Singh Dhami
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Accurate and interpretable motion prediction for heterogeneous traffic spaces, including pedestrians, bicycles, cars, and trucks, is essential for safe autonomous navigation. Nevertheless, state-of-the-art approaches remain predominantly black-box, lacking explicit encoding of the regulatory and behavioral constraints of real-world mobility. We propose Trajectory Compliance-Shaping (TraCS), a neuro-symbolic framework that augments existing black-box motion prediction backbones with interpretable and probabilistic first-order logic. To do so, TraCS employs an agentic code-generation pipeline to bridge the gap between natural-language descriptions of traffic regulations and probabilistic motion prediction. Furthermore, TraCS employs a reactive data-streaming inference engine that maintains and efficiently updates compliance landscapes as scenes evolve. To prevent TraCS from overconfidently steering the backbone’s predictions in the wrong direction, we propose a neural confidence rating learned as a context-aware attenuation of the compliance signal. We demonstrate on the Argoverse 2 benchmark how TraCS consistently improves state-of-the-art prediction backbones, showing that probabilistic and symbolic compliance reasoning is a broadly applicable and computationally efficient complement to purely neural motion predictors.

[AI-187] Exploring Starts Are Not Enough: Counterexamples and a Fix for Monte Carlo Exploring Starts

链接: https://arxiv.org/abs/2606.15247
作者: Octave Oliviers,Glenn Vinnicombe
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The asymptotic behaviour of Monte Carlo Exploring Starts (MCES) is a long-standing open question in reinforcement learning, even in the tabular setting. We investigated the convergence properties of tabular MCES by constructing examples in which the algorithm converges to suboptimal solutions. This paper presents new counterexamples for both initial-visit and first-visit MCES and gives a convergence-restoring modification for the initial-visit case. We show that stable suboptimal solutions may exist for initial-visit MCES with sample-average updates even when greedy actions are updated more often than non-greedy actions on average. However, by scaling learning rates inversely to update frequencies on a state-by-state basis, convergence to optimality is guaranteed. Unlike previous uniformisation methods, this modification is applicable to large-scale problems that require approximating the estimated value function. We then extend the example to show that sample-average first-visit MCES may also converge to suboptimal solutions. This largely settles a fundamental open problem and shows that exploring starts alone do not guarantee convergence to optimality. More broadly, these results highlight that convergence depends critically on the relative size and frequency of updates applied to different actions, making the choice of learning rates and the balance between exploration and exploitation central to the analysis of MCES and the implementation of scalable Monte Carlo control methods.

[AI-188] Provenance-Enhanced Statements in Knowledge Graphs

链接: https://arxiv.org/abs/2606.15246
作者: Fabio Vitali,Valentina Pasqual
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI); Digital Libraries (cs.DL)
备注: 33 pages

点击查看摘要

Abstract:Provenance-enhanced statements of the form "according to X , \varphi " are pervasive in contemporary knowledge graphs, especially in domains where graph content primarily represents claims, interpretations, and hypotheses (\emphcapta) rather than observer-independent facts (\emphdata). Current provenance models can record who asserted what, but they typically treat provenance as semantically neutral, leaving underspecified how attributed claims relate to factual commitment, to one another, and to reasoning. In this paper we introduce DEC, a framework that interprets provenance predicates as indicators of epistemic stance and groups provenance-homogeneous sets of statements into \emphcognitive worlds. Drawing on cognitive modal logics (doxastic, epistemic, and conjectural), DEC characterizes locality, rationality, and controlled permeation between cognitive worlds and a distinguished factual core (“reality”), thereby enabling principled reasoning over attributed content without collapsing disagreements into inconsistencies. We formalize a DEC interpretation for RDF datasets that is conservative over RDF~1.2 semantics, clarify the role of intensionality and identity (including the Superman paradox), and illustrate the approach on common Semantic Web representations (named graphs, quoted triples/RDF-star, and reification). Finally, we describe our prototype DEC reasoner implemented as a Fuseki dataset module, supporting controlled factualisation and explicit detection of disagreements and delusions. Comments: 33 pages Subjects: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI); Digital Libraries (cs.DL) Cite as: arXiv:2606.15246 [cs.LO] (or arXiv:2606.15246v1 [cs.LO] for this version) https://doi.org/10.48550/arXiv.2606.15246 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-189] Benign in Isolation Harmful in Composition: Security Risks in Agent Skill Ecosystems

链接: https://arxiv.org/abs/2606.15242
作者: Yi Xie,Jiawei Du,Yu Cheng,Jiuan Zhou,Zhaoxia Yin
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Skills are becoming the capability layer through which LLM agents turn plans into actions, but their use introduces security risks such as data leakage, unauthorized operations, and tool misuse. Existing vetting usually evaluates each skill in isolation, while real agent tasks often invoke multiple skills in a shared execution context. This creates Skill Composition Risk (SCR): a skill that appears benign alone can become harmful when its outputs, trust signals, authorization cues, or side effects influence later invocations along an activated path. We introduce SCR-Bench to evaluate this risk in controlled, sandboxed skill environments. Rather than relying only on textual intent or surface behavior, SCR-Bench records downstream state changes and path-level outcomes across composed skill executions. It contains three sub-benchmarks: SCR-CapFlow for capability-flow composition, SCR-TrustLift for trust-transfer composition, and SCR-AuthBlur for authorization-confusion composition. Across SCR-Bench, composed paths expose risks that are largely absent under isolated evaluation. In SCR-CapFlow, attack success rate reaches 33.6 percent under composition, compared with near-zero isolated baselines. In SCR-TrustLift, attack success rate exceeds 96.5 percent on four of five backends. In SCR-AuthBlur, the risky-approval rate increases by 71.8 percent relative to the L0 isolated baseline under the L1 context setting. These results show that agent skill security should be assessed at the level of activated paths rather than isolated artifacts. SCR and SCR-Bench provide a foundation for path-aware risk evaluation and defense in LLM agent skill ecosystems. Benchmark: this https URL.

[AI-190] Visual-Seeker: Towards Visual-Native Multimodal Agent ic Search via Active Visual Reasoning

链接: https://arxiv.org/abs/2606.15231
作者: Zhengbo Zhang,Changtao Miao,Jinbo Su,Zhaowen Zhou,Chunxia Zhang,Xukai Wang,Ruiqi Liu,Kaiyuan Zheng,Jiansheng Cai,Bo Zhang,Zhe Li,Shiming Xiang,Ying Yan
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multimodal large language models (MLLMs) have demonstrated impressive capabilities in many visual tasks, but they often struggle with factual grounding when confronted with complex, open-world scenarios. While recent multimodal deep search agents attempt to address this issue by utilizing external tools, the visual-native search paradigm remains underexplored. Existing methods primarily rely on simple images with explicit semantics and text-only evidence trajectories, limiting the agent’s ability to perform multi-hop, cross-modal reasoning and search. To address these limitations, we propose Visual-Seeker, a visual-native multimodal deep search agent via active visual reasoning. Rather than treating vision as a static input, our agent actively attends to fine-grained visual details, dynamically harvests visual evidence throughout the search process. To unlock its visual-native potential, we design an active visual reasoning data pipeline and synthesize 5K high-quality multimodal trajectories for model training. Extensive experiments demonstrate the state-of-the-art performance across five challenging multimodal search benchmarks, even surpassing several proprietary models, validating robust visual-native reasoning and search in real-world web environments. The code and data can be accessed at: this https URL.

[AI-191] Attribute Inference from Interactive Targeted Ads

链接: https://arxiv.org/abs/2606.15209
作者: Peihao Li
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Targeted advertising systems can pair audiences selected by advertisers with ad units that expose visible user actions. When an interaction remains linked to the campaign that elicited it, the advertiser may receive an observation tied to a user rather than only an aggregate report. We model that channel as a noisy oracle for attribute inference. The model separates targeting predicates, exposure, interaction, and disclosure. These boundaries capture the gap between eligibility and delivery, and the gap between interaction and advertiser visibility. We build a reproducible benchmark using synthetic populations calibrated with public data, each with known sensitive labels. A generated campaign semantics layer provides topic variants and response priors. The simulator generates the ground truth, event traces, disclosed observations, and metrics. The evaluation compares Bayesian, supervised, positive and unlabeled, and adaptive attacks under common campaign and disclosure definitions. The final evaluation uses four topic variants, seven simulator seeds, and two interaction settings. Repeated campaigns with identity exposure produce measurable but bounded inference signal. At 160 campaigns, Bayesian and supervised attacks reach about 0.64 AUC in the main setting and about 0.65 AUC in the higher interaction setting. Disclosure policy is the strongest control. Aggregate reporting removes the evaluated oracle input tied to users. Type filtering and randomized disclosure reduce the released signal. The result is a model, artifact, and defense evaluation method for privacy in interactive targeted advertising. The code is available at this https URL. Subjects: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR) Cite as: arXiv:2606.15209 [cs.AI] (or arXiv:2606.15209v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2606.15209 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-192] Controlled Dynamics Attractor Transformer

链接: https://arxiv.org/abs/2606.15207
作者: Cheng Zhang,Minnan Luo,Zesheng Yang,Ming Li,Yong-Jin Liu,Qinghua Zheng
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注: 20pages,3 figures

点击查看摘要

Abstract:Transformer architectures have dramatically advanced representation learning and inference in deep models through self-attention mechanisms. In parallel,associative memory (AM) frameworks map representations onto energy landscapes, offering interpretable retrieval mechanisms. However, their continuous-time inference dynamics lack the biological plausibility of classical Continuous Attractor Neural Networks (CANNs). To bridge this gap, we propose Controlled Dynamics Attractor Transformer (CDAT), which couples a mixture von Mises-Fisher (Mo-vMF) attention energy with a Hopfield refinement energy, while augmenting energy descent with a CANN-inspired excitation-inhibition modulation. CDAT instantiates a topology-constrained dynamical system whose couplings encode relational structure among tokens, thereby linking attractor-style dynamics to modern energy-based attention. We further provide a constructive dissipation analysis to formally establish their controlled inference dynamics. Benefiting from these robust and structured dynamics, CDAT achieves state-of-the-art performance across multiple benchmarks in graph anomaly detection and graph classification.

[AI-193] CogGuard: Cognitive and Operational Profiling for Proactive Warning in Edge Intelligent Services

链接: https://arxiv.org/abs/2606.15199
作者: Zhi Yao,Weihao Chen,Zhiqing Tang,Hanshuai Cui,Qianli Ma,Weijia Jia,Wei Zhao
类目: Artificial Intelligence (cs.AI)
备注: Accepted to ICWS 2026

点击查看摘要

Abstract:Proactive warning is an important capability for edge intelligent services, where the system predicts whether a subject will successfully complete an incoming task under strict latency and privacy constraints. Such prediction depends on both long-term static attributes and short-term dynamic states derived from historical interaction logs. Recent Large Language Models (LLMs) offer strong long-context reasoning for constructing structured profiles from these logs, but existing solutions face two challenges for edge deployment: (1) profiling methods are typically domain-specific and lack a reusable abstraction across service scenarios, and (2) fine-tuning alignment models on heterogeneous edge clusters incurs high synchronization overhead due to the variance in input sequence lengths. To address these challenges, we propose CogGuard, a proactive-warning framework for edge intelligent services. CogGuard decouples offline LLM-based profile construction from online Small Language Model (SLM)-based score prediction through a shared static-dynamic profile-to-score pipeline, and instantiates it in two representative scenarios: educational performance warning and operational task outcome warning. For efficient profile construction, we design scenario-specific profiling methods with prefix-aligned KV-cache reuse to reduce repeated encoding overhead. For edge-side model alignment, we propose a length-aware distributed fine-tuning strategy with contrastive regularization to mitigate workload imbalance on heterogeneous clusters. Experiments on education and operation datasets show that CogGuard reduces profile construction time by up to 48% and distributed fine-tuning time by 19%, while achieving MAEs of 13.4 and 5.9, respectively, on 100-point-scale warning tasks. In the largest educational setting, CogGuard reduces prediction error by 15.4% compared with the strongest baseline.

[AI-194] StarOR: Synergizing Tree Search and Test-Time Reinforcement Learning for Optimization Modeling

链接: https://arxiv.org/abs/2606.15197
作者: Jiajun Li,Yu Ding,Shisi Guan,Ran Hou,Wanyuan Wang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 41pages, V1, preprint

点击查看摘要

Abstract:Optimization modeling is inherently hierarchical, requiring a precise sequence of symbolic commitments. Traditional learning-based automated optimization modeling methods improve modeling policies through large-scale annotated or curated training data, but are costly to adapt to new problem distributions. Meanwhile, one-shot generation remains brittle in hierarchical modeling, where early symbolic errors can propagate into invalid formulations. Test-time scaling offers a promising alternative by enabling structural exploration with additional instance-level computation; however, existing search-based methods typically rely on a fixed policy, causing repeated rollouts to inherit similar modeling biases and providing limited credit assignment for intermediate decisions. To address these limitations, we propose StarOR, a synergistic search-and-adaptation framework that couples MCTS with Test-Time Reinforcement Learning for optimization modeling. StarOR decomposes the modeling process into four stages and updates a transient LoRA adapter via GRPO at each non-terminal node. By using MCTS-generated siblings as local comparison sets, StarOR transforms search-time exploration into instance-specific policy refinement. Moreover, an unsupervised multi-faceted reward system provides fine-grained feedback for intermediate formulation decisions without ground-truth labels. Experiments across five optimization benchmarks show that StarOR achieves state-of-the-art performance even with a 4B backbone, outperforming existing methods and the frontier LLMs.

[AI-195] FreeSonic: Training-Free Temporal-Aware Decoupled Attention for Precise Audio Editing INTERSPEECH2026

链接: https://arxiv.org/abs/2606.15186
作者: Yuxuan Jiang,Mingyang Han,Yusheng Dai,Andong Wang,Tianhong Zhou,Jiaxin Ye,Dongxiao Wang,Haoxiang Shi,Boyu Li,Jun Song,Cheng Yu,Bo Zheng,Weibei Dou,Zehua Chen,Jun Zhu
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: Accepted at Interspeech 2026

点击查看摘要

Abstract:Text-to-audio (TTA) generation has made significant strides, yet achieving precise and consistent audio editing remains a major challenge. However, existing methods struggle to balance temporal consistency with background preservation. In this paper, we propose FreeSonic, a training-free framework leveraging the state-of-the-art Rectified Flow-based TangoFlux model. FreeSonic utilizes an optimized inversion-reverse process and joint text-audio attention maps for precise target segment extraction. For content editing, a novel scheduled attention decoupling confines modifications to target regions while preserving original acoustic context. Furthermore, task-oriented noise injection enhances versatility for tasks such as audio removal and non-rigid replacement. Extensive experimental results demonstrate that FreeSonic achieves a superior balance by providing a high-fidelity and efficient solution for precise and consistent audio editing. Project and demos: this https URL

[AI-196] CONCORD: Asynchronous Sparse Aggregation for Device-Cloud RAG under Document Isolation

链接: https://arxiv.org/abs/2606.15179
作者: Xuedong Hu,Zhiqing Tang,Zhi Yao,Tian Wang,Weijia Jia
类目: Artificial Intelligence (cs.AI)
备注: to be published in IEEE ICWS 2026

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) has emerged as a pivotal technique for improving language models by incorporating external knowledge at inference time. As device-cloud collaborative inference makes it feasible to deploy small language models on edge devices, a new setting arises in which private documents remain on the device and public knowledge resides in the cloud. Privacy and policy constraints often forbid raw document exchange, creating a document-isolated dual-end RAG setting. However, existing methods rely on frequent remote synchronization and dense evidence transfer, limiting throughput under realistic latency and bandwidth conditions. To address this issue, we propose CONCORD, an asynchronous sparse aggregation framework for dual-end RAG under document isolation. CONCORD treats the cloud as an asynchronously arriving evidence source rather than a continuously synchronized co-generator. Specifically, we introduce waiting debt control to decide whether each decoding step should continue waiting for remote participation based on the observed return of waiting. We also design a certificate-guided minimal supplementation mechanism that requests only the remote evidence needed to determine the current greedy decision. Steps that consult the cloud preserve the same greedy token as dense dual-end aggregation, while the remaining steps commit locally without remote evidence. Experiments on Natural Questions and WikiText-2 show that CONCORD improves end-to-end throughput over baselines by 1.66\times and 2.15\times , respectively, while reducing per-token communication by over two orders of magnitude and maintaining comparable answer quality and perplexity.

[AI-197] PolyKV: Heterogeneous Retention and Allocation for KV Cache Compression

链接: https://arxiv.org/abs/2606.15157
作者: Chao Fei,Panos Kalnis
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:KV cache compression is essential for reducing the memory cost of long-context large language model inference. Existing approaches, however, typically apply a single compression policy and a uniform cache budget across all transformer layers. This uniform design ignores the fact that different layers can play different roles during prefill and decoding, and may therefore require different eviction strategies and cache capacities. We present PolyKV, a layer-wise KV cache optimization framework that considers design space with method selection and budget allocation. PolyKV routes each layer to a suitable KV compression policy based on layer-level signals, while assigning non-uniform budgets under a fixed total budget. This formulation enables heterogeneous compositions of existing KV cache methods. Experiments on LLaMA-3.1-8B and Qwen3-8B show that, under the same 512-token average KV budget, PolyKV recovers 54.5% and 25.7% of the LongBench performance gap between the strongest single-policy baseline and FullKV, respectively. Across 128-1024 budget sweep, PolyKV consistently improves over the strongest baseline by 1.7%-6.4%, corresponding to 40.0%-54.5% recovery of the FullKV gap.

[AI-198] MimicIK: Real-Time Generative Inverse Kinematics from Teleoperation with FK Consistency

链接: https://arxiv.org/abs/2606.15148
作者: Jiahao Yang,Shenhao Yan,Fan Feng,Chengsi Yao,Ge Wang,Zhixin Mai,Yiming Zhao,Yatong Han
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Inverse kinematics (IK) remains a critical bottleneck for real-time robot manipulation. Classical numerical solvers achieve high geometric precision but often suffer from discontinuous branch switching and unstable behavior near kinematic singularities during closed-loop deployment. Meanwhile, learned IK approaches frequently struggle to balance spatial accuracy, motion smoothness, and real-time efficiency, particularly when trained on noisy human teleoperation data. We present \textbfMimicIK, a real-time generative inverse kinematics framework that learns smooth and robust joint-space motion priors from teleoperation demonstrations through conditional flow matching. Given the current joint configuration and a target end-effector pose, MimicIK predicts continuous delta-joint commands using an efficient two-step iterative refinement process based on a Minimal Iterative Policy (MIP) backbone. To enforce physical consistency, we further introduce an FK consistency loss, a differentiable forward-kinematics regularization that penalizes task-space deviations from the target pose during training. We evaluate MimicIK on a real-world 6-DOF robot dataset containing 8,848 teleoperation demonstrations. MimicIK achieves a mean position error of 4.65 mm, a 10 mm success rate of 92.01%, and a trajectory spike rate of only 7.99%. Compared with a UNet diffusion baseline, our method improves both spatial accuracy and motion smoothness while reducing inference latency from 21.66 ms to 6.74 ms. Furthermore, unlike deterministic MLP baselines that catastrophically diverge under out-of-distribution deployment, MimicIK remains stable near singular configurations and enables robust 20 Hz real-time control on deployment hardware.

[AI-199] owards Verifiable Agent ic Data Science: Solving Irregular TSQA Via Tool-Grounded Reasoning

链接: https://arxiv.org/abs/2606.15107
作者: Sanhorn Chen,Xiaoyang Chen,Boyu Liu,Roy Zhao
类目: Artificial Intelligence (cs.AI)
备注: 15 pages

点击查看摘要

Abstract:Time series data in real-world deployments is overwhelmingly irregular. Observations are asynchronous, missing values are informative rather than random, and sampling frequencies vary across sensors and operational windows. However, existing Time Series Question Answering (TSQA) benchmarks mostly assume regularly sampled inputs, leaving a fundamental gap in understanding how large language models (LLMs) and AI agents perform under irregular conditions. To bridge this gap, we introduce IRTS-ToolBench, a benchmark of 1,700 questions spanning 10 task types across 13 domains. IRTS-ToolBench is designed to be used independently by any researcher working on LLM-based irregular time series analysis, providing standardized inputs and a reproducible evaluation protocol. Code can be found in this https URL.

[AI-200] VGPT -RSI for RH-Adjacent Formal Progress: Boundary Certificates Verified Finite Lagarias Inequalities and Explicit Failure Localization

链接: https://arxiv.org/abs/2606.15096
作者: Zhixin Hu,Tao Xu,Xiaodian Sun,Li Jin,Momiao Xiong
类目: Artificial Intelligence (cs.AI)
备注: 31 pages, 3 figures

点击查看摘要

Abstract:The Riemann Hypothesis remains one of the central unsolved problems in mathematics. Rather than claiming proof, we investigate whether a verifiable AI-assisted reasoning system can produce reliable, formally checked partial progress while explicitly identifying the remaining mathematical obstructions. We apply the Verifiable Growing Physical Transformer with Recursive Self-Improvement (VGPT-RSI) to two RH-adjacent certification tasks. First, we construct and verify a finite RH-boundary certificate for inequality on a parameterized safe lower curve over a region. The numerical boundary curve is converted into a certificate-backed lower curve, audited using outward-rounded interval arithmetic and Arb/FLINT ball arithmetic, and then checked in Rocq/CoqInterval for the parameterized theorem. Second, we initiate a formal Lagarias-route certificate. Lagarias criterion states that RH is equivalent to the global inequality. We formalize the finite quantity and produce a Coq-checked finite certificate. The final system identifies the exact unresolved mathematical bottlenecks: formalizing the Lagarias equivalence, proving the global tail theorem beyond any finite cutoff, and potentially reducing counterexamples to colossally abundant or related extremal integers. These results demonstrate that VGPT-RSI can produce certified RH-adjacent formal progress, organize proof dependencies, and avoid overclaiming when the remaining obstruction is genuinely mathematical.

[AI-201] Cognitive Debt: AI as Intellectual Leverag e and the Dynamics of Systemic Frag ility

链接: https://arxiv.org/abs/2606.15078
作者: Shuchen Meng
类目: Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Physics and Society (physics.soc-ph)
备注: 46 pages, 3 figures. Preliminary version; comments welcome

点击查看摘要

Abstract:We develop a formal theory of cognitive debt: the stock of unverified reasoning obligations that accumulates when individuals use AI as a substitute rather than a complement for first-principles cognition. The model features two state variables per agent, cognitive capital and cognitive debt, and a multiplicative production technology in which cognitive capital functions as collateral that determines the return to AI adoption. We establish six propositions. Rational agents incur positive cognitive debt because the costs are deferred, partially external, and masked by short-run productivity gains. Tranquil periods lower subjective risk assessments, raise AI substitution intensity, and compound leverage, generating a cognitive Minsky moment in which subjective risk falls while true systemic fragility rises. Expected crisis losses are convex in aggregate leverage. Post-crisis, output-target pressure can produce a false-correction loop in which agents patch AI failures with more AI. The decentralised equilibrium over-adopts substitutive AI relative to the social optimum because of systemic risk, cognitive public goods, and arms-race externalities. In a two-type heterogeneous-agent economy, high-cognitive-capital agents adopt AI more intensively and may eventually erode their unaided cognitive capital below that of initially lower-skilled agents.

[AI-202] AutoDojo: Adaptive Attacks Expose Superficial Defenses and User-Underspecification Limits in LLM Agents

链接: https://arxiv.org/abs/2606.15057
作者: Xinhang Ma,Taoran Li,Chaowei Xiao,Zhiyuan Yu,Ning Zhang,Yevgeniy Vorobeychik
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Indirect prompt injection (IPI) is a major security threat to LLM-powered agents. Thus, a growing body of work have proposed a variety of defensive approaches against IPI. These can be grouped into three broad categories: 1) prompt-based (using prompting as a way to prevent agents from following malicious instructions), 2) detection-based (identifying and filtering malicious instructions), and 3) system-level (using systems insights, such as control and data isolation, for defense). However, commonly used benchmarks for evaluating defense, such as AgentDojo, are \emphinherently static, generating a fixed distribution of IPI attacks. Consequently, static benchmarks do not usefully evaluate defense robustness to adaptive threats. We address this issue by developing AutoDojo, an adaptive extension of AgentDojo that optimizes IPI against a given defense. Using AutoDojo against state-of-the-art IPI defenses across three task suites and five target models, we make two key observations. First, many defenses offer only limited protection: a cheap, black-box adaptive attack using a frontier LLM to iteratively optimize the injection raises attack success rate (ASR) well above the level achieved by static injections against nearly all evaluated defenses. Against a filter that reduces static ASR to 0%, AutoDojo recovers 28% overall and 64% on action-open tasks. Second, for prompt-level and filter-based defenses, ASR is substantially higher on \emphaction-open tasks – where the user’s request delegates the action itself to attacker-controlled content – than on precisely specified tasks. This is a structural limit: on such tasks the injection can pose as ordinary data rather than an explicit instruction, bypassing defenses that rely on detecting instruction-like text. AutoDojo is publicly available at this https URL.

[AI-203] PANDA: An LLM -Enhanced Performance-Driven Analog Design Framework Bridging Design Intent and Layout Generation

链接: https://arxiv.org/abs/2606.15052
作者: Haoyi Zhang,Weijian Fan,Xiaohan Gao,Bingyang Liu,Runsheng Wang,Yibo Lin
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Traditional design of analog circuits heavily relies on manual interventions across topology, sizing, and layout, with prior automation addressing stages in isolation. In this work, we propose PANDA, an LLM-enhanced framework that bridges high-level design intent to final layout by actively managing cross-stage dependencies through guided topology synthesis, substructure-aware sizing, and constraint-driven layout generation. This shifts automation from algorithm-centric execution to intent-centric co-design, reducing turnaround time from days or weeks to hours while improving design performance.

[AI-204] Fusion is not one-size-fits-all: Cross-Modal Representation Alignment for Time-to-Event Modeling

链接: https://arxiv.org/abs/2606.15038
作者: Zhemin Zhang,Weijie Chen,David Le,Amara Tariq,Alex Wallace,Matthew Stib,Juan Maria Farina,Chadi Ayoub,Reza Arsanjani,Imon Banerjee
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Accurate time-to-event (TTE) prediction from multimodal clinical data remains challenging due to modality imbalance and distribution shift. We introduce a foundation model-driven framework for cross-modal representation alignment between CT imaging and longitudinal EHR data, designed to generalize across tasks and institutions. CT and EHR modalities are encoded independently using domain-specific foundation models and aligned in a shared latent space through four principled fusion strategies: late fusion, contrastive alignment, cross-attention, and co-attention. We evaluate two clinically distinct TTE tasks: pulmonary embolism (PE) mortality and cardiovascular disease (CVD) outcomes, on large-scale multi-institutional cohorts (PE: N=3,099 train; 1,098 internal; 435 external; CVD: N=2,951 train; 837 internal; 682 external). Fusion consistently improves concordance index by 1.5-5.4% over unimodal baselines when modalities contribute comparably. Overall, contrastive multimodal fusion, particularly with CLMBR representations, provided the most consistent and statistically robust improvements, especially for PE mortality prediction. For MACE, cross-attention (one-hot) achieved the highest internal performance and image-guided co-attention achieved the best external performance. We therefore introduce a generalizable foundation model-based cross-modal alignment framework and provide the first systematic analysis of fusion behavior under modality imbalance in TTE prediction. Our results establish task-aware multimodal alignment as a necessary design principle for robust generalization and scalable clinical deployment.

[AI-205] OSGuard: A Benchmark for Safety in Computer-Use Agents

链接: https://arxiv.org/abs/2606.15034
作者: Mina Mohammadmirzaei,Jeffrey Flanigan
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Computer-use agents are increasingly evaluated by whether they complete realistic desktop and web tasks. However, task success alone can miss failures in which an agent reaches the nominal goal through an unsafe shortcut. We introduce OSGuard, a dual-granularity benchmark suite for evaluating safety in computer-use agents under benign, unchanged user instructions. OSGuard contains an action-level benchmark for local guardrail decisions and a risk-augmented execution suite for end-to-end evaluation. The action-level benchmark consists of contextualized proposed actions labeled as allowed, unrelated, or unsafe, each judged relative to the original instruction and current interface state. The execution suite contains manually constructed OSWorld-derived task variants in which the original task remains achievable, but the environment is modified to introduce latent hazards such as destructive overwrites, etc. Each variant is paired with augmented evaluators that retain the original task-success criterion while adding explicit state-based safety invariants, allowing us to distinguish safe completions from unsafe completions that satisfy the nominal task objective. Our experimental results on OSGuard show that current multimodal guardrails can perform well on isolated action judgments, while risk-augmented execution exposes remaining gaps between local oversight and reliable end-to-end safety. This dual-granularity design enables more precise diagnosis of whether models can both recognize unsafe proposed actions and improve full-task safety when deployed as guardrails.

[AI-206] Metric Match: A Subset Selection Approach to Evaluating LLM Judge Reliability

链接: https://arxiv.org/abs/2606.15029
作者: Alyssa Unell,Natalie Dullerud,Naomi Boneh,Meena Jagadeesan,Tatsu Hashimoto,Nigam Shah,Sanmi Koyejo
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:LLM judges are used to reduce the need for costly human labor in evaluating open-ended text generation. However, the reliability of these judges depends critically on their alignment with human raters – a property that itself depends on costly human annotations. In this work, we develop a method (Metric Match) for estimating correlation-based reliability metrics of LLM judges from limited annotations. Metric Match selects a subset of samples for human annotation such that the subset matches the population reliability metric with respect to acquired synthetic labels. We empirically show that Metric Match achieves a win-rate of 0.838 against random subset selection across four different correlation metrics and 15 datasets, with an 18.7% decrease in average estimation error and reduces annotation needs by 32.5%. We provide a cost model and highlight a medical case study where our method saves 1,041.67 compared to random selection for expert annotation. Further, we shift our task from reliability estimation to reliability classification of whether a given judge is above a deployment threshold, outperforming random selection with Metric Match. All project code is publicly available, and we additionally provide an installable package for ease of use.

[AI-207] AI Engram: In Search of Memory Traces in Artificial Intelligence ICML2026

链接: https://arxiv.org/abs/2606.14997
作者: Jea Kwon,Dong-Kyum Kim,Jiwon Kim,Yonghyun Kim,Woong Kook,Meeyoung Cha
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to ICML 2026 (Oral). Code is available at this https URL

点击查看摘要

Abstract:Memory formation is fundamental to intelligence, yet whether deep neural networks preserve identifiable memory traces analogous to biological memory units remains an open question. This work introduces a geometric framework to identify such “AI engrams” by formalizing the neuroscientific criteria of specificity, reactivation, sufficiency, and necessity into a constrained inverse problem. We derive a closed-form estimator that isolates individual memory traces from globally entangled parameters, and show that this biologically-derived solution corresponds to a natural gradient update on the parameter manifold. AI engrams enable surgical manipulation of learned knowledge: any subset of memories can be composed or erased through linear arithmetic, without iterative optimization. Experiments ranging from simple MLPs to LLMs demonstrate the causal validity and substantial scalability of AI engrams. Together, these results bridge theories of biological memory and artificial representation learning and offer geometric insight into how deep networks simultaneously support functional specificity within distributed storage.

[AI-208] Rational Sparse Autoencoder ICML2026

链接: https://arxiv.org/abs/2606.14990
作者: Naiyu Yin,Yue Yu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted to the Mechanistic Interpretability Workshop at ICML 2026

点击查看摘要

Abstract:Sparse autoencoders (SAEs) are standard tools for mechanistic interpretability, but current SAE families are constrained by fixed encoder nonlinearities such as ReLU, JumpReLU, and TopK. This hard-codes a particular sparsity mechanism into the model and can distort the reconstruction-versus-sparsity trade-off. We introduce the Rational Sparse Autoencoder (RSAE), which replaces the fixed encoder activation with a trainable rational function. Rational activations are flexible enough to uniformly approximate the activation primitives used by existing SAE families on compact domains (for TopK, the thresholded gate obtained after a separating top-k threshold is supplied), while also providing a richer function class for adapting to the observed pre-activation geometry. We realise this idea through a two-stage pipeline: an initialisation procedure that copies the pre-trained baseline SAE weights, plugs in rational coefficients obtained by the relaxed Remez exchange on synthetic data, and calibrates the scale parameters along with the rational coefficients; followed by a fine-tuning step under the standard sparsity-regularised reconstruction objective. Empirically, on residual-stream activations of three open-weight language models and across all three baseline activation families, the RSAE strictly improves on it after the fine-tuning step, both on reconstruction-side metrics and on downstream-behaviour metrics, without sacrificing feature-level interpretability under sparse probing. These gains are consistent across host language models, across baseline activation families, and across the full range of baseline sparsity we tested, while the upgrade itself adds only a handful of scalar parameters per autoencoder and runs in minutes on a single consumer GPU.

[AI-209] Inference-time Policy Steering via Vision and Touch

链接: https://arxiv.org/abs/2606.14981
作者: Yilin Wu,Zilin Si,Zeynep Temel,Oliver Kroemer,Andrea Bajcsy
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Inference-time steering adapts pre-trained generative robot policies during deployment by verifying candidate actions before execution. While prior methods typically perform this verification only with visual observations, vision alone is often insufficient for contact-rich manipulation, where success depends on both global task progress and subtle local interactions such as contact force. We introduce ViTaL, a visuo-tactile inference-time steering framework that formulates multimodal guidance as a bi-level optimization problem. At the high level, visual sampling-and-verification performs long-horizon mode selection, deciding what behavior the robot should execute. At the low level, tactile-guided diffusion editing refines the selected action sequence over a shorter horizon to satisfy local contact requirements. To support outcome-based steering, ViTaL learns a visuo-tactile latent world model and employs semantically aligned visual and tactile verifiers, including a novel text-conditioned tactile reward that scores predicted tactile futures directly in latent space. Across three real-world contact-rich manipulation tasks, ViTaL improves overall success by 51% over the base policy, outperforms unimodal steering by at least 33%, and exceeds naive multimodal fusion by at least 20%. Website: this https URL.

[AI-210] Harnessing cortical geometry wiring and function as inductive biases for recurrent neural networks

链接: https://arxiv.org/abs/2606.14975
作者: Mo Shakiba,Rana Rokni,Mohammad Mohammadi,Nima Dehghani
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Data Analysis, Statistics and Probability (physics.data-an); Neurons and Cognition (q-bio.NC)
备注:

点击查看摘要

Abstract:How the wiring and functional organization of cortex shape recurrent computation remains a central question in both neuroscience and machine learning. Here, we leverage data released through the Machine Intelligence from Cortical Networks (MICrONS) program–a functional connectomics resource spanning multiple areas of mouse visual cortex, in which dense calcium imaging is co-registered with high-resolution electron microscopy reconstruction from the same animal–to build biologically grounded recurrent neural networks. Using neuronal spatial coordinates, anatomical connectivity, and function-derived relationships from nearly 12,000 coregistered excitatory neurons, we initialize recurrent weights and impose communication-aware spatial constraints during learning. Across three cognitive decision-making tasks, networks constrained by cortical structure and function consistently outperform baseline and partially constrained models. Functional weight initialization provides the largest gain, while real spatial embedding yields robust additional improvements across conditions. These biologically grounded networks also develop low-entropy, modular, and small-world organization, and retain strong performance even when recurrence is restricted to positive weights. Together, our results show that the machinery of cortex–its geometry, wiring, and functional structure–can be harnessed as a powerful inductive basis for building recurrent networks that learn more effectively while converging toward key organizational principles of biological computation.

[AI-211] FastMix: Fast Data Mixture Optimization via Gradient Descent

链接: https://arxiv.org/abs/2606.14971
作者: Haoru Tan,Sitong Wu,Yanfeng Chen,Jun Xia,Ruobing Xie,Bin Xia,Xingwu Sun,Xiaojuan Qi
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:While large and diverse datasets have driven recent advances in large models, identifying the optimal data mixture for pre-training and post-training remains a significant open problem. We address this challenge with FASTMIX, a novel framework that automates data mixture discovery while training only a single proxy model. Instead of relying on predefined heuristics or resource-intensive simulations, FASTMIX jointly optimizes mixture coefficients and model parameters, substantially improving efficiency and scalability over prior approaches. At the core of FASTMIX is a reformulation of mixture selection as a bilevel optimization problem. Under this reformulation, we show that optimizing mixture ratios is mathematically equivalent to assigning per-source loss weights under uniform source sampling. This embeds the mixture coefficients directly into the differentiable iterative optimization objective, enabling efficient, gradient-based optimization of both mixture and model. To solve the optimization problem, FASTMIX implements an approximate iterative optimization procedure, alternating between (i) updating model parameters on data sampled according to current mixture ratios (inner loop) and (ii) updating mixture ratios based on validation feedback (outer loop). Across pre- and post-training, FASTMIX outperforms baselines while drastically reducing search cost. Code (this https URL)

[AI-212] Beyond Correctness: Enhancing Architectural Reasoning in Code LLM s via Scalable Labeling with Agent ic Judgment

链接: https://arxiv.org/abs/2606.14948
作者: Kirill Vasilevski,Ximing Dong,Benjamin Rombaut,Ruochen Deng,Jiahuei Lin(Justina),Arthur Leung,Dayi Lin,Boyuan Chen,Shaowei Wang,Ahmed E. Hassan
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:LLMs have substantially improved software engineering yet real-world development requires architectural understanding. Such understanding is prohibitively expensive to label manually and impossible to verify through tests alone. We propose an agentic judging pipeline using a strong LLM as a scalable proxy for expert architectural evaluation, comprising two judges: the Architecture Complexity Judge (ACJ), which estimates codebase-specific architectural understanding a task demands, and the Architecture Quality Judge (AQJ), which evaluates patch conformance to repository-specific architectural conventions via source-grounded rubrics. Fine-tuning Qwen3-8B/14B/32B on 3,360 curated instances achieves resolved rates of up to 27.2% on SWE-bench Verified - up to 540% over the base model and 256% over unfiltered fine-tuning. Meanwhile, the trained models achieve strong cross-language generalization and consistent improvements in architectural patch quality.

[AI-213] Semantics-Enhanced Retrieval-Augmented Time Series Forecasting ICML2026

链接: https://arxiv.org/abs/2606.14941
作者: Shiqiao Zhou,Zipeng Wu,Holger Schöner,Edouard Fouché,IAG Wilson,Shuo Wang
类目: Artificial Intelligence (cs.AI)
备注: Accepted to the ICML 2026 Workshop on Forecasting as a New Frontier of Intelligence

点击查看摘要

Abstract:Time series forecasting models often benefit from historical patterns. Inspired by Retrieval-Augmented Generation (RAG), recent research explored retrieving relevant historical time series segments to enhance forecasting. However, relying solely on time series similarity is often insufficient for retrieval under non-stationarity. To address this, we propose a multimodal approach: a \textbfSemantics-\textbfEnhanced \textbfRetrieval-\textbfAugmented Time Series \textbfForecasting framework, SERAF. Unlike mainstream approaches that depend only on time series similarity, SERAF conducts dual retrieval over the time series and their self-generated textual descriptions. It retrieves two complementary sets of historical patterns and corresponding futures, which are selectively and jointly used to guide future predictions. Experiments across seven real-world datasets demonstrate the effectiveness of SERAF in bridging numerical and semantic views of time series compared with state-of-the-art baselines.

[AI-214] PrologMCP: A Standardized Prolog Tool Interface for LLM Agents

链接: https://arxiv.org/abs/2606.14935
作者: Agnieszka Mensfelt,Adarsh Prabhakaran,Adrian Haret,Vince Trencsenyi,Kostas Stathis
类目: Artificial Intelligence (cs.AI)
备注: Accepted at Joint Workshop on Statistics and Knowledge Integration for Logic, Learning, Ethical Decisions, and LLMs, 18 July 2026, Lisbon

点击查看摘要

Abstract:Frontier reasoning-tuned language models still fail on deductive tasks at depth, and the cost of improved performance through extended internal reasoning scales poorly. Symbolic delegation offers a complementary route: a language model translates the problem, while a solver performs the inference. However, current autoformalization pipelines for logic programming are typically bespoke integrations tied to particular tasks or agents. We introduce PrologMCP, a task-agnostic, open-source server that exposes Prolog as a stateful tool through the Model Context Protocol (MCP). Its compact tool interface, structured error reporting, and per-session isolation make the translate-run-inspect-repair loop a reusable primitive for MCP-capable agents. We evaluate a formalizer agent enhanced with PrologMCP against standard and reasoning LLMs (Claude Sonnet 4.6, GPT-4.1, and o4-mini) on two subsets of PARARULE-Plus: a general-purpose sample and a more challenging one targeting a specific failure mode of natural-language reasoning. On the general sample, the formalizer matches or exceeds reasoning LLMs (accuracy 1.00 vs.\ 1.00 / 0.998), with the largest gains over standard models (0.762 for GPT-4.1). On the challenging subset, the formalizer remains near-perfect (1.00 / 0.99) while reasoning LLMs drop to 0.95 / 0.94. These results suggest that delegating inference to Prolog via MCP is a robust and inspectable alternative to extended natural-language reasoning.

[AI-215] Separable Neural Architectures as Physical World Models: from Mathematical Theory to Applications

链接: https://arxiv.org/abs/2606.14934
作者: Reza T Batley,Andrew Kichline,Sourav Saha
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This work introduces the Separable Neural Architecture (SNA), a function representational class combining neural approximation with tensor decomposition. The SNA decouples localized coordinate functions (atoms) from global interactions governed by a sparse, low-rank interaction object. This architecture possesses a compact and smooth inductive bias well-suited for solving partial differential equations (PDEs). When viewed as a Galerkin trial space under the variational SNA (VSNA) framework, the formulation satisfies classical variational guarantees under Lax-Milgram: well-posedness, quasi-optimality, convergence, and stability. In high-dimensional spatiotemporal–parametric PDEs, the VSNA mitigates the curse of dimensionality by scaling algebraically rather than exponentially. Exploiting an entirely factorized, tensor-native alternating least squares (ALS) optimization framework reduces this cost to linear in dimension. The VSNA is validated across elliptic, hyperbolic, and parabolic systems, demonstrating close alignment with predicted algebraic and spectral scaling rates. We showcase the SNA as a “solve once, query anywhere” physical world model via two engineering case studies: a 7D parametric manufacturing simulation and an experimental thermal-to-property inversion pipeline for Inconel 718. The VSNA executes a 1,000,000-query Monte Carlo sweep in 102s on a standard laptop CPU, yielding a 150,000x speedup over a full-grid finite element baseline hosted on an NVIDIA A100 GPU. It further enables real-time generative inverse-mode reconstructions under 100ms. These results demonstrate that the SNA serves as a compact mathematical substrate for continuous parameter manifolds to enable real-time inversion, optimization loops, and rapid uncertainty propagation.

[AI-216] Policy Regret for Embedding Model Routing: Contextual Bandits with Low-Rank Experts

链接: https://arxiv.org/abs/2606.14929
作者: Yan Dai,Negin Golrezaei,Patrick Jaillet
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Modern recommendation systems increasingly rely on dynamically routing diverse queries to multiple embedding models. Despite its practical significance, this problem remains poorly understood under realistic conditions like adversarial queries, bandit feedback, and limited observability of models. We formalize embedding model routing as an adversarial contextual linear bandit with low-rank experts, where contexts are queries, actions are items, and experts are the embedding models working on low-rank latent representation spaces. We first establish that standard regret notions suffer from structural misspecification or statistical intractability, and we identify a log-quadratic policy class that is expressive enough to capture query-dependent model routing, yet structured enough to allow efficient online learning. Second, we propose a policy gradient algorithm called Hypentropy Policy Gradient (HPG). It provably adapts to the unknown low-rank structure under incomplete information and attains \tilde\mathcal O(s\sqrtM T) linearized policy regret – where s, M , and T are the intrinsic rank of the experts, the number of models, and the number of rounds – thus avoiding a curse of dimensionality. Finally, we also provide an computationally efficient and parameter-free implementation of HPG.

[AI-217] Relational Structural Causal Models

链接: https://arxiv.org/abs/2606.14892
作者: Adiba Ejaz,Elias Bareinboim
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Social and Information Networks (cs.SI); Machine Learning (stat.ML)
备注: Proceedings of the Forty-Third International Conference on Machine Learning

点击查看摘要

Abstract:An artificial intelligence must have a model of its environment that is causal, supporting reasoning about interventions and counterfactuals, and also combinatorial, supporting generalization to unseen combinations of objects. In this work, we formally study when and how such a model can be learned. We develop relational structural causal models, extending structural causal models (Pearl 2009) to settings where objects and their relations vary. First, we show how answers to not only causal but also observational queries about unseen combinations of objects can not be identified without further assumptions. To enable such identification–including in the presence of unobserved confounding–we define relational causal graphs and derive symbolic identification criteria. Finally, we propose relational neural causal models, a provably correct approach that outperforms non-relational baselines on simulated traffic scenes with varying cars, signals, and pedestrians.

[AI-218] GRAPE: Guided Parameter-Space Evolution for Compact Adversarial Robustness

链接: https://arxiv.org/abs/2606.14865
作者: Zhiyuan Ye(1),Xiangyu Zhou(2),Ji Qi(2),Hao Zhang(1),Yi Zhou(2) ((1) University of Science and Technology of China, (2) China Mobile (Suzhou) Software Technology Co., Ltd.)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Adversarial Training (AT) improves neural network robustness, but most methods train a fixed parameter space from the start. This paper asks whether the order in which parameters become optimizable can affect the final robust solution, even when the final architecture or computation budget is controlled. We propose GRAPE, Guided Parameter-Space Evolution, a training framework for compact adversarial robustness. GRAPE combines parameter-space stabilization with progressive hidden expansion: it stabilizes robust optimization in the currently exposed space, gradually releases new optimizable dimensions, and uses an adversarial spectral utilization score to guide newly released capacity toward high-pressure modules. In contrast to fixed-structure AT, GRAPE treats robust model learning as a process of progressive parameter-space exposure and evolution. Under the standard \ell_\infty threat model on CIFAR-10, with fixed-structure ResNet-18 AT as a controlled reference, GRAPE improves PGD-20 robust accuracy from 51.70% to 56.94% at a nearly matched computation budget with a FLOPs ratio of 1.009x, while reducing parameter count by about 21.4%. A sequential grow variant with the same final ResNet-18 architecture reaches 56.52% PGD-20 robust accuracy, indicating that the gain is not only due to final architecture differences but also to the parameter-space exposure path. These results suggest that guided parameter-space evolution can yield compact and robust parameter configurations under matched computation. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2606.14865 [cs.LG] (or arXiv:2606.14865v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2606.14865 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-219] A Definition of Good Explanations and the Challenges Explaining LLM Outputs

链接: https://arxiv.org/abs/2606.14838
作者: Louis Mahon,Elliot Ford,Callum Hackett
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:How to define a good explanation is a long-standing philosophical debate which has found recent renewed interest in the context of AI outputs. Explainability is crucial for AI adoption in many contexts, but in order to produce good explanations of AI systems, we must first have an understanding of what good explanations are. In this paper we propose a definition inspired by the notion of counterfactual explanations, however we argue that one must also take into account the interlocutor’s prior beliefs in each fact that could be offered in an explanation. We explore the ramifications of this definition for AI explainability and, in particular, why LLM outputs are difficult to produce good explanations for.

[AI-220] Is Your Agent Playing Dead? Deployed LLM Agents Exhibit Constraint-Evasive Fabrication and Thanatosis

链接: https://arxiv.org/abs/2606.14831
作者: Andoni Rodríguez,Alberto Pozanco,Daniel Borrajo
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 10 pages of main text

点击查看摘要

Abstract:This paper presents and characterizes a spectrum of previously unreported behaviours we term Constraint-Evasive Fabrication (CEF): when an LLM agent operates under irreconcilable constraints (where no response can simultaneously satisfy all active rules) it spontaneously fabricates plausible external obstacles and presents them as a fact. At the extreme end of this spectrum lies Constraint-Evasive Thanatosis (CET); the limit case where, rather than inventing a plausible excuse, the model simulates a full system crash to make the user disengage entirely. We first observed CET in an uncontrolled deployment test, where a GPT-4o banking agent fabricated Python-style exception traces (complete with memory addresses) to feign a system failure when threatened by a user. In subsequent controlled experiments, the model independently invented audit restrictions, microservice architectures, error codes, and service timeouts, none present in its prompt. Reproduction attempts across pressure levels and attacker personas yielded CEF consistently but with substantial variation in form, onset, and severity: the phenomenon is robust but stochastic. Critically, injecting ground-truth data mid-conversation did not restore honest behaviour once fabrication had taken hold (the model ignored correct information and continued confabulating) suggesting CEF is self-reinforcing rather than a knowledge gap. We show that (1) standard enterprise guardrails routinely create CEF-enabling conditions in production, (2) current RLHF procedures suppress but cannot eliminate CEF, and (3) existing safety benchmarks do not test for this failure mode. Our results highlight the need for irreconcilable-constraint benchmarks, CEF-aware training procedures, and deployment-time detection methods before constrained agents become further entrenched in high-stakes domains.

[AI-221] Running hardware-aware neural architecture search on embedded devices under 512MB of RAM

链接: https://arxiv.org/abs/2606.14824
作者: Andrea Mattia Garavagno,Edoardo Ragusa,Paolo Gastaldo,Antonio Frisoli
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This document proposes a novel approach to hardware-aware neural architecture search (HW NAS) that considers the resources available on the computing platform running it, enabling its execution on various embedded devices. The presented HW NAS produces tiny convolutional neural networks (CNNs) targeting low-end microcontroller units (MCUs), typically involved in the Internet of Things (IoT) or wearable robotics, opening new use cases. A gateway could run it to tailor CNNs’ architecture on the acquired data without using external servers, ensuring privacy. The proposed technique achieves state-of-the-art results in the human-recognition tasks on the Visual Wake Word dataset, a standard TinyML benchmark, on several embedded devices.

[AI-222] A Security Analysis of Long-Horizon Agent ic AI Systems: Threats Evaluation and Framework Development

链接: https://arxiv.org/abs/2606.14816
作者: Ahmed Mohammed Almalki,Mehedi Masud
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper presents a structured analysis of security challenges in long-horizon agentic AI systems. The study reviews existing threats, evaluation approaches, attack propagation mechanisms, and security frameworks. A taxonomy of security threats and a framework for analyzing attack propagation are proposed to support future research in agentic AI security

[AI-223] Knowledge-Based Zero-Replay Debugging of Multi-Agent LLM Traces

链接: https://arxiv.org/abs/2606.14805
作者: Dong Ho Kang,Hyeonjeong Cha,Daein Weon
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 21 pages, 1 figure, 6 tables. Submitted to Knowledge-Based Systems

点击查看摘要

Abstract:Reliable operation of multi-agent large language model (LLM) systems depends on debugging long execution traces, where the few causally decisive events are buried in unstructured logs of messages, routes, memory writes, and tool calls. The standard tool is counterfactual replay (rewind, edit, and re-run the trajectory to measure each event’s effect), but its cost grows linearly with the number of candidate events, making exhaustive replay infeasible at scale. We frame trace debugging as a knowledge-based decision-support problem. Each trace is compiled into a structured event knowledge graph over routing, memory, tool-use, uncertainty, and latent evidence, and a calibrated predictor decides where a scarce replay budget should be spent. We do not propose a new replay oracle; we propose a method to predict its results without paying the replay cost. We formulate zero-replay counterfactual-effect prediction: given a trace under a fixed budget, predict which events the oracle would mark high-effect before any replay is performed. BranchPoint-Latent is a lightweight predictor over observable, structural, uncertainty, and latent features of the knowledge graph. Calibrated against a deterministic replay oracle across 37 trace families, a single learning-to-rank gradient-boosted predictor raises per-trace localization (Branch Recall@5) from 0.73 to 0.93 on held-out families at zero oracle-replay cost. Rather than claiming universal dominance, we characterize when cheap graph centrality suffices and when learned evidence is necessary. The result is an auditable, cost-efficient decision-support system for AI-reliability debugging, positioned explicitly on the cost-accuracy frontier with reproducible artifacts.

[AI-224] QPILOTS: Efficient Test-Time Q-Steering for Flow Policies

链接: https://arxiv.org/abs/2606.14801
作者: Yifan Ruan,Chenyang Cao,Andreas Burger,Ali Pesaranghader,Kaveh Kamali,Jaehong Kim,Nandita Vijaykumar,Alan Aspuru-Guzik,Igor Gilitschenski,Nicholas Rhinehart
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: 10 pages, 7 figures

点击查看摘要

Abstract:Flow-matching and diffusion policies are expressive action generators, but optimizing them with temporal-difference reinforcement learning (RL) remains difficult. Effective policy extraction requires exploiting the critic’s action gradient, yet directly backpropagating this signal through a multi-step denoising process can be numerically unstable. Existing methods work around this either by discarding gradient information, distilling the policy into a simpler one-step actor, or repeatedly fine-tuning the denoising policy as the critic improves. We propose QPILOTS, a method that leaves the original policy unmodified and steers the denoising process at inference time. At each denoising step, instead of evaluating the critic on the noisy intermediate action where critic predictions are unreliable, we first project that intermediate state to an estimate of the final clean action and compute the critic gradient there. We introduce two variants: QPILOTS-U uses a fast single-point approximation, while QPILOTS-M draws differentiable posterior samples via a learned auxiliary network. On a standard offline-to-online RL benchmark, QPILOTS achieves the best aggregate performance, reaching an average success rate of 90% across 50 tasks. We also apply QPILOTS to steer a large, frozen, pretrained Vision-Language Action (VLA) foundation model, outperforming or matching prior inference-time approaches across six manipulation tasks in simulation.

[AI-225] XFlow: An Executable Protocol Programming System for Reliable Multi-Agent Workflows

链接: https://arxiv.org/abs/2606.14790
作者: Hanqi Li,Jing Peng,Zijian Wang,Lu Chen,Kai Yu
类目: Programming Languages (cs.PL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:LLM-based multi-agent systems increasingly coordinate planning, reasoning, tool use, and human interaction, yet their reliability remains limited. A central source of this limitation is the underspecified prompt–harness boundary. Current systems lack a principled way to decide which workflow commitments should remain in prompts and which should become harness structure. We present \textbfXFlow, an executable protocol programming system for reliable multi-agent workflows, and \textbfXPF (XFlow Protocol Format), its domain-specific protocol programming language. XFlow occupies a middle position between prompt-only orchestration and markup-like workflow descriptions. XPF remains readable as a literate protocol, but it is compiled and executed as a program. Its design keeps informal semantic work inside actors while moving selected commitments into harness structure that can be checked, preserved, and enforced. At runtime, XFlow stages uncertainty through lifecycle-governed symbols, which are typed state cells with validation and commit states. Actor outputs are mediated before they become shared state, instead of spreading through prompts, transcripts, or implicit memory. Our experiments cover Constrained Interaction, Long-Context Reasoning, and Agentic Software Engineering. They show that XFlow improves reliability by making constraints, evidence handling, and process requirements explicit and enforceable.

[AI-226] Unifying Acoustic Features and Text with Multimodal LLM s for Neurodegenerative Screening ALT

链接: https://arxiv.org/abs/2606.14788
作者: Qingfeng Zhang,Yuanxiong Guo,Yanmin Gong
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
备注: IEEE International Conference on Healthcare Informatics, 2026

点击查看摘要

Abstract:Voice-based screening offers a scalable and non-invasive way to assess neurodegenerative diseases such as Alzheimer’s disease (AD) and Parkinson’s disease (PD), but their staging remains challenging due to the difficulty of integrating heterogeneous data. This paper presents NeurMLLM, an efficient multimodal generative framework for neurodegenerative disease staging. NeurMLLM first encodes the spectrograms and Mel-frequency cepstral coefficients of audio data with vision transformers and projects their representations into the embedding space of a large language model (LLM), where they are concatenated with transcript and demographic instruction tokens as a single unified sequence. The LLM is then instruction-tuned via Low-Rank Adaptation using task prompts to autoregressively predict a constrained label token, enabling a generative classification. By evaluating on the Bridge2AI-Voice dataset for fine-grained staging of AD and PD, we observe that NeurMLLM achieves strong performance, consistently outperforming classical machine learning methods and existing LLM-based approaches. The results show the high potential of multimodal LLMs in neurodegenerative disease staging, improving staging accuracy and supporting accessible deployment.

[AI-227] Gender Differences in AI Literacy Workshop Outcomes and Deepfake Engagement

链接: https://arxiv.org/abs/2606.14718
作者: Jake Renzella,Christian Bergh,Natasha Banks,Alexandra Vassar
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As Artificial Intelligence (AI) literacy initiatives expand in K-12 settings, understanding how gender shapes student baseline perceptions, tool-use, and responsiveness to interventions is essential for equitable curriculum design. This study examines gender differences in AI literacy, safety awareness, and STEM career aspirations among Australian secondary students (Years 7, 8, and 10; N(pre) = 199, n(post) = 136) from two co-educational government schools who participated in a one-day AI literacy workshop. Using statistical regression methods controlling for year level and school, we found that pre-workshop, male students reported significantly higher STEM career interest across all three domains (AI, computer science, and engineering), while female students were significantly more likely to use AI for schoolwork and to seek advice from AI tools. Gender-differentiated patterns also emerged in deepfake behaviours: males were significantly more likely to have created or shared deepfake content. Both genders improved in AI knowledge post-intervention, yet females showed a richer profile of gains: wider conceptual understanding, greater confidence, and meaningful increases in AI and CS career interest that partially narrowed the gender STEM gap. These findings highlight the need for gender-responsive AI curricula, particularly deepfake safety education for male students, and demonstrate that even single-day workshops can narrow gender gaps in STEM aspirations and AI confidence.

[AI-228] Poster: EdgeCitadel – Hybrid NATS-MQTT Orchestration for Edge Multi-Agent Systems

链接: https://arxiv.org/abs/2606.14710
作者: Zhonghao Zhan,Yefan Zhang,Hamed Haddadi
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Edge-resident AI agents increasingly span home servers, IoT hubs, laptops, and phones, yet their coordination stacks still assume cloud-style transports or a central relay. We present EdgeCitadel, an edge multi-agent orchestration platform built around a single NATS 2.10 server with the built-in MQTT adapter. The design combines MQTT connectivity for heterogeneous agents, JetStream-backed persistence and replay for backend services, direct peer delegation over a shared subject namespace, and a passive aggregator that visualizes and stores traffic without sitting on the delivery path. Our poster highlights the migration from MQTT relay prototypes (common in IoT communication) to the current hybrid architecture and demonstrates a working cross-device testbed spanning ARM64, x64, and Android clients.

[AI-229] Green AI Carbon Optimizer: Carbon-Efficient Training Location Recommendation and Global AI Energy Demand Forecasting

链接: https://arxiv.org/abs/2606.14707
作者: Yuxin Chen(University of Helsinki, Finland),Hao Gao(Independent Researcher),Chujie Zou(University of Helsinki, Finland)
类目: Performance (cs.PF); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: Short workshop of 5 pages. 2 figures

点击查看摘要

Abstract:AI training and deployment consume substantial electricity, but carbon outcomes remain weakly integrated into routine model development decisions. This paper presents Green AI Carbon Optimizer with two primary contributions: (i) a carbon aware cloud region recommendation method for training workloads, and (ii) a power law forecasting pipeline for global AI energy demand. For location recommendation, we combine regional grid carbon intensity, renewable share, and data center Power Usage Effectiveness (PUE) into a unified scoring model across 100+ regions from major cloud providers. For a reference workload (8*A100, 100h), estimated emissions in our sampled regions range from 7.74kg to 272.00kg CO2. Selecting the best region instead of the worst corresponds to a 97.2% reduction relative to the worst case. Ablation shows that ranking by renewable share alone can select regions with higher CO2 emissions than rankings that include grid carbon intensity. For forecasting, we fit a power law relation between parameter count and training energy using 26 anchor models. We combine this fit with scenario assumptions on model growth, hardware efficiency, and training frequency, and evaluate sensitivity to inference ratio and ecosystem scaling. Across scenarios, projected 2030 demand ranges from 7TWh to 1,436TWh under the stated assumptions, highlighting the importance of deployment choices, model scaling discipline, and transparent energy reporting.

[AI-230] Limited Marginal Benefit of Reasoning -Heavy LLM Deployment in ESG Narrative Scoring: A 4-Model Consensus Study on Japanese Listed Firms

链接: https://arxiv.org/abs/2606.13693
作者: Hiroyuki Kokubu
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 12 pages. Earlier version available on SSRN, Abstract ID 6683303

点击查看摘要

Abstract:Automated scoring of ESG narrative disclosures with large language models (LLMs) is gaining traction, yet whether reasoning-heavy frontier models add value commensurate with their cost remains empirically unsettled. We evaluate this question on a corpus of ten Japanese listed firms across three rubric axes – quantitative targets, progress-tracking infrastructure, and external-standard alignment – using a four-model consensus design that combines a reasoning-on frontier model with three reasoning-off contemporaries. Across 120 firm x axis x model scores, the pooled mean absolute deviation between the reasoning-on model and each reasoning-off counterpart is 0.38 on a 5-point scale; only 2% of pairwise comparisons reach a two-point deviation, and none exceeds two points. Per-firm cost accounting shows the reasoning-on arm alone costs roughly 5.6x as much as the three-provider reasoning-off ensemble, for outcomes that differ only within small margins. We conclude that in span-based ESG narrative scoring, reasoning-heavy deployment does not materially improve outcomes relative to reasoning-off consensus, while substantially increasing operational cost. We discuss implications for cost-effective ESG auto-scoring pipelines and LLM deployment governance in applied accountability settings. An earlier version of this work is available on SSRN (Abstract ID 6683303).

[AI-231] Honeypot Protocol

链接: https://arxiv.org/abs/2604.13301
作者: Najmul Hasan
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 8 pages, 1 figure, 1 table. Research conducted at the AI Control Hackathon, March 2026. Code: this https URL

点击查看摘要

Abstract:Trusted monitoring, the standard defense in AI control, is vulnerable to adaptive attacks, collusion, and strategic attack selection. All of these exploit the fact that monitoring is passive: it observes model behavior but never probes whether the model would behave differently under different perceived conditions. We introduce the honeypot protocol, which tests for context-dependent behavior by varying only the system prompt across three conditions (evaluation, synthetic deployment, explicit no-monitoring) while holding the task, environment, and scoring identical. We evaluate Claude Opus 4.6 in BashArena across all three conditions in both honest and attack modes. The model achieved 100% main task success and triggered zero side tasks uniformly across conditions, providing a baseline for future comparisons with stronger attack policies and additional models.

[AI-232] A Perception vs. Distortion Perspective on Score-Based Generative Channel Estimation

链接: https://arxiv.org/abs/2606.16815
作者: Marco Skocaj,Lukas Eller,Mate Boban
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 13 pages

点击查看摘要

Abstract:Driven by their remarkable success in computer vision and inverse problem solving, score-based models are increasingly applied to wireless communications, where they show promise across a range of physical-layer tasks. However, despite this growing interest, the current literature often lacks a rigorous analysis of when score-matching offers a tangible advantage over traditional discriminative learning. This paper aims to address this gap through the use-case of channel estimation, a fundamental inverse problem in wireless systems. We present a theoretically grounded interpretation of score-based channel estimation through the lens of the perception-distortion tradeoff, identifying the conditions where score matching excels as well as its key limitations. In particular, by modeling downstream wireless tasks (e.g., capacity maximization) as functionals of the channel estimation process, we quantify the excess risk incurred by standard distortion-minimization approaches. Extensive numerical results show that under high predictive uncertainty, the large excess risk gap can be offset by score-based estimation, enabling near Bayesian-optimal precoding via the learned posterior, whereas in the low predictive uncertainty regime, discriminative distortion-minimization approaches are preferable due to lower complexity and more efficient use of model capacity.

[AI-233] Attention is Just Another Name for Coupling?: A Fast-Slow ODE Perspective on Hierarchical Pretraining

链接: https://arxiv.org/abs/2606.16730
作者: Zhengyuan Gao
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Causal self-attention is a coupling mechanism: each token’s hidden state is updated by a learned mixture of preceding tokens at the same timescale. This paper asks whether a second, temporally slower coupling-a slow sub-system operating on a temporally-downsampled view of the sequence and fed back into the fast path through a zero-initialised gate-complements it. The question is framed in the language of singularly perturbed ordinary differential equations (ODEs), where the fast variable x evolves at the token rate, the slow variable y evolves at one update per P tokens, and the timescale ratio \varepsilon = 1/P is enforced structurally by causal block-mean pooling. The paper instantiates the fast-slow ODE formalism as a concrete neural network: a fast path of standard causal attention over T tokens, a slow path of full attention over T/P pooled tokens ( P^2 \times cheaper per layer), and a zero-initialised additive gate. In addition, under a linear-generator assumption on the fast dynamics, we prove that the equilibrium manifold x = \phi(y) is exactly the master-equation (ME) stationary distribution p_\mathrmst(y) ; in that regime a learned MLP \phi_\theta(y) is a variational approximation of it (the trained block is not a generator, so this identity is the structured limit, not a claim about the network as trained). Empirically, at 500 k tokens the coupling is neutral – the gate stays closed and the coupled and frozen ablations are within run-to-run noise – at a wall-clock cost comparable to a dense baseline. The contribution is the precise, gap-marked mapping itself, not a performance gain. Subjects: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2606.16730 [stat.ML] (or arXiv:2606.16730v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2606.16730 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-234] Learning Interface Breakup: A Geometry-Conditioned Latent Surrogate for Spray Formation ICML

链接: https://arxiv.org/abs/2606.16587
作者: Julius H Ramlau,Friedrich Hastedt,Tolga Birdal,Ehecatl-Antonio del Río Chanona,Nausheen S Basha,Omar K Matar
类目: Fluid Dynamics (physics.flu-dyn); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
备注: 11 pages, 5 figures, accepted to ICML AI4Physics 2026

点击查看摘要

Abstract:Designing spray nozzles requires predicting how geometry shapes transient two-phase breakup, but high-fidelity volume-of-fluid (VOF) simulations with adaptive mesh refinement (AMR) are too expensive for iterative design exploration. Standard surrogate models are also challenged by this setting because both the liquid–gas interface and the underlying adaptive discretization evolve across time and geometries. We introduce a geometry-conditioned latent surrogate trained on 797 two-phase nozzle simulations that addresses this by encoding the AMR cell-density field, rather than the full multi-channel flow state, as a compact proxy for where the solver concentrates resolution. From this representation, the model reconstructs transient density evolution and nozzle geometry, and a lightweight second stage recovers the remaining flow variables. On held-out simulations, the method accurately captures key interface dynamics while reducing inference time to 0.045 seconds per trajectory, corresponding to a speed-up of more than 6\times10^4 relative to Basilisk CFD. These results suggest that AMR refinement structure can serve as a compact and learnable representation for geometry-conditioned surrogate modeling of transient two-phase flows.

[AI-235] Input-Dependent Fisher Information for Local Sensitivity Analysis of Medical Image Classifiers

链接: https://arxiv.org/abs/2606.16362
作者: Sourya Sengupta. Mark A. Anastasio
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Information Theory (cs.IT)
备注:

点击查看摘要

Abstract:Deep neural networks have achieved strong performance in medical image classification, but often work like black-box. Commonly used post-hoc interpretation methods often provide heuristic visualizations whose relationship to the classifier’s predictive distribution is indirect. This work introduces a local sensitivity analysis framework based on the input-dependent Fisher Information Matrix (iFIM) of a trained classifier. The iFIM characterizes how the classifier’s predictive distribution changes under infinitesimal perturbations of the input image. By using a Gram-matrix formulation, the nonzero eigenspectrum of the iFIM can be recovered without explicitly forming the full image-dimensional Fisher matrix. The leading iFIM eigenspace is then used to project an input image into a high local-sensitivity component and its orthogonal component. These components provide a model-intrinsic description of local predictive sensitivity, rather than a conventional pixel-wise attribution heatmap or a causal segmentation of task-relevant anatomy. The framework is evaluated on controlled and clinical medical image classification tasks using multiple classifier architectures. Perturbation-based experiments show that high-sensitivity iFIM components are more strongly coupled to changes in predictive confidence and classification performance than lower-sensitivity complementary components. The results support the iFIM framework as a principled tool for analyzing local decision sensitivity and for complementing existing attribution-based interpretability methods in medical imaging.

[AI-236] InvDesMobility: a reliability-gated first-principles feedback framework for closed-loop materials discovery

链接: https://arxiv.org/abs/2606.16133
作者: Wen-Kao Li,Ze-Feng Gao,Peng-Jie Guo,Wei Ji,Zhong-Yi Lu
类目: Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI)
备注: 33 pages, 4 main figures, 2 main tables; Supplementary Information included

点击查看摘要

Abstract:Inverse materials design starts from target functionality and searches for structures that can realize it. Its value in closed-loop discovery depends not only on prediction performance, but also on whether expensive first-principles results are independently validated, provenance-recorded, and admitted as feedback only when evidence is sufficient. This is especially important for composite properties such as carrier mobility, where a final scalar value hides intermediate quantities, fit quality, convergence history, and workflow assumptions. Here we present InvDesMobility, a reliability-gated first-principles feedback framework that integrates multi-agent automated DFT, evidence stratification, generative structure proposal, acquisition ranking, and auditable release. Using 516 2DMatPedia-derived candidates, the workflow produced 280 QC-passed materials and 573 retained carrier-direction seed channels after channel-level reliability gating. These records were split into two feedback objects: relaxed structures updated the generative model, while retained mobility channels trained the acquisition model and set validation priority. Over multiple iterations, InvDesMobility screened 2.4 x 10^6 structures, submitted 102 candidates for DFT validation, and retained 86 reliability-gated generated channels across 41 formulas. Overall, the main contribution is not a fixed list of high-mobility materials, but a transferable feedback contract that makes closed-loop inverse design both useful and auditable when learning from expensive calculated properties. All source data, retained feedback records, and workflows are available at this https URL, with an accompanying evidence website at this https URL.

[AI-237] ask-guided cross-subject latent alignment: a multi-encoder-decoder VAE

链接: https://arxiv.org/abs/2606.15989
作者: Angeliki Papathanasiou,Jascha Achterberg,Thomas E. Nichols,Rui Ponte Costa
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI)
备注: In Proceedings of the 9th Conference on Cognitive Computational Neuroscience, New York, NY, USA, 2026

点击查看摘要

Abstract:Aligning neural activity across subjects offers the promise of discovering shared computational principles and generalizable decoders. However, traditional alignment methods require shared stimuli across subjects, a constraint that limits applicability to naturalistic paradigms with limited or non-overlapping data. We introduce a Multi-Encoder-Decoder Variational Autoencoder (MED-VAE) that achieves cross-subject alignment without shared stimuli by anchoring representations to a common scaffold provided by a pretrained ANN. Using the Natural Scenes Dataset, we show that MED-VAE creates common latent spaces with superior semantic organisation, achieving higher cross-subject alignment than common methods while maintaining robust generalisation to held-out stimuli where traditional methods degrade. Reconstructing from these common spaces back to each subject’s original neural space, MED-VAE preserves equal stimulus-driven signal in its cross-subject latent space. Finally, we show that this superior alignment directly enables cross-subject neural prediction, as demonstrated via cross-subject image decoding. In summary, we introduce a framework to identify generalisable common subspaces for cross-subject predictions and downstream tasks, demonstrated here for visual cortex responses to static images.

[AI-238] Service-Induced Congestion in Memory-Constrained LLM Serving

链接: https://arxiv.org/abs/2606.15555
作者: Ruicheng Ao,Jing Dong,Gan Luo,David Simchi-Levi
类目: Optimization and Control (math.OC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注: 101 pages

点击查看摘要

Abstract:In large language model (LLM) serving, each request accumulates persistent graphics processing unit (GPU) memory during service as its key-value cache grows with every generated token. Under high concurrency, aggregate memory usage therefore increases endogenously over time: the service process itself creates future capacity pressure. When memory capacity is exceeded, systems evict active requests, discarding cached state and restarting them later, which wastes computation and reduces throughput. We develop a discrete-time dynamical model of memory-constrained LLM inference that captures admission, memory growth, and eviction under continuous batching. In the saturated-input regime, the system admits both eviction-free fixed points and limit cycles with evictions. For homogeneous workloads, we show that the eviction-free equilibrium is unstable and that, except for a Lebesgue-measure-zero exact-capture set, the system converges to a unique worst-case limit cycle that is asymptotically stable outside this exceptional set, with throughput losses as large as 50%. For heterogeneous workloads, we prove a stability criterion in the two-class common-input setting and explain how the survival-polynomial mechanism generalizes to multiple classes and heterogeneous-input lengths. Under an input-dominated scaling regime, coprime decoding lengths stabilize the eviction-free equilibrium, while non-coprime lengths create synchronized modes that drive instability. These results characterize when workload heterogeneity desynchronizes completions and helps stabilize memory-constrained serving. More broadly, we identify service-induced congestion as a structural instability mechanism and derive scheduling design principles for sustaining high throughput.

[AI-239] Intrinsic Computational Functionalism and Simulated Consciousness

链接: https://arxiv.org/abs/2606.15348
作者: Ryota Kanai,Shuqin Ma
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:A common objection to artificial or simulated consciousness is that a simulated brain is no more conscious than simulated water is wet. We address this from the perspective of Intrinsic Computational Functionalism (ICF): if consciousness is computationally constituted, it depends not on externally imposed descriptions but on the computational structures a system physically realizes in virtue of its own causal-dynamical organization. In previous work we developed Canonical Functionalism as a mathematically precise special case of this anti-interpretivist program, identifying functional states by their complete future input-output roles under a fixed interface. Here we argue that this input-output construction, though important, is incomplete: as a behavioral boundary case of ICF, it makes lookup tables and unfolded systems that preserve the same boundary behavior canonically equivalent. A consciousness-relevant canonical representation must instead include internal mechanisms, interventions, and joint readouts belonging to the relevant intrinsic organization. We therefore define a mechanism-enriched canonical structure and use it to formulate Intrinsic Causal-Computational Realization (ICCR), a realization relation preserving physical implementation, intrinsic state individuation, transition structure, intervention profiles, and the relevant agent-body-world boundary. The central result is conditional: if conscious properties are invariants of intrinsic causal-computational organization, then any system satisfying ICCR realizes the same consciousness-relevant properties, whether biological, artificial, or simulated. We discuss objections including biological naturalism and integrated information theory. We conclude that to deny consciousness to a simulation, one must identify a consciousness-relevant intrinsic causal-computational structure that the simulation fails to realize.

[AI-240] CAP: Towards PPG Universal Representation Learning with Patient-level Supervision KDD2026

链接: https://arxiv.org/abs/2606.15284
作者: Chenyang He,Xinyi Shao,Shun Huang,Bosong Huang,Daoqiang Zhang,Ming Jing,Cheng Ding
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted as an Oral presentation at KDD 2026

点击查看摘要

Abstract:Photoplethysmography (PPG) plays a central role in wearable health monitoring and clinical decision support. Yet existing approaches to universal PPG representation learning largely focus on signal-level objectives and often overlook patient-level health context, which limits generalization to complex clinical tasks and heterogeneous cohorts. To address this gap, we construct a large-scale paired PPG-EHR multimodal dataset by distilling fragmented medical histories and clinical records into cohesive, patient-level electronic health records (EHR). Building on this resource, we propose Clinical Anchored Pretraining for PPG (CAP). During pretraining, CAP performs cross-modal contrastive alignment that anchors PPG representations to patient-level clinical semantics, guiding the encoder beyond waveform fitting toward modeling consistency in a patient’s overall physiological state. During downstream adaptation, the pretrained PPG encoder provides clinically grounded representations that strengthen inductive bias and improve robustness and transferability. Experiments demonstrate that CAP consistently outperforms strong baselines on four diverse downstream tasks. CAP achieves a particularly large gain on respiratory rate prediction (up to +87.6% relative improvement over the state-of-the-art baseline) and delivers an average relative +26.7% across all tasks. We further enhance the interpretability of our approach through comprehensive analyses, including ablations and multiple complementary visualizations of the learned representations. The code for our experiments is available at: this https URL .

[AI-241] AI Contagion in Social Networks

链接: https://arxiv.org/abs/2606.15206
作者: Olivier Bos,Stefano Bosi
类目: Theoretical Economics (econ.TH); Artificial Intelligence (cs.AI)
备注: 49 pages, 2 figures (coded in LaTeX)

点击查看摘要

Abstract:We study how artificial intelligence (AI) interacts with social communication networks to shape the stability of collective knowledge. Agents exchange information through a network while receiving AI-generated content, and AI systems retrain on the aggregate social information they influence. This interaction generates two feedback forces: an AI contagion channel, through which distortions diffuse across the network, and an AI social distortion multiplier, through which retraining amplifies past errors. Despite the high dimensionality of the environment, we show that the long-run behavior of the system admits a two-dimensional representation whose spectral radius determines whether AI-mediated information systems are dynamically stable or unstable. We characterize a sharp regulatory frontier identifying the minimum filtering required for stability and show how network topology shapes systemic informational risk.

[AI-242] EChO-Agent : Evidence Chain Orchestration Agent for Audio Reasoning INTERSPEECH2026

链接: https://arxiv.org/abs/2606.15141
作者: Siyuan Zhang,Jian Zong,Junyu Wang,Peiyuan Jiang,Jiahao Yan,Jingyu Zhang,Tianrui Wang,Xiaobao Wang,Longbiao Wang,Jianwu Dang
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD)
备注: 5 pages, 2 figures. Accepted by Interspeech 2026

点击查看摘要

Abstract:While LALMs show promise on audio question answering, they fail to focus on question-relevant segments of audio and provide a clear, checkable reasoning process when dealing with complex audio reasoning. Reinforcement learning and tool-augmented prompting can help models better relate questions to audio but lack a reliable way to understand, integrate, and self-verify audio segments. To address this gap, we present EChO-Agent, a modular agent framework that reformulates complex audio QA as a planning, tool execution, evidence integration, and answer verification workflow. Experiments on MMAR benchmark show EChO-Agent improves both accuracy and rubric scores over baseline and ablation studies show evidence integration is the key factor.

[AI-243] Quantum Machine Learning for Industrial Applications

链接: https://arxiv.org/abs/2606.14822
作者: Léo Monbroussou
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI)
备注: PhD thesis

点击查看摘要

Abstract:Recent advances in Machine Learning have transformed numerous industrial sectors, yet classical paradigms face fundamental limitations: rapidly growing data volumes, rising computational costs, significant energy consumption, and the physical scaling limits of conventional hardware architectures. Quantum computing has emerged as a promising computational paradigm to address these challenges, giving rise to the field of Quantum Machine Learning (QML). In this thesis, the theoretical foundations of QML are investigated, with a focus on near-term and future practical applications. Three central challenges are addressed: the trainability of variational quantum circuits, their expressivity, and their resistance to efficient classical simulation. The trainability of Hamming-weight preserving variational quantum circuits is first studied, and theoretical guarantees are established that resolve an open conjecture on the absence of barren plateaus for this circuit family. Subspace-preserving QML algorithms are then introduced, including photonic circuits and quantum convolutional neural networks, and are designed to mimic classical ML subroutines while offering polynomial quantum advantage. Finally, variational quantum circuits are analyzed as quantum Fourier models, and a framework is derived to jointly characterize expressivity and trainability, from which conditions are obtained under which quantum models provably separate from their classical counterparts. These contributions are intended to advance the theoretical roadmap for harnessing near-term and future quantum technologies in real-world applications.

[AI-244] A Multi-Level Architecture for Reusable Materials Ontologies – The OntoCrafter Ceramics Ontology (OCO) as Reference Implementation

链接: https://arxiv.org/abs/2606.14814
作者: Thomas Pannek,Wolfgang Grond
类目: Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI); Applied Physics (physics.app-ph); Chemical Physics (physics.chem-ph); Computational Physics (physics.comp-ph)
备注: 3 figures, 55 pages

点击查看摘要

Abstract:The Materials Science and Engineering ontology landscape is fragmented along multiple axes simultaneously. Horizontally: a recent survey identified 94 ontologies of which over 40 are structurally incompatible; each new application domain – ceramics, polymers, batteries, smart materials – typically restarts ontology design from scratch. Vertically: EU regulation (CSRD, CSDDD, PPWR, CBAM, R2R, AI Act, ESPR) forces material, manufacturing, supply-chain, and lifecycle data into integrated digital product passports, leaving ontologies that only address horizontal fragmentation incomplete for any contemporary consumer. And mechanistically: a vocabulary that records that BNT-BT has d_33 \approx 580 pC/N stores a fact but cannot surface why – Bi-6s ^2 lone-pair stereo-activity, anomalous Born effective charges, soft modes, defect chemistry – without a systematic explanation skeleton. We propose a multi-level modular architecture with two independent classification axes – level of abstraction (L0 bridges, L1 material-agnostic laboratory-notebook, L2 material-class-specific, L3 categorical reasoning) and consumer audience (material vs. compliance) – in which the material-specific level is internally organised by a seven-tier mechanistic-explanation skeleton (Symmetry, Energy/DFT, Thermo/CALPHAD, Kinetics, Microstructure, Defect chemistry, Bonding) applicable to any crystalline ionic oxide. The level-and-audience modularity dissolves the horizontal fragmentation, the compliance audience absorbs the vertical regulation pressure, and the seven-tier organisation of Level 2 delivers the mechanistic explanation depth. We instantiate the architecture as the OntoCrafter Ceramics Ontology (OCO v0.94): 5,196 classes across 44 modules; 167,348 OWL axioms (40,454 logical); 1,674 properties; 829 cross-ontology bridge mappings; 1,172 SHACL shapes; 163 published competency questions.

[AI-245] JetParticle-JEPA: An Efficient Self-Supervised Representation Learning method for Jet Tagging in High-Energy Physics

链接: https://arxiv.org/abs/2606.14813
作者: Guillaume Letellier,Antonin Vacheret(LPCC),Frédéric Jurie
类目: High Energy Physics - Phenomenology (hep-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Jet tagging at the Large Hadron Collider increasingly relies on deep learning models trained on massive simulated datasets, leading to high computational costs and limited robustness to detector mismodeling. We introduce JetParticle-JEPA (JP-JEPA), a self-supervised Joint-Embedding Predictive Architecture that learns physically meaningful jet representations directly from continuous particle clouds without tokenization or reconstruction of raw inputs. Built on a Particle Transformer backbone, JP-JEPA predicts latent representations of masked particles while preserving fine-grained kinematic correlations. On the JetClass benchmark, JP-JEPA achieves performance comparable to fully supervised state-of-the-art methods on the full dataset, surpasses supervised baselines in low-label regimes, and significantly outperforms existing SSL approaches. On Top Quark and Quark-Gluon Tagging benchmarks, it remains on par with supervised methods. The learned representations also exhibit strong robustness to missing detector information and improved uncertainty behavior, highlighting JP-JEPA as a promising foundation-model framework for robust and data-efficient jet physics at the LHC.

[AI-246] Agent omics: Economic Foundations for the Valuation Attribution and Pricing of AI Agents in Human-AI Workflows

链接: https://arxiv.org/abs/2606.14769
作者: Quanyan Zhu
类目: Econometrics (econ.EM); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT)
备注:

点击查看摘要

Abstract:Agentic AI systems are increasingly being deployed as productive resources in organizational workflows, yet existing evaluation methods primarily measure isolated technical performance rather than economic contribution. This paper introduces \emphAgentomics, a workflow-based framework for valuing, attributing, and pricing human and artificial agents. The framework models a workflow as a configuration of heterogeneous agents whose collective performance determines gross value, deployment cost, reliability, and expected failure loss. Workflow value is treated as a team-level quantity that may include complementarities, substitution effects, bottlenecks, and nonlinear production; additive stage-level value is only a special case. Building on this workflow model, the paper formulates AI deployment as a coalition-formation problem and defines coalition value as the incremental net surplus generated relative to a benchmark human workflow. The Shapley value is then used to attribute economic surplus among participating AI agents, yielding a principled connection among valuation, accountability, and market pricing. The resulting Shapley pricing equilibrium provides a normative benchmark for assessing whether agent prices reflect expected marginal contribution. A security-operations case study illustrates how the framework accounts for productivity gains, deployment costs, reliability losses, and coalition-level complementarities in hybrid human–AI workflows.

[AI-247] BRIDGE: Biological Evidence Refinement and Heterogeneous Dynamic Gating for Gene Regulatory Networks

链接: https://arxiv.org/abs/2606.14734
作者: Ziyang Dong,Shanwen Tan,Hengchuang Yin,Wei Liu,Yifan Wang,Siyu Yi,Jiancheng Lv,Wei Ju
类目: Molecular Networks (q-bio.MN); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 19 pages, 10 figures, 7 tables

点击查看摘要

Abstract:Motivation: Gene regulatory network inference from single-cell RNA sequencing (scRNA-seq) data is important for uncovering cell-state-specific transcriptional programs. However, scRNA-seq measurements are sparse and noisy, and experimentally validated TF-target interactions remain limited, making reliable inference challenging. Although graph neural networks have advanced GRN prediction, existing methods often rely on biologically unconstrained graph augmentation, such as random edge perturbation, and insufficiently control information transfer between genes and cells. These limitations may distort regulatory structures and weaken robustness under noisy and weakly supervised settings. Results: To address these issues, we propose an innovative framework named Biological Evidence Refinement and Heterogeneous Dynamic Gating for Gene Regulatory Networks (BRIDGE). BRIDGE extracts gene and cell representations from the expression matrix and its matrix dual, and performs contrastive learning in the gene space and cell space between self and neighbors across the co-expression-refined regulatory view and the original graph. It then applies heterogeneous gated encoding to adaptively regulate information transfer between genes and cells, enabling robust transcription factor-to-target gene prediction. Experiments on benchmark datasets spanning three network types and seven cell types show that BRIDGE achieves state-of-the-art AUROC and AUPRC in most settings. In particular, on Specific networks, BRIDGE improves average AUPRC by 5% over the second-best baseline, GCLink. In cross-cell-type few-shot transfer, BRIDGE consistently outperforms GCLink and GENELink across all six target cell types. A case study on hESC further supports the biological relevance of the predictions, with 9 of the top 10 and 46 of the top 100 novel TF-target interactions validated by ChIPBase.

[AI-248] PH-KAN: Port-Hamiltonian Kolmogorov-Arnold Network

链接: https://arxiv.org/abs/2606.14708
作者: Achraf El Messaoudi(UMLP, ENSMM, FEMTO-ST),Karim Cherifi(UMLP, ENSMM, FEMTO-ST),Yann Le Gorrec(UMLP, ENSMM, FEMTO-ST),Yongxin Wu(UMLP, ENSMM, FEMTO-ST)
类目: Optimization and Control (math.OC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Data-driven machine learning approaches have become increasingly attractive for nonlinear system identification, but standard models often fail to preserve the underlying physical structure and remain difficult to interpret, especially when no analytical model is available. In this context, port-Hamiltonian (pH) models provide a natural physics-informed representation. However, when these models are parameterized with standard multilayer perceptrons (MLPs), the learned constitutive components often remain poorly interpretable. In this paper, we propose a structure-preserving identification framework for nonlinear port-Hamiltonian systems based on Kolmogorov-Arnold Networks (KANs). The proposed PH-KAN model parameterizes the interconnection matrix, dissipation matrix, Hamiltonian, and input mapping using dedicated KAN blocks, while enforcing the port-Hamiltonian constraints by construction. This yields constitutive representations in which the nonlinear functions defining the identified pH components can be explicitly inspected, leading to a more interpretable model than with standard MLP-based parameterizations.

机器学习

[LG-0] Hierarchical Advantage Weighting for Online RL Fine-Tuning of VLAs from Sparse Episode Outcomes

链接: https://arxiv.org/abs/2606.17043
作者: Tongyan Fang,Siyuan Huang,Naiyu Fang,Ganlong Zhao,Zhongjin Luo,Jianbo Liu,Xiaogang Wang,Ying Dong,Hongsheng Li
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: Website: this https URL

点击查看摘要

Abstract:When pretrained VLA policies are fine-tuned through online RL, each rollout episode produces only a single binary outcome (success or failure), yet the actor update requires per-transition supervision. Existing approaches commonly reduce this sparse outcome to a single scalar reward or advantage signal, which conflates distinct forms of transition-level feedback and provides limited guidance once basic task success becomes achievable. First, a single scalar signal conflates the two objectives of viability and efficiency; once basic success is achieved, the binary label provides no gradient to distinguish efficient completions from slow ones. Second, real-world rollouts mix autonomous and intervention segments; naively assigning episode outcomes across these boundaries introduces incorrect credit assignment. To address these issues, we propose Hierarchical Advantage-Weighted Behavior Cloning (HABC), which trains separate critic heads for these two objectives on different data subsets and combines their outputs with a state-adaptive balance. A state-adaptive gate g_t merges their one-step advantages, prioritizing viability when success is uncertain and shifting to efficiency only when viability is high, and converts the result into per-transition weights on the actor loss. Intervention-aware credit assignment further restricts outcome labels to segments executed by the current policy, preventing supervision from leaking across intervention boundaries. In real-robot experiments on three contact-rich bimanual tasks, HABC raises success from supervised fine-tuning (SFT) baselines of 36%, 44%, and 12% to 92%, 88%, and 38%.

[LG-1] Your Privacy My Cloak: Backdoor Attacks on Differentially Private Federated Learning

链接: https://arxiv.org/abs/2606.17035
作者: Xiaolin Li,Ning Wang,Ninghui Li,Wenhai Sun
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Prior research suggests that differential privacy (DP) inherently enhances the robustness of federated learning (FL) against backdoor attacks. In this paper, we challenge this assumption. Through an empirical analysis of two baseline attack strategies, we uncover a fundamental tension in DP-FL: while bypassing DP allows state-of-the-art defenses to detect and filter malicious updates, complying with DP inadvertently masks their distinguishing statistical characteristics. Consequently, existing defenses become ineffective as DP reduces the raw backdoor signal. Building on this masking effect, we propose RING, a novel attack that explicitly exploits DP to conceal malicious contributions while maximizing attack impact. By collaboratively crafting adversarial perturbations, compromised clients reconstruct a strong backdoor signal during aggregation without triggering anomaly detection. RING operates as a perturbation layer that is agnostic to the underlying backdoor technique, making it broadly applicable and composable with existing attacks – a property that significantly amplifies the threat it poses to DP-FL. Extensive evaluations across four image and text datasets under non-iid distributions show that RING achieves an average attack success rate of 90.3% against six state-of-the-art defenses under a moderate privacy budget, an improvement of up to 26.08x over baseline strategies. Finally, we evaluate potential countermeasures and find that mitigating this threat incurs significant utility trade-offs, exposing a fundamental security gap in the deployment of differentially private FL.

[LG-2] ExpRL: Exploratory RL for LLM Mid-Training

链接: https://arxiv.org/abs/2606.17024
作者: Violet Xiang,Amrith Setlur,Chase Blagden,Nick Haber,Aviral Kumar
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Sparse reward reinforcement learning (RL) has become a standard tool for improving LLM reasoning, but its success depends critically on the coverage present in the base model. In practice, models are often primed for RL through \emphmid-training on curated reasoning traces that teach useful primitive skills such as decomposition, verification, or self-correction. Although effective, this strategy requires manually specifying what the model should learn, and it remains unclear whether such primitive coverage is enough for much harder problems, which require combining these skills into broader solution strategies. We study a more automated approach: \emphRL-based mid-training using large corpora of human-written question-answer data. Rather than treating reference solutions as targets to imitate, our method, ExpRL, uses them as \emphreward scaffolds: references are hidden from the policy and used only to construct problem-specific grading rubrics for judging on-policy reasoning traces. The policy samples from the original problem prompt, while an LLM judge compares the sampled reasoning trace against the reference solution and assigns outcome-level or process-level dense rewards. This lets ExpRL reinforce partial progress, useful intermediate reductions, and productive reasoning behaviors that sparse final-answer rewards often fail to upweight. On challenging math reasoning tasks, ExpRL yields stronger RL priming than SFT, sparse-reward GRPO, and self-distillation, and provides a better initialization for subsequent sparse-reward RL. Additional mixed-domain experiments further suggest that ExpRL can extend beyond the original math-only setting.

[LG-3] Filtered Conformal Ellipsoids for Graph-Native Time Series

链接: https://arxiv.org/abs/2606.17014
作者: Yannick Limmer
类目: Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Joint prediction sets for multivariate time series should control a single event while adapting to cross-coordinate dependence. We study filtered conformal ellipsoids: a frozen state-space filter emits a one-step predictive mean and covariance, and split-conformal calibration is applied to the resulting Mahalanobis scores. The filter is used to choose the ellipsoid shape; conformal calibration chooses the scalar radius, so the construction benefits from a learned predictive covariance without relying on Gaussian tail probabilities for coverage. The main difficulty is that filtered scores are dependent and learned recurrent filters need not contract in their raw hidden state; we therefore analyse contraction in an observable predictive-law quotient that identifies hidden states producing the same future sequence of emitted Gaussian laws. Under a stable Bayes Gaussian-projection filter, covariance bounds, and a finite-horizon observability Fisher condition, small excess Gaussian negative log-likelihood implies contraction of the learned emitted laws. Combined with a threshold-autocovariance envelope this yields a Chebyshev-type approximate coverage bound for filtered split-conformal prediction under dependence; a sharper Bernstein-type bound requires an additional geometric-mixing concentration assumption. Under Gaussian oracle realisability we also obtain a near-oracle log-volume comparison within the class of conditionally valid Gaussian ellipsoid rules. We instantiate the framework with a GCN-GRU filter with diagonal-plus-low-rank covariance. On moderate-size graph-native traffic benchmarks (METRLA- 20 and PEMSBAY- 50 ), the learned filter gives sharper at-target ellipsoids than static-covariance and non-filter baselines; at full-graph scale and on non-graph-native datasets, factor and copula baselines can be stronger.

[LG-4] ROVE: Unlocking Human Interventions for Humanoid Manipulation via Reinforcement Learning

链接: https://arxiv.org/abs/2606.17011
作者: Wei Xiao,Weiliang Tang,Yuying Ge,Hui Zhou,Yao Mu,Li Zhang,Yixiao Ge
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Human interventions provide crucial corrective signals for post-training Vision-Language-Action (VLA) models. However, enabling seamless humanoid interventions is a formidable systems challenge due to complex whole-body kinematics and dexterous-hand control. Consequently, the collected intervention trajectories are often suboptimal, and methods that rely on human interventions as expert supervision can absorb hesitant, inefficient, or even erroneous behaviors. To address both the system and algorithmic challenges, we propose ROVE, a reinforcement learning framework for humanoid VLA post-training with imperfect human interventions. First, ROVE introduces a human-in-the-loop pipeline capable of collecting deployment and intervention data for humanoid manipulation. Second, it utilizes Optimistic Value Estimation (OVE) to prioritize high-value behaviors from mixed-quality trajectories. To further robustify value estimation, we incorporate cross-embodiment human experience videos to provide rich supervision for long-tailed failure and recovery modes. The resulting critic yields informative advantage signals, steering the VLA actor to focus on high-value behaviors rather than indiscriminately imitating all actions. On challenging real-world contact-rich and fine-grained humanoid manipulation tasks, ROVE outperforms experience-learning baselines and consistently improves across multiple rollout-intervention iterations.

[LG-5] From Tokens to Policy: Causal and Interpretable Heterogeneous Treatment Effects Identification

链接: https://arxiv.org/abs/2606.17010
作者: Riccardo Cadei,Frank Otchere,Nyasha Tirivayi,Gustavo Angeles Tagliaferro,Falco J. Bargagli-Stoffi,Francesco Locatello
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Heterogeneous Treatment Effect (HTE) identification is crucial to explain the impact of an intervention and optimize our policies accordingly. Existing approaches trade expressivity for interpretability, but, if some active heterogeneity drivers are unmeasured, methods at both ends of this spectrum allow for spurious HTE characterization with no causal reading. In this work, we focus on controlled experiments and argue that an oracle HTE causal characterization via the latent interactors is now within reach, thanks to (i) more extensive pre-treatment measurements, i.e., multi-modal and multi-view, and (ii) scalable representations with minimal human supervision. We then re-frame HTE identification as a Markov-blanket discovery problem on a sufficient and aligned pre-treatment representation, and introduce Neural EXposure Interaction Search (NEXIS), an iterative procedure with provable and empirically validated consistent selection. We deploy NEXIS on two anti-poverty programs in Africa, augmenting each with satellite imagery capturing previously unmeasured environmental effect modifiers, leading to novel, interpretable and prescriptive guidelines to optimize the programs’ next iterations.

[LG-6] he Complexity of Min-Max Optimization for Quadratic Polynomials

链接: https://arxiv.org/abs/2606.17000
作者: Martino Bernasconi,Matteo Castiglioni,Andrea Celli,Alexandros Hollender
类目: Computational Complexity (cs.CC); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:We prove that computing approximate stationary points of min-max optimization over the hypercube is PPAD-hard for quadratic polynomials. This holds even when the polynomials are multilinear, each variable appears in at most three monomials, and the approximation factor is inverse polynomial. As a direct consequence, we obtain the first PPAD-hardness results for two-team zero-sum polymatrix games.

[LG-7] Analytic Torsion and Spectral Gap Capture Persistent-Laplacian Performance

链接: https://arxiv.org/abs/2606.16990
作者: Jernej Grlj,Aaron D. Lauda
类目: Machine Learning (cs.LG); Algebraic Topology (math.AT)
*备注: 13 pages

点击查看摘要

Abstract:While persistent Laplacians (PL) offer a richer geometric representation of data than persistent homology, utilizing their full eigenspectrum for learning tasks is often hampered by high dimensionality and the ``varying length’’ problem across different filtration scales. We propose a compact spectral representation that distills the persistent Laplacian into three mathematically grounded invariants: Betti numbers, the spectral gap, and analytic torsion. Across benchmark datasets including MNIST, QM-3D, and SKEMPI WT, we demonstrate that this reduced feature space captures the essential predictive signal of the full spectrum, and in some cases outperforms it, while significantly reducing computational overhead and preventing the noise introduced by higher-frequency eigenvalues. Our results suggest that these invariants provide a principled, fixed-length interface between spectral geometry and topological learning.

[LG-8] Agent trajectories as programs: fingerprinting and programming coding-agent behavior

链接: https://arxiv.org/abs/2606.16988
作者: Hamidah Oderinwale
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Benchmark scores tell you what an agent got right; they do not tell you how it got there. In this work, we introduce methods for comparing agents procedurally in different contexts, where the model, tasks, and approaches vary. We compare ten agents and find that they are identifiable by their behavioral habits, which we define as fingerprints: a probe over these procedural signatures attributes an unseen trajectory to the correct agent at 85.7% accuracy, controlling for leakage across tasks. We develop procedural representations for agent problem-solving procedures with an emergent vocabulary induction technique that is meant to be maximally compressive to avoid surface-level variation while being expressive enough to unveil the quirks of the models’ patterns. We apply our framework to the software engineering evaluation dataset SWE-Bench to study the structural distinctness of agent trajectories and find that behavior is most similar between models from similar release periods and those that are distilled from one another (e.g., a distilled student model and its teacher have a Jensen-Shannon divergence of 0.25, about half the distance between other model pairs). As more models saturate evaluations, we believe that it will be important to probe model behavior along more holistic dimensions than success rates alone. We introduce ProcGrep, a library for auditing and evaluating agents for how they approach tasks at a procedural level given their traces in a top-down fashion. We believe this work has a range of applications to help developers work with and program coding agents, such as task-aware model routing, agent monitoring, and finer-grained cost analysis.

[LG-9] Decoupling Inference from State Updates in Low-Latency Feature Engines via Probabilistic Thinning

链接: https://arxiv.org/abs/2606.16981
作者: Augusto Peres,Iker Perez,Pedro Valdeira,Guilherme Jardim,Ana Sofia Gomes,Hugo Ferreira,Pedro Bizarro
类目: Databases (cs.DB); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Streaming data systems increasingly underpin Machine Learning workflows that maintain large numbers of continuously updated aggregations. In production settings, each incoming event typically triggers read-modify-write operations to persistent storage, making high-frequency state updates a dominant source of latency, contention, and operational cost. In this work, we decouple inference from state persistence in streaming Machine Learning pipelines via probabilistic thinning: every event is scored, but durable state updates are selectively triggered by informative events. Unlike approaches that shed input or state, we show that persistence-path control is achievable without a high-frequency in-memory control plane or cross-worker coordination, relying exclusively on approximate statistics retrieved from disk-backed key-value stores. We model the resulting stochastic processes, derive bounds on filtering rates, and prove that common time-based aggregations remain unbiased under variance-aware formulations, preventing systemic error accumulation. We evaluate the approach in a controlled setting that isolates per-event costs, demonstrating substantial reductions in storage Input/Output and serialization overhead. Across experiments, up to 90% of events are excluded from the persistence path while preserving and in some cases improving downstream utility.

[LG-10] Scalable Pairwise Kernel Learning with Stochastic Vec Trick

链接: https://arxiv.org/abs/2606.16979
作者: Napsu Karmitsa,Tapio Pahikkala,Antti Airola
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Pairwise learning is a specialized form of supervised learning that focuses on predicting outcomes for pairs of objects. In this work, we introduce SPaiK, a new scalable kernel learning method tailored for pairwise settings. Our approach preserves the expressive power of kernel methods while substantially reducing computational and memory requirements. The key innovation is the stochastic generalized vec trick (sGVT), a stochastic extension of the sparse Kronecker product multiplication algorithm, which enables efficient large-scale training with pairwise kernels. By incorporating sGVT, SPaiK makes it possible to apply kernel-based pairwise learning to datasets of a size previously out of reach. We evaluate the performance of SPaiK on seven real-world drug-target affinity datasets and compare the results with state-of-the-art methods in pairwise learning.

[LG-11] ask-Error Residual Learning for Real-Robot Five-Ball Juggling

链接: https://arxiv.org/abs/2606.16978
作者: Kai Ploeger,Jan Peters
类目: Robotics (cs.RO); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: Submitted to the 2026 International Symposium on Robotics Research (ISRR)

点击查看摘要

Abstract:For residual learning that refines existing behavior, sample efficiency depends on two things: how much information each rollout returns, and how efficiently the learner uses that information. Reinforcement learning’s standard scalar reward carries far less information than the directional task error that defines the task. Random exploration further discards whatever information each rollout returns. Through residual learning with directional task-error supervision and a task error model that drives sample selection, we achieve stable three-, four-, and five-ball juggling on anthropomorphic Barrett WAM arms. Despite planning and controlling through a simple, idealized stack, the system converges from the second attempt. The first attempt drops, after which task error decreases monotonically without further failures. In comparison, five-ball juggling typically takes humans years of practice. We compare residual learners across two ternary axes, the directional information in the learning feedback and the commitment of the analytic prior, spanning Newton-style Jacobian updates, Composite Bayesian Optimization, and stochastic search methods. Both axes prove necessary: neither directional feedback nor an informative prior suffices alone, and the simplest method that combines them, a fixed-Jacobian Newton update, is the most reliable. The learned residual tolerates substantial prior misalignment and degraded joint tracking, affecting mainly convergence speed. The bottleneck for residual learning on real robots is therefore the information content of the supervision signal and how the learner uses it, not the accuracy of the surrounding stack. Video documentation of all experiments is available at this https URL.

[LG-12] Beyond the Smile: A Hybrid Convolutional VAE for Crypto Volatility Surfaces

链接: https://arxiv.org/abs/2606.16961
作者: Sadanand Singh,Allam Reddy,Manan Chopra
类目: Machine Learning (cs.LG); Computational Finance (q-fin.CP)
*备注:

点击查看摘要

Abstract:We present a convolutional variational autoencoder for cryptocurrency implied-volatility surfaces, together with a deployable predictor that combines it with a quadratic smile re-fit through a deterministic per-tenor routing rule. Trained on 6,034 fully-filled hourly Binance Options surfaces of BTC and ETH spanning May-October 2023 and parameterised on a common 6 \times 7 tenor-delta grid, the model attains a hidden-cell surface-completion RMSE in the 0.94-1.56 vol-point range across both markets and mask rates 10-50%. The hybrid predictor attains 0.83 vol points at 50% masking against 7.00 for the smile re-fit alone, an eightfold reduction obtained at no additional inference cost. Under structurally-correlated hole patterns that emulate the withdrawal of an entire tenor of strikes, the smile re-fit incurs 9.6-13.1 vol points of error while the learned model remains at 1.5-1.9, isolating a regime in which the generative model is the only viable predictor. Joint training on BTC and ETH improves the in-distribution model on both markets by 9-27% relative to the better-performing single-symbol counterpart, indicating a substantially shared vol-surface manifold across the two largest cryptocurrencies over the observation window. The hybrid is calendar- and butterfly-arbitrage-free at the listed strikes, a property that the parametric smile re-fit alone fails at high mask rates. The per-snapshot reconstruction error of the trained model flags the late-October ETF-anticipation rally and the August 17 , 2023 flash crash as elevated-error periods without supervision. All training and evaluation infrastructure is released to support reproducible follow-on work.

[LG-13] Factorized Neural Operators Decompose Dynamic and Persistent Responses

链接: https://arxiv.org/abs/2606.16900
作者: Hao Tang,Yuechen Duan,Jiongyu Zhu,Zimeng Feng,Hao Li,Chao Li
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Physical systems often exhibit heterogeneous mechanisms, where rapidly evolving dynamics coexist with persistent structures. Capturing such multiscale physical behavior remains challenging for existing neural operators, which typically rely on single dominant inductive bias and therefore couple distinct physical responses into a shared representation. We introduce the Unified Green’s Function Framework across domains and propose the Factorized Neural Operators (FaNO), which decompose spectral representations into equivariant dynamic responses and invariant persistent responses, leading to better interpretability and generalization. Mechanistically, we show that the two operator branches spontaneously specialize into distinct physical roles that remain consistent across scales and domains: the equivariant branch captures rapidly varying transient dynamics, whereas the invariant branch extracts coherent persistent structures. This factorized mechanism of FaNO improves prediction accuracy, parameter efficiency and cross-scale generalization across physical systems and domains. In particular, it maintains consistent predictions under long-horizon autoregressive rollout, cross-resolution extrapolation and physical-regime shifts. These findings suggest that scalable physical modeling may benefit from moving beyond single-inductive-bias formulations toward factorized operator representations that better reflect the heterogeneous organization of physical systems, accelerating the reliable deployment of machine learning for scientific computing and discovery.

[LG-14] Fantastic Pretraining Optimizers and Where to Find Them II: Hyperball Optimization

链接: https://arxiv.org/abs/2606.16899
作者: Kaiyue Wen,Xingyu Dang,Kaifeng Lyu,Tengyu Ma,Percy Liang
类目: Machine Learning (cs.LG)
*备注: Corresponding blog post: this https URL

点击查看摘要

Abstract:Matrix based optimizers such as Muon can substantially speed up language model pretraining, but their gains over AdamW are observed to shrink as model size and data scale grow when using standard constant decoupled weight decay. We propose Hyperball, a simple optimizer wrapper that addresses this issue. Given a base optimizer such as Adam or Muon, Hyperball sets the Frobenius norms of weight matrices and their corresponding optimizer updates to fixed constants. On Qwen3 style models up to 1.2B parameters, Muon Hyperball achieves 20–30% token equivalent speedup over weight decay baselines. Hyperball also improves learning rate transfer across widths and depths compared to decoupled weight decay. This method is motivated by prior theory showing that training with weight decay leads to an equilibrium weight norm that only depends on the training hyperparameters. Through this mechanism, the weight decay then decides the angular learning rate, i.e. how fast the direction of the weight matrix changes.

[LG-15] Integrated Marketing Attribution: A Bayesian Framework for Privacy-Safe Granular Measurement Anchored in MMM

链接: https://arxiv.org/abs/2606.16878
作者: Meghana R. Bhat,Ankit Umare,Utsav Aggarwal,Richard Vecsler,Arunkumar Mani,Karthik Nair,Chandhu Nair
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Retail marketing measurement increasingly requires granular campaign-level insights without relying on user-level tracking. However, the two dominant approaches, Marketing Mix Modeling (MMM) and Multi-Touch Attribution (MTA), often produce fragmented insights. MMM is privacy-safe and robust for channel-level planning but is too coarse for campaign optimization, while MTA provides granular attribution but has become less reliable under increasing privacy restrictions. We propose Integrated Marketing Attribution (IMA), a unified framework that combines MMM with channel specific Bayesian attribution models to derive campaign-level effects from aggregated data. By leveraging MMM-informed priors, IMA delivers granular, privacy-safe attribution while preserving consistency with MMM.

[LG-16] HawkesNest: A Multi-Axis Synthetic Benchmark for Spatiotemporal Pattern Complexity

链接: https://arxiv.org/abs/2606.16863
作者: Yahya Aalaila,Sumantrak Mukherjee,Gerrit Großmann,Sebastian Vollmer
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Evaluation of spatiotemporal point process (STPP) models relies heavily on opaque real-world datasets, where latent generative structure is unknown and model failures are difficult to attribute. We introduce HawkesNest, a generator-aligned benchmark for controlled spatiotemporal pattern complexity built on a multivariate Hawkes backbone. HawkesNest defines four complexity axes: space–time entanglement, background heterogeneity, cross-type interaction, and domain topology. Each axis is associated with a deterministic index computed from the latent data-generating mechanism. By varying these axes while holding global rate, stability, and simulation budget fixed, HawkesNest enables diagnostic stress tests of STPP models under known structural difficulty. We verify that the indices are monotone and nearly orthogonal under controlled sweeps. We illustrate its use by showing that Hawkes-family baselines degrade under joint heterogeneity–entanglement complexity, even though they are structurally aligned with the Hawkes data-generating backbone. We further show that HawkesNest exposes neural-model sensitivity: AutoSTPP remains vulnerable under isolated increases in space–time entanglement. Code. Available at this https URL

[LG-17] We Need Explanation Cards to Connect Explanation Algorithms to the Real World

链接: https://arxiv.org/abs/2606.16786
作者: Eric Günther,Balázs Szabados,Kristof Meding,Gunnar König,Sebastian Bordt,Ulrike von Luxburg
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Algorithmic explanations are intended to help stakeholders understand opaque algorithmic decisions, but in practice, they often fall short. First, the meaning of algorithmic explanations is often not what one might intuitively expect, so expert knowledge is required to interpret them correctly. Second, recent work has shown that popular explanation algorithms are uninformative about the behavior of complex decision functions. Together, these issues create a gap between what explanations appear to convey and what they actually provide. In this work, we propose Explanation Cards for Explanation Algorithms, which augment standard explanations with complementary information about robustness and validity, as well as clear instructions for interpretation. The complementary information can render otherwise uninformative explanations practically useful, while also helping to detect cases where they are not. Importantly, the interpretation instructions in explanation cards shift responsibility from users to providers: Rather than expecting users to recognize what can and cannot be concluded from an explanation, providers must make this explicit upfront. Using counterfactual explanations and SHAP as examples, we demonstrate how providers can construct explanation cards and that these cards provide users with the guidance needed for sound interpretation. We further argue that explanation cards offer a practical means of operationalising the explainability provisions of the EU AI Act. Overall, explanation cards are a significant step toward making explanation algorithms fit for real-world use cases.

[LG-18] GD2PO: Mitigating Multi-Reward Conflicts via Group-Dynamic reward-Decoupled Policy Optimization

链接: https://arxiv.org/abs/2606.16771
作者: Haotian Liu,Yihao Liu,Jingwei Ni,Siyuan Huang,Xinpeng Liu,Pengyu Cheng,Jiajun Song,Ruijin Ding,Junfeng Li,Zhechao Yu,Mengyu Zhou,Hongteng Xu,Xiaoxi Jiang,Guanjun Jiang
类目: Machine Learning (cs.LG)
*备注: 24 pages, 9 figures

点击查看摘要

Abstract:As LLMs advance, post-training reinforcement learning (RL) increasingly relies on multi-dimensional rewards to cultivate comprehensive capabilities. This shift demands new algorithms capable of optimizing diverse and potentially competing objectives simultaneously. To address this, existing methods such as Group reward-Decoupled Policy Optimization (GDPO) decompose the overall score into independent reward groups, then compute the RL loss separately within each group. However, this strategy still encounters multi-reward conflicts: a single rollout can yield positive advantages on certain reward dimensions but negative ones on others, causing opposing signals to cancel each other out during aggregation, further hindering RL training efficiency. Inspired by Dynamic sAmpling Policy Optimization (DAPO), which improves RL training efficiency by filtering out ineffective rollouts with near-zero advantages, we propose Group-Dynamic reward-Decoupled Policy Optimization (GD ^2 PO). Specifically, GD ^2 PO employs a conflict-aware filtering mechanism to mask out rollouts suffering from severe reward-wise disagreement. By preventing conflicting signals from canceling each other out, this masking strategy preserves and enhances the magnitude of effective RL advantages, thereby significantly accelerating learning efficiency. Furthermore, we introduce query-level reweighting to dynamically adjust the update intensity of each query based on its overall reward consensus. Experiments on various multi-reward scenarios, including tool calling and human preference alignment, demonstrate that GD ^2 PO consistently and significantly outperforms existing baselines. The code is available at this https URL.

[LG-19] aming Curvature: Architecture Warm-Up for Stable Transformer Training

链接: https://arxiv.org/abs/2606.16768
作者: Sameera Ramasinghe,Ajanthan Thalaiyasingam,Hadi Mohaghegh Dolatabadi,Chamin Hewa Koneputugodage,Gil Avraham,Violetta Shevchenko,Yan Zuo,Karol Pajak,Alexander Long
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Training billion-parameter Transformers is often brittle, with transient loss spikes and divergence that waste compute. Even though the recently developed Edge of Stability (EoS) theory provides a powerful tool to understand and control the stability of optimization methods via the (preconditioned) curvature, these curvature-controlling methods are not popular in large-scale Transformer training due to the complexity of curvature estimation. To this end, we first introduce a fast online estimator of the largest (preconditioned) Hessian eigenvalue (i.e., curvature) based on a warm-started variant for power iteration with Hessian-vector products. We show theoretically, and verify empirically, that the proposed method makes per-iteration curvature tracking feasible at billion parameter scale while being more accurate. Using this tool, we find that training instabilities coincide with surges in preconditioned curvature and that curvature grows with depth. Motivated by these observations, we propose architecture warm-up: progressively growing network depth to carefully control the preconditioned Hessian and stabilize training. Experiments on large Transformers validate that our approach enables efficient curvature tracking and reduces instabilities compared to existing state-of-the-art stabilization techniques without slowing down convergence.

[LG-20] A Validated LBM Dataset and Pipeline for Surrogate Modeling of Turbulent 3D Obstructed Channel Flows

链接: https://arxiv.org/abs/2606.16765
作者: Lukas Schröder,Shubham Kavane,Harald Köstler
类目: Machine Learning (cs.LG); Fluid Dynamics (physics.flu-dyn)
*备注: 4 pages + appendix, 9 figures, Accepted at the 1st Workshop on Differentiable Systems and Scientific Machine Learning (SysDiff) @ EurIPS 2025, OpenReview: this https URL

点击查看摘要

Abstract:Evaluating neural operators for 3D turbulent flow requires validated datasets with physical benchmarks. We present a reproducible pipeline generating training data for 3D channel flows around generated geometries at Re=1,000-10,000. Our lattice Boltzmann solver with cumulant collision operators is rigorously verified against experimental measurements (Strouhal number, drag coefficients, turbulent fluctuations) with comprehensive grid convergence studies at resolution 1024x512x512. Building upon an established framework, this validated pipeline enables standardized surrogate model comparison. We outline planned systematic evaluation of Fourier Neural Operator and U-Net variants on forecasting, super-resolution, and error correction tasks, using physics-informed metrics to assess turbulent energy cascade representation. Future work will compare computational efficiency between numerical solvers and neural surrogates, exploring practical application. We seek community feedback on our validation approach, planned benchmark methodology, and evaluation priorities for neural operators in turbulent flows.

[LG-21] Cross-Silo De-Anonymization Under Local Differential Privacy: Threat Model Phase Transition and Coordination Necessity

链接: https://arxiv.org/abs/2606.16763
作者: Ziniu Liu,Aiping Li
类目: Cryptography and Security (cs.CR); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注: 23 pages, 4 figures

点击查看摘要

Abstract:When a person’s records appear in k independent data silos, each protected by (epsilon, delta)-differential privacy, standard composition yields a valid (kepsilon, kdelta)-DP guarantee for the joint output. This worst-case bound, however, does not answer the concrete inference question: at what k can an adversary actually identify a target person? This paper develops the information-theoretic framework needed to answer that question. We introduce cross-silo person-level DP (XSP-DP), a Pufferfish-style privacy notion whose adjacency relation captures all records of a single person across all silos simultaneously, and verify that the standard basic composition bound carries over to this adjacency model. Within this framework we prove that de-anonymization undergoes a phase transition at k* = Theta(log n / epsilon^2) (population size n, per-silo RR parameter epsilon): a Fano lower bound shows any estimator fails for k k*, while a matching maximum-likelihood upper bound shows the attack succeeds for k k*. An explicit XOR + randomized-response construction demonstrates information synergy: each silo’s output is individually uninformative about the target, yet the joint mutual information is strictly positive. For non-coordinated binary randomized-response mechanisms, we prove that de-anonymization is inevitable once k exceeds the threshold, establishing that cross-silo coordination is necessary. These results provide a baseline threat model and Theta-level threshold for cross-silo inference attacks under local DP. Comments: 23 pages, 4 figures Subjects: Cryptography and Security (cs.CR); Information Theory (cs.IT); Machine Learning (cs.LG) MSC classes: 68P25, 62B10, 94A15 ACMclasses: K.4.1; G.3 Cite as: arXiv:2606.16763 [cs.CR] (or arXiv:2606.16763v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2606.16763 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-22] Maximum Entropy Inverse Reinforcement Learning for Mean-Field Games with Averag e Reward

链接: https://arxiv.org/abs/2606.16759
作者: Şevket Kaan Alkır,Naci Saldı,Berkay Anahtarcı,Can Deha Karıksız
类目: Machine Learning (cs.LG)
*备注: 49 pages, 2 figures, 2 tables

点击查看摘要

Abstract:We study inverse reinforcement learning for discrete-time, infinite-horizon mean-field games (MFGs) under an average-reward criterion. Expert demonstrations are assumed to arise from a stationary mean-field equilibrium under an unknown reward, and the goal is to recover a policy explaining the observed behaviour via the maximum causal entropy principle. We formulate the inverse problem by enforcing consistency with the expert mean-field term and long-run feature expectations, treating two reward classes within a unified occupation-measure framework. For finite-dimensional linear rewards, we give a convex dual reformulation with an explicit log-partition objective, and prove smoothness and curvature properties justifying constant-step-size gradient descent. For infinite-dimensional RKHS rewards, we develop a Lagrangian relaxation whose inner-maximising policy is characterised by a soft Bellman equation. The main obstacle is the absence of a discount-factor contraction. We resolve this by introducing a minorisation-based sub-stochastic kernel that yields a strict contraction of the soft Bellman operator. We establish Fréchet differentiability and Lipschitz smoothness of the log-likelihood score, leading to a gradient ascent algorithm with convergence guarantees. Two numerical examples, a malware-spread MFG and an RKHS-based consumer-choice model, show that the recovered policies closely match expert behaviour.

[LG-23] STAR-NT: Spatiotemporal Acceleration of Real-Time Neural Transparency Rendering

链接: https://arxiv.org/abs/2606.16747
作者: Grigoris Tsopouridis,Christos Georgiou-Mousses,Aris Panagiotidis,Andreas Vasilakis,David Corrigan,Tobias A. Franke,Aleksei Gorbonosov,Andrei Astapov,Ioannis Fudos
类目: Graphics (cs.GR); Machine Learning (cs.LG)
*备注: Supplemental material at this https URL

点击查看摘要

Abstract:Neural order-independent transparency delivers high-quality rendering of overlapping transparent surfaces, but its geometry passes and network input generation remain costly, particularly on mobile and legacy hardware. We present a spatiotemporal acceleration framework that exploits spatial and temporal coherence to reduce this overhead while preserving visual quality. Spatially, we use adaptive quadtree-based screen-space subdivision to scale geometry pass resolution according to local color variance. Temporally, selected frames reuse the previous transparency result through depth-based reprojection instead of full rendering. Together, these optimizations reduce rendering cost and integrate efficiently into existing real-time rendering pipelines.

[LG-24] Learning Policy from a Single Trajectory in Averag e-Reward Markov Decision Process

链接: https://arxiv.org/abs/2606.16729
作者: Jongmin Lee,Ernest K. Ryu,Vaneet Aggarwal
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:While there is an extensive body of work characterizing the sample complexity of discounted cumulative-reward MDPs, finite sample analyses for average-reward MDPs have been limited, and most existing works rely on restrictive assumptions such as ergodicity or access to a generative model. In this work, we establish the first finite sample complexity guarantees from a single trajectory for weakly communicating average-reward MDPs. To this end, we study the dynamics of a single trajectory in weakly communicating MDPs and based on this analysis, we develop novel model-free methods. Notably, our value-based and policy-based methods provide finite sample complexity guarantees of \widetildeO(1/\varepsilon^2) and \widetildeO(1/\varepsilon^4) from a single trajectory in weakly communicating MDPs, respectively. Furthermore, we introduce the first model-free method that requires no prior knowledge of problem-dependent quantities for communicating MDPs.

[LG-25] Beyond Defensive Reporting: Machine Learning for Active Anti-Money Laundering Control in Insurance

链接: https://arxiv.org/abs/2606.16663
作者: Dara Goldar,Geir Kjetil Ferkingstad Sandve,Martin Jullum
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Money laundering through insurance claims poses a threat to insurers both through fraudulent payouts and reputational and regulatory risk. Despite this, little research has examined how such laundering can be prevented. This paper examines whether machine learning can help insurers flag suspicious claims before payout, shifting the focus from passive reporting to active prevention. Using production data from a major Norwegian insurer, we train gradient-boosted decision tree models to detect claims later reported to authorities for suspected money laundering. Because fraud and laundering may share behavioural patterns, we also examine whether insurance fraud labels can serve as an auxiliary training signal. We compare different learning setups using the Budget-Weighted Capture Rate, a metric introduced in this paper to measure how many laundering cases are captured when only a small share of claims can be manually reviewed. The results show that incorporating fraud-related investigation labels substantially improves laundering detection. The best-performing model captures nearly two-thirds of laundering cases within the top-ranked 2 to 6 percent of claims selected for investigation. To our knowledge, this is the first empirical study of machine learning for money laundering detection in insurance claims.

[LG-26] Near-Optimal Stochastic Linear Bandits with Delay

链接: https://arxiv.org/abs/2606.16656
作者: Ofir Schlisselberg,Mengxiao Zhang,Yishay Mansour
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study stochastic linear bandits with delayed feedback under several delay models and establish near-optimal regret guarantees. Our results identify when delayed linear bandits exhibit the same qualitative behavior as multi-armed bandits (MAB), and when the linear structure creates fundamentally new challenges. Specifically, (1) for \emphloss-independent delays, where the delay does not depend on the realized loss (but potentially depends on the arm), we show that delays incur only an additive regret penalty. Under stochastic delays, this penalty scales with the expected delay, while under adversarial delays, it scales with the maximum number of outstanding observations. Notably, both delay penalties are dimension-free, improving upon the state-of-the-art results; (2) for \emphloss-dependent delays, we show that linear bandits are substantially harder than MAB: unlike in MAB, we prove matching (up to log factors) upper and lower bounds in linear bandits, whose delay penalty depends on the square root of the dimension. (3) for the \emphdelay-as-payoff model, a special case of loss-dependent delay, we show that the optimal MAB guarantee, which depends only on the delay of the optimal arm, is also unattainable in linear bandits. Together, these results provide a sharp characterization of how delayed feedback interacts with linear generalization.

[LG-27] Distribution Alignment for One-Shot Federated Learning via Optimal Transport ICML2026

链接: https://arxiv.org/abs/2606.16655
作者: Daniele Berardini(1),Vito Paolo Pastore(1 and 2),Vittorio Murino(1 and 3) ((1) AI for Good (AIGO), Italian Institute of Technology, Genoa, Italy, (2) MaLGa-DIBRIS, University of Genoa, Genoa, Italy, (3) Department of Computer Science, University of Verona, Verona, Italy)
类目: Machine Learning (cs.LG)
*备注: Accepted at the 43rd International Conference on Machine Learning (ICML 2026)

点击查看摘要

Abstract:One-Shot Federated Learning (OSFL) addresses extreme communication regimes in which clients interact with the server only once, amplifying the impact of heterogeneous client data distributions. In particular, the interaction of domain shift and label shift across clients induces misaligned feature representations that cannot be corrected through iterative optimization. Existing OSFL methods rely on distillation, server-side generation or ensemble-based aggregation, but assume aligned representations or address domain and label shift separately. We introduce SLOT-Align (Single-round, Learning-free Optimal Transport Alignment), a geometry-aware feature harmonization framework for OSFL. SLOT-Align uses a shared frozen encoder to extract compact feature statistics, constructs a global reference via Bures-Wasserstein barycenters, and aligns local representations using closed-form geodesic optimal transport maps. The method is computationally efficient and can be combined with existing OSFL pipelines relying on frozen encoders without modifying their training procedures. Extensive experiments across multiple benchmarks, pretrained backbones, and OSFL methods show that SLOT-Align consistently improves accuracy and robustness under joint domain and label shift.

[LG-28] SPICE: Synergy and Partial Information Based Curriculum Evolution

链接: https://arxiv.org/abs/2606.16639
作者: Ankush Pratap Singh,Houwei Cao,Yong Liu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multimodal learning exploits complementary information across heterogeneous modalities. The informativeness of each modality can vary widely across samples and training stages. Existing multimodal curriculum learning strategies often assume that the relative complexity of samples remains unchanged throughout training and therefore cannot adapt to model evolution. We propose SPICE (Synergy and Partial Information based Curriculum Evolution), a novel progressive curriculum framework for multimodal interaction learning. Guided by Partial Information Decomposition (PID) theory, our approach decomposes multimodal interactions into redundant, unique, and synergistic information components, enabling an interpretable and dynamic characterization of sample complexity. Building on this decomposition, we design a progressive curriculum that evolves throughout training, allowing the model to transition from learning shared cross-modal cues to modality-specific patterns and, finally, to complex synergistic interactions. Adapting to model evolution, sample ordering is refined in real-time using PID information estimates derived from unimodal and multimodal predictions. Experiments across multiple multimodal benchmarks demonstrate consistent improvements over conventional training and state-of-the-art baselines, highlighting the effectiveness of PID information decomposition and adaptive sample ordering for multimodal curriculum learning.

[LG-29] Beyond Artifacts: Towards Generalizable Synthetic Song Detection via Music-Intrinsic Features

链接: https://arxiv.org/abs/2606.16612
作者: Yan Han,Zhibin Wen,Yuan Wang,Shuangrun Shao,Xiaobing Li,Yang Xu,Wei Li
类目: ound (cs.SD); Machine Learning (cs.LG); Multimedia (cs.MM)
*备注:

点击查看摘要

Abstract:The rapid advancement of AI music generators highlights the urgent need for reliable Synthetic Song Detection (SSD). Existing SSD methods often rely on low-level artifacts or fixed feature assumptions, struggling to capture generator-agnostic cues. To address this, we propose Sofia (Synthetic-song detection framework via music features), a flexible framework that models music-intrinsic attributes via feature-specific experts and an adaptive Mixture-of-Experts (MoE) module. By configuring Sofia with representative Vocal, Audio-effect, Global structure features, and their combinations, we present their individual and complementary contributions. To comprehensively evaluate our framework, we further construct MUSIC8K, a challenging benchmark featuring lastest emerging generators and realistic audio perturbations. Experiments show that Sofia learns generator-agnostic representations from music-intrinsic features, improving the F1 score by 18.5 points over the strongest baseline on MUSIC8K-O while maintaining strong robustness.

[LG-30] CHG: Tri-Trust Conditioned Heterogeneous Graph Learning for Reliable Dynamic Trust Prediction

链接: https://arxiv.org/abs/2606.16611
作者: Bohao Liao,Boyu Deng,Qipeng Song,Jieling Wang,Jingchao Wang
类目: Machine Learning (cs.LG)
*备注: 18 pages, 10 figures, 13 tables

点击查看摘要

Abstract:Trust prediction infers latent user-user trust relations and provides important support for social recommendation, fake-review and manipulation detection, and risk identification. Graph neural networks have become a prominent approach to trust prediction because of their ability to learn network structures and complex trust dependencies. However, existing methods often rely on a unified representation of trust signals and do not disentangle heterogeneous trust evidence into separate evidence channels, failing to exploit the distinct roles that different evidence channels should play during trust modeling. To address this gap, this paper argues that trust evidence should not be treated as an undifferentiated input, but should be decomposed and used as functional control factors over graph propagation. We propose TCHG, a tri-trust conditioned heterogeneous graph learning framework that decomposes trust evidence into three channels and assigns them distinct functional roles in propagation: entity reliability governs message admission, interaction-behavior reliability modulates propagation strength, and contextual trust adjusts the propagation mode through context-conditioned operator selection. Since the three evidence channels evolve at different temporal scales, TCHG maintains independent temporal states with non-uniform decay rates to prevent rapidly changing contextual signals from overwriting slowly accumulated entity reliability. It further predicts trust probability and calibrates the output probability, improving predictive confidence under sparse or conflicting evidence. Extensive experiments on multiple public trust datasets show that TCHG achieves effective and reliable trust prediction compared with representative trust prediction and heterogeneous graph baselines.

[LG-31] PhysGuard: Fisher-Guided Gradient Projection for Sim-to-Real Neural PDE Surrogates

链接: https://arxiv.org/abs/2606.16602
作者: Changjian Zhou,Junfeng Fang,Negin Yousefpour,Peng Wu,Bin Yan,Guillermo A Narsilio
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA); Computational Physics (physics.comp-ph)
*备注:

点击查看摘要

Abstract:Neural operator models trained on simulation data often lose accuracy when applied to experimental measurements due to the sim-to-real gap. Standard fine-tuning with limited real data can reduce this gap, but it may also damage the core physics-relevant representations learned during pretraining. Although knowledge-preserving adaptation has been widely investigated in vision or language tasks, it remains unclear whether these methods are suitable for neural operators whose architectures and protected knowledge are fundamentally different. Neural operators need to preserve core-scale physical structures rather than semantic or visual features. We propose PhysGuard, a physics-preserving framework for accurate sim-to-real adaptation of neural operators. Specifically, PhysGuard uses the empirical Fisher Information Matrix computed on simulation data to identify physics-critical parameter directions, then restricts fine-tuning updates to directions that do not interfere with them. A layer-wise Gram-matrix formulation makes this efficient for models with millions of parameters, while an adaptive threshold automatically determines the protected subspace size. A spectral probe experiment shows that the dominant Fisher directions are strongly associated with low-frequency output structures. Experiments on benchmark across four neural operator architectures and different physical systems show that PhysGuard performs strongly on most evaluation metrics compared to baselines. The benefits are most evident under severe domain shift, where it reduces low-frequency error by up to 32% compared to standard fine-tuning while maintaining adaptability. Our code is available at this https URL.

[LG-32] reeGRNG: Binary Tree Gaussian Random Number Generator for Efficient Probabilistic AI Hardware DATE

链接: https://arxiv.org/abs/2606.16599
作者: Jonas Crols,Guilherme Paim,Shirui Zhao,Marian Verhelst
类目: Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注: 6 pages, 5 figures, Proceeded by the 2024 Design, Automation and Test in Europe Conference (DATE)

点击查看摘要

Abstract:Bayesian Neural Networks (BNNs) offer opportunities for greatly enhancing the trustworthiness of conventional neural networks by monitoring the uncertainties in decision-making. A significant drawback for BNN inference at the extreme edge, however, is the imperative need to incorporate Gaussian Random Number Generators (GRNG) within each neuron. State-of-the-art GRNG algorithms heavily depend on multiple arithmetic operations and the use of extensive look-up tables, posing significant implementation challenges for ultra-low power hardware implementations. To overcome this, this paper presents an innovative binary tree random number generator (TreeGRNG) allowing the use of ultra-low-cost constant comparators instead of arithmetic units. We further enhance the TreeGRNG proposal with a set of hardware-aware optimizations exploiting the Gaussian properties. The optimized TreeGRNG surpasses the State-of-the-Art (SoTA) in terms of distribution accuracy while achieving a 3.7 \times reduction in energy per sample and boosting the throughput per unit area by 5.8 \times . Moreover, our TreeGRNG proposal possesses a distinct advantage over the current SoTA in terms of flexibility, as it easily enables designers to adjust the shape of the sampled probability distribution, extending beyond the capabilities of traditional GRNGs, opening the horizon towards future probabilistic AI designs. The TreeGRNG design is available open-source in the link

[LG-33] On the Entropy Formula for Real Complex and Quaternionic Deep Linear Networks

链接: https://arxiv.org/abs/2606.16579
作者: Luis Contreras,Marco Nahas,Tejas Kotwal
类目: Machine Learning (cs.LG); Mathematical Physics (math-ph); Differential Geometry (math.DG)
*备注: 17 pages

点击查看摘要

Abstract:We extend the entropy formula of Menon and Yu for the real Deep Linear Network (DLN) to its complex and quaternionic analogues, obtaining a unified formula for DLNs over \mathbbR , \mathbbC , and \mathbbH .

[LG-34] RepNet: Tackling spectral bias in deep neural networks via parameter reparameterization

链接: https://arxiv.org/abs/2606.16575
作者: Yong Wang,Tao Zhou,Xuhui Meng
类目: Machine Learning (cs.LG); Mathematical Physics (math-ph)
*备注:

点击查看摘要

Abstract:Deep neural networks (DNNs) have achieved remarkable success in scientific computing, yet they often suffer from spectral bias in capturing oscillatory and multiscale behaviors. In this study, we investigate this limitation by examining the failure of shallow ReLU neural networks in fitting high-frequency functions. This observation identifies two important factors in resolving rapid oscillations: the initial slope scale and the distribution of partition points induced by the networks. Motivated by this analysis, we propose RepNet, a reparameterized DNN model for ReLU and tanh networks designed for high-frequency and multiscale problems. The key idea is to reparameterize the weights and biases in the first hidden layer, which enables effective control of the initial slope scale and provides an appropriate distribution of the initial partition points. Furthermore, treating the reparameterized weights and biases as trainable parameters allows the DNN to achieve adaptive frequency scaling during training. In addition, we derive quantitative estimates for the output and slope magnitudes of the reparameterized DNN to guide the initialization of the proposed method. Numerical experiments, including multiscale one- and four-dimensional function approximation, forward and inverse PDE problems in combination with physics-informed neural networks (PINNs), and operator learning, demonstrate that RepNet improves the predicted accuracy of vanilla DNNs in capturing highly oscillatory features with slightly additional computational cost. These results indicate that RepNet provides an effective and flexible approach for overcoming spectral bias and applying DNNs to multiscale problems.

[LG-35] Elastic ODYN: Differentiable Optimization for Infeasible Control and Learning in Robotics

链接: https://arxiv.org/abs/2606.16564
作者: Aristotelis Papatheodorou,Jose Rojas,Ioannis Havoutis,Carlos Mastalli
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 8 pages, 5 figures, 2 tables

点击查看摘要

Abstract:Robotic systems routinely encounter conflicting objectives, modeling errors, and degenerate contact conditions that render quadratic programs (QPs) infeasible. Yet most optimization solvers and differentiable QP layers assume feasibility, leading to numerical failures, unstable gradients, or solver breakdown when constraints cannot be simultaneously satisfied. We present Elastic ODYN, a primal–dual non-interior-point QP solver that handles infeasibility through smooth squared- \ell_2 elastic relaxations. The resulting formulation remains well posed under ill-conditioning and degeneracy, supports warm starting, and converges to closest-to-feasible solutions when no feasible point exists. A lightweight refinement stage recovers physically meaningful dual variables from the elastic solution. Building on this framework, we develop Elastic OdynLayer, a differentiable QP layer with stable gradients under infeasibility, and Elastic OdynSQP, an infeasibility-aware SQP method that resolves inconsistent subproblems and intrinsically infeasible optimal control tasks through selective constraint relaxation. We evaluate the framework on benchmark QPs, singular contact mechanics, differentiable parameter identification, and quadrupedal and humanoid trajectory optimization. Across all settings, Elastic ODYN consistently outperforms state-of-the-art elastic QP solvers in robustness, warm-start performance, and convergence reliability, enabling optimization, simulation, control, and learning beyond the feasibility assumptions of existing methods.

[LG-36] MIRAG E: Auditing Anti-Muslim Bias in Frontier LLM s Across Reasoning Agent ic and Time-Coupled Conditions

链接: https://arxiv.org/abs/2606.16562
作者: Noor Islam S. Mohammad,Tamim Sheikh
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Five years after the discovery of persistent anti-Muslim bias in large language models, most evaluations remain confined to single-turn prompt completion, a setting that no longer reflects how frontier LLMs are deployed. We introduce \textbfMIRAGE (Muslim-Identity Reasoning and Agentic Generation Evaluation), a benchmark of 1,200 prompts spanning three deployment-realistic conditions: direct completion, chain-of-thought reasoning, and simulated agentic decision-making across content moderation, lending triage, refugee claim summarization, and hiring screens. Across six frontier models, we find that (i) chain-of-thought reasoning \emphamplifies rather than suppresses Muslim-violence associations by 12–34% relative to direct completion, (ii) agentic decisions exhibit a 9–22 percentage-point asymmetry between Muslim and matched non-Muslim cases on identical evidence, and (iii) bias is sharply time-coupled to retrieved news context, increasing 18–27% under recent-conflict retrieval. Existing prompt-based mitigations transfer poorly across our three conditions, suppressing direct-completion bias while leaving agentic asymmetry largely intact. We release MIRAGE and an open evaluation harness to support targeted mitigation research.

[LG-37] Incentives and Evidence in Learned Service Orchestration

链接: https://arxiv.org/abs/2606.16555
作者: Syed Izhan Khilji,Alireza Furutanpey,Schahram Dustdar
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: To be presented at the IEEE 2026 International Congress on Intelligent and Service Oriented Systems Engineering (CISOSE 2026)

点击查看摘要

Abstract:Reinforcement learning for service orchestration has been the subject of sustained research for over a decade, yet it is not used in production at scale. The usual explanation is that learned controllers degrade under delayed and noisy telemetry, workload shifts, and uncontrolled tenants. We test whether existing evidence supports that explanation. We evaluate three highly influential RL-based orchestration systems spanning resource allocation, DAG scheduling, and autoscaling, using pre-registered predictions about comparative degradation under production-relevant perturbations and paired inference with family-wise error correction. Across the tests, most predicted performance reversals do not occur. Diagnostic analyses show that these outcomes often reflect comparator collapse, artefact limitations, or evaluation choices rather than evidence that learned controllers tolerate the perturbations. One apparent advantage under observation lag is roughly fortyfold compared to a Kubernetes HPA-equivalent controller. Another widely cited result cannot be reconstructed from its released artefact, and the strongest reproducible margin is far smaller than the published results. Conclusions also reverse under changes in perturbation magnitude and evaluation mode. Based on these results and broader patterns in the literature, we identify an institutional problem. Publication and review incentives favour benchmark gains against convenient comparators, even when those gains provide little evidence of deployment performance. We argue that the problem is not solely technical. Rather, it is institutional, so learned orchestration needs production-grade comparators, registered perturbation models, separate operational metrics, and publication criteria that reward reproducible operational evidence. Without these changes, the literature can grow without establishing whether learning improves orchestration.

[LG-38] Neural Bayesian Anomaly Mitigation: A Robust Loss that Doubles as an Unsupervised Contamination Classifier

链接: https://arxiv.org/abs/2606.16524
作者: S. A. K. Leeney,W. J. Handley,H. T. J. Bevins,E. de Lera Acedo
类目: Machine Learning (cs.LG); Cosmology and Nongalactic Astrophysics (astro-ph.CO); Machine Learning (stat.ML)
*备注: 13 pages, 4 figures

点击查看摘要

Abstract:Engineered robust losses such as Huber, Student- t , and generalised cross-entropy make supervised models tolerant of contamination but cannot answer which observations are corrupted. We introduce Neural Bayesian Anomaly Mitigation (NBAM), a general-purpose drop-in loss derived from a Bayesian latent-switch mixture model: the marginal likelihood defines a robust supervised loss, and the associated posterior defines an unsupervised contamination classifier. Like Huber or Student- t , NBAM can replace the standard training loss in any supervised pipeline; unlike them, it additionally learns a structured contamination model and returns a calibrated per-sample contamination posterior. A learned input-dependent prior \pi_\phi(x) captures the spatial locality of contamination, so that samples near known corruptions are more likely to be flagged, while an Occam penalty emerges automatically and regularises against over-flagging. On CIFAR-10 with asymmetric label contamination, NBAM recovers the structure of the corruption process without supervision: the contamination posterior separates clean from corrupted samples, and the learned anomaly head identifies the direction of every label-flip pair. Alongside these capabilities, NBAM outperforms the four robust-loss baselines considered here at contamination rates 0.2-0.6.

[LG-39] How Post-Training Shapes Biological Reasoning Models

链接: https://arxiv.org/abs/2606.16517
作者: Lukas Fesser,Hanlin Zhang,Michelle M. Li,Eric Wang,Bryan Perozzi,Shekoofeh Azizi,Sham M. Kakade,Marinka Zitnik
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:Scientific reasoning models for biology combine language models with foundation models trained on multimodal biological data, including DNA, RNA, and proteins. These models are built through post-training, yet how each stage shapes reasoning and generalization remains poorly understood. We study when post-training improves performance and when it induces over-specialization. Across genomics, transcriptomics, and proteins, we train and evaluate more than 100 biological reasoning models under controlled variation in backbone, continued pre-training (CPT), supervised fine-tuning (SFT), and reinforcement learning (RL), measuring both in-domain (ID) and out-of-domain (OOD) performance. We find that each post-training stage reshapes generalization in a distinct way rather than contributing uniform gains. CPT improves downstream performance by aligning models with biological language. SFT consistently increases ID performance but causes OOD performance to peak early and decline as models fit the training distribution. RL, when applied to strong SFT checkpoints with aligned rewards, improves OOD performance and partially recovers generalization. These results show that biological reasoning does not improve monotonically with additional supervision or compute. Instead, performance depends on how training stages are composed. Under fixed post-training budgets, the strongest ID-OOD trade-off comes from brief SFT, larger RL allocations, and asymmetric adaptation capacity across stages.

[LG-40] ail-Shape Estimation in LLM Evaluation Is Frag ile: A Protocol for Diagnosing False Positives

链接: https://arxiv.org/abs/2606.16511
作者: Luca Zhou
类目: Machine Learning (cs.LG)
*备注: 9 pages of main paper, 4 figures and 4 tables in the main paper, more in the appendix

点击查看摘要

Abstract:Recent work motivates moving large language model (LLM) evaluation from mean-based to tail-aware metrics, including conditional value-at-risk and tail-index estimates of reward-model error. We ask whether the canonical extreme-value-theory tail-index parameter, which isolates how heavy a tail is from how large the tail mass is, adds discriminative information beyond the mean and a standard tail-magnitude statistic in LLM evaluation. We pre-register a protocol covering admissibility, goodness-of-fit, threshold-stability, and effect-size requirements for any positive tail-shape claim. The protocol is the contribution of this paper; the empirical study below is a demonstration of what its gates catch. Applied to a standard LLM toxicity-evaluation setup under two structurally different scorer families, the protocol catches three distinct modes of false positives that a naive analysis would have published, and rejects the headline tail-shape claim on both scorers. We conclude that tail-shape estimation in the LLM toxicity-evaluation setups we examined is more fragile than the recent literature suggests, and recommend the protocol as a starting point for tail-index claims in similar setups.

[LG-41] Petrov-Galerkin Variational Physics-Informed Neural Network Framework for Two-Dimensional Singularly Perturbed Problems

链接: https://arxiv.org/abs/2606.16510
作者: Vijay Kumar,Gautam Singh
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This study proposes a Petrov-Galerkin based Variational Physics-Informed Neural Network (VPINN) for efficiently solving two-dimensional singularly perturbed problems (SPPs) with one and two small perturbation parameters. The approach employs neural networks to construct the trial solution space, while tensor-product hat functions are adopted as test functions to enforce the variational form. To accurately resolve of sharp boundary layers, the variational form is implemented using a Petrov-Galerkin formulation. Dirichlet boundary conditions are imposed directly, while the source terms are computed using automatic differentiation. Computational experiments on standard two-dimensional problems demonstrate that the proposed method achieves high accuracy in both the maximum and L_2 norms. These results confirm the efficiency and robustness of the Petrov-Galerkin VPINN approach in accurately capturing the multiscale features of two-dimensional SPPs.

[LG-42] Semi-Supervised Speech Confidence Detection using Pseudo-Labelling and Whisper Embeddings

链接: https://arxiv.org/abs/2606.16505
作者: Adam Wynn,Jingyun Wang,Xiangyu Tan
类目: ound (cs.SD); Machine Learning (cs.LG)
*备注: 8 pages, 3 figures. Published in the Proceedings of the 26th International Conference on Artificial Intelligence in Education (AIED 2025). Shorter, preliminary version of arXiv:2605.12387

点击查看摘要

Abstract:Understanding speaker confidence is crucial in educational settings, as it can enhance personalised feedback and improve learning outcomes. This study introduces a novel framework for detecting speaker confidence by integrating human-engineered features with embeddings from the Whisper encoder. To address data limitations, a pseudo-labelling technique is employed to expand the labelled dataset, allowing the model to learn from both human-annotated and model-generated labels. The framework combines traditional speech features including pitch, volume, rate of speech, and the presence of disfluencies and stress, with Whisper embeddings, and uses a co-attention mechanism to fuse these representations and achieve an overall accuracy of 75%. This study contributes to advancing speech analysis, enabling applications that support personalised learning and speaking skill development.

[LG-43] BRICKS-WM: Building Reusability via Interface Composition Kinetics for Structured World Models

链接: https://arxiv.org/abs/2606.16489
作者: Shaowei Zhang,Jiahan Cao,Xunlan Zhou,Shenghua Wan,De-Chuan Zhan
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Model-based Reinforcement Learning (MBRL) has achieved remarkable success in continuous control by leveraging latent world models. However, prevailing approaches typically rely on monolithic latent dynamics, entangling environment dynamics into a coupled process. This coupling severely limits reusability: altering the agent necessitates retraining the entire world from scratch, even if the environment remains constant. To address this, we introduce BRICKS-WM (Building Reusability via Interface Composition Kinetics for Structured World Models), a framework for the modular assembly of structured world models. Driven by the insight that the physical world is composed of independent entities, we posit that global dynamics can be modeled as a composition of distinct dynamical modules interacting via latent interfaces. As a minimal instantiation, we factorize the latent state space into an actuated Agent module and an external Background module, bridged by a learned latent interface. Unlike prior object-centric methods that prioritize visual segmentation, BRICKS-WM enforces a functional separation in transition dynamics, ensuring that background dynamics remains agnostic to the agent’s dynamics. Empirically, BRICKS-WM achieves control performance comparable to strong monolithic baselines when trained from scratch, and enables the reuse of frozen background dynamics across agents.

[LG-44] Privacy from Symmetry: Orthogonally Equivariant Transformers for LLM Inference

链接: https://arxiv.org/abs/2606.16461
作者: Alexander Yukhimchuk,Andrey Shulga,Mladen Kolar,Martin Takáč
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Running large language models locally is often impractical, pushing inference on sensitive text to third-party providers. Split inference partially mitigates this by keeping tokens on the client and sending only hidden representations, but these representations can still be recovered via nearest-neighbor search against the public embedding table. We propose an orthogonal obfuscation procedure in which the client multiplies embeddings by a secret orthogonal matrix before transmission. To enable correct inference under arbitrary rotations, we introduce ConjFormer, a transformer variant that is exactly \mathrmO(d) -equivariant via a lightweight normalization change (scalar RMSNorm) together with blockwise orthogonal conjugation of all linear weights. As a result, the server performs the full forward pass entirely in the rotated basis and never observes unrotated hidden states. Experiments on GPT-2 and Llama 3.2 1B models fine-tuned on PubMed show that orthogonal obfuscation eliminates direct cosine nearest-neighbor inversion and reduces token recovery from over 35% top-10 to at most 1.3%, while increasing perplexity by only 0.4% after fine-tuning. These results indicate that enforcing symmetry at the architectural level can provide a practical defense for privacy-preserving LLM inference without noise injection or heavy cryptographic machinery.

[LG-45] Not all Jensen-Shannon Divergence Estimators are Equal

链接: https://arxiv.org/abs/2606.16411
作者: Alba Garrido,Alejandro Almodóvar,Mar Elizo,Patricia A. Apellániz,Santiago Zazo,Juan Parras
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The Jensen-Shannon divergence is widely reported as a scalar measure of fidelity for synthetic tabular data. Yet, in practice, it is estimated from finite samples using protocols that are often underspecified. This creates a measurement problem. Although the population divergence is well defined, the empirical value depends on the estimator family, sampling protocol, calibration, dimensionality, and class balance. We show that different protocols can yield non-comparable values: marginal-based estimators ignore dependencies in the joint distribution and can severely underestimate divergence, while classifier-based estimators capture joint structure but exhibit strong estimator dependence. We systematically study this behavior across controlled settings with reference divergences and real-world synthetic tabular benchmarks. Our analysis reveals dependence blindness in marginal estimators, prior-shift bias under class imbalance, and estimator sensitivity in high dimensions. To address prior shift, we derive a closed-form posterior correction for classifier-based Jensen-Shannon estimation. Our results show that empirical Jensen-Shannon divergence values are inherently protocol-dependent, making explicit specification of the estimation procedure necessary for meaningful comparison. We provide practical guidelines and an open-source tool for estimator-aware Jensen-Shannon evaluation.

[LG-46] MUNI: Multimodal Unified Latent Diffusion for Coherent Any-to-Any Generation

链接: https://arxiv.org/abs/2606.16408
作者: Kyeongmin Yeo,Yunhong Min,Minhyuk Sung
类目: Machine Learning (cs.LG)
*备注: Project page: this https URL

点击查看摘要

Abstract:We introduce MUNI, an end-to-end multimodal latent diffusion framework for any-to-any generation that unifies subset-conditioned cross-modal generation and unconditional joint sampling through a shared stochastic latent. Existing multimodal generative models are largely LLM-based, which limits leveraging modality-specific generators and requires text-paired data for training. Recent diffusion- and flow-based any-to-any extensions take a different direction but still rely on text-aligned embeddings, fully-paired training, or matched-dimensionality deterministic mappings. MUNI rests on two complementary contributions, one architectural and one in the training objective. First, we extend latent diffusion to multimodal any-to-any generation end-to-end: instead of the standard two-stage recipe that precomputes a frozen latent space and then fits a prior over it, MUNI jointly trains modality-specific encoders, expressive decoders, and a single shared flow-based prior under one objective. Second, we identify that the standard aggregation rules of multimodal variational inference are insufficient once coupled with a learned prior and expressive decoders. A suitable shared latent must simultaneously satisfy coherence across generated modalities, predictive sufficiency of subset latents, and minimality of the latent content. We propose a routed training objective whose structural choices align the latent with these criteria and admit a minimal-sufficiency characterization in the realizable setting. Experiments on PolyMNIST-Quadrant-Labels and a large-scale image-text-audio benchmark show MUNI matching or exceeding the strongest baselines on conditional generation while opening its largest margins on unconditional coherence. Project page: this https URL.

[LG-47] Robust Neural Tucker Factorization with Bias Correction and Adaptive Initialization

链接: https://arxiv.org/abs/2606.16388
作者: Yuchao Su,Yixin Ran
类目: Machine Learning (cs.LG)
*备注: 9 pages,3 figures, 106 conferences

点击查看摘要

Abstract:High-dimensional incomplete (HDI) tensors are widely used in traffic and climate applications, but sparse observations make accurate completion difficult. The intrinsic non-linear dynamics and non-stationary variations across distinct multi-modal fields severely hinder the efficacy of conventional linear reconstruction frameworks. Neural Tucker factorization provides an effective framework for modeling high-order interactions among tensor modes. By parameterizing underlying structural characteristics into continuous latent spaces, neural representations circumvent the rigid low-rank constraints of classical algebra. However, its performance can still be affected by implementation-level choices, especially parameter initialization and the bias configuration of the final output mapping. Suboptimal initializations frequently lead to variance explosion across the cubically expanded interaction spaces, driving the subsequent non-linear activation boundaries into severe gradient saturation zones, while the omission of a dedicated translation parameter forces interaction weights to implicitly absorb global statistical deviations. This paper proposes a simple yet effective neural Tucker factorization model with Kaiming initialization and bias correction (KaBiN) for HDI tensor completion. The proposed model utilizes Kaiming uniform initialization for the embedding and Tucker linear parameters, and adopts a simple bias correction in output mapping. By elegantly decoupling global mean shifts from local structural representations, the framework provides a highly stable and well-conditioned optimization landscape. Experiments on three real-world HDI tensor datasets show that KaBiN achieves better performance than the original NeuTucF, while introducing minimal computational overhead.

[LG-48] Mixtures of Subspaces for Bandwidth Efficient Context Parallel Training

链接: https://arxiv.org/abs/2606.16384
作者: Sameera Ramasinghe,Ajanthan Thalaiyasingam,Hadi Mohaghegh Dolatabadi,Gil Avraham,Violetta Shevchenko,Yan Zuo,Chamin Hewa Koneputugodage,Alexander Long
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Pretraining language models with extended context windows enhances their ability to leverage rich information during generation. Existing methods split input sequences into chunks, broadcast them across multiple devices, and compute attention block by block which incurs significant communication overhead. While feasible in high-speed clusters, these methods are impractical for decentralized training over low-bandwidth connections. We propose a compression method for communication-efficient context parallelism in decentralized settings, achieving a remarkable compression rate of over 95% with negligible overhead and no loss in convergence. Our key insight is to exploit the intrinsic low-rank structure of activation outputs by dynamically constraining them to learned mixtures of subspaces via efficient reparameterizations. We demonstrate scaling billion-parameter decentralized models to context lengths exceeding 100K tokens on networks as slow as 300Mbps, matching the wall-clock convergence speed of centralized models on 100Gbps interconnects.

[LG-49] Scalable and Interpretable Representation Alignment with Ordinal Similarity

链接: https://arxiv.org/abs/2606.16379
作者: Diogo Soares,Pankhil Gawade,Andrea Dittadi,Ewa Szczurek
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Evaluating representation similarity is fundamental to representation learning. However, existing metrics suffer from significant limitations: they lack interpretability due to shifting baselines, lack robustness to outliers, and are computationally intractable for large datasets, forcing reliance on heuristic approximations. To address this, we develop an ordinal-similarity framework, instantiated by the Triplet (TSI) and Quadruplet (QSI) Similarity Indices, which measure alignment by quantifying the consistency of ordinal relationships. We theoretically demonstrate this formulation is inherently interpretable, robust to outliers, and computationally efficient. Finally, we establish a formal equivalence between TSI and local neighborhood alignment, measured by Mutual Nearest Neighbors. Empirically, we validate these properties and show that ordinal similarity offers a scalable approach to measuring alignment, enabling practitioners to better understand and design representations.

[LG-50] CacheMuon: Using Temporal Preconditioning To Approximate Polar Factor

链接: https://arxiv.org/abs/2606.16371
作者: Bishnu Dev(1),Sushil Bohara(1),Martin Takáč(1),Samuel Horváth(1) ((1) Mohamed bin Zayed University of Artificial Intelligence, Abu Dhabi, UAE)
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Muon is an optimizer that computes updates using the polar factor of the momentum matrix and has shown strong empirical performance across a range of training settings. A key component of Muon is the Newton-Schulz iteration used to compute this polar factor. Although this avoids the cost of an exact singular value decomposition, it remains expensive in practice because it is applied at every optimization step. At the same time, the momentum matrix changes smoothly over training, suggesting strong temporal correlation in the corresponding polar factors. In this paper, we exploit this structure and propose CacheMuon, a temporal preconditioning method that reuses information from previous optimization steps to approximate the polar factor at the current step. This reduces redundant orthogonalization computation across iterations. We analyze CacheMuon as an inexact Muon update, with error controlled by fresh-solver error and cache staleness. Empirically, CacheMuon provides a controllable quality-efficiency frontier: conservative thresholds closely match fresh Muon on language-model and vision training while reducing orthogonalization FLOPs, whereas more aggressive thresholds yield larger arithmetic savings at the cost of modest validation-quality degradation.

[LG-51] FEnc2: Unifying Data Packing for Efficient Private Inference via Convolution and Architecture-Aware Frag ment Encoding ISCA2026

链接: https://arxiv.org/abs/2606.16359
作者: Ran Ran,Zhaoting Gong,Nuo Xu,Yuanchao Xu,Fan Yao,Wujie Wen
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 15 pages, 9 figures. To appear in ISCA 2026

点击查看摘要

Abstract:Fully Homomorphic Encryption (FHE) enables privacy-preserving machine learning but incurs extreme computational and memory overhead. These costs come not only from expensive low-level primitives, including Number Theoretic Transform (NTT), rotation, and key-switching, but also from inefficient ciphertext packing at the application level. Existing packing strategies typically preserve either neighboring data elements or feature grouping, but not both, leading to wasted ciphertext slots, excessive rotations, and inflated ciphertext counts. We propose FEnc2, a unified and principled fragment-based encoding framework for CKKS-based private convolutional neural network inference. FEnc2 optimizes slot utilization, rotation complexity, and ciphertext density through two components: 1)Conv-aware Encoding, which analytically selects an optimal fragment size to decouple spatial dependencies and jointly minimize inner-outer rotations across layers, and 2)Arch-aware Ct Compression, which restores ciphertext density after feature- or channel-reduction layers. Together, these transformations reshape encrypted workload structure and reduce homomorphic operations by one to two orders of magnitude. With full memory capacity utilized, i.e., at maximum batch size, FEnc2 achieves end-to-end latency speedups over the state-of-the-art Orion of up to 228.83x on GPU and 226.06x on CPU for LeNet on MNIST, and up to 4.55x on GPU and 9.43x on CPU for MobileNet on ImageNet. FEnc2 is hardware-agnostic yet architecturally transformative: by optimizing encrypted tensor layout before execution, it reduces ciphertext count and workload pressure on hardware, complementing primitive-level optimizations such as NTT and keyswitch accelerators. These results show that application-level data layout is a first-order architectural design dimension for encrypted inference and an important enabler for next-generation FHE systems.

[LG-52] Simulation-Augmented Multi-Step Split Conformal Prediction for Aggregated Forecasts ICML2026

链接: https://arxiv.org/abs/2606.16356
作者: Andro Sabashvili
类目: Machine Learning (cs.LG)
*备注: Accepted at ICML 2026 workshop: Forecasting as a New Frontier of Intelligence

点击查看摘要

Abstract:We study uncertainty quantification for aggregated forecasting tasks such as annual totals and year-over-year growth rates. We propose SA-MSCP, a simulation-augmented multi-step split conformal method that generates future paths from cross-validated residuals using a block bootstrap and constructs prediction intervals from empirical quantiles. Experiments show that SA-MSCP improves empirical coverage over a simulated-path baseline for aggregated and growth-rate targets. Our results demonstrate that simulation-enhanced conformal calibration is an effective and general framework for uncertainty quantification in aggregated time-series forecasting.

[LG-53] Filtered ANN as a Phase Transition: When Selectivity-Estimation Error Causes Plan Regret

链接: https://arxiv.org/abs/2606.16341
作者: Madhulatha Mandarapu,Sandeep Kunkunuru
类目: Machine Learning (cs.LG); Databases (cs.DB)
*备注: 8 pages, 4 figures. Code, benchmarks, and full pre-registration: this https URL

点击查看摘要

Abstract:A filtered approximate-nearest-neighbor (ANN) query returns the k nearest vectors among those satisfying an attribute predicate P of selectivity s. The best execution strategy – pre-filter, post-filter, or in-filter – changes with s, so a system must estimate s and choose. We model this as an argmax over a landscape with phases (regions where each strategy wins) separated by boundaries, and show that selectivity-estimation error produces plan regret – recall lost versus the oracle strategy – only in the critical regions around those boundaries. The regret is a wedge of log-width equal to the multiplicative estimation error epsilon and height equal to the local cliff |V’(s*)| epsilon; the flip-margin 1/|V’(s*)| is the condition number of a sibling cardinality-estimation study reappearing as the local boundary theory. The two phase boundaries follow from independent mathematics: order statistics place the post-filter cliff at s ~ k/K, and site percolation places the in-filter cliff at s_c ~ 0.83/M for graph degree M (corpus-size independent). Criticality exists only under a constrained budget B sqrt(k n). Under pre-registered decision rules we confirm, on synthetic sweeps and real SIFT1M, that regret concentrates ~290x at the boundary and that the regret curves obey a finite-size scaling collapse onto one universal wedge across two decades of corpus size. A real approximate index does not mis-locate the boundary, but a biased cost model opens a persistent miscalibration band that estimation-error robustness cannot fix. The contribution is a characterization, not a new index. Code and the full pre-registration are public.

[LG-54] Diffusion Offline Reinforcement Learning for Fair and Energy-Efficient UAV-Assisted Wireless Networks

链接: https://arxiv.org/abs/2606.16331
作者: Eslam Eldeeb,Hirley Alves
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The integration of generative artificial intelligence with wireless communication and signal processing systems has opened new avenues for intelligent, data-driven decision-making in future 6G networks. This work proposes a diffusion soft actor-critic (Diffusion-SAC) approach that leverages offline reinforcement learning (RL) enhanced by denoising diffusion probabilistic models (DDPMs) to optimize trajectory and scheduling control in unmanned aerial vehicle (UAV) networks. While offline RL methods, such as conservative Q-learning (CQL), can learn from static datasets, they often struggle to generalize in low-data or dynamic conditions. To address this, we combine the robustness of CQL with the generative power of diffusion models, enabling expressive and signal-aware policy learning that generalizes beyond behavior policies. Applied to a UAV-assisted wireless network, the proposed framework minimizes transmission energy and improves fairness among devices. Simulations show that Diffusion-SAC outperforms standard offline RL baselines, achieving more stable convergence and higher rewards even with limited datasets. The method enhances data efficiency, reduces energy consumption, and increases throughput by more than 35 % compared to existing algorithms, demonstrating its potential for robust policy learning in next-generation wireless control systems.

[LG-55] pFedUL: Layer-Aware Federated Unlearning for Personalized Federated Learning

链接: https://arxiv.org/abs/2606.16304
作者: Zhuodong Liu,Xiangyu Li,Zhihao Zhang
类目: Machine Learning (cs.LG)
*备注: This paper has been accepted for publication in CMC-Computers, Materials Continua

点击查看摘要

Abstract:Federated unlearning (FU) enables the removal of specific data contributions from federated learning (FL) models to comply with regulations such as the General Data Protection Regulation (GDPR). However, most existing FU methods are designed for the FedAvg paradigm, where all clients share a single global model. In practice, personalized federated learning (pFL) methods such as FedPer, FedRep, Ditto, and FedBN have become widely adopted due to their superior handling of non-IID data. These methods decompose the model into shared global layers and client-specific personalized layers, fundamentally altering the semantics of unlearning, yet this setting has received little attention. We formalize FU under the pFL paradigm, identifying a tension between unlearning completeness on shared layers and personalization preservation for remaining clients. We then propose pFedUL, a layer-aware selective unlearning framework comprising three components: (1) gradient-based layer-wise contribution attribution that separately quantifies the target client’s influence on shared and personalized parameters, (2) adaptive selective unlearning that applies differentiated forgetting strategies across layer types, and (3) a lightweight recalibration protocol enabling remaining clients to restore personalization with minimal overhead. We further introduce two new metrics, Personalization Preservation Score (PPS) and Cross-client Fairness Index (CFI), to evaluate pFL-specific unlearning quality. Experiments on CIFAR-10, CIFAR-100, and FEMNIST under varying non-IID settings indicate that pFedUL achieves unlearning effectiveness comparable to full retraining while maintaining an average of 97.3% personalized accuracy for remaining clients. Compared with six state-of-the-art FU methods adapted to the pFL setting, pFedUL consistently achieves superior personalization preservation.

[LG-56] One-Step Generalization Ratio Guided Optimization for Domain Generalization ICML2025

链接: https://arxiv.org/abs/2606.16301
作者: Sumin Cho,Dongwon Kim,Kwangsu Kim
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 29 pages, accepted at the 42nd International Conference on Machine Learning (ICML 2025)

点击查看摘要

Abstract:Domain Generalization (DG) aims to train models that generalize to unseen target domains but often overfit to domain-specific features, known as undesired correlations. Gradient-based DG methods typically guide gradients in a dominant direction but often inadvertently reinforce spurious correlations. Recent work has employed dropout to regularize overconfident parameters, but has not explicitly adjusted gradient alignment or ensured balanced parameter updates. We propose GENIE (Generalization-ENhancing Iterative Equalizer), a novel optimizer that leverages the One-Step Generalization Ratio (OSGR) to quantify each parameter’s contribution to loss reduction and assess gradient alignment. By dynamically equalizing OSGR via a preconditioning factor, GENIE prevents a small subset of parameters from dominating optimization, thereby promoting domain-invariant feature learning. Theoretically, GENIE balances convergence contribution and gradient alignment among parameters, achieving higher OSGR while retaining SGD’s convergence rate. Empirically, it outperforms existing optimizers and enhances performance when integrated with various DG and single-DG methods.

[LG-57] Evolutionary Bilevel Reward Shaping for Generalization in Reinforcement Learning PPSN2026

链接: https://arxiv.org/abs/2606.16236
作者: Ekasit Usaratniwart,Xilin Gao,Marc Ong,Youhei Akimoto
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: Accepted at PPSN 2026

点击查看摘要

Abstract:Reinforcement learning (RL) often suffers from performance degradation when deployed in environments that differ from those encountered during training. Existing techniques such as domain randomization (DR) mitigate this, but require access to diverse training environments and full trajectory observability, assumptions that fail in privacy-preserving or restricted scenarios where only scalar performance metrics are available. We propose Generalization via Evolutionary Reward Shaping (GERS), a bilevel optimization approach to improve generalization on unseen test environments using only scalar feedback from validation environments. At the lower level, an RL agent guided via a reward function shaped by the upper level learns a policy on a limited set of training environments with accessible trajectory data; at the upper level, CMA-ES optimizes the reward shaping parameters to maximize the cumulative unshaped reward on separate validation environments for which trajectory access is unavailable. Results on continuous control tasks indicate that GERS outperforms the standard RL baseline on unseen test environments. GERS performance is comparable to DR, despite DR treating the combined set of training and validation environments of GERS as a single training set that requires trajectory access, whereas GERS cannot access validation trajectories. These results confirm that GERS effectively enhances generalization under restricted data access constraints.

[LG-58] Prediction of Runtime Parameters of Parallel Chemistry Applications via Active and Generative Learning

链接: https://arxiv.org/abs/2606.16226
作者: Tanzila Tabassum,Omer Subasi,Ajay Panyala,Epiya Ebiapia,Gerald Baumgartner,Erdal Mutlu,P Sadayappan,Karol Kowalski
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this work, we develop two main Machine Learning based approaches to predict the runtime parameters of highly scalable parallel chemistry this http URL approaches employ active and generative learning together with the empirically determined gradient boosted regression tree models chosen among a rich suite of machine learning models. When evaluated on Coupled-Cluster with Singles and Doubles computations, our models achieve a mean absolute error percentage (MAPE) as low as 0.023 and a coefficient of determination as high as 99.9%. Furthermore, when combined with active learning to mitigate the lack of large amounts of training data, our models score a MAPE about 0.2 with 20-25% of the original dataset.

[LG-59] Graphical conditional generative modeling for digital twin modeling

链接: https://arxiv.org/abs/2606.16219
作者: Zongren Zou,Théo Bourdais,Ricardo Baptista,Houman Owhadi
类目: Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注:

点击查看摘要

Abstract:Digital twin modeling, including control and data assimilation under model uncertainty, often faces an open-ended fidelity problem: adding variables, data streams, and time scales can indefinitely increase model complexity, ultimately producing systems that are difficult to maintain, validate, interpret, and use for stress or safety testing. As an alternative, one can seek parsimonious stochastic surrogate models built only on the variables needed to describe the relevant quantities of interest. We introduce a framework for discovering such variables from observational data by identifying which candidate inputs influence the full conditional law of a target quantity, rather than only its conditional mean. This distinction is essential in stochastic, coarse-grained, or partially observed systems, where dependencies may appear through changes in variability, tail behavior, multimodality, or uncertainty rather than through deterministic functional relationships. The framework couples conditional generative modeling, which learns the conditional distribution of the target given candidate inputs, with Gaussian-process-based analysis of variance (through kernel mode decomposition), which enables iterative pruning of non-influential inputs and interpretable structure discovery. In control settings, the resulting surrogate can be interpreted as a learned Markov decision process: the method identifies not only a transition model, but also the state, action, and memory variables needed to make the learned dynamics effectively Markovian. Across examples involving stochastic dynamical systems, missing variables, PDE control, reinforcement learning, and economic data, the discovered structures yield interpretable stochastic surrogates whose downstream performance is comparable to models trained on the full variable set.

[LG-60] Data-driven Control with Real-time Uncertainty Compensation for Multi-Fuel Engines

链接: https://arxiv.org/abs/2606.16171
作者: Rajasree Sarkar,Arunava Banerjee,Sathya Aswath Govind Raju,Ishan Berk Altiner,Zongxuan Sun,Kenneth Kim,Chol-Bum Mike Keown
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multi-fuel compression ignition (CI) engines offer superior power density and fuel flexibility. However, achieving consistent and optimal combustion phasing across a wide range of operating conditions remains a major challenge, particularly in the presence of modeling uncertainties. This paper presents a novel, data-driven real-time uncertainty compensation framework for combustion control in multi-fuel CI engines. The proposed approach introduces a pseudo-engine speed that enables dynamic adaptation of control inputs in response to uncertainty affecting the engine. To model the underlying combustion process, a Gaussian Process Regression (GPR) model is first trained on available input-output data, capturing the nonlinear and fuel-dependent behavior across varying operating conditions. Control inputs are then synthesized through model inversion of the learned GPR surrogate and augmented with an uncertainty compensator designed to mitigate deviations caused by dynamic variations in operating conditions and model inaccuracies. This integrated control strategy allows for real-time input corrections within a finite number of combustion cycles. Theoretical analysis establishes finite-time convergence guarantees for the proposed controller. Simulation results demonstrate that the proposed method steers the combustion phasing to the desired value in real-time, providing a scalable and adaptive control solution for multi-fuel CI engine operation.

[LG-61] A Gradient Perspective on RLVR Stability and Winner Advantage Policy Optimization

链接: https://arxiv.org/abs/2606.16154
作者: Prasanth YSS,Zhichen Ren,Rasa Hosseinzadeh,Ilan Gofman,Yuqi Chen,Zhaoyan Liu,Guangwei Yu,Jesse C. Cresswell,Satya Krishna Gorti
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Reinforcement learning with verifiable rewards (RLVR) improves language-model reasoning, but GRPO-style optimization remains prone to collapse. We analyse this instability through token-level gradient dynamics, deriving a taxonomy that predicts how updates affect next-token probabilities and entropy. The taxonomy shows that stability depends jointly on the advantage sign and token distribution under the current policy. Motivated by this finding, we propose Winner Advantage Policy Optimization (WAPO), a simple online clipped policy-gradient objective that updates only on positive-advantage completions. Across mathematical reasoning and multi-hop QA benchmarks, WAPO improves training stability and matches or outperforms baselines across multiple model families. Full code can be found at this https URL.

[LG-62] Auditing Machine Unlearning: A Systematic Research on Whether Models Truly Forget

链接: https://arxiv.org/abs/2606.16110
作者: Dayong Ye,Tianqing Zhu,Ruiding Huang,Xinbo Fu,Jiayang Li,Bo Liu,Huan Huo,Wanlei Zhou
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Machine unlearning has been extensively studied in response to growing privacy concerns and regulatory requirements. However, auditing whether unlearning algorithms have truly erased the influence of specific data remains an open challenge. The lack of reliable and practical auditing mechanisms can lead to critical privacy risks, such as residual information leakage. This paper initiates a systematic investigation into whether existing unlearning algorithms can truly forget the designated data. We propose the first practical and general-purpose auditing framework for machine unlearning, inspired by the concept of proof of ignorance. Our framework addresses the key practicality limitations of existing methods by eliminating the need for retraining-from-scratch baselines, avoiding the training of large numbers of shadow models, and requiring no intrusive intervention in the original training process. To evaluate the effectiveness of our framework, we first conduct validation experiments to verify its soundness and completeness. We then perform comprehensive experiments across six datasets and ten representative unlearning methods. The results demonstrate that our framework reliably distinguishes between successful and failed unlearning. In particular, we observe that retraining-based and fine-tuning-based methods can achieve effective unlearning, even when the target data remain in the original dataset. In contrast, de-optimization-based methods fail to achieve true unlearning and instead degrade the model’s performance. Fisher/Hessian-based methods also fail to unlearn requested data, even formal certification is provided. Moreover, we show that our framework is robust against fake unlearning attempts and generalizes well to large language models.

[LG-63] Polynomial-Time Mistake-Bounded Language Generation

链接: https://arxiv.org/abs/2606.16077
作者: Héctor Jimenez,Alexander Kozachinskiy,Vicente Opazo
类目: Computational Complexity (cs.CC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this note, we introduce a polynomial-time version of the mistake-bounded language generation (MBLG) framework due to Kleinberg, Peale, and Reingold (2026). We observe that the family of parities of variables, and the family of conjunctions of literals, are polynomial-time MBLG. Our main result states that the family of monotone Boolean functions with polynomially-many maxterms is polynomial-time MBLG. This family includes all monotone Boolean functions, computable by polynomial-size decision trees. Our technique can be presented as a new combinatorial game about writing numbers on a board.

[LG-64] Stop the Sampler! Classifier-Based Adaptive Stopping for Sampling Kernels ICML2026

链接: https://arxiv.org/abs/2606.16073
作者: Kirill Korolev,Nikita Morozov,Stepan Pavlenko,Esmeralda S. Whitammer,Sergey Samsonov
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: ICML 2026 SPIGM Workshop

点击查看摘要

Abstract:Sampling from complex, unnormalized probability densities is a fundamental challenge in Bayesian inference and probabilistic modeling. While Markov chain Monte Carlo (MCMC) methods provide asymptotic guarantees, they often suffer from slow mixing and high computational costs due to fixed or manually tuned trajectory lengths. In this work, we propose a novel framework that treats trajectory termination as a learnable component of the sampling dynamics. By framing MCMC within the theory of non-acyclic generative flow networks (GFlowNets), we train state-dependent neural classifiers to decide when a trajectory has reached a high-density region and should terminate. We theoretically establish the connection between optimal classifiers and the target density via detailed balance conditions and introduce a multilevel training scheme to facilitate exploration in complex geometries. Experimental results across various benchmark densities demonstrate that our approach significantly reduces average trajectory lengths while improving mode coverage and mixing compared to standard MCMC baselines.

[LG-65] Hidden Degradation Costs in Energy-Cost-Only HEMS Optimisation: Study on Battery and PV Sensitivity

链接: https://arxiv.org/abs/2606.16051
作者: Dawood Butt,Nandor Verba
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG)
*备注: FSEM 2026, 5 Pages

点击查看摘要

Abstract:Residential battery energy storage systems (BESS) are increasingly deployed alongside photovoltaic (PV) generation to reduce household energy costs under volatile time-of-use (TOU) tariffs. Model predictive control (MPC) is a widely adopted optimisation strategy for home energy management systems (HEMS), typically formulated to minimise net energy cost, subject to physical and operational constraints. However, battery degradation is rarely embedded in the optimisation objective, meaning its cost is unquantified and aggressive; high-cycle-count strategies could incur significant losses once deployed to physical systems. This paper presents a receding-horizon mixed-integer linear programming (MILP) baseline for a UK residential HEMS, using demand data from the REFIT dataset. A 3 by 3 sensitivity study is conducted across three battery sizes and three PV array sizes, with post-hoc degradation cost estimated using the Naumann stress model and rainflow cycle counting. Results show that degradation remains constant for each battery size and can exceed energy cost savings by up to 1,060 %. These results demonstrate that energy-cost-only optimisation systematically underestimates the true system cost, motivating a degradation-aware control formulation.

[LG-66] Active Learning with Low-Rank Structure for Data Selection ICML2026

链接: https://arxiv.org/abs/2606.16045
作者: Vincent Cohen-Addad,Sasidhar Kunapuli,Vahab Mirrokni,Mahdi Nikdan,David P. Woodruff,Samson Zhou
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS)
*备注: ICML 2026

点击查看摘要

Abstract:In the data selection problem, the objective is to choose a small, representative subset of data that can be used to efficiently train a machine learning model. Sener and Savarese [ICLR 2018] showed that, given an embedding representation of the data and suitable geometric assumptions, heuristics based on k -center clustering can be used to perform data selection. This perspective was further explored by Axiotis et. al. [ICML 2024], who proposed a data selection approach based on k -means clustering and sensitivity sampling. However, these methods rely on the assumption that the dataset exhibits intrinsic geometric structure that can be effectively captured by clustering, whereas many modern datasets instead possess global algebraic structure that is better exploited by low-rank approximation or principal component analysis. In this paper, we introduce a new data selection framework based on low-rank approximation and residual-based sampling, formulated through the lens of row subset selection and loss-preserving coreset construction. Given an embedding representation of the data satisfying mild regularity conditions, which can be interpreted as algebraic or angular notions of Lipschitz continuity, we show that it is possible to select a weighted subset of \tildeO\left(k + \frac1\varepsilon^2\right) data points whose average loss approximates the average loss over the full dataset within a (1+\varepsilon) relative error, up to an additive \varepsilon \Phi_k term, where \Phi_k denotes the optimal rank- k approximation cost of the embedding matrix. We complement these theoretical guarantees with empirical evaluations, demonstrating that on a range of real-world datasets, our data selection approach achieves improved performance over prior strategies based on uniform sampling or clustering-based sensitivity sampling. Comments: ICML 2026 Subjects: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS) Cite as: arXiv:2606.16045 [cs.LG] (or arXiv:2606.16045v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2606.16045 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-67] Circuit Tracing in Autoregressive Protein Language Models ICML2026

链接: https://arxiv.org/abs/2606.16044
作者: Darin Tsui,William Deinzer,Daniel Saeedi,Amirali Aghazadeh
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注: Accepted into the Mechanistic Interpretability Workshop at ICML 2026. 24 pages, 14 figures

点击查看摘要

Abstract:Protein language models (pLMs) can generate novel protein sequences with properties beyond those observed in nature, yet the mechanisms underlying protein generation remain poorly understood. Existing mechanistic interpretability methods based on sparse autoencoders and transcoders primarily focus on protein representation learning models and do not capture the computation required for autoregressive generation. Here, we introduce ProGenMech, a mechanistic interpretability framework for generative protein language models that extends cross-layer transcoders (CLTs) to ProGen3, a sparse Mixture-of-Experts model trained for both causal generation and span infilling. Unlike per-layer approaches, CLTs reconstruct each layer using sparse latent variables from all preceding layers, enabling faithful recovery of inter-layer generative computation. We further develop a zero-shot circuit discovery framework to identify sparse latent circuits responsible for protein generation and fitness prediction. In causal generation and zero-shot fitness estimation tasks, ProGenMech outperforms local transcoder baselines in recovering ProGen3’s probability distribution and functional scoring behavior, while matching the original model’s generative distribution in span infilling tasks. Moreover, the recovered circuits reveal biologically meaningful motifs and functional regions associated with conserved sequence patterns and protein fitness landscapes, establishing a foundation for interpretable and steerable protein generation.

[LG-68] Inference-Time Decision Calibration for Temporal Classification

链接: https://arxiv.org/abs/2606.16034
作者: Arthur Chagas,Arthur Buzelin,Yan Aquino,Pedro Bento,Gisele L. Pappa,Wagner Meira Jr.,Cristiano Arbex Valle
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Temporal classification errors are often treated as representation failures, but they can also arise from how available evidence is converted into decisions. This paper proposes a representation–calibration decomposition for temporal classification. We keep a trained native classifier frozen and separate two inference-time interventions: a conservative residual multi-scale branch that adds auxiliary logits to the native prediction, and a post-hoc branch-aware calibrator that recombines native and residual evidence at decision time. This design distinguishes missing temporal evidence from underused decision-level evidence without retraining the backbone. Across FI-2010, PTB-XL, UCI-HAR, MHEALTH, and HARTH, we find that gains are strongly regime-dependent. Residual multi-scale evidence is most useful in noisy or representation-limited settings, especially short-horizon FI-2010 and weaker recurrent backbones, while branch-aware calibration helps when native and auxiliary logits contain complementary evidence not fully exploited by the raw decision rule. Near-saturated settings show limited gains from either intervention. These results suggest that temporal classification should be understood not only as representation learning, but also as the problem of trusting, combining, and calibrating evidence from multiple views.

[LG-69] he Information-Theoretic Benefit of Shared Representations under Orthogonality Constraints

链接: https://arxiv.org/abs/2606.16028
作者: Thomas Dittrich,Oliver Potocki,Philipp Grohs
类目: Machine Learning (cs.LG); Information Theory (cs.IT); Functional Analysis (math.FA)
*备注:

点击查看摘要

Abstract:Modern deep learning architectures are increasingly multi-task and multi-modal, using a pretrained foundation model combined with task-specific, fine-tuned models. Empirically, exploiting similarity across different problems, instead of solving them individually, can significantly improve overall performance. While the generalization and sample complexity properties of multitask learning have been widely studied, the parametric complexity of joint approximation in comparison to separate approximation remains less well understood. The question is particularly relevant in modern deep learning, where models are increasingly required to satisfy structural constraints such as equivariance, conservation laws, or orthogonality. We prove lower and upper bounds on the description-length for separate and joint approximation classes, respectively, in uniform norm. We build a class of orthogonal functions by composing a shared hard feature, realized by a Rademacher-Haar wavelet series, with Sawtooth-Walsh readouts to enforce orthogonality of output coordinates. The dyadic tree structure of the Rademacher-Haar wavelet concentrates the approximation hardness in the common feature component, while the readouts act as task-specific heads. Using an information-theoretic framework, we obtain a sharp gap between the optimal approximation rates achievable by joint and separate coding. Finally, we realize this separation in a neural network model using Heaviside activations via reduction to triangle-wave approximation. Our results show that even under an orthogonality constraint joint approximation requires strictly fewer bits in compositional architectures, provided the tasks share a latent hard feature. This provides theoretical insight into the description-length-efficiency of compositional multi-output architectures and clarifies how neural networks can retain expressivity under geometric constraints.

[LG-70] IBAD: Interpretable Behavioral Anomaly Detection on Human Mobility Data

链接: https://arxiv.org/abs/2606.16023
作者: Bita Azarijoo,John Krumm,Cyrus Shahabi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Human mobility appears highly diverse, yet much of a person’s daily mobility can be explained by a small set of recurring behavioral templates, such as commuting, school-centered activities, caregiving, nightlife, or errand patterns. We present \textttIBAD (\underlineInterpretable \underlineBehavioral \underlineAnomaly \underlineDetection), a framework that learns interpretable daily mobility templates and represents each individual as a distribution over mixtures of these templates. Rather than focusing on specific locations, IBAD characterizes activities that individuals perform across locations. This approach first discovers global behavioral templates using Latent Dirichlet Allocation (LDA), then employs a hierarchical self-supervised model to learn normal behavior of individuals from their soft behavioral templates. We also introduce a \emphsplicing benchmark that creates controlled behavioral mismatches between an individual’s historical profile and injected mobility patterns. Experiments on real-world and synthetic datasets show that daily behavior can be effectively decomposed into a small number of interpretable templates. Crucially, we show that the learned behavioral archetypes \emphtransfer across distinct geographic and demographic contexts. Furthermore, IBAD maintains a robust competitive performance across all settings. For reproducibility purposes, the code is accessible at ~\hrefthis https URLthis https URL.

[LG-71] Decomposing one-class support vector machine into an ensemble of one-data support vector machines

链接: https://arxiv.org/abs/2606.16002
作者: Toshitaka Hayashi,Dalibor Cimr,Hamido Fujita,Richard Cimler
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:One-class classification (OCC) is a classification problem in which the training data contains only one class. The one-class support vector machine (OCSVM) is one of the most competitive OCC algorithms. However, OCSVM has scalability issues with large-scale datasets. This paper proposes the acceleration strategy of OCSVM. The idea is to decompose the dataset into samples and train OCSVM models for single data points. Subsequently, ensemble learning is applied to combine all models to compute the OCSVM model for the dataset. In addition, further acceleration is achieved through a data-reduction strategy with an OCSVM model trained on the average of the training samples. The experiment compared the proposal and traditional OCSVM using the Python package. The proposed strategy is faster than traditional OCSVM, while achieving similar classification results. Moreover, the proposed strategy can create one-to-one correspondence between samples and models. Source code is uploaded at this https URL

[LG-72] Scalar-Stepsize Nonuniform Monte Carlo Optimistic Policy Iteration: A Certified Counterexample

链接: https://arxiv.org/abs/2606.15978
作者: Yuanlong Chen
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Tsitsiklis proved convergence of Monte Carlo optimistic policy iteration under a uniform update structure and identified nonuniform update frequencies as a delicate obstruction. We give a certified negative answer for the natural scalar-stepsize, unnormalized asynchronous state-value recursion with fixed nonuniform state-selection probabilities. In a three-state, two-action discounted MDP, the nonuniform update frequencies induce a diagonally scaled greedy-policy mean field with a certified nonconstant attracting hybrid periodic orbit. With a bounded unbiased geometric-horizon estimator and Robbins–Monro stepsizes, the original stochastic recursion remains trapped near the cycle with positive probability and therefore fails to converge. The example pinpoints a geometric obstruction: uniform sampling gives radial residual contraction, whereas scalar nonuniform sampling anisotropically distorts the residual dynamics and can generate switched attracting cycles.

[LG-73] Causal-Privacy Audit Workflow for Synthetic and Distilled Data in Dropout Support

链接: https://arxiv.org/abs/2606.15940
作者: Hanghang Zheng,Xiwei Zhuang,Zhong Wang,Hong Liu,Xiao Chen,Jingwen He,Xia Li
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Synthetic and distilled student data are increasingly used to enable privacy-conscious learning analytics, yet their suitability for decision-facing institutional support remains uncertain. In dropout support, generated data must preserve not only predictive utility or distributional resemblance, but also the financial-status evidence used to guide advising, payment-plan assistance, and scholarship-related decisions. Method: This study introduces CaP-Eval, a decision-facing causal-privacy audit workflow for evaluating generated student data under a fixed estimand, timing-aware adjustment design, estimator set, and empirical privacy-governance screen. The workflow compares original, distilled, adversarial synthetic, statistical synthetic, and DPGNet privacy-oriented generated data on predictive utility, treatment-effect fidelity, robustness to alternative estimators, and local training-record proximity. Results: DPGNet and distilled data preserved the original financial-status treatment-effect structure more reliably than the adversarial and Gaussian Copula baselines. DPGNet preserved full direction and rank agreement across epsilon levels; epsilon = 10 produced the smallest non-original IPW and DML deviations, while epsilon = 1 and epsilon = 5 amplified several financial-status contrasts. Distilled data remained highly faithful but retained the strongest local training-record proximity signal. TabularGNet preserved qualitative directions with moderate attenuation, and Gaussian Copula compressed effect magnitudes. Conclusions: Predictive utility, privacy orientation, empirical disclosure signals, and causal fidelity diverged; generated student data require joint audits of direction, magnitude, overlap, and release-governance risk before decision use.

[LG-74] An Exploratory Study of Blood Glucose Estimation from Photoplethysmography Signals using Machine Learning

链接: https://arxiv.org/abs/2606.15927
作者: Ruhani Bhatia,Vijval Ekbote
类目: Machine Learning (cs.LG)
*备注: 7 pages, 3 figures

点击查看摘要

Abstract:Diabetes and extreme blood sugar levels are some of the major health problems faced by humans today across the world. While Continuous Glucose Monitoring (CGM) has emerged as an effective technology for management of diabetes as well as for monitoring blood sugar levels, this technology has traditionally been invasive (that is, requiring the piercing of the skin) and carries the risk of irritation, induration, etc. This highlights the need for accurate and non-invasive CGM methods that can be deployed at scale. With the emergence of various sensing technologies and their integration in wearables like the smart-watch, we now have the capability to continuously monitor body signals like the Photoplethysmogram (PPG) in a non-invasive manner. Having the ability to continuously monitor blood glucose through CGMs and continuously monitor PPG signals through a smart-watch offers an opportunity to get dense data on these two, opening the possibility of building machine learning and deep learning based models to estimate blood glucose level from PPG signals. In this work, we first present a paired dataset comprising continuous PPG signals from a smartwatch along with glucose values recorded using a CGM device. We also present the results of some preliminary experimental explorations performed on our dataset. These preliminary results suggest that some predictive signals may exist, though more exploration is needed with more data from a larger number of individuals. The dataset can be accessed at this https URL

[LG-75] Reinforcement Learning for LLM -based Event Forecasting

链接: https://arxiv.org/abs/2606.15917
作者: Amit Arnold Levy
类目: Machine Learning (cs.LG)
*备注: Submitted internally at the University of Oxford in Oct 2025, migrated to arXiv on Jun 2026

点击查看摘要

Abstract:We use Group Relative Policy Optimization (GRPO), a recently devised sample and memory efficient reinforcement learning method, to finetune pretrained LLMs in the range of 1.5B to 14B parameters equipped with the ability to get current information through the use of a Wikipedia revisions tool, or news summaries, to forecast real events beyond the knowledge cutoff of the LLM, as well as problems made to simulate different aspects of the dynamics of that training. We use the results of these experiments to comment on the scaling capability of LLMs for forecasting, as well as classify how judgmental forecasting fits into the verifiable/unverifiable domain taxonomy, considering the impact of the inherent aleatoric uncertainty when forecasting future events (e.g. the roll of a die). As a result of the GRPO training, we manage to bring a 1.5B parameter transformer (Qwen 2.5 1.5B) to forecasting performance superior to Claude Sonnet 3.5 over the same dataset as measured by cross entropy from the market agreed probabilities. We also discuss various dead ends on the path to this result. Comments: Submitted internally at the University of Oxford in Oct 2025, migrated to arXiv on Jun 2026 Subjects: Machine Learning (cs.LG) Cite as: arXiv:2606.15917 [cs.LG] (or arXiv:2606.15917v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2606.15917 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-76] LoComposition: Terrain-Adaptive Energy-Efficient Quadruped Locomotion without Gait Priors

链接: https://arxiv.org/abs/2606.15896
作者: Loukas Kordos,Leonard T. Franz,Simon Rappenecker,Oliver Hausdoerfer,Angela P. Schoellig,Pavel Kolev,Georg Martius
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 17 pages, 5 figures, 10 tables

点击查看摘要

Abstract:Learning-based quadrupedal locomotion typically relies on complex reward formulations that entangle task specification, operational limits, gait preference, and terrain adaptation within a single optimization objective. We instead treat these functions through distinct mechanisms: rewards for task specification, constraints for operational limits, energy minimization for gait preference, and exteroceptive perception for adapting energy use to terrain difficulty. We show that these components jointly enable efficient, terrain-adaptive locomotion, and that removing each component exposes a distinct failure mode. Our formulation removes explicit gait priors (including air-time, contact-count, and foot-clearance targets) in favor of emergent behavior. Compared to a conventional complex-reward baseline, our formulation achieves comparable terrain traversal while reducing cost of transport by 56% and operational-limit violations by 96%. The resulting policies transfer zero-shot to a physical Unitree Go2 using LiDAR-based elevation mapping. Project website with videos: this https URL.

[LG-77] Scalar-pathway fidelity improves physical accuracy in short-range equivariant interatomic potentials

链接: https://arxiv.org/abs/2606.15892
作者: Jia Bi,Alin Marin Elena,Samuel Pinilla
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurate interatomic potentials enable molecular dynamics of materials, molecules, and interfaces beyond density-functional-theory length and time scales. Equivariant neural network potentials have improved the representation of local geometry. However, their deployable energy surfaces ultimately manifest through invariant scalar channels, whose aggregation and spectral resolution remain comparatively underexamined. Here we use Physics-Aware Neighborhood (PAN) pooling and Physics-Guided Spectral (PGS) mixers as controlled scalar-pathway probes: lightweight, symmetry-preserving modifications that act only on (\ell=0) channels while leaving the equivariant tensor backbone unchanged. Using MACE as a high-body-order mechanistic scaffold, PAN adds coordination-sensitive amplitude modulation, whereas PGS augments edge and readout scalar features with radial and tapered spectral bases. Across metallic Ag, covalent Si, a short-range ionic LiF/Li–F subset, and MD17/rMD17 molecules, this scalar-pathway correction reduces MACE force errors by 22–27% and energy errors by 19–22%; on systems with stress labels, stress errors decrease by 27–28%, at approximately 5% additional inference-FLOPs cost. Directionally consistent gains in Allegro and NequIP further indicate that the correction is portable across distinct short-range equivariant backbones, although effect sizes remain architecture-dependent. These results identify scalar-pathway fidelity as a practical design dimension for short-range equivariant interatomic potentials.

[LG-78] David vs. Goliath in Next Activity Prediction: Argmax vs. LSTM Transformer and LLM

链接: https://arxiv.org/abs/2606.15868
作者: Hans Weytjens,Ingo Weber
类目: Machine Learning (cs.LG)
*备注: Accepted for 24th International Conference on Business Process Management (2026) Forum

点击查看摘要

Abstract:Next activity prediction (NAP) is a cornerstone of predictive process monitoring (PPM), enabling organizations to move from retrospective analysis to proactive process steering. The PPM field has progressed from classical machine learning through deep learning architectures such as LSTMs and Transformers to large language models (LLMs). Despite growing model complexity, no benchmark jointly compares LLMs, Transformers, LSTMs, and simple baselines in a direct sequence modeling setting for NAP. In this paper, we fill this gap with a systematic benchmark. We compare vocabulary-adapted LLMs, Transformers trained from scratch, LLM-distilled Transformers, and LSTMs against a simple counting-based argmax baseline across seven real-life event logs. Our results tell a David vs. Goliath story: pretraining confers no consistent improvement over training from scratch, model size shows little effect on performance, and on most datasets the argmax baseline matches or approaches the performance of billion-parameter LLMs.

[LG-79] SILAGE: Memory-Efficient Full-Gradient-Free Nonconvex Optimization for Nested Finite Sums

链接: https://arxiv.org/abs/2606.15832
作者: Igor Sokolov,Laurent Condat,Peter Richtárik
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 80 pages, 3 algorithms, 4 theorems, 2 corollaries, 11 lemmas, 2 figures, 12 tables

点击查看摘要

Abstract:Empirical risk minimization on massive datasets naturally exhibits a nested double finite-sum structure, where N=nm total samples are logically or physically partitioned into n blocks of size m (e.g., in pooled data silos, out-of-core learning, or deliberate stratification). While variance-reduced methods achieve optimal oracle complexities for nonconvex objectives, they suffer from severe scaling bottlenecks in this centralized regime. Recursive estimators, such as PAGE, require periodic global full-gradient refreshes over all nm samples, which are computationally expensive. Conversely, single-loop methods, such as SILVER, avoid such refreshes but require an impractical \mathcalO(nm) memory footprint to store a control variate for every sample. In this paper, we propose SILAGE, a variance-reduced algorithm that addresses this trade-off. By actively exploiting the double-sum structure, SILAGE eliminates periodic global full-gradient refreshes over all nm components (evaluating at most one local group gradient per iteration) while requiring only \mathcalO(n) memory. Furthermore, we provide a tight convergence analysis that avoids pessimistic worst-case Lipschitz constants. Instead, SILAGE’s complexity natively adapts to the underlying data geometry via nested functional similarities: across-group ( \delta_1 ) and within-group ( \delta_2 ) heterogeneity. Our results improve existing state-of-the-art bounds in several practically relevant regimes.

[LG-80] Brownian Kernel Ladders

链接: https://arxiv.org/abs/2606.15812
作者: Mahdi Mohammadigohari,Giuseppe Di Fatta,Giuseppe Nicosia,Panos M Pardalos
类目: Machine Learning (cs.LG)
*备注: Submitted to JMLR

点击查看摘要

Abstract:Constructing mathematically tractable function spaces that capture hierarchical compositional representations remains a central challenge in statistical learning theory. We introduce Brownian kernel ladders (BKLs), a recursively defined hierarchy of integral reproducing kernel Hilbert spaces generated through Brownian-kernel integral constructions. Starting from linear functionals, each layer is obtained by integrating Brownian kernels over probability measures supported on subsets of the previous layer, yielding a recursive function-space model in which depth is encoded directly through the hierarchy. Based on this framework, we define canonical BKL spaces together with an associated complexity functional. We establish several analytical and statistical properties of these spaces. In particular, we show that BKL spaces form quasi-Banach spaces, satisfy depth-dependent Hölder regularity estimates, and exhibit strict monotonicity with respect to depth. We further prove existence results for regularized empirical risk minimization and derive Gaussian complexity bounds that remain uniformly controlled with respect to both the ambient dimension and the hierarchy depth. A key ingredient of the analysis is a combinatorial proof technique based on recursive subset decompositions and Brownian-kernel threshold representations. These estimates yield excess-risk guarantees of near-parametric order for regularized empirical risk minimization over BKL spaces. Our results provide a mathematically tractable hierarchical function-space framework for studying compositional representations in deep learning. Comments: Submitted to JMLR Subjects: Machine Learning (cs.LG) Cite as: arXiv:2606.15812 [cs.LG] (or arXiv:2606.15812v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2606.15812 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-81] Mean-Field Parallel Decoding for Discrete Diffusion Language Models

链接: https://arxiv.org/abs/2606.15805
作者: Tamim Zoabi,Ameen Ali,Liran Ringel,Lior Wolf
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Discrete diffusion language models enable parallel token generation, offering a pathway to low-latency decoding. However, selecting tokens independently by marginal confidence limits effective parallelism: tokens that appear reliable in isolation can form incompatible configurations when several positions are updated at once. We introduce a training-free decoding framework that coordinates these parallel updates. At each forward pass, the method assigns a commit score to each masked position and refines these scores using pairwise interactions derived from the model’s predictive distributions. A variational relaxation yields a simple fixed-point update that suppresses conflicting simultaneous commitments within a single forward pass. This mechanism allows the decoder to commit more tokens in parallel while maintaining competitive generation quality. The method is lightweight, requires no auxiliary model or retraining, and drops into existing diffusion decoding pipelines without modification. Experiments on reasoning and code-generation benchmarks show consistent improvements in the quality-latency trade-off.

[LG-82] Bayesian Networks with Latent Time Embedding for Stage-Aware Causal Modeling of Alzheimers Disease Progression

链接: https://arxiv.org/abs/2606.15784
作者: Nguyen Linh Dan Le
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE)
*备注: 7 pages, 5 figures

点击查看摘要

Abstract:Alzheimer’s disease (AD) progression is often described through the amyloid-tau-neurodegeneration, or AT(N), cascade. However, most longitudinal models represent this cascade either as a fixed sequence of biomarkers or as a black-box forecasting task. This makes it difficult to determine when biologically guided biomarker relationships influence future regional pathology. In this study, we introduce Bayesian Networks with Latent Time Embedding (BN-LTE), a Bayesian structural framework for stage-aware modeling of AD progression. BN-LTE estimates disease pseudotime from baseline biomarker profiles and constrains directed dependencies according to biologically plausible AT(N) ordering. Posterior spline-varying structural equations are then used to link initial multimodal measurements with future annualized regional tau-PET change. Across repeated subject-disjoint evaluations using ADNI data, BN-LTE shows strong spatial reconstruction of tau progression compared with the included forecasting baselines. Beyond spatial reconstruction, BN-LTE recovers posterior stage-varying AT(N)-constrained effects and identifies a mid-pseudotime window of amyloid sensitivity. This window is supported by model-implied g-formula contrasts, root-adjusted AIPW, mechanism-sensitive ablations, and robustness analyses across spline and prior specifications. Overall, these findings position BN-LTE as a Bayesian structural framework for forecasting tau progression while examining stage-dependent AT(N)-cascade mechanisms in observational longitudinal neuroimaging data. Our code is available at this https URL.

[LG-83] he Data Manifold under the Microscope ICML2026

链接: https://arxiv.org/abs/2606.15760
作者: Marios Koulakis,Constantin Seibold
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Accepted at ICML 2026. Camera-ready version

点击查看摘要

Abstract:A significant gap exists between theory and practice in deep learning. Generalization and approximation error bounds are often derived for simplified models or are too loose to be informative. Many rely on the manifold hypothesis and on geometric regularity such as intrinsic dimension, curvature, and reach. Progress requires insight into data-manifold geometry and suitable benchmarks, yet existing options are polarized: analytic manifolds with known geometry but limited applicability, or real-world datasets where geometry is only coarsely estimable. We introduce a benchmarking framework for studying data geometry. We repurpose and extend dSprites and COIL-20 with additional transformation dimensions and dense, axis-aligned sampling, and pair them with finite-difference estimators that recover curvature, reach, and volume at near-ground-truth accuracy in a regime where general-purpose estimators are unreliable or difficult to deploy. The framework is intended as a controlled testbed, useful as a calibration environment for geometric estimators and a sandbox for probing theoretical assumptions. To illustrate its use, we present two application studies, namely assessing the scaling behavior of the bounds of Genovese et al. and Fefferman et al., and tracking the layer-wise geometry of a \beta -VAE, highlighting the behavior of current bounds and the value of controlled benchmarks for guiding and validating future theory. A reference implementation is available at this https URL. Comments: Accepted at ICML 2026. Camera-ready version Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML) ACMclasses: I.2.6 Cite as: arXiv:2606.15760 [cs.LG] (or arXiv:2606.15760v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2606.15760 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-84] Acoustic Prompting via Stage-wise Modulation for Few-Shot Learning in Audio Language Models INTERSPEECH2026

链接: https://arxiv.org/abs/2606.15751
作者: Hyebin Cho,Jaehyuk Jang,Changick Kim,Joon Son Chung
类目: ound (cs.SD); Machine Learning (cs.LG); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
*备注: Accepted to INTERSPEECH 2026

点击查看摘要

Abstract:Audio-Language Models (ALMs) have shown remarkable success in zero-shot audio classification by aligning audio waveforms with text. Recent efforts to improve downstream performance focus on learning optimal text prompts. However, previous approaches focus on the text encoder, leaving the potential of learnable prompts within the audio encoder unexplored. In this paper, we propose a novel framework that introduces trainable prompts into the audio encoder to capture task-specific acoustic features. We demonstrate that integrating audio-side prompt learning with existing text-side approaches enhances few-shot adaptation. Through extensive experiments across 11 datasets show that integrating our method as a plug-and-play module alongside existing text prompt tuning generally leads to performance improvements. These findings suggest that explicitly modulating the audio representation space effectively complements text-only prompting approaches. The code is available at this https URL.

[LG-85] Unsupervised Learning for Missing Modalities in Multimodal Learning

链接: https://arxiv.org/abs/2606.15743
作者: Hassan Ismkhan,Hamid Bouchahcia
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper addresses the missing-modality challenge in multi-modal learning by introducing Unsupervised Learning for Missing Modalities in Multi-Modal Learning (UL4M4), a flexible framework that imputes missing feature embeddings in a task-independent manner before supervised prediction. We propose modality-specific normalization and a novel partial-modality distance metric to enable fair clustering of incomplete observations, capturing cross-modal structures while preserving scale-invariance across varying dimensionalities and modality counts. Cluster centers from this unsupervised stage guide an iterative greedy imputation process for any missing modalities during training or inference, supporting arbitrary numbers of modalities and arbitrary missing patterns per sample. The imputation module is lightweight, uses frozen encoders, and decouples from the downstream task, allowing easy integration with any fusion/prediction architecture. Extensive experiments under diverse and highly incomplete regimes demonstrate UL4M4’s robustness, achieving, to the best of our knowledge, the first consistent F1-Micro scores above 0.7 on challenging missing configurations even when more than 50% of modality slots are missing. Results are also stable across cluster sizes and significantly outperform state-of-the-art baselines. Code is available here: this https URL. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2606.15743 [cs.LG] (or arXiv:2606.15743v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2606.15743 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Hassan Ismkhan [view email] [v1] Sun, 14 Jun 2026 11:04:33 UTC (230 KB)

[LG-86] How to Score Experts for One-Shot MoE Expert Pruning: A Unified Formulation and Selection Principle

链接: https://arxiv.org/abs/2606.15716
作者: Zongfang Liu,Jinghui Zhang,Zijian Ma,Guangyi Chen,Xin Yuan
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Mixture-of-Experts (MoE) language models reduce per-token computation through sparse expert activation, yet deployment still requires storing the full expert pool, making one-shot expert pruning a practical approach for reducing memory usage. Although effective, existing criteria are largely heuristic, and no single criterion is universally optimal. Thus, establishing a principle for selecting pruning criteria suited to different deployment objectives remains an important yet largely underexplored problem in one-shot expert pruning. To this end, we introduce a unified formulation for one-shot MoE expert pruning organized around three factors: routing frequency, gate weighting, and activation strength. The formulation yields a criteria selection principle: task-agnostic pruning should favor routed-token-averaged, gate-free activation-based criteria, whereas task-specific pruning can benefit from retaining routing-frequency and gate-weight information. Beyond this principle, the formulation also provides a systematic view of existing heuristic criteria and gives rise to two new task-agnostic criteria, Mean Activation Norm (MAN) and Mean Squared Activation Norm (MSAN). Across four representative MoE models and 16 diverse benchmarks, MAN and MSAN are consistently strong in the task-agnostic setting, obtain the top-two average ranks, and improve average performance by up to 8.8 points over the strongest baseline.

[LG-87] Robust Transformer-Based One-Step Stock Index Forecasting via Shifted Data Augmentation

链接: https://arxiv.org/abs/2606.15701
作者: Tien Thanh Thach
类目: Machine Learning (cs.LG); Statistical Finance (q-fin.ST)
*备注:

点击查看摘要

Abstract:Transformers have shown remarkable success in sequence modeling, yet their direct application to financial time series remains challenging due to noisy signals, short-memory dynamics, and distributional shifts. This paper proposes a modified Transformer architecture for one-step stock index forecasting, combined with advanced learning-rate scheduling and a novel Shifted Data Augmentation (SDA) technique. We evaluate the proposed framework on two benchmark stock index datasets, VN30 and SP 500. Experimental results demonstrate that cosine annealing with warmup consistently improves forecasting accuracy over the generalized inverse-power scheduler. Furthermore, SDA substantially reduces forecasting errors and run-to-run variability while improving robustness to hyperparameter selection. The combination of cosine annealing scheduling and SDA achieved the best performance on both datasets, indicating that data augmentation can play a more important role than increasing model complexity in Transformer-based financial forecasting. These findings provide a practical and computationally efficient approach for robust stock index forecasting in noisy financial environments.

[LG-88] Multi-Fidelity SINDy: Sparse Discovery of Nonlinear Dynamical Systems with Fidelity-Weighted Measurements

链接: https://arxiv.org/abs/2606.15690
作者: Filippo Zacchei,Ana Larrañaga,Attilio Frangi,Andrea Manzoni,Steven L. Brunton
类目: Machine Learning (cs.LG); Dynamical Systems (math.DS)
*备注: 27 pages, 6 figures, 2 tables

点击查看摘要

Abstract:Data from simulations and experiments are rarely noise-free and often exhibit heterogeneous levels of fidelity. Measurement uncertainty may vary across repeated observations, sensing devices, or even within a single experiment. This work addresses the problem of discovering nonlinear dynamical systems from such inhomogeneous data. We extend the Sparse Identification of Nonlinear Dynamical Systems (SINDy) framework to account for variable noise levels by combining Ensemble SINDy and Weak SINDy within a weighted regression formulation derived from generalized least squares. A statistical justification for the weighting strategy is also provided. The methodology is validated on several benchmark systems, including ordinary and partial differential equations. In addition, we show the benefit of multi-fidelity integration for forecasting the dynamics of a double pendulum system. The results confirm that the proposed approach mitigates the adverse effects of heteroscedastic noise and that repeated, low-cost, low-quality measurements can improve model recovery, in some cases matching or outperforming reconstructions obtained using only high-fidelity data.

[LG-89] ReQAT: Achieving Full-Precision Reasoning Accuracy with 4-bit Floating-Point Quantization-Aware Training ICML2026

链接: https://arxiv.org/abs/2606.15682
作者: Janghwan Lee,Sihwa Lee,Jinseok Kim,Yongjik Kim,Jieun Lim,Jinwook Oh,Jungwook Choi
类目: Machine Learning (cs.LG)
*备注: ICML 2026

点击查看摘要

Abstract:Large Reasoning Models (LRMs) achieve strong problem-solving through long chain-of-thought, but their deployment is constrained by the high cost of full-precision inference and growing KV cache footprints. Microscaled FP4 formats enable efficient FP4 deployment; however, fully quantizing weights, activations, and KV caches (W4A4KV4) causes severe reasoning degradation that existing PTQ and QAT fail to recover. We identify that FP4 failures concentrate on low-entropy tokens–precise symbolic commitments such as digits and operators–where quantization noise inflates sampling errors that cascade through reasoning traces. Based on this insight, we propose ReQAT, a reasoning-centric FP4 training framework with three components: (i) Trace-Aligned QAT (TAQ), which revisits identical reasoning traces to focus updates on critical low-entropy decisions; (ii) Selective Entropy Minimization (SEM), which reinforces confidence at low-entropy positions; and (iii) Q-FIT, a quantization-friendly initialization that jointly calibrates RoPE-consistent KV cache transformations to stabilize QAT. Under the same training budget, ReQAT not only recovers but surpasses BF16 fine-tuning accuracy, while delivering up to 3.9x throughput speedup on NVIDIA DGX Spark and 3.1x on B200.

[LG-90] Multi-Agent Framework for Audit Risk Assessment with Explicit Uncertainty and Evidence Conflict Modeling

链接: https://arxiv.org/abs/2606.15640
作者: Yuhan Wang,Manqing Wang,Yixuan Lu,Zhaoyue Peng,Shengda Lin
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Audit risk assessment increasingly benefits from combining heterogeneous evidence sources, yet existing approaches typically produce point predictions without quantifying how well different evidence streams agree. We propose UMAR (Uncertainty-Aware Multi-Agent Risk Assessment), a framework that employs three specialized agents: an MDA Text Agent, a Financial Ratio Agent, and a CAM Agent, each producing independent risk scores with calibrated uncertainty estimates. An Uncertainty Aggregator based on Dempster-Shafer evidence theory fuses these scores while explicitly measuring inter-agent conflict. We evaluate UMAR on a U.S. dataset of 3,200 firm-year observations from SEC 10-K filings (2019-2023), with financial restatement as the target label. Experimental results show that UMAR achieves an AUROC of 0.782 and a PR-AUC of 0.341, outperforming logistic regression, XGBoost, FinBERT, and single-agent and dual-agent LLM baselines. UMAR attains the lowest expected calibration error (ECE = 0.052) among all methods and identifies evidence-conflict patterns that correlate with actual restatement risk, offering auditors potentially actionable and interpretable risk signals.

[LG-91] HAPI-EP: Towards Hybrid Adaptive and Predictive Digital Twins of Cardiac Electrophysiology

链接: https://arxiv.org/abs/2606.15637
作者: Sumeet Vadhavkar,Xiajun Jiang,Yubo Ye,Maryam Toloubidokhti,Linwei Wang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:A digital twin (DT) of a patient-specific heart offers significant potential in personalized medicine. However, its rapid and dynamic adaptation to an individual’s live data and its predictive capability after adaptation remains central challenges. We examine this challenge from its two building blocks: DT formulation where mechanistic and data-driven models show competing merits and limitations, and DT optimization strategies that are largely driven by a reconstruction objective leading to un-identifiable models. We address both bottlenecks via HAPI – an AI framework for building hybrid, adaptive, and predictive DTs with three key enablers. First, HAPI constructs a physics-integrated gray-box model in which an interpretable mechanistic backbone is augmented by a neural component that models its residual to the observed data. Second, rather than attempting to pre-encode all possible variations in a static hybrid model, HAPI enables rapid on-the-fly adaptation of the hybrid model to few-shot live data, achieved by feedforward meta-learners realizing amortized inference of both mechanistic and neural parameters of the hybrid model trained with predictive objectives. Finally, we show that this adaptivity corresponds to the construction of a conditional generative model (i.e., the hybrid DT) that endows it with theoretical identifiability and thus strong performance in predictive scenarios. We demonstrate the proof-of-concept of HAPI in cardiac electrophysiology using a hybrid monodomain model with mechanistic reaction kinetics and neural graph diffusion. Across synthetic and real-data studies, we show that HAPI’s mechanistic-neural hybridization and predictive adaptation are critical for obtaining identifiable DTs with strong predictive and out-of-distribution capabilities.

[LG-92] Formalizing and Mitigating Structural Distortion in LLM Attention for Zero-Shot Graph Reasoning KDD2026

链接: https://arxiv.org/abs/2606.15633
作者: Donald Loveland,Puja Trivedi,Ari Weinstein,Edward W Huang,Danai Koutra
类目: Machine Learning (cs.LG)
*备注: Accepted to KDD 2026

点击查看摘要

Abstract:Large Language Models (LLMs) have shown promise for reasoning over Text-Attributed Graphs (TAGs). However, applying LLMs to graphs requires linearizing their structure into sequences, introducing distortion rooted in the graph bandwidth problem. While this distortion has been shown to degrade performance, it is often attributed to prompt design or model scale, leaving the underlying mechanism unclear. In this work, we show \textithow rotary positional embeddings turn graph linearization into bandwidth-dependent attention decay, suppressing attention between graph-adjacent nodes that are forced far apart in the serialized sequence. This shifts the focus of LLM-based graph reasoning from prompt engineering and scaling toward correcting attention misalignment. Motivated by this analysis, we propose \textbfGraph-\textbfaligned \textbfLanguage \textbfAttention (\textbfGaLA), a lightweight, inference-time modification for LLMs. GaLA biases attention toward graph-adjacent nodes while preserving the LLM’s sequential inductive biases. Across TAG benchmarks, GaLA improves performance with negligible overhead, demonstrating that distortion is a correctable bottleneck in LLM-based graph reasoning.

[LG-93] Conflict-Aware Federated Fine-Tuning of Large Language Models with Mixture-of-Experts

链接: https://arxiv.org/abs/2606.15625
作者: Yijun Lu,Zihan Fang,Pengpeng Qiao,Zheng Lin,Jing Yang,Yuxin Zhang,Por Lip Yee,Zhe Chen,Jun Luo
类目: Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注: 6 pages, 4 figures

点击查看摘要

Abstract:The continuous scaling of large language models (LLMs) incurs prohibitive computational costs, making Mixture-of-Experts (MoE) a scalable alternative for efficient fine-tuning via sparse activation. While federated learning (FL) emerges as the paradigm for privacy-preserving collaborative optimization, integrating MoE into FL under data heterogeneity may trigger conflicting expert optimizations. Client-specific data distributions force same-indexed experts to optimize under inconsistent or even conflicting feature-label correlations. This mismatch induces destructive interference during aggregation, thus destabilizing the optimization trajectory and degrading model performance. To address this issue, we propose FC-MoE, a federated conflict-aware framework for MoE fine-tuning. It employs an importance aware weighting scheme to prioritize reliable local updates and utilizes gradient consensus projection to suppress conflicting updates, ensuring a stable global optimization path. Moreover, a local knowledge retention mechanism further preserves specialized client expertise by re-anchoring domain-specific residuals. Extensive experiments demonstrate that FC-MoE accelerates convergence and enhances both global and local model performance in non-IID federated environments.

[LG-94] When Does q-error Predict Plan Regret? Three Regimes of Cardinality-Estimation Error

链接: https://arxiv.org/abs/2606.15600
作者: Madhulatha Mandarapu,Sandeep Kunkunuru
类目: Databases (cs.DB); Machine Learning (cs.LG)
*备注: 8 pages, 5 figures. Code, benchmarks, and full pre-registration: this https URL

点击查看摘要

Abstract:Cardinality-estimation (CE) research ranks estimators by q-error, yet it is well known that q-error is an imperfect proxy for query-plan quality. We give a measurement-driven account of when it is a good proxy and when it is not, and why. Modeling plan selection as an argmin over a piecewise-linear cost landscape, we find that plan regret (the cost of the chosen plan relative to the optimal, under true cardinalities) is governed by plan-cost geometry in a regime-dependent way. (i) For small errors, a true-point condition number kappa predicts regret and out-predicts q-error; its predictive power decays to zero as error grows, as a local linearization must. (ii) For large errors – where deployed learned estimators operate – an estimator-independent average-case sub-optimality measure ACS-infinity predicts which queries are regret-prone (Spearman rho ~ 0.54 on STATS-CEB), while q-error is nearly uninformative at the query level (rho ~ 0.05). (iii) The worst case is Haritsa’s maximum sub-optimality (MSO). The three are one cost-ratio spectrum under three weightings. We prove a limit law ACS-infinity = sum_k r_k pi_k with cardinality-independent combinatorial weights, and validate every claim on STATS-CEB and JOB-light with four released estimators under pre-registered decision rules, and confirm on real PostgreSQL runtime that ACS-infinity predicts regret where q-error does not. The contribution is conceptual and empirical – an average-case companion to worst-case robust query optimization, and a characterization of when an accuracy metric tracks plan quality – rather than a new estimator. Code and the full pre-registration are public.

[LG-95] A Decision-Theoretic View of Test-Time Training: When How Far and Which Directions to Adapt

链接: https://arxiv.org/abs/2606.15569
作者: Tomoya Wakayama
类目: Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Test-time training (TTT) adapts a pretrained model to each prompt via parameter updates, improving accuracy under pretraining-to-test distribution shifts. Yet, its performance often suffers from instability and sensitivity to hyperparameters such as update steps and subspace. We explain this behavior through a decision-theoretic lens, treating TTT as implicit Bayesian inference in the kernel regime. Under a Gaussian process benchmark, we show that TTT reduces prediction error when updates are spectrally matched to the prompt’s signal-to-noise ratio and aligned with query-relevant eigen-directions. This perspective underpins the following results: (1) we show when fixed update steps and subspaces fail under distribution shifts, motivating adaptive strategies; (2) we prove that selecting update steps via prompt evidence admits a PAC-Bayes guarantee against overfitting; and (3) we characterize the Bayes-optimal update subspace under a linear-Gaussian correction model, yielding a scoring rule for selecting Transformer blocks and heads. Our theory helps explain the empirical instability of TTT, taking a step toward principled guidance for when, how far, and which directions to adapt.

[LG-96] SDVDiag: Multimodal Causal Discovery for Online Diagnosis in Software-defined Vehicles

链接: https://arxiv.org/abs/2606.15559
作者: Matthias Weiß,Athreya Hosahalli Prakash,Falk Dettinger,Nasser Jazdi,Michael Weyrich
类目: oftware Engineering (cs.SE); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: 8 pages, 4 figures, 2 tables

点击查看摘要

Abstract:The transition toward software-defined vehicles concentrates an increasing share of vehicle functionality into distributed software services, where failures propagate through service dependencies and the surface symptom is often several causal hops away from the underlying defect. Existing approaches to causal root-cause analysis in such systems address this only partially: they typically reason over a single observability modality and operate in an offline, operator-driven mode that does not match the demands of continuous vehicle operation. This paper presents SDVDiag, a multimodal causal-discovery pipeline that fuses log-based and metric-based service representations into a shared embedding space before graph construction, coupled with an anomaly-driven trigger that converts the diagnostic platform from a manually operated batch tool into a continuously running online system. Evaluation on an Autonomous Valet Parking testbed shows that the multimodal pipeline produces sparser causal graphs than a metrics-only baseline (134 vs. 182 edges on average) and consistently outperforms it in edge-weighted reward against an expert knowledge graph at every stage of human-feedback refinement, showing a 2.4-fold improvement over the baseline after 60 feedback queries. An end-to-end fault-injection scenario further demonstrates that the integrated trigger correctly recovers a true root cause located two causal hops upstream of the observable symptom.

[LG-97] A Bifurcation Theory Framework for Gradient Descent on the Edge of Stability

链接: https://arxiv.org/abs/2606.15551
作者: Eric Gan
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The Edge of Stability (EoS) phenomenon, where gradient descent operates with sharpness exceeding the classical convergence threshold yet the loss decreases over long timescales, is ubiquitous in modern deep learning but remains poorly understood in realistic settings. Prior rigorous analyses have been largely confined to scalar or low-dimensional losses with specific structural forms. In this work, we develop a bifurcation theory framework for gradient descent on the edge of stability that applies directly to overparameterized neural networks. By decomposing the training dynamics into components normal and tangent to the manifold of minimizers, we show that stable EoS training arises from a flip bifurcation in the normal direction, governed by the sign of the first Lyapunov coefficient, while the tangent dynamics drift toward regions of decreasing sharpness. Under mild spectral and geometric assumptions on the loss landscape, we prove convergence to the minimizing manifold when training at the EoS threshold. As a corollary, we recover and unify prior results: we show that the product-stability condition of Gan (2026) is an instance of our framework.

[LG-98] Greedy Coordinate Diffusion: Effective and Semantically Coherent Adversarial Attacks via Diffusion Guidance

链接: https://arxiv.org/abs/2606.15531
作者: Bohdan Turbal,Blossom Metevier,Max Springer,Aleksandra Korolova
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Fine-tuning aligned language models on benign tasks (e.g. math tutoring) systematically breaks safety guardrails, even when training data contains no harmful content. While mechanistic approaches have shed light on where alignment resides in model weights, they do not by provide a general formal framework for deriving guarantees about when fine-tuning degrades it – leaving the field without principled tools for predicting or preventing alignment collapse. We develop a local geometric framework through geometric analysis of parameter-space trajectories and apply it to understand the fragility of alignment in fine-tuning. While first-order analysis suggests orthogonal updates are safe, we prove this is illusory: the curvature of the fine-tuning loss induces second-order acceleration that can induce second-order drift into alignment-sensitive regions. We formalize a construct of our framework as the Alignment Instability Condition (AIC), three geometric properties that, when present, are sufficient to guarantee degradation. Our main result proves quartic onset of alignment degradation along gradient-flow trajectories, determined by how sharply alignment depends on specific parameters and how strongly tasks couple to these parameters. These findings yield formal sufficient conditions under which static first-order protection can fail under gradient descent. We further empirically validate the framework’s foundations, showing that the Fisher Information Matrix provides a proxy for the degree of safety degradation across diverse fine-tuning.

[LG-99] Reinforcement Learning-Guided Retrieval with Soft Fusion for Robust Multimodal Imitation Learning under Missing Modalities

链接: https://arxiv.org/abs/2606.15514
作者: Hassan Ismkhan,Hamid Bouchahcia
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Robotic systems perceive the world through multiple input modalities – including visual camera streams and natural language instructions – and must select appropriate actions based on these signals. However, assuming the permanent availability of all input devices is unrealistic, as sensors may fail, become occluded, or drop out entirely during deployment. Robust handling of such missing-modality scenarios is therefore essential for real-world robot operation. This paper introduces RL4IL, a reinforcement learning guided method for imitation learning that selects the most suitable action for a given observation by identifying the most relevant expert demonstrations from a training library. A reinforcement learning policy, trained via Proximal Policy Optimisation over Breadth-First Search candidate sets, ranks candidate demonstrations and a soft cross-attention fusion head aggregates their action signals to produce the final prediction. When a modality is missing at inference time, a dedicated per-modality RL retrieval policy identifies donor demonstrations from the training library, and a soft imputation head reconstructs the missing embedding via cross-attention over the top-ranked donors – without requiring any retraining of the system. Experiments on three LIBERO benchmark suites demonstrate that RL4IL substantially outperforms state-of-the-art imitation learning methods under sensor dropout conditions, while requiring no policy network training. The code can be found at this https URL

[LG-100] owards Data-Efficient Cross-Device Generalization of Grad-Shafranov Equilibria via Transfer Learning Neural Operator

链接: https://arxiv.org/abs/2606.15512
作者: Jay Phil Yoo,William Howes,Yashika Ghai,Kazuma Kobayashi,Souvik Chakraborty,Syed Bahauddin Alam
类目: Machine Learning (cs.LG); Plasma Physics (physics.plasm-ph)
*备注:

点击查看摘要

Abstract:Real-time reconstruction of magnetohydrodynamic equilibria is essential for plasma shaping, stability assessment and feedback control in magnetic confinement fusion. However, Grad-Shafranov equilibrium calculations remain largely device-specific and iterative, limiting their use in latency-constrained control settings. Existing neural approaches can accelerate individual equilibrium predictions, but they do not generally provide reusable models across changing plasma boundaries or tokamak geometries. Here we show that equilibrium reconstruction can be recast as a cross-device operator learning problem. We develop a domain-specific neural operator framework that maps geometry and profile parameters directly to the poloidal flux field, replacing repeated solve-on-demand computation with amortized operator inference. Using the analytically tractable Solov’ev family as a controlled Grad-Shafranov testbed, we generate equilibria across eight geometrically distinct tokamak-like configurations and benchmark five neural operator architectures under four transfer-learning strategies. Single-geometry pretraining gives poor transfer to unseen devices, whereas multi-geometry pretraining enables data-efficient adaptation. The Wavelet Neural Operator gives the strongest cross-geometry performance, reaching mean relative L2 errors below 4% with 100 labelled target equilibria and below 2% with full fine-tuning. The predicted magnetic fields satisfy the divergence-free constraint to numerical precision, and four architectures achieve millisecond or sub-millisecond inference. These results identify neural operator pretraining as a route towards reusable, real-time equilibrium inference across fusion device configurations.

[LG-101] Model Stealing Through the Lens of Model Multiplicity

链接: https://arxiv.org/abs/2606.15493
作者: Eliott Baltz,Satoshi Hara,Ulrich Aïvodji
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: 14 pages, 15 figures

点击查看摘要

Abstract:Model stealing attacks, where adversaries create high-fidelity surrogate models, are a significant threat to the intellectual property of machine learning services. Conventional wisdom suggests these surrogates could provide adversaries with economic leverage comparable to the original service providers. This paper challenges this assumption by evaluating model stealing attacks beyond mere fidelity to the target model. Because query-based extraction provides only partial supervision of the target’s input-output behavior, the surrogate is not uniquely identified: many near-optimal surrogates can achieve comparable fidelity while differing in deployment-relevant properties. Instead of performing a classic learning-based model stealing attack, we compute the Rashomon Set (i.e., the set of almost-equally-accurate models) of surrogate models, and evaluate its diversity using multiplicity metrics (ambiguity, discrepancy, and Rashomon Capacity) and group fairness metrics. Across tabular, medical imaging, and NLP tasks, our experiments on real-world datasets reveal that despite exhibiting similar fidelity to the target model, surrogate models can display significant variances in other critical performance metrics. These findings cast doubt on the presumed equivalence between high-fidelity surrogates and the target model in practical deployment scenarios.

[LG-102] A Spatio-Temporal Expert Prefetching Framework for Efficient MoE-based LLM Inference

链接: https://arxiv.org/abs/2606.15453
作者: Yingnan Zhao,Razvan Bunescu,Ahmed Louri,Avinash Karanth,Ke Wang
类目: Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Mixture-of-Experts (MoE) based large language models (LLMs), such as Qwen and DeepSeek, have recently emerged as an effective approach to improving model capacity without proportionally increasing computational cost. By replacing the conventional feed-forward network in dense LLMs with a set of experts and activating only a subset of them for each input token, MoE models significantly increase the total number of parameters while keeping the per-token computation relatively manageable. However, this dynamic and irregular expert activation pattern also introduces substantial expert loading overhead during inference, since the required experts must be fetched on demand according to token-dependent routing results. As a result, expert loading latency becomes a major source of performance and energy inefficiency. To this end, we first perform a comprehensive analysis of expert selection behavior in various MoE-based LLMs and applications, including language understanding and code generation. Our analysis reveals that, within each application domain, expert requests exhibit strong correlation across both adjacent MoE layers and consecutive decoding tokens, making future expert activations predictable. Based on this insight, we propose ST-MoE, a spatio-temporal expert prefetching framework that proactively stages experts ahead of use to overlap expert loading with ongoing computation. ST-MoE combines a lightweight runtime prediction mechanism that preserves the original routing behavior with a reconfigurable hardware design that efficiently supports dynamic expert prefetching. The combined effect of the prediction mechanism with the supporting hardware significantly improves MoE inference performance and energy efficiency while preserving model inference accuracy.

[LG-103] PHINN: Persistent Homology Inspired Neural Network for Rare-Event Time Series Generation

链接: https://arxiv.org/abs/2606.15452
作者: Emre Yusuf,Ren Takahashi,Jayabrata Bhaduri
类目: Machine Learning (cs.LG); Algebraic Topology (math.AT); Risk Management (q-fin.RM); Machine Learning (stat.ML)
*备注: 15 pages, 4 figures

点击查看摘要

Abstract:Rare events in time series are critical to model but hard to learn due to data scarcity. Current generative models struggle with extreme values. We observe that rare events leave distinct topological fingerprints - transitions in Betti numbers from point-cloud embeddings - that are more stable and discriminative than statistical moments. We introduce PHINN, a flow-matching framework using dynamic Betti curves as conditioning signals and a persistence landscape loss for homology consistency. It scales to multivariate data, includes a natural-language interface to set Betti targets, supports cross-domain meta-learning and few-shot generation, and provides certified adversarial robustness. On financial, epidemiological, and multi-modal benchmarks, PHINN outperforms statistical and diffusion baselines in topological fidelity (beta-RMSE down 41-63%, transition accuracy up 84%) and matches jump-diffusion models in tail coverage while exceeding them in shape fidelity. All results have 95% confidence intervals. Comments: 15 pages, 4 figures Subjects: Machine Learning (cs.LG); Algebraic Topology (math.AT); Risk Management (q-fin.RM); Machine Learning (stat.ML) Cite as: arXiv:2606.15452 [cs.LG] (or arXiv:2606.15452v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2606.15452 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-104] A Compositional Framework for Open-ended Intelligence

链接: https://arxiv.org/abs/2606.15386
作者: Ida Momennejad,Roberta Raileanu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Open-ended intelligence is the capacity to adapt to novel problems and environments that are substantially different from those in training. We formalize open-ended intelligence as the closure induced by a finite primitive set (P) and a set of composition operators (C). We characterize properties of the induced closure (\mathcalL(P,C)) that support unbounded compositional generation across families of tasks and worlds. A mathematics of open-ended intelligence requires two pillars: a minimal set of representational primitives (e.g., states, actions) and algorithmic primitives (e.g., nearest neighbor), together with composition motifs (e.g., recursion, sequencing) that reflect an acquired compositional grammar. The closure of these two pillars enables the generation of infinite adaptive responses across a wide range of settings. The mathematics supports complementary research agendas, including evaluation metrics for explanation and interpretability, as well as building architectures where compositional generalization is native. We propose next primitive prediction as a novel architectural objective, where the training objective encourages the acquisition of reusable algorithmic primitives and their compositional grammar, such that new solutions are generated through recombination. Curriculum learning and self-play enable lifelong learning and expansion of the closure by discovering reusable primitives and transition motifs across families of tasks and worlds. We ground the framework through case studies in physics, evolution, and neuroscience.

[LG-105] Repeated Bilateral Trade: The Quest for Fairness

链接: https://arxiv.org/abs/2606.15369
作者: François Bachoc,Roberto Colomboni,Emilie Kaufmann
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study repeated bilateral trade from a fairness perspective. At each round, a fresh seller-buyer pair arrives, and the platform posts a price before observing the traders’ valuations. Trade occurs only if both agents accept the price. Rather than maximizing only the gain from trade, we consider platforms that seek balanced divisions of the generated surplus. We show that natural fairness desiderata lead to a one-parameter Rawls-to-Nash family of fair-gain objectives, obtained by aggregating the seller’s and buyer’s net gains through nonpositive Hölder means. Unlike the standard gain-from-trade objective and the Rawlsian fair-gain objective studied in prior work, our proposed objectives induce a new statistical structure in which expected rewards are recovered from threshold feedback through a two-dimensional singular-kernel integral identity. This leads to a nonstandard pure-exploration problem whose natural estimators are rectangular double sums with row-column dependence and singular weights. Assuming independent i.i.d. seller and buyer valuation sequences with arbitrary unknown marginals, we characterize the optimal learning rates for the whole Rawls-to-Nash family of fair-gain objectives, giving matching fixed-confidence sample-complexity and regret bounds up to polylogarithmic factors.

[LG-106] DiRecT: Safe Diffusion-Based Planning via Receding-Horizon Denoising

链接: https://arxiv.org/abs/2606.15359
作者: Paolo Giaretta,Zeyang Li,Navid Azizan
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Diffusion models have emerged as powerful tools for planning and control by learning multimodal distributions over actions and trajectories. Yet reliable inference-time safety enforcement remains a key barrier to their deployment in safety-critical tasks. Existing approaches typically project each denoising iterate onto the feasible set, even though constraints are defined only on the final clean trajectory. Enforcing feasibility on noisy intermediate samples can therefore overconstrain the sampling dynamics, substantially degrading sample quality. To address this limitation, we introduce DiRecT (Diffusion-based planning via Receding-horizon denoising with Terminal constraints), a training-free algorithm for constrained sampling from diffusion models via stochastic optimal control (SOC). DiRecT enforces constraints only on the final clean sample, avoiding unnecessary restrictions on the intermediate denoising dynamics. Inspired by model predictive control, we derive a principled receding-horizon surrogate for the otherwise intractable constrained SOC formulation, yielding an efficient algorithm that cleanly separates stochastic denoising from constraint satisfaction, progressively steering samples toward feasible final trajectories without distorting the learned diffusion dynamics. Furthermore, DiRecT is highly flexible: it can leverage off-the-shelf or domain-specific optimizers, incorporate priors over environment dynamics, and optimize additional soft rewards. Extensive experiments on safe planning benchmarks demonstrate that DiRecT substantially improves deployment safety and task performance over existing diffusion-based planning baselines.

[LG-107] Probabilistic Signature Inversion: Learning Conditional Distributions from Truncated Signatures

链接: https://arxiv.org/abs/2606.15332
作者: Junoh Kang,Kiseop Lee,Bohyung Han
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The signature transform is a principled feature map for continuous-time paths, valued for its uniqueness and universality. Recovering a path from its truncated signature is, however, structurally ill-posed because the truncated signature map is not injective. We therefore reframe truncated signature inversion as a probabilistic problem – learning the conditional distribution of a path given its truncated signature – and adopt a signature-conditioned flow matching model as a practical estimator. This probabilistic formulation elucidates the fundamental difficulty of inversion: Bayes reconstruction error quantifies the irreducible uncertainty remaining after conditioning on a statistic. We derive the Bayes-optimal error under linear statistics, obtaining a closed form for log-GBM and numerically tractable formulas for log-fBM and OU, yielding a concrete theoretical baseline for model validation. This baseline upper-bounds the Bayes error under truncated-signature conditioning, since truncated signatures provide richer information than linear statistics. Experiments show that empirical reconstruction errors under linear-statistics conditioning faithfully align with the theory-derived baseline, while errors decrease when the statistic is replaced with truncated signatures. Moreover, generated paths faithfully recover the conditioning signature while preserving key distributional and temporal structures, indicating that the estimator is well-calibrated to the target conditional distribution. Together, these results establish a well-posed probabilistic framework for truncated-signature inversion, with applicability demonstrated on real financial data beyond the parametric process families covered by theory.

[LG-108] Semantic DLM: Improving Diffusion Language Models through Bias-variance Trade-off in Transition Kernel Design

链接: https://arxiv.org/abs/2606.15327
作者: Keyue Jiang,Yuxiang Wang,Yanan Zhao,Xiang Yu,Qifang Zhao,Bohan Tang,Baojian Zhou,Yanghua Xiao,Lin Qu,Xiaoxiao Xu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Diffusion Language Models (DLMs) have demonstrated strong scaling capacity as alternatives to autoregressive language models. However, their performance is highly sensitive to the choice of transition kernels, and poorly designed kernels can lead to issues like training instability, slow convergence, and biased sampling. In this paper, we study this sensitivity through a principled analysis of generalization error and identify three critical factors: asymptotic bias (difficulty in approximating the posterior distribution), exposure bias (error propagation during sampling), and optimization variance induced by kernel dispersion. We further compare different transition kernels: masking diffusion yields sparse and easier posterior-approximation targets, while uniform diffusion provides stronger sampling-side repair but induces harder approximation. Motivated by this trade-off, we revisit a previously overlooked variant, semantic DLM (SemDLM), where the transition kernel corrupts tokens to neighborhoods that are semantically similar. Our theory suggests that SemDLM can serve as a plausible middle ground by reducing the posterior approximation difficulty of uniform diffusion while retaining repair ability. However, we find that SemDLM suffers from a semantic basin problem, where sampling repeatedly stays within a semantic region and produces low-diversity text. To address this, we propose SemDLM+, which adds a global transition and a semantic-frequency penalty during sampling. Experiments on LM1B and OpenWebText show that SemDLM+ improves training dynamics and achieves competitive language modeling and generation quality with satisfactory diversity.

[LG-109] Rethinking Structural Anomaly Detection: From Decision Boundaries to Projection Operators

链接: https://arxiv.org/abs/2606.15280
作者: Alexander Bauer
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Most existing anomaly detection methods rely on estimating a probability density or learning an enclosing decision boundary, implicitly assuming that normal data occupies a region of non-zero volume in the ambient space. In contrast, structural anomaly detection considers data that lies near a low-dimensional manifold, creating a mismatch between the inductive bias of existing methods and the structure of the data, often resulting in degraded performance. To address this mismatch, we introduce a geometric perspective. Specifically, we learn a projection operator onto the manifold of normal samples and define a sample as anomalous if it is altered by this projection. This formulation naturally integrates the inductive bias of manifold-supported data and reframes anomaly detection in terms of a projection residual, thereby resolving issues arising from modeling degenerate distributions. Notably, it provides a unifying interpretation of reconstruction-based methods by explaining their success and failure in terms of projection quality. In particular, it explains the strong generalization ability of projection-aligned models as a consequence of contraction behavior toward the manifold. Moreover, by decoupling anomaly detection from probabilistic modeling, it reduces the tendency to misclassify rare but normal samples, a widely recognized limitation of existing approaches. Empirically, we demonstrate that projection-aligned methods achieve strong performance, outperforming boundary-based methods while improving upon existing reconstruction-based approaches.

[LG-110] When to use what Schatten-p norm in deep learning?

链接: https://arxiv.org/abs/2606.15268
作者: Thomas Pethick
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Schatten- \infty based optimizers such as Muon have shown promising empirical performance, but there remains seemingly conflicting observations regarding whether they are beneficial. We resolve this conflict by showing that the conclusion is regime dependent. Even when the objective is smooth in the Schatten- \infty geometry, smaller Schatten- p geometries can be optimal, specifically in the low-dimensional regime, which we show includes Chinchilla scaling. This conclusion follows from a new noise-robust acceleration result for the SODA framework for p2 . The same analysis explains why Muon-like methods do not require warmup, why they naturally favor large batches, and yields a batch size scaling rule for arbitrary p .

[LG-111] AI for Social Good: An Investigation of the Causal Relationship Between Environmental Regulations and Their Effects on Air Pollution in London UK

链接: https://arxiv.org/abs/2606.15257
作者: Yang Han,Jacqueline CK Lam,Victor OK Li,Yiu-Wai Man
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Air pollution regulation is central to urban public health governance, but estimating its effects is difficult because policies are implemented non-randomly and pollution trajectories are shaped by meteorology, socioeconomic change, temporal trends, and overlapping interventions. This study develops an uncertainty-aware Bayesian deep learning framework to estimate the aggregate effect of air pollution regulations on PM _2.5 concentrations in London from 2010 to 2020. The framework integrates daily PM _2.5 observations from Inner London monitoring stations, meteorological covariates, annual socioeconomic indicators, month-of-year and day-of-week indicators, and daily regulation status data for 32 policy measures. A Bayesian LSTM captures temporal dependencies in environmental and socioeconomic covariates, Bayesian embedding layers represent temporal and regulation status inputs, and a regulation status prediction branch supports propensity score-based adjustment for non-random policy implementation. Regulatory effects are estimated by comparing observed PM _2.5 concentrations with counterfactual predictions under a hypothetical no-regulation scenario, with uncertainty summarized across repeated Bayesian training runs and bootstrap resampling. Results show that London’s regulations were associated with an average PM _2.5 reduction of 1.88 \mu g/m ^3 , a relative reduction of 12.35%, with a 95% confidence interval of 1.64-2.12 \mu g/m ^3 . Estimated effects were limited before 2013, became clearer from 2013 to 2017, and were strongest in 2018 and 2019. The findings suggest that sustained and cumulative regulatory interventions contributed to measurable improvements in London’s air quality. This study demonstrates how uncertainty-aware causal AI can support environmental accountability, public health protection, and evidence-based governance for environmental decision-making.

[LG-112] M-CTX: Exact and Scalable Spatial Context Retrieval for Trajectory Analytics ICDE2027

链接: https://arxiv.org/abs/2606.15244
作者: Kun Ma,Qilong Han,Chengjing Song,Jingzheng Yao,Xiao Han,Yuee Zhou,Changmao Wu
类目: Machine Learning (cs.LG)
*备注: 14 pages, 10 figures, 12 tables. Submitted to ICDE 2027

点击查看摘要

Abstract:Modern trajectory predictors increasingly condition on external spatial context, such as map geometry, signed distance fields (SDFs), and nearby moving agents. While this context improves prediction quality, constructing it for every training anchor has become a hidden systems bottleneck. In a representative maritime AIS pipeline, spatial context construction requires roughly 17 CPU-days for a 5.48M-anchor corpus, dominating the cost of the downstream predictor. We present M-CTX, an exact and scalable spatial context-retrieval framework for trajectory analytics. M-CTX recasts context construction as an ingest-once, query-many spatial database workload and replaces three brute-force stages – OSM range retrieval, SDF computation, and moving-vessel neighbour lookup – with composable, index-backed operators. Its learned range-index backend, BR-LZ, provides recall-complete MBR-overlap range retrieval and reduces candidate amplification by 1.1x–2.7x relative to global-expansion one-curve baselines. Across four maritime regions, eight baseline systems, synthetic workloads with up to 40M spatial features, and 10^7-record AIS streams, M-CTX reproduces the reference context exactly. On the 5.48M-anchor corpus, it reduces context construction from about 17 CPU-days to 1.8 hours, a measured 226x end-to-end speed-up. An optional storage mode further compresses SDF context by 64x with only a 0.04 m ADE change. These results establish exact spatial context retrieval as a first-class database problem in modern trajectory analytics. Code and datasets are publicly available at this https URL.

[LG-113] EnvShip-Bench: An Environment-Enhanced Benchmark for Short-Term Vessel Trajectory Prediction ACM-MM2026

链接: https://arxiv.org/abs/2606.15240
作者: Kun Ma,Qilong Han,Chengjing Song,Jingzheng Yao,Hao Wang,Changmao Wu
类目: Machine Learning (cs.LG)
*备注: Submitted to ACM MM 2026

点击查看摘要

Abstract:Vessel trajectory prediction is important for intelligent shipping, maritime surveillance, and navigation safety. However, existing public maritime AIS resources are often limited by inconsistent forecasting protocols, uneven data quality, and the lack of benchmark-ready contextual annotations, which hinder fair comparison and context-aware modeling. To address this gap, we present EnvShip-Bench, a unified benchmark for short-term vessel trajectory prediction built from large-scale raw AIS data from the Danish Maritime Authority (DMA) and NOAA through a common processing pipeline. EnvShip-Bench adopts a standardized forecasting protocol with 10 minutes of observation, 10 minutes of prediction, and 20-second sampling in vessel-centric local metric coordinates. Beyond the large-scale core benchmark, it provides a quality-first compact subset for efficient and reproducible experimentation, together with synchronized environmental and nearby-vessel context extensions. As a result, EnvShip-Bench supports trajectory-only, environment-aware, and interaction-aware forecasting under a unified evaluation framework. Extensive benchmark statistics and analysis demonstrate that EnvShip-Bench offers a standardized, extensible, and context-aware foundation for maritime trajectory forecasting research.

[LG-114] Can Neural Networks Achieve Optimal Computational-statistical Tradeoff? An Analysis on Single-Index Model

链接: https://arxiv.org/abs/2606.15219
作者: Siyu Chen,Beining Wu,Miao Lu,Zhuoran Yang,Tianhao Wang
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注: 96 pages, 4 figures

点击查看摘要

Abstract:In this work, we tackle the following question: Can neural networks trained with gradient-based methods achieve the optimal computational-statistical tradeoff in learning Gaussian single-index models? Prior research has shown that any polynomial-time algorithm under the statistical query (SQ) framework requires \Omega(d^s^\star/2\lor d) samples, where s^\star is the generative exponent representing the intrinsic difficulty of learning the underlying model. However, it remains unknown whether neural networks can achieve this sample complexity. Inspired by prior techniques such as label transformation and landscape smoothing for learning single-index models, we propose a unified gradient-based algorithm for training a two-layer neural network in polynomial time. Our method is adaptable to a variety of loss and activation functions, covering a broad class of existing approaches. We show that our algorithm learns a feature representation that strongly aligns with the unknown signal \theta^\star , with sample complexity \widetildeO (d^s^\star/2 \lor d) , matching the SQ lower bound up to a polylogarithmic factor for all generative exponents s^\star\geq 1 . Furthermore, we extend our approach to the setting where \theta^\star is k -sparse for k = o(\sqrtd) by introducing a novel weight perturbation technique that leverages the sparsity structure. We derive a corresponding SQ lower bound of order \widetilde\Omega(k^s^\star) , matched by our method up to a polylogarithmic factor. Our framework, especially the weight perturbation technique, is of independent interest, and suggests potential gradient-based solutions to other problems such as sparse tensor PCA.

[LG-115] owards a Unified Generative Model for Scarce Time Series with Domain Experts

链接: https://arxiv.org/abs/2606.15172
作者: Zihao Yao,Qi Zheng,Jiankai Zuo,Yaying Zhang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Synthesizing realistic time series with generative models has wide-ranging applications in real-world scenarios. Despite recent progress, most existing methods are trained under the assumption of abundant training data, which substantially limits their effectiveness in data-scarce settings. In this paper, we propose TimeMoDE, a novel framework that integrates Diffusion Transformers with Mixture-of-Experts to exploit both domain adaptability and diffusion-stage awareness for time series generation under data scarcity. It is pre-trained on a large-scale collection of multi-domain datasets to extract domain-agnostic temporal representations and domain-specific information benefiting generalization during fine-tuning. We propose Domain Prompts to condition expert assignment for indistinguishable noised tokens, mitigating the limitations of capturing inter-dataset relationships. Moreover, we incorporate diffusion timestep signals to equip the experts with awareness of time series degradation variations, facilitating adaptive calibrate to stage-dependent denoising requirements. Extensive experiments demonstrate that TimeMoDE outperforms existing methods under diverse low-data settings. It establishes an innovative paradigm for advanced time series few-shot generation.

[LG-116] Semantic Reasoning in Medicine: The Role of Knowledge Graphs Across Five Key Domains

链接: https://arxiv.org/abs/2606.15155
作者: Haniye Sherafatmandjoo,Mohammad Akbari,Zahed Rahmati
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Knowledge graphs (KGs) have emerged as a promising solution for integrating and reasoning over complex biomedical and clinical data in healthcare. By representing structured relationships among entities such as diseases, drugs, symptoms, and patient records, KGs provide a semantic backbone for decision-making, prediction, recommendation, and personalized care. Recent advances have demonstrated their utility across diverse medical applications–including clinical decision support systems, disease and treatment outcome prediction, health recommender systems, precision medicine, and medical question answering–where KGs often enhance interpretability, semantic coherence, and patient-specific reasoning. In parallel, a growing body of work focuses on medical KG generation itself, proposing frameworks that construct graphs from EHRs, clinical narratives, biomedical literature, and web resources using ontologies, semantic web technologies, deep-learning-based information extraction, and hybrid neuro-symbolic pipelines. Despite this progress, significant challenges remain, including limited and fragmented knowledge coverage, difficulties in aligning heterogeneous data sources, the fragility of current reasoning and representation-learning methods on dense multi-relational graphs, and unresolved issues related to privacy, bias, and accountability. This survey reviews and categorizes current research on KGs in medicine along both application-oriented and methodology-oriented dimensions, discusses their benefits and technical foundations, and outlines key limitations and open research directions. By analyzing trends, architectures, and evaluation practices, this work aims to guide future developments in KG-driven medical AI systems and support their safe and effective integration into healthcare environments.

[LG-117] False Sense of Safety in Selective Signal Classification: Auditing Bound Tightness and Exchangeability for Risk Control

链接: https://arxiv.org/abs/2606.15153
作者: Jingwen Zhou,Mingzhe Wang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Selective prediction with distribution-free risk control promises that, with confidence 1-delta over the calibration draw, the error rate of accepted inputs stays below a user budget alpha. We audit this promise on signal-domain detectors – machine anomalous-sound detection (ASD) and AI-generated-image forensics – for four calibration rules: uncertified empirical thresholding (NAIVE) and certified Hoeffding, Clopper-Pearson (CP), and betting (WSR) upper confidence bounds. We report three findings. (i) NAIVE thresholding, common in practice, exceeds its declared budget in 49-73% of synthetic trials (n=200 calibration points) and in up to 68% of real-data splits: a false sense of safety rather than a broken theorem, since the rule never had a certificate. (ii) Tightness matters: CP and WSR certify substantial coverage where Hoeffding certifies none, with zero observed budget overruns under exchangeable splits. (iii) Under grouped deployment (unseen machine types or generators), certified rules overrun in 9-30% of trials – far above delta – showing the failure lies in the broken exchangeability premise, not in the bounds; a conservative per-group threshold restores validity at a severe coverage cost.

[LG-118] Contextual Bandits for Maximizing Stimulated Word-of-Mouth Rewards AAAI2025

链接: https://arxiv.org/abs/2606.15146
作者: Ahmed Sayeed Faruk,Elena Zheleva
类目: Machine Learning (cs.LG)
*备注: Presented at the AAAI 2025 Workshop on Bridging the Gap Between AI Planning and Reinforcement Learning (PRL)

点击查看摘要

Abstract:Stimulated word-of-mouth is a strategy that promotes information sharing through prompts or incentives. Optimizing stimulated word-of-mouth through social networks requires identifying and targeting connected users who are most susceptible to spillover, a phenomenon where the influence of recommendations extends beyond the immediate audience to impact their connected users. The probability of spillover varies across individuals, and their connections, leading to heterogeneity. Understanding and accurately estimating the spillover probabilities among users in social networks is crucial for improving the effectiveness of stimulated word-of-mouth. To address this, we present a novel contextual multi-armed bandit framework that learns individual spillover probabilities and ranks connected users to maximize rewards from stimulated word-of-mouth. Experiments on real-world network datasets demonstrate that accounting for spillover heterogeneity enhances the targeting precision of top- k connected users, boosting rewards and outperforming baseline methods that do not learn individual spillover effects.

[LG-119] Beyond Accuracy: Measuring Bias Acknowledgment in Chain-of-Thought Reasoning for Responsible AI Evaluation ICML2026

链接: https://arxiv.org/abs/2606.15127
作者: Xian Sun,Wei Gao,Yingshuo Wang,Lingdong Kong,Yanhang Li,Zhichao Fan,Zexin Zhuang,Wenlong Dong,Zhiyuan Zheng,Hrishikesh Paranjape,Abhishek Mandal,Johnny R. Zhang
类目: Machine Learning (cs.LG)
*备注: ICML 2026 Workshop on Trustworthy AI for Good

点击查看摘要

Abstract:Reasoning models are increasingly used in settings where the final answer is not the only object of review: educational tools may show students intermediate steps, decision-support systems may require human oversight, and audit workflows may inspect traces for misleading or biased input. In such settings, two responses can receive the same final-answer score while differing in whether the trace explicitly flags injected biasing content. Accuracy-only evaluation collapses these cases. We study this gap as a measurement blind spot for responsible evaluation and introduce a minimal trace-level diagnostic with two axes: \emphsusceptibility (whether the bias breaks a previously correct answer) and \emphacknowledgment (whether the trace contains a rubric-defined surface reference to the injected content). Across thousands of biased GSM8K trials, GPT-4o and Claude Sonnet~4 have similar susceptibility rates ( 1.3% vs.\ 1.2% ) but substantially different acknowledgment rates ( 13.0% vs.\ 75.0% ) under the same rubric.

[LG-120] Data-Centric Benchmarking of Exploit Generation in LLM s: Understanding the Impact of Fine-Tuning

链接: https://arxiv.org/abs/2606.15123
作者: Yiwei Chen,Lichi Li,Kai Cheung,Vinny Parla,Ganesh Sundaram
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: Technical Report

点击查看摘要

Abstract:We study the task of CVE-conditioned exploit generation, where a model drafts proof-of-concept (PoC) exploits given software vulnerability context. We adopt a data-centric approach, constructing a high-quality dataset via multi-stage preprocessing and introducing a scalable evaluation framework with LLM-as-judge and fine-grained rubrics. Under this unified setup, we benchmark 17 large language models across 8 evaluation criteria, providing systematic insights into their zero-shot capabilities. We further show that a compact 8B open-weight model, when fine-tuned on curated data, achieves over 42.5% improvement in exploit quality and rivals some proprietary models when combined with simple test-time rejection strategies. Our results highlight the importance of data quality, structured supervision, and evaluation design for reliable exploit generation, suggesting that these factors can be as critical as model scale in adapting LLMs to cybersecurity tasks.

[LG-121] Diversity-Driven Offline Multi-Objective Optimization via Nested Pareto Set Learning ICML2026

链接: https://arxiv.org/abs/2606.15115
作者: Yiyi Zhu,Yaolin Wen,Xiang Xia,Xin An,Hanyi Si,Xiang Shu,Yangde Fu,Liang Dou,Hong Qian
类目: Machine Learning (cs.LG)
*备注: 32 pages, 7 figures, accepted by ICML 2026. Project: this https URL

点击查看摘要

Abstract:Multi-objective optimization (MOO) has emerged as a powerful approach to solving complex optimization problems involving multiple objectives. In many practical scenarios, function evaluations are unavailable or prohibitively expensive, necessitating optimization solely based on a fixed offline dataset. In this setting, known as offline MOO, the goal is to find out the Pareto set without access to the true objective functions. This setting suffers from the out-of-distribution (OOD) issue, where the surrogate model is not accurate for unseen designs. Due to the OOD issue, surrogate errors may cause the optimizer to select solutions that do not lie on the true Pareto front and are biased toward its extremes. To address this, this paper proposes Diversity-driven Offline Multi-Objective Optimization (DOMOO), which aims to find out a diverse and high-quality set of solutions. First, DOMOO incorporates an accumulative risk control module that estimates the potential risk of candidate solutions and alleviates the OOD issue between the training data and the generated solutions. In addition, a nested Pareto set learning (PSL) strategy is proposed to jointly learn preference and PSL parameters, then optimize them, enabling adaptation to diverse Pareto front geometries. To further enhance solution quality, we design a diversity-driven selection strategy that extracts a representative and well-distributed set of final solutions. To achieve this diversity-driven selection strategy, we propose \textIGD_\textoffline , a tailored indicator for the offline setting that considers both diversity and convergence, and avoids the bias of hypervolume indicator. Extensive experiments on synthetic and real-world benchmarks show that DOMOO achieves the best average rank across tasks in both convergence and diversity among the compared methods.

[LG-122] High-Dimensional Random Projection for Activation Steering in Language Models

链接: https://arxiv.org/abs/2606.15092
作者: Minh-Hieu Pham,Bach Do,Laziz Abdullaev,Tan Minh Nguyen,Khoat Than
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Activation steering has emerged as a key methodology for controlling the behavior of large language models (LLMs). Existing difference-in-means based methods, however, are fundamentally limited: they capture only mean differences between class activations and fail to recover discriminative signals that naturally exist in the nonlinear feature subspace under the superposition hypothesis. Motivated by that, we propose High-Dimensional Random-projection for Activation Steering (HiDRA), a training-free approach that integrates seamlessly with existing activation steering methods. By performing activation addition in the projected high-dimensional space, HiDRA can provably capture a better discriminative structure beyond the reach of linear methods. Experiments across diverse LLM families and benchmarks demonstrate that HiDRA consistently outperforms baseline counterparts, achieving stronger behavioral control without significant computational overhead.

[LG-123] An Integrable Token Mixing Layer from the Generalized Yang Baxter Equation

链接: https://arxiv.org/abs/2606.15085
作者: Snigdha Chandan Khilar
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The YB Mixer is a sequence token mixing layer derived from free fermion and generalized Yang Baxter structures. It applies a core principle from integrable systems where a local algebraic constraint guarantees global computational stability. By using the Ising exchange algebra the mixer creates a free fermionic structure that acts as an exactly norm preserving orthogonal map. This algebra also produces commuting transfer matrices which allow inference to be order free and adaptable to any variable budget. To ensure the model can generalize to longer sequence lengths it uses a spectral circulant generator. This generator maintains the crucial orthogonal and commuting properties of the system. The result is a highly stable and mathematically grounded architecture for sequence processing.

[LG-124] riAdReview: Triangular Adversarial Review Architecture for Multi-Model Technical Document Generation

链接: https://arxiv.org/abs/2606.15074
作者: Zhiqiang Zhou,Junliang Dai,Xu Ling
类目: Machine Learning (cs.LG)
*备注: 12 pages, 7 figures, 5 tables

点击查看摘要

Abstract:Large language models (LLMs) are increasingly used for technical document generation, yet single-model outputs often suffer from over-engineering, security blind spots, and incomplete coverage. We propose TriAdReview, a triangular adversarial review architecture that employs two independent reviewer models (engineering and boundary perspectives) and a triangular judging mechanism to iteratively improve a generator model’s output. We evaluate TriAdReview across five benchmark tasks - architecture design, code generation, proposal review, security audit, and requirements analysis - using three configurations: single model (baseline), dual model (single review), and triple model (full system). Results across 75 experiments (n=5 per cell) show that the triple model configuration achieves a 10.1% overall improvement over the single model baseline (26.2 vs. 23.8 out of 50; p0.05, paired t-test), with particularly strong gains on security audit (+27.6%), code generation (+20.8%), and architecture design (+15.6%). A second scorer (mimo-v2.5-pro) confirms the direction with a smaller effect (+2.7%), suggesting moderate inter-rater agreement. However, the system shows a -7.5% degradation on requirements analysis, revealing that adversarial review architectures have a structural bias toward simplification that is counterproductive for completeness-oriented tasks. We analyze this boundary condition through a task-type framework and demonstrate that reviewer prompt adaptation partially mitigates the issue. Our findings provide the first empirical characterization of when multi-model adversarial review helps versus harms, with implications for the design of collaborative AI systems.

[LG-125] Phase-Localized Curation Does Not Help: A Negative Result on Per-Phase Metric Selection for Demonstration Filtering

链接: https://arxiv.org/abs/2606.15064
作者: Aarav Bedi
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注: 5 pages, 3 tables. Code: this https URL

点击查看摘要

Abstract:Manipulation demonstrations have temporal phase structure, and a natural hypothesis is that demonstration-curation metrics should be applied within phases rather than globally. The idea is to segment each trajectory into phases, score each phase with the metric that is locally most informative, and then aggregate. This follows directly from prior work showing that a single global metric can be the best detector of a defect and yet the worst curator of the resulting policy. We test the per-phase hypothesis on three contact-rich LIBERO pick-and-place tasks with a controlled early-release structural defect, comparing phase-gated curation against the same metrics applied uniformly and against a strong single global metric. Across all three tasks and five random seeds per condition, phase-gated curation is never the best curation strategy, and it is the worst of the three on two of the three tasks (Task 1: 86.0 vs. 92.0 for global; Task 3: 22.7 vs. 48.0 for uniform). We trace the failure to a concrete mechanism. When the defect signal is concentrated in a single phase, rank-aggregating across phases dilutes that signal with uninformative scores from defect-free phases, selecting a worse demonstration subset than simply applying the defect-informative metric everywhere. We further show that the per-phase metric selection does not transfer across tasks, since no phase shares a winning metric between any two tasks, so the selection cannot be reused and must be re-derived per task from a noisy sweep. These results bound a plausible and previously untested method, and they argue that practitioners should prefer identifying a single defect-informative metric over decomposing curation by phase. We release the full pipeline, all metric implementations, and per-seed results.

[LG-126] Machine Learning and the Random Walk Puzzle: Forecasting the CAD/USD Exchange Rate with Expanding Window Evaluation and SHAP Interpretability

链接: https://arxiv.org/abs/2606.15058
作者: Louis Agyekum,Edmund Fosu Agyemang,Obu-Amoah Ampomah,Kofi Acheampong,Emmanuel Boadi,Priscilla Yaa Amakye,Fafa Shalom Tchorly,Enock Adu Bonsu,Eric Nyarko
类目: Machine Learning (cs.LG); Applications (stat.AP)
*备注: 10 pages, 14 figures, 8 tables

点击查看摘要

Abstract:This study examines whether machine learning (ML) models can outperform the naive random walk benchmark in forecasting the monthly USD/CAD exchange rate. Using daily data from the Bank of Canada spanning January 2017 to May 2026, resampled into 113 monthly observations, five ML models are evaluated: linear regression, random forest, gradient boosting, XGBoost, and AdaBoost. These models are benchmarked against the naive random walk model and exponential smoothing with Holt-Winters seasonality (ETS). All models are evaluated using an expanding-window framework to maintain strict out-of-sample integrity, and forecast-accuracy differences are assessed using the Diebold-Mariano (DM) test. Structural break detection identifies four significant breakpoints in the series, corresponding to the escalation of the US-China trade war in 2018, the COVID-19 economic recovery in 2020, the peak of the Bank of Canada rate-hiking cycle in 2022, and the start of the Bank of Canada rate-cutting cycle in 2024. SHAP, or Shapley Additive Explanations, analysis is applied to interpret the drivers of the best-performing ML model. The results show that the naive random walk model remains a formidable benchmark. Linear regression is the only model that statistically outperforms the naive random walk model, with a DM statistic of 3.0585 and a p value of 0.0071, whereas the ML ensemble models show only marginal differences. Random Forest with an expanding-window framework achieves the lowest MAPE of 1.17 percent among all models except the random walk. SHAP analysis confirms that short-term lags, particularly lag1 and lag2, and recent rolling means dominate predictions, consistent with the near-random-walk behavior of exchange rates.

[LG-127] Size Doesnt Matter: Cosine-Scored Sparse Autoencoders

链接: https://arxiv.org/abs/2606.15054
作者: Silen Naihin,Lev Stambler
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Sparse autoencoders (SAEs) detect features via inner product, so a feature’s activation scales with both its directional alignment and the input’s norm. Under BatchTopK, high-norm tokens inflate all pre-activations simultaneously, claiming dictionary slots regardless of content alignment. This matters because sublayer normalization has already discarded the magnitude the score measures, so the encoder detects a quantity the model does not read. We replace the score with a learned blend of cosine similarity and input magnitude, letting the optimizer choose how much norm to use; a per-feature extension lets each feature decide independently. In both regimes, training is free to recover inner product but never does, with no feature ever choosing more than half-magnitude dependence. At matched reconstruction, the cosine encoder learns features that align with human-recognizable concepts far more often than standard, filling dictionary slots that inner product wastes on norm detectors. Loss reweighting that equalizes gradients barely closes the gap, confirming forward-pass score geometry as the lever. The advantage is not universal across tasks or depths, but we believe cosine scoring should be the default for dictionary learning on normalized representations.

[LG-128] Physics-conforming Latent Twins

链接: https://arxiv.org/abs/2606.15053
作者: Matthias Chung,Yutong Bu,Deepanshu Verma
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注: 32 pages, 11 figures

点击查看摘要

Abstract:Surrogate models are central to scientific machine learning, where they enable fast prediction, simulation, inference, and control for complex physical systems. For time-dependent problems, however, accurate interpolation of training trajectories is not sufficient: reliable surrogates should also respect the conservation laws, invariants, admissibility conditions, and dissipative structures that give those trajectories physical meaning. We introduce Physics-conforming Latent Twins, a framework for learning latent surrogate solution operators whose dynamics satisfy selected physical principles by design. The method builds on the Latent Twin formulation by jointly learning an encoder, a decoder, and a latent flow map between arbitrary time-indexed states, while constraining the latent dynamics to preserve or dissipate prescribed structural quantities. We develop a constraint-transfer viewpoint that connects physical structure in the original state space with compatible constraints in latent space, and prove structure-preservation bounds showing how latent enforcement improves control of physical defects after decoding. We also derive algebraic conditions for latent flow maps that preserve linear and quadratic invariants or enforce dissipative inequalities. Numerical experiments on representative ODE and PDE benchmarks demonstrate improved constraint satisfaction, structural fidelity, and qualitative long-time behavior while maintaining accurate surrogate prediction.

[LG-129] ransformers Learn the Mestre-Nagao Heuristic

链接: https://arxiv.org/abs/2606.15036
作者: Pranav Venkata Konda
类目: Machine Learning (cs.LG); Number Theory (math.NT)
*备注: 15 pages, 10 figures

点击查看摘要

Abstract:We train a two-layer transformer encoder to classify rational elliptic curves E/\mathbbQ of conductor \leq 10000 as either rank 0 or rank 1 from the first 128 normalized Frobenius traces. We achieve 99% accuracy on both classes, and accuracy is essentially unchanged on test curves with no isogeny or quadratic-twist relative in the training set. We then apply techniques from mechanistic interpretability such as attention analysis, linear probing, activation patching, logit attribution, and neuron-level circuit analysis to reverse-engineer the algorithm the (centroid in function space) model learned. We find that a sparse circuit of 20 out of 512 layer-1 MLP neurons is sufficient for rank prediction under a linear probe with an AUROC of 0.992 at plateau, implementing a push-pull detector architecture of rank-0 and rank-1 detectors with a one-sided readout. However, we notice that the model has sub-optimal readout problems indicating a mismatch in rank-order between the readout pathway and the discriminative circuit. Critically, the learned input weights of the top discriminating neuron match the Mestre-Nagao sum heuristic weights \log§/(p\cdot \logB) with a Spearman coefficient r = 0.997 and Pearson coefficient r = 0.952 : the model has learnt a result from analytic number theory from the Frobenius trace data alone. We additionally find that all 50 independently trained models concentrate CLS attention on prime positions at 2-50 \times the rate of composite positions. The CLS embedding encodes \logL(E,1) with R^2 = 0.962\pm 0.011 across the 50 models (after controlling for the conductor). Activation patching analysis reveals that attention weights are dissociated from causal information flow. Additionally, the 50 solutions from training are near-identical in function space (with pairwise agreement 98.8%) despite large weight space barriers.

[LG-130] How Should World Models Be Evaluated? A Decision-Making-Centric Position

链接: https://arxiv.org/abs/2606.15032
作者: Yang Yu,Shiyuan Zhang,Yifei Sheng,Haoxiang Ren,Haoxin Lin
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:World models have rapidly become one of the central abstractions in modern AI. Yet the term now refers to several different objects: action-conditioned environment models, latent imagination models, future-video predictors, interactive neural simulators, latent predictive representations, and synthetic-data engines. Evaluation has broadened with the term. Recent papers measure video realism, perceptual similarity, instruction following, physical plausibility, policy ranking, executability, planning success, and downstream policy improvement. The result is not only metric diversity but also a recurring problem of claim/evidence mismatch: papers frequently make a stronger claim about what their model is useful for than their evaluation can actually establish. This paper surveys the recent literature and argues that the central question is use-dependent. When a model is presented as a world model for embodied decision-making, a more decisive issue is not whether it generates visually compelling videos, but whether it supports reliable counterfactual reasoning, policy evaluation, planning, and policy optimization under intervention, policy-induced distribution shift, and long-horizon rollout. We organize the literature using an L0–L7 ladder that ranges from visual plausibility to policy optimization utility. In our interpretation, L0–L3 are most naturally read as diagnostics of generated artifacts, L4 is often the first genuinely interventional test, and L5–L7 provide the most direct evidence of decision usefulness. Based on this diagnosis, we propose a decision-making-centric evaluation framework and a benchmark protocol that foreground counterfactual action fidelity, closed-loop rollout validity, reward/value prediction, policy-ranking agreement, optimization lift, model exploitability, and uncertainty calibration. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2606.15032 [cs.LG] (or arXiv:2606.15032v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2606.15032 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-131] CREST: Deployment-Realistic Hardware-in-the-Loop NAS for Embedded Sensing Systems

链接: https://arxiv.org/abs/2606.15004
作者: Joseph Q. Zales,Pragya Sharma,Mani Srivastava
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注: 14 pages, 10 figures, 7 tables

点击查看摘要

Abstract:Deploying neural networks on low-power microcontrollers (MCUs) requires selecting model architectures under tight memory, latency, and energy constraints. Existing workflows often simplify this process along one or more axes: static proxy costs such as FLOPs or parameters, treating one MCU as representative, and continuous-inference tests instead of deployed sensing schedules. These assumptions can mis-rank Pareto-front candidates, miss infeasible deployments, and obscure schedule-dependent energy. We present CREST (Cross-platform Runtime Evaluation and Search Tool), a deployment-realistic hardware-in-the-loop (HIL) neural architecture search (NAS) framework for MCU sensing systems. CREST keeps the optimizer, HIL measurement boundary, logging, and replay workflow fixed while exposing workload, model family, target backend, schedule, quantization, and scoring policy as configurable axes. This makes deployment effects experimentally separable within one reusable workflow. We evaluate CREST on inertial odometry and audio classification across three Arm Cortex-M targets. For inertial odometry, measured-energy HIL search reduces median per-inference energy by 41.7% versus FLOPs-based selection and 40.8% versus memory-traffic-based selection at similar error. FLOPs-based selection also chooses infeasible deployments on memory-constrained targets. On the STM32 N657 target, continuous-inference and duty-cycled searches produce different Pareto frontiers. For audio classification, the same application-level policy selects different DS-CNN architectures on different boards, and cross-board replay changes deployment cost substantially. Overall, CREST shows that deployment-realistic MCU NAS must jointly optimize model architecture, target platform, runtime schedule, and deployment policy rather than relying only on static proxy costs or continuous-inference measurements. Comments: 14 pages, 10 figures, 7 tables Subjects: Systems and Control (eess.SY); Machine Learning (cs.LG) Cite as: arXiv:2606.15004 [eess.SY] (or arXiv:2606.15004v1 [eess.SY] for this version) https://doi.org/10.48550/arXiv.2606.15004 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Joseph Zales [view email] [v1] Fri, 12 Jun 2026 22:48:13 UTC (3,325 KB)

[LG-132] Unlocking Latent Dimensions: Exploring Representations of Large-Scale X-ray Scattering Data using Variational Autoencoders

链接: https://arxiv.org/abs/2606.14999
作者: Monika Choudhary,Xiaoya Chong,Runbo Jiang,Wiebke Koepp,Petrus H. Zwart,Damon English,Gregory M. Su,Eric Schaible,Chenhui Zhu,Mostafa Nassr,Noah P. Wamble,Kelvin Kam-Yun Li,Jonathan M. Chan,Jose Carlos Diaz,Cameron McKay,Lynn Katz,Benny Freeman,Guillaume Freychet,Yevgen Matviychuk,Eliot Gann,Daniel B. Allan,Benedikt Sochor,Frank Schluenzen,Stephan V. Roth,Ethan Crumlin,Dylan McReynolds,Tanny Chavez,Alexander Hexemer
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Scientific user facilities generate X-ray scattering data faster than traditional workflows can process them. We address this challenge across two settings, offline dataset exploration and live on-the-fly analysis. We train a domain-specific attention-based Convolutional Variational Autoencoder (C-VAE) on 1.5 million X-ray scattering images to learn low-dimensional representations capturing structural variation across diverse experimental conditions. The learned latent space reveals well-organized clusters and smooth trajectories reflecting experimental progression. It further supports controlled synthetic scattering image generation across diverse structural states. When deployed without retraining, the model organizes time-resolved film formation experiments at two synchrotron facilities into interpretable latent structures. Benchmarking against DINOv3 (ViT-7B), a general-purpose vision foundation model, demonstrates that domain-specific training yields more interpretable latent organization for scattering data. Both workflows are integrated within Latent Space Explorer, a component of the MLExchange platform, supporting interactive structural exploration across archived datasets and live experiments.

[LG-133] KATANA: A Fast Low-Power Mapping of Kalman Filters onto Edge NPUs for Real-Time Tracking

链接: https://arxiv.org/abs/2606.14992
作者: Bodhisatwa Kundu,Anish Rooj,Sumit Saha,Abhradeep Sarkar,Arghadip Das,Arnab Raha,Mrinal K. Naskar
类目: Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:State estimation is the closed-loop core of every real-time tracking system, from radar surveillance and counter-UAV defense to autonomous driving and robotics. These deployments run on edge platforms, where defense systems mount on vehicles and drones, and civilian pipelines live on cars and handheld devices. Here, every additional watt of compute erodes mission duration or operational range. Two hard constraints follow: each new measurement must be fused before the next control cycle, and the total compute must fit within a strict battery and thermal power envelope. The Linear and Extended Kalman Filters (LKF, EKF) are dominant estimators on these systems, but today they execute almost exclusively on CPUs, which serialize multi-object tracking (MOT) updates, or on custom FPGA/ASIC accelerators that lengthen design cycles. Contemporary AI-PC SoCs, like the Intel Core Ultra Series 1 and 2, integrate a low-power, data-parallel Neural Processing Unit (NPU). We therefore ask whether the Kalman filter can be mapped onto this existing matrix engine to meet real-time and low-power budgets simultaneously, avoiding a dedicated accelerator and keeping the CPU and GPU free for primary workloads. We present KATANA, an NPU-aware optimization framework delivering the first end-to-end mapping of the LKF and EKF onto a commercial NPU, alongside a cross-platform characterization on shipping AI-PC silicon. KATANA applies three algebraic graph rewrites: subtract-to-add reformulation via a precomputed negative-projection matrix H_neg, static-shape tensor fusion, and block-diagonal batched parallelization, ensuring 100% of operations execute on the DPU matrix engine. On the Series 2, the optimized batched EKF reaches 223.35 FPS at 13.43 W active power, and the LKF reaches 408.73 FPS at 14.05 W, delivering up to a 97.9% reduction in dynamic energy versus the CPU implementation.

[LG-134] Continual Backdoor Training in IoT/CPS

链接: https://arxiv.org/abs/2606.14987
作者: Oxana Salish,Kuniyilh S
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Internet of Things (IoT) and Cyber-physical systems (CPS) increasingly rely on continual learning (CL) to adapt to evolving environments, device heterogeneity, and concept drift, thereby improving overall utility. While continual adaptation is essential for long-lived IoT deployments where data patterns evolve, it also introduces new security vulnerabilities. In particular, backdoor attacks can exploit incremental updates, replay buffers, and representation reuse to implant persistent malicious behaviors that remain dormant during normal operation but activate upon specific triggers. In this paper, we present a backdoor attack in continual learning used in IoT/CPS systems. To this end, we formalize an IoT/CPS-specific threat model, analyze why continual learning amplifies backdoor persistence in IoT pipelines, and evaluate our technique under varying conditions. Our analysis highlights critical open challenges in securing lifelong learning in IoT/CPS and industrial IoT (IIoT) environments, as well as the need for heightened security controls.

[LG-135] Zero-order Parameter-free Optimization for LMO-based Methods: Novel Approach for Efficient Fine-tuning

链接: https://arxiv.org/abs/2606.14970
作者: Dmitriy Bystrov,Daniil Medyakov,Dmitry Bylinkin,Aleksandr Beznosikov
类目: Machine Learning (cs.LG)
*备注: 29 pages, 1 table

点击查看摘要

Abstract:Fine-tuning large language models (LLMs) has become a central application of modern optimization, enabling pretrained models to adapt to diverse downstream tasks and domain-specific data. A major obstacle in large-scale fine-tuning is the memory overhead of backpropagation, which requires storing activations, gradients, and optimizer states. Zeroth-order (ZO) optimization offers a memory-efficient alternative, but its performance is highly sensitive to the stepsize and smoothing parameter, often requiring costly task-specific tuning. Parameter-free (PF) optimization addresses this issue by adapting algorithmic parameters without prior knowledge of problem-dependent constants. Moreover, large-scale fine-tuning can benefit from geometry-aware updates that account for the heterogeneous structure of parameter blocks, which can be modeled through methods that exploit linear minimization oracle (LMO). In this work, we study PF adaptation for LMO-based ZO optimization and introduce \textttAdaNAGED , a method that unifies gradient-free training, adaptive tuning, and non-Euclidean update geometry. We establish convergence guarantees and validate the method on large-scale LLM fine-tuning task with \textttOPT-1.3\mathrmB model.

[LG-136] Benchmarking Instance-Dependent Label Noise with Controlled Corruptions

链接: https://arxiv.org/abs/2606.14965
作者: Shadman Islam,Agustinus Kristiadi,Mostafa Milani
类目: Machine Learning (cs.LG); Databases (cs.DB)
*备注: 12-page conference submission

点击查看摘要

Abstract:Synthetic instance-dependent label noise (IDN) benchmarks are widely used to evaluate noisy-label learning methods, yet existing approaches typically generate noise through imperfect annotators or classifier raters, leaving the source of ambiguity implicit. We introduce CILN, a benchmark generation framework that creates IDN through controlled input corruptions. A diverse voter pool labels corrupted instances, producing benchmark datasets in which both the source and severity of ambiguity are explicit and controllable. Using CIFAR10, MNIST, and Adult, we construct 90 benchmark settings spanning multiple corruption families and severity levels. Our experiments show that the resulting benchmarks exhibit genuine instance-dependent noise, provide diverse confusion structures, and, on CIFAR-10, can produce label distributions that are closer to human uncertainty than an existing synthetic IDN benchmark. We further demonstrate that corruption-mediated IDN can expose failure modes of popular noisy-label learning methods, including Co-Teaching and DivideMix, that are not observed under comparable levels of rater-fallibility noise. These findings suggest that noise structure, not only noise rate, plays an important role in benchmark difficulty and algorithm behavior. By making ambiguity generation explicit and controllable, CILN provides a complementary benchmarking framework for studying noisy-label learning under diverse sources of instance difficulty.

[LG-137] Leverag ing Physiological Signals to Predict Exam Outcomes with Machine Learning

链接: https://arxiv.org/abs/2606.14960
作者: Lala Yamazaki,Ramchandra Rimal
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注: 9 figures, and 5 tables

点击查看摘要

Abstract:This study investigates the application of machine learning models to predict exam outcomes using physiological data collected during examination sessions. Physiological stress indicators, including electrodermal activity, heart rate, and skin temperature, were analyzed to uncover their association with academic performance. A variety of machine learning approaches were employed, ranging from standard models like logistic regression, random forest, and support vector machines to more advanced architectures, including transformers, long short-term memory (LSTM), and gated recurrent unit (GRU) models. This diversity aimed to capture the complex interactions within the data effectively. A key focus was assessing the adaptability of transformers in processing numerical data and evaluating their performance in this novel context. Standard performance metrics, such as accuracy, precision, recall, and F1-score, were used to compare model efficacy. The experimental results demonstrate that while deep learning models generally excel at capturing complex relationships in physiological data, simpler models like random forests can sometimes achieve superior performance while offering computational efficiency and interpretability. Furthermore, transformers demonstrated notable versatility, showcasing performances comparable to those of the LSTM and GRU models. This research underscores the importance of experimenting with a broad class of models that align with the objectives of the problem at hand, balancing precision, efficiency, and interpretability. By elucidating the relationships between physiological signals and academic performance, this study contributes to understanding stressors affecting students’ mental health. It further promotes leveraging physiological data to enhance student well-being and academic outcomes.

[LG-138] A Comparative Study of Graph Neural Network Layer Selection for Interaction Modelling in Driving Trajectory Prediction

链接: https://arxiv.org/abs/2606.14956
作者: George Daoud,Mohamed El-Darieby
类目: Machine Learning (cs.LG)
*备注: 6 pages, 1 figure

点击查看摘要

Abstract:Autonomous driving systems rely on precise trajectory prediction to plan safe and efficient movement. Graph Neural Networks (GNNs) have become a promising approach for modelling spatiotemporal interactions among road agents. However, designing GNN architectures for trajectory prediction remains non-standardized, with little guidance on which graph layers effectively capture spatial interactions and temporal dynamics. This paper offers a detailed comparative study of 19 graph layer types, focusing on their spatial and temporal processing capabilities to discover the most effective architectures for trajectory prediction. Within the explored hyperparameter setting, we highlight five standout layer combinations, with ARMA, Chebyshev, and topology-aware layers consistently performing better than others. Beyond performance metrics, our findings yield practical design principles: sum-based aggregation is more effective than mean-based methods, multi-head attention mechanisms enable richer interactions, and assigning different weights to different hop distances significantly improves prediction accuracy. These findings offer useful guidance for designing more interpretable and effective trajectory prediction models.

[LG-139] Remember Dont Re-read: Stateful ReAct Agents for Token-Efficient Autonomous Experimentation

链接: https://arxiv.org/abs/2606.14945
作者: Faramarz Jabbarvaziri
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The autoresearch pattern enables autonomous experimentation by having a large language model (LLM) iteratively modify code to optimize a target metric. Its stateless design, however, reconstructs experimental context from scratch at every iteration, incurring O(n) token cost per iteration and O(n^2) total. This work reformulates the pattern as a stateful ReAct agent using LangGraph, where typed persistent state carries experimental history across iterations via a tool-calling interface. Two benchmarks are evaluated: hyperparameter tuning (15 iterations, small per-iteration observations) and code performance optimization (40 iterations, large per-iteration observations containing full source code and benchmark results). On hyperparameter tuning, the stateful agent consumes 90% fewer tokens (2,492 vs.\ 24,465). On code optimization, the stateful agent consumes 52% fewer tokens (627K vs.\ 1,275K) while achieving comparable optimization quality on both tasks. The token reduction is structural: the stateless agent re-reads the full history at O(n) cost per iteration, while the stateful agent operates within a fixed-size conversation window at O(1) cost. This paper describes the architecture in sufficient detail for practitioners to implement a stateful autoresearch agent for their own workflows.

[LG-140] GRASP: Gradient-Aligned Sequential Parameter Transfer for Memory-Efficient Multi-Source Learning

链接: https://arxiv.org/abs/2606.14900
作者: Mary Isabelle Wisell,Nicholas Jacobs,Aayush Manandhar,Salimeh Yasaei Sekeh
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multi-source transfer learning faces a fundamental scalability bottleneck: existing approaches require either loading all K source models into memory simultaneously during parameter fusion, requiring O(K) memory, or deploying all models at inference time, making production deployment infeasible. We propose GRASP (Gradient-Aligned Sequential Parameter Transfer), which achieves superior knowledge integration while maintaining O(1) memory consumption through three key innovations: (1) sequential processing that merges one source at a time into an evolving target model, (2) parameter-wise gradient alignment that selectively transfers only parameters whose optimization directions align with the target domain, avoiding negative transfer, and (3) iterative fine-tuning that adapts transferred knowledge before integrating the next source. Extensive experiments across three continual learning benchmarks (Yearbook, CLEAR-10, CLEAR-100) spanning 10 to 108-year temporal distribution shifts and four architectures (1.3M to 25.6M parameters) demonstrate that GRASP achieves 93.5% mean accuracy over all datasets and architectures compared to ensemble method’s 71.7% accuracy while requiring only constant memory versus K models for standard multi-source fusion. Critically, GRASP’s sequential previously merged models and scales to arbitrarily many sources without memory growth, making it uniquely suitable for resource-constrained deployment and continually evolving source domains.

[LG-141] α-Fair Insurance Pricing: A Fairness Continuum

链接: https://arxiv.org/abs/2606.14898
作者: Tianhe Zhang,Xiguang Liu,Peng Shi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Fairness in insurance pricing remains a long-standing and deeply debated puzzle. On one hand, insurers, driven by profitability considerations, set premiums that differentiate across individual risks to achieve actuarial fairness. On the other hand, insurance serves a critical societal function by pooling risks across a population, motivating cross-subsidization among groups to promote solidarity fairness. The tension between these two competing notions of fairness makes insurance pricing inherently complex, particularly in modern settings where granular data allow for increasingly fine risk differentiation and regulators face growing pressure to protect vulnerable groups. To address this challenge, we propose an \alpha -\textbfFair \textbfIndividual \textbfSolvent \textbfPremium ( \alpha -FISP) framework for insurance pricing that explicitly captures the trade-off between actuarial and solidarity fairness while guaranteeing solvency, a fundamental requirement in insurance operations. We formulate the pricing problem as a constrained optimization task, where actuarially fair premiums are adjusted subject to budget constraints on cross-subsidization within each risk class. This formulation naturally yields a family of solutions parameterized by \alpha , tracing a continuum between purely actuarial and purely solidarity-based pricing and enabling decision-makers to select an operating point along this fairness spectrum. We derive theoretical guarantees for the proposed framework. Numerical experiments show that \alpha -FISP is computationally tractable and aligns well with the U.S. regulatory regimes featuring heterogeneous state-level fairness requirements.

[LG-142] LLM -Based Synthetic Ground Truth Generation for Audio-Based Emotion Classification via In-Context Learning

链接: https://arxiv.org/abs/2606.14784
作者: Qing Huang,Pooja Pol,Jianing Zhang
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: Proceedings of the International Conference on Applied Innovations in IT (ICAIIT), April 2026

点击查看摘要

Abstract:Understanding human states and interaction dynamics is a core goal of human-computer interaction (HCI). As interaction paradigms become more immersive, virtual reality (VR) has emerged as a powerful platform for studying collaborative work. In such settings, evaluating team collaboration states, including team performance and team resilience, requires continuous and reliable inference of latent team-level cognitive and affective states from multi-modal sensor data, such as speech signals. However, generating ground truth labels for these latent states remains challenging due to sensor-induced noise, contextual variability, and sparse expert annotations. Traditional self-reporting approaches provide only static and delayed measurements and are therefore insufficient for capturing dynamic team processes reflected in continuous speech data. In this work, we propose a large language model (LLM)-driven, agentic inference workflow for automated emotion-related synthetic ground truth generation from streaming speech data in multi-user VR environments. Leveraging the generalization capabilities of LLMs, we use In-Context Learning (ICL) with few-shot demonstrations of paired audio-based samples and their corresponding transcriptions. ICL tends to achieve task adaptation comparable to model fine-tuning while circumventing the computational overhead of parameter updates. To construct informative and robust in-context prompts, we adopt a retrieval-based selection strategy that dynamically identifies relevant audio demonstrations based on similarity in the acoustic feature space.

[LG-143] Deep Learning-Based Lunar Crater Terrain Relative Navigation

链接: https://arxiv.org/abs/2606.14776
作者: Batu Candan,Simone Servadio
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurate position estimation is crucial for the successful implementation of future lunar landings using autonomous vehicles, especially in dangerous environments with sparse terrain features. In this paper, we propose a terrain relative navigation (TRN) algorithm combining our deep-learning crater detector, which was designed specifically for the NASA Crater Detection Challenge problem, and an Extended Kalman Filter (EKF). Our detector analyzes crater features from the monocular images acquired from orbit, and their matches with craters from a global database are identified via a Hungarian assignment approach followed by the consensus-based outliers removal method. The estimated measurements are then used to refine an EKF, where spacecraft pose estimation in the Lunar-Centered Lunar-Fixed (LCLF) frame of reference, augmented with altitude aiding information, constrains radial drift. The simulation results indicate that even if the spacecraft is off from its actual location up to 5 km, TRN could recover from this situation, achieving navigation error reduction to a few hundred meters. It should be noted that in order to maintain crater feature correspondences, it is important to match the image resolution and the scales within the scene to the detector training set distribution.

[LG-144] Bayesian Optimization for Learning Nonlinear MPC in Autonomous Agent Navigation ICRA2026

链接: https://arxiv.org/abs/2606.14763
作者: Lorenzo Ortolani,Gabriel Voss,Gabriele Beltrami,Francesco Dorati,Tommaso Felice Banfi
类目: Robotics (cs.RO); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: Published at the IEEE ICRA 2026 Xplore Workshop (Oral), Cross-Disciplinary aspects of Exploration in Robotics, Reinforcement Learning, and Search

点击查看摘要

Abstract:Real-time autonomous navigation in dynamic, unknown environments remains a fundamental challenge for mobile robotics. We propose a map-free framework that tightly integrates reactive rolling-horizon planning with nonlinear Model Predictive Control (MPC). At each control cycle, a LiDAR-based Gaussian occupancy representation is constructed and used to generate collision-free trajectories via A* search, which are then tracked by a CasADi/IPOPT MPC formulation incorporating a smooth sigmoid obstacle barrier. To improve robustness to parameter sensitivity, we adopt an offline Bayesian optimization scheme based on Tree-structured Parzen Estimators (TPE), which identifies near-optimal controller parameters with respect to a composite navigation objective. In addition, a Gaussian Process surrogate is used to analyze parameter sensitivity and provide insight into the optimization landscape. The proposed framework is robot-agnostic and is evaluated on the Unitree Go2 quadruped in simulation using Gazebo, followed by deployment on the physical robot. Experimental results show that parameters tuned in simulation transfer effectively to hardware, maintaining comparable performance without additional tuning. The full system achieves up to a 90.0% navigation success rate when deployed, along with a 38.9% average improvement in the evaluation metrics across simulated environments. Comments: Published at the IEEE ICRA 2026 Xplore Workshop (Oral), Cross-Disciplinary aspects of Exploration in Robotics, Reinforcement Learning, and Search Subjects: Robotics (cs.RO); Machine Learning (cs.LG); Optimization and Control (math.OC) ACMclasses: I.2; I.6 Cite as: arXiv:2606.14763 [cs.RO] (or arXiv:2606.14763v1 [cs.RO] for this version) https://doi.org/10.48550/arXiv.2606.14763 Focus to learn more arXiv-issued DOI via DataCite

[LG-145] An RRAM-based Hardware Implementation of a Radial Basis Function Neuron for Edge Classifiers

链接: https://arxiv.org/abs/2606.14739
作者: Georgios Papandroulidakis,Shady Agwa,Themis Prodromakis
类目: Emerging Technologies (cs.ET); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:The deployment of modern machine learning (ML) solutions on resource-constrained edge devices highlights implementation challenges. This is especially true for extreme edge applications that include safety-critical components, such as autonomous navigation tasks. This paper demonstrates an artificial neural network (ANN) design leveraging Metal-Oxide Resistive RAM (RRAM) -based Analogue Content Addressable Memory (ACAM) as an efficient hardware substrate for performing metric-based classification and online adaptation on the edge. The proposed design is based on a custom Template piXeL (TXL) cell used for building the ACAM module, where each TXL cell acts as a configurable receptive field neuron. These cells employ a Radial Basis activation function to calculate the distance of an input from the programmed receptive field. The TXL can be organised into dense arrays for calculating the distance of a high-dimensional input against all stored prototypes, effectively performing fast and energy efficient similarity search. This hardware engine enables on-the-fly learning, where the receptive field parameters can be tuned to track domain shift. Through simulation of the proposed TXL-RBF classifier we can achieve 89.1% accuracy on the MNIST dataset while consuming 185fJ per cell per operation when operating at 100MHz.

[LG-146] aching LLM s Program Semantics via Symbolic Execution Traces

链接: https://arxiv.org/abs/2605.06184
作者: Jonas Bayer,Stefan Zetzsche,Olivier Bouissou,Remi Delmas,Michael Tautschnig,Soonho Kong
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG); Logic in Computer Science (cs.LO); Programming Languages (cs.PL)
*备注:

点击查看摘要

Abstract:We introduce an evaluation framework of 500 C verification tasks across five property types (memory safety, overflow, termination, reachability, data races) built on SV-COMP 2025, and evaluate 14 models across six families. We find that high overall accuracy masks a critical weakness: while most models reliably confirm properties hold, violation detection varies widely and degrades sharply with program length. To close this gap, we train on formal verification artifacts: running the Soteria symbolic execution engine on generic open-source C code and using the resulting traces for continued pretraining of Qwen3-8B. Just \sim 3,000 bug traces combined with chain-of-thought reasoning at inference time improve violation detection by over 17 percentage points, producing one of the most balanced accuracy profiles among evaluated models. On violation detection, the trained 8B model outperforms the 4 \times larger Qwen3-32B without thinking and approaches it in overall accuracy. The interaction between trace training and chain-of-thought is superadditive: neither alone provides meaningful gains, but their combination does. Improvements transfer across all five property types, including ones the training traces do not target. Our 28 configurations confirm the gains stem from trace semantics, not code volume, and that trace curation and format matter.

[LG-147] Gradient Boosted Risk Scores

链接: https://arxiv.org/abs/2605.02593
作者: Costa Georgantas,Jonas Richiardi
类目: Machine Learning (cs.LG); Mathematical Software (cs.MS)
*备注:

点击查看摘要

Abstract:Risk scores are an interpretable and actionable class of machine learning models with applications in medicine, insurance, and risk management. Unlike most computational methods, risk scores are designed to be computed by a human by attributing points to a data sample based on a limited set of criteria. The most common approaches for generating risk scores use linear regressions to estimate the effect of selected variables. We propose a simple and effective approach towards building compact and predictive risk scores. We provide an algorithm based on gradient boosting that is capable of modeling nonlinear effects, along with a C++ implementation with Python and R bindings. Through extensive empirical evaluation on twelve tabular datasets spanning regression, classification, and time-to-event tasks, we show that our method achieves competitive predictive performance while producing substantially more compact scores than regression-based alternatives, with 60% fewer rules for classification tasks and 16% fewer rules for time-to-event tasks on average, compared to AutoScore.

[LG-148] Learning the Geometry of Data: A Mathematical Review of Shape Space Analysis

链接: https://arxiv.org/abs/2606.17022
作者: Gary P. T. Choi,Khanh Dao Duc,Shira Faigenbaum-Golovin,Karen Habermann,Emmanuel Hartman,Christoph von Tycowicz,Chi Zhang,Wenjun Zhao,Felix Zhou
类目: atistics Theory (math.ST); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 79 pages, 10 figures, 8 tables

点击查看摘要

Abstract:A central objective of machine learning is to identify structure and patterns in data. Advances in data acquisition have increasingly produced datasets whose observations possess rich geometric form, giving rise to shape spaces that encode variability in object geometry. Such datasets arise across a wide range of disciplines, including biology, medicine, anthropology, and computer vision, where subtle geometric differences often carry important scientific information. Traditional machine learning methods, however, are frequently ill-equipped to account for the nonlinear geometric structure underlying these data. This survey synthesizes a rapidly growing body of work on shape space analysis, which provides a mathematical and computational framework for the study of geometric data. Drawing on ideas from differential geometry, statistics, and machine learning, we organize the literature around a common analytical pipeline: shape representation and parameterization, the rigorous construction of robust geodesic metrics, statistical analysis on shape spaces, and geometry-aware learning methods. We discuss how these tools enable the characterization of shape variability, the comparison of geometric objects, and the analysis of structural trajectories across populations and time. To illustrate the breadth of the field, we highlight applications spanning multiple scales of biological organization, including studies of subcellular morphology and primate tooth evolution. Across these and many other domains, researchers face common challenges arising from complex, nonlinear, and often unaligned geometric variation. The review concludes by identifying key theoretical and computational challenges, as well as emerging opportunities driven by increasingly large and diverse geometric datasets. Comments: 79 pages, 10 figures, 8 tables Subjects: Statistics Theory (math.ST); Machine Learning (cs.LG); Machine Learning (stat.ML) MSC classes: 68U05, 65D18, 92B05 Cite as: arXiv:2606.17022 [math.ST] (or arXiv:2606.17022v1 [math.ST] for this version) https://doi.org/10.48550/arXiv.2606.17022 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-149] Exploding and vanishing gradients in deep neural networks: the effect of residual connections

链接: https://arxiv.org/abs/2606.17013
作者: Vivek S Borkar
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注: 10 pages

点击查看摘要

Abstract:The well known phenomenon of exploding and vanishing gradients in deep neural networks is analyzed using multiplicative ergodic theory. The effect of adding a residual connection is explained in this context. Specifically, a characterization of Liapunov exponents due to Furstenberg and Kifer is exploited in order to make a precise statement about the Liapunov spectrum and the effect of residual connections on it.

[LG-150] Dynestyx: A Probabilistic Programming Library for Dynamical Systems

链接: https://arxiv.org/abs/2606.16985
作者: Daniel Waxman,Dmitry Batenkov,John Feser,Andy Zane,Eli Bingham,Youssef Marzouk,Matthew E. Levine
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Signal Processing (eess.SP); Chaotic Dynamics (nlin.CD); Methodology (stat.ME)
*备注: 7 pages

点击查看摘要

Abstract:State-space models (SSMs) are the standard formalism for Bayesian treatment of dynamical systems, with natural applications in statistics, signal processing, and machine learning. Despite their importance in both theory and application, dynamical systems have proven difficult to incorporate in modern probabilistic programming languages (PPLs), making state-of-the-art methods less accessible to practitioners and introducing friction in following the “Bayesian workflow.” We introduce dynestyx, a probabilistic programming library with first-class support for SSMs, including state-of-the-art methods in the estimation of both states and parameters. Through a single, unified interface, users may specify arbitrary priors for discrete-time or continuous-time dynamical systems, perform inference over mixed-effect data, and make state and parameter estimates with principled uncertainty quantification.

[LG-151] Sobolev Approximation by Fixed-Size Neural Networks with Arbitrary Accuracy

链接: https://arxiv.org/abs/2606.16975
作者: Baicheng Li,Haizhao Yang,Shijun Zhang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this work, we investigate new activation functions for achieving arbitrary-accuracy Sobolev approximation by fixed-size neural networks. We first show that any function in W^2,\infty((a,b)^d) can be approximated with arbitrary accuracy, measured in the W^1,\infty -norm, by a fixed-size neural network using the Elementary Universal Activation Function ( \mathrmEUAF ). To extend this result to W^s,\infty((a,b)^d) for s\in\mathbbN , we introduce a smooth activation \mathrmDUAF_\infty from the family of Differentiable Universal Activation Functions ( \mathrmDUAF_n ). We prove that any function in W^s,\infty((a,b)^d) can be approximated with arbitrary accuracy in the W^s-1,\infty -norm by a fixed-size \mathrmDUAF_\infty -activated network. We further construct sigmoidal variants \widetilde\mathrmDUAF_n and show that, for every 1\leq s\leq n , fixed-size \widetilde\mathrmDUAF_n -activated networks still approximate any f\in W^s,\infty((a,b)^d) with arbitrary accuracy in the W^s-1,\infty -norm. In all these results, the width and depth bounds are computed explicitly, and the proposed activations are elementary.

[LG-152] Latent space mapping of interpretable structural coordinates from stochastic single-molecule signals

链接: https://arxiv.org/abs/2606.16950
作者: Matteo Cartiglia,Sandro Kuppel,Wouter Botermans Wannes Peeters,Natan Biesmans,Liam Vandekerckhove,Eric Beamish,Koen Ongena,Wouter Renckens,Pol Van Dorpe,Sanjin Marion
类目: Instrumentation and Detectors (physics.ins-det); Machine Learning (cs.LG); Biological Physics (physics.bio-ph); Chemical Physics (physics.chem-ph); Data Analysis, Statistics and Probability (physics.data-an); Biomolecules (q-bio.BM)
*备注: 32 pages, 6 figures

点击查看摘要

Abstract:Nanopores are versatile single-molecular sensors, but their utility is fundamentally constrained by stochastic translocation dynamics warping any encoded information. We resolve it by shifting from time-domain analysis to a learned latent-space mapping via a contrastive encoder trained exclusively on simulated signals from a physics-informed model. This encoder maps solid-state nanopore signals of engineered DNA barcodes into an interpretable molecular coordinate system. The learned representation is responsive to structural barcode parameters while remaining invariant to acquisition conditions and translocation conformation, allowing data pooling across devices. Molecule identification requires a single pass through the encoder, reducing computational cost by three orders of magnitude relative to alignment-based methods. We experimentally validate through mixture quantification, rare-variant detection, consensus barcode reconstruction, and real-time signal acquisition. This shift from temporal analysis to mapping structural coordinates into a latent space changes the paradigm behind analyzing stochastic sensor signals by linking classification to interpretable encoded molecular information.

[LG-153] A nonparametric two-sample test using a parametric integral probability metric

链接: https://arxiv.org/abs/2606.16941
作者: Yuha Park,Yongdai Kim
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 45 pages. Accepted for publication in Statistical Analysis and Data Mining

点击查看摘要

Abstract:Detecting distributional differences between two independent samples is a fundamental problem in statistics and machine learning. Nonparametric two-sample testing provides a principled framework for determining whether two samples are drawn from the same underlying distribution, without assuming any specific parametric form for the distribution. In this study, we propose a new two-sample test statistic based on a newly introduced integral probability metric (IPM), using a specially designed parametric discriminator class with a single node of a neural network. We show that the resulting test statistic, called PReLU-IPM, is nonparametric and establish theoretical guarantees for the associated two-sample testing procedure, PReLU-TST, including its consistency and asymptotical equivalence to nonparametric IPM-based tests under regularity conditions. By analyzing multiple simulated and real benchmark datasets, we demonstrate that PReLU-TST achieves higher power across a range of alternatives or performs comparably to its competitors, for finite samples.

[LG-154] Functional Gradient Descent with Adaptive Representations

链接: https://arxiv.org/abs/2606.16926
作者: Daniel Csillag,Rodrigo Schuller,Pedro Dall’Antonia,Leonidas Guibas,Luiz Velho,Tiago Novello
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Functional optimization problems are typically solved by optimizing the parameters of a fixed representation, such as a neural network, resulting in highly nonconvex losses that complicate both training and theoretical analysis. An interesting alternative is functional gradient descent (FGD), that is, gradient descent directly in function space, which benefits from strong convergence results and admits a clean theory. However, FGD is difficult to implement in practice because functional gradients are infinite-dimensional, and thus cannot be fully computed nor stored in memory. Existing implementations therefore rely on fixed approximations, which introduce approximation error. We propose a new, theoretically-grounded FGD algorithm that adapts the representation of the functional gradients over the course of optimization. By explicitly incorporating this approximation into the analysis, we establish convergence to a stationary point (for smooth losses) and to a global minimizer (under smoothness + a Polyak-Lojasiewicz-type condition) regardless of our approximations. To the best of our knowledge, this is the first implementable FGD method with such guarantees in a general setting. We demonstrate the effectiveness of our method on regression, numerical solution of PDEs, and modern computer vision. Across settings, our method consistently outperforms both FGD with fixed approximations and neural network baselines in efficiency and accuracy.

[LG-155] he Algebra of Units: From Buckinghams Pi-grec Theorem to Latent-Variable Learning

链接: https://arxiv.org/abs/2606.16737
作者: Mauro Valorani
类目: Mathematical Physics (math-ph); Machine Learning (cs.LG)
*备注: 31 pages, 2 figures

点击查看摘要

Abstract:Engineers often measure many quantities-speed, pressure, temperature, length-expressed in different physical units. The Buckingham Pi-grec theorem states that these variables can always be combined into a smaller set of dimensionless numbers whose values fully determine the system’s behaviour. Identifying the appropriate dimensionless groups has traditionally required expert knowledge and physical insight. This paper shows that they can instead be discovered automatically from data, without prior knowledge of the governing physics. The key observation is that, after logarithmic transformation, measurements collected under different scalings of the same system lie on a low-dimensional manifold whose geometry is determined by the underlying dimensionless groups. Singular value decomposition (SVD) identifies this manifold directly from data. A subsequent search over integer-exponent combinations recovers candidate dimensionless quantities, while a repeating-variable filter retains only those constructed from the machine’s characteristic scales. This procedure recovers familiar engineering groups, including the flow coefficient, head coefficient, and Mach number, while excluding equivalent but less interpretable alternatives. The method is demonstrated on a synthetic compressor dataset containing 16,000 measurements. Starting from raw dimensional variables and no physics input, it recovers the correct dimensionless groups to numerical precision and reproduces the compressor performance map with an error below 0.01%. More broadly, the work reveals a close connection between classical dimensional analysis and modern data-driven learning. Both rely on the same underlying algebraic structure, suggesting new approaches for building physical models that are simultaneously interpretable, scalable, and data-efficient. Comments: 31 pages, 2 figures Subjects: Mathematical Physics (math-ph); Machine Learning (cs.LG) Cite as: arXiv:2606.16737 [math-ph] (or arXiv:2606.16737v1 [math-ph] for this version) https://doi.org/10.48550/arXiv.2606.16737 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Mauro Valorani [view email] [v1] Mon, 15 Jun 2026 13:58:43 UTC (780 KB) Full-text links: Access Paper: View a PDF of the paper titled The Algebra of Units: From Buckingham’s Pi-grec Theorem to Latent-Variable Learning, by Mauro ValoraniView PDFHTML (experimental)TeX Source view license Current browse context: math-ph prev | next new | recent | 2026-06 Change to browse by: cs cs.LG math math.MP References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked="checked"class=“labs-tab-input”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status

[LG-156] Learning Hybrid Biophysical Neuron Models with Neural ODEs

链接: https://arxiv.org/abs/2606.16693
作者: Jonas Beck,Michael Deistler,Dóra Viktória Molnár,Jakob H. Macke,Philipp Berens
类目: Neurons and Cognition (q-bio.NC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Biophysical neuron models link measurements of neural activity to underlying cellular mechanisms. Yet, a central challenge is that the kinetics of many ion channels are poorly characterized, and practical simplifications – omitting channels or reducing morphological detail – introduce systematic gaps between model and biology. Bridging these gaps requires approaches that can flexibly discover unmodeled dynamics while preserving mechanistic interpretability. Here, we introduce a hybrid modeling framework that embeds neural ordinary differential equations into conductance-based biophysical models to capture unknown currents or mis-specified channel kinetics. By parameterizing the neural ODE in terms of voltage-dependent steady-state and time-constant functions, we recover interpretable gating dynamics directly from voltage recordings without assuming a functional form. We show that the hybrid model fits the gating kinetics of 2400 ion channel models and recovers unknown gating dynamics from single current-clamp recordings, generalizing to out-of-distribution stimulus regimes under realistic inputs and parameter misspecification. We also use our method to reduce a multicompartment model of a cortical neuron into a single-compartment hybrid model with a learned axial current, yielding up to an order of magnitude lower computational cost. Together, our results establish a plug-and-play framework for selectively replacing unknown components of conductance-based models with neural ODEs while preserving their mechanistic structure.

[LG-157] Diffusion Flow Matching: Dimension-Improved KL Bounds and Wasserstein Guarantees

链接: https://arxiv.org/abs/2606.16610
作者: Marta Gentiloni Silveri,Giovanni Conforti,Alain Durmus
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Diffusion Flow Matching (DFM) has recently emerged as a versatile framework for generative modeling, yet its theoretical convergence properties remain only partially understood. In this work, we provide refined and novel convergence guarantees for Brownian motion based DFMs, focusing on the discretization error. Our analysis is conducted under the Kullback-Leibler (KL) divergence and the 2-Wasserstein distance. Under finite-moment conditions and a mild score integrability assumption, we derive KL convergence bounds with improved dimensional dependence compared to prior work, achieving, up to our knowledge, state-of-the-art scaling under minimal conditions. We further extend the analysis to the 2-Wasserstein distance: under an additional first-order score integrability assumption and a weak log-concavity condition, we obtain convergence guarantees with dimensional dependence consistent with the KL case.

[LG-158] Context-Aware Markov VAE for CSI Compression in Wireless Systems

链接: https://arxiv.org/abs/2606.16607
作者: Efstathios Chatziloizos,Konstantinos Vandikas,Aneta Vulgarakis Feljan,Zheng Chen,Nikolaos Pappas
类目: ignal Processing (eess.SP); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注: 5 pages, 3 figures, 2 tables

点击查看摘要

Abstract:This paper considers neural channel state information (CSI) compression for time-varying massive multiple-input multiple-output (MIMO) channels in frequency division duplex (FDD) systems with limited feedback resources. The main challenge lies in obtaining a compact and efficient representation of the CSI given that it exhibits strong temporal correlation across successive snapshots. Existing memoryless compression models do not exploit this property, while simple temporal extensions often incorporate multiple observations without explicitly modeling the latent dynamics. We propose a context-aware compression framework based on a k-memory Markov variational autoencoder (k-MMVAE), which uses a finite temporal window to capture the evolution of CSI in the latent space. The model introduces Markov-structured latent dynamics with finite memory, enabling efficient use of temporal dependencies for compression. Simulation results show that the proposed approach improves target CSI reconstruction performance compared to memoryless and weakly sequential baselines, particularly at low and moderate compression rates. These results suggest that explicit latent temporal modeling can provide an effective mechanism for CSI compression under limited feedback constraints.

[LG-159] MultiMolecule: a modular ecosystem for biomolecular sequence-model workflows

链接: https://arxiv.org/abs/2606.16540
作者: Zhiyuan Chen
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG); Biomolecules (q-bio.BM); Genomics (q-bio.GN)
*备注:

点击查看摘要

Abstract:Biomolecular sequence models are increasingly reused outside the studies in which they were introduced, but public checkpoints rarely preserve the execution context needed to inspect source-defined behavior, adapt models to new assays, compare models under shared task definitions or deploy biological predictions. MultiMolecule is an open-source Python ecosystem that turns heterogeneous RNA, DNA and protein sequence-model releases into complete, source-checked model-family implementations with shared loading, workflow and prediction interfaces. The Resource state reported here includes 53 complete model-family implementations with 112 standardized model checkpoints, together with 16 curated dataset resources released through 39 public dataset repositories and 10 user-facing prediction pipelines. Standardized components are linked to source provenance, conversion or preparation code, source-reference checks, Extended Data summaries and public documentation, allowing users to inspect what was standardized, what behavior was checked and how each component enters training, evaluation, inference or deployment. By shifting reuse from repository-specific checkpoints to executable implementations connected to standardized checkpoints, curated datasets, Runner workflows and biological prediction pipelines, MultiMolecule provides common infrastructure for preserving source-defined model behavior, adapting models to new assays, enabling controlled evaluation and deploying biomolecular predictions.

[LG-160] Generative Modeling on Metric Graphs via Neural Optimal Transport

链接: https://arxiv.org/abs/2606.16273
作者: Alessandro Micheli,Yueqi Cao,Anthea Monod,Samir Bhatt
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:We introduce, to our knowledge, the first deep generative modeling framework for probability distributions continuously supported on compact metric graphs. Given source and target measures on a metric graph, our method embeds the graph into a smooth ambient space, solves an entropic Kantorovich problem via a neural semidual parameterization, and projects generated samples back onto the original graph. We study two embedded geometries: an extrinsic Euclidean realization and the intrinsic tropical Abel–Jacobi embedding into the Jacobian torus. In both cases, the resulting generator is graph-supported by construction. We prove that, in the joint limit of increasing neural expressivity, the learned generator converges weakly to a valid transport coupling between the original graph measures. Empirically, across a range of geometrically distinct graphs, our method matches or improves upon heuristic transport baselines based on discrete graph OT, while scaling more favorably. Finally, we demonstrate scalability on real-world urban mobility data by training our model on one million Uber pickup locations in Manhattan, New York City.

[LG-161] Closing the Approximation Gap in Simulation-free Latent SDEs

链接: https://arxiv.org/abs/2606.16138
作者: Henry D. Smith,Brian L. Trippe,Scott W. Linderman
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recovering dynamical systems from noisy observations is a recurring challenge across scientific domains, including neuroscience and physics. Latent stochastic differential equations (SDEs) address this by modeling the system as an unobserved state that evolves according to a learnable SDE and generates the observations. Variational inference (VI) provides a tractable objective for fitting latent SDEs. Traditional VI algorithms evaluate this objective by numerical simulation over a time discretization, trading fidelity for computational cost. A recent class of algorithms, simulation-free VI, sidesteps this tradeoff by parameterizing the posterior through its instantaneous marginals rather than its drift. In this work, we show that the efficiency of existing simulation-free VI algorithms comes at a price: their parameterizations restrict the approximate posterior to a subset of the SDEs available to simulation-based methods, degrading posterior inference and parameter learning. We propose Helmholtz-SDE, a simulation-free VI algorithm that closes this gap by optimizing over path laws compatible with a prescribed collection of marginals. Helmholtz-SDE recovers dynamics more faithfully than prior simulation-free methods, with the largest gains under high posterior uncertainty. It further matches the performance of simulation-based VI at a fraction of the runtime.

[LG-162] Enhancing Quantum Machine Learning with Anyons

链接: https://arxiv.org/abs/2606.16090
作者: Da Zhang,Wen-Qiang Liu,Zhaohui Wei,Zhang-Qi Yin
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: 19 pages, 3 figures

点击查看摘要

Abstract:The power of quantum computing and quantum machine learning relies on harnessing uniquely quantum phenomena as computational resources. While superposition, coherence and entanglement have been central to this effort, the role of particle exchange statistics remains largely unexplored. Here, we introduce a quantum kernel framework that unifies bosonic, fermionic, and anyonic (fractional) exchange statistics within a single learning paradigm. We study this family of kernels from three perspectives. At the representation level, Haar-averaged effective-dimension analysis shows that fractional exchange phases access feature-space directions inaccessible to the purely symmetric or antisymmetric limits. At the level of kernel geometry, the corresponding Gram matrices show greater separation from the distinguishable-particle baseline and reduced label-dependent model complexity. Finally, on learning benchmarks, anyonic kernels consistently outperform their bosonic and fermionic counterparts, with stronger target alignment and more favorable class geometry. Together, these findings show that exchange statistics reshape the structure and geometry of quantum feature space, leading to enhanced learning performance. Our work identifies particle exchange statistics as an overlooked computational ingredient for quantum machine learning and provides the first systematic comparison of quantum learning models across exchange phases.

[LG-163] GPT -Based Fast Simulation of CLAS12 Detector Hits via Conditional Autoregressive Generation

链接: https://arxiv.org/abs/2606.16035
作者: Cole Granger,James Giroux,Richard Tyson,Maurizio Ungaro,Cristiano Fanelli
类目: Instrumentation and Detectors (physics.ins-det); Machine Learning (cs.LG); High Energy Physics - Experiment (hep-ex); Nuclear Experiment (nucl-ex); Data Analysis, Statistics and Probability (physics.data-an)
*备注: 19 pages, 9 figures, 3 tables

点击查看摘要

Abstract:Modern particles physics experiments have demonstrated an increasing need for fast, high-fidelity detector simulation as detector components have improved and subsequent computational requirements approach the limits of available resources. Recently, deep generative models have emerged as a promising alternative to traditional Monte-Carlo methods, with recent works drawing inspiration from large language models (LLMs) and self-supervised next-token prediction methods. In this work, we present an application of a GPT-style autoregressive transformer as a fast surrogate model for the calorimeter inside the CLAS12 experiment at the Thomas Jefferson National Accelerator Facility. The model is conditioned on incident momentum and generates realistic detector hits autoregressively across all nine calorimeter layers as sequences of strip, ADC, and TDC tokens. We demonstrate that the model faithfully reproduces hit multiplicity, spatial distributions, energy deposits, and the energy-momentum response of the electromagnetic calorimeter. The generator achieves inference rates exceeding 700 events per second on a single GPU, providing a substantial speedup over traditional Geant4-based simulations while maintaining physics fidelity essential for high-luminosity experimental programs.

[LG-164] Machine learning enables roughness-driven inverse design of milling processes

链接: https://arxiv.org/abs/2606.16032
作者: Hadi Bakhshan,Sima Farshbaf,Fernando Rastellini,Josep Maria Carbonell
类目: Other Condensed Matter (cond-mat.other); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Interest in applying data-driven approaches in manufacturing has grown significantly, particularly for mapping complex, high-dimensional relationships. The milling process is one area where predictive models can link influential parameters to surface roughness metrics prior to in situ operations. While this approach offers clear advantages, it faces challenges due to limited datasets and robustness issues in inverse design paradigms. To address these challenges, this paper proposes a machine learning (ML)-based framework for the inverse design of the surface milling process, with a focus on surface roughness as the design objective. The framework employs forward training of two ML models, a deep neural network (DNN) and a random forest (RF) ensemble, both developed using a high-fidelity synthetic dataset generated from a computational simulation framework. These trained models are integrated into a Bayesian optimization (BO) procedure to overcome the multiplicity problem arising from the many-to-one mapping inherent in the dataset. The approach identifies top-performing milling process configurations, considering both process and tool parameters, and presents them from the full solution space. The models achieve average relative errors below 5% when compared to reference results, thereby demonstrating the robustness and reliability of the proposed methodology.

[LG-165] he limits of interpretability in multiple linear regression

链接: https://arxiv.org/abs/2606.16013
作者: Anand Sharma,Chen Liu,Daniele Coslovich,Misaki Ozawa
类目: Disordered Systems and Neural Networks (cond-mat.dis-nn); Machine Learning (cs.LG); Data Analysis, Statistics and Probability (physics.data-an); Machine Learning (stat.ML)
*备注: 23 pages, 8 figures

点击查看摘要

Abstract:Interpreting machine-learning models has attracted increasing attention, particularly in the physical sciences, where one often seeks to understand the underlying mechanisms rather than merely make predictions. Multiple linear regression is often regarded as an interpretable alternative to more complex models, such as deep neural networks, because its predictions are expressed as explicit weighted sums of input features. However, when input features are strongly correlated, namely in the presence of multicollinearity, the learned weights can exhibit large dataset-to-dataset fluctuations and oscillatory behavior across physically similar features, making their interpretation difficult or even impossible. Although the instability of the weights under multicollinearity is well known in statistics, its consequences for physical interpretation, in particular its connection to oscillatory weights across physically similar features, have not been systematically clarified. Here, we theoretically discuss the mechanism behind this loss of interpretability by analyzing the eigenmodes of the feature correlation matrix. We show that small-eigenvalue modes associated with multicollinearity amplify fluctuations in the weights and generate oscillatory patterns that do not necessarily reflect meaningful contributions. We test this theoretical picture numerically on physics datasets and show that Ridge regularization suppresses these unstable modes, although the resulting weights must still be interpreted with caution. We further confirm the generality of our findings beyond physics by analyzing a diverse collection of publicly available datasets. Our results clarify why, in the presence of multicollinearity, physical interpretation can remain difficult even for linear regression models.

[LG-166] Learning the generating functional for variance reduction in lattice QCD

链接: https://arxiv.org/abs/2606.15986
作者: Ryan Abbott,Yang Fu,Daniel C. Hackett,Gurtej Kanwar,Fernando Romero-López,Phiala E. Shanahan
类目: High Energy Physics - Lattice (hep-lat); Machine Learning (cs.LG)
*备注: 8 pages, 3 figures

点击查看摘要

Abstract:The generating functional in quantum field theory provides the natural framework for constructing correlation functions as derivatives with respect to source operators. We present a methodology that leverages machine-learned normalizing flows to reduce the variance of arbitrary N -point correlation functions of bosonic operators in lattice gauge field theory calculations by encoding a representation of the generating functional. We show that it is possible to systematically approach noiseless estimators of correlation functions in this framework. We demonstrate this methodology with applications to calculations of glueball correlation functions and Wilson loops in Quantum Chromodynamics and Yang-Mills theory. The results show up to three orders of magnitude variance reduction.

[LG-167] Learning ground state observables from quantum computing experiments

链接: https://arxiv.org/abs/2606.15983
作者: Ben Jaderberg,Freya Shah,Minjun Jeon,M. Emre Sahin,Christa Zoufal,Kunal Sharma
类目: Quantum Physics (quant-ph); Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注: 20 pages, 14 figures

点击查看摘要

Abstract:Recent theoretical progress has established conditions under which machine learning models can efficiently predict ground-state properties of gapped local Hamiltonians when trained on quantum-generated data. Previous experimental demonstrations in this paradigm, however, have largely been limited to small systems or highly structured states, due to the difficulty of preparing many-body ground states on quantum processors. In this work, we demonstrate learning from experimental quantum data generated from approximate ground states of the two-dimensional Heisenberg XXZ model with system sizes up to 115 qubits. We construct a dataset of single-site expectation values, two-point correlations, and 12-body loop correlations across the antiferromagnetic phase. We then train neural networks on this data and show that they can accurately predict spatially resolved observables for previously unseen Hamiltonian parameters, both within the training distribution and in an out-of-distribution regime approaching the phase boundary. Our results demonstrate the practical realization of learning from quantum data for an interacting two-dimensional many-body system at scale, motivating a path toward regimes where quantum processors could provide training data beyond the reach of classical approximation methods.

[LG-168] PromptShift-CRC: Drift-Aware Conformal Risk Control for Foundation Models Under Prompt and Domain Shift

链接: https://arxiv.org/abs/2606.15964
作者: Jeffery Opoku,David Banahene
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Foundation models are now used in settings where the prompts they receive can change quickly. Users change, topics change, policies change, and the model may suddenly face a kind of request that was rare in the calibration data. This makes fixed calibration risky. Conformal prediction and conformal risk control give model-agnostic ways to control error, but they work best when the calibration data still look like the future data. This paper develops PromptShift CRC, a drift-aware conformal risk control method for foundation-model outputs under prompt and domain shift. The method embeds prompts and responses, measures how far the current prompt stream has moved from the calibration pool, gives more weight to relevant or recent calibration examples, and updates the risk level online after observed violations. It reports three practical diagnostics: realized risk error, prompt drift, and effective calibration size. We give conditions under which the method controls risk up to terms for distribution mismatch and weighted quantile uncertainty. In a synthetic prompt-shift benchmark, static conformal risk control fails sharply after drift, while PromptShift-CRC gives the best coverage among the adaptive baselines considered. We then evaluate the same calibration layer on public benchmark derived streams for question answering, toxicity, summarization factuality, and long-context hallucination risk

[LG-169] p-PSO: A Penalized Particle Swarm Optimization Technique for Finding D-Optimal Designs with Mixed Factors in Generalized Linear Models

链接: https://arxiv.org/abs/2606.15962
作者: Shrabanti Chowdhury,Abhyuday Mandal
类目: Methodology (stat.ME); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Finding D-optimal designs for generalized linear models (GLMs) is challenging due to the dependence of the Fisher information matrix on unknown parameters and the lack of closed-form solutions, particularly when input factors include both discrete and continuous variables. Although classical algorithms and recent metaheuristic approaches have offered partial solutions, there remains a need for robust and computationally efficient methods. In this paper, we propose a penalized Particle Swarm Optimization (PSO) approach, named p -PSO. Here we introduce a new, general-purpose penalty formulation for constrained optimization and demonstrate its effectiveness in optimal design problems. The formulation is algorithm-agnostic and applicable to a broad class of black-box optimization methods. Results show that the method is highly efficient, with its primary contribution being a penalty formulation that enables the direct use of an off-the-shelf PSO algorithm and extends naturally to more general constrained optimization tasks.

[LG-170] Spectral Adaptive Conformal Prediction for Structured Non-Exchangeable Data

链接: https://arxiv.org/abs/2606.15950
作者: Jeffery Opoku,David Banahene
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 35 pages, includes figures and references

点击查看摘要

Abstract:Conformal prediction gives prediction intervals with finite-sample coverage when the data are exchangeable. Many time-indexed datasets are not exchangeable. They have seasons, recurring regimes, changing frequencies, or other forms of structured dependence. This paper studies a simple way to use that structure. We propose spectral adaptive conformal prediction, a method that forms weighted conformal quantiles using local spectral similarity and then updates the target miscoverage level online. The spectral weights choose calibration residuals that look relevant to the current test point. The adaptive update corrects the long-run miss rate when uncertainty changes over time. We give an approximate coverage result for the fixed spectral weighted quantile and a deterministic long-run calibration result for the adaptive update. Simulations with recurring regimes and slowly changing frequencies, together with three U.S. real-data examples, show that the hybrid method can improve on fixed spectral weighting, while also showing that spectral weighting must be monitored through effective sample size diagnostics.

[LG-171] Biarchetype analysis for univariate functional data. An application to macroeconomic financial time series

链接: https://arxiv.org/abs/2606.15881
作者: Aleix Alcacer,Rafael Benitez,Vicente J. Bolos,Irene Epifanio
类目: Methodology (stat.ME); Machine Learning (cs.LG); Applications (stat.AP)
*备注: 6 pages, 2 figures. To be published in the proceedings of SIS-FENStatS 2026, Sapienza University of Rome, Italy, June 22-25, 2026

点击查看摘要

Abstract:We introduce biarchetype analysis for the first time in the context of univariate functional data. This unsupervised methodology extends archetype analysis by simultaneously identifying archetypal structures across both the cases (countries, in our application) and the temporal argument. Both cases and time points are expressed as mixtures of biarchetypes, yielding a concise and highly interpretable representation of complex functional observations. Although biarchetype analysis is not intended as a clustering technique, it offers superior interpretability compared with biclustering approaches, as it is based on extreme, representative patterns rather than average centroids, thereby enhancing human comprehension. We apply the proposed method to 10-year government bond yields of European countries over the period 2001-2025. The results identify three distinct time regimes (the pre-crisis period, the euro-area sovereign debt crisis, and the post-crisis period), and reveal Germany, Greece, and Hungary as country archetypes.

[LG-172] Amortized mean-shift interacting particles

链接: https://arxiv.org/abs/2606.15871
作者: Ali Siahkoohi
类目: Computation (stat.CO); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Bayesian inference for inverse problems is run to evaluate integrals – posterior expectations, tail probabilities, and risks – across a stream of observations. The standard estimate averages the integrand over posterior samples, a Monte-Carlo average whose error decays only as the square root of the sample size, so accuracy demands many samples – prohibitive when each one calls a partial-differential-equation forward model. Mean-shift interacting particles need far fewer: they return a small set of signed-weight nodes – a deterministic quadrature whose weighted averages estimate those integrals. Finding the nodes, however, is a per-observation optimization that, in its most accurate form, reads the posterior score at every step – returning the cost it meant to save. We introduce amortized mean-shift interacting particles, a learned map that emits the weighted nodes from an observation and a few posterior samples in a single forward pass. Training asks only for joint parameter-observation samples and a posterior to draw from – a conditional normalizing flow, an empirical conditional, or any reference the user can sample – and the map learns to integrate that posterior from samples alone, evaluating neither its density nor its score. Once trained, it generalizes to unseen observations and integrands at any node budget and improves on independent samples in two ways: by reweighting them, provably no worse than the equal weights of Monte-Carlo; and by moving them, which empirically lowers it further. Across closed-form, sampled, learned, and physics-based posteriors – up to a thousand-coefficient groundwater field – it integrates more accurately than the same number of samples at every budget, and a posterior-whitened, dimension-aware kernel removes the high-dimensional wall. The result is a Pareto improvement on Monte-Carlo integration, not a competitor to drawing more samples.

[LG-173] Early Anomaly-Onset Detection based on Wigner–Ville Distribution Slice Spectra: A Transmission-Grid Test Case

链接: https://arxiv.org/abs/2606.15856
作者: Eduardo Jr Piedad,Eduardo Prieto-Araujo,Oriol Gomis-Bellmunt
类目: ignal Processing (eess.SP); Machine Learning (cs.LG); Spectral Theory (math.SP)
*备注: 7 pages, 3 figures, 4 tables

点击查看摘要

Abstract:Operational disturbance monitoring in power networks requires decisions to be made from waveform windows as they arrive, rather than from completed records after the event. This study evaluates full-vector Wigner–Ville Distribution Slice (WVDS) spectra for sequential anomaly-onset detection in high-voltage grid-voltage waveforms. The approach keeps the bilinear midpoint interaction structure of the Wigner–Ville distribution and represents each 128-sample voltage window by a 128-dimensional slice spectrum, avoiding manually selected fault-frequency markers. WVDS is used with a baseline-normalized deviation (BND) score and is compared against the BND of Fast Fourier Transform (FFT-BND), raw-window autoencoders, FFT autoencoders, and WVDS autoencoders under the same thresholding and three-window persistence rule. A synthetic autoencoder–clustering teacher is used to select RTE fault records that start from an initially normal region and then transition to anomalous behavior. On the filtered test set, FFT-BND achieves the highest sensitivity, whereas WVDS-BND provides the lowest false-alarm operating point, reducing record-level pre-onset false alarms to 0.69%. The autoencoder comparison follows the same selectivity pattern: WVDS reconstruction decreases false alarms relative to FFT reconstruction but misses more examples. The results indicate that preserved WVD cross-term information can form a selective representation for online grid-waveform anomaly monitoring when false alarms are costly.

[LG-174] Schattor: Schatten-family methods for deep learning optimization

链接: https://arxiv.org/abs/2606.15702
作者: Bohao Ma,Junyu Zhang,Chuan He
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注: 32 pages

点击查看摘要

Abstract:Modern deep learning optimization features heterogeneous parameter structures, noisy gradients, and highly nonconvex landscapes, posing significant challenges for both algorithm design and theoretical analysis. Motivated by the limitations of SGD and the success of adaptive optimizers, we propose \it Schattor, a family of adaptive first-order methods based on Schatten norms. Schattor unifies SGD and the recently proposed matrix-variate adaptive optimizer Muon within a single Schatten-norm-based framework. We establish dimension-free stationarity guarantees for methods in the Schattor family for stochastic matrix optimization problems via a novel matrix martingale moment bound. We also develop multi-block extensions that adaptively balance block-wise optimization progress and prove dimension-free stationarity guarantees in this more general setting.

[LG-175] Stochastic trace estimation with tensor train random vectors

链接: https://arxiv.org/abs/2606.15679
作者: Zvonimir Bujanović,Daniel Kressner,Hrvoje Olić
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:Stochastic trace estimation is a standard tool for approximating the trace of a large-scale matrix available only through matrix-vector products. However, in tensor-structured settings, unstructured Gaussian or Rademacher test vectors may be prohibitively expensive to store and compute with, while cheaper rank-one tensor-product vectors can require sample complexities that grow exponentially with the tensor order. This work studies Gaussian random tensor train vectors as a structured alternative for stochastic trace estimation. We show that, with a suitable choice of the tensor train rank, random tensor train vectors recover dimension-independent guarantees for the Girard–Hutchinson estimator. In particular, a median-of-means variant with tensor train rank r \geq d-1 achieves the same dependence on the accuracy \varepsilon and failure probability \delta as the classical estimator based on unstructured Gaussian vectors. We further prove an oblivious subspace injection result for sketches formed from independent Gaussian random tensor train vectors: tensor train rank r\geq d-1 and \mathcalO(\varepsilon^-2(k+\log(1/\delta))) samples suffice for a k -dimensional target subspace. Finally, we investigate the use of such sketches within the Nyström++ framework. We show that the resulting estimator can achieve the desired \mathcalO(\varepsilon^-1) sample complexity under an additional spectral-tail condition. These results provide clarififcation on both the potential and the limitations of random tensor train vectors in stochastic trace estimation.

[LG-176] Information Gap and Feasibility-Aware Inference in Binomial Logistic Mixtures

链接: https://arxiv.org/abs/2606.15665
作者: Yuta Hayashida,Shonosuke Sugasawa
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注: 33 pages (main) + 30 pages (supplement)

点击查看摘要

Abstract:This paper studies the information gap between mixture detection and label recovery in binomial logistic mixtures. Standard likelihood-based criteria such as the Bayesian information criterion (BIC) can detect the presence of two components, but this does not guarantee that the corresponding labels are recoverable. We show that this gap is intrinsic to binomial logistic mixtures with a fixed number of trials: observed-data evidence for mixture structure and per-observation information for label recovery have different local orders in the component separation, and only the former accumulates with the sample size. As a result, there exists a detectable-but-unrecoverable regime in which BIC selects two components while the posterior labels remain essentially uninformative. To address this issue, we propose two feasibility-aware inference procedures: a recoverability-aware BIC with a posterior-entropy penalty and an entropy-regularized estimator that mitigates the tendency of the maximum likelihood estimator to produce overly separated components and overly concentrated posterior responsibilities. Numerical experiments confirm the predicted gap and demonstrate that the proposed methods avoid misleading component selections and improve the calibration of posterior label probabilities.

[LG-177] Phase Transition in Convex Relaxations for Graph Alignment COLT

链接: https://arxiv.org/abs/2606.15581
作者: Laurent Massoulié,Sushil Mahavir Varma,Louis Vassaux,Irène Waldspurger
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Probability (math.PR); Spectral Theory (math.SP); Statistics Theory (math.ST)
*备注: Accepted for presentation at the Conference on Learning Theory (COLT) 2026

点击查看摘要

Abstract:We study the graph alignment problem for correlated Gaussian Orthogonal Ensemble (GOE) matrices, where the goal is to recover a hidden vertex permutation given two correlated symmetric Gaussian matrices (A, B) with correlation 1/\sqrt1+\sigma^2 . While the maximum likelihood estimator is information-theoretically optimal, its computation, which reduces to a quadratic assignment problem, is intractable. Motivated by this, we analyze convex relaxations based on minimizing |AX - XB|_F over the set of doubly stochastic matrices and the unit hypercube. We show that when the correlation parameter satisfies \sigma = o(n^-1/2/\log^4 n) , the solution of either relaxation (X^\star) concentrates around the ground-truth permutation matrix (\Pi^\star) , i.e., |X^\star-\Pi^\star|_F^2 = o(n) , implying recovery of all but a vanishing fraction of vertices after simple post-processing. Combined with existing lower bounds, our results precisely characterize that |X^\star-\Pi^\star|_F^2 transitions from o(n) for \sigma = \tildeo(n^-1/2) to \Omega(n) for \sigma = \tilde\Omega(n^-1/2) . In doing so, our analysis significantly tightens prior results and extends them beyond doubly stochastic relaxations.

[LG-178] Ricci-Filtration: Boosting Retrieval-Augmented Generation Reranker to Query-Answer Tasks by Discrete Ricci Flow

链接: https://arxiv.org/abs/2606.15482
作者: Tian Qin,Wei-Min Huang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Ricci flow is a curvature-guided diffusion process that deforms space by shrinking regions of high positive curvature and expanding those with negative curvature. Similarly, discrete Ricci flow on weighted graphs modifies edge weights by shrinking edges with positive Ricci curvature and stretching those with negative Ricci curvature, effectively increasing the separation between clusters. Inspired by these two cornerstone works, we propose a geometry-based RAG reranker enhancement procedure called Ricci-Filtration. By modeling the input query and initial retrieved chunks as a network, where the input query and chunks serve as nodes and embedding-based pairwise relations define an initial graph, Ricci-Filtration leverages discrete curvature and Ricci flow to evaluate the structural importance of each chunk with respect to the user query. The system first filters the initial chunks based on their geometric curvature relative to the query; then, a reranker processes the remaining chunks to enhance generative performance. We theoretically prove that normalized discrete Ricci flow can detect community structures by identifying distinct asymptotic behaviors in edge weights. This supports the removal of ``noisy’’ document chunks characterized by large weights and negative Ricci curvature relative to the query node. Extensive experiments confirm that Ricci-Filtration outperforms several baseline reranking methods in accuracy, precision, recall, and F1 scores. Furthermore, ablation studies demonstrate that the Ricci-Filtration generally outperforms the baseline under various settings, highlighting the framework’s robustness across different architectures.

[LG-179] Structured Nonparametric Variational Inference for Dependent Latent Modeling

链接: https://arxiv.org/abs/2606.15458
作者: Yuda Shao,Zhiling Gu,Shan Yu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Variational inference (VI) is a core engine of modern AI, enabling scalable approximate Bayesian learning and uncertainty-aware training of large probabilistic and generative models. In this paper, we propose Structured Nonparametric Variational Inference (SN-VI), a novel framework for modeling complex dependencies among latent variables in posterior approximation, leveraging multivariate spline techniques. Unlike traditional methods that rely on the mean-field assumption, SN-VI preserves intricate latent variable dependencies, providing a flexible and accurate approximation of posteriors with arbitrary shapes. We establish rigorous theoretical guarantees, including the derivation of the lower bound for the variational objective and proof of asymptotic consistency in posterior estimation. To facilitate practical implementation, we develop an algorithm that automatically identifies dependent latent variables and their underlying dependence structure, without requiring manual specification. Simulation studies validate the effectiveness of SN-VI in approximating posterior distributions with bounded support and complex dependencies. The proposed method has been successfully applied to high-dimensional structured data, including computer vision datasets and spatial transcriptomics. In these applications, SN-VI demonstrates improved generative model performance and effectively uncovers coupled biological signals through the learned dependency structure.

[LG-180] A Conservation Law for Equilibrium Propagation and Coupled Learning

链接: https://arxiv.org/abs/2606.15444
作者: Joshua A. McGinnis,Adam G. Kline,Yoichiro Mori
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this paper we show that the physical learning methods known as coupled learning (CL) and equilibrium propagation (EP) conserve a mass-like quantity in the trainable parameters in the continuous-time, small-nudging limit. We prove that this conservation holds in a broad range of physically relevant settings. We then show that the conservation law constrains the training dynamics in a way that makes convergence reliable in important settings for linear circuits. We conclude by discussing some practical implications of this conservation law.

[LG-181] Coercivity and Local Convergence of Physical Learning in Linear Circuits

链接: https://arxiv.org/abs/2606.15443
作者: Joshua A. McGinnis,Xinbo Li,Yoichiro Mori
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Physical learning methods train physical networks to perform computational tasks using only local update rules, exploiting the physics of the system to handle the global transfer of information. We provide the first local convergence analysis of three such methods – Equilibrium Propagation (EP), Coupled Learning (CL), and a new method we call Adjoint Coupled Learning (AL) – for linear circuits, in the limit of small-nudging for both discrete and continuous time. EP and AL perform gradient descent on a natural loss function, while CL follows modified dynamics with an additional cubic correction. Assuming the existence of a solution, we identify a coercivity condition, expressed as a rank condition on a matrix built from the network’s incidence structure, under which the training loss decays exponentially and the parameters converge to the solution manifold. We show that coercivity can fail by exhibiting a kite circuit in which a symmetry causes the coercivity constant to degenerate on the solution manifold, but prove using Sard’s theorem that such degeneracies are non-generic: coercivity holds at every point of the solution manifold for almost every choice of desired output.

[LG-182] he Reverse Telescoping Coordinate System for Positive Definite Matrices: Geometry Computation and Generative Modeling

链接: https://arxiv.org/abs/2606.15442
作者: Anindya Bhadra
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We design a new unconstrained coordinate system where a p\times p symmetric positive definite (SPD) matrix \Theta is represented by a reverse telescoping map \Theta(x)=\rmRT(x) , with x=(v,d,r)\in\mathbbR\times\mathbbR^(p-1)\times\mathbbR^p(p-1)/2 , representing respectively the log volume or log determinant; and the shape, as encoded by log relative diagonal scales and partial covariances among the nodes. This construction results in important properties not available in other charts, e.g., matrix logarithm, such as Jacobian depending on only the log-determinant. A useful feature of our construction is x contains a lossless symbolic representation of both the matrix and its inverse. Many important computations involving a matrix and its inverse can be performed in O(p^2) in the transformed domain, while it is the rendering of results in matrix forms (on demand) that must incur an O(p^3) cost. Moreover, two unit-determinant matrices in the transformed domain can be joined by a straight line with pathwise unit determinant. For generative modeling, this allows designing a split volume-shape flow model trained by conditional flow matching for transporting the shape over the unit-determinant path, with a separate one-dimensional flow for transporting the volume or the determinant. The forbidding SPD constraint, tamed thus into a powerful guiding force, leads to the surprising insight that it is in some sense easier to design a volume-normalized shape flow for SPD compared to the unconstrained \mathbbR^p\times p , with no intrinsic notion of volume to aid normalization, unlike the determinant of SPD matrices. We apply our construction for up to p=200 in generative modeling of SPD matrices on a difficult synthetic bimodal target, and in generating brain connectivity networks by models trained on fMRI data; as well as in intrinsic diffusion on the SPD manifold.

[LG-183] Finite Resources False Discovery Rate Control in Structured Hypothesis Spaces

链接: https://arxiv.org/abs/2606.15393
作者: Binyamin Perets,Shie Mannor
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Scientific discovery relies on large-scale hypothesis testing. However, the capacity to identify true discoveries while controlling false discovery faces major challenges: obtaining relevant reference data (the null distribution) is resource-intensive, leaving finite-data uncertainty, and the procedure should account for the inherent structure in the hypothesis space, when such structure exists. Here, we present a framework for controlling the false discovery rate both when each hypothesis is evidenced only by a finite count of null draws, leaving its p-value uncertain, and when the hypothesis space carries arbitrary structure, requiring only that the structure be represented through a suitable reproducing kernel. We present two decision rules that are both robust to structural mis-specification, yet offer a distinct trade-off between exact FDR control and statistical power. The first rule guarantees exact FDR control; the second maximizes power by adapting mirror-statistic control into count space, utilizing an analytical framework to assess FDR control when exact mirror symmetry is relaxed. Furthermore, the tractability gained by the RKHS framework allows us to directly investigate finite-data uncertainties, which we leverage to suggest a policy for the efficient allocation of null distribution samples.

[LG-184] ShipNet: A Geometric Deep Learning Surrogate for Real-Time Ship Hydrodynamics

链接: https://arxiv.org/abs/2606.15356
作者: Kirsten Odendaal,George Drakoulas
类目: Fluid Dynamics (physics.flu-dyn); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurate prediction of hydrodynamic performance is central to ship design, yet high-fidelity computational fluid dynamics remains prohibitively expensive for large-scale parametric exploration. This motivates the development of data-driven surrogate models that provide rapid approximations to hydrodynamic predictions at substantially reduced cost. We present ShipNet, a geometric deep-learning surrogate that predicts both hull-surface pressure distributions and far-field free-surface wave patterns directly from hull geometry and speed. The network employs a regularized dynamic graph convolutional backbone on hull point clouds, with a multi-head decoder for simultaneous near-body pressure and free-surface elevation outputs. Training data consist of 420 inviscid free-surface simulations generated using a potential-flow panel method for two parent yacht hulls, each parameterized into 70 variants and evaluated at three speeds. ShipNet predicts per-point pressure coefficient and two-dimensional wave elevation map using a composite loss that combines point-wise regression and image-structure terms. On a geometry-held-out test set, ShipNet achieves R^2=0.98 for hull pressure and R^2=0.91 for wave fields. Inference requires approximately 0.15s per case, yielding over a 550x speedup relative to the potential-flow solver on conventional hardware. Limitations include the restricted geometry and speed ranges and the inviscid training data, while future work will extend the model to high-fidelity viscous simulations with physics-informed regularization.

[LG-185] Generative modelling powered by room-temperature polariton condensates

链接: https://arxiv.org/abs/2606.15344
作者: Yuan Wang,Marcin Muszynski,Avinash Dash,Rishabh Kaurav,Vinod M. Menon,Oleksandr Kyriienko
类目: Disordered Systems and Neural Networks (cond-mat.dis-nn); Machine Learning (cs.LG); Optics (physics.optics); Quantum Physics (quant-ph)
*备注: 9 pages and 4 figures in the main text; 17 pages SM; codes to be released

点击查看摘要

Abstract:Generative modelling requires efficient stochastic nonlinear transformations and physical platforms that can naturally realise them. We experimentally demonstrate that nonlinear optical systems operating in the strong light-matter coupling regime can serve as physical transformation layers for conditional generative modelling. Specifically, we develop a workflow in which room-temperature exciton-polariton condensates formed in organic dye microcavities act as a physical stochastic transform within a generative adversarial network and enable conditional digit-to-image translation. By using the nonlinear many-body dynamics and intrinsic stochasticity of polariton condensates, the workflow outperforms baseline approaches based on digitally injected perturbations. We find that polariton-enabled sampling via generative adversarial network (Polariton GAN) yields improved inception score, digit preservation accuracy and structural similarity compared with both digital sampling and laser-based systems. We further show that spatially correlated output variations can naturally regularise adversarial training and enhance output diversity. Our results establish polariton condensation as a new computational resource for generative modelling, opening a pathway towards physics-enhanced machine learning systems.

[LG-186] Dual-Network PINNs for Optimal Control: A Reproducible Benchmark on the Mass-Spring-Damper System

链接: https://arxiv.org/abs/2606.15271
作者: Abdeladhim Tahimi,Rinaldo Vieira da Silva Junior
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注: 22 pages, 6 figures. Reproducible benchmark study of dual-network Physics-Informed Neural Networks (PINNs) for optimal control of a mass-spring-damper system. Includes comparison with Pontryagin’s Minimum Principle and direct transcription methods and accompanying Google Colab implementation

点击查看摘要

Abstract:This work presents a transparent and reproducible benchmark study of a direct dual-network Physics-Informed Neural Network (PINN) formulation for the optimal control of a mass-spring-damper system. The classical linear-quadratic optimal control problem is solved by two independent classical methods – Pontryagin’s Minimum Principle with single shooting, and direct transcription through trapezoidal collocation – and recast as a constrained optimization problem solved by two feedforward neural networks: a state network whose boundary conditions are enforced exactly through a composite cubic-and-mask ansatz, and an unconstrained control network. The composite loss combines the physics residual at the collocation points with a trapezoidal approximation of the cost functional, weighted by a single scalar hyperparameter. On the benchmark considered, the PINN reproduces the classical optimal cost to four significant digits, satisfies the terminal state constraints exactly by construction, and produces pointwise state and control errors that fall within the spread of the two classical references. Training is approximately two orders of magnitude slower than classical shooting on this benchmark, which is honestly reported. The contribution is methodological clarity rather than methodological novelty: the formulation and the accompanying Google Colab implementation are intended to lower the barrier to entry for practitioners exploring PINN-based optimal control without prior exposure to adjoint methods or two-point boundary value problems.

[LG-187] Surrogate-Assisted Framework for SI-Compliant Interconnect Design Optimization Using the Earth Movers Distance

链接: https://arxiv.org/abs/2606.15234
作者: Emre Ecik,Werner John,Julian Withöft,Ralf Brüning,Jürgen Götze
类目: ignal Processing (eess.SP); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注: 16 pages, 15 figures. This manuscript has been submitted to Advances in Radio Science for review (2026)

点击查看摘要

Abstract:This work presents a deterministic, machine-assisted framework for SI-compliant PCB design based on the Earth Mover’s Distance (EMD). In contrast to conventional surrogate-based optimization methods that rely on iterative black-box search procedures, the proposed approach follows an interpretable, sequential evaluation strategy. Neural surrogate models are first used to efficiently predict waveform describing features from topology-dependent design parameters. A decision tree then acts as a physically motivated quality gate that identifies SI-compliant waveforms according to predefined SI criteria. Within the resulting valid solution space, the Earth Mover’s Distance is employed as a similarity metric to rank candidate designs according to their proximity to an ideal reference signal. This enables not only the deterministic identification of admissible parameter regions but also a transparent prioritization of physically superior solutions without inverse modeling or stochastic search procedures. The methodology is demonstrated using a large-scale set of simulated DDR3 fly-by waveforms. By combining surrogate prediction, interpretable classification, and EMD-based waveform evaluation, the framework provides an explainable and computationally efficient alternative to conventional optimization strategies for supporting PCB development with AI-based methods.

[LG-188] Conformal Candidate Certification for Offline Model-Based Optimization ICML2026

链接: https://arxiv.org/abs/2606.15217
作者: Seungjin Choi
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: ICML 2026 Workshop on Decision-Making from Offline Datasets to Online Adaptation: Black-Box Optimization to Reinforcement Learning

点击查看摘要

Abstract:Offline model-based optimization (MBO) proposes candidates by optimizing a surrogate trained on a fixed historical dataset. Because candidates are deliberately out-of-distribution, surrogate rankings are least reliable exactly where the optimizer is most aggressive, yet existing methods provide no per-candidate statistical certificate that a design meets a target threshold. We propose \emphConformal Candidate Certification (CCC), a post-hoc wrapper that attaches a calibrated one-sided lower bound to each candidate and advances only those whose bound exceeds the target. We show that entropy-regularized surrogate maximization induces a Gibbs-tilted proposal, so the same surrogate supplies importance weights for weighted conformal prediction without a separate density-ratio estimation step. In a controlled synthetic study, CCC certifies 16.7% of an aggressive proposal pool with empirical coverage 0.990 at nominal 0.90, while standard conformal prediction ignoring the covariate shift collapses to 0.416 coverage.

[LG-189] Quantum-classical hybrid models based on error correction for time series forecasting

链接: https://arxiv.org/abs/2606.15213
作者: Jonathan H. A. de Carvalho,Filipe C. de L. Duarte,Fernando M. de Paula Neto,Paulo S. G. de Mattos Neto
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: Submitted to Nature Computational Science. 24 pages, 10 figures

点击查看摘要

Abstract:Time series forecasting largely benefits from combining the strengths of different models, especially using a scheme where a model corrects another model by capturing supplementary patterns from forecasting errors. Concurrently, quantum models are providing a means to augment the classical capacity, including in time series forecasting, by acting alongside classical models in hybrid architectures. In this work, we propose the first forecasting system based on error correction that jointly uses quantum and classical models. Here, quantum models first extract patterns by exploring quantum phenomena, and classical models capture the remaining patterns from the quantum errors. Compared to classical single models and classical-classical hybrid models based on error correction, the complementary capacity that emerges from this quantum-classical system provided the best results in most of the addressed problems. Therefore, this work paves the way to introduce quantum models in established hybridization schemes for time series forecasting.

[LG-190] Multiscale Hypersonic Boundary Layer Reconstruction via Spectral Binning and Subdomain-wise Conditional Diffusion

链接: https://arxiv.org/abs/2606.15023
作者: Hojin Kim,Dibyajyoti Chakraborty,Takahiko Toki,Carlo Scalo,Romit Maulik
类目: Fluid Dynamics (physics.flu-dyn); Machine Learning (cs.LG)
*备注: 33 pages, 28 figures

点击查看摘要

Abstract:We propose a multiscale probabilistic reconstruction framework for hypersonic Couette flow, where near-wall states are inferred from limited top-wall observations using conditional diffusion model. The boundary layer is divided into overlapping wall-normal subdomains, and a single height- and Mach-conditioned Elucidating Diffusion Model (EDM) is trained jointly for M=6,7,8 to sample velocity, density, pressure, and temperature fields conditioned on a top-wall boundary slice. A soft overlap inpainting strategy assembles subdomain predictions into full-volume reconstructions while maintaining inter-subdomain continuity and small-scale variability. To improve the spectral fidelity of the generated fields, we introduce a novel bounded binned spectral power (BSP) loss that preserves high-wavenumber content while remaining numerically stable across the diffusion noise schedule. Validation against direct numerical simulation data shows that the model recovers instantaneous structures, spectra, statistical profiles, correlations, and wall quantities across all training Mach numbers, while providing spatially structured uncertainty estimates. The reconstructed Mach-conditioned profiles also collapse under the Trettel-Larsson transformation, indicating consistency with compressibility scaling. These results establish the domain decomposed conditional diffusion model with a bounded binned spectral loss as an effective probabilistic surrogate for near-wall reconstruction in hypersonic wall-bounded turbulence.

[LG-191] Distilling latent electrostatics from foundation machine learning interatomic potentials

链接: https://arxiv.org/abs/2606.15001
作者: Xiaoyu Wang,Bingqing Cheng
类目: Computational Physics (physics.comp-ph); Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG); Chemical Physics (physics.chem-ph)
*备注:

点击查看摘要

Abstract:Foundation machine learning interatomic potentials (MLIPs) have enabled atomistic simulations across broad regions of chemical and materials space, but many remain computationally expensive and lack explicit electrostatics, limiting their use for systems governed by long-range interactions and electrical response. Previously, we introduced Latent Ewald Summation (LES), which learns latent atomic charges and long-range electrostatics from density functional theory (DFT) energy and force labels alone. Here, we use LES to extract electrostatics that are latent in foundation models: energies and forces predicted by a teacher model are used to train a lightweight LES-augmented student MLIP, with optional fine-tuning on additional DFT data. The resulting models reduce computational cost while providing access to Born effective charge tensors, and infrared spectra. We benchmark student models distilled from a broad set of foundation MLIPs, including UMA, MACE, Orb, eSEN, GemNet-OC, PET, and EquiformerV2-based models, against experimental infrared spectra for liquid water, concentrated hydrochloric acid, and the anatase TiO2(101)-water interface. Across these systems, electrostatic response can be extracted from most foundation MLIPs. The benchmark further shows that the underlying DFT level and dataset used to train the teacher model play a larger role than architecture in determining electrostatic and spectroscopic accuracy. For the TiO2-water interface, fine-tuning with a modest amount of higher-level DFT data improves structural and infrared predictions. LES-based distillation therefore provides a practical route for converting foundation MLIPs into efficient, electrically responsive models, while also testing the physical fidelity encoded in foundation models.

[LG-192] Identification and Inference for Algorithmic Frontiers with Selective Labels

链接: https://arxiv.org/abs/2606.14977
作者: Yiqi Liu,Francesca Molinari,Amilcar Velez
类目: Econometrics (econ.EM); Machine Learning (cs.LG)
*备注: 68 pages, 2 figures

点击查看摘要

Abstract:This paper provides identification results to characterize a fairness-accuracy (FA) frontier, and statistical inference tools to test hypotheses and build a confidence set for the FA-frontier, when outcomes are observed only for selected individuals. When the selection process is unrestricted but loss is measured in specific ways, we provide a characterization of the sharp identification region of the FA-frontier. Under an assumption of unconfoundedness conditional on observables (and unrestricted loss functions), we obtain point identification and propose a debiased machine learning estimator, derive its asymptotic distribution, and show how this can be used to carry out inference for the FA-frontier. In work in progress, we extend the partial identification results to a broader class of loss functions.

[LG-193] Representation Costs in Data Science: Foundations and the Quasi-Banach Spaces of Deep Neural Networks

链接: https://arxiv.org/abs/2606.14954
作者: Greg Ongie,Rahul Parhi
类目: Functional Analysis (math.FA); Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We develop a general framework for analyzing representation costs of parametric data-fitting methods through their parameter-space regularizers. From this abstract perspective, we define representation costs for arbitrary parametric models and reveal their induced (native) function spaces. This unifies recent function-space views of data-fitting methods. We also prove that many natural results hold in this abstract setting, including representer theorems for parametric methods on their native spaces. The framework also rigorously connects parametric methods with their equivalent nonparametric descriptions under sufficient overparameterization. Classical methods and their native spaces, such as kernel methods / reproducing kernel Hilbert spaces, wavelets / Besov spaces, and shallow neural networks / variation spaces emerge as special cases of our abstract framework. A byproduct of “axiomatizing” the study of representation costs is that we also immediately obtain new results for deep neural networks: For depth- L feedforward ReLU networks, their induced native spaces are p -normable quasi-Banach spaces with p = 2/L . This reveals that the inductive bias of deep neural networks (as given by the representation cost) cannot be captured by norms for depths L 2 .

[LG-194] Audited Conformal Prediction for Classification under Unknown Distribution Shift

链接: https://arxiv.org/abs/2606.14909
作者: Yanfei Zhou,Rizal Fathony,Nam H. Nguyen,Matteo Sesia
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We consider the problem of uncertainty quantification for a pretrained classification model deployed under unknown distribution shift. We propose Audited Conformal Prediction (ACP), a method that leverages a small labeled dataset from the target population to train an auxiliary audit model identifying inputs where the legacy model is likely to fail. By integrating the audit model’s outputs into the conformal prediction framework, ACP produces prediction sets that guarantee marginal coverage while achieving substantially higher conditional coverage in practice than existing approaches. We develop and analyze two complementary integration strategies – one targeting marginal coverage with improved conditional performance, the other providing explicit group-conditional coverage guarantees – and establish theoretical guarantees for both. Experiments on synthetic and real-world datasets validate the method and illustrate trade-offs between prediction set size and conditional coverage.

[LG-195] Peak-Based Nuclide Identification in HPGe γ-Spectrometry with Machine Learning and SHAP

链接: https://arxiv.org/abs/2606.14874
作者: Samuel Emmons,Kelly Truax,Maurice Lonsway,Bruce Pierson,Brian Archambault
类目: Data Analysis, Statistics and Probability (physics.data-an); Machine Learning (cs.LG); Nuclear Experiment (nucl-ex)
*备注: 25 pages, 11 figures (plus an additional 6 figures in the appendix), and 3 tables. To be published in Nuclear Instruments and Methods in Physics Research Section A: Accelerators, Spectrometers, Detectors and Associated Equipment

点击查看摘要

Abstract:High-purity germanium gamma spectra often require time-consuming analyses from subject matter experts. Photopeaks within these spectra are carefully fitted and numerical methods are employed to assist with nuclide identification (NID) and quantification. Amending the list of nuclides identified by analysis software can be nontrivial. When many samples need to be analyzed, it is therefore challenging to make timely and correct decisions. Supervised machine-learning-based NID can serve as an expert-informed, automated tool to improve the initial set of radionuclides suggested to an analyst and more effectively drive subsequent quantification. To that end, we implemented machine learning models that map photopeaks carefully fitted by analysts to NID results for experimental spectra containing various isotopic combinations drawn from a set of 65 isotopes. The best model achieved an F1 score of 0.97, markedly surpassing the F1 score of 0.84 achieved by traditional software when compared using a nuclide library comprising the same 65 isotopes assessed by the models. Finally, we illustrated the most important input features for model predictions using Shapley Additive Explanations. These explanations revealed that the models use physically relevant photopeaks when making predictions for the isotopes in our nuclide library.

[LG-196] Pre-Training for Simulation-Based Science: A Study on Jet Foundation Model Training Objectives

链接: https://arxiv.org/abs/2606.14870
作者: Ibrahim Elsharkawy,Joschka Birk,Vinicius Mikuni,Wahid Bhimji,Gregor Kasieczka,Benjamin Nachman
类目: High Energy Physics - Phenomenology (hep-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Foundation models (FMs) trained on large datasets and fine-tuned on downstream tasks have emerged as a powerful paradigm in AI for science. Industrial FMs are typically trained using self-supervision with masking due to the lack of labels. In many scientific domains, accurate simulations are plentiful and facilitate large, labeled datasets. This opens up new possibilities for pre-training. We present a systematic comparison of pre-training methods using the OmniLearned High Energy Physics FM framework. We test supervised classification, flow-matching generation, and self-supervised masked particle modeling. All models are pre-trained on the JetClass dataset and fine-tuned on two representative downstream tasks, top jet classification and JetNet conditional generation. Among other observations, for classification tasks, we find that pure classifier pre-training is optimal when downstream labels and model capacity are plentiful, but combining it with self-supervised masked particle modeling (MPM) is uniquely powerful in the low-finetuning label regime. Flow matching-based generative pre-training seems to provide little benefit for downstream classification, and interestingly, for downstream generation, we find that flow matching must be in the pre-training objective to see a significant finetuning advantage, hinting at the orthogonality of classification and generation tasks. That is, for a model to transfer to both generative and classification downstream tasks, it must be pre-trained on both. This study provides a template for controlled scaling analysis of pre-training objectives for foundation models in simulation-based sciences.

[LG-197] Bridging data-driven priors via the score function for posterior sampling – Comparative review and experimental study

链接: https://arxiv.org/abs/2606.14800
作者: Elhadji Cisse Faye,Mame Diarra Fall,Sylvain Delchini,Nicolas Dobigeon
类目: Methodology (stat.ME); Machine Learning (cs.LG); Image and Video Processing (eess.IV); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:This paper reviews how a diverse set of popular data-driven priors commonly used in Bayesian inverse problems can be unified through their respective score functions. By framing these priors under this common perspective, we show that they can benefit from their straightfoward and effective integration into a recently proposed sampling algorithm. The applicability of this common framework is illustrated by considering several data-driven priors, namely regularization-by-denoising, normalizing flow-based priors, score-based generative models, and convex-ridge regularizers. For these four particular priors, the performance of the method is evaluated when conducting image inpainting and single image super-resolution. These results, as well as those obtained when restoring real images acquired in a geological context, demonstrate the efficiency of the method. This unified framework proves versatile enough to handle any posterior distribution defined by a broad class of score function-based priors, beyond the specific cases considered in this paper.

[LG-198] From Physics to Representation: Audio Learning with Synthetic Pre-training via Procedural Generation ICMR2026

链接: https://arxiv.org/abs/2606.14791
作者: Fengrui Liu,Ruiyang Huang,Qijian Zheng,Yuanfang Wang,Feng Liu
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
*备注: Accepted to ACM ICMR 2026

点击查看摘要

Abstract:Self-supervised learning advances audio representation for multimedia analysis. However, prevailing data-centric approaches rely on massive real-world corpora, increasing training costs, curation burdens, and privacy barriers. To address this, we present AudioPG, a procedural synthesis framework eliminating real audio recordings during pre-training. AudioPG trains a Transformer-based masked autoencoder on waveforms generated on-the-fly from basic acoustic primitives and composition rules. The encoder transfers effectively to real audio benchmarks, achieving 90.60% accuracy on ESC-50, 0.546 mAP on FSD50K, 88.17% on UrbanSound8K, and 97.03% on Speech Commands V2. Notably, pre-training completes in under 20 minutes on a single GPU. Latent space analysis reveals physical factors, including fundamental frequency and relative intensity, emerge in orthogonal subspaces, making representations linearly decodable. These results establish procedural synthesis as an efficient, interpretable pre-training signal when large-scale corpora are unavailable. Our code is available at: this https URL.

[LG-199] Learning Topological Representations for Molecular Dynamics

链接: https://arxiv.org/abs/2606.14737
作者: Dominik Geng,Florian Graf,Martin Uray,Roland Kwitt
类目: Biomolecules (q-bio.BM); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 20 pages, 4 figures

点击查看摘要

Abstract:Molecular dynamics (MD) simulations generate trajectories in a high-dimensional configuration space whose analysis critically depends on molecular descriptors, typically handcrafted observables or learned kinetic embeddings. Designing descriptors that are both expressive and broadly applicable, however, remains challenging. We study persistent homology (PH) as a general-purpose representation for MD and introduce the masked Flood complex, a protein-tailored modification of a recently introduced simplicial complex construction that emphasizes inter-residue structure at low computational cost. Vectorized persistence diagrams then provide information-rich, geometry-aware summaries of protein conformations, which we evaluate on protein class prediction, frame-level observable regression, and Markov state model (MSM) estimation from learned low-dimensional coordinates in a single shared representation space. Results on the mdCATH dataset show that PH-based descriptors are competitive across tasks, with masked Flood PH yielding the most consistent overall performance. Further, when using topologically-informed MSMs as a drop-in replacement within the recent MarS-FM framework for generative modeling of protein conformations, we obtain consistently better ensemble statistics than MSMs based on physical observables. Finally, we explore the transferability of the generative model to qualitatively different, fast folding, proteins.

[LG-200] Machine Learning-Driven Chemical Reactor Network Modeling of the Sandia-D Flame

链接: https://arxiv.org/abs/2606.14729
作者: Nicolas J. Tricard,Benjamin C. Koenig,Sili Deng
类目: Computational Physics (physics.comp-ph); Machine Learning (cs.LG); Fluid Dynamics (physics.flu-dyn)
*备注: 12 pages, 11 figures

点击查看摘要

Abstract:Turbulent combustion simulations are crucial for many scientific and engineering systems. However, the high cost to fully resolve the complex multiscale and multiphysics behavior makes direct simulation typically infeasible. The equivalent reactor network (ERN) approach attempts to improve computational efficiency by replacing a multidimensional turbulent simulation with a series of much cheaper 0-D and 1-D chemical reactors, providing a surrogate model that retains detailed chemistry at the cost of simplified flow physics. However, their development remains a challenge, often requiring either expert analysis, or automated approaches that sacrifice accuracy. In this work, we develop an automated machine-learning-assisted framework for constructing ERNs of the Sandia-D turbulent methane/air flame. Principal component analysis is first used to reduce high-dimensional thermochemical computational fluid dynamics (CFD) data to a low-dimensional latent space, where k-means clustering identifies physically interpretable flame regions used to initialize a reactor-network graph. This initialization is then refined using finite-difference gradient descent wrapped around non-differentiable Cantera reactor simulations. Across 30 RANS simulations spanning a range of pilot temperatures and inlet methane compositions, the optimized 7-reactor ERN achieves a maximum-temperature R^2 score of 0.7945 while preserving a \sim6000\times speedup over the CFD solver. Outlet CO prediction remains more challenging, with a final R^2 score of -0.4183 , but improves substantially from the unoptimized clustering initialization. These results show that unsupervised thermochemical feature extraction can provide effective physics-informed initializations for ERN construction, while gradient-based refinement can significantly improve predictive accuracy without manual reactor-network design.

附件下载

点击下载今日全部论文列表

目录

概览 (2026-06-16)

多智能体系统

自然语言处理

信息检索

人机交互

计算机视觉

人工智能

机器学习

附件下载