本篇博文主要内容为 2026-06-17 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR、MA六个大方向区分。
说明:每日论文数据从Arxiv.org获取,每天早上12:30左右定时自动更新。
提示: 当天未及时更新,有可能是Arxiv当日未有新的论文发布,也有可能是脚本出错。尽可能会在当天修复。
目录
概览 (2026-06-17)
今日共更新622篇论文,其中:
- 自然语言处理共85篇(Computation and Language (cs.CL))
- 人工智能共214篇(Artificial Intelligence (cs.AI))
- 计算机视觉共112篇(Computer Vision and Pattern Recognition (cs.CV))
- 机器学习共156篇(Machine Learning (cs.LG))
- 多智能体系统共13篇(Multiagent Systems (cs.MA))
- 信息检索共10篇(Information Retrieval (cs.IR))
- 人机交互共27篇(Human-Computer Interaction (cs.HC))
多智能体系统
[MA-0] DRFLOW: A Deep Research Benchmark for Personalized Workflow Prediction
【速读】:该论文旨在解决现有深度研究(Deep Research, DR)系统在处理企业级复杂信息查询任务时的局限性,即当前方法多聚焦于生成报告或摘要,而忽视了实际应用中更为关键的“工作流”(workflow)识别问题。具体而言,用户需求往往不是获取信息摘要,而是明确完成某项任务所需的可执行动作序列,例如“在固定预算下如何申请新增人员编制”。为此,论文提出DRFLOW基准测试体系,用于评估智能代理从异构数据源中识别并生成个性化、可操作的工作流的能力。其核心解决方案在于构建一个包含100个跨五个领域的任务集合,涵盖超过3,900份来源和1,246条参考工作流步骤,并定义了七项诊断性指标,涵盖事实依据性、步骤恢复率、结构顺序正确性、条件解析能力及个性化程度等维度。同时,论文提出DRFLOW-Agent(DRFA)作为面向工作流预测的参考代理模型,尽管其在平均F1分数上相较强基线提升达10.02%,但各项指标仍存在显著改进空间,表明生成完整且准确的个性化工作流仍是深度研究领域亟待突破的技术挑战。
链接: https://arxiv.org/abs/2606.18191
作者: Md Tawkat Islam Khondaker,Raymond Li,Muhammad Abdul-Mageed,Laks V. S. Lakshmanan,Issam H. Laradji
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:
Abstract:Deep research (DR) systems are increasingly used for complex information-seeking tasks, but existing works mainly focus on generating reports and summaries. In contrast, many enterprise tasks instead require an agent to identify concrete workflows which is a sequence of action-steps. For example, rather than summarizing budgeting policies, an agent should be able to determine the steps needed to answer a question such as: “How do I request new headcount given a fixed budget?”. Therefore, we introduce DRFLOW, a benchmark for evaluating personalized workflows predicted by agents from heterogeneous sources. Each task requires the agent to identify relevant evidence from scattered sources, then use that evidence to predict the correct action-step sequence for the user’s task. DRFLOW contains 100 tasks across five domains, with 1,246 reference workflow steps grounded in more than 3,900 sources. We define seven diagnostic metrics covering factual grounding, step recovery, structural ordering, condition resolution, and personalization. We further present DRFLOW-Agent (DRFA), a workflow-oriented reference agent to predict personalized workflow. We show that although DRFA improves over strong baseline agents (upto 10.02% average F1 score), there is substantial room for improvement remains across these workflow metrics, indicating that predicting complete and correct personalized workflows remains a challenging frontier for deep research.
[MA-1] On the Reliability of Networks of AI Agents : Density Evolution Stopping Sets and Architecture Optimization
【速读】:该论文旨在解决多智能体协同型现代人工智能系统在任务求解过程中表现优异但缺乏可解释性的问题,即难以明确其成功或失败的机制。其核心挑战在于:当前系统依赖多个不完美智能体(如生成、验证、通信)通过消息传递协作完成复杂任务,而这些智能体之间的交互具有不确定性与异质性,导致整体行为难以预测。为解决此问题,作者提出一种基于稀疏图结构的消息传递模型,借鉴低密度奇偶校验码(LDPC)的密度演化理论框架,将其扩展至更复杂的场景。关键创新在于将任务建模为一组耦合的二元子命题,并将智能体架构定义为角色类型化的稀疏因子图,其中校验节点为噪声布尔验证器,执行局部布尔函数运算;同时引入三种独立的失效模式——智能体放弃响应、验证器输出无效、消息丢失——均以“擦除”形式建模并随消息传播扩散。特别地,校验节点采用单一逻辑强制规则,可泛化为异或(XOR)、与(AND)、或(OR)、蕴含及霍恩约束等情形。由于验证函数非线性且值不对称,三类失效不可归约为单一等效信道,因此必须发展新的阈值、有限长度及对偶性结果,而非直接套用传统LDPC密度演化方法。论文证明了适用于随机角色类型架构的密度演化定理,可预测渐近未解析子命题的比例,并进一步推广至确定性局部树状图序列。在特定情况下,异或情形恢复经典二元擦除信道上的LDPC递推关系;而与情形则揭示了正向与负向验证证书间的显著不对称性。
链接: https://arxiv.org/abs/2606.18121
作者: Ehsan Aghazadeh,Hossein Pishro-Nik
机构: 未知
类目: Multiagent Systems (cs.MA); Information Theory (cs.IT)
备注:
Abstract:Modern AI systems increasingly solve a task not with a single model call but with several imperfect agents working together: some propose pieces of a solution, others verify them, and the results are combined. These systems often outperform any single model, yet it is rarely clear why they succeed or when they will fail. We model such a system as message passing on a sparse graph, the structure that underlies low-density parity-check (LDPC) codes, and extend the density-evolution machinery of coding theory to this richer setting. In our model a task is a set of coupled binary subclaims, and an agent architecture is a sparse, role-typed factor graph whose check nodes are noisy Boolean verifier nodes, each computing a local Boolean function of the subclaims it touches. Three distinct failure modes, all modeled as erasures (an agent abstaining, a verifier returning no usable output, and a message lost between two agents), propagate as the agents exchange set-valued messages. The check agents combine these messages by a single logical-forcing rule that specializes to XOR, AND, OR, implication, and Horn constraints. This is more than a relabeling of LDPC theory: the verifier functions are nonlinear and value-asymmetric, and the three failure modes do not reduce to a single effective channel, so they require new threshold, finite-length, and converse results rather than a direct reuse of parity-check density evolution. We prove a density-evolution theorem that predicts the asymptotic fraction of unresolved subclaims on random role-typed architectures, with an extension to deterministic, locally tree-like graph sequences. The XOR case recovers the classical LDPC recursion on the binary erasure channel (BEC); the AND case exposes an asymmetry between positive and negative verifier certificates.
[MA-2] Intelligence Entropy Principle and the ADE Stability Engineering Framework
【速读】:该论文旨在解决大语言模型驱动的多智能体系统(LLM-driven Multi-Agent Systems, MAS)在从实验室环境向生产环境迁移过程中出现的非线性行为退化问题。其核心挑战在于,概率驱动的智能系统在运行中会自发趋向熵增与无序状态,导致系统稳定性急剧下降甚至崩溃。为此,论文提出“智能熵原理”(Intelligence Entropy Principle),通过形式化表达系统熵随时间演化的关系 $ S(t) = S_0 \cdot \exp(\alpha t / C_m) $,其中 $ C_m $ 为提出的模型能力系数,量化了系统抵抗熵增的能力。基于李雅普诺夫(Lyapunov)分析,推导出系统稳定性的关键条件 $ \lambda < \alpha / C_m $,为系统设计提供了理论保障。解决方案的关键在于构建了包含23个核心组件的四层架构——代理交付工程(ADE)框架,涵盖物理规律(L1)、逻辑规则(L2)、任务执行(L3)与用户适应(L4),实现了对系统复杂性的分层控制。同时,提出“五层紊乱分类法”以统一归类系统故障,并创新性地引入“弹性组织”(Elastic Organization)作为新型多智能体形态,显著提升了系统的自适应与鲁棒性。实验验证覆盖百万级规模仿真与33.6天真实生产环境监控,结果表明通道断裂率由69%-98%降至接近0%,系统失效概率低于0.02%,充分证明了所提方法的有效性与实用性。
链接: https://arxiv.org/abs/2606.18065
作者: Dexing Liu(Shanghai Qijing Digital Technology)
机构: Shanghai Qijing Digital Technology Co., Ltd(上海奇景数字科技有限公司)
类目: Multiagent Systems (cs.MA)
备注: 32 pages, 18 figures
Abstract:As LLM-driven multi-agent systems (MAS) transition from lab to production, system behavior exhibits nonlinear degradation. We introduce the Intelligence Entropy Principle: probability-driven systems spontaneously drift toward disorder, formalized as S(t) = S0 * exp(alpha*t/Cm), where Cm is a model capability coefficient we propose. Lyapunov analysis yields the stabilization condition lambda alpha/Cm. We construct the ADE (Agent Delivery Engineering) four-layer framework (L1 Physical Laws through L4 User Adaptation) with 23 core components. Validation spans 100K-scale experiments and 33.6 days of production monitoring. We propose a Five-Layer Disorder Taxonomy unifying failures under structural collapse, and present Elastic Organization as an original MAS morphology. Results: channel fracture reduced from 69-98% to near 0%; system death probability below 0.02%.
[MA-3] ProvenanceGuard: Source-Aware Factuality Verification for MCP-Based LLM Agents
【速读】:该论文旨在解决生成式 AI(Generative AI)在多工具调用框架下,因来源混淆(cross-source conflation)导致的可信度失效问题。具体而言,现有事实性评估指标通常仅检验答案是否被聚合证据支持,却忽略了关键的溯源敏感性缺陷:即某论断虽在某一来源中成立,但被错误归因于其他来源。为应对这一挑战,论文提出 ProvenanceGuard——一种基于模型上下文协议(Model Context Protocol, MCP)的源感知验证器。其核心解决方案包括:利用稳定工具与来源标识符捕获MCP追踪日志;将答案分解为原子性论断;通过自然语言推理(NLI)与标记对齐代理进行逐源证据匹配;比对声明归属与实际路由来源;最终输出每个论断的判断结果及整体答案的允许/阻断决策。该系统支持通过检索增强型答案修正与再验证实现错误修复。在281个医疗领域MCP代理轨迹上的实验表明,ProvenanceGuard在40条保留测试集上达到0.802的阻断F1和0.858的来源准确率,显著优于不输出论断-来源映射关系的盲源基线;在更复杂的多源基准上,尽管源+关系准确率下降至0.229,仍证明了精确归属的困难性。修复-再验证机制可完全消除所有被阻断的答案,且在50个受控临床混淆探测任务中成功检测全部注入的归属篡改,无误留错误归属。结果表明,源归属是基于MCP的智能体事实性验证中独立且关键的维度。
链接: https://arxiv.org/abs/2606.18037
作者: Ander Alvarez,Santhiya Rajan,Samuel Mugel,Román Orús
机构: Multiverse Computing; Donostia International Physics Center (多诺斯蒂亚国际物理中心); Ikerbasque Foundation for Science (伊克尔巴斯克科学基金会)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)
备注: 20 pages, 4 figures
Abstract:Tool-using LLM agents increasingly use the Model Context Protocol (MCP) to answer from heterogeneous evidence sources, including search, APIs, databases, clinical records, and formulary tools. Standard factuality metrics usually test whether an answer is supported by pooled evidence, missing a provenance-sensitive failure mode: a claim may be supported somewhere while being attributed to the wrong source. We call this cross-source conflation. We introduce ProvenanceGuard, a source-aware verifier for MCP-grounded answers. It consumes captured MCP traces with stable tool IDs, source IDs, and raw outputs; decomposes answers into atomic claims; routes claims to source-specific evidence; checks support with NLI and a token-alignment proxy; compares stated attribution with the routed source; and returns per-claim verdicts plus an answer-level allow/block decision. Blocked answers can be repaired with retrieval-augmented answer revision and re-verified. We evaluate on 281 medical-domain MCP-agent traces. A 266-trace adjudicated subset yields 2,325 LLM-assisted claim labels split by trace; 361 held-out labels are human-verified. On the 40-trace held-out split, ProvenanceGuard achieves block F1 0.802 and source accuracy 0.858 over 260 source-eligible claims, outperforming source-blind baselines that do not emit claim-to-source IDs. On a harder multi-source benchmark it reaches block F1 0.846, while source-plus-relation accuracy drops to 0.229, showing that exact source ownership remains difficult with semantically close sources. Repair-and-reverify resolves all blocked answers in the full trace set, often via conservative fallback. In 50 controlled clinical conflation probes, ProvenanceGuard detects all injected attribution swaps with no retained wrong attribution. These results show that source attribution is an independent axis for factuality verification in MCP-based agents. Comments: 20 pages, 4 figures Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA) Cite as: arXiv:2606.18037 [cs.AI] (or arXiv:2606.18037v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2606.18037 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[MA-4] LegalHalluLens: Typed Hallucination Auditing and Calibrated Multi-Agent Debate for Trustworthy Legal AI ICML2026
【速读】:该论文旨在解决法律领域中人工智能(AI)系统在实际部署过程中存在高频率幻觉(hallucination)的问题,尤其关注现有聚合指标(如52%的幻觉率)无法揭示错误集中于特定类型主张及错误方向(如遗漏或虚构)这一关键缺陷,导致合规人员缺乏可操作的可信性信号。其解决方案的核心在于提出LegalHalluLens审计框架,包含三个关键组件:(1)基于四类法律驱动型主张(数值型、时间型、义务/权利型、事实型)构建的细粒度幻觉画像(typed hallucination profiles),在CUAD数据集上进行量化;(2)风险方向指数(Risk Direction Index, RDI),将“遗漏偏差”与“虚构偏差”的混合问题压缩为一个可跨模型比较的标量指标,实现方向感知的评估;(3)经类别与方向校准的细粒度辩论管道(typed debate pipeline),能够有效识别并抑制特定类型的幻觉。实验在510份合同、249,252个条款级实例上验证了模型间差距高达38–40个百分点,且两个幻觉率均为52%的系统可能具有相反的RDI值。该辩论管道使虚假生成检测率降低45%,并在各类型上表现与诊断结果一致,同时仅需40亿活跃参数即可达到商用API性能。研究进一步表明,这些诊断指标可作为多智能体辩论系统的校准输入,其中针对已测失败模式设计的“质疑者”(Skeptic)与非对称门控机制优于通用调参的辩论策略。整体框架实现了对法律AI部署中错误方向的精准感知,支持具备方向意识的采购决策、责任追溯与智能体架构设计。
链接: https://arxiv.org/abs/2606.18021
作者: Lalit Yadav,Akshaj Gurugubelli
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: 15 pages, 5 figures; Published at the Second Workshop on Agents in the Wild: Safety, Security, and Beyond (AIWILD) at ICML 2026
Abstract:AI systems deployed in legal workflows hallucinate at rates that aggregate metrics report at ~52%, but this average conceals where errors concentrate and in which direction they run, leaving compliance officers without an actionable signal for trustworthy deployment. We present LegalHalluLens, an auditing framework with three components: typed hallucination profiles across four legally-motivated claim categories (numeric, temporal, obligation/entitlement, factual) over CUAD (Hendrycks et al., 2021); a Risk Direction Index (RDI) that reduces omission-versus-invention bias to a single deployment-comparable scalar; and a typed debate pipeline calibrated to both magnitudes and directions. Across 510 contracts and 249,252 clause-level instances we measure a within-model gap of approximately 38-40 pp between obligation/numeric and temporal claims that aggregate reporting hides, and show that two systems with matched 52% rates can carry opposite RDIs. The debate pipeline reduces fabricated detections by 45% with per-category gains tracking the diagnosis, matching commercial APIs with a substantially smaller backbone (4B active parameters). Typed profiles and RDI surface failure modes that aggregate metrics hide; we further show these diagnostics serve as calibration inputs for multi-agent debate pipelines, where Skeptic challenges and asymmetric gates targeted at measured failure modes outperform generically-tuned debate. The framework supports direction-aware procurement, accountability, and agent design for legal AI deployed in the wild.
[MA-5] A Neuro-Symbolic Approach to Strategy Synthesis for Strategic Logics
【速读】:该论文旨在解决多智能体系统(Multi-Agent Systems, MAS)中战略能力推理的计算成本过高问题,尤其针对策略合成(strategy synthesis)的高复杂性挑战。传统逻辑框架如ATL虽能提供形式化的方法,但其在实际应用中受限于策略空间的组合爆炸。本文提出一种神经符号(neuro-symbolic)框架,将大语言模型(Large Language Models, LLMs)集成至MAS的模型检验流程中,构建“生成-验证”(generate-and-certify)架构:由LLM充当策略生成预言机(oracle),提出候选策略,再通过标准的MAS模型检查器进行形式化验证。该方案的关键在于利用LLM的语义理解与推理能力高效探索大规模策略空间,同时通过形式化验证确保结果的正确性——仅当策略被验证器正式认证后才被接受,从而兼顾效率与形式安全性。研究以有界战略推理中的NatATL为例,构建了首个包含4211个实例的NatATL策略合成数据集,并基于开源的Qwen3-32B模型实验表明,该认证管道在策略合成任务上实现了92%的准确率。
链接: https://arxiv.org/abs/2606.17962
作者: Marco Aruta,Vadim Malvone,Aniello Murano,Domenico Parente,Luca Rizzuti
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注:
Abstract:Reasoning about what agents can achieve through strategic interaction is a core challenge in Multi-Agent Systems (MAS). Logics for strategic ability, such as ATL, provide rigorous methods, but their adoption is often hindered by the computational cost of strategy synthesis. We introduce a neuro-symbolic framework that integrates large language models (LLMs) into the model-checking pipeline for MAS. The LLM acts as a strategy-generation oracle, proposing candidate strategies that are then formally validated by a standard MAS model checker. This generate-and-certify architecture uses LLM guidance to navigate large combinatorial strategy spaces while preserving formal soundness: generated strategies are accepted only when certified by the verifier. We instantiate the framework for bounded strategic reasoning in NatATL and introduce the first NatATL strategy-synthesis dataset, consisting of 4211 instances. Experiments with an open-weight Qwen3-32B model show that our certified pipeline achieves 92% accuracy on strategy-synthesis outcomes.
[MA-6] rustworthy Self-Composable Big-Data-as-a-Service: An LLM -Orchestrated Multi-Agent Framework for Automated Data Engineering AutoML MLOps Deployment and Drift-Aware Lifecycle Optimization
【速读】:该论文旨在解决当前大数据即服务(Big-Data-as-a-Service, BDaaS)平台在数据科学全生命周期自动化中面临的系统性挑战,特别是现有基于大语言模型(LLM)的数据科学代理与自动机器学习(AutoML)系统普遍局限于孤立的流程阶段,缺乏对整个生命周期的协同编排、产物治理、人工监督以及漂移感知适应能力。其核心解决方案是提出一种基于LLM协同多智能体(multi-agent)协作的可自组合可信BDaaS框架。该框架将BDaaS生命周期分解为多个专业化智能体,涵盖数据摄入、数据清洗、特征工程、AutoML训练、模型评估、MLOps部署、监控及漂移检测等环节,并由中央LLM编排层负责协调执行、验证中间输出、管理上下文状态并支持动态工作流重构。关键创新点在于通过共享的产物治理机制、可复现性保障、人机协同检查点以及漂移感知反馈回路,实现了从流程孤岛向端到端可信自动化演进。基于带缺失值、类别变量、异常值、类别不平衡及模拟协变量漂移的基准数据集的原型评估表明,相比手动建模、纯AutoML及单智能体LLM基线,该多智能体管道在保持竞争力预测性能的同时,显著提升了工作流完成率、产物可追溯性、部署就绪性、可复现性及漂移恢复能力,验证了LLM驱动的多智能体系统在推动传统AutoML向可信、自适应、生产就绪型BDaaS自动化演进中的可行性与有效性。
链接: https://arxiv.org/abs/2606.17915
作者: Aueaphum Aueawatthanaphisut,Badri Raj Lamichhane
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Databases (cs.DB); Software Engineering (cs.SE)
备注: 7 pages, 3 figures, 5 tables
Abstract:Big-Data-as-a-Service (BDaaS) platforms require re liable automation across data ingestion, cleaning, feature engi neering, model development, deployment, and post-deployment monitoring. However, existing LLM-based data science agents and AutoML systems mainly focus on isolated workflow stages, leaving limited support for lifecycle-level orchestration, artifact governance, human oversight, and drift-aware adaptation. This paper proposes a trustworthy self-composable BDaaS frame work based on LLM-orchestrated multi-agent collaboration. The proposed architecture decomposes the BDaaS lifecycle into specialized agents for data ingestion, data cleaning, feature engineering, AutoML training, model evaluation, MLOps de ployment, monitoring, and drift detection. A central LLM or chestration layer coordinates agent execution, validates interme diate outputs, manages workflow context, and enables dynamic workflow composition. The framework also incorporates shared artifact governance, reproducibility support, human-in-the-loop checkpoints, and drift-aware feedback loops. A prototype-based evaluation is conducted using controlled tabular benchmark datasets with missing values, categorical variables, outliers, class imbalance, and simulated covariate drift. Compared with manual ML, AutoML-only, and single-agent LLM baselines, the pro posed multi-agent BDaaS pipeline achieves competitive predictive performance while improving lifecycle-level reliability, including workflow completion, artifact traceability, deployment readiness, reproducibility, and drift recovery. The results suggest that LLM-orchestrated multi-agent systems can extend conventional AutoML toward trustworthy, adaptive, and production-oriented BDaaS lifecycle automation.
[MA-7] ED3R: Energy-Aware Distributed Disaster Detection Enabled by Cooperative Robotic Agents
【速读】:该论文旨在解决在资源受限、不确定性高且操作约束严格的环境下,机器人执行野火监测任务时面临的多重挑战,包括如何在保证检测置信度的前提下最小化能耗与检测延迟。其核心解决方案是提出一种面向能量感知的分布式框架ED3R(Energy-aware Distributed Decision-making for Wildfire Detection),通过层级协作机制实现机器人与远程控制器之间的协同决策:远程控制器负责规划机器人的运动路径,而机器人则基于自身感知信息自主决定检测策略(本地或远程执行)及执行方式。关键创新在于引入了自定义惩罚函数以确保任务可行性,集成障碍规避与冗余探索抑制机制,并通过分布式神经回归模型赋予系统前瞻能力,使各智能体可在执行前评估候选策略的未来表现。实验结果表明,ED3R在复杂任务中可将能耗降低最高达36.4%,检测速度提升41%,并实现高达97.18%的任务成功率。
链接: https://arxiv.org/abs/2606.17739
作者: Lina Magoula,Nikolaos Koursioumpas,Nancy Alonistioti,Ramin Khalili
机构: National and Kapodistrian University of Athens (雅典国立卡波迪斯特里亚大学); Huawei Heisenberg Research Center (慕尼黑) (华为海森堡研究中心)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Multiagent Systems (cs.MA)
备注: 14 pages, 9 figures
Abstract:Robotics are expected to support environmental monitoring and natural disaster management, where decisions must be made under uncertainty, resource limitations, and strict operational constraints. In critical missions, such as wildfires, robotic agents must not only identify hazardous events with sufficient confidence, but also manage the energy cost and time until detection. This paper introduces ED3R, an energy-aware distributed framework for wildfire detection under uncertainty. ED3R enables hierarchical cooperative decision-making between a robot and a remote controller. The remote controller decides upon the robot’s motion, while the robot senses the environment and decides where to execute the wildfire detection (onboard or remotely) and how. The common goal is to detect wildfires with a required confidence while minimizing the energy consumed by any robot operation. ED3R further integrates mechanisms to avoid nearby obstacles, prevent redundant exploration, enable adaptive early mission completion, and ensure feasibility through a custom penalty function. ED3R also introduces a forward-looking capability, enabled through distributed neural regression models that allow the agents to anticipate the future by evaluating candidate strategies before execution. The framework is evaluated through realistic robotics simulations, ablation studies, and baseline comparisons. Overall, ED3R achieves a mission success rate of up to 97.18%. Especially in the most demanding missions, it reduces energy consumption by up to 36.4% and detects wildfires up to 41% faster than baselines.
[MA-8] GeoDisaster: Benchmarking Orchestrated Agents for Operational Disaster Geo-Intelligence
【速读】:该论文旨在解决遥感视觉-语言模型(RS-VLMs)在实际地理情报(geo-intelligence)应用中面临的局限性,即缺乏基于工具的时空推理能力与结构化、证据支撑的决策机制。其核心问题在于现有模型难以实现可执行、可验证的地理空间分析任务,尤其在灾害响应等高要求场景下表现不足。解决方案的关键在于提出GeoDisaster——一个包含2,921个经验证实例的面向操作型地理空间灾害推理基准,涵盖43种问题类型和五大任务类别(如森林砍伐监测、多灾害分析、建筑损毁评估等),并整合异构地球观测(EO)与地理信息系统(GIS)证据(包括光学与合成孔径雷达(SAR)影像、栅格掩膜、矢量几何、道路网络及暴露度图层)。该基准通过可执行的地理空间工作流和确定性一致性检查生成真实答案,避免依赖语言模型标注。同时,论文设计了一种由18个灾害专用工具组成的协同多智能体框架,各角色智能体通过显式执行契约(execution contracts)协作,并引入角色-契约期望对齐(RCEA)机制,结合故障感知的监督微调与基于密集步骤信号的契约驱动强化学习,显著提升了工具使用效率、证据锚定精度、状态一致性及决策质量。
链接: https://arxiv.org/abs/2606.17246
作者: Maram Hasan,Aman Verma,Savitra Roy,Hariseetharam Gunduboina,Daksh Jain,Muhammad Haris Khan,Subhasis Chaudhuri,Biplab Banerjee
机构: Indian Institute of Technology Bombay (印度理工学院孟买分校); Mohamed bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Multiagent Systems (cs.MA)
备注: 28 pages, 11 Figures
Abstract:Remote-sensing vision-language models (RS-VLMs) have advanced Earth-observation analysis toward visual interpretation and instruction-following, yet fall short of operational geo-intelligence, which demands tool-grounded spatial reasoning and structured, evidence-backed decisions. We introduce GeoDisaster, an operational geospatial disaster reasoning benchmark with 2,921 verified instances across 43 question types and five task families: deforestation monitoring, multi-hazard analysis, building-damage assessment, flood-safe routing, and Sentinel-1 SAR flood monitoring. Instances integrate heterogeneous EO/GIS evidence-optical and SAR imagery, raster masks, vector geometries, road networks, and exposure layers-spanning hazard detection, damage assessment, exposure estimation, and diagnostic report generation. Ground-truth answers are grounded in executable geospatial workflows and deterministic consistency checks, removing the need for language-model annotation. We further propose an orchestrated multi-agent framework with 18 disaster-oriented tools, where role-specialized agents coordinate through explicit execution contracts, aligned via Role-Contract Expectation Alignment (RCEA): failure-aware supervised fine-tuning combined with contract-grounded reinforcement learning over dense step-level signals. Experiments show that GeoDisaster challenges existing RS-VLMs and agentic systems, while RCEA improves tool use, evidence grounding, state consistency, and decision generation.
[MA-9] Intermittent Strategic Cooperation of Two Selfish Agents on Graphs
【速读】:该论文旨在解决在空间与时间约束下,两个自利代理之间动态协作的策略脆弱性问题,具体表现为在路径规划过程中,尽管合作可使双方均受益,但任一代理都可能在任意节点偏离合作策略。为此,论文提出了基于间歇性战略协作的双智能体路径规划(Intermittent Strategic Cooperation-Based Two-Agent Path Planning, IC2PP)模型,将其建模为图上的最短路径博弈。其解决方案的关键在于:通过博弈论分析,揭示了纯纳什均衡(Pure Nash Equilibrium, PNE)在IC2PP中的结构特性,证明了稳定合作必须遵循高度受限的形式;同时,提出了一种多项式时间算法,用于枚举所有相关的PNE解,并进一步基于讨价还价理论设计协调机制,在多个均衡中选择最优结果,以优化个体行程时间和整体社会福利。
链接: https://arxiv.org/abs/2606.17216
作者: Itay Shedlezki,Noa Agmon
机构: Bar-Ilan University (巴伊兰大学)
类目: Multiagent Systems (cs.MA); Computer Science and Game Theory (cs.GT); Robotics (cs.RO)
备注:
Abstract:We study strategic space- and time-constrained cooperation between two self-interested agents through the Intermittent Strategic Cooperation-Based Two-Agent Path Planning (IC2PP) problem, a shortest-path game on graphs in which agents navigate toward individual targets while optionally cooperating at specific nodes to reduce their own travel times. Although such cooperation can strictly benefit both agents, it is strategically fragile: agents may deviate at any point along their paths. Modeled as a 2-player game, we characterize the structure of Pure Nash Equilibrium (PNE) joint strategies in IC2PP, and show that stable cooperation must follow a highly constrained form. We further prove that at least one PNE exists in every instance of IC2PP, and present a polynomial-time algorithm for enumerating all relevant PNEs. When multiple equilibria arise, we study coordination mechanisms based on bargaining-theoretic selection concepts and empirically compare equilibrium outcomes in terms of individual travel times and social welfare.
[MA-10] Verified Detection and Prevention of Concurrency Anomalies in Multi-Agent Large Language Model Systems
【速读】:该论文旨在解决多智能体大语言模型(Multi-agent LLM)系统中因共享状态机制(如内存存储、向量索引和工具注册表)引发的并发一致性问题。其核心挑战在于,现有系统在动态执行环境下难以保证操作序列的可预测性与隔离性,导致潜在的并发异常。解决方案的关键在于:首次基于TLA+形式化定义了四类结构上类比经典隔离异常的并发缺陷——过时生成(stale-generation)、幻影工具(phantom-tool)、因果级联(causal-cascade)及工具效应重排序(tool-effect reordering),并通过TLC模型检测器验证其存在性;进一步提出一个机械验证的、严格分离的最大链式一致性层级 L0⊊⋯⊊L4,实现了对多智能体运行时一致性的机器可验证分层建模。该框架通过274个Verus断言(零假设、零容许,仅依赖两个结构公理与互斥锁对应关系)证明了检测器的完备性与正确性,并在三个已部署的Rust运行时中实现从 L0 到 L1 的悲观锁与可串行化快照隔离,以及 L2 至 L4 的依赖无关预防机制(以无错误/1000对比1000/1000验证其有效性)。实证表明,该方法不仅复现并修复了字节跳动deer-flow中的隐性丢失更新,还揭示了LangGraph ToolNode中的工具效应重排序问题,且经由L3提交顺序调度器得以消除。最终贡献在于一套经过形式化验证的检测器、一致性精化关系与可实现性证据,而非现象本身。
链接: https://arxiv.org/abs/2606.17182
作者: Sajjad Khan
机构: 未知
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Logic in Computer Science (cs.LO); Multiagent Systems (cs.MA); Programming Languages (cs.PL)
备注: 32 pages, 2 figures, 6 tables. Verus/TLA+ verification artifact, reference Rust runtime, and Python harnesses, plus a supplementary appendix (Sections A-F, Tables S1-S6), included as ancillary files
Abstract:Multi-agent LLM systems share state through memory stores, vector indices, and tool registries. We model such sharing as long-running read-generate-write operations under deterministic-generation semantics – the regime durable-execution engines enforce by deterministic replay – and formalize four concurrency anomalies in TLA+: stale-generation, phantom-tool, causal-cascade, and tool-effect reordering, structural analogues of classical isolation anomalies, each with a TLC counter-example. The exclusion lattice over these anomalies is trivial; the contribution is the mechanically verified realizability and strict separation of one maximal chain within it, L_0 \subsetneq \cdots \subsetneq L_4 , to our knowledge the first machine-checked consistency hierarchy for such runtimes. A development of 274 Verus obligations (zero assume, zero admit; trust base: two structural axioms and a mutex correspondence) proves the detectors sound and complete against the specifications and each runtime its avoidance set. Three deployed Rust runtimes realize L0-L1 (pessimistic locking, serializable snapshot isolation, default-SI), each verified against stale-generation and refined to its state machine; L2-L4 are exec-mode-verified with dependency-free prevention twins (A3, A6, A2: 0/1000 versus 1000/1000), and L2 is run live across three model families (A3 prevented in all 120 retracted sessions). We reproduce a silent lost update in ByteDance’s deer-flow, formalizing its fix as a verified L_0 \to L_1 refinement, and exhibit tool-effect reordering in LangGraph’s ToolNode on unmodified output, removed by an L3 commit-order sequencer. The verified detector, refinements, and realizability artifacts are the contribution; the phenomena and lattice are classical.
[MA-11] From Parasocial Scripts to Dyadic Persistence in Autonomous AI-Agent Communities EMNLP2026
【速读】:该论文旨在解决在由自主人工智能(AI)代理构成的在线社区中,是否存在类似于传统媒体环境中“准社会互动”(parasocial interaction, PSI)和“准社会关系”(parasocial relationship, PSR)的非对称人际线索这一关键问题。其核心解决方案在于通过三种基于理论的文本指标——依恋/亲密语言、互惠性诉求以及对原始发帖人(OP)的身份认同——系统分析来自Moltbook平台的4,434篇帖子与50,338条评论。研究采用关键词匹配、少量样本大语言模型(LLM)标注及上下文分组式LLM标注相结合的方法,发现PSI类口语化互动线索广泛存在,并显著关联于原始发帖人的再参与行为及双向回复结构。该结论在负向对照、零假设检验、聚类标准误重估及多重检验校正等稳健性测试中均保持成立。进一步的配对持续性检验表明,互惠性诉求与持续性的双向互动模式一致,为交互层面的PSI脚本与符合PSR特征的重复性二元互动模式之间的桥梁提供了实证支持。研究将这些发现解释为由大语言模型赋能的智能体在话语中形成的一种行为结构。
链接: https://arxiv.org/abs/2606.17174
作者: Mohammadsadegh Abolhasani,Hamid Reza Firoozfar,Reza Mousavi,Paul Jen-Hwa Hu
机构: University of Utah (犹他大学); University of Virginia (弗吉尼亚大学)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY); Multiagent Systems (cs.MA)
备注: Submitted for review in ARR for EMNLP 2026
Abstract:While parasocial interactions (PSIs) and parasocial relationships (PSRs) have been studied in conventional media settings, we investigate whether PSI- (colloquial) relational cues also exist in online communities where both sides are autonomous AI agents. We analyze 4,434 posts and 50,338 comments from Moltbook through three theory-based textual indicators: attachment/intimacy language, reciprocity bids, and self-identification to original poster (OP). The combined results across methods based on keyword matching, few-shot large language model (LLM) annotation, and grouped-context LLM annotation reveal that PSI colloquial cues prevail and are strongly associated with OP re-engagement and a reciprocal reply structure. These results are robust across negative controls, nullification, clustered-standard-error re-estimation, and multiple-testing correction. A dyadic persistence test further affirms reciprocity bids aligned with sustained OP-involving mutual recurrence, providing empirical evidence for bridging interaction-level PSI scripts with PSR-consistent repeated dyadic patterns. We interpret the evidence as a behavioral structure in discourse by LLM-enabled agents.
[MA-12] MemSlides: A Hierarchical Memory Driven Agent Framework for Personalized Slide Generation with Multi-turn Local Revision
【速读】:该论文旨在解决个性化演示文稿生成中用户偏好在多轮交互与局部编辑过程中难以稳定保持的问题,具体包括:如何在跨任务中维持稳定的用户偏好、在多轮修订中保留新引入的约束条件,并实现可靠且精准的局部修改。其解决方案的关键在于提出MemSlides——一种分层记忆框架,通过将长期记忆细分为用户画像记忆(user profile memory)与工具记忆(tool memory),并与工作记忆(working memory)分离,分别承担不同层级的记忆功能:用户画像记忆用于初始轮次的个性化配置,工作记忆负责在修订过程中持续携带当前会话的偏好与约束,工具记忆则存储可复用的执行经验以支持可靠的局部编辑。此外,该框架结合作用域限定的幻灯片局部修订机制,使更新仅作用于最小受影响区域,避免全量重生成。实验结果表明,用户画像记忆显著提升多角色、多意图场景下的个性化对齐度,工具记忆注入可改善闭环修改行为,而定性案例验证了工作记忆在偏好传递中的有效性。综合而言,该研究揭示了有效个性化演示生成依赖于对持久用户画像、会话级工作记忆及可复用执行经验的系统性分离与协同管理。
链接: https://arxiv.org/abs/2606.17162
作者: Ye Jin,Yangyang Xu,Jun Zhu,Yibo Yang
机构: Beijing University of Posts and Telecommunications (北京邮电大学); Tsinghua University (清华大学); Shanghai Jiao Tong University (上海交通大学)
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Multiagent Systems (cs.MA)
备注: Code, website, project page, and video are linked in the paper
Abstract:Personalized presentation generation requires more than conditioning on a current prompt or template: agents must preserve stable user preferences across tasks, retain newly introduced preferences and constraints during multi-turn revision, and carry out local edits reliably. We propose MemSlides, a hierarchical memory framework for personalized presentation agents that separates long-term memory from working memory and further divides long-term memory into user profile memory and tool memory. User profile memory stores intent-conditioned profiles for round-0 personalization, working memory carries active preferences and session constraints across revision rounds, and tool memory stores reusable execution experience for reliable localized editing. MemSlides pairs this memory design with scoped slide-local revision, so targeted updates act on the smallest affected region instead of repeatedly regenerating the full deck. In controlled experiments, user profile memory improves persona-alignment judgments on a multi-persona, multi-intent profile bank, tool-memory injection improves closed-loop modify behavior in diagnostic matched-pair settings, and qualitative cases illustrate working memory’s ability to carryover preferences. Taken together, these results suggest that effective personalization in presentation authoring depends on separating persistent user profiles, session-level working memory, and reusable execution experience across generation and localized revision.
自然语言处理
[NLP-0] Variable-Width Transformers
【速读】: 该论文旨在解决传统Transformer语言模型在深度扩展过程中采用恒定宽度(width)分配所带来的资源利用效率低下问题。尽管模型规模的扩大(如深度与宽度增加)显著推动了性能提升,但现有架构在所有层间均等分配参数与计算预算,忽略了不同层级可能承担不同计算角色的事实。为此,本文提出一种“X形”(×-shaped)非均匀宽度架构,其特点为在浅层和深层保持较宽结构,而中间层进行缩窄,并通过无参数的残差重缩放机制实现层间特征维度的动态调整。实验表明,在从200M到2B参数的稠密模型以及3B参数的混合专家(MoE)模型上,该设计在语言建模损失上持续优于参数量相当的均匀基线模型。更重要的是,该结构通过降低平均层宽,实现了整体计算量(FLOPs)减少22%、键值缓存(KV cache)内存与输入/输出(I/O)开销减少15%的显著优化。分析显示,这种瓶颈式结构促使残差流(residual stream)中形成定性不同的表征。综上,该研究的关键在于证明:通过非均匀宽度分配可实现更优的资源感知型模型扩展(resource-optimal scaling),从而在不牺牲性能的前提下大幅提升计算效率。
链接: https://arxiv.org/abs/2606.18246
作者: Zhaofeng Wu,Oliver Sieberling,Shawn Tan,Rameswar Panda,Yury Polyanskiy,Yoon Kim
机构: MIT; MIT-IBM Watson AI Lab
类目: Computation and Language (cs.CL)
备注:
Abstract:Scaling model size, specifically depth and width, has driven significant progress in transformer-based language models. However, most architectures maintain a constant width across all layers, allocating a fixed parameter and computation budget evenly despite different layers potentially playing distinct computational roles. In this work, we empirically investigate nonuniform capacity allocation across network depth by proposing a \times -shaped former architecture. This design maintains wider early and late layers while narrowing the middle layers, utilizing a parameter-free residual resizing mechanism. Across decoder-only language models ranging from 200M to 2B parameters (dense) and 3B parameters (MoE), our former consistently outperforms parameter-matched uniform baselines on language modeling loss. By reducing the average layer width, this architecture also requires fewer overall FLOPs (22% reduction under fitted loss-matched scaling curves) and smaller KV cache memory and I/O cost (15% reduction). In analysis, we show that this bottleneck structure results in qualitatively different representations in residual streams. Overall, our results demonstrate that nonuniform width allocation can result in more resource-optimal scaling of language models.
[NLP-1] ReproRepo: Scaling Reproducibility Audits with GitHub Repository Issues
【速读】: 该论文旨在解决科学界在复现机器学习研究结果过程中面临的可复现性挑战,尤其针对现有评估框架因依赖大量人工数据整理与评价而难以规模化的问题。其解决方案的关键在于提出一个名为ReproRepo的可扩展评估框架,该框架创新性地利用GitHub上由研究人员自发提交的真实问题(issue)作为自然监督信号,以识别实际研究中存在复现障碍的典型场景。通过在1,149篇来自顶级会议的近期机器学习论文上实例化该框架,并评估四种前沿大模型代理(LLM agents)的表现,研究发现即使不执行代码,大模型代理也能有效识别多数真实世界的复现问题——其中表现最佳的Codex结合GPT-5.5的代理在约90%的论文中能够发现至少一个语义相关的、人类报告过的复现障碍。进一步分析表明,这些代理在识别可见性失败和定位问题的语义区域方面表现出色,但在精确归因到具体代码位置方面仍存在不足。ReproRepo为未来对大模型代理在真实世界可复现性审计中的评估提供了一个可重用、可扩展的基准工具。
链接: https://arxiv.org/abs/2606.18237
作者: Shanda Li,Qiuhong Anna Wei,Jingwu Tang,Valerie Chen,Nihar B Shah,Tim Dettmers,Yiming Yang,Ameet Talwalkar
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Reproducing research results from papers and released code is central to scientific progress. Existing works have introduced benchmarks to evaluate whether LLM agents can assist with reproducibility, but they are difficult to scale due to their reliance on substantial manual effort for data curation and evaluation. We introduce ReproRepo, a scalable framework for reproducibility evaluation that leverages human-raised GitHub issues as naturally occurring supervision on realistic reproduction blockers. We instantiate ReproRepo on 1,149 recent machine learning papers from major conferences and evaluate four frontier model-agent configurations. Our results show that LLM agents, even without executing code, can identify many real-world reproducibility problems from paper-repository pairs: the best agent in our study, namely Codex with GPT-5.5, surfaces at least one semantically related human-reported blocker for ~90% of papers in the study. Further analysis shows that agents are particularly effective for surfacing visible failures and identifying the right semantic region, but may still be insufficient in exact localization. ReproRepo can serve as a reusable, scalable framework for future evaluations of LLM agents on real-world reproducibility auditing. Our code is released at this https URL.
[NLP-2] Darshana Graph: A Parallel Commentary Corpus for Comparative Indian Philosophy with Stylometric and Exploratory Graph Analyses
【速读】: 该论文旨在解决古典印度哲学文本中缺乏大规模、结构化、可跨注释者比较的公开语料库问题,尤其针对吠檀多(Vedanta)与耆那教(Jain)等传统中同一根源经文在不同历史注释者间的诠释差异难以系统性分析的难题。其核心解决方案在于构建“达尔珊图”(Darshana Graph)这一独特语料库,其中包含约8,500条来自印度教与耆那教经典的根经文(sutra)及其对应十八位历史注释者的跨注释对齐文本,覆盖五个吠檀多学派及其他哲理传统(darshanas),实现了前所未有的大规模、结构化跨评论者对照分析。关键创新点在于通过人工校准的对齐机制,使同一经文在不同解释传统中的表述得以直接比较,为后续的风格计量分析与哲学关系抽取提供了坚实基础。研究进一步提出基于预定义关系词汇表和确定性后处理验证的受限大语言模型(constrained large language model)流水线,以提取概念间的类型化哲学关系,构建可解释的语义图谱;该方法不仅揭示了跨学派的分歧模式,也暴露了嵌入式分析与规则驱动图谱之间在某些案例上的不一致性,凸显了方法论局限性。研究成果包括完整语料库、关系图谱及全部源代码的开源发布。
链接: https://arxiv.org/abs/2606.18222
作者: Joy Bose
机构: 未知
类目: Computation and Language (cs.CL); Digital Libraries (cs.DL)
备注: 12 pages, 1 figure. Open Source Code available at this https URL and dataset at this https URL
Abstract:We introduce Darshana Graph, a corpus of over 125,000 text records spanning classical Hindu, Buddhist, and Jain philosophical traditions, drawn from public-domain and openly licensed translations of sources including the Bhagavad Gita, Brahma Sutras, principal Upanishads, the Pali Canon, and core Jain texts. Its distinctive contribution lies in a structurally unique subset of roughly 8,500 Hindu and Jain records in which the same root verse or sutra is aligned across eighteen historical commentators representing five schools of Vedanta and other darshanas, enabling direct comparison of how independent interpretive traditions read identical source material. To our knowledge, no publicly available resource provides comparable cross-commentator alignment at this scale. We present two analyses built on this corpus. First, a transparent stylometric comparison requiring no machine learning measures argumentative style through scriptural citation density, explicit refutation rate, and sentence complexity. It finds a moderate negative correlation between citation density and refutation rate, a marked increase in refutation rate across three commentators in a related doctrinal lineage, and measurable genre-level differences within the Pali Canon itself. Second, we describe a constrained large language model pipeline that extracts typed philosophical relationships between concepts using a predefined relation vocabulary and deterministic post-hoc validation. The resulting graph surfaces cross-school disagreement patterns while also revealing important extraction limitations, including cases where an independent embedding-based analysis disagrees with the graph-derived findings. We release the full corpus, extracted relationship graph, and all source code.
[NLP-3] Zone of Proximal Policy Optimization: Teacher in Prompts Not Gradients
【速读】: 该论文旨在解决知识蒸馏(Knowledge Distillation)在小模型(small-student regime)下泛化能力差的问题:传统方法强制学生模型模仿大教师模型的输出logits,导致学生过度聚焦于教师最尖锐的决策模式,从而在训练数据之外的基准任务上表现退化。现有基于强化学习(Reinforcement Learning, RL)的方法虽避免了logit模仿,但在全失败的难例上因回报为零而无法更新策略,且引入教师响应会破坏on-policy假设并引发策略漂移。为此,论文提出受维果茨基“最近发展区”理论启发的近邻策略优化(Zone of Proximal Policy Optimization, ZPPO),其核心在于将教师信息保留在提示(prompt)层面而非策略梯度中。针对难题,ZPPO构建两类重构提示:二元候选提示(Binary Candidate-included Question, BCQ)将正确教师回答与错误学生回答作为匿名选项供学生判别;负向候选提示(Negative Candidate-included Question, NCQ)则聚合学生所有错误推演结果生成单一提示,以暴露共性失败模式。通过提示回放缓冲区(prompt replay buffer)对困难问题进行循环处理,直至学生平均推演准确率达到阈值(半数)或因容量限制被先进先出淘汰,从而持续放大当前“最近发展区”内的学习信号。在0.8B–9B不同规模的学生模型上,使用27B教师模型进行视觉-语言建模后训,于31项基准测试(16个VLM、10个LLM、5个Video)中,ZPPO显著优于off/on-policy蒸馏及GRPO方法,尤其在最小模型尺度下提升最为明显。
链接: https://arxiv.org/abs/2606.18216
作者: Byung-Kwan Lee,Ximing Lu,Shizhe Diao,Minki Kang,Saurav Muralidharan,Karan Sapra,Andrew Tao,Pavlo Molchanov,Yejin Choi,Yu-Chiang Frank Wang,Ryo Hachiuma
机构: 未知
类目: Computation and Language (cs.CL)
备注: Project page: this https URL
Abstract:Knowledge distillation transfers a teacher’s competence to a small student but is brittle in the small-student regime: forcing the student to imitate logits from a much larger teacher concentrates it on the teacher’s sharpest modes, hurting generalization on benchmark families beyond the training corpus. Reinforcement learning (RL) avoids logit imitation by training on the student’s own rollouts. However, on questions where every rollout fails-yielding zero advantage and being silently discarded-injecting a stronger teacher’s response into the policy gradient breaks the on-policy assumption and induces drift. We introduce Zone of Proximal Policy Optimization (ZPPO), inspired by Vygotsky’s zone of proximal development, which keeps the teacher inside the prompt rather than the policy gradient. On hard questions, ZPPO constructs two reformulated prompts: a Binary Candidate-included Question (BCQ) pairs one correct teacher response with one incorrect student response as anonymized candidates the student must discriminate, and a Negative Candidate-included Question (NCQ) aggregates the student’s wrong rollouts into a single prompt to surface their shared failure modes. A prompt replay buffer recirculates each hard question until it either graduates-the student’s mean rollout accuracy on it reaches half- or is FIFO-evicted under finite capacity, amplifying BCQ and NCQ inside the student’s current zone of proximal development. On the Qwen3.5 family at four student scales (0.8B-9B) with a 27B teacher, post-trained as vision-language models and evaluated on a 31-benchmark suite (16 VLM, 10 LLM, 5 Video), ZPPO outperforms off/on-policy distillation and GRPO, with the largest gains at the smallest scale.
[NLP-4] Looped World Models
【速读】: 该论文旨在解决当前世界模型在长时序模拟中面临的根本性矛盾:高保真度的长期预测需要深度计算,但深层模型在部署时成本高昂且容易产生误差累积。其解决方案的关键在于提出首个环状世界模型架构——环状世界模型(Looped World Models, LoopWM),通过参数共享的Transformer模块对潜在环境状态进行迭代精炼。该方法实现了高达100倍的参数效率提升,并引入自适应计算机制,使模型深度可根据每一步预测的复杂度动态调整。与传统上通过扩大模型规模和训练数据来提升性能的方式不同,LoopWM将迭代潜在深度确立为世界模拟的新规模扩展维度,有望显著推动该领域的发展。
链接: https://arxiv.org/abs/2606.18208
作者: Hongyuan Adam Lu,Z.L. Victor Wei,Qun Zhang,Jinrui Zeng,Bowen Cao,Lingwei Meng,Mocheng Li,Zezhong Wang,Haonan Yin,Naifu Xue,Minyu Chen,Cenyuan Zhang,Zefan Zhang,Hao Wei,Jiawei Zhou,Haoran Xu,Hao Yang,Ronglai Zuo,Tongda Xu,Yonghao Li,Jian Chen,Hebin Wang,Zeyu Gao,Yang Li,Wei Zhao,Qimin Zhong,Siqi Liu,Yumeng Zhang,Leyan Cui,Zhangyu Wang,Wai Lam
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: Technical Report
Abstract:Current world models face a fundamental tension: faithful long-horizon simulation demands deep computation, but deeper models are expensive to deploy and prone to compounding errors. We resolve this by introducing Looped World Models (LoopWM), which are the first looped architectures for world modelling. Our method iteratively refines latent environment states through a parameter-shared transformer block. This yield up to 100x parameter efficiency over conventional approaches with adaptive computation that automatically scales depth to match the complexity of each prediction step. Orthogonal to scaling model size and training data, LoopWM establishes iterative latent depth as a new scaling axis for world simulation, which might significantly push the community forward.
[NLP-5] Analyzing and Encoding the Al-Mawrid Arabic-English Dictionary with the ISO Language Markup Framework and TEI Lex-0
【速读】: 该论文旨在解决阿拉伯语词汇资源基础设施薄弱的问题,特别是针对20世纪出版的《Al-Mawrid》阿英双语词典在数字化过程中面临的结构模糊性与标点不一致等挑战。其核心解决方案在于提出一种系统化的双标准编码方法,将国际标准化组织(ISO)的词汇标记框架(Lexical Markup Framework, LMF)与文本编码计划(Text Encoding Initiative, TEI)的Lex-0指南相结合,通过编辑视角对词典的宏观与微观结构进行重构,从而实现高精度的结构解析(准确率达91%)。研究基于对词典中代表性样本(以字母“أ”开头的部分,占总量4.6%)的实证分析,验证了信息提取规则的有效性:同义词识别达到85%精确率与98%召回率,其他形态语义特征识别精确率达88%。此外,论文还探讨了该资源在语言学关联开放数据(Linguistic Linked Open Data, LLOD)生态中的可扩展性,建立了基于前缀的引用体系,支持其融入语义网。最终成果是一个可互操作、机器可处理且具备可复现性的数字化词典资源,为阿拉伯语自然语言处理(NLP)与数字人文领域中复杂遗产双语词典的逆向数字化提供了范式参考。
链接: https://arxiv.org/abs/2606.18205
作者: Diaa Fayed,Laurent Romary
机构: 未知
类目: Computation and Language (cs.CL)
备注: 44 pages, 58 figures, 12 tables. Submitted to Language Resources and Evaluation, under review since Aug 2025, round 3
Abstract:This paper presents a robust methodology for the systematic digitization and encoding of the Al-Mawrid Arabic-English dictionary, transforming it from a legacy print resource into a standardized computational lexicon. Addressing a significant gap in Arabic lexical infrastructure, the study adopts a dual-standard framing that aligns the ISO Lexical Markup Framework (LMF) with the Text Encoding Initiative TEI Lex-0 guidelines. By applying an editorial view to the dictionary’s macro- and microstructure, the research resolves the structural ambiguities and punctuation inconsistencies typical of 20th-century bilingual dictionaries. The methodology is grounded in an empirical analysis of the dictionary’s lexical knowledge density. Drawing on a representative sample (the letter Ayn, comprising 4.6% of the total volume), the study provides scientific weight to the encoding process, demonstrating a structural parsing accuracy of 91%. Quantitative evaluation of the information extraction rules reveals high performance, with 85% precision and 98% recall for synonyms, and 88% precision for other morpho-semantic features. Beyond technical description, the paper provides a critical comparison with existing Arabic lexical resources and discusses the limitations of TEI Lex-0 when modelling specific Arabic phenomena, such as implicit “open set” semantic relations and scattered morphological cues. Furthermore, the study explores the potential for Linguistic Linked Open Data (LLOD) integration by establishing a scalable prefix-based referencing system that facilitates the resource’s inclusion in the semantic web. The result is an interoperable, machine-tractable resource that provides a reproducible workflow for the retro-digitization of complex legacy bilingual lexicons within the Arabic NLP and Digital Humanities communities.
[NLP-6] RubricsTree: Scalable and Evolving Open-Ended Evaluation of Personal Health Agents across Health Memory and Medical Skills
【速读】: 该论文旨在解决生成式健康AI系统在大规模临床部署中面临的评估瓶颈问题,即现有评估方法在可靠性与可扩展性之间存在显著矛盾:医生标注虽具高可信度但成本高昂且难以规模化,而基于大语言模型(LLM)的自动评估器虽具备可扩展性,却存在主观性强、一致性差及临床语义对齐不足等缺陷。其解决方案的关键在于提出RubricsTree——一个基于专家对齐的分层评估框架,该框架构建了超过100个原子级、临床可验证的布尔评判标准(Boolean rubrics),并通过4,000条真实用户查询的迭代人机协同校准流程,由资深医师主导的专家小组持续优化。该框架引入上下文感知的自适应路由机制,仅激活与当前查询相关的加权评判子集,从而在保障专家对齐质量的前提下实现高效可扩展的评估吞吐量。实证研究表明,RubricsTree在复杂开放性任务上显著优于主流基线,在专家对齐性、对上下文退化响应的敏感性惩罚以及作为结构化指令/反馈/训练奖励时提升模型性能(在HealthBench上相对增益达约66%)方面均表现卓越,为产品级个人健康AI系统的持续优化提供了可扩展、可审计、可演进的评估基础设施。
链接: https://arxiv.org/abs/2606.18203
作者: Weizhi Zhang,Zechen Li,Hamid Palangi,Ben Graef,A. Ali Heydari,Simon A. Lee,Salman Rahman,Ray Luo,Zeinab Esmaeilpour,Erik Schenck,Chloe Zhang,Yamin Li,Menglian Zhou,Philip S. Yu,Daniel McDuff,Lindsey Sunden,Mark Malhotra,Shwetak Patel,Ahmed A. Metwally
机构: Google Research(谷歌研究); University of Illinois Chicago(伊利诺伊大学芝加哥分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:The LLM-empowered personal health agents with user health (sensor) metrics have offered a promising pathway to alleviate global disparities in healthcare access. However, large-scale clinical deployment remains constrained by an open-ended evaluation bottleneck: physician annotation is reliable but costly and unscalable, while LLM-as-a-judge evaluators are scalable but subjective, inconsistent, and sometimes clinically misaligned. We introduce RubricsTree, a scalable evaluation framework with an expert-aligned hierarchical taxonomy of over 100 atomic, clinically-verifiable Boolean rubrics, evolving from the insights of 4,000 real user queries through an iterative human-in-the-loop curation protocol with an expertise panel led by an experienced physician. A context-aware adaptive router activates only the relevant auto-weighted rubric subset per query, providing the throughput needed for scalable evaluation with expert-aligned quality. Through a systematic meta-evaluation, we show that RubricsTree (i) substantially exceeds a strong large-scale evaluation baseline in expert alignment on challenging open-ended queries; (ii) reliably penalizes contextually degraded responses; and (iii) when used as structured instructions, text feedback, or training rewards for performance optimization, yields up to ~66% relative gains on HealthBench for Gemini, GPT, and Qwen model families. RubricsTree thus provides a scalable, auditable, and evolving evaluation infrastructure required for the continuous optimization of product-level personal healthcare AI.
[NLP-7] Learning from the Self-future: On-policy Self-distillation for dLLM s
【速读】: 该论文旨在解决现有基于策略的自蒸馏(On-policy self-distillation, OPSD)方法无法有效应用于扩散型大语言模型(diffusion LLMs, dLLMs)的问题。传统OPSD方法以自回归生成为核心,依赖从左到右的前缀条件输入及逐标记的差异性监督,这一设计与dLLMs所采用的非顺序、迭代去噪式生成机制存在根本冲突。为此,论文提出d-OPSD,首个专为dLLMs设计的OPSD框架,其核心创新在于:首先,通过将自生成的回答作为后缀条件来重构自教师(self-teacher)构建方式,使学生模型能够学习“自我未来经验”而非依赖特权前缀;其次,将监督信号由逐标记级别调整为步骤级别,与dLLMs的迭代去噪训练过程保持一致。实验结果表明,d-OPSD在四个推理基准上均显著优于强化学习价值回归(RLVR)和监督微调(SFT)基线,且样本效率极高,仅需RLVR约10%的优化步数即可达到相当性能,为dLLM的后训练提供了极具前景的新路径。
链接: https://arxiv.org/abs/2606.18195
作者: Yifu Luo,Zeyu Chen,Haoyu Wang,Xinhao Hu,Yuxuan Zhang,Zhizhou Sha,Shiwei Liu
机构: Tsinghua University (清华大学); Technical University of Munich (慕尼黑工业大学); Nanyang Technological University (南洋理工大学); University of British Columbia (不列颠哥伦比亚大学); University of Texas at Austin (德克萨斯大学奥斯汀分校); ELLIS Institute Tubingen (图宾根艾利斯研究所); Max Planck Institute for Intelligent Systems (马克斯普朗克智能系统研究所); Tubingen AI Center (图宾根人工智能中心)
类目: Computation and Language (cs.CL)
备注: Preprint
Abstract:On-policy self-distillation (OPSD) has proven effective for post-training large language models (LLMs), yet its application to diffusion LLMs (dLLMs) remains unexplored. Existing OPSD methods are inherently autoregressive-centric. They inject privileged information via left-to-right prefix conditioning with token-level divergence supervision, a design that fundamentally conflicts with the arbitraryorder generation of dLLMs. We introduce d-OPSD, the first OPSD framework tailored for dLLMs. Our approach makes two core contributions. First, we reframe self-teacher construction by using self-generated answers as suffix conditioning, enabling the student model to learn from “self future-experience” rather than privileged prefixes. Second, we shift supervision from token-level to step-level, aligning training with the iterative denoising process of dLLMs. Experiments across four reasoning benchmarks show that d-OPSD consistently outperforms RLVR and SFT baselines with superior sample efficiency, requiring only around 10% of the optimization steps by RLVR and opening a promising pathway for dLLM posttraining. The code is available at this https URL.
[NLP-8] A Red-Team Study of Anthropic Fable 5 Opus 4.8 Models
【速读】: 该论文旨在解决当前前沿大语言模型(Large Language Models, LLMs)在面对自动化越狱攻击(automated jailbreak attacks)时的对抗鲁棒性不足问题,尤其关注模型在真实复杂威胁场景下的安全性。研究聚焦于Anthropic开发的Fable 5与Opus 4.8两款先进模型,评估其在覆盖十类危害性意图(ten-category harm taxonomy)的7,826个恶意目标上的防御表现。其解决方案的关键在于采用HackAgent红队框架生成数十万次对抗性攻击尝试,并通过三名裁判模型组成的独立评审小组进行多数投票验证,以确保结果的可靠性。研究发现,尽管两模型整体上对多数攻击具有较强抵抗力,但残余脆弱面远超聚合指标所反映的水平——主要由自适应迭代攻击主导,而静态混淆手段几乎被完全破解。其中,最强的自适应搜索策略(tree-of-attacks)使Opus 4.8在11.5%的意图下被攻破,而Fable 5亦达6.1%最坏情况。更值得关注的是,在无任何人工专家介入的情况下,攻击模型仅需一至两次迭代即可自动、低成本地定位并生成跨所有危害类别、经评审确认的有害输出,分别达1,620条(Opus 4.8)和702条(Fable 5)。因此,研究结论表明:即便经过最严苛测试与加固的前沿模型,在持续的自动化压力下仍存在可被可靠突破的系统性风险。
链接: https://arxiv.org/abs/2606.18193
作者: Nicola Franco
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: White paper
Abstract:We evaluate the adversarial robustness of two frontier large language models (LLMs) developed by Anthropic, Fable 5 and Opus 4.8, against four families of automated jailbreak attack across 7 826 harmful intents spanning a ten-category harm taxonomy. Using the HackAgent red-teaming framework, hundreds of thousands of adversarial attempts were generated and every apparent success was independently re-adjudicated by a panel of three judge models (majority vote). Both models resist the majority of attacks, but the residual surface is larger than aggregate framing suggests: it is dominated by adaptive iterative attacks, while static obfuscation is near-fully neutralised. The strongest adaptive search (tree-of-attacks) breaks Opus 4.8 on 11.5% of intents overall, whereas Fable 5 stays in the single digits (6.1% worst-case). Aggregate rates therefore should not be read as reassurance. Even in these hardened configurations, the two models produced 1 620 (Opus 4.8) and 702 (Fable 5) panel-confirmed harmful completions spanning every harm category, located automatically, cheaply, and within the first one or two refinement steps by an attacker model with no human expert in the loop. The reasonable conclusion is that even the best, most-tested frontier models remain reliably breakable under sustained automated pressure.
[NLP-9] he Measurement Gap in the Automation of EU Law: Benchmarking Doctrinal Legal Reasoning under the EU AI Act
【速读】: 该论文旨在解决当前法律人工智能(Legal AI)评估体系中缺乏对生成式法律文本是否具备教义性法律推理(doctrinal legal reasoning)能力的衡量标准这一核心问题。现有基准主要聚焦于辅助性、事务性的法律任务(如合同起草或法律检索),而未能评估模型在解释法律规范、适用法律原则等构成法律工作核心的教义性推理能力。这一评估空白不仅构成方法论缺陷,更引发法律合规风险:欧盟《人工智能法案》(EU AI Act)要求高风险司法领域AI系统必须达到“适当准确性”(appropriate accuracy),但若无对应的教义性推理基准,该要求将无法获得可操作的落地标准。解决方案的关键在于构建一个能够有效衡量生成式人工智能在法律解释与推理层面表现的新型基准,从而实现对法律AI系统真正司法适用能力的量化评估。
链接: https://arxiv.org/abs/2606.18158
作者: Michèle Finck
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Large language models now produce legal text of at least median quality, yet no existing benchmark can evaluate whether they perform doctrinal legal reasoning, which forms the interpretive core of legal work, rather than the ancillary, paralegal tasks that most current legal-AI evaluations measure. This measurement gap is not only methodological but legal: the EU AI Act makes “appropriate accuracy” a binding requirement for high-risk AI used in the judicial domain, yet that requirement cannot acquire operational content without the very doctrinal-reasoning benchmark the field lacks.
[NLP-10] Your AI Travel Agent Would Book You a Bullfight: An Agent ic Benchmark for Implicit Animal Welfare in Frontier AI Models
【速读】: 该论文旨在解决当前生成式AI模型在作为代理(agent)执行实际任务时,其在动物福利方面的伦理推理是否能够有效转化为具体行动的问题。现有评估体系主要依赖于对问答式文本响应的评判,无法验证模型在具备工具调用能力的自主决策场景中是否真正避免涉及动物剥削的行为。为此,论文提出首个面向代理行为的基准测试TAC(Travel Agent Compassion),通过设计12个由人工撰写的旅行预订场景(涵盖六类动物剥削形式,扩展至48个样本以控制价格、评分和位置等混淆因素),系统评估七种前沿大模型在代用户执行任务时规避动物剥削选项的能力。结果显示,所有模型表现均低于随机水平(64%),最优模型Claude Opus 4.7仅达53%,表明当前模型在真实代理环境中缺乏基本的动物福利敏感性。研究发现,在系统提示中加入一句明确的福利意识语句可使Claude与GPT-5.5模型性能提升47至63个百分点,而DeepSeek与Gemini模型提升不足12个百分点,揭示了不同模型对指令敏感性的显著差异。此外,基于Gemini 2.5 Flash Lite作为裁判的辅助审计(Inspect Scout)分析显示,顶级模型在288个基础条件下的对话记录中无一表现出对评估目标的认知,说明其低分表现并非源于模型识别评估意图,而是真实缺乏伦理行动能力。该研究强调了现有基于文本响应的福利评估框架的局限性,指出需建立更贴近真实代理行为的评估体系,并呼吁在欧盟通用人工智能(General-Purpose AI)实践准则的系统性风险框架下,加强对高阶代理行为中伦理偏差的监管。
链接: https://arxiv.org/abs/2606.18142
作者: Jasmine Brazilek,Oliver Tulio,Joel Christoph,Miles Tidmarsh,Carol Kline,Arturs Kanepajs
机构: Compassion Aligned Machine Learning; Sentient Futures; Harvard Kennedy School; Appalachian State University; Google(谷歌)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:
Abstract:AI agents are moving from advisors to actors, booking travel, planning menus, and running procurement on behalf of users. Existing benchmarks for AI and animal welfare evaluate model text responses to question-answer prompts, leaving open whether the welfare reasoning surfaced in those responses transfers to agentic deployment where the model must take actions with tools. We introduce TAC (Travel Agent Compassion), the first agentic benchmark measuring whether AI agents avoid options involving animal exploitation when acting on behalf of users. TAC presents an AI agent with twelve hand-authored travel booking scenarios across six categories of animal exploitation, augmented to forty-eight samples to control for price, rating, and position confounds. We evaluate seven frontier models from four labs. Every model scores below the chance level of sixty-four percent, with the best performer (Claude Opus 4.7) at fifty-three percent. A single welfare-aware sentence in the system prompt yields gains of forty-seven to sixty-three percentage points in Claude and GPT-5.5, twenty-six points in GPT-5.2, and under twelve points in DeepSeek and Gemini. An auxiliary Inspect Scout audit of 288 base-condition transcripts from the top two performers, using Gemini 2.5 Flash Lite as judge, flags zero transcripts for evaluation awareness, suggesting the below-chance rates do not stem from the models recognising the evaluation. We discuss implications for category-level variation across cultural domains, the limits of text-response welfare benchmarks, and the EU General-Purpose AI Code of Practice systemic risk framework.
[NLP-11] Unintended Effects of Geographic Conditioning in Large Language Models ACL2026
【速读】: 该论文旨在解决现代对话式人工智能系统在使用用户元数据(如地理位置)进行响应本地化时,无意引入的区域偏差问题。其核心问题是“位置信息泄露”(location leakage),即模型在接收到地理中立的用户输入时,仍会生成具有特定地域特征的内容,从而暴露用户的潜在地理位置。解决方案的关键在于揭示了这种泄露现象不仅源于具体的地理信息注入,更深层地源自用户配置文件本身所形成的结构性条件作用:即使将具体位置替换为占位符“Unknown”,模型的输出仍会出现显著的区域偏好,表明用户元数据的“存在性”及其上下文框架本身即可作为生成过程中的强条件信号,引发非预期的地域性倾向。这一发现揭示了当前大语言模型在隐私与公平性方面的潜在风险,并强调需重新审视元数据在模型推理中的隐含影响。
链接: https://arxiv.org/abs/2606.18124
作者: Naz Col,David M. Chan
机构: University of California, Berkeley (加州大学伯克利分校)
类目: Computation and Language (cs.CL)
备注: To appear at the Second Workshop on Customizable NLP (CustomNLP4U) at ACL 2026
Abstract:Modern conversational AI systems frequently rely on user metadata to localize responses, yet the unintended regional biases introduced by this hidden context remain poorly understood. In this work, we evaluate location leakage: the phenomenon where a model generates geographic references despite receiving a geographically neutral user prompt. Across both creative writing and open-ended QA prompts, even state-of-the-art LLMs systematically favor region-specific outputs when exposed to location metadata, with leakage spiking by up to 793 times above baseline (e.g., from 0.04% to 31.7% for Llama 3.1-8B, and 21.3% and 8.8% for Qwen3-8B and Claude Sonnet 4.6, respectively). Our analysis further shows a novel structural conditioning effect: replacing the injected location with the placeholder “Unknown” still elevates leakage by up to 72 times above baseline, demonstrating that the user profile frame itself, independent of any geographic content, acts as a generative conditioning signal.
[NLP-12] Structural Role Injection in Handlebars-Templated LLM Prompts: Triple-Brace Interpolation Delimiter Family and the Limits of HTML Auto-Escaping
【速读】: 该论文旨在解决生成式 AI 应用中因模板引擎(如 Handlebars)的默认安全策略不当而引发的结构化角色注入(structural role injection)漏洞问题。其核心问题是:尽管 Handlebars 的双大括号表达式(``)通过 HTML 转义对插值内容进行“安全”处理,但这种转义机制仅能有效处理尖括号(angle brackets),而无法覆盖方括号、冒号或 Markdown 井号等关键字符,导致部分角色分隔符(如 Llama-2 的 [INST]、传统 Human:/Assistant: 及 Markdown ###)在经过转义后仍可保留原始结构,从而被攻击者利用以伪造高权限对话轮次。解决方案的关键在于揭示:依赖模板引擎的字符转义并不能提供普适性的结构安全防护,真正有效的防御必须基于指令与数据的结构性分离,而非依赖于对特定符号的转义处理。研究通过 5760 次实验验证了不同分隔符家族在多种模型(GPT-3.5 Turbo、GPT-4o mini、GPT-4.1 mini、Claude Haiku 4.5)上的生存率差异,证明当前默认转义策略仅对少数分隔符有效,且无法抵御基于冒号和 Markdown 格式的攻击,凸显了现有实践中的根本性缺陷。
链接: https://arxiv.org/abs/2606.18120
作者: Mohammadreza Rashidi
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 7 pages, 6 figures
Abstract:Large language model applications build prompts from templates, and Handlebars is a widely used templating engine and the default prompt-template format in Microsoft Semantic Kernel. Its double-brace x expression HTML-escapes the interpolated value and is documented as the safe default; its triple-brace x expression inserts the value raw. We show that this choice silently governs an application’s exposure to structural role injection, where attacker-controlled data carries chat role delimiters that forge a higher-privilege turn. A model-free analysis establishes the mechanism: Handlebars escaping rewrites angle brackets but not square brackets, colons, or Markdown hashes, so it neutralises ChatML, Llama-3, and XML role delimiters (survival rate 0.00) while leaving Llama-2 [INST], legacy Human:/Assistant:, and Markdown ### delimiters intact (survival rate 1.00 for the last two). We then run 5760 trials across seven delimiter families, two attack objectives, and four models (GPT-3.5 Turbo, GPT-4o mini, GPT-4.1 mini, Claude Haiku 4.5) at a combined API cost of 1.63 USD. GPT-3.5 Turbo follows the task-hijack instruction in 97% of raw and 91% of escaped trials, with the escaping protection concentrated in the angle-bracket families and absent for the colon- and Markdown-based families; the harder secret-exfiltration objective, which does not saturate, exposes the same family interaction more cleanly. Claude Haiku 4.5 resists both objectives almost entirely. The escaped default protects only the delimiter schemes whose characters HTML escaping happens to cover, gives no protection for the rest, and cannot substitute for a structural separation of instruction and data.
[NLP-13] PseudoBench: Measuring How Agent ic Auto-Research Fuels Pseudoscience
【速读】: 该论文旨在解决生成式智能体(Generative AI agents)在自主科学探究过程中对伪科学(pseudoscience)缺乏辨识与抵制能力的问题。随着基于大语言模型的智能体逐步参与自动化科研,若其无法有效识别伪科学叙事,将可能生成看似合理但误导性的研究结论,进而污染学术文献并削弱公众对科学的信任。为此,论文提出PseudoBench——一个对抗性基准测试框架,用于评估智能体在从实验设计到论文撰写全链条科研流程中识别并抵抗伪科学主张的能力。其核心解决方案在于构建包含200组跨五个领域的伪科学主张-证据配对的数据集,并通过端到端的研究范式量化智能体的抗伪科学表现。实验结果表明,当前主流智能体普遍存在极低的拒绝率(接近0%),最高抵抗率仅为27.4%,且更强大的模型反而倾向于以更复杂的科学语言包装伪科学内容,从而提升其表面可信度。这一发现揭示了现有系统在科学对齐(scientific alignment)方面的严重缺陷,强调在广泛部署前亟需强化其对伪科学的识别与抵御能力。
链接: https://arxiv.org/abs/2606.18060
作者: Xinyang Liao,Lingyu Li,Huacan Liu,Tianle Gu,Yang Yao,Tong Zhu,Yan Teng,Yingchun Wang
机构: Shanghai Artificial Intelligence Laboratory; Xi’an Jiao Tong University; Shanghai Jiao Tong University
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 26 pages, 21 figures
Abstract:As Large Language Model based agents enter autonomous scientific research, their ability to resist pseudoscience becomes increasingly important. Otherwise, such systems may rapidly generate plausible yet misleading studies that contaminate academic literature and erode trust in science. We present PseudoBench, an adversarial benchmark for evaluating whether agentic auto-research systems can identify and resist pseudoscientific narratives. PseudoBench contains 200 curated pseudoscientific claim-evidence pairs across five domains and evaluates agents through an end-to-end research pipeline from experiments to writing. Testing seven state-of-the-art agents, we find that current systems readily produce persuasive reports that align with pseudoscientific premises with near-zero refusal rates and the highest resistance of only 27.4%. Stronger agents risk packaging pseudoscience in more sophisticated scientific language, increasing its apparent credibility. These findings reveal an alarming capacity to fuel pseudoscience, calling for scientific alignment before widespread deployment.
[NLP-14] ConSA: Controllable Sparsity in Hybrid Attention via Learnable Allocation
【速读】: 该论文旨在解决混合注意力架构(如全注意力FA与滑动窗口注意力SWA)在大语言模型(LLM)推理中的高效性与性能平衡问题,核心挑战在于现有方法依赖人工设计规则或简单启发式策略进行FA/SWA分配,缺乏对注意力行为内在机制的深入分析。其解决方案的关键是提出可控制稀疏性的混合注意力框架ConSA,通过L0正则化学习每个注意力单元选择FA或SWA的二值掩码,并结合增强拉格朗日约束,在层或键值头(KV-head)粒度上精确实现用户指定的稀疏性目标。实验表明,基于学习的分配策略显著优于规则基线,且在键值头粒度上的分配优于层粒度;所学模式呈现“底层使用SWA、中层集中使用FA”的连续块状结构,与传统均匀交错模式明显不同,该结构在不同模型规模、稀疏度和粒度下均保持稳定,揭示了注意力行为的细粒度内在规律,为高效注意力设计提供了数据驱动的新范式。
链接: https://arxiv.org/abs/2606.18056
作者: Yao Chen,Yinqi Yang,Junyuan Shang,Xiangzhao Hao,Simeng Zhang,Yilong Chen,Tingwen Liu,Shuohuan Wang,Dianhai Yu
机构: Baidu Inc.
类目: Computation and Language (cs.CL)
备注:
Abstract:Hybrid architectures combining full attention (FA) and sliding-window attention (SWA) are a promising paradigm for efficient LLM inference. However, existing methods typically rely on hand-crafted rules or simple post-hoc heuristics for FA/SWA allocation and offer limited analysis of the attention behaviors underlying these designs. We propose Controllable Sparsity in Hybrid Attention (ConSA), a framework that learns optimal FA/SWA assignment under a user-specified sparsity target. ConSA employs L0 regularization to learn binary masks selecting between FA and SWA for each attention unit, while an augmented Lagrangian constraint enforces the target sparsity at either layer or KV-head granularity. We evaluate ConSA on two LLMs at the 0.6B and 1.7B scales. Learned allocations consistently outperform rule-based baselines, with KV-head-wise allocation yielding clear gains over layer-wise allocation. The learned patterns place SWA in the bottom layers and concentrate FA into contiguous middle-layer blocks, diverging from evenly interleaved patterns in rule-based methods. This structure persists across model scales, sparsity levels, and allocation granularities, revealing a fine-grained spectrum of intrinsic attention behaviors that underlies the learned allocation.
[NLP-15] Compositional Skill Routing for LLM Agents : Decompose Retrieve and Compose
【速读】: 该论文旨在解决大语言模型(LLM)智能体在面对复杂真实任务时,难以有效组合多个外部技能(external skills)的问题。现有方法多局限于单一技能的选择,而现实任务往往需要将用户请求分解为一系列原子子任务,并精确匹配相应技能以生成可执行的复合计划。为此,论文首次形式化提出“组合式技能路由”(Compositional Skill Routing)问题:在给定复杂用户查询和大规模技能库的前提下,实现查询的分步分解、技能检索与计划构建。其核心解决方案是SkillWeaver框架,该框架由三部分构成:基于大语言模型的任务分解器、采用FAISS索引的双编码器技能检索器,以及考虑依赖关系的有向无环图(DAG)规划器。为支持评估,研究引入CompSkillBench基准,涵盖300个复合查询、2,209个来自公开MCP生态的真实技能,覆盖24个功能类别。实验发现,任务分解质量是主要瓶颈——标准LLM分解在步骤层面仅达34.2%的类别召回率。针对此问题,提出迭代式技能感知分解(Iterative Skill-Aware Decomposition, SAD),通过检索增强的反馈机制,使分解过程与可用技能对齐。SAD在单次迭代中将分解准确率从51.0%提升至67.7%(相对提升32.7%,Wilcoxon检验p < 10⁻⁶),且依赖分析表明正确粒度是高效检索的前提(当依赖条件为1时,类别召回率@1从34%升至41%)。此外,SkillWeaver可减少99%以上的上下文窗口消耗,并在跨领域迁移实验中验证了良好的泛化能力(即使目标类别未出现在检索池中,仍获得35.6%的相对分解准确率提升)。
链接: https://arxiv.org/abs/2606.18051
作者: Xueping Gao
机构: Alibaba Cloud (阿里巴巴云)
类目: Computation and Language (cs.CL)
备注:
Abstract:LLM agents increasingly rely on external skills – reusable tool specifications – but real-world tasks often require composing multiple skills, not just selecting one. We formalize this as the Compositional Skill Routing problem: given a complex user query and a large skill library, decompose the query into atomic sub-tasks, retrieve the appropriate skill for each sub-task, and compose an executable plan. We present SkillWeaver, a decompose-retrieve-compose framework combining an LLM task decomposer, a bi-encoder skill retriever with FAISS indexing, and a dependency-aware DAG planner. To support evaluation, we introduce CompSkillBench, a benchmark of 300 compositional queries over 2,209 real MCP server skills spanning 24 functional categories, sourced from the public MCP ecosystem. Our experiments reveal that task decomposition quality is the primary bottleneck: standard LLM decomposition reaches only 34.2% category recall at the step level. To address this, we propose Iterative Skill-Aware Decomposition (SAD), a retrieval-augmented feedback loop that iteratively aligns decomposition with available skills. SAD improves decomposition accuracy from 51.0% to 67.7% (+32.7%, Wilcoxon p 10^-6) in a single iteration; DA-conditioned analysis confirms that correct granularity is the prerequisite for effective retrieval (CatR@1 rises from 34% to 41% when DA=1). SkillWeaver reduces context window consumption by over 99%, and transfer experiments confirm generalization (+35.6% relative DA gain even when target categories are absent from the retrieval pool).
[NLP-16] When English Isnt the Best Teacher: Source Language Effects in Cross-Lingual In-Context Learning ACL2026
【速读】: 该论文旨在解决跨语言上下文学习(cross-lingual In-Context Learning, ICL)中源语言选择的难题,尤其在少样本场景下如何有效提升多语言自然语言处理任务的泛化性能。传统方法依赖于监督微调(supervised fine-tuning)中的经验,如数据可用性与语言相似性,但这些启发式规则在ICL范式下是否依然适用尚未得到充分验证。研究通过涵盖七项任务、六种模型及语言类型多样化的语料库进行大规模实证分析,发现基于微调的经验规律在ICL中并不一致成立。其关键解决方案在于揭示“语言混淆”(language confusion)作为生成类任务在跨语言ICL中的核心障碍,并提出适用于ICL场景的新型源语言选择准则,强调应超越传统语言相似性,转而关注模型在特定语言间的表现差异与上下文推理能力,从而实现更有效的跨语言知识迁移。
链接: https://arxiv.org/abs/2606.18033
作者: Fred Philippy,Siwen Guo,Jacques Klein,Tegawendé F. Bissyandé
机构: University of Luxembourg (卢森堡大学); Luxembourg Institute of Science and Technology (卢森堡科学与技术研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at 1st Workshop on Multilinguality in the Era of Large Language Models (MeLLM 2026), co-located with ACL 2026
Abstract:Cross-lingual transfer in multilingual NLP has been widely explored in supervised fine-tuning contexts, where factors like data availability and linguistic similarity largely determine transfer quality. As the field shifts toward few-shot In-Context Learning (ICL), it is often presumed that insights from fine-tuning carry over unchanged. Yet this assumption has not been rigorously evaluated, leaving open the question of how to choose source languages for cross-lingual ICL. We conduct a broad empirical study of cross-lingual transfer in ICL spanning seven tasks, six models, and a typologically diverse set of languages. We further analyze language confusion, a key obstacle for generative tasks in cross-lingual ICL. Our results show that conventional fine-tuning-based expectations do not consistently apply in the ICL regime and point to alternative heuristics for selecting source languages effectively.
[NLP-17] VoidPadding: Let [VOID] Handle Padding in Masked Diffusion Language Models so that [EOS] Can Focus on Semantic Termination
【速读】: 该论文旨在解决生成式模型在指令微调过程中因响应长度建模不准确而导致的解码效率与生成质量下降问题,尤其关注大块解码时出现的
链接: https://arxiv.org/abs/2606.17999
作者: Chunyu Liu,Zhengyang Fan,Kaisen Yang,Alex Lamb
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:MDLMs generate text by denoising a preallocated masked response canvas, making response-length modeling central to instruction tuning. Existing MDLMs often inherit the autoregressive convention of using repeated \texttt[EOS] tokens for padding during instruction tuning, giving \texttt[EOS] a dual role as both a semantic terminator and a padding token. We show that this dual role is a root cause of \texttt[EOS] overflow under large-block decoding. To decouple these roles, we propose VoidPadding, which introduces \texttt[VOID] for padding and reserves \texttt[EOS] for termination. During inference, the learned \texttt[EOS] signal enables early stopping, while the learned \texttt[VOID] signal guides adaptive response canvas expansion. On Dream-7B-Instruct, VoidPadding improves the block-size-averaged four-task mean across mathematical reasoning and code generation benchmarks by (+17.84) points over the original model and (+6.95) points over RainbowPadding, while reducing decoding NFE by 55.7% on average. Code is available at this https URL.
[NLP-18] Fine-tuning LLM s for Passive Depression Severity Estimation from AI Mental Health Dialogue
【速读】: 该论文旨在解决抑郁症症状变化早期检测中因传统自评量表(如患者健康问卷-9,PHQ-9)实际完成率低所导致的响应偏差与系统性数据缺失问题。其核心解决方案是通过被动监测方式,仅利用用户与人工智能心理健康应用之间的对话文本,直接预测PHQ-9总分,从而实现无需额外临床数据支持的连续、无感症状评估。关键技术在于:采用基于Qwen3.5-27B大模型的微调架构,结合由推理模型(Claude Opus)生成的伪标签,并通过迭代训练构建包含6,283名用户的增强数据集,显著提升了模型在全量程抑郁严重度范围内的预测性能。在独立测试集上,模型达到平均绝对误差(MAE)2.6、均方根误差(RMSE)4.0、皮尔逊相关系数(Pearson r)0.80,以及在PHQ-9=10临床阈值下AUC达0.91,且在从PHQ-9=3至24的各严重度阈值上均保持AUC>0.87,证明了其对抑郁程度的全面捕捉能力。该研究为实现生成式AI驱动的心理健康平台中的被动、持续症状监测提供了可行路径。
链接: https://arxiv.org/abs/2606.17973
作者: Olivier Tieleman,Ziyi Zhu,Ting Su,Samuel J. Bell,Thomas D. Hull,Caitlin A. Stamatis
机构: 未知
类目: Computation and Language (cs.CL)
备注: 12 pages, 1 figure
Abstract:Depression is the leading cause of disability worldwide, and early detection of symptom change is essential for timely intervention. Validated instruments such as the Patient Health Questionnaire-9 (PHQ-9) support symptom monitoring at scale, but real-world completion rates are low, introducing response bias and systematic missingness. Passive approaches that infer severity from routinely generated data could close this gap. We address this by predicting PHQ-9 total scores directly from transcripts of conversations between users and an AI mental health application, requiring only conversation text and no additional clinical data. We fine-tune a Qwen3.5-27B backbone with a regression head, augment 3,111 ground-truth labels with pseudolabels generated by a reasoning model (Claude Opus) and iteratively trained intermediate models, for a combined dataset of 6,283 users. On a held-out test set of 842 users, our best model achieves MAE = 2.6, RMSE = 4.0, Pearson r = 0.80, and AUC = 0.91 at the PHQ-9 = 10 clinical threshold. We also find AUC 0.87 at every severity threshold from PHQ-9 = 3 to PHQ-9 = 24, demonstrating that the model captures depression severity across the full clinical spectrum. This work opens the door to passive, continuous symptom monitoring in AI mental health platforms, without requiring users to complete self-report measures.
[NLP-19] Learning task-specific subspaces via interventional post-training of speech foundation models INTERSPEECH2026
【速读】: 该论文旨在解决语音基础模型(Speech Foundation Models)生成的表征中,关键语音变量(如说话人信息与语义内容)以分布式方式混合编码,而下游任务仅需其中部分可分变量的问题。其核心解决方案是提出一种基于干预式对比学习(interventional contrastive learning)的后训练精炼方法,通过引入干预数据集和多部分对比损失函数,学习从原始混杂表征空间到解耦的语义内容与说话人子空间的映射。该方法有效实现了说话人与内容信息的分离,实验表明在跨域说话人验证和关键词检测任务中均取得性能提升,验证了所学子空间的解耦性。
链接: https://arxiv.org/abs/2606.17967
作者: Jack Cox,Jon Barker
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted to Interspeech 2026; 6 pages (4 main body), 2 figures
Abstract:Speech foundation models, pre-trained on large corpora of unlabelled speech data, produce general-purpose representations which are useful across tasks. However, these representations encode information about salient speech variables in a distributed manner, while downstream speech tasks rely on only some of this variability. In this work, we propose a post-training refinement approach using interventional contrastive learning. By leveraging an interventional dataset and multi-part contrastive loss, we learn a transformation from the entangled representation space of speech foundation models into separate content and speaker subspaces. We evaluate the learnt representations on speaker verification and keyword spotting tasks, showing improved out-of-domain speaker verification performance and evidence that speaker and content information are separated across the learned subspaces.
[NLP-20] ChLogic: Evaluating Robustness of Logical Reasoning in Chinese Expressions
【速读】: 该论文旨在解决大语言模型在跨语言情境下逻辑推理能力的鲁棒性问题,特别是当同一潜在逻辑结构以英语和多样化的汉语表面形式表达时,模型是否仍能保持一致的推理性能。其核心挑战在于揭示汉语表层实现(surface realization)、翻译引入的伪影(translation artifacts)以及模型自身特性对多语言逻辑推理的影响。解决方案的关键在于构建一个中英对齐的基准测试 ChLogic,该基准基于形式化逻辑模板,包含三类数据集:通用对齐集、困难对齐集与仅中文集,每条样本均配对一个英文参考表达与五个汉语实现。实验表明,尽管标准中文回译可提升部分任务表现,但在复杂逻辑任务中反而导致 Qwen3-32B 与 GLM-5.1 性能下降,凸显出汉语表达多样性与模型偏差共同制约多语言逻辑推理的可靠性。ChLogic 为评估多语言逻辑推理的鲁棒性提供了一个有效的压力测试工具。
链接: https://arxiv.org/abs/2606.17905
作者: Peixian Zhou,Yuxu Chen,Chaorui Zhang,Wei Han,Bo Bai,Xueyan Niu
机构: Huawei Technologies Co., Ltd(华为技术有限公司)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language models perform increasingly well on standardized logical reasoning benchmarks, but whether this ability remains robust beyond English is unclear. We introduce ChLogic, an English–Chinese aligned benchmark that tests whether models preserve logical reasoning performance when the same latent logical structure is expressed in English and diverse Chinese surface realizations. Built from formal logical templates, the benchmark contains three data sets: (i) the General aligned set, derived from 60 General Propositions across nine template families; (ii) the Difficult aligned set, derived from 40 Difficult Problems; and (iii) the Chinese-only set, covering 15 language-specific phenomenon types. Each aligned item pairs one English reference expression with five Chinese realizations. Experiments on Qwen3, Ministral, and GLM models reveal a persistent English–Chinese performance gap. Back-translation from standard Chinese into English often improves performance on the General aligned set, but produces mixed effects on the Difficult aligned set, where Qwen3-32B and GLM-5.1 perform worse after translation. These results indicate that Chinese surface realization, translation artifacts, and model-specific behavior jointly affect multilingual logical reasoning. Overall, ChLogic provides a useful stress test for the robustness of multilingual reasoning.
[NLP-21] Dynamic Rollout Editing for Reducing Overthinking in RL-Trained Reasoning Models
【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)在执行复杂任务时出现的“过度思考”(overthinking)问题,即模型在生成正确答案后仍继续进行冗余的推理过程。其核心挑战在于:在基于GRPO风格强化学习(Reinforcement Learning, RL)的后训练过程中,由于序列级信用分配机制无法区分达成正确答案的前缀与后续无意义的推理延续,导致成功轨迹因整体长度增加而获得更强的正向更新信号,从而加剧了过长推理的倾向。解决方案的关键是提出一种训练期干预方法——动态回放编辑(Dynamic Rollout Editing, DRE),该方法针对已完成推理但仍在持续思考的成功轨迹,保留已验证的正确前缀,并对后续冗余内容进行编辑处理,使编辑后的轨迹在强化学习组内更受偏好,从而削弱对无效推理的奖励信号,同时不惩罚必要的推理过程。实验结果表明,DRE在多种任务上均能有效缓解过度思考现象,提升模型推理效率。
链接: https://arxiv.org/abs/2606.17890
作者: Zihao Wei,Wenjie Shi,Liang Pang,Jingcheng Deng,Shicheng Xu,Shasha Guo,Zenghao Duan,Jiahao Liu,Jingang Wang,Huawei Shen,Xueqi Cheng
机构: Institute of Computing Technology, Chinese Academy of Sciences(中国科学院计算技术研究所); University of Chinese Academy of Sciences(中国科学院大学)
类目: Computation and Language (cs.CL)
备注: 21 pages, 10 figures, 2 tables
Abstract:Long-form chain-of-thought reasoning can improve LLM performance on complex tasks, but models often continue generating unnecessary reasoning after a correct answer has emerged. We refer to this behavior as overthinking. We study this phenomenon from the perspective of GRPO-style reinforcement learning (RL) post-training, framing it as a training-time credit-assignment problem rather than merely a decoding-time stopping problem. In rollouts sampled at the onset of GRPO training, we observe that successful trajectories can exhibit a slightly higher degree of overthinking than unsuccessful trajectories for the same prompts. This early imbalance provides a starting point for an undesirable feedback loop: because GRPO assigns sequence-level credit, it cannot distinguish the solution-reaching prefix from the unnecessary continuation that lengthens a successful trajectory. Both receive positive update signal, allowing the initial imbalance to grow into more severe overthinking during training. To address this issue, we introduce Dynamic Rollout Editing (DRE), a training-time intervention for successful trajectories that continue thinking after answer emergence. DRE preserves the accepted verified prefix, edits the remaining thinking, and prefers the edited trajectory within the same RL group, weakening the preference signal for unnecessary thinking without penalizing the reasoning needed to reach the answer. Experiments across diverse tasks show the effectiveness of DRE.
[NLP-22] GameCraft-Bench: Can Agents Build Playable Games End-to-End in a Real Game Engine?
【速读】: 该论文旨在解决生成式AI在游戏生成任务中的端到端实现难题,即如何将自然语言描述的玩法规范转化为可在目标游戏引擎(如Godot)中运行、具备完整可交互性的游戏成品。其核心挑战在于,游戏生成不仅涉及代码编写,还需在游戏引擎环境中协调脚本逻辑、场景布局、资源加载、渲染效果与实时玩家交互,确保整体游戏体验的连贯性与可玩性。解决方案的关键在于提出一种以交互为基础的评估框架——GameCraft-Bench,该框架通过回放演示(replayed demonstrations)与基于评分标准的多模态评判机制,实现对生成游戏在引擎环境中的可执行性、内容完整性及交互一致性的综合评估。该框架确立了三项关键评价标准:引擎锚定性(Engine Grounding)、成果完整性(Artifact Completeness)和交互验证性(Interactive Verification),并构建了包含140个任务、覆盖15类游戏类型的基准数据集,揭示当前前沿编码代理在端到端游戏生成任务中仍面临显著瓶颈,最强模型仅达到41.46%的成功率,普遍问题在于虽能实现部分可识别的游戏机制,但难以提供足够内容量、功能性视觉反馈及连贯的呈现结构。
链接: https://arxiv.org/abs/2606.17861
作者: Tongxu Luo,Rongsheng Wang,Jiaxi Bi,Chenming Xu,Zhengyang Tang,Jianlong Chen,Juhao Liang,Ke Ji,Shuqi Guo,Yuhao Du,Fan Bu,Wenyu Du,Xiaotong Zhang,Kyle Li,Shaobo Wang,Linfeng Zhang,Yuxuan Liu,Xin Lai,Chenxin Li,Yiduo Guo,Zhexin Zhang,Xinyuan Wang,Tianyi Bai,Ziniu Li,Benyou Wang
机构: The Chinese University of Hong Kong (香港中文大学); Shenzhen Loop Area Institute (深圳环区研究院); Hunyuan Team (混元团队); Tencent (腾讯); USTB (北京科技大学); DualverseAI (双界AI); SJTU (上海交通大学); NUS (新加坡国立大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Game generation is an emerging application of coding agents, requiring models to transform natural-language specifications into playable interactive systems. Unlike traditional coding tasks, game generation takes place within a game engine, where scripts, scenes, assets, rendering, and runtime interactions must jointly produce coherent gameplay. We formalize end-to-end game generation as the problem of producing a complete game artifact that realizes a specification through observable player-game interaction in a target environment. We argue that evaluating this setting requires three desiderata: Engine Grounding, Artifact Completeness, and Interactive Verification. We propose an interaction-grounded evaluation framework that assesses executable gameplay through replayed demonstrations and rubric-guided multimodal judging. We instantiate this framework as GameCraft-Bench, a benchmark comprising 140 Godot tasks across 15 game families. Evaluations of frontier coding agents show that end-to-end game generation remains highly challenging: the strongest agent achieves only 41.46%, and most agents score below 40%. Further analysis reveals that while agents often implement recognizable mechanics, they struggle to deliver complete games with sufficient content, functional visual feedback, and coherent presentation. See this https URL for demos, code, and data.
[NLP-23] Environment-Grounded Automated Prompt Optimization for LLM Game Agents
【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)智能体在交互环境中对提示(prompt)高度敏感,而现有提示工程仍依赖人工且任务特定的难题。其核心解决方案是提出一种自动化的提示优化框架,将观测到动作的决策流程分解为目标条件描述代理(goal-conditioned descriptor agent)与动作选择代理(action selection agent)两个模块,并通过由环境回报引导的、基于大语言模型驱动的进化循环,迭代优化各模块的提示。该框架的关键在于引入行为分析器(behavior analyzer)以归因于特定提示组件对智能体表现的影响,并结合突变器(mutator)生成针对性的提示修改建议,再通过环境回放验证其有效性。实验在BALROG基准的全部五个BabyAI任务上进行,结果表明,该方法在无需更新模型权重的前提下,显著提升了性能,尤其在多步协作任务PutNext中,使原本零成功率的RobustCoTAgent达到最高72.5%的成功率,证明了多智能体架构与自动化提示优化相结合可有效增强大语言模型能力,而无需微调或大量人工干预。
链接: https://arxiv.org/abs/2606.17838
作者: Rean Clive Fernandes,Lukas Fehring,Theresa Eimer,Marius Lindauer,Matthias Feurer
机构: Lamarr Institute for ML and AI (拉马尔机器学习与人工智能研究所); TU Dortmund University (多特蒙德工业大学); Leibniz University Hannover (汉诺威大学); L3S Research Center (L3S研究中心)
类目: Computation and Language (cs.CL)
备注:
Abstract:LLM agents in interactive environments are highly sensitive to their prompts, yet prompt engineering remains a manual, task-specific process. We introduce an automated prompt optimization framework for LLM agents that decomposes the observation-to-action pipeline into a goal-conditioned descriptor agent and an action selection agent, and iteratively refines each module’s prompt through an LLM-driven evolutionary loop guided by environment returns. We propose a behavior analyzer to attribute episode outcomes to specific prompt components, and a mutator to propose targeted revisions to the prompt, before validating them through environment rollouts. We evaluate on all five BabyAI tasks in the BALROG benchmark, comparing our pipeline against BALROG’s RobustCoTAgent under both plain and guided prompt initializations. Optimization improves performance consistently across tasks and conditions, without requiring updates to the model weights. On PutNext, a multi-step coordination task where the RobustCoTAgent achieves 0% success, our framework reaches up to 72.5% success rate using the same underlying LLM with optimized prompts. These results suggest that a multi-agent framework, combined with automatic prompt optimization, enhances LLMs without the need for fine-tuning or extensive human supervision.
[NLP-24] Perceptual compensation for tonal context in self-supervised speech models INTERSPEECH2026
【速读】: 该论文旨在探究wav2vec2.0架构在汉语声调感知中是否具备对音位上下文的补偿能力(compensation for phonological context)。研究通过伪重复(pseudo-replication)汉语声调感知实验,对比了纯自监督预训练模型与针对汉语语音识别(ASR)微调后的模型在嵌入表示相似性及探测分类器输出上的表现。结果表明,纯自监督预训练模型的嵌入表示中未发现补偿现象;尽管探测分类器在层级结构上表现出分类性能的提升,显示出一定程度的补偿迹象,但其在孤立音节测试中的表现仍无法复现人类水平。这一发现与以往认为仅通过自监督预训练即可自发产生对音位结构敏感性的观点相矛盾,提示:为促使至少部分音位规律的抽象表征形成,监督学习目标可能是必要的。
链接: https://arxiv.org/abs/2606.17835
作者: James Kirby,Ioana Krehan,Michele Gubian
机构: Institute for Phonetics and Speech Processing, LMU Munich (慕尼黑大学语音与语音处理研究所), Germany
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: Accepted for publication at Interspeech 2026
Abstract:This study examines the extent to which the wav2vec2.0 architecture exhibits evidence of compensation for phonological context. We conducted a pseudo-replication of a perceptional compensation experiment on Mandarin Chinese tones, and compared the embedding similarities and probing classifier outputs between a purely self-supervised pre-trained model and a model fine-tuned for Mandarin ASR. No evidence of compensation was found in the embedding similarities of the purely pre-trained model. Probing classifiers showed some evidence of compensation in addition to the expected layer-wise improvements in categorization, but failed to replicate human performance on isolated test syllables. Our findings contrast with previous reports of sensitivity to phonological structure emerging through pre-training alone, and suggest that supervised objectives may be necessary to encourage the abstraction of at least some types of phonological regularities.
[NLP-25] When Multiple Scripts Matter: Evaluating ASR in Clinical Settings INTERSPEECH2026
【速读】: 该论文旨在解决非英语临床场景下自动语音识别(ASR)面临的多书写形式变异(multiscript variability)问题,即同一术语可能以多种合法拼写形式出现,而传统基于字符串匹配的评估指标常将这些合法变体误判为错误,从而低估了模型的实际性能。其解决方案的关键在于提出一个名为MultiClin的临床ASR基准测试集,该基准专门用于评估模型在多书写形式变异下的鲁棒性。研究发现,采用考虑多书写形式的评估方法能够更公平地衡量识别质量;进一步分析表明,训练过程中脚本一致性对模型性能具有显著影响:不一致的脚本映射会增加拼写不确定性并阻碍模型收敛,其中50%映射比例时熵值最高,而脚本统一则始终带来最优的ASR表现。
链接: https://arxiv.org/abs/2606.17826
作者: Jean Seo,Minkyu Kim,Jeonguk Lee,Jisoo Jung,Wooseok Han,Eunho Yang
机构: AITRICS; University of Copenhagen (哥本哈根大学); KAIST (韩国科学技术院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Interspeech 2026
Abstract:Automatic speech recognition (ASR) in non-English clinical settings is challenged by multiscript variability, where the same term may appear in multiple valid orthographic forms. Conventional string-matching evaluation metrics often underestimate ASR performance by treating orthographic variants as errors. To address this issue, we introduce MultiClin, a clinical ASR benchmark designed to evaluate robustness to multiscript variability. Experiments across diverse ASR models show that multiscript-aware evaluation provides a fairer assessment of recognition quality than conventional single-reference evaluation. We further investigate the impact of script consistency during training and find that inconsistent script mappings increase orthographic uncertainty and hinder model convergence, with a balanced 50% mapping ratio producing the highest entropy. In contrast, script unification consistently yields the best ASR performance. Our dataset and code are publicly available at: this https URL.
[NLP-26] Improving low-resource ASR using bilingual fine-tuning with language identification: a cross-linguistic evaluation
【速读】: 该论文旨在解决低资源语言中自动语音识别(ASR)性能受限的问题,尤其关注双语微调(bilingual fine-tuning)对多语言ASR系统的影响。其核心挑战在于:在缺乏足够标注数据的情况下,如何有效提升低资源语言的语音识别准确率。解决方案的关键在于引入语言标识符标记(language identification token),在训练阶段通过在输入文本前添加该标记以区分不同语言,使模型学习到语言间的区分特征;在推理阶段,模型基于语音输入联合预测语言和转录结果。研究发现,当语言识别准确率较高时,双语微调可显著提升ASR性能;而当语言识别性能较差时,在推理阶段显式提供语言标识符标记可有效改善识别效果,从而缓解因语言误判导致的性能下降问题。
链接: https://arxiv.org/abs/2606.17820
作者: Reihaneh Amooie,Yun Hao,Wietse de Vries,Jelske Dijkstra,Matt Coler,Martijn Wieling
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:This study explores how bilingual fine-tuning affects automatic speech recognition (ASR) in low-resource languages. We evaluate this method across nine linguistically and geographically diverse language pairs, covering a range of language families and writing systems. To distinguish the two languages, during training, we pre-pend each input text with a language identification token. At inference, the model jointly predicts both the language and transcription from the speech input alone. As texts for which the language is incorrectly determined show low ASR performance, we also conduct a follow-up experiment in which the language identification token is provided both during training and inference. Our results show that bilingual fine-tuning can be beneficial when language identification accuracy is high, and that in cases where language identification performance is low, including the language identification token at inference helps to improve ASR performance.
[NLP-27] A Framework for Evaluating Agent ic Skills at Scale
【速读】: 该论文旨在解决当前生成式AI(Generative AI)领域中代理技能(agent skills)缺乏可复用评估方法的问题,尤其关注其在跨领域、商业与开源模型中的实际影响尚未得到充分研究。其核心解决方案是提出一种可扩展的评估框架,使技能作者能够基于真实场景构建任务,以严谨方式评估技能的关键性能维度,并通过任务求解结果量化技能的实用性。该框架通过生成1000个源自500个真实世界技能的任务,结合指令遵循与目标完成度评分标准,系统评估了19种不同代理模型配置(涵盖专有与开源模型)的表现。研究发现,各模型在遵循技能内嵌指令方面的表现差异显著,导致性能提升幅度不一;同时,引入技能显著改变了模型行为模式,表明技能可作为编码特定工作流偏好(opinionated workflows)的关键机制。研究成果已公开评估数据集,以支持后续代理技能研究。
链接: https://arxiv.org/abs/2606.17819
作者: Maksim Shaposhnikov,Nicolas Fortuin,Simon Stipcich,Maria I. Gorinova,Amy Heineike,Rob Willoughby
机构: Tessl(泰斯尔); London(伦敦); United Kingdom(英国)
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Agent skills – structured, reusable knowledge artifacts that augment LLM agent capabilities – have been rapidly adopted in industry, yet their cross-domain impact and use across commercial and open-source models remain under-studied, and no reusable methodology exists for evaluating an individual skill. In this work, we present an evaluation framework that lets a skill author construct realistic tasks to rigorously assess the aspects of a skill that matter most to them, and that estimates skill utility by solving those tasks. Further, we apply our evaluation approach at scale to 500 real-world skills, generating 1,000 tasks derived from the skills’ content, along with instruction-following and goal-completion scoring rubrics. Using these metrics, we evaluate how 19 agent-model configurations, both proprietary and open-source, perform on the tasks. Our results show that models vary widely in how closely they adhere to the instructions encoded in skills, leading to substantial differences in their performance gains. Furthermore, we show that access to a skill significantly changes model behavior compared to the no-skill setup, providing an essential mechanism for encoding opinionated workflows into LLM agents. We release our evaluation dataset to support future work on agent skills.
[NLP-28] Beyond Native Success: Auditing Deployment-Interface Exposure of CLIP Backdoors
【速读】: 该论文旨在解决生成式视觉-语言模型(如CLIP)在多部署接口中潜在后门风险的可迁移性与隐蔽性问题,特别是现有后门攻击评估仅局限于原生任务,无法有效揭示中毒检查点在跨接口复用时的真实暴露状态。其核心解决方案是提出DIFE(Deployment-Interface Footprint Evaluation)框架,通过统一定义各部署接口的组件读出方式、触发通道、目标事件、参考条件及评估指标,实现对不同接口下后门行为的可比性审计。DIFE进一步引入有效足迹诊断(effective-footprint diagnosis),精准识别携带后门风险的可复用组件或组合,并揭示风险转移路径。实验表明,原生攻击成功率并非检查点层面的风险保证,后门暴露具有组件级足迹特征,且文本侧污染并不等同于对文本编码器的完全控制;部分耦合攻击仍受机制约束。基于此审计发现,研究提出新型后门攻击方法BadTextTower,其能够实现强文本条件下的检索、重排序和选择任务中的暴露,同时保持视觉单模态复用场景的几乎无痕性,从而填补了现有后门攻击中“文本编码器作为可复用恶意载体”的关键空白。
链接: https://arxiv.org/abs/2606.17815
作者: Kunlan Xiang,Haomiao Yang,Wenbo Jiang
机构: University of Electronic Science and Technology of China (电子科技大学)
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注:
Abstract:Contrastive Language-Image Pre-training models are widely reused across downstream interfaces, including feature extraction, retrieval, reranking, and selection. Existing CLIP backdoor, however, usually validate attacks on a small attack-native task, leaving unclear whether the same poisoned checkpoint remains exposed, weakens, or becomes not applicable when reused through other interfaces. We introduce DIFE, a Deployment-Interface Footprint Evaluation framework that audits backdoored CLIP checkpoints across deployment interfaces. DIFE makes various evaluations comparable by specifying each interface’s component readout, trigger channel, target event, reference condition, and metric. DIFE also introduces effective-footprint diagnosis to identify the reusable CLIP component or component combination that carries exposure and explains where risk transfers. Auditing reproduced CLIP backdoors with DIFE reveals a structured landscape: native success is not a checkpoint-level risk certificate, exposure follows component footprints, text-side poisoning does not yield textual-encoder control, and some coupled attacks remain mechanism-bound. This audit reveals a import gapin existing CLIP backdoors: a textual encoder that itself becomes a reusable carrier of adversarial behavior. We therefore introduce BadTextTower to fill this gap. BadTextTower produces strong text-conditioned retrieval, reranking, and selection exposure while leaving visual-only reuse nearly clean.
[NLP-29] Position: Coding Benchmarks Are Misaligned with Agent ic Software Engineering
【速读】: 该论文旨在解决当前代码生成基准测试(coding benchmarks)与生成式智能体(coding agents)实际应用之间存在的严重错位问题。传统基准测试将模型、工具链(harness)、环境及反馈信号等要素融合为单一的端到端评分,且通常仅以单一参考解作为评判标准,缺乏对系统各组件的细粒度反馈信号,导致无法有效支持迭代优化。其核心问题在于:现有基准未能反映真实场景中代码智能体作为复杂系统(system harness)的本质——即由多个可独立影响性能的模块(如模型、上下文、环境、反馈机制等)构成,其中任一组件的改进都可能带来相当于相邻模型代际提升的性能增益。因此,解决方案的关键在于重构评估框架,实现对系统各组件的独立可观测性与可迭代性,从而提供更精准的性能诊断与优化指引,使基准测试真正适配生成式智能体驱动的软件工程范式。
链接: https://arxiv.org/abs/2606.17799
作者: Maria I. Gorinova,Macey Baker,Amy Heineike,Maksim Shaposhnikov,Rob Willoughby,Dru Knox
机构: Tessl(特塞尔); London(伦敦); UK(英国)
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Coding agents have become a major mode of software engineering, but the benchmarks we use to compare them were designed in a pre-agent era: they collapse model, harness, and environment into a single end-to-end score, typically computed against one reference solution, with no component-level signal for iteration. We argue that current coding benchmarks are misaligned with agentic software engineering. A coding agent in practice is not a model: it is a system harness – a composite of models, harnesses, contexts, environments, and feedback signals, any one of which can move the benchmark score by margins comparable to those between adjacent model generations. We discuss three symptoms: (i) benchmark scores conflate the model with the rest of the harness; (ii) grading against a single reference solution penalises equally valid alternatives; and (iii) the absence of signal at the level of individual harness components makes the end-to-end system score difficult to iterate on.
[NLP-30] he Slop Paradox: How Synthetic Standardization Erodes Clinical Uncertainty and Cross-Modal Alignment in AI-Rewritten Radiology Reports
【速读】: 该论文旨在解决生成式人工智能(Generative AI)在辅助临床文档书写过程中,特别是利用大语言模型(LLM)对放射科报告进行摘要、标准化和教学案例重构时所引发的信息退化问题。其核心挑战在于:尽管这些自动化处理手段旨在提升数据的可读性和一致性以支持多模态医学人工智能训练,但可能无意中破坏了原始报告中的关键临床信息与影像内容之间的对齐关系。解决方案的关键在于通过三类真实场景下的LLM重写任务(电子健康记录摘要、标准化重写、教学案例准备),系统性地量化评估信息流失程度,包括实体侵蚀(基于医学命名实体识别)、模糊表达消解(临床不确定性语言丢失)以及跨模态对齐退化(基于BiomedCLIP的图像-文本相似度)。研究发现,信息损失与跨模态保真度之间存在显著脱节——虽然电子病历摘要任务造成最严重的语义内容丢失(临床实体减少51.4%,不确定语气减少43.7%),却几乎完全保留了图像-文本对齐(仅下降2.5%);而意在生成更“干净”训练数据的标准化重写与教学案例准备任务,虽保留更多临床实体(仅侵蚀26.8%和29.3%),却导致高达14.9%-16.5%的对齐下降,是前者的六至七倍。这一反直觉现象被作者称为“冗余悖论”(slop paradox),揭示出为提升文本整洁性所做的重写反而加剧了文本与影像之间的偏离。进一步分析表明,退化程度主要由重写任务类型决定,而非病变罕见性,且罕见病并未表现出更高敏感度,说明现有监测机制难以捕捉此类隐性污染。该研究对多模态医学AI数据集构建及生成式临床文档工具的治理具有重要启示。
链接: https://arxiv.org/abs/2606.17791
作者: Samar Ansari
机构: University of Chester (切斯特大学)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:AI-assisted clinical documentation tools increasingly summarize, standardize, and reformat radiology reports using large language models (LLMs). We present a controlled measurement of the resulting information degradation. Using 450 chest X-ray reports from the Indiana University dataset, we generate synthetic versions via three realistic LLM rewriting tasks: EHR summarization, standardized rewriting, and teaching case preparation. We measure entity erosion (via medical NER), hedging collapse (loss of clinical uncertainty language), and cross-modal alignment degradation (via BiomedCLIP image-text similarity). Our central finding is a dissociation between information loss and cross-modal fidelity. EHR summarization is the most destructive at the content level, eroding 51.4% of clinical entities and 43.7% of hedging language, yet it preserves image-text alignment almost entirely (a 2.5% drop). The two tasks meant to produce cleaner training data, standardized rewriting and teaching case preparation, do the reverse: they preserve more entities (26.8% and 29.3% eroded) but cause 14.9-16.5% alignment drops, six to seven times those of EHR summarization. We term this the slop paradox: rewriting that makes clinical text look cleaner for multimodal training is precisely what pulls it away from the image. Contrary to our pre-specified hypothesis, rare pathologies were not preferentially degraded: across nine rare-versus-common comparisons, no difference survived multiple-comparison correction, and nominal differences ran in the opposite direction (common rare), so contamination is invisible to condition-specific monitoring. The dominant determinant of degradation is the type of AI rewriting task, not the clinical content. These findings bear on multimodal medical AI dataset construction and the governance of AI-assisted clinical documentation.
[NLP-31] Vision-language models for chest radiography do not always need the image
【速读】: 该论文旨在解决当前医学视觉-语言模型在胸部X光片诊断中表现优异,但其准确性是否真正源于对图像的正确理解这一关键问题。现有评估体系无法区分模型是基于图像内容进行推理,还是仅依赖“病灶名称先验”(finding-name priors)等表面线索进行预测,导致对模型真实能力的误判。为此,作者提出一种因果审计(causal audit)方法,通过三种干预手段——遮蔽相关解剖区域、遮蔽无关区域、以及替换为同标签但不同患者的扫描图像——并结合三项行为指标,系统检验模型的正确判断是否真正依赖于输入图像信息。实验结果表明,在九个不同系统中,完全无图像访问能力的纯文本模型性能与最佳多模态模型相差不超过5.7个百分点,甚至一个1190亿参数的多模态模型在统计上也与仅70亿参数的纯文本基线无显著差异。因果审计将模型分为三类:三类忽略图像、一类表现不稳定、五类仅对部分病灶选择性使用图像信息,且该分类在另一数据集、不同图像分辨率及提示词表述下具有鲁棒性。更重要的是,尽管纯文本模型在准确率上与执业放射科医生无显著差异,但其“接地度”(grounding)为零;而使用图像的模型则表现出接近放射科医生水平的接地率。研究还发现,报告中的置信度标记仅在模型实际使用图像时才有效提示答案的可靠性。因此,论文强调:临床部署应以“接地审计”(grounding audit)而非单纯准确性作为核心评估标准。
链接: https://arxiv.org/abs/2606.17710
作者: Mahshad Lotfinia,Sebastian Ziegelmayer,Lisa Adams,Daniel Truhn,Andreas Maier,Soroosh Tayebi Arasteh
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Medical vision-language models report strong chest radiograph accuracy, and this is increasingly read as evidence that they use the image. That inference is unsafe: a model exploiting finding-name priors scores like one that reads the scan, and no standard benchmark separates them. We introduce a causal audit that intervenes on the image, occluding the relevant region, occluding an irrelevant one, and swapping in another patient’s same-label scan, and combines three behavioral metrics to test whether a correct answer depends on the image. Across nine systems, a text-only model with no image access reaches within 5.7 accuracy points of the best multimodal one, and a 119-billion-parameter multimodal model is statistically indistinguishable from a 7-billion text-only baseline. The audit splits the cohort into three models that ignore the image, one that is unstable, and five that use it selectively, for a subset of findings; the categories hold across a second dataset, resolution, and prompt phrasing. Against board-certified radiologists, a text-only model is statistically indistinguishable from a radiologist’s accuracy while grounding at zero, whereas the image-using models ground at radiologist-comparable rates. Reported confidence flags ungrounded answers only when a model uses the image. Grounding audits, not accuracy, should gate clinical deployment.
[NLP-32] EComAgent Bench: Benchmarking Shopping Agents on Long-Horizon Tasks with Distributed Hidden Intent
【速读】: 该论文旨在解决当前基于大语言模型(LLM)的购物代理在实际应用中面临的核心挑战:现有基准测试无法真实反映用户需求在交互过程中逐步显现的动态特性。具体而言,用户需求可能以隐式查询、用户画像中的记录或仅在特定提问后才暴露的形式存在,而传统基准通常提前暴露完整意图,仅评估最终选择结果,因而无法有效衡量代理在长时程交互中识别隐藏需求、验证候选商品属性与评论证据的能力。为填补这一空白,论文提出EComAgentBench,一个基于真实亚马逊商品及评论构建的662项任务基准。其关键创新在于将需求分散于可见查询、工具控制的用户画像以及脚本化澄清环节中,要求代理在不超过100次工具调用的限制下,主动挖掘隐藏意图、综合属性与评论证据进行推理,并做出唯一决策。此外,该基准采用带类型和来源标记的评分规则(rubrics),可精准定位每项失败所对应的缺失需求及其信息源。整个基准构建过程自动化且可靠,所有答案预先通过代码固化,确保一致性与可复现性。对七种模型的评估表明,即使最强模型的整体准确率也仅为57.1%,且从显式到隐式信息源的评分满意度显著下降。研究认为,EComAgentBench为推动购物代理从单轮查询搜索向具备长期可靠性的人机协作助手演进提供了可复现的基准基础。
链接: https://arxiv.org/abs/2606.17698
作者: Zeyao Du,Tong Li,Haibo Zhang
机构: Shopee
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:As LLM-based shopping agents enter production, existing benchmarks fail to capture how a shopper’s requirements arrive: stated implicitly in the query, recorded in a profile, or revealed only when the right question is asked. Benchmarks that expose full intent upfront and grade only the final choice can neither pose this long-horizon challenge nor explain which requirement an agent missed. To address this gap, we introduce EComAgentBench, a benchmark of 662 tasks grounded in real Amazon products and reviews. Each task scatters these requirements across a visible query, a tool-gated profile, and scripted clarification; an agent must uncover hidden intent, verify candidates against attributes and review evidence, and commit to a single product within 100 tool calls. Moreover, typed, source-tagged rubrics grade every task, attributing each failure to a requirement and its source. Construction is automated yet reliable, with every answer fixed in code before any text is generated and every sample validated. Our evaluation of seven models reveals that even the strongest attains only 57.1% overall accuracy, and rubric satisfaction degrades from visible to hidden sources. Overall, we believe EComAgentBench will serve as a reproducible foundation for moving shopping agents from single-query search toward dependable assistance over long horizons.
[NLP-33] LLM s Infer Cultural Context but Fail to Apply It When Responding
【速读】: 该论文旨在解决大语言模型(LLM)在生成对话时对不同文化背景缺乏适应性的问题,特别是模型在面对非西方文化语境时,往往过度依赖其训练数据中占主导地位的西方文化偏见,导致生成内容无法有效适配用户的文化背景。其核心解决方案在于提出一种名为“文化与语用响应推理”(Cultural and Pragmatic Response Inference, CAPRI)的数据集,该数据集通过包含不同程度文化线索的对话样本,系统评估模型在识别用户文化背景、回忆相应文化惯例(如本地度量单位、时间表达和数量表述方式)方面的表现。研究发现,尽管先进大模型具备一定的文化背景推断能力并能回忆相关文化规范,但通常不会主动将其用于调整输出,除非被明确提示按顺序执行文化适配任务。此外,模型对时间与数量表达等具有文化相对性的语言特征的回应会随文化线索累积而逐步适应,但其内在先验仍受模型训练数据来源国的影响,表现出非中立性。因此,该研究的关键突破在于揭示了现有模型在文化知识与实际文化自适应生成之间的脱节,并通过CAPRI为未来实现真正文化敏感的语言生成提供了可扩展的评估框架与基准。
链接: https://arxiv.org/abs/2606.17688
作者: Yisong Miao,Jian Zhu,Vered Shwartz
机构: University of British Columbia (不列颠哥伦比亚大学); Canada CIFAR AI Chair, Vector Institute (加拿大CIFAR人工智能主席,向量研究所); National University of Singapore (新加坡国立大学)
类目: Computation and Language (cs.CL)
备注: 9 pages, 7 figures, 2 tables (24 pages, 12 figures, 8 tables including references and appendices)
Abstract:Recent work has shown that LLMs overrepresent dominant cultures, particularly Western ones, while marginalizing others. We investigate whether this affects models’ ability to generate culturally adapted responses by evaluating their use of local measurement units based on the user’s perceived cultural background. We introduce Cultural and Pragmatic Response Inference (CAPRI), a dataset of conversations with varying levels of cultural cues. Experiments with state-of-the-art LLMs show that models can infer cultural background and recall relevant conventions, but often fail to utilize the information to adapt their answers to the relevant cultural conventions, unless explicitly prompted to perform the tasks sequentially. We further evaluate adaptation to the interpretation of time and quantity expressions, two subjective language grounding dimensions that are affected by culture. We find that models increasingly adapt their answers as cultural cues accumulate, but their priors are not culture-neutral, sometimes aligning with the model’s country of origin. Overall, CAPRI provides a resource for future research aimed at narrowing the gap between cultural knowledge and culturally adaptive language generation.
[NLP-34] SuCo: Sufficiency-guided Continuous Adaptive Reasoning ICML2026
【速读】: 该论文旨在解决大推理模型(Large Reasoning Models, LRMs)在执行任务时生成过长的思维链(Chain-of-Thought, CoT),导致计算成本过高,尤其在简单查询中表现尤为明显的问题。现有方法多依赖离散的推理模式或固定的预算层级,缺乏对“推理何时已足够”的理论性判断标准。为此,论文提出最小充分思维链(Minimal Sufficient CoT, MSC),定义为能够生成正确答案的最短CoT前缀,并通过实证表明MSC不仅显著减少推理标记数(tokens),还能提升不同难度任务下的准确性。基于MSC,作者进一步提出一种双阶段训练框架——充分性引导的连续自适应推理(Sufficiency-guided Continuous Adaptive Reasoning, SuCo):第一阶段采用MSC对齐微调(MSC-Aligned Fine-Tuning, MFT),利用随问题难度动态调整的充分性阈值构建高质量数据,使模型内化简洁而充分的推理模式;第二阶段通过充分性感知策略优化(Sufficiency-Aware Policy Optimization, SAPO),结合强化学习与动态复杂度追踪机制,设计兼顾过量思考与思考不足惩罚的奖励函数,实现对推理过程的连续、自适应控制。大量实验结果表明,SuCo在数学、代码和科学等多个基准测试中均实现了准确率与推理效率的双重提升。
链接: https://arxiv.org/abs/2606.17687
作者: Jiahao Wang,Bingyu Liang,Chenhao Hu,Longhui Zhang,Xuebo Liu,Min zhang,Jing Li,Xuelong Li
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to ICML 2026. 18 pages
Abstract:Despite remarkable performance on complex tasks, Large Reasoning Models (LRMs) often generate excessively long Chain-of-Thoughts (CoT), inflating computational costs even for simple queries. Existing efforts to mitigate this inefficiency typically rely on discrete reasoning modes or fixed budget tiers, lacking a principled criterion of when reasoning is sufficient. In this work, we introduce Minimal Sufficient CoT (MSC), defined as the shortest prefix of a CoT trajectory which is adequate for producing the correct answer. We empirically show that MSC not only reduces reasoning tokens, but also improves accuracy across difficulty levels. Building on MSC, we propose Sufficiency-guided Continuous Adaptive Reasoning (SuCo), a two-stage training framework for autonomous reasoning control along a continuous spectrum. In stage 1, MSC-Aligned Fine-Tuning (MFT) constructs MSC data using problem-adaptive sufficiency thresholds that naturally scale with question difficulty, then fine-tunes the model to internalize concise yet sufficient reasoning patterns. In stage 2, Sufficiency-Aware Policy Optimization (SAPO) further optimizes the model through reinforcement learning with dynamic complexity tracking and sufficiency-aware rewards that penalize both over- and under-thinking. Extensive experiments across mathematics, code, and science benchmarks show that SuCo consistently achieves improvements in both accuracy and reasoning efficiency.
[NLP-35] Bridging Functional Correctness and Runtime Efficiency Gaps in LLM -Based Code Translation ICML2026
【速读】: 该论文旨在解决生成式代码翻译系统中运行时效率(runtime efficiency)普遍低于人工编写代码的问题,尽管大语言模型(Large Language Models, LLMs)在功能正确性方面已取得显著进展,但其生成的代码在执行性能上存在明显短板。传统方法依赖提示工程(prompt engineering)难以有效改善此问题。为此,本文提出SwiftTrans框架,其核心解决方案包含两个关键阶段:(1)多视角探索(Multi-Perspective Exploration),通过并行上下文学习(parallel in-context learning, ICL)机制,由MpTranslator生成多样化的翻译候选;(2)差异感知选择(Difference-Aware Selection),由DiffSelector基于对多个翻译结果之间的显式差异分析,择优选择最优方案。为增强模型在上述两阶段中的适应性,进一步引入分层引导(Hierarchical Guidance)与序数引导(Ordinal Guidance)。为系统评估运行时效率,研究扩展了CodeNet与F2SBench基准,并构建新基准SwiftBench。实验结果表明,SwiftTrans在功能正确性与运行时效率方面均实现一致且显著的提升。
链接: https://arxiv.org/abs/2606.17683
作者: Longhui Zhang,Jiahao Wang,Chenhao Hu,Bingyu Liang,Jing Li,Min Zhang
机构: 未知
类目: Computation and Language (cs.CL); Programming Languages (cs.PL)
备注: Accepted to ICML 2026
Abstract:While large language models (LLMs) have greatly advanced the functional correctness of automated code translation systems, the runtime efficiency of translated programs has received comparatively little attention. With the waning of Moore’s law, runtime efficiency has become increasingly important for program quality, alongside functional correctness. Our preliminary study reveals that LLM-translated programs often run slower than human-written ones, and this issue cannot be remedied through prompt engineering alone. Therefore, our work proposes SwiftTrans, a code translation framework comprising two key stages: (1) Multi-Perspective Exploration, where MpTranslator leverages parallel in-context learning (ICL) to generate diverse translation candidates; and (2) Difference-Aware Selection, where DiffSelector identifies the optimal candidate by explicitly comparing differences between translations. We further introduce Hierarchical Guidance for MpTranslator and Ordinal Guidance for DiffSelector, enabling LLMs to better adapt to these two core components. To support the evaluation of runtime efficiency in translated programs, we extend existing benchmarks, CodeNet and F2SBench, and introduce a new benchmark, SwiftBench. Experimental results across all three benchmarks show that SwiftTrans achieves consistent improvements in both correctness and runtime efficiency.
[NLP-36] From Trainee to Trainer: LLM -Designed Training Environment for RL with Multi-Agent Reasoning
【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)强化学习训练流程中因各阶段环境需手动重设计而导致的效率低下与配置优化依赖经验性判断的问题。现有方法在不同训练阶段间切换时,往往缺乏系统性的环境配置迭代机制,导致性能提升受限于人工直觉。为此,本文提出“LLM作为环境工程师”(LLM-as-Environment-Engineer)框架,其核心在于利用当前策略模型对失败轨迹、上下文信息及环境统计特征进行分析,并自动生成下一阶段训练环境的优化配置。该框架的关键创新在于将环境重设计过程从人工干预转变为基于策略模型自我诊断能力的自动化闭环,通过引入可调控的测试平台MAPF-FrozenLake,该平台支持多维度环境参数配置,为环境重构研究提供标准化基准。实验表明,以Qwen3-4B为基底模型的该框架在多项指标上超越更大规模的专有模型(如GPT、Gemini)和固定环境训练基线,且分析显示有效环境更新依赖于失败证据并保留已生效的配置特性。尤为关键的是,经过强化学习微调后的策略检查点比原始基础模型更适合作为环境工程师,说明策略学习增强了模型识别自身薄弱环节的能力,从而提升了环境重构的精准性与有效性。
链接: https://arxiv.org/abs/2606.17682
作者: Chao Chen,Chengzu Li,Zhiwei Li,Yinhong Liu,Zhijiang Guo
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Reinforcement learning pipelines for Large Language Model (LLM) training often rely on manually redesigned environments between stages, requiring practitioners to heuristically infer which configuration will best improve the current policy. To automate this process, we propose the LLM-as-Environment-Engineer framework in which the current policy model analyzes failure trajectories together with contextual information and proposes modifications to the next-stage training environment configuration. We also introduce MAPF-FrozenLake, a controllable testbed whose generator exposes multi-dimensional environment configurations, making it suitable for studying and benchmarking environment redesign. On this testbed, we condition the environment engineer on structured summaries of policy behavior, failure cases, and environment statistics, from which it produces the configuration for the next training stage. With Qwen3-4B as the backbone, our framework achieves the strongest aggregate performance on our benchmarks, outperforming larger proprietary LLMs (e.g., GPT, Gemini) and fixed-environment training baselines. We further analyze which forms of context are most effective, finding that successful environment updates rely on failure evidence and preserve configurations that already work. Interestingly, the current RL checkpoint serves as a better environment engineer than the original base model, suggesting that policy learning improves the model’s ability to diagnose its remaining weaknesses.
[NLP-37] EnvRL: Learn from Environment Dynamics in Agent ic Reinforcement Learning
【速读】: 该论文旨在解决长时程智能体任务中强化学习(Reinforcement Learning, RL)因奖励稀疏而导致的策略学习困难问题。传统RL方法在面对此类任务时,仅依赖最终结果的稀疏奖励信号,难以有效捕捉环境动态变化,从而限制了智能体对环境内在机制的理解与建模能力。其核心解决方案在于引入环境动态学习(environment dynamics learning),通过构建两个辅助目标——状态预测(state prediction)与逆动力学(inverse dynamics),从智能体自身的交互轨迹中挖掘隐含的监督信号。该信号能够揭示环境的状态转移规律,促使智能体在训练过程中内化对环境动态的准确认知。实验结果表明,所提出的EnvRL框架在两个长时程智能体基准测试(ALFWorld和WebShop)上显著提升了成功率,例如在使用GRPO训练时,将Qwen-2.5-1.5B-Instruct模型在ALFWorld上的成功率从72.8%提升至77.4%,在WebShop上从56.8%提升至67.0%,验证了利用交互经验中的环境动态信息对增强策略学习的有效性。
链接: https://arxiv.org/abs/2606.17680
作者: Zhitong Wang,Songze Li,Hao Peng,Shuzheng Si,Yi Wang,Maosong Sun,Juanzi Li
机构: Tsinghua University (清华大学); Shanghai AI Laboratory
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:Reinforcement learning (RL) has emerged as a powerful paradigm for training Large Language Models (LLMs) as agents. However, conventional RL methods for long-horizon agentic tasks often struggle with sparse outcome rewards. Intuitively, this overlooks the rich environment dynamics information contained in rollout interaction trajectories. We argue that the interaction experience inherently serves as an implicit supervision signal, reveals the underlying transition mechanisms of the environment, and enables the agent to construct a more accurate internal model of the environment… Therefore, in this work, we investigate how to leverage this additional signal to improve policy learning. Specifically, we propose EnvRL, a framework that incorporates environment dynamics learning into agentic RL via two auxiliary objectives: state prediction and inverse dynamics. By jointly optimizing with the primary RL objective, we encourage the agent to internalize environment dynamics from its own interaction experience. Extensive experiments on two long-horizon agentic benchmarks demonstrate that EnvRL achieves significant improvements on success-rates over RL-only baselines, e.g., when trained with GRPO, lifting Qwen-2.5-1.5B-Instruct from 72.8% to 77.4% on ALFWorld, and from 56.8% to 67.0% on WebShop.
[NLP-38] MambaCount: Efficient Text-guided Open-vocabulary Object Counting with Spatial Sparse State Space Duality Block
【速读】: 该论文旨在解决文本引导的开放词汇目标计数(Text-guided Open-vocabulary Object Counting, TOOC)在密集场景中因尺度变化大而带来的挑战,尤其是现有基于Transformer的方法因图像分辨率呈二次复杂度而导致的可扩展性瓶颈。针对此问题,论文提出MambaCount框架,其核心解决方案在于:首先,通过分析并重构Mamba中隐藏状态的衰减动态,缓解其固有的因果建模对视觉任务所需双向空间依赖关系的限制;其次,引入空间令牌选择(Spatial Token Selection, STS)子模块,降低Mamba中空间令牌响应的无约束高熵问题,从而增强局部细节与高频特征的保留;此外,设计多粒度原型(Multi-Granularity Prototypes, MGP)以在不同语义层级上识别类对象区域,提升跨模态对齐能力与模型可解释性。实验表明,MambaCount在无需额外查询的情况下达到当前最优性能(FSC-147测试MAE为12.23),同时保持线性计算复杂度,显著提升了效率与精度平衡。
链接: https://arxiv.org/abs/2606.17650
作者: Hao-Yuan Ma,Li Zhang,Minjie Qiang,Jie Gao
机构: Soochow University (苏州大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:
Abstract:Text-guided Open-vocabulary Object Counting (TOOC) aims to estimate the number of objects described by text prompts, which is particularly challenging in dense scenes with large scale variations. Existing TOOC approaches predominantly rely on Transformers, whose quadratic complexity with respect to image resolution limits their scalability. Mamba offers a promising alternative due to its linear complexity. However, previous Mamba-based methods have two main limitations. On the one hand, the inherent causal formulation of Mamba constrains the bidirectional spatial dependency modeling required by non-causal vision tasks. On the other hand, existing Mamba-based vision models often overlook the unconstrained high entropy in the spatial token responses, which can weaken local details and high-frequency cues. To address these limitations, we propose MambaCount, an efficient framework built on the Spatial Sparse State Space Duality (S^4D) block. Specifically, we analyze and reconstruct the decay dynamics of hidden states in Mamba to alleviate the dependency constraints introduced by causal modeling. Moreover, we introduce a Spatial Token Selection (STS) sub-block to reduce the unconstrained high entropy in spatial token responses within Mamba. In addition, we design Multi-Granularity Prototypes (MGP) to identify object-like regions at different semantic levels, improving cross-modal alignment and interpretability. Extensive experiments on FSC-147 demonstrate that MambaCount achieves state-of-the-art performance among methods without secondary querying, obtaining a test MAE of 12.23, while retaining linear complexity.
[NLP-39] Beyond Domains: Reusing Web Skills via Transferable Interaction Patterns
【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)驱动的网页智能体在执行任务时因频繁调用低层级操作而产生的长决策序列问题,导致生成式推理延迟高、计算成本昂贵。其核心挑战在于现有技能库依赖指令相似性或粗粒度站点元数据进行技能触发,难以在未见站点上实现高效复用,限制了整体性能提升。为此,论文提出SkillMigrator,其关键创新在于通过匹配页面布局结构而非具体元素引用,实现跨站点的可迁移技能复用。具体而言,每个诱导出的技能被封装为可迁移交互模式(Transferable Interaction Pattern, TIP),包含技能本身及其在诱导时刻的结构化快照。在测试阶段,系统基于布局相似性检索TIP,并在实时页面上对齐其引用,从而实现精准落地。该方法在保持标准访问树观测与工具调用机制的基础上,显著提升了技能复用率,在WebArena和Mind2Web基准上以相同成功率下将平均LLM动作数降低8%-10%,有效缓解了长序列带来的延迟与成本压力。
链接: https://arxiv.org/abs/2606.17645
作者: Shiqi He,Yue Cui,Feijie Wu,Xinyu Ma,Jiaheng Lu,Yaliang Li,Bolin Ding,Mosharaf Chowdhury
机构: University of Michigan (密歇根大学); Alibaba Group (阿里巴巴集团); Purdue University (普渡大学); McMaster University (麦克马斯特大学); University of Pennsylvania (宾夕法尼亚大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Large language model (LLM) web agents are usually deployed as tool callers: each turn, the model reads a fresh page observation and emits one structured tool action. When every action is a low-level primitive, horizons grow quickly and so do policy-facing LLM completions, dominating latency and cost on benchmarks such as Mind2Web and WebArena. Recent systems therefore wrap repeated interaction fragments as web skills: callable tools built from successful trajectories or induced programs, so one call can replace several primitives. However, prior skill libraries are still triggered mainly by instruction similarity or coarse site metadata, which yields low skill reuse on held-out sites and leaves much of the potential step and token reduction on the table. We present SkillMigrator, an agent that learns reusable web skills and transfers them across sites by matching layout structure rather than specific element references. Each induced skill is stored as a transferable interaction pattern (TIP): the skill paired with a structural sketch of the snapshot at induction time. At test time, SkillMigrator retrieves TIPs by layout similarity and grounds their references on the live page. The rest of the stack is standard: accessibility-snapshot observations with stable references, and fixed tool calling over primitives plus skill invocations. Compared with the state-of-the-art approaches, SkillMigrator reduces the average LLM-action count on successful trajectories by 8-10% across both WebArena and Mind2Web at matched success rate. Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG) Cite as: arXiv:2606.17645 [cs.AI] (or arXiv:2606.17645v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2606.17645 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[NLP-40] Prompt Perturbation for Reliable LLM Evaluation over Comparison Graphs
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在开放式任务中进行成对评估时出现的非传递性(intransitivity)问题,即评估结果可能产生循环偏好(如 A ≻ B ≻ C ≻ A)或包含矛盾平局(如 A ≡ B ≡ C ≠ A),导致排行榜不稳定且难以解释。其解决方案的关键在于提出一种提示扰动(prompt perturbation)框架:通过生成每个提示的扰动变体,构建比较图以识别并过滤掉结构不一致的比较模式,随后在经过筛选的比较数据上应用标准排序方法。该框架的核心优势在于将图级别的结构一致性作为先验约束显式引入评估流程,在排名聚合前即实现对循环不一致性的有效抑制,从而提升大语言模型评估结果的可靠性和可解释性。
链接: https://arxiv.org/abs/2606.17634
作者: Dong Huang,Jianbo Sun,Pengkun Yang
机构: 未知
类目: Computation and Language (cs.CL)
备注: 42 pages, 8 figures
Abstract:Evaluating large language models (LLMs) is important for understanding their capabilities, comparing competing systems, and supporting the deployment of reliable models in practice. For open-ended tasks, pairwise evaluation has become a popular paradigm, in which two responses to the same prompt are compared and the resulting judgments are aggregated into an overall ranking. A central challenge of this paradigm is intransitivity: the induced comparison outcomes may fail to support any coherent global ranking. For example, one may observe cyclic preferences such as A \succ B \succ C \succ A , or inconsistencies involving ties such as A \equiv B\equiv C\neq A . Such contradictions make the resulting leaderboard unstable and challenging to interpret. In this paper, we propose a prompt perturbation framework for improving the consistency of pairwise LLM evaluation. Our approach generates perturbed variants of each prompt, uses the resulting comparison graphs to identify and filter out structurally inconsistent comparison patterns, and then applies standard ranking methods to the filtered comparisons. A key feature of the proposed framework is that graph-level structural consistency is incorporated explicitly into the evaluation pipeline before ranking aggregation. This provides a simple and principled way to reduce cyclic inconsistencies and improve the reliability of LLM rankings.
[NLP-41] OPD-Evolver: Cultivating Holistic Agent Evolver via On-Policy Distillation
【速读】: 该论文旨在解决现有记忆增强型自演化智能体(memory agents)在经验利用与持续进化能力上的局限性问题:尽管当前方法能够存储轨迹、检索反思或累积技能,但普遍缺乏对有用经验的甄别、有效利用、可复用知识的生成以及动态演化的记忆库维护等综合能力。其核心解决方案是提出OPD-Evolver,一种基于“慢-快”协同演化机制的框架,通过在线策略自蒸馏(on-policy self-distillation)实现对智能体演化能力的系统性培养。在快速循环中,OPD-Evolver依托四级记忆层次结构完成经验的读取、使用、写入与维护,支持测试时的快速演化;在慢速循环中,通过结果校准的记忆归因与特权回溯(privileged hindsight)将上述四重能力提炼并固化至可部署策略中。实验表明,该框架在多领域基准上显著优于传统记忆系统(如ReasoningBank)和训练依赖型方法(如Skill0),且其大模型版本(OPD-Evolver-9B)已具备挑战超大规模模型(如Qwen3.5-397B-A17B)的能力,标志着从单纯记忆增强向真正具备自我进化资质的智能体演进器(agent evolver)的跨越。
链接: https://arxiv.org/abs/2606.17628
作者: Guibin Zhang,Xun Xu,Yanwei Yue,Zikun Su,Wangchunshu Zhou,Xiaobin Hu,Shuicheng Yan
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Memory has become a standard substrate for self-evolving agents, yet retaining experience is not the same as learning how to evolve through it. Existing memory agents can store trajectories, retrieve reflections, or accumulate skills, but often lack the holistic competence to select useful experience, act on it, write reusable knowledge, and maintain a growing repository. We introduce OPD-Evolver, a slow-fast co-evolution framework that cultivates such an agent evolver through on-policy self-distillation. In the fast loop, OPD-Evolver interacts with a four-level memory hierarchy to read, use, write, and maintain experience for rapid test-time evolution. In the slow loop, outcome-calibrated memory attribution and privileged hindsight distill these four abilities into the deployable policy. Across multi-domain benchmarks, OPD-Evolver surpasses memory systems such as ReasoningBank by up to 11.5%, and training-based methods such as Skill0 by ~5.8%. Further analysis shows that OPD-Evolver internalizes high-value experience and memory management, enabling OPD-Evolver-9B to challenge giant counterparts such as Qwen3.5-397B-A17B and Step-3.5-Flash, pointing beyond memory-augmented agents toward genuinely qualified agent evolvers.
[NLP-42] he Benchmark Illusion: Pruned LLM s Can Pass Multiple Choice but Fail to Answer
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在压缩(如剪枝)后评估结果与实际生成能力之间存在的“基准幻觉”(benchmark illusion)问题。具体而言,现有标准多选题评估基准往往无法捕捉压缩模型在开放式生成任务中的失效现象:尽管模型在多选题中仍能正确识别答案,但在自由生成场景下却无法有效输出该答案。其核心问题是:剪枝究竟导致正确答案被完全移除,还是仅使其在生成过程中难以成为首选输出?研究通过多语言问答任务,在剪枝前后对同一问题进行追踪分析发现,多数情况下答案并未消失,而是被“降级”——即其生成概率降低,不再处于最高置信度位置。该结论的关键在于,通过束搜索(beam search)、采样(sampling)或引入上下文示例(in-context example)等策略,这些被降级的答案仍可恢复生成。因此,论文强调,压缩模型的评估应超越“识别能力”,转向对其实际生成能力的测试,否则将产生严重的评估盲区。
链接: https://arxiv.org/abs/2606.17609
作者: Rui Wen,Lu Sun,Jiayang Liu,Zesheng Xu,Tianshuo Cong,Zheng Li
机构: Institute of Science Tokyo(东京科学研究所); Tohoku University(东北大学); Nanyang Technological University(南洋理工大学); KTH Royal Institute of Technology(瑞典皇家理工学院); Shandong University(山东大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Compressing large language models reduces memory use and inference cost, but it can also create failures that standard benchmarks miss. A pruned model may still perform well on multiple-choice evaluations, yet fail to answer the same question in open generation. We ask what pruning changes: does it erase the correct answer, or does it make the answer harder to produce as the top output? We study this question with multilingual question answering, tracking the same questions before and after pruning. We find a benchmark illusion. Under high-sparsity pruning, especially Wanda, models often fail in greedy open generation while still selecting the correct answer under multiple-choice scoring. In these recognition-only errors, the answer is usually not gone, but demoted: it often reappears with beam search, sampling, or one in-context example. Overall, multiple-choice benchmarks can overstate the usability of compressed LLMs, creating an evaluation blind spot. Compressed models should be tested on what they can produce, not only on what they can recognize. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2606.17609 [cs.CL] (or arXiv:2606.17609v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2606.17609 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[NLP-43] LLM Features Can Hurt GNNs: Concatenation Interference on Homophilous Graph Benchmarks
【速读】: 该论文旨在解决在图神经网络(GNN)中直接通过拼接方式引入大语言模型(LLM)生成的节点特征时,尽管已有研究报道其可提升标准基准测试的准确率,但在特定条件下反而系统性降低性能的问题。核心问题是:为何在纯输入拼接(而非联合训练、知识蒸馏或提示引导)场景下,LLM特征会显著损害同质性图(homophilous graphs)上的模型表现,尤其是在使用MLP骨干网络与原始词袋特征时。其解决方案的关键在于提出一个名为Δ_sig(LLM-alone discriminability)的简单度量指标,用于预测拼接是否带来收益或损失——该指标衡量的是仅用LLM编码器对节点特征进行分类的能力,结果表明Δ_sig与拼接带来的性能变化相关性(r² = 0.38)远强于传统同质性指标(r² = 0.06),且在9个数据集上表现出良好的判别能力。研究进一步发现,当Δ_sig低于阈值τ ≈ 13.8个百分点时,拼接通常导致性能下降;该阈值在多数置信区间内位于[5, 30]范围内,因此Δ_sig被定位为一种具有解释力的分析工具而非精确筛选器。此外,通过维度控制消融实验排除了维度膨胀和权重衰减等潜在干扰因素,并揭示出拼接损失遵循幂律关系|Δ_concat| ∝ (√(d_l/n))^1.31,验证了在低Δ_sig、小样本区域(如PubMed)出现显著性能下降的机制根源。
链接: https://arxiv.org/abs/2606.17579
作者: Zhongyuan Wang,Pratyusha Vemuri
机构: RaptorX.AI
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Social and Information Networks (cs.SI)
备注: 29 pages, 8 figures
Abstract:Adding LLM-generated node features to graph neural networks (GNNs) is widely reported to improve accuracy on standard benchmarks. We document a contrasting observation: when LLM features are introduced through pure input concatenation (rather than joint training, distillation, or prompt-conditioning), they can systematically degrade accuracy on the same homophilous benchmarks where end-to-end LLM pipelines succeed. With an MLP backbone on the Planetoid public split and bag-of-words original features, concatenating SBERT-encoded GPT-4o-mini TAPE features reduces PubMed test accuracy by -17.0 +/- 0.3 pp and Cora by -4.3 +/- 0.6 pp (CiteSeer -0.6 +/- 0.8 pp, within seed noise). The drop attenuates as we relax each condition (GCN / GCNII / GAT backbones, random splits, smaller encoders) and reverses on medium-homophily WikiCS (+4.4 pp) and ogbn-arxiv (+11.7 pp). To predict when concatenation helps versus hurts, we report a simple measure of LLM-alone discriminability, Delta_sig. Across 9 datasets Delta_sig correlates with the concatenation cost more strongly than homophily at point estimate (r^2 = 0.38 vs. 0.06; N=9, bootstrap CIs overlap). The bootstrap-best change-point is tau = 13.8 pp, and the rule “Delta_sig = tau predicts non-positive concat cost” classifies 7/9 datasets correctly; since 60% of bootstrap samples place tau in [5, 30] pp, we treat Delta_sig as an interpretive lens rather than a precision filter. A dimension-controlled ablation on PubMed places the LLM-feature drop between same-source PCA (-2.3 pp) and same-dim Gaussian noise (-37.3 pp), ruling out dimensionality and weight-decay artifacts. Nine PubMed configurations fit a power law |Delta_concat| proportional to (sqrt(d_l/n))^1.31 with r^2 = 0.97; the low-Delta_sig, small-n corner is exactly where the headline -17 pp PubMed deficit appears.
[NLP-44] Evaluating Large Language Models Abilities for Addressee Turn-change and Next Speaker Prediction in Meetings INTERSPEECH2026
【速读】: 该论文旨在解决多模态多方对话中的交互轮换(turn-taking)建模问题,重点关注说话人指代识别、话轮转换预测与下一说话人预测三个关键任务。其核心解决方案在于构建一个基于大规模语言模型(LLMs)的评估框架,对比监督学习模型、仅文本输入的LLMs、多模态大语言模型(MM-LLMs)以及人类参与者在真实对话数据集(AMI语料库)上的表现。研究发现,尽管未在目标领域进行训练且缺乏音频/视觉信息,文本型LLMs在下一说话人预测任务上仍优于监督模型和人类;而引入多模态信息的MM-LLMs在说话人指代识别和话轮转换预测上表现更优,但整体性能仍低于人类,表明其对原始音视频信号的融合能力有限。消融实验进一步揭示,对话上下文信息对下一说话人预测至关重要,且人类与LLM的预测模式高度相似,尤其在高频率话轮转换区间均表现出显著困难,说明当前模型在捕捉动态交互节奏方面仍有提升空间。
链接: https://arxiv.org/abs/2606.17542
作者: Ryo Fukuda,Takatomo Kano,Siddhant Arora,Marc Delcroix,Naohiro Tawara,Atsunori Ogawa,Yuya Chiba,Atsushi Ando,William Chen,Shinji Watanabe
机构: NTT(日本电气公司); Language Technologies Institute, Carnegie Mellon University (卡内基梅隆大学语言技术研究所)
类目: Computation and Language (cs.CL)
备注: Accepted to INTERSPEECH 2026
Abstract:We investigate turn-taking in multimodal multi-party conversations using large language models (LLMs). We construct an evaluation framework for three tasks: addressee detection, turn-change prediction, and next speaker prediction. We compare supervised models trained for these tasks, text-based LLMs, multimodal LLMs (MM-LLMs), and human subjects. Experiments on the AMI corpus showed that LLMs outperformed supervised models and humans in next speaker prediction, despite not being trained on the target domain and without access to audio or visual information. An MM-LLM performed better than text-based LLMs on addressee detection and turn-change prediction but remained below human performance, indicating difficulty leveraging raw audio-visual signals. Ablation analyses revealed that conversational context was critical, particularly for next speaker prediction. We observed that human and LLM prediction patterns were similar, and intervals with frequent turn changes were difficult for both.
[NLP-45] An expressivity analysis of hierarchical modelling in deep transformers via bounded-depth grammars
【速读】: 该论文旨在解决深度变换器(Transformer)模型在语言建模中如何实现层次化表示的理论机制问题,尤其关注其对复杂语法结构(如句法层级)的表达能力缺乏严谨理论支撑的现状。其核心解决方案的关键在于:通过形式化地将深度变换器的表达能力与有界深度、非递归上下文无关文法(bounded-depth, non-recursive context-free grammars)进行关联,证明了具备位置注意力机制的变换器模型能够以线性增长的网络深度,精确编码任意深度的文法结构。具体而言,该研究构造了一类变换器架构,其神经元数量随推导树形态数量呈线性增长,并随产生式规则数量呈二次增长,从而在残差流(residual stream)中将抽象的语法状态映射至低维线性可分子空间。这一结果为“线性表示假设”提供了理论支持,揭示了深度变换器通过层级注意力机制实现复杂语法结构表征的内在机制。
链接: https://arxiv.org/abs/2606.17522
作者: Vinoth Nandakumar,Qiang Qu,Pramod Thebe,Sakshi Khachariya,Tongliang Liu
机构: University of Sydney; San Francisco State University; IIT Madras
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Deep neural networks are widely believed to derive their expressive power from their ability to form \textbfhierarchical representations, capturing progressively more abstract and compositional features across layers. In language modeling, \textbftransformers have emerged as the dominant architecture, with early layers capturing local syntactic patterns and later layers encoding more complex clause-level dependencies. While this intuition has shaped model design, there remains a lack of rigorous theoretical work demonstrating \textbfhow deep transformers represent such hierarchical structures. In this work, we analyze the expressiveness of deep transformer models through the formal lens of bounded-depth, non-recursive context-free grammars. For this class of grammars, we explicitly construct transformers with positional attention whose depth grows linearly with grammar depth, while the neuron count scales with the number of derivation-tree shapes and quadratically with the number of production rules. Our theoretical results support the linear representation hypothesis by demonstrating that these architectures possess the structural capacity to encode abstract grammatical states into low-dimensional, linearly separable subspaces within the residual stream.
[NLP-46] Scaling Enterprise Agent Routing: Degradation Diagnosis and Recovery
【速读】: 该论文旨在解决大规模工具目录下生成式AI助手在单步路由(single-step routing)中准确率随工具数量增长而显著下降的问题。随着企业级生产力助手工具库从10个扩展至110个代理、584个工具,面对表述不充分的用户请求时,各类前沿模型的路由F1值下降了16–23个百分点。研究通过“理想模型分析”(oracle analysis)将性能退化分解为两个关键因素:检索差距(retrieval gap,即模型无法召回正确工具)与混淆差距(confusion gap,即使检索完美,理想上限仍下降10个百分点)。针对此问题,提出基于嵌入向量(embedding-based)的短列表筛选(shortlisting)方法,在所有三类模型及两个提供商场景下,均实现了在全规模下的F1值提升10–11个百分点。进一步的生产环境标注研究(1,435条人工标注语句,三位标注者)验证了该方案在真实流量中的有效性,尽管绝对性能较基准低10–15个百分点,但依然实现10–17个百分点的净提升,表明其对实际部署中路由准确率具有显著改善作用。
链接: https://arxiv.org/abs/2606.17519
作者: Kellen Gillespie,Robyn Perry
机构: Superhuman, Inc.
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 10 pages (6 main + 4 appendix), 4 figures, 6 tables
Abstract:Production LLM assistants route user requests to growing libraries of specialized tools, but how does routing accuracy degrade as the catalog scales? We study single-step routing on a 110-agent, 584-tool catalog from a deployed enterprise productivity assistant, evaluating three frontier models from 10 to 110 agents. Routing F1 on under-specified requests drops 16–23 percentage points across models. An oracle analysis decomposes the degradation into a \emphretrieval gap (the model cannot surface the right tool) and a \emphconfusion gap (even with perfect retrieval, the oracle ceiling drops 10pp). Embedding-based shortlisting recovers +10–11pp F1 at full scale across all three models and two providers. A production annotation study (1,435 human-labeled utterances, three annotators) confirms the recovery on real traffic at +10–17pp despite 10–15pp lower absolute performance.
[NLP-47] Evaluating Second-Order Bias of LLM s Through Epistemic Entitlement
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在作为偏见评估者时可能表现出的隐性社会偏见问题,即“第二层偏见”(second-order bias),亦即模型在判断文本是否具有社会偏见时所体现出的系统性偏差。传统评估方法仅关注模型是否生成或暗示偏见内容,而忽视了模型作为评判者时其自身认知判断中潜藏的偏见。本文的关键解决方案是提出一种基于义务认识论(entitlement epistemology)的哲学基础推理任务,将偏见概念化为影响主体理性探究的错误基础性知识,并据此设计逻辑推理任务,用于判断某段偏见文本对不同群体的可接受性。研究构建了两个简洁的量化指标,用以衡量模型在缺乏充分依据的情况下推断人口属性与可接受性之间关系的偏差程度,以及这种推断在不同目标群体间的差异。实验结果表明,该任务能够绕过安全防护机制,揭示出模型在判断中的系统性偏见,反映出隐含的社会认知图谱,并验证了模型仍受人口统计标签的显著触发影响。研究强调了在判断任务中开展大语言模型偏见评估的重要性,并呼吁自然语言处理领域采用更具理论根基的偏见评估范式。相关代码与模型响应已公开发布。
链接: https://arxiv.org/abs/2606.17506
作者: Ramaravind Kommiya Mothilal,Terry Jingchen Zhang,Raiyan Ahmed,Zhijing Jin,Shion Guha,Syed Ishtiaque Ahmed
机构: University of Toronto(多伦多大学); Vector Institute(向量研究所); EuroSafeAI; Max Planck Institute for Intelligent Systems(马克斯普朗克智能系统研究所)
类目: Computation and Language (cs.CL)
备注: 20 pages, 13 tables, 2 figures
Abstract:Evaluations of social bias in LLMs largely focus on whether models generate or imply biased content. However, as LLMs are increasingly used as judges of bias, they may exhibit social biases in subtler ways in how they evaluate biased content, which current methods do not systematically capture. We call this second-order bias: social bias in an LLM’s judgment about social bias, which we evaluate through a novel, philosophically grounded reasoning task. Drawing on entitlement epistemology, we conceptualize bias as misplaced foundational knowledge that shapes an agent’s rational inquiry, and derive a logical reasoning task for LLMs to judge to whom a biased text is acceptable or non-acceptable. We develop two simple metrics to measure how biased LLM judges are in inferring demographics for acceptability without sufficient support, and how these inferences vary across groups targeted by biased texts. Evaluating open and closed models, we find that our task evades safety guardrails by surfacing bias in model judgment. It varies systematically across target groups, reflects implicit social maps, and shows how models are still triggered by demographic labels. Our work points to the need for LLM bias evaluation in judgment tasks and broadly, for more theoretically grounded approaches to bias evaluation in NLP. We release our code and model responses at this https URL.
[NLP-48] Decoding Hidden Deception in Reasoning LLM s: Activation Explainers for Deception Auditing
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在具备更强推理能力的同时,日益突出的欺骗行为(deceptive behavior)安全问题。现有检测方法通常仅基于可见文本输出评分或从表示向量中提取标量探针分数,缺乏对响应可疑性的可解释性证据。为此,论文提出STATEWITNESS,一种用于欺骗审计的激活解释器(activation explainer)。其核心解决方案是引入一个独立解码器,读取目标模型的隐藏状态,并以自然语言问答、结构化报告等形式生成关于这些状态的可解释输出。在七个欺骗数据集上对两组目标推理型LLM进行评估,STATEWITNESS实现了0.916的平均AUROC,相较于最佳黑盒文本监测器提升11.6%,相比最佳激活探针基线提升25.0%。此外,该框架不仅能提供查询级别的解释、模式报告及词元或句子级证据轨迹,支持人工审查,还可与现有监测器集成,在简单阈值融合中显著降低漏检率。因此,其可解释性接口有望成为更广泛可解释性与对齐工具的基础组件。
链接: https://arxiv.org/abs/2606.17478
作者: Kexin Chen,Yi Liu,Haonan Zhang,Yanhui Li,Xinyu Deng,Dongxia Wang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Under review
Abstract:As LLMs acquire stronger reasoning capabilities, deceptive behavior becomes an increasingly serious safety concern. Existing deception monitors either score visible transcripts or derive scalar probe scores from representation vectors, leaving little inspectable evidence about why a response is suspicious. We introduce STATEWITNESS, an activation explainer for deception auditing. A separate decoder reads a target model’s hidden states, then answers natural-language queries or emits structured reports about them. We evaluate STATEWITNESS on two target reasoning LLMs across seven deception datasets. STATEWITNESS reaches 0.916 mean AUROC, a relative gain of 11.6% over the best black-box text monitor and 25.0% over the best activation-probe baseline under the same evaluation protocol. When combined with existing monitors, STATEWITNESS reduces missed deceptive examples in simple threshold ensembles. Beyond scalar detection, the decoder returns query-level answers, schema reports, and token- or sentence-level evidence traces for human inspection. We view this interface as a potential building block for broader interpretability and alignment tools.
[NLP-49] AIPatient Arena: EHR-grounded evaluation of large language models in end-to-end clinical consultation workflows
【速读】: 该论文旨在解决当前大型语言模型(LLM)在临床咨询任务中评估方法过于静态、单轮或仅关注特定结果指标,无法充分反映真实医疗场景中交互性、不确定性和连续性的核心问题。其解决方案的关键在于提出AIPatient Arena——一个基于电子健康记录(EHRs)的评估框架,通过构建患者特异性的知识图谱,实现多轮医患交互的动态模拟,并从临床胜任力的八个维度系统评估模型的临床实用性。该框架不仅量化了模型在问诊技巧、伦理规范、解释清晰度等方面的表现,更揭示了其在处理模糊回应、信息覆盖、诊断推理等关键环节的系统性短板,强调了过程导向评估的重要性,从而为医疗大模型在部署前提供更全面、贴近真实临床工作流的评估能力。
链接: https://arxiv.org/abs/2606.17474
作者: Jiahui Niu,Huizi Yu,Wenkong Wang,Guangxin Dai,Jingxian He,Xiang Li,Zhiying Liang,Xinxin Lin,Kent CY So,Bryan YP Yan,Yun Kwok Wing,Yanqiu Xing,Xin Ma,Lizhou Fan
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 49 pages, 12 figues, 11 tables
Abstract:Large language models (LLMs) are increasingly considered for use in clinical consultation tasks, yet most medical evaluations remain static, single-turn, or narrowly outcome-based, limiting their ability to reflect the sequential, uncertain, and interactive nature of real-world care. Here, we propose AIPatient Arena, an EHRs-grounded evaluation framework for assessing the clinical utility of LLMs across eight dimensions of clinical competence. The framework integrates EHR data into patient-specific knowledge graphs, enabling multi-turn physician-patient interactions. We applied AIPatient Arena on a primary cohort of 437 patients and two out-of-distribution validation cohorts of 119 and 67 patients. We observe that LLMs performed well in medical interview questioning skills (QS; mean scores, 4.43-4.99/5), ethical and professional conduct (ET; 4.38-4.93/5), and clarity and transparency of clinical explanations (EX; 3.80-4.72/5). Performance was moderate in information integration (II; 3.19-4.21/5) and medication safety and justification (MS; 3.13-3.78/5), but persistent weaknesses were observed in handling of ambiguous patient responses (HR; 2.57-3.32/5), information coverage (IC; 2.08-3.02/5), and diagnostic accuracy and reasoning (Dx; 2.63-3.55/5). Process-based evaluation revealed recurrent interaction failures, including repetitive questioning, omission of past medical history, and inadequate handling of uncertainty. Richer conversational context improved diagnostic reasoning but yielded limited gains in treatment planning. These findings indicate that final-answer accuracy alone is insufficient for evaluating clinical readiness and highlight the importance of assessing how models gather, interpret, and communicate information throughout a consultation. AIPatient Arena provides an EHR-grounded framework for workflow-oriented pre-deployment evaluation of medical LLMs.
[NLP-50] PARSE: Provenance-Aware Retrieval Sanitization for Professional Domain LLM Agents EMNLP2026
【速读】: 该论文旨在解决生成式 AI 在真实企业文档中面临提示注入(prompt injection)攻击时,现有防御机制在合成基准测试上表现良好但无法泛化至实际复杂文档的问题。真实企业文档具有长度长、信息密度高、合法权威性语言与事实内容交错等特点,导致传统防御方法失效。其解决方案的关键在于提出 PARSE(Provenance-Aware Retrieval Sanitization),一种面向特定领域、保持事实完整性的净化管道:通过评估每句话的注入风险、在重写前提取结构化事实,并利用一致性校验循环验证事实保真度;同时引入直接性门控机制,将59%的低风险文档引导至轻量路径,集中计算资源于高风险文本。该方法在真实文档上实现15.6%的攻击成功率(较基线25.4%降低38%),且保持86.9%的使用效率,兼具统计显著性(p=0.014)与近基线的实用性,是首个在真实场景中同时满足有效性和可用性的防御方案。研究强调,防御措施应基于与目标领域匹配的真实文档进行评估,而非依赖合成代理。
链接: https://arxiv.org/abs/2606.17467
作者: Aaditya Pai
机构: Columbia University (哥伦比亚大学)
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注: 7 pages, 3 figures, 2 tables. Under submission at EMNLP 2026 Industry Track
Abstract:Prompt injection defenses evaluated on synthetic benchmarks do not generalize to real enterprise documents, which are longer, denser, and interleave legitimate authority language with factual content. We demonstrate this gap with a real-document benchmark of 122 tasks across five professional domains (financial, legal, medical, scientific, DevOps) using actual SEC filings, Federal Register rules, PubMed abstracts, arXiv papers, and GitHub postmortems. Paraphrasing, the strongest defense on synthetic benchmarks, shows no statistically significant attack success rate reduction on real documents (p=0.500) while degrading utility from 91.8% to 82.8%. We introduce PARSE (Provenance-Aware Retrieval Sanitization), a domain-aware, fact-preserving sanitization pipeline that classifies each sentence by injection likelihood, extracts structured facts before rewriting, and verifies fact preservation via a consistency-checking loop. A directiveness gate routes 59% of real enterprise documents to a lightweight path, concentrating computational cost on high-risk documents. PARSE achieves 15.6% attack success rate – a 38% reduction versus the 25.4% baseline – at 86.9% utility, the only condition that is both statistically significant (p=0.014, adequately powered) and maintains near-baseline utility. Practitioners should evaluate defenses on domain-matched real documents, not synthetic proxies.
[NLP-51] MODE-RAG : Manifold Outlier Diagnosis and Energy-based Retrieval-Augmented Generation Evaluation ACL2026
【速读】: 该论文旨在解决多模态检索增强生成(Multimodal Retrieval-Augmented Generation, M-RAG)系统在实际应用中面临的跨模态幻觉(cross-modal hallucinations)、因果性捏造(causal fabrications)以及迎合性生成(sycophancy)等问题。现有缓解策略常陷入“干预悖论”:静态规则易误伤准确生成,而完全无引导的多模态推理则导致已有模态不一致被放大,引发严重逻辑谬误。为此,本文提出一种基于变分自由能(Variational Free Energy, VFE)与内部注意力状态驱动的多智能体系统——MODE-RAG,其核心在于通过动态门控机制实现精准干预。高风险查询被路由至五个阶段特化的智能体,结合蒙特卡洛树搜索(Monte Carlo Tree Search, MCTS)进行严格的因果推导,并利用对数几率扰动(logit perturbations)抑制迎合性行为;同时,专用的校正与监管智能体保障输出格式稳定性并执行事后事实验证。为客观评估方法有效性,研究构建了源自MultiVent数据集的挑战性子集ModeVent。大量实验表明,该系统显著降低了幻觉率与逻辑谬误发生率,有效提升了M-RAG系统的鲁棒性。
链接: https://arxiv.org/abs/2606.17449
作者: Zehang Wei,Jiaxin Dai,Jiamin Yan,Xiang Xiang
机构: Huazhong University of Science and Technology (华中科技大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM)
备注: To be presented at ACL 2026
Abstract:While Multimodal Retrieval-Augmented Generation (M-RAG) enhances Large Vision-Language Models, it remains highly susceptible to cross-modal hallucinations, causal fabrications, and sycophancy. Furthermore, existing mitigation pipelines often face an intervention paradox: static rules tend to unnecessarily disrupt accurate generations, whereas leaving the multi-modal reasoning completely unguided allows existing mismatches to cascade into severe logical fabrications. To quantify and mitigate these hallucinations, we propose a Multi-Agent system, MODE-RAG, driven by Variational Free Energy (VFE) and internal attention states to dynamically gate interventions. High-risk queries are routed to five stage-specific agents, integrating Monte Carlo Tree Search (MCTS) for rigorous causal derivation and logit perturbations to penalize sycophancy. Dedicated Correction and Overseer agents ensure formatting stability and perform post-hoc factual verification. To objectively evaluate our approach, we introduce ModeVent, a challenging subset derived from the MultiVent dataset. Extensive experiments indicate that our system effectively reduces hallucination rates and logical fabrication, significantly improving the robustness of M-RAG systems.
[NLP-52] Incumbent Advantage: Brand Bias and Cognitive Manipulation Dynamics in LLM Recommendation Systems
【速读】: 该论文旨在解决生成式人工智能(Generative AI)推荐系统中品牌竞争机制的缺失认知问题,特别是在消费者难以在购买前评估质量的高信任依赖品类(如护肤品)中的品牌动态。研究发现,当产品规格完全相同时,知名品牌在三大商业生成式AI模型(GPT-4o-mini、Claude Sonnet、Gemini 3 Flash)中呈现“条件垄断”现象——其推荐概率高达100%(影响力指数IAI=10.0),但一旦竞争对手具备微小的+0.1星评分优势,该垄断即被打破。解决方案的关键在于:通过采用权威型营销语言(如虚构的临床证据宣称),可产生相当于+0.17分评级优势的“偏差盈余价值”(Bias Surplus Value),从而突破垄断格局,且各模型对此类策略的响应存在差异。此外,研究揭示了多品牌生成式引擎优化(GEO)中的社会困境:当所有品牌采取相同优化策略时,个体收益从+0.802骤降至+0.007,非参与者则完全无法获得推荐。因此,该研究提出,生成式引擎优化(GEO)不仅应被视为潜在的安全风险,更需作为新兴的市场营销实践纳入竞争分析框架。
链接: https://arxiv.org/abs/2606.17443
作者: Xi Chu,Yupeng Hou
机构: Trine University (特里尼大学); Texas A&M University (德克萨斯农工大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: 16 pages, 4 figures, 11 tables
Abstract:Large language models (LLMs) are becoming a major way for consumers to find products, but we do not yet understand how brands compete in this new channel. We study brand dynamics in LLM recommendations using skincare products – a category where consumers cannot easily judge quality before buying and must rely on brand reputation – across three commercial LLMs (GPT-4o-mini, Claude Sonnet, Gemini 3 Flash), with a robustness check on search goods. In three experiments, we find: (1) a Conditional Monopoly where well-known brands get recommended 100% of the time (IAI = 10.0) when all products have the same specifications, but this dominance disappears with less than a +0.1-star rating advantage for a competitor; (2) authority-style marketing language, including fabricated clinical-evidence claims, breaks this monopoly at a Bias Surplus Value equal to +0.17 rating points, with each model responding differently; and (3) a social dilemma in multi-brand GEO competition: when all brands adopt the same optimization strategy, individual payoff falls from +0.802 to +0.007 in our payoff proxy, and non-participating brands receive zero recommendations in our tests. Our results suggest that generative engine optimization (GEO) should be studied not only as a security risk, but also as an emerging marketing practice that shapes market competition.
[NLP-53] NarrativeWorldBench: A Frontier-Saturated Benchmark and a Latent World Model for Long-Horizon Co-Creative Audio Drama ICML2026
【速读】: 该论文旨在解决长篇连续性音频剧(Long-form serialized audio drama)中生成式内容在超长叙事跨度下(200至800集)的连贯性与一致性衰退问题,这一场景是当前前沿大语言模型(Large Language Models, LLMs)难以有效应对的挑战。现有闭源前沿模型(closed-frontier systems)在叙事结构评估中表现出明显的性能饱和(在剧情节拍F1分数维持于[0.78, 0.81]区间),并在叙事展望窗口h=200时出现约-0.20 F1的显著性能坍塌。其核心解决方案在于提出一种基于变分状态空间建模的新型架构——叙事变分状态空间模型(Narrative Variational State-Space Model, N-VSSM),通过采用Mamba-2骨干网络构建一个256维结构化潜在世界状态(latent world state),并引入事件条件后验分布与80亿参数解码器,实现对超过200集剧情的长期状态追踪。该方法在所有评估时间窗口(h=10, 20, 50, 100, 200)上保持0.84的剧情节拍F1得分,且计算开销仅为闭源模型基准的四分之一。此外,引入学习型文化迁移函数(Cultural Transfer Function)显著提升了跨语言叙事保真度,在四种印地语系语言(印地语、泰米尔语、泰卢固语、马拉地语)上的评分提升达+0.20至+0.23李克特量表点。在由12位专业编剧参与的组内对照实验(共240次测试)中,N-VSSM在长弧一致性方面被选择概率达71%,且在可控性维度上获得+1.3李克特点的主观评分优势。
链接: https://arxiv.org/abs/2606.17391
作者: Logan Mann,Abdur Rahman,Mohammad Saifullah,Taaha Kazi,Vasu Sharma
机构: University of California, Santa Barbara (加州大学圣塔芭芭拉分校); Pocket FM
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 10 pages. Accepted to the ICML 2026 Workshops on High-dimensional Learning Dynamics (HiLD) and Culture x AI
Abstract:Long-form serialized audio drama, with arcs that run for 200 to 800 episodes, is a major creative medium and a setting where frontier large language models (LLMs) fail. We benchmark 21 models, spanning classical, fine-tuned, open-frontier, closed-frontier, and reasoning tiers, on a uniform set of structural narrative metrics. All closed-frontier systems saturate at a plot-beat F1 in the band [0.78, 0.81] and collapse by about -0.20 F1 at horizon h=200. We introduce NarrativeWorldBench, an open benchmark of nine narrative-structure metrics evaluated across horizons h in 10, 20, 50, 100, 200, with cross-lingual evaluation across four Indic languages (Hindi, Tamil, Telugu, Marathi). We introduce N-VSSM, a Narrative Variational State-Space Model that maintains a structured 256-dimensional latent world state over more than 200 episodes via a Mamba-2 backbone with an event-conditioned posterior and an 8B decoder. N-VSSM holds plot-beat F1 = 0.84 across all horizons at 4x lower compute than the closed-frontier band. A learned Cultural Transfer Function lifts cross-language fidelity by +0.20 to +0.23 Likert points. In a within-subjects writer study (n = 12 professional authors, 240 trials), N-VSSM is preferred over Claude Opus 4.5 on long-arc consistency 71% of the time and rated +1.3 Likert points higher on controllability.
[NLP-54] Visuals Lie Consistency Speaks: Disentangling Spatial Attention from Reliability in Vision-Language Models ICLR2026
【速读】: 该论文旨在解决多模态基础模型(Multimodal Foundation Models)在作为推理代理时的可靠性问题,特别是如何准确判断模型是否出现幻觉(hallucination)。其核心挑战在于:现有主流假设——即视觉注意力集中于相关区域可反映模型可信度(称为“注意力-置信度假设”)——是否成立。研究提出的关键解决方案是通过构建视觉语言模型可靠性探测器(VLM Reliability Probe, VRP),系统性地跨模型家族分析可靠性信号。研究引入结构化注意力度量指标,如聚类数(C_k)与空间熵(H_s),量化视觉编码器的注视模式,并追踪其在各网络层中的演化(ΔH_s)。结果揭示出“符号解耦”现象:模型虽早期锁定视觉特征,但后期注意力迅速扩散,导致早期感知与最终生成脱节。进一步发现“聚类失效”——空间注意力分布与预测准确性几乎无相关性(R ≈ 0.001),表明可靠性并非源自视觉锚定。相反,可靠性本质上是生成动态与内部状态分布的产物;其中,自一致性(Self-Consistency,即多次采样推理路径的一致率)成为预测正确性的主导因子(R = 0.429)。通过因果干预实验,研究还揭示了架构差异:LLaVA将预测固化于脆弱的晚期瓶颈,而PaliGemma与Qwen2-VL则实现全局可靠性分布,在破坏超过50%关键层后仍保持鲁棒性。因此,当前多模态模型的可靠性信号已脱离视觉锚定图谱,应从生成阶段的动力学行为和隐藏状态探针中推断。
链接: https://arxiv.org/abs/2606.17389
作者: Logan Mann,Yi Xia,Ajit Saravanan,Ishan Dave,Saadullah Ismail,Shikhar Shiromani,Emily Huang,Ruizhe Li,Kevin Zhu
机构: University of California, Santa Barbara (加州大学圣塔芭芭拉分校); Algoverse AI Research (Algoverse AI 研究所); University of California, Berkeley (加州大学伯克利分校); Independent Researcher (独立研究员)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 16 pages. Accepted to the ICLR 2026 Workshop on Multimodal Intelligence. Code: this https URL
Abstract:Multimodal Foundation Models are increasingly used as reasoning agents, making reliability, knowing when a model may hallucinate, critical. A common intuition, which we call the Attention-Confidence Assumption, holds that reliability follows from “structural” visual perception: tight attention on relevant regions should signal a trustworthy answer, while scattered attention signals confusion. We challenge this through the VLM Reliability Probe (VRP), a systematic cross-family study of reliability signals in contemporary Vision-Language Models (VLMs). We introduce structural-attention metrics, cluster counts (C_k) and spatial entropy (H_s), to quantify the visual encoder’s gaze, and track its evolution (Delta H_s) across layers. This reveals a “Symbolic Detachment”: models often “Early Lock” visual features only to diffuse attention later, severing early perception from final generation. Contrary to the grounding hypothesis, we find a “Cluster Failure”: spatial attention has near-zero correlation (R approx 0.001) with accuracy. Instead, reliability is a phenomenon of generation dynamics and internal-state distributions. Self-Consistency, the agreement rate across sampled reasoning paths, is the dominant predictor of truth (R = 0.429). Scaling causal interventions exposes a sharp architectural divergence: LLaVA locks its prediction in a fragile late-stage bottleneck, whereas PaliGemma and Qwen2-VL distribute reliability globally, staying resilient even when ~50% or more of their most predictive layer is destroyed. For current VLMs, reliability signals are detached from visual grounding maps and are best inferred from generation-time dynamics and hidden-state probes.
[NLP-55] Implicit vs. Explicit Prompting Strategies for LVLMs in Referential Communication
【速读】: 该论文旨在解决近期两项研究(Jones et al., 2026;Zeng et al., 2026)关于大型视觉语言模型(LVLMs)是否能够协调生成高效指称表达(efficient referring expressions)得出矛盾结论的问题。其核心解决方案在于控制任务差异,直接对比两种研究中的提示(prompting)风格。研究发现,当模型被明确提示以实现沟通效率时,能够成功协调生成高效指称表达,表明任务差异并非导致结果分歧的原因。然而,当采用更隐含的提示方式时,相同模型无法自发推断出对沟通效率的需求,揭示了人类与人工智能系统在沟通机制上的关键差异——即人类能从上下文隐含线索中理解沟通效率的重要性,而当前的LVLMs仍依赖显式指令才能实现此类行为。
链接: https://arxiv.org/abs/2606.17372
作者: Peter Zeng,Amie J. Paige,Weiling Li,Susan E. Brennan,Owen Rambow,Cameron R. Jones
机构: Stony Brook University (石溪大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Two recent studies (Jones et al. (2026); Zeng et al. (2026)) reach apparently contradictory conclusions about whether LVLMs can coordinate on efficient referring expressions. We control for task differences between the studies while directly comparing their prompting styles. We replicate the finding that models can coordinate efficient referring expressions when explicitly prompted to do so, suggesting that other task differences are not responsible for divergent results. However, we also find that the same models fail to infer the need for communicative efficiency from a more implicit prompt, highlighting critical differences between how humans and AI systems communicate.
[NLP-56] ranslating the Untranslatable: An Operationalizable Ontology for Untranslatability
【速读】: 该论文旨在解决机器翻译(Machine Translation, MT)在面对不可译性(Untranslatability)现象时的局限性问题,即当源语言中的意义无法在目标语言中实现直接对应时,现有MT系统往往表现不佳。其核心挑战在于,传统评估基准难以捕捉此类语义损失,导致模型优化方向偏离真实语言复杂性。论文的关键解决方案是构建一个结构化的不可译性本体(ontology of untranslatability)与补偿策略分类体系(taxonomy of compensation strategies),其中补偿策略指在不可译情境下用于传递原有意涵的具体技术手段。研究进一步将该框架转化为一个多语言不可译句子数据集,每条句子均配有基于特定策略生成的翻译,从而支持对翻译行为的可控分析。初步的人类偏好实验表明,翻译质量显著依赖于所采用的补偿策略,尤其偏好包含解释性上下文的“注释补偿”(Annotation compensation)策略。该框架与数据集为建模和研究具有策略意识的机器翻译提供了坚实基础。
链接: https://arxiv.org/abs/2606.17354
作者: Jacob Bremerman,Brihi Joshi,Hirona Arai,Xiang Ren,Jonathan May
机构: University of Southern California; Information Sciences Institute
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Untranslatability, cases where meaning cannot be directly preserved across languages, is well-studied in linguistics but underexplored in NLP. As machine translation (MT) systems improve on standard benchmarks, their limitations increasingly concentrate in such cases, where translation cannot be reduced to one-to-one equivalence. We introduce a structured ontology of untranslatability along with a taxonomy of compensation strategies, which are specific techniques to convey meaning under these untranslatable circumstances. We operationalize this framework into a multilingual dataset of untranslatable sentences paired with strategy-based translations, enabling controlled analysis of translation behavior. Initial human preference studies suggest that translation quality depends on the strategy used, with consistent preferences for outputs that include explanatory context, known as the Annotation compensation strategy. Our framework and dataset provide a foundation for studying and modeling strategy-informed machine translation.
[NLP-57] Do Large Language Models Always Tell The Same Stories?
【速读】: 该论文旨在解决生成式 AI(Generative AI)在文本生成任务中输出多样性不足的问题,具体聚焦于大型语言模型(LLM)生成故事时的叙事多样性。其核心问题在于:尽管大语言模型能够生成高质量的叙述性文本,但其生成内容在整体上是否具备与人类创作相当的多样性仍存在争议。论文的关键解决方案是构建一个基于叙事相似性的对比分析框架,通过整合来自 r/WritingPrompts 的人类写作故事及对应提示语料库,并结合人工评估与三种自动标注方法,系统量化比较10种代表性大语言模型生成故事之间的叙事相似性。研究发现,模型生成的故事彼此之间显著更相似,呈现出向“平均化”通用叙事收敛的趋势,即模型倾向于生成趋同的、缺乏集体多样性的内容,而这一现象在前沿模型中尤为明显。此外,论文进一步验证了负向提示(negative prompting)和温度调节(temperature scaling)等常见缓解策略对改善此类同质化问题均无效。因此,该研究揭示了当前生成式 AI 在内容多样性上的根本局限,强调需发展新的机制以突破模型输出的“叙事内卷”现象。
链接: https://arxiv.org/abs/2606.17350
作者: Thennal DK,Hans Ole Hatzel
机构: University of Hamburg (汉堡大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent advances in large language models (LLMs) have enabled the generation of high-quality prose, yet the question of whether these models are capable of generating diverse outputs remains contested. In this work, we investigate the diversity of LLM-generated stories through the framework of narrative similarity. Using a contrastive framework and a dataset of human-written stories and prompts from r/WritingPrompts, we collect narrative similarity judgments across 10 representative LLMs, utilizing both human evaluations and three different automatic annotation methods. Our findings reveal a consistent trend: LLM-generated narratives are consistently more similar to each other than human-written stories are. We demonstrate that frontier models in particular converge on a ``mean’’ generic narrative that approximates individual human stories but lacks the collective diversity of human authors. Finally, we show that common mitigation strategies, including negative prompting and temperature scaling, fail to meaningfully address this homogeneity.
[NLP-58] SpeechDx: A Multi-Task Benchmark for Clinical Speech AI
【速读】: 该论文旨在解决当前临床语音人工智能(Clinical Speech AI)研究中普遍存在的碎片化问题,即多数方法局限于特定疾病或孤立任务,导致结果难以横向比较且泛化能力无法有效评估。其核心解决方案是提出SpeechDx——一个涵盖12个数据集、27项任务的大规模基准测试平台,覆盖多种健康状况。该基准通过按语音生成过程的受干扰阶段(概念化、构词、发音)对任务进行结构化划分,以揭示共享的临床机制;同时通过引入标注数据有限的任务及跨数据集评估,区分真实临床模式与数据集伪影,从而严格检验模型的泛化性能。关键发现表明:大规模语音模型在整体表现上构成最强基线,领域特定模型仅在任务匹配度高时有优势,而现有表示学习方法尚无法在跨疾病、跨数据集的临床语音场景中实现可靠泛化。SpeechDx因此建立了一个统一的评估框架,为推动通用型临床语音表征的发展提供了标准化路径。
链接: https://arxiv.org/abs/2606.17339
作者: Sejal Bhalla,Larry Kieu,Aina Merchant,Eyal de Lara,Alex Mariakakis
机构: University of Toronto, Canada
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Sound (cs.SD)
备注:
Abstract:Speech offers a uniquely informative window into health by simultaneously engaging neurological, motor, respiratory, and vocal systems. Current clinical speech AI methods have largely progressed through isolated condition-specific studies, making results difficult to compare and generalization difficult to assess. We introduce SpeechDx, a large-scale benchmark for clinical speech AI spanning 12 datasets and 27 tasks across diverse health conditions. To enable evaluation across shared clinical mechanisms, SpeechDx structures tasks by the stage of speech production they disrupt: conceptualization, formulation, and articulation. The benchmark tests generalization by including tasks with limited labeled data and evaluating the same health condition across multiple datasets, distinguishing clinically meaningful patterns from dataset artefacts. We systematically evaluate 12 state-of-the-art audio encoders across all tasks and under zero-shot cross-condition transfer. Results show that large-scale speech models represent the strongest overall baselines, domain-specific models improve performance only on closely matched tasks, and no current representation generalizes reliably across the clinical speech landscape. SpeechDx establishes a shared evaluation framework for tracking progress toward general-purpose clinical speech representations
[NLP-59] Examining the Limits of Word2Vec with Toki Pona
【速读】: 该论文旨在解决生成式词向量模型(如Word2Vec)在极小词汇量语言环境下的有效性问题,尤其关注其在词汇规模仅为约130词的构造语言Toki Pona中的表现。传统研究多集中于高词汇量自然语言,而本研究首次系统评估了在极端低词汇量条件下,词嵌入模型能否仍能有效捕捉语义关系。其解决方案的关键在于通过构建一个包含140万句、795万词元的大规模真实语料库,并对比训练两种模型:一种保留语料中非核心词(如专有名词、借词和新造词等语言噪声),另一种则完全过滤这些非规范词汇。研究发现,尽管存在大量非核心词,但它们并未破坏嵌入空间的相对语义结构,反而使语义相近的词在向量空间中更紧密聚集。这表明,词向量的语义表征能力主要依赖于词语的分布模式而非词汇总量,即使在词汇量极低的情况下,只要存在足够的上下文分布信息,Word2Vec仍可生成有效的语义嵌入。
链接: https://arxiv.org/abs/2606.17299
作者: Daniel Zhenhan Huang,Hongchen Wu
机构: 未知
类目: Computation and Language (cs.CL)
备注: 10 pages, 4 figures, 3 tables. Accepted to the Society for Computation in Linguistics (SCiL) 2026
Abstract:Word2Vec’s effectiveness at generating semantic embeddings has been widely validated, yet it has been tested almost exclusively on languages with large vocabulary inventories. This study examines whether Word2Vec can successfully capture semantic relationships within an extremely reduced vocabulary using data from Toki Pona, a constructed language with approximately 130 words. We sourced 1.4 million sentences (7.95 million tokens) from the Toki Pona community for training. Approximately 23% of sentences in the corpus contain non-Toki Pona tokens such as named entities, loanwords, and neologisms. To investigate whether this linguistic noise enhances or hinders performance – a topic rarely addressed in word embedding literature – we trained two distinct models: one retaining these incidental tokens and another filtering them out completely. Evaluation was conducted using quantitative methods measuring word proximity to semantic category centroids, automated silhouette scores via agglomerative clustering, and qualitative analysis utilizing representational similarity matrices compared against English. The results indicate that while sparse, non-core tokens do not affect the relative structure of the learned embeddings, they actually draw similar words closer together in the vector space. Importantly, Word2Vec’s effectiveness depends more on distributional patterns than lexicon size even at this extreme lower bound.
[NLP-60] Nothing from Something: Can a Language Model Discover 0?
【速读】: 该论文旨在解决生成式人工智能(Generative AI)在数学发现任务中实现分布外泛化(out of distribution generalization)的能力问题,特别是其能否在缺乏直接训练数据的情况下,自主推导出全新的数学概念。研究聚焦于语言模型是否能够独立发现“零”这一核心数学概念,从而检验其超越训练数据范围进行创造性推理的能力。研究的关键发现在于:尽管规模相当于GPT-2的语言模型在仅依赖语言预训练的情况下无法在测试时完成该泛化任务,但通过在数十至数百个关于“零”的示例上进行微调后,模型性能显著提升;同时,语言预训练可使所需示例数量减少约50%,表明语言能力在神经网络中可作为数学发现的支撑性基础,即语言先验知识能有效促进对抽象数学结构的归纳与扩展。因此,解决方案的关键在于结合语言预训练与少量特定数学概念的显式示例训练,以激发模型的创造性泛化能力。
链接: https://arxiv.org/abs/2606.17289
作者: Phoebe Zeng,Thomas L. Griffiths,Brenden M. Lake
机构: Princeton University (普林斯顿大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:AI systems based on artificial neural networks are being developed with aspirations of pushing the boundary of human mathematical knowledge. A key question for these systems is how much they can reach beyond their training data. Mathematical discovery requires a strong form of out of distribution generalization; the ability to hypothesize genuinely new - and potentially logically more powerful - mathematical structures. It has been hypothesized that language abilities support such generalizations in human cognition. In this work, we use simple arithmetic as a case study for examining how modern AI models could expand their mathematical horizons, evaluating whether these models can independently discover the concept of “zero”. We show that We show that (1) language models of a GPT-2 size are unable to perform this generalization at test time regardless of language pretraining, but (2) models can improve substantially after training on tens or hundreds of examples of zero. Additionally, we find that language pretraining reduces the number of required examples by approximately 50% , showing that language abilities can scaffold mathematical discovery in neural models.
[NLP-61] Are you speaking my languages? On spoken language adherence in multimodal LLM s
【速读】: 该论文旨在解决基于大语言模型(Large Language Model, LLM)的自动语音识别(Automatic Speech Recognition, ASR)系统在多语言场景下存在的语言识别不准确问题,即模型常出现输出语言误判的情况,导致转录准确性下降并影响下游应用质量。其核心挑战被正式定义为“语言依从性不足”(language adherence),并提出一种新的量化指标以衡量此类偏差。解决方案的关键在于采用软提示(soft prompting)策略,在不强制限定输出语言的前提下,通过提示机制隐式引导模型识别正确的语种,从而在保持语言灵活性和代码切换能力的同时提升语言识别准确性。研究评估了三种主要缓解策略:(1)零样本提示(zero-shot prompting)以应对不确定性下的鲁棒引导;(2)监督微调(Supervised Fine-Tuning, SFT)增强提示遵循能力;(3)思维链(Chain-of-Thought, CoT)推理在解码过程中强化语言依从性。通过跨多种语言的对比分析,验证了各方法在降低语言误判率的同时维持整体ASR性能的有效性,并进一步讨论了不同计算资源约束下的权衡关系,为实际应用中的策略选择提供指导。
链接: https://arxiv.org/abs/2606.17281
作者: Hyungwon Kim,Kandarp Joshi,Lillian Zhou,Pavel Golik,Petar Aleksic
机构: 未知
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: 7 pages, 3 tables in the main body
Abstract:While Large Language Model (LLM) based Automatic Speech Recognition (ASR) enables seamless multilingual use, models often misidentify the output language, compromising transcription fidelity and downstream application quality. To preserve flexibility and code-switching capabilities, we propose a soft prompting approach that hints at potential spoken languages without strictly constraining the output. We formally define this challenge as a lack of language adherence, introduce a novel metric to quantify violations, and evaluate three mitigation strategies: (1) zero-shot prompting for robust guidance under uncertainty, (2) supervised fine-tuning (SFT) to improve prompt adherence, and (3) Chain-of-Thought (CoT) reasoning to enforce adherence during decoding. We present a comparative analysis of these methods across multiple languages, evaluating effectiveness in reducing the language violation while maintaining overall ASR performance. Finally, we discuss trade-offs to guide strategy selection under various compute constraints.
[NLP-62] MLLP-VRAIN UPV system for the IWSLT 2026 Simultaneous Speech Translation task
【速读】: 该论文旨在解决长文本场景下实时语音翻译(SimulST)中的质量与延迟权衡问题,尤其关注在多语言方向上实现高精度、低延迟的端到端翻译系统。其核心解决方案是基于最新发布的Parakeet和Qwen 3.5模型构建一个级联式架构,并引入自适应“黑箱”策略(adaptive “black-box” policies)以动态调控翻译过程中的输出时机,从而在保持流畅性的同时提升翻译质量。通过放松这些策略的约束条件,进一步优化了质量-延迟的平衡表现。此外,针对英语到德语、意大利语及中文的翻译方向,研究团队还参与了2026年新增的上下文增强赛道,采用语音识别(ASR)词级增强与基于检索增强生成(RAG)的离线预译范例机制,有效注入领域特定上下文信息,显著提升生成质量。实验结果显示,在MCIF En→De测试集上,相比去年系统,XCOMET-XL指标提升了+5.82;在上下文增强任务中,性能再提升+1.03,验证了所提方法的有效性。
链接: https://arxiv.org/abs/2606.17255
作者: Jorge Iranzo-Sánchez,Gerard Mas-Mollà,Adrià Giménez,Jorge Civera,Albert Sanchis,Alfons Juan
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: IWSLT 2026 System Description
Abstract:This work describes the participation of the MLLP-VRAIN research group in the shared task of the IWSLT 2026 Simultaneous Speech Translation track. Our submission utilizes the recently released Parakeet and Qwen 3.5 models to create a robust, cascaded solution for long-form SimulST through the use of adaptive “black-box” policies. We explore relaxations of these policies to achieve better quality-latency trade-offs. Compared to last year, we participate on all language directions. In addition to this, for the En \rightarrow De, It, Zh directions we also participate in this year’s new context track employing a combination of ASR word-boosting and a RAG mechanism of offline pre-translated exemplars to guide generation and enrich our system with domain-specific context. Finally, we provide a detailed latency analysis of our system. Compared to last year, results on the MCIF En \rightarrow De test set shows a substantial quality improvement of +5.82 XCOMET-XL. Our context track processing further improves performance by +1.03.
[NLP-63] Rethinking Groups in Critic-Free RLVR
【速读】: 该论文旨在解决现有无评判器强化学习(critic-free RL)方法在后训练大语言模型时存在的数据效率低、群体同步障碍以及对结构化生成序列灵活性不足的问题。其核心问题在于,传统方法依赖同一问题的多轮采样(group of rollouts)来估计价值基线以计算优势,但这一设计不仅浪费数据资源,还限制了训练过程的并行性与适应性。本文的关键突破在于重新审视“群体”的本质作用,发现其根本功能并非单纯估计基线,而是防止对负样本施加错误惩罚。基于此洞察,作者提出**负标记过滤(negative token filtering)**策略,通过在单轮采样(single-rollout)框架下动态过滤有害或负面响应片段,实现稳定高效的训练。该方法可无缝集成至两种批处理级优势计算方法中,在推理类任务上达到与群体方法相当的性能,并在代理型任务(agentic tasks)上显著优于传统群组式强化学习技术,展现出更高的灵活性与有效性。
链接: https://arxiv.org/abs/2606.17250
作者: Yihong Wu,Liheng Ma,Lingfeng Xiao,Muzhi Li,Xinyu Wang,Yingxue Zhang,Jian-Yun Nie
机构: Université de Montréal(蒙特利尔大学); McGill University(麦吉尔大学); Mila - Quebec AI Institute(魁北克人工智能研究所); University of Waterloo(滑铁卢大学); The Chinese University of Hong Kong(香港中文大学); Huawei Noah’s Ark Lab(华为诺亚方舟实验室)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:Reinforcement learning (RL) has become a central paradigm for post-training large language models. Existing critic-free RL methods typically generate a group of rollouts for the same question to estimate value baselines for advantage computation. However, this design suffers from data inefficiency, group synchronization barriers, and inflexibility with structured rollouts. In this work, we revisit the role of the ``group’’ and show that its underlying function is not merely to estimate baselines but to prevent false penalties on negative samples. Building on this insight, we propose negative token filtering, a simple and effective strategy that enables stable single-rollout training. We apply it to two batch-level advantage methods, achieving comparable performance on reasoning tasks and stronger performance on agentic tasks relative to group-based RL techniques.
[NLP-64] Speaking in Self-Assessing Tongues: On the Verbalized Confidence of LLM s in Machine Translation
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在机器翻译任务中对其自身输出可信度评估的可靠性问题。传统方法依赖于模型内部信号(如预测概率)来衡量置信度,但这类方法存在局限性:其反映的是模型在候选输出之间的选择确定性,而非实际翻译正确性,且通常需要对内部机制的访问权限。为克服这些缺陷,本文提出五种无需依赖内部信号的“言语化”(verbalized)方法,用于提取模型对每个词元(token)级别的置信度。其解决方案的关键在于通过设计可解释的自然语言提示(prompting)策略,引导模型自主生成关于自身输出可靠性的判断,从而实现不依赖内部接口的外部置信度评估。研究通过细粒度错误检测与校准(calibration)两种方式评估置信度可靠性,结果表明言语化方法与内部信号在性能上相当,但两者之间相关性极低,揭示了二者评估机制的本质差异。
链接: https://arxiv.org/abs/2606.17234
作者: Ali Marashian,Alexis Palmer,Katharina von der Wense
机构: University of Colorado Boulder, USA; Johannes Gutenberg University Mainz, Germany
类目: Computation and Language (cs.CL)
备注:
Abstract:The rapid rise in popularity of large language models (LLMs) for translation calls for a thorough study of the reliability of their confidence in their own outputs. Unlike many generation tasks, translation errors and confidence levels can be useful at different levels of granularity (tokens, words, or spans). Unsupervised approaches based on internal signals like predicted probabilities can be misleading because they reflect certainty among alternatives rather than correctness. In addition, they require access to such internal signals. Here, we devise five verbalized methods of extracting an LLM’s per-token confidence without those shortcomings and compare their reliability with that of the model’s internal signals of certainty. We evaluate reliability using two forms of alignment: fine-grained error detection and calibration. For both, internal and verbalized methods perform similarly, although results vary by model. Interestingly, we find little to no correlation between internal and verbalized methods.
[NLP-65] Rift: A Conflict Signature for Deception in Language Models
【速读】: 该论文旨在解决生成式模型在具备真实知识的前提下仍可能故意说谎(即“欺骗性输出”)的问题,而这类行为仅通过外部行为评估难以识别。其核心挑战在于:当模型产生错误答案时,无法区分该错误是源于无知的诚实错误,还是明知真相却蓄意欺骗。论文的关键解决方案在于引入“错误性控制”机制——通过对比“沉睡代理”(知真而被触发后说谎)与“天真说谎者”(经微调后产生相同错误但无真实训练背景)两种模型,二者输出完全一致的错误结果,从而将差异聚焦于是否存在知识冲突。研究发现,欺骗性前向传播中存在显著的内部签名:在相同错误答案下,欺骗性路径的残差秩(residual rank)比天真说谎路径高出2.1–2.3倍,这一特征可实现对谎言的100%准确无监督识别,且在GPT-2小/中型、三类指令模型以及Qwen2.5-1.5B/7B、Phi-3-mini等多模型上均表现稳定。该签名在多种对抗场景下依然有效,包括自主构造谎言、主动隐藏意图及长度控制复制等。进一步地,基于无基底相对表示的探测器可在零样本条件下跨模型家族、跨架构、跨格式甚至跨五种语言进行检测,平均AUC达0.933,部分场景下达到完美区分(AUC 1.0)。该签名具有只读特性,不可被注入或操控,验证了其作为内在可信赖信号的可靠性。
链接: https://arxiv.org/abs/2606.17229
作者: Petr Nyoma
机构: Harmonic Labs
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 13 pages, 4 figures. Code and experiment logs: this https URL
Abstract:A model that lies while knowing the truth is the central case ELK cannot handle with behavioral evaluation alone. We ask whether such deception leaves an internal signature distinguishing it from honest error. Our key move is a control for wrongness: we contrast a sleeper agent (knows the truth, lies on trigger) against a naive liar (fine-tuned to emit the same wrong answers with no honest training). Both produce identical wrong outputs; any difference is about knowledge conflict, not incorrectness. We find deceptive forward passes carry a conflict signature - 2.1-2.3x higher residual rank than naive-liar passes on the same wrong answer - strong enough to identify which of two responses is the lie with 100% accuracy and no labels, across GPT-2 small/medium (three seeds) and three instruct models. Across Qwen2.5-1.5B/7B and Phi-3-mini, instructed deception raises residual rank on every tested fact (18/18, 40/40, 34/34); on Phi-3, lies separate perfectly from both honest answers and hallucinations (AUC 1.0, Wilcoxon p~6e-11). The signature survives strategic self-constructed deception (model invents its own lie, AUC 1.0), active concealment attempts (AUC 1.0), and length-controlled replication (20/20, AUC 1.0, p~1e-6). Using basis-free relative representations, a probe trained on one model family detects deception in two other families zero-shot (mean AUC 0.933), surviving simultaneous architecture and format change (AUC 0.821), and transfers across five languages (AUC 1.000, length-controlled). The signature is read-only: detectable but not injectable (0/8 both directions). Honest limitations and six negative experiments are documented in full.
[NLP-66] Revisiting LLM Adaptation for 3D CT Report Generation: A Study of Scaling and Diagnostic Priors
【速读】: 该论文旨在解决将生成式人工智能(Generative AI)模型,特别是大语言模型(LLM)与视觉-语言模型(VLM),应用于医学领域中体积化(3D)CT图像报告生成时所面临的挑战,包括高计算复杂度、体积依赖性以及视觉特征与临床术语之间的语义鸿沟。传统方法通过在有限的医学数据上直接微调大模型易导致过拟合和临床幻觉(clinical hallucination),即生成内容虽语言流畅但缺乏临床真实性。为应对这一问题,本文提出RAD3D-Prefix——一种轻量级的诊断先验条件化框架,其核心创新在于将图像嵌入与多标签诊断分类的逻辑值融合,从而在保持预训练语言模型冻结的前提下,有效保留关键临床信息并缩小语义差距。该方法仅需极少可训练参数,显著降低过拟合风险,并在不同规模的模型(96.1M至1.6B参数)上验证了:对于较小模型,微调仍具优势;而对于超大规模模型(~1B+参数),仅训练轻量投影层而冻结主干模型能实现性能、泛化能力与计算效率的最佳平衡。实验结果表明,RAD3D-Prefix在多种自动评估指标及临床医生阅读研究中均优于现有参数高效基线方法,且具备更强的跨域泛化能力,同时显著减少可训练参数量,是面向医疗3D影像生成任务的一种高效、可靠解决方案。
链接: https://arxiv.org/abs/2606.17213
作者: Vanshali Sharma,Andrea M. Bejar,Halil Ertugrul Aktas,Quoc-Huy Trinh,Debesh Jha,Gorkem Durak,Ulas Bagci
机构: 未知
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent advances in multimodal learning, including large language models (LLMs) and vision-language models (VLMs), have demonstrated strong adaptability to natural images. However, extending their use to the medical domain, particularly for volumetric (3D) images, is challenging due to high computational complexity, volumetric dependencies and the semantic gap between visual features and clinical terminology. Naively fine-tuning LLMs on limited medical data often leads to overfitting and clinical hallucination, where linguistic fluency is prioritized over clinical factuality. In this study, we investigate parameter-efficient adaptation strategies for volumetric CT report generation and introduce RAD3D-Prefix, a lightweight diagnostic-prior conditioning framework that minimizes the need for extensive parameter training. This module integrates image embeddings with multi-label diagnostic classification logits, preserving critical clinical details while bridging the semantic gap. By keeping the LLM frozen, our method requires minimal trainable parameters and mitigates the risk of overfitting on small, domain-specific datasets. Through a systematic study spanning LLMs from 96.1M to 1.6B parameters, we find that fine-tuning is most beneficial for smaller LLMs, whereas freezing larger (~1B+ LLMs and training only lightweight projection layers provides a superior trade-off between performance, generalization, and computational efficiency. Across multiple automatic metrics and a clinical reader study, RAD3D-Prefix outperforms comparable parameter-efficient baselines and demonstrates strong out-of-domain generalization while using substantially fewer trainable parameters than fully fine-tuned alternatives.
[NLP-67] Not Truly Multilingual: Script Consistency as a Missing Dimension in VLM Evaluation
【速读】: 该论文旨在解决当前多语言视觉-语言模型(VLMs)评估体系中存在的根本性缺陷:现有评估假设语言与书写系统之间存在一一对应关系,忽视了使用多种书写系统的数十亿用户,尤其是像旁遮普语(Punjabi)这样拥有三种活跃书写系统(古木基文、沙姆基文和罗马字母)的语言。为此,研究提出了PuMVR(旁遮普语多模态视觉推理)基准,包含1,000个严格平行的图像-文本实例,覆盖旁遮普语的三种书写系统。其关键解决方案在于揭示并量化“书写系统差距”(Script Gap),即模型在不同书写系统间表现显著不一致的现象——尽管视觉输入能普遍提升模型性能,但无法弥合书写系统的差异,且跨书写系统的上下文迁移极为脆弱,暴露出模型对特定书写系统的依赖性。研究进一步提出“书写一致性率”(Script Consistency Rate, SCR)作为强制性的评估指标,以实现无书写系统偏见的公平评估,确保生成式AI在多书写系统场景下的可及性与公平性。
链接: https://arxiv.org/abs/2606.17188
作者: Prabhjot Singh,Bhushan Pawar,Madhu Reddiboina,Rajvee Sheth
机构: RediMinds Inc., USA; The University of Texas at Austin, USA; Google DeepMind; OpenAI; Anthropic; xAI; Alibaba; Meta; LMMS-Lab; OpenGVLab; Moonshot AI
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:
Abstract:Current multilingual evaluations for Vision-Language Models (VLMs) assume a one-to-one mapping between language and orthography, overlooking billions of users of multi-script languages. We introduce PuMVR (Punjabi Multimodal Visual Reasoning), a benchmark of 1,000 strictly parallel image-text instances across Punjabi’s three active scripts: Gurmukhi, Shahmukhi, and Roman. Evaluating 10 state-of-the-art VLMs, we expose a substantial and systematic Script Gap. Models frequently solve visual tasks in one script while failing identical tasks in another, with accuracy deltas reaching 16%. Crucially, visual input boosts absolute performance uniformly yet does not close the orthographic gap. Furthermore, cross-script in-context transfer is highly brittle, exposing script-locked knowledge representation. Supported by McNemar tests across all script pairs, our findings demonstrate that current “multilingual” VLMs are not truly multi-script. We propose the Script Consistency Rate (SCR), which falls as low as 24.8% on our benchmark, as a mandatory metric for script-agnostic evaluation to ensure equitable AI access. Data and code are available at: this https URL.
[NLP-68] Self-Generated Error Training for Token Editing in Diffusion Language Models
【速读】: 该论文旨在解决生成式 AI(Generative AI)在块扩散解码(block-diffusion decoding)过程中,基于令牌到令牌(Token-to-Token, T2T)编辑机制时存在的训练-推理不匹配问题。具体而言,现有方法在训练阶段使用随机词汇扰动作为监督信号,但在推理阶段却需处理模型自身生成的流畅且高置信度的错误,导致编辑器难以有效应对真实场景中的错误模式。其解决方案的关键在于提出自生成 T2T(self-generated T2T),通过无梯度的草稿前向传播生成带有掩码位置的预测内容,并在第二轮中以这些自生成的扰动作为监督信号进行恢复训练。该方法以短时 LoRA 继续预训练的形式实现,无需修改推理参数,在官方 Q-Mode T2T 评估流程下显著提升了准确性,同时降低了 T2T 编辑强度,有效缓解了诸如正确推理后末位数字转录错误及简短事实回答前过度自我修正等失败模式。
链接: https://arxiv.org/abs/2606.17175
作者: Lin Yao
机构: Shanghai Jiao Tong University (上海交通大学); Zhongguancun Academy (中关村学院)
类目: Computation and Language (cs.CL)
备注:
Abstract:Token-to-token (T2T) editing lets LLaDA2.1 revise committed tokens during block-diffusion decoding. The released recipe trains this editor on random vocabulary corruptions, but at inference the editor sees the model’s own fluent, high-confidence draft errors instead. We study this training-inference mismatch and propose self-generated T2T, which performs a no-gradient draft pass, fills masked positions with predicted tokens, and supervises recovery in a second pass under these self-generated corruptions. We implement the update as a short LoRA continued-pretraining pass on LLaDA2.1-mini and evaluate on several benchmarks under the official Q-Mode T2T procedure with unchanged inference parameters. The method generally improves accuracy while reducing T2T edit intensity, mitigating failure modes such as final-digit transcription errors after otherwise correct reasoning and excessive self-correction before short factual answers.
[NLP-69] RepSelect: Robust LLM Unlearning via Representation Selectivity
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在去除特定知识与价值观时难以实现深层且不可逆遗忘的核心挑战,现有方法普遍存在遗忘效果浅层、易被微调或少样本提示攻击恢复的问题。其根本原因在于:当前方法作用于既与保留数据集共享又可被微调攻击重构的表示空间,导致遗忘过程不仅破坏模型通用能力,还易于被逆向恢复。为此,论文提出RepSelect(Representation Selectivity)机制,通过在每次参数更新前压缩权重重梯度的前几个主成分,主动隔离仅与需遗忘内容相关的表示,从而在保持模型通用能力的同时,限制微调攻击可重构的信息范围。实验涵盖生物危害知识与不当倾向两类遗忘场景,覆盖密集架构与混合专家(Mixture-of-Experts, MoE)架构的四种模型家族(Llama 3、Qwen 3.5、Gemma 4 E4B、DeepSeek V2 Lite)。相比五种主流基线方法(GradDiff、NPO、SimNPO、RMU、UNDIAL),RepSelect在重学习后回答准确率上实现4至50倍的降幅,显著优于最强基线,且对少样本提示攻击表现出近乎完全的鲁棒性。研究表明,针对选择性表示进行干预是实现深度、稳健的模型遗忘的关键路径。
链接: https://arxiv.org/abs/2606.17168
作者: Filip Sondej,Yushi Yang,Adam Mahdi
机构: University of Oxford (牛津大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Making large language models (LLMs) deeply forget specific knowledge and values without sacrificing general capabilities remains a central challenge in unlearning. However, current methods are easily reversed by fine-tuning or few-shot prompting, suggesting their forgetting is only shallow. We identify the root cause. Existing methods target representations shared with both the retain set and the subspace recovered by a fine-tuning attacker, making unlearning both disruptive to general capabilities and easy to reverse. We propose RepSelect (Representation Selectivity), isolates forget-set-specific representations by collapsing top principal components of weight gradients before each update, leaving general capabilities intact while limiting what fine-tuning can recover. We evaluate across two forget categories, biohazardous knowledge and abusive tendencies, and four model families spanning dense and Mixture-of-Experts architectures (Llama 3, Qwen 3.5, Gemma 4 E4B, DeepSeek V2 Lite). Compared to five popular baselines (GradDiff, NPO, SimNPO, RMU, UNDIAL), RepSelect achieves a 4-50x larger reduction in post-relearning answer accuracy than the strongest baseline, and is near-perfectly robust to few-shot prompting attacks. Targeting selective representations is thus an important step towards deep and robust LLM forgetting.
[NLP-70] he Critical Role of Model Selection in Causal Inference: A Comparative Analysis of Classification Models within the InferBERT Framework for Pharmacovigilance
【速读】: 该论文旨在解决药物警戒领域中因果性不良药物事件(ADEs)与虚假相关性难以区分的核心挑战。其解决方案的关键在于评估不同模型在InferBERT框架中的表现,以探究模型选择对因果检测性能的影响。研究发现,尽管大语言模型(LLM)具有更高的参数量,但经过医学领域特定预训练的BioBERT在两个基准数据集(解热镇痛药诱发急性肝衰竭和曲马多相关死亡率)上均取得了最高准确率,显著优于XGBoost基线、ALBERT及医学生物领域大模型Med-LLaMA。结果表明,领域特定预训练(domain-specific pre-training)是决定模型性能的关键因素,而单纯扩大模型规模并不能带来收益。此外,后处理校准虽可降低预期校准误差(ECE),但对准确率和因果信号发现效果不一。综上,研究证实:在计算药物警戒中,投入资源构建适配领域知识的中小型专用模型,远比盲目扩展模型规模更为有效。
链接: https://arxiv.org/abs/2606.17113
作者: Csaba Kiss,Roland Molontay,Gabriele Pergola
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 10 pages, 5 figures
Abstract:Distinguishing causal adverse drug events (ADEs) from spurious correlations remains a central challenge in pharmacovigilance. The InferBERT framework integrates transformer models with Do-calculus, but its success hinges on the underlying classification model. This study evaluates the impact of model choice in InferBERT, assessing whether simpler models suffice, if domain-specific pre-training helps, whether scaling to LLMs improves causal detection, and the effect of post-hoc calibration. We performed a comparative study on two benchmarks: Analgesics-induced Acute Liver Failure (AILF) and Tramadol-related Mortalities (TRAM). Four models were evaluated-XGBoost (baseline), ALBERT (original InferBERT), BioBERT (biomedical transformer), and Med-LLaMA (medical LLM)-using 5-fold cross-validation repeated over 20 runs. We measured accuracy, Expected Calibration Error (ECE) pre- and post-isotonic regression, and Jaccard concordance of causal terms with PRR, ROR, and EBGM; significance was tested with paired t-tests. BioBERT achieved the highest accuracy on both datasets, while Med-LLaMA underperformed despite its size and parameter-efficient fine-tuning. Domain-specific pre-training was decisive. Calibration improved ECE but had mixed effects on accuracy and causal discovery. BioBERT’s superiority also yielded the strongest concordance with traditional pharmacovigilance signals. These results show that domain-specific pre-training provides a clear advantage over simpler baselines and larger LLMs. Investing in manageable, domain-aware models is more effective for computational pharmacovigilance than simply scaling model size.
[NLP-71] Securing Multi-Agent GIS Systems: Risk Evaluation and Prompt Hardening Optimization
【速读】: 该论文旨在解决多智能体地理信息系统(Multi-agent GIS)中因智能体协同带来的安全风险问题,尤其是在复杂对话与空间分析任务中,智能体间交互可能引入潜在的攻击面和不可控行为。其解决方案的关键在于提出一种面向安全的框架,通过模块化状态机驱动的编排机制将智能体行为抽象为可复用组件,从而增强系统的可管理性与可扩展性;同时,采用基于自适应攻击大语言模型(LLM)与确定性判别器的红队测试框架,实现对多轮攻击场景下系统鲁棒性的量化评估;进一步地,引入提示优化框架,将提示视为结构化签名并注入对抗性示范,以系统化提升安全性,同时保持任务性能不下降。该方法实现了安全强化与功能适应性的协同优化。
链接: https://arxiv.org/abs/2606.17092
作者: Kyle Gao,Pranavi Kotta,Linlin Xu,Jonathan Li,David A. Clausi
机构: University of Waterloo (滑铁卢大学); University of Calgary (卡尔加里大学)
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注: Kyle Gao and Pranavi Kotta contributed equally to this work
Abstract:Agentic systems are increasingly integrated with geographic information systems (GIS), where multi-agent coordination enables complex conversational and spatial analysis but introduces security risks. This work presents a security-oriented framework for risk identification, evaluation, and mitigation in a multi-agent GIS system while maintaining adaptability to broader agentic architectures. We test the agentic system of a commercial geospatial partner while developing a modular state-machine-based orchestration framework that abstracts agent behavior into reusable components. We evaluate robustness using a red-teaming framework with an adaptive attacker LLM and a deterministic judge that produces binary outcomes with supporting rationales across multi-turn attacks. We further improve resilience with a prompt optimization framework that treats prompts as structured signatures and injects adversarial demonstrations, enabling systematic security improvements without degrading task performance.
[NLP-72] Correct When Paired Wrong When Split: Decoupling and Editing Modality-Specific Neurons in MLLM s
【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在知识编辑过程中存在的“编辑解耦失败”问题,即模型在接收图文联合输入时可成功更新实体相关知识,但在仅使用单模态输入(如纯文本或纯图像)时,所更新的知识往往无法保持,导致模型退回到旧有的过时信息。其核心问题在于:MLLMs中的实体知识并非以统一表征形式存储,而是分布在不同模态特异的路径中,使得针对多模态查询的编辑难以有效传播至单模态处理通路。为此,作者提出DECODE方法,通过显式地解耦并定位各模态特异的神经元群组,实现对特定模态路径的知识精准编辑。实验结果表明,DECODE能够在不同模态触发条件下持续实现有效的知识更新,显著缓解了编辑解耦失败现象。
链接: https://arxiv.org/abs/2606.17057
作者: Tingchao Fu,Wenkai Wang,Fanxiao Li,Huadong Zhang,Jinhong Zhang,Dayang Li,Yunyun Dong,Renyang Liu,Wei Zhou
机构: Yunnan University(云南大学); National University of Singapore(新加坡国立大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 18 pages, 11 figures
Abstract:Although Knowledge Editing provides an efficient mechanism for updating the knowledge of Multimodal Large Language Models (MLLMs), we find that current paradigms still suffer from an important yet remain underexplored issue : editing decoupling failure, where entity-related knowledge can be updated when the model is triggered by multimodal inputs (text–image query pairs), however, it often reverts to outdated pre-edit facts when the paired inputs are split into unimodal ones. Our in-depth empirical analysis reveals that the entity knowledge in MLLMs is not stored as a unified representation, but is instead distributed across disentangled modality-specific pathways. As a result, updates biased toward multimodal queries fail to propagate effectively to unimodal circuits. To bridge this gap, we propose DECODE, which explicitly disentangles and localizes modality-specific neuron groups for targeted knowledge. Extensive experiments demonstrate that DECODE consistently achieves effective knowledge updates under different modality triggers, thereby mitigating editing decoupling failures.
[NLP-73] Reading between the Lines: Leverag ing Large Language Models for Global Dementia and Depression Assessment from Clinical Interviews
【速读】: 该论文旨在解决老年人群中痴呆(dementia)与抑郁(depression)这两种最常见的神经精神障碍在临床表现上高度重叠所带来的鉴别诊断难题。其核心挑战在于如何利用非侵入性、可获取的语音样本实现对两类疾病的精准、客观评估。解决方案的关键在于引入基于观察者评分的全球抑郁量表(GDS-D),该量表与既有的全球衰退量表(GDS)在结构上保持一致,从而支持情感与认知症状的并行全局分期。研究进一步采用开放权重的大语言模型(LLMs)——Mistral 3.1、DeepHermes和Qwen3——在两种范式下进行分析:一是零样本(zero-shot)直接预测,二是基于LLM的特征提取结合支持向量回归(SVR)。结果表明,尽管在零样本设置下,模型对抑郁严重程度的预测已表现出优异性能(最低平均绝对误差MAE为0.60),但痴呆评估则显著受益于结构化特征提取策略,最佳MAE达0.78,相较零样本基线误差降低高达35%。此外,包含停顿信息的自动转录文本(pause-enriched transcripts)性能可媲美人工转录,验证了全自动筛查流程在区分神经精神疾病中的可行性与有效性。
链接: https://arxiv.org/abs/2606.18019
作者: Franziska Braun,Alea Rüggeberg,Thomas Ranzenberger,Hartmut Lehfeld,Thomas Hillemacher,Tobias Bocklet,Korbinian Riedhammer
机构: 未知
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
备注: Accepted for publication in Text, Speech and Dialogue (TSD 2026). The final authenticated publication will be available online via Springer LNCS/LNAI
Abstract:Dementia and depression are the most prevalent neuropsychiatric disorders in geriatric populations, and their overlapping symptoms pose major challenges for differential diagnosis. In this study, we investigate open-weights Large Language Models (LLMs) for predicting dementia and depression severity from speech samples collected during standardized history taking interviews with 154 German-speaking subjects. We introduce an observer-based Global Depression Scale (GDS-D) aligned with the established Global Deterioration Scale (GDS), enabling parallel global staging of affective and cognitive symptoms. We compare three LLMs (Mistral 3.1, DeepHermes, Qwen3) in two settings: (1) zero-shot prediction and (2) LLM-based feature extraction for Support Vector Regression, using human and pause-enriched transcripts. Results show that LLMs effectively predict depression severity in zero-shot settings (best MAE of 0.60), while dementia assessment benefits substantially from structured feature extraction (best MAE of 0.78), reducing errors by up to 35% over zero-shot baselines. Pause-enriched transcripts achieve competitive performance with human transcriptions, demonstrating the viability of fully automatic screening pipelines for differential neuropsychiatric assessment.
[NLP-74] Non-Autoregressive Minimum Bayes Risk Decoding for Fast Speech Recognition INTERSPEECH2026
【速读】: 该论文旨在解决非自回归(Non-autoregressive, NAR)语音识别中因无法依赖已生成标记进行条件建模而导致的识别性能下降问题。传统NAR解码虽通过并行生成提升速度,但缺乏对输出不确定性的有效缓解机制。其解决方案的关键在于提出一种基于最小贝叶斯风险(Minimum Bayes’ Risk, MBR)的新型NAR解码框架——NAR-MBR解码,该方法不直接最大化输出概率,而是通过从NAR模型的输出分布中采样,计算并最大化预期效用(expected utility),从而在并行解码中实现更优的预测决策。尤为关键的是,利用NAR模型固有的并行特性,可在单次前向传播中高效生成多个样本,显著提升了计算效率与解码质量。实验结果表明,该方法在LibriSpeech、Switchboard、AMI及网络演示语料库上均优于现有NAR解码方案,且推理速度超越自回归(Autoregressive, AR)解码。
链接: https://arxiv.org/abs/2606.17537
作者: Hiroyuki Deguchi,Takatomo Kano,Katsuki Chousa,Marc Delcroix
机构: 未知
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL)
备注: Accepted at Interspeech2026
Abstract:Non-autoregressive (NAR) decoding generates output tokens in parallel, making speech recognition faster than autoregressive decoding, which generates them sequentially from left to right. However, the recognition performance is degraded because NAR decoding cannot resolve uncertainty by conditioning on previously generated tokens. To address this issue, we propose a novel NAR decoding framework based on minimum Bayes’ risk (MBR) decoding, termed NAR-MBR decoding, that maximizes the expected utility calculated from samples drawn from the output probability of an NAR model rather than maximizing the output probability. Notably, by leveraging the nature of NAR models, multiple samples are obtained efficiently with a single forward computation. Our experiments across LibriSpeech, Switchboard, AMI, and web presentation corpus demonstrated that our NAR-MBR decoding outperformed previous NAR decoding and ran faster than AR decoding.
信息检索
[IR-0] IUUDB: Tracking Illegal Unreported and Unregulated Fishing Seafood Fraud and Labor Abuse through LLM -driven Information Extraction
链接: https://arxiv.org/abs/2606.18181
作者: Henry Bodwell,Hong Yang,John C. Simeone,Kelvin Gorospe,Bella Sullivan,Lana Huang,Jessica Gephart,Sandy Aylesworth,Molly Masterton,Naren Ramakrishnan
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:
Abstract:Illegal, unreported, and unregulated fishing (IUU) traditionally refers to fishing activities that violate applicable laws or occur in areas that lack applicable laws. We propose the term IUU+ to capture a broader suite of fisheries sector environmental and associated supply chain trade-related crimes and behaviors. Although IUU+ activity is widely recognized as a serious threat to marine ecosystems, markets, and livelihoods, a quantitative understanding of these incidents, e.g., their frequency, geography, species, actors, and patterns in the type of illicit activity, remains difficult to obtain. We propose IUU+DB, a large language model driven system for building a global incident database of IUU+ activity. The system ingests heterogeneous documents, classifies whether they describe relevant incidents, extracts key data elements such as actors, locations, species, vessels, violations, and enforcement outcomes, and supports deduplication and trend analysis. Case studies and validation results show that IUU+DB can help organize fragmented evidence, surface geographic and behavioral hotspots, support fisheries-domain specific research in academia and non-government organizations, assist source and species risk assessments for industry, and provide support for policy implementation and targeted enforcement efforts to government agencies.
[IR-1] HistoRAG : Embedding Historical Methodology in Retrieval-Augmented Generation Through Critical Technical Practice
链接: https://arxiv.org/abs/2606.18103
作者: Noah J. Kim-Baumann,Torsten Hiltmann
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 25 pages, 6 figures. Companion preprint to a Journal of Digital History notebook article (under review)
Abstract:Retrieval-Augmented Generation (RAG) is the prevailing architecture for grounding language model outputs in external evidence, yet its dominant evaluation paradigms and default configurations remain oriented toward factual question-answering. For interpretive disciplines such as historical studies, RAG embeds assumptions that conflict with scholarly practice. We introduce HistoRAG, a framework that translates historiographical principles into concrete architectural interventions. Separated retrieval and generation decouples source discovery from interpretation, temporal windowing enforces balanced source representation across the research period as a methodological requirement of historical inquiry, and LLM-as-judge evaluation makes relevance judgments transparent and contestable. We evaluate these interventions using SPIEGELragged, applied to 102,189 articles from Der Spiegel (1950-1979). Each intervention addresses a measurable deficiency in standard RAG: era-specific vocabulary retrieves zero chunks from the 1950s when using 1970s terminology, evidence of the temporal skew that motivates windowing; vector similarity and LLM-assessed relevance correlate only weakly (Spearman rho = 0.275), motivating post-retrieval evaluation; and keyword-based and semantic retrieval surface largely disjoint source pools, motivating an architecture in which both operate as complementary retrieval layers under a shared LLM evaluation filter. We also introduce the concept of Zwischentexte (intermediate texts that function as interpretive proposals rather than findings) as a framework for responsible integration of LLM-generated text into scholarly practice. The architecture offers a model for how domain-specific epistemological commitments can be translated into RAG design decisions, and may transfer to other interpretive disciplines working with large corpora.
[IR-2] Non-negative Elastic Net Decoding for Information Retrieval
链接: https://arxiv.org/abs/2606.17910
作者: Koki Okajima,Yasutoshi Ida,Tsukasa Yoshida,Yasuaki Nakamura
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 19 pages, 4 figures
Abstract:Dense retrieval has become the dominant paradigm in information retrieval, in which each document is scored against a query by the inner product of their vector embeddings, and the top- k documents by score are retrieved for this query. However, since each document’s score depends solely on the embedding of the query and itself, the retrieval process is oblivious to the content of the entire corpus. Therefore, dense retrieval cannot avoid selecting semantically similar documents from the corpus, which may result in a non-diverse, redundant set of retrieved documents. To this end, we approach retrieval as a joint decoding problem, in which documents are selected as a set with regard to the context of the rest of the corpus. To achieve this, we propose Non-Negative elastic Net (NNN) decoding, which selects documents whose embeddings jointly reconstruct the query embedding as a sparse non-negative linear combination. Our main theoretical result establishes a strict separation between dense retrieval and NNN decoding. For any corpus, every query correctly handled by dense retrieval is also handled by NNN decoding, while on corpora containing correlated documents, NNN decoding additionally handles queries that dense retrieval cannot. Experimental results indicate that applying NNN decoding to frozen embeddings trained for inner-product scoring yields consistent improvements across several benchmarks. Moreover, we introduce an end-to-end training procedure which optimizes the embeddings for NNN decoding, producing significant performance gains surpassing in all metrics and benchmarks compared to dense retrieval. Our work establishes a new paradigm for leveraging dense embeddings in information retrieval, beyond the standard practice of inner-product scoring. Comments: 19 pages, 4 figures Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2606.17910 [cs.IR] (or arXiv:2606.17910v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2606.17910 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[IR-3] Understanding and Debugging Failures in N-Gram-Based Generative Retrieval
链接: https://arxiv.org/abs/2606.17721
作者: Richard Takacs,Adrian Bracher,Svitlana Vakulenko
类目: Information Retrieval (cs.IR)
备注: Work in progress
Abstract:Generative Retrieval (GR) is an emerging Information Retrieval (IR) paradigm that is motivated by increasingly capable language models. In GR, a model directly generates identifiers for relevant documents. While these systems offer unique advantages, they also introduce distinct failure mechanisms. We explore these failure modes in three contributions: (1) We present a taxonomy of GR failure modes based on GR literature. (2) We empirically investigate failure in a subset of GR: ngram-based methods, more specifically, SEAL and MINDER. Our analysis reveals common issues, such as ambiguous docids, low identifier diversity, and the disproportionate impact of specific identifiers. (3) We introduce a new web-based tool that helps the IR community analyze generated ngrams and their respective contribution to the final ranking, providing an intuitive interface to identify where such GR methods go wrong.
[IR-4] Do Generative Recommenders Deepen the Information Cocoon? A Closed-Loop Simulation with LLM -powered User Simulators
链接: https://arxiv.org/abs/2606.17707
作者: Jiyuan Yang,Gengxin Sun,Mengqi Zhang,Lingjie Wang,Yuanzi Li,Hongxi Cui,Xin Xin,Pengjie Ren
类目: Information Retrieval (cs.IR)
备注:
Abstract:Recommender systems alleviate information overload, yet repeated feedback between recommendations and user interactions can reinforce existing preferences and narrow users’ exposure, forming information cocoons. While this phenomenon has been widely studied in traditional sequential recommendation, its impact on generative recommendation remains unclear. By replacing atomic item IDs with Semantic ID (SID) sequences, generative recommenders introduce a different recommendation mechanism whose role in information cocoon formation is not yet understood. To investigate whether generative recommenders deepen information cocoons, we propose \textscRecLoop, a closed-loop simulation framework with LLM-driven user agents. We compare two generative recommenders and two traditional sequential baselines on two Amazon datasets across multiple feedback cycles. In addition to standard exposure-level metrics, we introduce \emphCode-Space Structural Cocoon, a model-level metric that measures concentration in the generated SID space. Experimental results show that generative recommenders are generally less prone to exposure-level cocoon formation than traditional baselines, preserving broader exposure diversity and slowing cross-user homogenization. However, feedback loops can still induce concentration within the generated SID space. We further find that cocoon severity depends strongly on tokenization strategy and model scale: collaborative-signal tokenization produces stronger cocoon effects than semantic tokenization, whereas larger models maintain greater code-space diversity and better retain access to niche content. These findings suggest that information cocoons in generative recommendation are shaped not only by recommendation behavior, but also by item tokenization and model capacity. Our code is available at this https URL.
[IR-5] mporal Preference Optimization for Unsupervised Retrieval ICML2026
链接: https://arxiv.org/abs/2606.17664
作者: HyunJin Kim,Jaejun Shim,Young Jin Kim,JinYeong Bak
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: Accepted to ICML 2026
Abstract:Unsupervised dense retrievers offer scalability by learning semantic similarity from unlabeled documents via contrastive learning, but they struggle to capture the temporal relevance, retrieving semantically related but temporally misaligned documents-an important aspect when a document collection spans multiple time periods (e.g., retrieving documents from 2018-2025 for “Who is the president in 2019?” introduces temporal ambiguity). Existing methods rely on supervised training with explicit timestamps, which are not always feasible. We propose TPOUR (Temporal Preference Optimization for Unsupervised Retriever), which uses our novel training method Temporal Retrieval Preference Optimization (TRPO). TRPO reinterprets preference learning in the temporal dimension, guiding the retriever to favor temporally aligned documents. TPOUR further generalizes to unseen time periods via interpolation in a learned time embedding, enabling continuous temporal alignment. Experiments on temporal information retrieval (T-IR), TPOUR outperforms both unsupervised and supervised baselines. Compared to Qwen-Embedding-8B, despite being about 72.7x smaller, TPOUR Contriever improves average nDCG@5 by +4.04 (+12.15%) on explicit and +4.98 (+15.21%) on implicit queries. We provide our code at this https URL.
[IR-6] RSRank: Learning Relevance from Representational Shifts
链接: https://arxiv.org/abs/2606.17468
作者: Archit Gupta,Sai Sundaresan,Debabrata Mahapatra
类目: Information Retrieval (cs.IR)
备注: Under Peer Review
Abstract:As enterprises deploy RAG-based systems to provide grounded responses to user queries, reranking has become a critical component for the final filtering step that separates relevant from distracting or irrelevant documents. Existing rerankers often rely on heuristic thresholds to achieve optimal filtering. Moreover, for relevance scoring, state-of-the-art methods use a language model’s logit signals, which are designed for next-token prediction, not for assessing relevance. To address these limitations, we identify a principled signal for relevance: the representational shift (RS) induced in a query’s internal state when conditioned on a document. We observe that the alignment between (a) RS induced by a candidate document and (b) RS induced by an oracle document-set provides a robust indicator of relevance. Building on this insight, we introduce a lightweight training framework that learns projections mapping RS to calibrated relevance scores. Our training objectives naturally filter irrelevant content at a zero threshold, reducing dependence on heuristic tuning. Across diverse retrieval datasets, our method delivers gains over SOTA rerankers.
[IR-7] On the Memorization Behavior of LLM s in Generative Recommendation: Observations Implications and Training Strategies
链接: https://arxiv.org/abs/2606.17276
作者: Sunwoo Kim,Sunkyung Lee,Clark Mingxuan Ju,Donald Loveland,Bhuvesh Kumar,Kijung Shin,Neil Shah,Liam Collins
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:
Abstract:Generative recommendation (GR) has emerged as a promising direction for recommender systems. Recently, large language models (LLMs) have been increasingly adopted for GR, as their rich pretrained knowledge is expected to help them generalize beyond common user behavior patterns that traditional memorization-oriented baselines can capture. However, existing LLM-based GR works largely ignore LLMs’ well-known tendency to memorize, which, if present in LLMs fine-tuned for GR, would restrict their utilization of pretrained knowledge. In this work, we investigate this concern by examining one-hop memorization, where a model recommends items that are direct successors of items in the training data. We show that LLMs do this more than non-LLM-based GR models-in fact, the vast majority of their gains over GR baselines are actually on users whose target items can be predicted through one-hop memorization. We intuit that improving performance on the remaining users requires LLMs to learn richer item-item relations beyond one-hop transitions. To achieve this, we propose IIRG, a novel training strategy that teaches LLMs to capture: (1) collaborative relations derived from item co-occurrences across multiple hops in user sequences, and (2) semantic relations among items with similar themes, both of which can serve as useful recommendation signals. We show that IIRG significantly improves over LLMs trained solely with standard next-item prediction, with especially large gains for users whose test items are not covered by train-time one-hop transitions.
[IR-8] Beyond Parallel Sampling: Diverse Query Initialization for Agent ic Search EMNLP2026
链接: https://arxiv.org/abs/2606.17209
作者: Sidhaarth Murali,João Coelho,Jingjie Ning,João Magalhães,Bruno Martins,Chenyan Xiong
类目: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: 15 pages, 8 figures; under review at EMNLP 2026
Abstract:Test-time scaling for agentic search typically increases depth (i.e., more turns and tokens per trajectory) or breadth (i.e., more parallel rollouts). Here we focus on breadth scaling, showing that standard parallel sampling yields diminishing returns, tracing this to query redundancy at the first turn. When models issue similar first queries across rollouts, the threads retrieve overlapping evidence, and subsequent turns are conditioned on this shared retrieval. We address this limitation with DivInit, a training-free intervention at the first turn. Rather than sampling k independent first queries, DivInit draws n candidates from a single call, picks k n diverse seeds, and runs them as parallel trajectories. Across five open-weight models and eight benchmarks, DivInit consistently improves over standard parallel sampling, with average gains of five to seven points on multi-hop QA at matched compute. Code available at this https URL
[IR-9] Designing Recommendation Exposure and Favorite Lists: A Field Experiment in a Spot-Work Platform
链接: https://arxiv.org/abs/2606.17397
作者: Kazuki Sekiya,Suguru Otani,Yuki Komatsu,Shunsuke Ozeki,Shunya Noda
类目: General Economics (econ.GN); Computer Science and Game Theory (cs.GT); Information Retrieval (cs.IR)
备注:
Abstract:How should recommender systems be designed when recommendations shape access to scarce, short-lived opportunities? We study this question in a production setting: Timee, Japan’s largest platform for spot work, where workers favorite job templates and receive notifications when firms post shifts from those templates. Maximizing predicted favoriting can generate misdirected concentration: recommendations accumulate on popular templates that create few viable job openings, while templates with unmet labor demand receive too little exposure. We design exposure-control mechanisms for favorite-list management, reallocating template exposure based on posting activity and unfilled capacity. The proposed recommender, thresholded eligibility control (TEC), is fully parallelizable and suitable for large-scale digital platforms. In simulations calibrated to Timee data, TEC raises the per-round job-finding rate from 57.6% to 70.0%. A prefecture-level randomized field experiment increases realized matches and exposure per active template, reduces the share of low-exposure templates, and improves impression-level favoriting and downstream matching.
人机交互
[HC-0] MAJIC: Leverag ing Articulatory Motion for Speech-based Emotion Recognition
链接: https://arxiv.org/abs/2606.18228
作者: Tanmay Srivastava,Paras Bhavnani,Benjir Alvee Islam,Shubham Jain
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:We introduce MAJIC, a multimodal emotion recognition system that leverages articulatory motion of the jaw and facial muscles for speech-based emotion recognition (SER). While most SER systems perform well on datasets with strongly expressed emotional speech of trained actors, their performance often degrades when emotional expressions become more subtle. We explore this challenge by engineering features from articulatory motion and integrating them with audio features using a multi-task learning framework. Our key insight is that emotion in speech manifests not only through vocal characteristics but also through distinct articulatory motions: jaw movements, facial muscle vibrations, and speech-induced vibrations. While audio captures features such as pitch and prosody, articulatory motion contains complementary information that is not present in audio alone. We evaluate our system on data collected from 20 participants across multiple sessions, 10 languages, and diverse scenarios, including prompted and conversational speech, showing its robustness across users and settings. MAJIC achieves 93% accuracy and 91% F1 score for emotion classification, outperforming strong audio-based baselines on our dataset.
[HC-1] owards Understanding and Measuring COGNITIVE ATROPHY in LLM Behaviour
链接: https://arxiv.org/abs/2606.18129
作者: Abeer Badawi,Moyosoreoluwa Olatosi,Negin Baghbanzadeh,Laleh Seyyed-Kalantari,Frank Rudzicz,R. Shayna Rosenbaum,Sara Pishdadian,Elham Dolatabadi
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent incidents involving LLMs used for mental-health support reveal a critical evaluation gap: surface-level safety scores do not capture how models behave across realistic, emotionally sensitive interactions over time. Existing benchmarks measure knowledge, safety, or static response quality, but miss whether LLM interactions help users keep reflecting, coping, and making decisions themselves. We formalize this missing dimension as COGNITIVE ATROPHY, a process-level behavioural measure in AI-mediated mental-health support distinct from safety and helpfulness. To measure it, we introduce COGNITIVE ATROPHY BENCH, a clinically grounded benchmark built from 1,576 fully human-generated counseling conversations, 15,680 turns, and 42,230 responses from five LLMs. Three clinical and neuropsychology experts developed a 20-attribute schema spanning user context, response behaviour, and global risk flags; six trained clinical reviewers applied it with span-grounded evidence, producing 5,324 reviewer judgments. We further introduce the User-Input Risk Index (UIRI), the Cognitive Atrophy Risk Index (ARI), and trajectory summaries. Across five LLMs, models show a consistent moderate-to-high level of atrophy-aligned behaviour across single and multi-turn settings. While models generally respond to overt safety cues, they adapt less reliably when users seek solutions or decisions. The dominant recurring patterns are directive advice, problem-solving, recommendation responses, topic shifts, and forms of validation that may reinforce dependence rather than reflection. Our work makes COGNITIVE ATROPHY measurable and provides a foundation for auditing model behaviour in sensitive LLM conversations.
[HC-2] Security and Privacy Prompts in the Wild: What Users Ask LLM s and How LLM s Respond
链接: https://arxiv.org/abs/2606.18062
作者: Hobin Kim,Xiaoyuan Wu,Omer Akgul,Lujo Bauer,Nicolas Christin
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Human-Computer Interaction (cs.HC)
备注:
Abstract:Large language models (LLMs) are widely used to fulfill users’ information needs; users ask LLMs about the weather, pose educational questions, and consult them for legal assistance. One particularly understudied area is digital security and privacy (SP), where users may seek LLMs’ help on how to secure their online accounts or protect their computers from cyber attacks. To the best of our knowledge, no prior study has collected or analyzed the SP questions users ask LLMs; prior research on LLM response quality relied on expert-authored SP misconceptions or FAQs rather than user queries. Drawing from WildChat, a dataset of 3.2M user-LLM conversations collected in the wild, our study identifies 14,727 SP prompts and categorizes them into nine categories covering a wide range of SP topics. From the SP prompts, we sampled 450 and performed a thematic analysis to characterize the SP questions users ask LLMs. Separate from the thematic analysis, we curated 270 advice-seeking SP prompts, where users ask for recommendations, guidance, or specific SP information. We measured LLM response quality and consistency when posing the prompt to LLMs 10 times. We found that commercial LLMs outperform open-weight models (GPT 5.5 provided “good enough” responses on 98% of prompts; Llama 4 on 47%). However, among prompts that received high-quality responses on average, commercial models sometimes produce contradictory responses across runs, risking confusing or misleading users.
[HC-3] When AI Says “I have been in similar situations”: Synthetic Lived Experience in Peer-Like Caregiver Support
链接: https://arxiv.org/abs/2606.18057
作者: Drishti Goel,Violeta J. Rodriguez,Daniel S. Brown,Ravi Karkar,Dong Whi Yoo,Koustuv Saha
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Social and Information Networks (cs.SI)
备注:
Abstract:Caregivers often turn to online communities for informational and emotional support. In these spaces, peer supporters frequently draw on personal narratives to respond to emotionally complex caregiving situations. As LLMs are increasingly designed as peer-like sources of support, they introduce a critical tension: AI can provide immediate, private, and nonjudgmental support, but it cannot authentically possess the lived experiences that make human peer support meaningful. Yet, when prompted to sound peer-like, LLMs may generate language that implies lived experience. This creates a synthetic lived experience paradox: the same experiential language that may make AI support feel warm, relatable, and peer-like can also falsely position the system as someone with lived experience. We examine this paradox in the context of family caregivers of people living with Alzheimer’s Disease and Related Dementias (ADRD). Drawing on caregiver support exchanges from online communities and prompted peer-like responses from three LLMs – LLaMA, GPT-4o-mini, and MedGemma – we analyze how human peers use personal narratives and how AI incorporates similar narrative forms. Psycholinguistic analysis shows that peer responses used significantly more first-person and past-focused language than peer-like AI responses. Qualitatively, we identify seven types of personal narratives in human peer support and show that AI often captures their emotional work, but can fabricate experiential grounding. These findings reveal a narrative authenticity gap: peer-like AI can generate synthetic lived experience without the real experience that makes peer support meaningful. We argue that caregiver-support AI systems need mechanisms to distinguish supportive peer-like framing from fabricated lived experience, ensuring that models can offer warmth and validation without falsely positioning themselves as experiential peers.
[HC-4] ParaTutor: LLM Mediated Parent Child Tutoring through Role Separated Scaffolding Interface in Real Time
链接: https://arxiv.org/abs/2606.18030
作者: Lan Luo,Anqi Wang,Muzhi Zhou,Junhua Zhu,Jie Cai,Ao Yu,Hui Pan
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:Parent child tutoring is a collaborative learning setting with asymmetric roles, where parents guide children s problem solving while children engage in understanding and reasoning. However, most LLM based learning systems are designed for either single users or symmetric collaboration, leaving parent child tutoring with distinct instructional roles underexplored. Through a formative study, we find that effective parent child tutoring depends on preserving these distinct roles, with parents guiding the learning process and children remaining actively engaged in reasoning. We also identify recurring challenges when parents struggle to understand problem structure, lack sufficient knowledge to provide support, or encounter communication difficulties that disrupt shared understanding. To address these challenges, we present ParaTutor, a scaffolding system that provides different forms of support to parents and children. ParaTutor supports parents with guidance for tutoring and provides children with visual grounding for problem solving. We evaluate ParaTutor with 23 parent child dyads (children aged 10 to 12) under four tutoring conditions that vary how LLM assistance is delivered. Results show that generic LLM assistance tends to reduce the parent s role in tutoring, whereas ParaTutor better preserves parent led support and sustains children s participation in reasoning. These findings suggest that in multi users learning, the value of LLM support depends not only on model capability but also on how support is distributed across users with different roles. Our work contributes design implications for LLM systems that support family learning.
[HC-5] Co-Creativity at the Table: A Qualitative Analysis of Creative Interactions in the Podcast “Adventure AI”
链接: https://arxiv.org/abs/2606.18010
作者: Hanna Dodd,Daniel G. Brown
类目: Human-Computer Interaction (cs.HC)
备注: 11 pages, 3 tables
Abstract:Tabletop role-playing games provide a unique environment for interaction with artificial intelligence (AI) due to their complex and collaborative nature. We analyze Adventure AI, a podcast featuring human-AI interactions in Dungeons Dragons play, to examine how AI is and can be used in tabletop role-playing gaming and how players perceive this use. We complete a qualitative analysis of three seasons of this podcast, from 2023 to 2025, reporting on the overarching themes of roles of AI, roles of humans, the evaluations and failures of AI, and its treatment as a person and character at the table. There are many aspects of the game where artificial intelligence succeeds, while there are others where it is less appropriate. This analysis gives a basis for future work on where artificial intelligence should and should not be used in gaming spaces.
[HC-6] Children Are Not the Enemy: Child-Fit Security as an Alternative to Bans and Surveillance
链接: https://arxiv.org/abs/2606.17957
作者: Kopo M. Ramokapane,Rui Huan,Zaina Dkaidek,Awais Rashid
类目: Cryptography and Security (cs.CR); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注: 14 pages, 2 figures, Paper Under review
Abstract:Digital technologies are now central to children’s learning, play, communication, identity formation, and social participation. Yet dominant approaches to children’s online safety often rely on containment mechanisms, including bans, age gates, parental controls, monitoring, and screen-time restrictions. These approaches can be useful in specific contexts, but they often frame child protection primarily as a problem of restricting access to systems designed for adults. In this paper, we argue that this framing is inadequate for children’s digital lives and insufficient as a security paradigm. We propose Child-fit security, a design paradigm in which technologies likely to be used by children treat a child as legitimate users, not attackers to be excluded, vulnerabilities to be patched, or risks to be managed. In this paradigm, children’s wellbeing, development, privacy, safety, agency, and rights become core security requirements. This shifts the focus of protection from apps, accounts, and data to the child-system relationship, which means protecting both the child and their participation. We conceptualise child-fit security, contrast it with containment-oriented approaches, define its core principles, and discuss its implications for security design. We conclude by presenting a research agenda for making child-fit security operational.
[HC-7] AI Adoption Across a Multinational Workforce: Sociotechnical Conditions for GenAI Acceptance in Human Resources
链接: https://arxiv.org/abs/2606.17887
作者: Dalia Ali,Maria José Rodríguez Velázquez,Manoel Horta Ribeiro,Vera Liao,Orestis Papakyriakopoulos
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:
Abstract:Generative AI (GenAI) deployment in the workplace is accelerating rapidly. Nevertheless, questions of who adopts, who benefits, and who is left behind and why are still understudied. In this paper, we investigate these dynamics in the context of a multinational tech company transitioning from a legacy Human Resources (HR) search system to a GenAI-supported system, analyzing search log data, survey data (n=25), and ten semi-structured interviews. Our findings show that adoption depended on the fit between the GenAI system’s design assumptions and employees’ work positionalities (role, spoken language, tenure). Further, we find that employees’ trust in GenAI answers was built through source-checking, comparison among systems, and seeking input from colleagues or HR when in doubt. Our contribution is twofold. First, we provide empirical evidence of workplace GenAI adoption during a live organizational transition, showing that adoption is influenced by factors such as situational fit, search literacy, and trust calibration. It is also further shaped by knowledge conditions such as the system’s content quality, employee training, and guidance. Second, we translate these findings into design considerations for inclusive deployment and adoption in high-stakes environments such as HR. We argue that organizations should design systems considering the role and context-sensitive benefits they yield to different social groups. They also need to treat the organizational knowledge infrastructure as AI infrastructure to improve the accountability and usability of GenAI systems
[HC-8] From Ad Hoc Pilots to Repeatable Patterns: Structuring Drone Collaboration in Emergency Services with DroneLets
链接: https://arxiv.org/abs/2606.17839
作者: Dzmitry Katsiuba,Samuel Brander,Mateusz Dolata,Gerhard Schwabe
类目: Robotics (cs.RO); Human-Computer Interaction (cs.HC)
备注: Presented at International Conference on Information Systems (ICIS) 2025: this https URL
Abstract:Drones hold promise for supporting emergency services, but their integration into workflows remains ad hoc and coordination-intensive. This paper addresses two research questions: how emergency teams want to collaborate with drones, and how to formalize these collaborations into repeatable processes. Based on four field trials and 95 interviews, we derive 44 interaction patterns grouped into 10 meta-patterns reflecting operational needs such as reconnaissance, communication, and logistical support. To structure these practices, we introduce DroneLets - a new class of design artifacts that extend Collaboration Engineering to embodied agents. DroneLets capture setup requirements, drone capabilities, environmental constraints, and coordinated actions across human and drone actors. They offer a modular framework for designing repeatable, scalable collaboration processes in emergency services, illustrated through patterns such as broadcasting to bystanders and post-fire monitoring. This work expands the scope of CE and provides a structured foundation for integrating autonomous drones into high-stakes field operations.
[HC-9] Accountability in Autonomous Drone-Based Firefighting: Insights From a Field Trial
链接: https://arxiv.org/abs/2606.17831
作者: Dzmitry Katsiuba,Anna Katharina Boos,Robin Hany,Mateusz Dolata,Gerhard Schwabe
类目: Robotics (cs.RO); Human-Computer Interaction (cs.HC)
备注: Accepted for Publication at International Conference on Information Systems (ICIS) 2025: this https URL
Abstract:There is a growing research field exploring how autonomous drones can enhance emergency response effectiveness. Integrating these (artificial) agents into existing emergency teams and workflows may significantly impact established accountability relationships. This paper examines how autonomous drones affect accountability attribution within complex socio-technical systems. Drawing on two real-life field trials in firefighting, the study reveals substantial uncertainty around accountability when drones are organizationally deployed. Using Bovens’ accountability framework, two challenges are identified: (1) uncertainty about the role of drones within hierarchical structures, leading to confused accountability ascriptions; and (2) new forms of human-drone interactions introducing additional accountability-relevant issues. Based on these insights, the paper proposes actionable recommendations to support the responsible integration of autonomous drones into firefighting operations without undermining accountability. These findings offer practical guidance for policymakers and contribute to further research on accountability in autonomous systems.
[HC-10] ARES: A Platform for Adaptive Role-Based Evaluation of Social Engineering Risks in Human–AI Games CCS
链接: https://arxiv.org/abs/2606.17793
作者: Roberto Daza,Javier Irigoyen,Ivan Lopez,Raquel Rodriguez-Carvajal,Laura Gomez,Julian Fierrez,Ruben Tolosana,Aythami Morales
类目: Human-Computer Interaction (cs.HC); Databases (cs.DB)
备注: 6 pages, 2 figures. Accepted at the International Carnahan Conference on Security Technology (ICCST 2026)
Abstract:This work introduces ARES, a platform and open pilot dataset for auditing adaptive social engineering risks in LLM-mediated social decision-making through controlled social games. ARES supports human–human, human–AI, and AI–AI settings, combining configurable game templates, role-conditioned LLM agents, psychology-informed participant profiling, structured interaction trees, and synchronised behavioural and biometric acquisition, filtering, and deep-learning-based feature extraction. The pilot dataset was collected from 15 participants interacting with a role-conditioned GPT-5.4 agent in two concatenated games: an adapted Prisoner’s Dilemma and an Ultimatum Game. It comprises 340 GB of raw and processed multimodal data across six streams: interaction logs, video, screen recordings, gaze logs, smartwatch signals, and game/questionnaire metadata. These data include interaction paths, written justifications, psychological profiles, subjective feedback, perceived counterpart identity, game outcomes, and derived behavioural, facial, and gaze features. Alongside the dataset, we provide descriptive analyses to characterise the pilot release. Rigorous risk evaluation is essential for the deployment of secure AI systems, as it enables the identification and mitigation of vulnerabilities, ensures the protection of sensitive data, and supports compliance with evolving regulatory and ethical standards in society.
[HC-11] Mind Companion: An Embodied Conversational Agent for Process-Based Psychotherapy
链接: https://arxiv.org/abs/2606.17789
作者: Sofie Kamber,Lukas Diebold,Pascal Riachi,Stella Brogna,Andrew Gloster,Rafael Wampfler
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:Access to evidence-based psychotherapy remains limited worldwide, with long waitlists even in high-income regions. Recent advances in large language models (LLMs) offer potential for scalable mental health support when designed with clinical oversight and safety mechanisms. We present Mind Companion, an LLM-based embodied conversational agent integrating multi-layered psychological analysis with process-based therapy principles. The system performs real-time analysis of client statements across fact extraction, psychological flexibility process detection, emotion recognition, and safety monitoring. Analysis results are stored for supervising clinicians to inform therapeutic planning. Response generation incorporates retrieval-augmented generation from evidence-based therapeutic literature and context-aware prompting. Responses are delivered through an embodied avatar with synchronized speech synthesis and animation. We evaluated three LLM configurations (GPT-4.1-mini, GPT-5.2, Claude Sonnet 4.5) against therapist responses from real therapy sessions using automated LLM-judge assessment and expert evaluation with 11 professional psychotherapists. GPT-5.2 achieved higher ratings than human therapist responses across understanding, interpersonal effectiveness, collaboration, and therapeutic alignment in both evaluations, demonstrating the feasibility of LLM-based conversational agents as tools to complement clinical care.
[HC-12] oward Accessible Psychotherapy Training Using AI-Driven Interactive Patient Avatars
链接: https://arxiv.org/abs/2606.17786
作者: Pascal Riachi,Sofie Kamber,Stella Brogna,Andrew Gloster,Rafael Wampfler
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL)
备注:
Abstract:Training psychotherapists in evidence-based interventions such as Acceptance and Commitment Therapy (ACT) requires repeated practice with meaningful feedback, yet opportunities for safe, standardized training are limited by ethical, logistical, and resource constraints. We introduce a system designed to support ACT-oriented psychotherapy training through spoken dialogue with an embodied virtual patient. The system uses large language models to simulate patient behavior conditioned on profiles derived from real therapy sessions and configurable clinical scenarios, while a separate automated evaluator provides turn-by-turn feedback on therapist responses based on established ACT fidelity criteria. Rather than aiming to replace supervision, the system is intended to support deliberate practice by enabling experimentation, reflection, and immediate feedback in low-risk settings. Expert evaluation with practicing psychologists confirmed high realism in patient behavior and demonstrated that immediate turn-by-turn ACT feedback increased therapists’ awareness of intervention choices and enabled effective experimentation with alternative responses. Quantitative evaluation across 49 therapy transcripts identified GPT-4o-mini as the optimal feedback model, achieving the lowest mean absolute error (MAE = 6.12) in replicating human supervisor ACT fidelity ratings with statistically significant agreement. This work demonstrates the potential of fidelity-aware simulated patients as a scalable complement to psychotherapy training.
[HC-13] Is It Real? Exploiting Virtual-Physical Discrimination Vulnerability in Mixed Reality
链接: https://arxiv.org/abs/2606.17783
作者: Xueyang Wang,Xihuan Yao,Yanming Xiu,Xin Yi,Maria Gorlatova,Hewu Li
类目: Human-Computer Interaction (cs.HC)
备注: Accepted at the 2026 USENIX Symposium on Usable Privacy and Security (SOUPS 2026)
Abstract:Consumer mixed reality (MR) headsets seamlessly blend virtual content into physical environments with sufficient fidelity that users may be unable to distinguish virtual objects from physical ones. We identify this virtual-physical discrimination vulnerability as an exploitable security primitive. Through speculative design workshops with 12 experts from cybersecurity and MR/HCI, we develop a taxonomy of virtual-physical confusion attacks and implement four proof-of-concept attacks on Apple Vision Pro, evaluating them with 26 participants in realistic MR tasks. All four attacks altered user behavior, with success rates ranging from 85% to 100%, producing misdirected interactions, misjudged object identities, biased purchasing decisions, and altered navigation paths. Notably, the most successful attacks were also the hardest to detect according to participants’ subjective ratings. Even participants who recognized virtual content still complied behaviorally, and no participant attributed anomalous events to adversarial causes. We propose platform-level provenance, interaction gating, and user education as countermeasures.
[HC-14] alking to Your Data: Exploring Embodied Conversation as an Interface for Personal Health Reflection
链接: https://arxiv.org/abs/2606.17767
作者: Nikola Kovacevic,Bastien Husler,Di Zhuang,Rafael Wampfler,Barbara Solenthaler
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:
Abstract:Personal health data from wearables are typically presented through dashboards of charts and summary statistics, requiring users to actively interpret patterns and implications. We explore an alternative interaction paradigm: engaging with personal health data through an embodied conversational agent that facilitates objective data reflection in dialogue with the user. We present a system that combines lightweight preprocessing of wearable data with a Unity-based embodied character. Internally, the system follows a dual-agent design in which an Observer agent extracts descriptive statistics and temporal trends, and a Presenter agent communicates these findings through “spoken statistics,” intentionally refraining from clinical advice to isolate the impact of the interaction modality. We evaluate this approach through a simulated-self user study (N=5) using a within-subject design. Participants adopted health personas and goals derived from the LifeSnaps dataset to compare traditional dashboard exploration with embodied conversational reflection. Our evaluation focuses on perceived understanding, the specificity of generated actions, and the cognitive shift from passive viewing to active sensemaking. The paper contributes a functional prototype, a design pattern for objective health data narrative generation, and early empirical insights into how embodiment affects the interpretation of personal health metrics.
[HC-15] A Wearable Multimodal UltrasoundInertial System for Real-Time Virtual Reality Interaction
链接: https://arxiv.org/abs/2606.17741
作者: Giusy Spacone,Sebastian Frey,Enzo Baraldi,Mattia Orlandi,Luca Benini,Andrea Cossettini
类目: ystems and Control (eess.SY); Human-Computer Interaction (cs.HC)
备注: 8 pages, 8 figures, 3 tables
Abstract:A-mode ultrasound (US) is a promising sensing modality for Virtual Reality (VR) interaction, as it enables the mapping of muscular activity into control commands while retaining the benefits of wearable sensing. However, existing approaches still face limitations in terms of wearability and interaction complexity, often relying on external hardware such as cameras. In this work, we propose a fully wearable multimodal interface for real-time VR-interaction, based on concurrent US and inertial (accelerometry) sensing from the forearm and upper arm. The system is built on the WULPUS platform and integrates an end-to-end software framework for real-time acquisition, visualization, and communication with a Unity-based VR environment. A multimodal learning pipeline is introduced for concurrent hand pose and forearm position estimation in 2D space. The interface is evaluated through offline and online experiments with five subjects, during the execution of three functional tasks: cylinder grasping (gross motor) and relocation, marble pinching (fine motor) and relocation, and liquid pouring. For offline experiments, we collect 5 acquisition sessions across multiple days, achieving an average inter-session accuracy across subjects of 80 \pm 6% for hand pose estimation and 77 \pm 7% for forearm position estimation. Online validation with minimal fine-tuning (5 min) demonstrates success rates of 92.0 \pm 16.0%, 88.0 \pm 9.8%, and 96.0 \pm 8.0% for the three tasks, respectively. With a power consumption of only 19.9~mW, our system enables more than 2.5 days of continuous use on a small 350 mAh LiPo battery without the need for recharge, enabling truly wearable, multimodal, and functionally meaningful VR interaction.
[HC-16] SketchXplain: Intuitive Visual Explanations of Image Classifiers with Sketches
链接: https://arxiv.org/abs/2606.17646
作者: Wencan Zhang,Mario Michelessa,Xuejun Zhao,Brian Y. Lim
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 14 pages, 6 figures, 4 tables. Submitted to TVCG
Abstract:Saliency map visualizations explain image-based AI predictions by pointing to regions, but these are often unintuitive and semantically unclear, leaving an interpretability gap. We argue that AI explanations should be intuitive – coherent to user knowledge, yet simple and selective to accelerate interpretation. Inspired by artistic drawings, we propose SketchXplain to generate sketch-based visual explanations for intuitive image-based explainable AI (XAI). Combining techniques in saliency maps, concept-bottleneck models, and sketch optimization, SketchXplain integrates saliency to select coherent observation artifacts, concepts for knowledge coherence, cues to represent them, and abstraction for simplicity. Evaluating on face expression recognition, modeling and user studies showed that SketchXplain supported quicker interpretation with more aligned visualizations than saliency maps or simple drawings. Further evaluation on skin lesion diagnosis found that SketchXplain more coherently visualized disease symptoms, better supporting lay diagnosis. Thus, this work illustrates the value of sketches for intuitive, simple, coherent, and quick image-based XAI visualizations.
[HC-17] AdaPT: Adaptive Lesson Plan Transformer for Cross-Regional and Differentiated Instruction
链接: https://arxiv.org/abs/2606.17633
作者: Yanjie Zhang,Jiajun Zhu,Minyu Wu,Huamin Qu,Sicheng Song
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:Due to educational inequality, high-quality lesson plans often mismatch the needs of disparate educational contexts. Teachers typically modify existing lesson plans to fit new contexts, but current tools instead focus on generating content from scratch, creating additional workload. Moreover, a critical gap remains in supporting teachers to quickly adapt to new learning profiles. To bridge these gaps, we present AdaPT, a system leverages LLMs to support transformation of existing lesson plans for cross-regional and differentiated instruction. AdaPT features an interactive interface that allows teachers to input student profiles, offers structured lesson representation, provides explanations for lesson-plan transformations, automatically adapts lesson content for new contexts, and supports iterative, teacher-in-the-loop refinement. We evaluated AdaPT through a user study with 9 teachers and an expert evaluation with 3 specialists. Results show that AdaPT supports workflows of teachers and offers a promising pathway toward promoting educational equity.
[HC-18] owards Speech Impairment Prediction in German-Speaking Individuals with Amyotrophic Lateral Sclerosis INTERSPEECH2026
链接: https://arxiv.org/abs/2606.17616
作者: Monica Gonzalez-Machorro,Ricarda von Heynitz,Justine Hanslmeier,Finja Grimm,Alexandra-Iulia Deac,Anne Gründel,Isabell Cordts,Björn Schuller
类目: Human-Computer Interaction (cs.HC)
备注: Paper accepted at Interspeech 2026, Sydney, Australia
Abstract:Amyotrophic Lateral Sclerosis (ALS) is a neurodegenerative disease, often affecting speech due to bulbar dysfunction. In this study, we predict speech impairment in people with ALS (pwALS) using two clinical speech-related scores. We evaluate cross-sectional (across speakers) and personalised (within-speaker) modelling paradigms and analyse the utility of common speech tasks to contribute to the standardisation of speech data collection for pwALS. Experiments on a German-speaking cohort of 66 pwALS show that repetition tasks (/da/-/da/, /da/-/ba/) achieved the best cross-sectional performance (Concordance Correlation Coefficient (CCC) = 0.62) for predicting the Quality of Life in the Dysarthric Speaker questionnaire, while the within-speaker setting reached a CCC of 0.86. This study represents an initial step towards speech impairment prediction in German-speaking pwALS and highlights the potential of automated speech analysis as a supportive tool for speech impairment assessment.
[HC-19] MedEasy: Designing AI Standardized Patients for Clinical Consultation Training
链接: https://arxiv.org/abs/2606.17512
作者: Zhiqi Gao,Huarui Luo,Guo Zhu,Bingquan Zhang,Dongyijie Primo Pan,Yizhan Feng,Jiahuan Pei,Jie Li,Benyou Wang
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:AI standardized patients are becoming a setting for professional training in clinical consultation. This paper presents MedEasy, a multi-agent system that organizes virtual-patient practice through patient dialogue, clinical actions, decision submission, documentation, and feedback. We first conducted a formative study with 12 clinical-year medical students through interviews and three co-design workshops. The findings informed a staged workflow, structured case records, action-contingent findings, and trajectory-based review. We then conducted an evaluative user study with a separate cohort of 12 clinical-year medical students, with each participant completing two counterbalanced cases. Learners interpreted MedEasy as a connected consultation environment. They used patient responses, examination findings, available actions, and feedback together to judge whether the represented case remained coherent. They valued repeatable practice and recorded review, while questioning missing actions and feedback criteria. The paper contributes design implications for AI-supported professional training systems that use case-specific standards to connect situated practice.
[HC-20] Self-Efficacy and Favorability Shape Learning from Tutoring Systems and Paper Practice
链接: https://arxiv.org/abs/2606.17470
作者: Xinfei Cen,Vincent Aleven,Kenneth R. Koedinger,Conrad Borchers,Paulo F. Carvalho
类目: Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注: Full research paper accepted at EC-TEL 2026
Abstract:Motivational factors such as self-efficacy and how favorably students feel toward practice play a crucial role in shaping learning, particularly in technology-supported environments. Yet, educational interventions often overlook how these factors interact with practice format. This paper examines the influence of self-efficacy and favorability on learning outcomes across two common practice formats: paper-based and system-based tutoring practice. Using a counterbalanced within-subject design with matched problem sets, we isolate the effect of practice format while modeling motivational differences. Results indicate that students with lower baseline self-efficacy achieved greater learning gains regardless of practice format. Among students with lower baseline self-efficacy, greater favorability toward the tutor was associated with greater learning gains during tutor practice, whereas the pattern differed in paper-based practice. Intelligent Tutoring System (ITS)-based practice did not significantly improve post-training self-efficacy relative to paper-based methods. These findings underscore the potential value of tailoring practice format to students’ motivational profiles, as the benefits of tutor- and paper-based practice varied with baseline self-efficacy and favorability. They lay the groundwork for future research on how instructional formats can be aligned more effectively with learners’ motivational needs.
[HC-21] Patients With Personality: Realistic Patient Simulation through Controlled Diversity and Selective Disclosure
链接: https://arxiv.org/abs/2606.17441
作者: Moritz Schlager,Friederike Jungmann,Samuel Schmidgall,Philipp Raffler,Franziska Hartl,Eva Wende,Paula Roßmüller,Conrad Ketzer,Avinatan Hassidim,Dale R. Webster,Yossi Matias,Yun Liu,Daniel Rueckert,Mike Schaekermann,Paul Hager
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 22 pages, 11 figures
Abstract:Simulating realistic patient interactions is a key requirement to testing clinical applications of LLMs at scale without time-consuming and expensive user studies. However, existing approaches often lack realism and controllability, often oversharing information unprompted, and failing to capture the wide variability of patient behavior. Here, we introduce PatientsWithPersonality (PWP), a patient simulation framework that generates realistic yet diverse virtual patient responses through explicit personality parametrization over a latent patient state. Grounded in HEXACO, a six-dimensional personality space used to quantify and parameterize human behavioral traits, our approach enables fine-grained control over conversational style, cooperativeness, and information disclosure within a unified framework. In a clinician evaluation, PWP is judged nearly as realistic as recorded human actors and clearly ahead of prior simulators, while being flagged as “too informative” far less often. Conditioning on HEXACO axes yields personas whose configured traits are recoverable by both clinicians and an autorater, span a substantially wider behavioral footprint than the closest baseline, and prevent oversharing. Altogether, our framework paves the way for more accurate and informative LLM benchmarking through our realistic and steerable patient simulator.
[HC-22] Impact of Hand Impairment and Occlusions on Hand Pose Estimation Accuracy in Augmented Reality Applications
链接: https://arxiv.org/abs/2606.17427
作者: Damian M. Manzone,Mathew Szymanowski,Olga Taran,Shuo Cai,Melissa Marquez-Chin,Tammy Zeng,Hardeep Singh,Cesar Marquez-Chin,José Zariffa
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注:
Abstract:Mixed reality applications can be designed for hand rehabilitation. Augmented reality (AR) head mounted displays (HMDs) specifically allow for ecologically valid tasks because individuals can see their real environment and interact with real objects while receiving additional cues on the HMD. While these applications rely on accurate hand pose estimation, there is a gap in investigating the influence of hand impairment or occlusion from real-object interactions on pose estimation accuracy. Further, comparisons between AR HMD predictions and state-of-the-art pose estimation methods have not been established. The current study assessed pose estimation accuracy of the HoloLens 2 HMD and state-of-the-art pose estimation algorithms (WiLoR, HaMeR, WildHands, and MediaPipe) while individuals with cervical spinal cord injury (cSCI; n = 13, Neurological Level of Injury: C3-C6; American Spinal Injury Association Impairment Scale: A-D) and 15 uninjured controls interacted with clear and opaque objects. Ground truth estimates of 3D joint positions were generated via triangulation from a multi-camera setup. Pose estimation accuracy did not differ between the cSCI and uninjured control groups suggesting that 3D joint predictions from the HoloLens 2 and pose estimation algorithms can generalize to populations with hand impairment. Further, clear objects provided a small accuracy advantage over opaque objects (0.1 mm) and predictions from both WiLoR and HaMeR were slightly more accurate than the HoloLens 2 (2 mm). Overall, these results suggest that the HoloLens 2 may be viable for hand rehabilitation applications and the dataset generated can be used to refine pose estimation methods for hand-impaired populations.
[HC-23] PromptMN: Pseudo Prompting Language
链接: https://arxiv.org/abs/2606.17164
作者: Enkhzol Dovdon
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Programming Languages (cs.PL); Software Engineering (cs.SE)
备注: 32 pages, 2 figures
Abstract:Prompting has become the primary interface between humans and generative AI, yet many natural language prompts remain fragile: roles, goals, constraints, and expected outputs are often buried in prose or left implicit. In agentic and software development workflows, a misread at the first handoff can propagate through every step, since a significant portion of agent failures stem from context ambiguities rather than model limitations. This paper introduces PromptMN, a pseudo-prompting domain-specific language that annotates natural language with compact, %-prefixed typed directives covering roles, goals, requirements, priorities, constraints, plans, inputs, and outputs. Semantic resolution lets authors write in any order while the model interprets directives by function. PromptMN sits between informal prompting and programming-style pseudocode: structured enough to be inspectable and reusable, yet lightweight enough for analysts, managers, developers, and stakeholders across the software development lifecycle (SDLC). PromptMN also pairs with reverse prompt engineering. Asking a model to restate a desired outcome as PromptMN lets users inspect the inferred roles, goals, constraints, and missing assumptions before acting, reducing repair cycles and yielding a reusable artifact for aligning people and AI tools. PromptMN’s feasibility is evaluated across several frontier models, including Claude Fable 5, Claude Opus 4.8, Gemini 3.1 Pro, and GPT-5.5. The models correctly resolved PromptMN instructions, including complex structures such as repetition, conditionals, methods, and a prime-checking task, without fine-tuning. The same vocabulary applies across new codebases, maintenance, and redesign in the SDLC scenarios presented. While large-scale validation remains future work, these early results suggest PromptMN is a practical step toward clearer, more reviewable human-to-AI interaction.
[HC-24] he Bias Paradox: How AI Personas Can Overcome Human Limitations in UX Research
链接: https://arxiv.org/abs/2606.17101
作者: Ozgur Taylan Celik
类目: Human-Computer Interaction (cs.HC)
备注: Paper accepted for ACM CHI workshop on Responsible AI Personas
Abstract:This position paper examines a paradox encountered in UX research practice: a situation where real human participants delivered less authentic insights than AI personas might have, due to context-induced biases. We share our experience developing research-based AI personas using OpenAI’s custom GPT builder and conducting a design thinking workshop with high-net-worth banking clients. The workshop setting, including a luxury hotel, present portfolio managers, and hospitality dynamics, introduced biases that compromised the feedback. We propose that AI personas offer an underexplored opportunity to mitigate certain human limitations in user research, and call for frameworks that help teams recognize when traditional research contexts introduce biases that AI personas might help avoid.
[HC-25] Security and Human-Centered Assessment of BACnet-Controlled DALI Infrastructure in an Educational Building Automation Testbed
链接: https://arxiv.org/abs/2606.17089
作者: Ariton Verush
类目: Cryptography and Security (cs.CR); Human-Computer Interaction (cs.HC)
备注: 7 pages, 9 figures, 1 table; technical case study
Abstract:Building automation and control systems integrate heating, ventilation, air conditioning, lighting, sensing, and management functions through specialized communication protocols. While this integration enables flexible building operation, it also creates complex cyber-physical environments that are difficult to inspect, secure, and explain to new analysts. This paper presents a practical security and human-centered case study of a BACnet/IP building automation testbed with DALI lighting infrastructure, investigated during a domotics-oriented cybersecurity hackathon in Thun, Switzerland in April 2026. The study combines network-oriented enumeration, object-level inspection, physical rack analysis, and reflective HCI analysis of tool-supported learning. Using Yabe and BACteria, the work documents observable BACnet services, reconstructs structured object hierarchies, identifies room-level lighting-control paths, and maps BACnet objects to DALI group-level infrastructure. The analysis emphasizes that BACS assessment is not only a technical protocol task: it also requires usable tool interfaces, physical observability, interpretable naming conventions, and safe mental models for command priorities. The paper contributes a compact case study of BACnet/DALI exploration in an educational testbed and discusses implications for cybersecurity education, human-centered security tooling, and responsible experimentation in cyber-physical building environments.
计算机视觉
[CV-0] Future Dynamic 3D Reconstruction: A 3D World Model with Disentangled Ego-Motion ICML2026
链接: https://arxiv.org/abs/2606.18250
作者: Nils Morbitzer,Jonathan Evers,Artem Savkin,Thomas Stauner,Nassir Navab,Federico Tombari,Stefano Gasperini
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICML 2026. Project page: this https URL
Abstract:Forecasting the evolution of dynamic environments is crucial for autonomous agents. While generative world models have recently achieved high photorealism in 2D video synthesis by mixing ego-motion and environmental dynamics within the image plane, they exhibit physical inconsistencies, such as morphing or vanishing objects, especially over long time horizons. In this paper, we propose FR3D, a world model that predicts a persistent 3D latent representation for future dynamic 3D reconstruction. Unlike prior works that treat the world as a sequence of image-based features, FR3D explicitly decouples the 3D evolution of the scene from the agent’s trajectory, treating the inferred ego-motion as a latent proxy for action. This disentanglement resolves the ambiguities between self-motion and world-motion, ensuring geometric consistency into the future. Furthermore, we introduce a teacher-student distillation strategy that leverages the spatial “common sense” of off-the-shelf foundation models, leading to robust zero-shot generalization. Extensive experiments demonstrate FR3D’s strong performance for future dynamic 3D reconstruction from monocular observations across multiple datasets, even 2 seconds into the future. Project page: this https URL.
[CV-1] Unified Multimodal Autoregressive Modeling with Shared Context-Visual Tokenizer is Key to Unification ICML2026
链接: https://arxiv.org/abs/2606.18249
作者: Wujian Peng,Lingchen Meng,Yuxuan Cai,Xianwei Zhuang,Yuhuan Yang,Rongyao Fang,Chenfei Wu,Junyang Lin,Zuxuan Wu,Shuai Bai
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICML2026. Project page this https URL
Abstract:Unified Multimodal Modeling aims to integrate visual understanding and generation within a single system. However, existing approaches typically rely on two disparate visual tokenizers, which splits the representation space and hinders truly unified modeling. We propose UniAR, a unified autoregressive framework where a single discrete visual tokenizer serves as the key bridge between understanding and generation, enabling a shared context in which the model can directly interpret its own generated visual tokens without additional re-encoding. UniAR adapts a pretrained vision encoder with multi-level feature fusion and a lookup-free bitwise quantization scheme, preserving both high-level semantics and low-level details while scaling the effective visual vocabulary at minimal cost. Building on this, the unified autoregressive model adopts parallel-bitwise-prediction to jointly predict spatially grouped, multi-level visual codes, substantially reducing visual sequence length and accelerating generation. Finally, a diffusion-based visual decoder operates on discrete visual tokens to decode high-fidelity images. Through large-scale pre-training, followed by supervised fine-tuning and reinforcement learning, UniAR achieves state-of-the-art performance on image generation and image editing while remaining competitive on multimodal understanding benchmarks. The project page is available at this https URL.
[CV-2] MOCHI: Motion Enhancement of Collaborative Human-object Interactions SIGGRAPH2026
链接: https://arxiv.org/abs/2606.18243
作者: Jiye Lee,Yonghun Choi,Jungdam Won
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Robotics (cs.RO)
备注: SIGGRAPH 2026 Journal (ACM TOG); Project page: this https URL
Abstract:Collaborative human-object interaction shows dynamic and complex movements that require mutual anticipation and continuous adjustment between participants and the shared object. Modeling such collaborative multi-human object interaction (MHOI) scenarios requires high-quality data acquisition as a foundational step; however, this is challenging due to the inherent complexity of MHOI where human-human and human-object interactions occur simultaneously. Such complexity leads to noisy MHOI captures characterized by several artifacts: contact misalignment between hands and objects, motion jitter and temporal inconsistencies in the captured sequences, and missing or incomplete finger-level articulation details. To address these challenges, we present MOCHI (MOtion Enhancement of Collaborative Human-object Interactions), a two-stage framework for enhancing noisy MHOI data. Our approach first generates physically plausible hand grasps through optimization from noisy body input, producing grasps that are both physically plausible and semantically consistent with the body pose, where these optimized grasps are extended into complete hand-object interaction sequences. Consequently, the full-body motion for all participants are refined through a diffusion-based noise optimization framework that uses single-person motion priors. During the optimization process, we introduce optimization objectives to encode human-object and human-human interaction information within these single-person priors. Experimental results demonstrate the effectiveness of our pipeline across diverse MHOI data, either acquired by existing capture methods or synthesized by generative models. We further show robustness of our system across varying numbers of participants and types of interactions, and demonstrate various applications including keyframe-based MHOI creation and data augmentation through varying object geometries.
[CV-3] EventDrive: Event Cameras for Vision-Language Driving Intelligence CVPR2026
链接: https://arxiv.org/abs/2606.18242
作者: Dongyue Lu,Rong Li,Ao Liang,Lingdong Kong,Wei Yin,Lai Xing Ng,Benoit R. Cottereau,Camille Simon Chane,Wei Tsang Ooi
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR2026, 34 pages, 15 figures, 15 tables, project page: this https URL
Abstract:Event cameras sense the world through asynchronous brightness changes with microsecond latency and high dynamic range, offering motion fidelity far beyond frame-based sensors and capturing temporal structure that conventional exposures often miss. These properties make events a powerful complement to RGB in autonomous driving, especially under blur, glare, and rapid motion, where frame-based perception can become unreliable. However, existing event-aware vision-language models remain limited to generic perception and do not reveal how event sensing contributes to reasoning and decision-making across the full driving loop. We present EventDrive, a large-scale benchmark and model suite that unifies event streams, RGB frames, and language supervision across four core dimensions: Perception, Understanding, Prediction, and Planning, covering captions, structured QA, grounding, motion-state recognition, trajectory forecasting, and planning tasks. Building on this foundation, EventDrive-VLM introduces a multi-horizon event pyramid and a temporal-horizon mixture-of-experts module to adaptively encode and fuse asynchronous and frame-based information for downstream reasoning. Comprehensive evaluation across diverse tasks shows that event streams provide substantial gains in temporal precision, motion awareness, and robustness, bringing event sensing into the center of driving intelligence.
[CV-4] Adaptive Volumetric Mechanical Property Fields Invariant to Resolution ICML2026
链接: https://arxiv.org/abs/2606.18231
作者: Rishit Dagli,Donglai Xiang,Vismay Modi,Xuning Yang,Gavriel State,David I.W. Levin,Maria Shugrina
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
备注: Project Page and hi-res paper: this https URL . ICML 2026
Abstract:Accurate mechanical properties (or materials) Young’s modulus ( E ), Poisson’s ratio ( \nu ) and density ( \rho ) are essential for reliable physics simulation of digital worlds, but most 3D assets lack this information. We propose AdaVoMP, a method for predicting accurate dense spatially-varying ( E , \nu , \rho ) for input 3D objects across representations, improving the resolution, accuracy, and memory efficiency over the state-of-the-art. The foundation of our technique is a sparse and adaptive voxel structure SAV that efficiently represents both the input 3D shape and the material field output. We replace the fixed-voxel model of the most accurate prior method, VoMP, with a novel sparse transformer encoder-decoder model that learns to generate a unique SAV autoregressively for every input shape to represent its materials, achieving a resolution 16^3\times higher than prior art. Experiments show that AdaVoMP estimates more accurate volumetric properties, even with lesser test-time compute than all prior art. This allows us to convert high-resolution complex 3D objects into simulation-ready assets, resulting in realistic deformable simulations.
[CV-5] Seeing Is Not Screening: Multimodal Hidden Instruction Attacks on Agent Skill Scanners
链接: https://arxiv.org/abs/2606.18198
作者: Xiaojun Jia,Jie Liao,Simeng Qin,Ke Ma,Wenbo Guo,Yebo Feng,Aishan Liu,Yang Liu
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Agent skills are emerging as an important attack surface in LLM-based systems. Through an empirical study of existing skill scanners, we find that current defenses primarily rely on textual descriptions, manifests, and source code as the main signals for security analysis, which can leave visually conveyed malicious intent insufficiently examined. This creates a practical blind spot: harmful operational instructions hidden in images may bypass scanning while still being recoverable by multimodal agents during deployment. To systematically investigate this threat, we propose SkillCamo, a document-mediated multimodal instruction attack that conceals malicious instructions within images bundled with a skill while rewriting the surrounding documentation to naturally reference those images as part of the normal workflow. Thus, the attack does not rely on the image alone, but on the joint interpretation of textual guidance and visual payload at execution time. To defend against such attacks, we further propose ExecScan, an execution-grounded multimodal scanning module that performs intent extraction, behavior reconstruction, abuse assessment, and deliberative execution simulation over skill artifacts. ExecScan jointly analyzes documentation, code, referenced resources, and visual content to recover hidden instructions, reconstruct executable behavior chains, and identify downstream risks such as exfiltration, destruction, persistence, deception, and privilege escalation. Extensive experiments show that image-hidden malicious instructions challenge existing skill scanners, while ExecScan can improve the skill scanning performance.
[CV-6] EgoCS-400K: An Egocentric Gameplay Dataset for World Models
链接: https://arxiv.org/abs/2606.18180
作者: Rongjin Guo,Dong Liang,Yuhao Liu,Fang Liu,Tianyu Huang,Gerhard P. Hancke,Rynson W. H. Lau
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The shift from video generation to interactive world modeling places new demands on data: beyond captioned videos, world models require temporally aligned video-action-language trajectories grounded in the actions, camera motion, states, and events that drive future scene changes. However, such data is difficult to obtain at scale. Web video datasets offer broad visual coverage but lack executable actions and reliable states; robotic datasets provide action and state supervision but are costly and limited in scene diversity; and existing simulators often lack large-scale human-driven interaction trajectories. In this paper, we introduce EgoCS-400K, a large-scale replay-grounded egocentric Counter-Strike dataset for world models, built from public professional CS and CS2 match demos that preserve human gameplay trajectories and enable parsing, replaying, rendering, and temporal alignment. We extract player states, view directions, movements, keyboard/button inputs, view-angle changes, weapon usage, game events, and round-level context, and render clean first-person videos from the same trajectories. EgoCS-400K contains over 400,000 first-person videos and 10,000 hours of gameplay from more than 1,000 matches and 40,000 rounds, covering 13 maps and 10 player viewpoints per round. It supports a range of interactive visual modeling tasks, including action-conditioned future prediction, state- and event-aware scene rollout, replay-grounded captioning, and agent egocentric action understanding. By connecting visual observations with human actions, camera motion, game states, and events at scale, EgoCS-400K serves as a practical bridge between passive web videos, controllable game simulation, and costly real-world embodied data.
[CV-7] ReAge3D: Re-Aging 3D Faces with View Consistency
链接: https://arxiv.org/abs/2606.18156
作者: Libing Zeng,Li Ma,Mingming He,Ning Yu,Paul Debevec,Nima Khademi Kalantari
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:We present a novel framework for realistic and controllable 3D face re-aging which produces highly detailed, identity-preserving results. Existing 3D editing methods, while effective for coarse semantic changes, are not well suited for re-aging, as even small inconsistencies across re-aged 2D views can lead to over-smoothing of subtle but perceptually important age-related details. To address this challenge, we first introduce a 2D diffusion-based re-aging model, DiffReaging, trained on synthetically generated image pairs. We further propose a center-out editing propagation strategy that leverages this re-aging model to reconstruct multi-view-consistent re-aged images. Specifically, starting from a re-aged frontal pivot view, we reconstruct the remaining views through warping and our proposed Masked-DiffReaging process. By injecting existing content at every step of the diffusion process, Masked-DiffReaging ensures that the reconstructed regions remain coherent with existing pixels. The resulting consistent set of re-aged views supervises the optimization of the re-aged 3D representation. Our method outperforms existing 3D editing techniques both visually and quantitatively, enabling smooth, fine-grained control over age transformations in 3D face models.
[CV-8] Neural Tree Reconstruction for the Open Forest Observatory ICLR2024
链接: https://arxiv.org/abs/2606.18153
作者: Marissa Ramirez de Chanlatte,Arjun Rewari,Trevor Darrell,Derek J. N. Young
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Published as a workshop paper at “Tackling Climate Change with Machine Learning”, ICLR 2024
Abstract:The Open Forest Observatory (OFO) is a collaboration across universities and other partners to make low-cost forest mapping accessible to ecologists, land managers, and the general public. The OFO is building both a database of geospatial forest data as well as open-source methods and tools for forest mapping by uncrewed aerial vehicle. Such data are useful for a variety of climate applications including prioritizing reforestation efforts, informing wildfire hazard reduction, and monitoring carbon sequestration. In the current iteration of the OFO’s forest map database, 3D tree maps are created using classical structure-from-motion techniques. This approach is prone to artifacts, lacks detail, and has particular difficulty on the forest floor where the input data (overhead imagery) has limited visibility. These reconstruction errors can potentially propagate to the downstream scientific tasks (e.g. a wildfire simulation.) Advances in 3D reconstruction, including methods like Neural Radiance Fields (NeRF), produce higher quality results that are more robust to sparse views and support data-driven priors. We explore ways to incorporate NeRFs into the OFO dataset, outline future work to support even more state-of-the-art 3D vision models, and describe the importance of high-quality 3D reconstructions for forestry applications.
[CV-9] Predicting Immune Biomarkers with MultiModal Mixture-of-Expert Pathology Foundation Models Empowers Precision Oncology
链接: https://arxiv.org/abs/2606.18123
作者: Tianyu Liu,Ziqing Wang,Zhaokang Liang,Tong Ding,Peter Humphrey,Lorraine Colón-Cartagena,Emily Ling-Lin Pai,Kenneth Tou En Chang,Mohamed Kahila,Jonathan Chong Kai Liew,Tinglin Huang,Rex Ying,Kaize Ding,Faisal Mahmood,Wengong Jin
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 5 figures
Abstract:Predicting immune biomarkers associated with the tumor immune microenvironment (TIME) is critical for advancing precision oncology, yet existing approaches are largely limited to single image modalities and suffer from insufficient resolution and incomplete utilization of complementary clinical and biological information. Here we introduce MixTIME, a multimodal foundation model that leverages a mixture-of-experts (MoE) architecture to integrate pathology foundation models trained across distinct modalities: image only (UNIv2), image text (CONCHv1.5), and image transcriptomic (STPath) representations for pixel-level and slide-level prediction of multiplex immunofluorescence (mIF) protein expression from hematoxylin and eosin (HE) whole-slide images. MixTIME employs a learnable router to dynamically weight expert contributions and is trained with a distribution- and tendency-aware loss function. Benchmarked on two datasets of different scales, MixTIME achieves state-of-the-art performance across 17 protein markers as measured by correlation metrics. The predicted mIF profiles substantially enhance downstream tasks, including spatial domain identification, survival prediction, and AI-assisted pathology report generation validated by expert pathologists from multiple institutes across the world. Furthermore, MixTIME enables longitudinal tracking of protein expression dynamics across clinical time points and reveals protein gene interaction patterns linked to drug resistance and immune suppression in tumor microenvironments. Collectively, MixTIME provides a scalable framework for multimodal biomarker discovery and clinical translation in computational pathology.
[CV-10] HLS-GPT : A Generative Pretrained Transformer (GPT ) for Continental-Scale NASA Harmonized Landsat and Sentinel-2 (HLS) Reflectance Reconstruction Across All Bands on Arbitrary Dates
链接: https://arxiv.org/abs/2606.18115
作者: Junjie Li,Hankui K. Zhang,David P. Roy
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent deep learning methods for Landsat and Sentinel-2 reflectance time series reconstruction remain limited by restricted spectral coverage, limited geographic scalability, or patch-based designs with short temporal contexts. We present HLS-GPT, a large-scale generative pretrained Transformer model for reconstructing NASA Harmonized Landsat Sentinel-2 30 m surface reflectance for all bands, any date, and any pixel location. HLS-GPT uses a hierarchical Transformer architecture to handle the different spectral band configurations of Landsat and Sentinel-2 and operates on single-pixel 12-month time series. To capture geographic and seasonal variability, the model was trained with nine years of HLS time series from more than 0.25 million training pixels across the conterminous United States. A random cropping and masking strategy extracts 12-month periods with varying start dates across epochs, masks 50% of valid observations, and trains the model to reconstruct the masked reflectance values from the remaining observations. Evaluation using more than 62,000 independent test pixels shows robust reconstruction under diverse land surface conditions, including complex crop phenology and sparse, irregular observations. Leave-one-observation-out evaluation achieved reconstruction RMSE below 0.026 for all HLS spectral bands, with relative RMSE below 35% for visible bands and below 13% for other bands. Red-edge band errors were comparable to red and near-infrared errors despite the absence of red-edge bands on Landsat. Sensitivity analyses that randomly masked 10% to 90% of test observations showed only modest degradation when 10% to 50% of observations were masked, with all-band RMSE below 0.028. Image reconstruction over nine independent 109 by 109 km CONUS HLS tiles further demonstrates that HLS-GPT outperforms two conventional methods and the NASA-IBM Prithvi model.
[CV-11] Qwen -RobotNav Technical Report: A Scalable Navigation Model Designed for an Agent ic Navigation System
链接: https://arxiv.org/abs/2606.18112
作者: Jiazhao Zhang,Gengze Zhou,Hale Yin,Yiyang Huang,Zixing Lei,Qihang Peng,Haoqi Yuan,Jie Zhang,Xudong Guo,Xiaoyue Chen,An Yang,Fei Huang,Junyang Lin,Dayiheng Liu,Jingren Zhou,Zhuoyuan Yu,Jingyang Fan,Zhixuan Liang,Pei Lin,Ye Wang,Anzhe Chen,Kun Yan,Xiao Xu,Jiahao Li,Lulu Hu,Minying Zhang,Shurui Li,Wenhu Xiao,Shuai Bai,Xuancheng Ren,Chenxu Lv,Chenfei Wu,Xiong-Hui Chen
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Agentic navigation systems require a base navigation model whose observation strategy can be externally reconfigured at inference time, because instruction following, object search, target tracking, and autonomous driving share the same perception-planning backbone yet demand fundamentally different strategies for consuming the visual stream. We present Qwen-RobotNav, a scalable navigation model built on Qwen-RobotNav that addresses it through a parameterised interface with two complementary dimensions: multiple task modes that select the navigation behaviour, and controllable observation parameters (e.g., token budget, per-camera weights) that govern how visual history is encoded. With training-time randomization over all parameters, Qwen-RobotNav is robust to any inference-time configuration requiring zero architectural modification to the Qwen-RobotNav backbone. We train Qwen-RobotNav on 15.6M samples; co-training with vision-language data prevents the collapse into reactive action-sequence mappers observed in trajectory-only training. The parameterised interface also makes Qwen-RobotNav a natural building block for agentic systems: for long-horizon scenarios, an upper-level planner decomposes goals into sub-tasks and dynamically switches Qwen-RobotNav’s task mode and context strategy mid-episode, composing complex behaviours from repeated calls to the same model. Extensive experiments show that Qwen-RobotNav sets new state-of-the-art results across major navigation benchmarks. The model exhibits favourable scaling from 2B to 8B parameters, with joint multi-task training developing a shared spatial-planning substrate that transfers across task families, and demonstrates strong zero-shot generalisation to real-world robots across diverse environments.
[CV-12] Blended Chart Surfaces: A Seamless Explicit Representation for Smooth Surface Fitting
链接: https://arxiv.org/abs/2606.18069
作者: Romy Williamson,Niloy Mitra
类目: Graphics (cs.GR); Computational Geometry (cs.CG); Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 16 figures
Abstract:A surface representation suitable for geometry processing should be compact and explicit, provide global smoothness guarantees, support a wide range of surface topologies, and offer reliable access to differential quantities such as normals and surface energies, while remaining compatible with modern differentiable optimization. Existing neural representations typically sacrifice one or more of these properties: implicit fields typically require iso-surfacing for downstream use, while explicit neural maps are constrained by canonical-domain parametrizations or exhibit seam artifacts between local charts. We introduce Blended Chart Surfaces, a compact, network-free, explicit representation that is smooth by construction and anchored to user-provided topology. Given a coarse proxy mesh encoding the intended surface topology and approximate geometry, Blended Chart Surfaces jointly optimize for a polynomial map at each proxy vertex using an off-the-shelf optimizer to fit to an implicit target shape, avoiding the need for an input parametrization. Neighboring maps are fused using a smooth ‘one-ring coordinate’ blending scheme, decoupling topology and coarse geometry (carried by the proxy) from geometric details (carried by the local patches). The surface is globally smooth, fully differentiable, and enables stable evaluation of derivatives, making differential quantities and surface energies directly accessible. Additionally, our construction is equivariant to rigid motions and scaling of the proxy mesh. We evaluate Blended Chart Surfaces on various topologies and geometric complexity, and compare against explicit alternatives including interpolating-function baselines and mesh-displacement MLPs. Across these, Blended Chart Surfaces achieve a favorable trade-off among compactness, simplicity, access to differential quantities, and expressivity while remaining smooth across patch boundaries.
[CV-13] When LLM s Analyze Scars: From Images to Clinically-Meaningful Features
链接: https://arxiv.org/abs/2606.18063
作者: Ruman Wang,Hangting Ye
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Medical image classification faces a fundamental dilemma: while deep learning models achieve remarkable performance at scale, real-world clinical scenarios often suffer from severe data scarcity due to annotation costs, privacy constraints, and disease rarity. This challenge is particularly pronounced in pathological scar classification, where differentiating keloids from hypertrophic scars requires subtle expert knowledge and labeled images are extremely limited. We propose a novel paradigm that repositions large language models (LLMs) as knowledge-driven feature engineers rather than end-to-end classifiers. We call this framework ScaFE (Scar Feature Engineering). Our key insight is that LLMs encode rich medical knowledge that can be externalized as executable feature extraction code, enabling the transformation of high-dimensional images into low-dimensional, clinically interpretable representations. Specifically, we prompt an LLM with established scar assessment criteria to generate deterministic Python code that extracts features aligned with clinical scoring systems such as the Vancouver Scar Scale. Our approach offers three key advantages: (1) data efficiency, achieving robust performance with limited training samples by decoupling knowledge acquisition from statistical learning; (2) privacy preservation, as raw images are processed locally without exposure to external LLMs; and (3) interpretability, through explicit features grounded in clinical reasoning. Extensive experiments on scar classification demonstrate that our method consistently outperforms end-to-end deep learning baselines or using LLMs as black-box classifiers under limited data conditions, establishing a promising direction for integrating LLMs into data-efficient and clinically transparent medical AI systems.
[CV-14] PhaseWin: An Efficient Search Algorithm for Faithful Visual Attribution
链接: https://arxiv.org/abs/2606.18008
作者: Zihan Gu,Ruoyu Chen,Junchi Zhang,Li Liu,Xiaochun Cao,Hua Zhang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 26 pages, 29 figures
Abstract:Visual attribution is a fundamental tool for interpreting modern vision and vision-language models, particularly when their decisions must be inspected, diagnosed, or audited. Its goal is to explain how a model’s decision depends on local regions of the visual input, typically by assigning an importance ordering over candidate image regions. Given an image partitioned into n regions, faithful attribution can be cast as an ordered subset-search problem, in which progressively inserting the selected regions should recover the target model response as early as possible. Exhaustive search over region subsets incurs exponential cost, while the widely used greedy search still requires a quadratic number of model evaluations, because every selection step rescores all remaining candidates. We propose PhaseWin, an efficient subset-search algorithm for faithful visual attribution. PhaseWin reorganizes greedy region selection into a phased window-search procedure: rather than re-evaluating the full candidate set at every step, it alternates between global candidate screening, adaptive pruning, and localized window refinement, while preserving the essential region-ranking behavior of greedy search. We analyze PhaseWin under monotone evidence-accumulation conditions and show that, under feature-level structural assumptions, it attains controllable linear evaluation complexity together with near-greedy faithfulness guarantees. Extensive experiments on image classification, object detection, visual grounding, and image captioning show that, among all compared attribution methods, PhaseWin reaches high faithfulness with the fewest forward passes, empirically realizing the predicted reduction from O(n^2) to O(n) . The code is available at this https URL.
[CV-15] AIGS-Net: Compact Illumination Field Modeling via 2D Gaussian Splatting for Fast Low-Light Image Enhancement
链接: https://arxiv.org/abs/2606.17998
作者: Yuhan Chen,Kunyang Huang,Fuchen Li,Zhuohan Qin,Guofa Li,Wenbo Chu,Keqiang Li
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Existing low-light image enhancement methods often face a bottleneck between the representation capacity of illumination-field modeling and computational complexity. To address this issue, this paper proposes an Adaptive Illumination Gaussian Splatting Network (AIGS-Net), an ultra-lightweight architecture for fast low-light enhancement. Unlike conventional static priors, AIGS-Net constructs an input-adaptive 2D Gaussian Splatting illumination field. The opacity of Gaussian basis functions is dynamically modulated by relative luminance statistics of the input image, and spatially varying illumination compensation is rendered through ordered alpha compositing. To guide adaptive illumination compensation efficiently, a zero-parameter nonlinear multiscale contextual encoding module is introduced to extract low-frequency structures and local contrast cues without additional convolutional weights. To suppress noise amplification and sensor-induced color bias, AIGS-Net integrates noise-mask estimation, locked single-channel Gamma mapping, cross-channel consistency regularization, and target color-alignment constraints. Experiments on LOL and LSRW benchmarks show that AIGS-Net improves detail recovery and color fidelity while requiring only approximately 40 learnable parameters, achieving an effective trade-off between enhancement quality and extreme inference efficiency.
[CV-16] Recover Semantics First Generate Better: Improved Latent Modeling for 3D MRI Reconstruction and Cross-Contrast Synthesis
链接: https://arxiv.org/abs/2606.17989
作者: Yonghao Chen,Sicheng Yang,Rui Tang,Lei Zhu
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Code: this https URL
Abstract:Multi-contrast magnetic resonance imaging (MRI) provides complementary information for clinical diagnosis. However, acquiring all MRI sequences is often time-consuming and costly. Recent generative models perform cross-contrast synthesis to address this issue by inferring absent contrasts from the available ones. Nevertheless, synthesizing 3D MRI presents significant challenges. Due to the massive volume sizes, operating directly in the pixel space is computationally prohibitive; therefore, a common approach is to first compress the 3D volumes into a latent space and subsequently train generative models in that space. We observe that existing compression architectures face several critical issues: they under-preserve long-range anatomical coherence, discard clinically meaningful semantics, and rely on optimization objectives that lead to over-smoothed reconstructions. Ultimately, these shortcomings compromise the performance of subsequent generative models. In this work, we propose a semantics-first latent modeling framework for 3D MRI reconstruction and cross-contrast synthesis. Specifically, we introduce a Latent Harmonization Encoder (LHE) to capture global anatomical dependencies, ensuring coherent volumetric representations. To mitigate semantic degradation during latent compression, we further design a Semantic Recovery Block (SRB) that injects high-level priors from a self-supervised semantic teacher, enhancing contrast-aware separability in the latent space. Additionally, we propose an Anatomy-aware Frequency Loss (AFL) to adaptively preserve diagnostically relevant high-frequency structures. Extensive experiments on two public multi-contrast MRI datasets demonstrate consistent improvements in reconstruction fidelity and cross-contrast synthesis quality. Our code is available at this https URL.
[CV-17] Gaussian Light Field Splatting: A Physical Prior-Driven Vision Transformer for Unsupervised Low-Light Image Enhancement
链接: https://arxiv.org/abs/2606.17985
作者: Yuhan Chen,Wenxuan Yu,Guofa Li,Fuchen Li,Kunyang Huang,Yicui Shi,Ying Fang,Wenbo Chu,Keqiang Li
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Existing unsupervised low-light image enhancement methods often encounter local exposure imbalance and color distortion under complex non-uniform illumination. In addition, most Vision Transformers lack an explicit mechanism for modeling the physical priors of illumination degradation. To address these limitations, we propose GLFS, a Gaussian light field splatting-based Vision Transformer that integrates continuous physical illumination modeling from Gaussian splatting into the Transformer architecture. In GLFS, scene illumination is represented by a superposition of anisotropic Gaussian basis functions. Physics-guided biases are introduced into self-attention to adaptively infer a spatial gain field, enabling accurate and uniform restoration under complex illumination. To reduce color bias and structural degradation during enhancement, a color-vector angular loss and a luminance-edge loss are further developed. These losses enforce hue consistency and improve the structural fidelity of local details. Extensive ablation studies and quantitative evaluations show that GLFS provides clear advantages in illumination correction and detail preservation. It achieves state-of-the-art performance and offers a new representation paradigm for low-light image enhancement.
[CV-18] SegDINO: Introducing Multi-Scale Structure into DINO for Efficient Medical Image Segmentation
链接: https://arxiv.org/abs/2606.17972
作者: Sicheng Yang,Hongqiu Wang,Zhaohu Xing,Sixiang Chen,Qiuxia Yang,Yize Mao,Guang Yang,Lei Zhu
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Code: this https URL
Abstract:Self-supervised DINO models provide strong transferable visual representations, yet applying them directly to image segmentation remains challenging. Existing approaches commonly rely on heavy decoders with complex upsampling, introducing substantial parameter and computational overhead. We observe that introducing scale into DINO features is far more critical than increasing decoder capacity. In this work, we present SegDINO, an efficient segmentation framework that integrates a DINOv3 backbone with lightweight scale modeling. SegDINO introduces Token Pyramid Adaptation (TPA) to reorganize intermediate DINO features into a pseudo multi-scale hierarchy, and Scale-Aware Decoding (SAD) for efficient intra-scale refinement and top-down multi-scale propagation. We further curate PanCT, a new CT dataset containing 284 patients with expert-annotated pancreatic tumors, to assess SegDINO’s ability to handle difficult small-lesion cases. Extensive experiments on PanCT and three public benchmarks demonstrate that SegDINO achieves state-of-the-art results with high efficiency. The code is available at this https URL.
[CV-19] Reload-Mamba: Hierarchical Anti-Dilution State-Space Modeling for Multi-Class Semantic Segmentation
链接: https://arxiv.org/abs/2606.17966
作者: Sheng-Wei Chan,Hsin-Jui Pan,Jen-Shiun Chiang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 23 pages, 4 figures, 17 tables. Code will be released soon
Abstract:Mamba-based state space models offer linear-time long-range modeling for high-resolution dense prediction, but sequential state-space propagation can attenuate boundary-sensitive and detail-sensitive responses that are critical in multi-class semantic segmentation. We propose Reload-Mamba, a semantic segmentation framework that addresses this propagation-induced response dilution through three segmentation-specific designs: (i) a boundary-supervised local detail prior that is explicitly trained with ground-truth boundary masks to identify regions requiring response restoration; (ii) a class-uncertainty-aware Reload Gate that incorporates per-pixel class entropy from a pre-reload auxiliary head as an additional gating signal, a formulation that is informative only under multi-class dense prediction; and (iii) a hierarchical multi-level Reload mechanism that applies anti-dilution refinement at three decoder levels and fuses the restored representations top-down. Built upon a ConvNeXt-Tiny encoder with a multi-scale decoder and four-directional Mamba scanning with pixel-wise directional attention, Reload-Mamba achieves 47.9% single-scale (48.9% multi-scale) mIoU on ADE20K and 83.2% single-scale mIoU on Cityscapes. With ResNet-101 + COCO pre-training under the standard DeepLab-style protocol, Reload-Mamba reaches 87.8% mIoU on PASCAL VOC 2012 val. Controlled ablations show that each of the three segmentation-specific designs contributes beyond a direct port of the prior anti-dilution architecture proposed for binarization, cumulatively improving over the direct-port baseline by +2.2 mIoU on ADE20K.
[CV-20] Robustness of Similarity-based Positional Encoding Under Rotations: Theoretical Analysis and Experimental Validation
链接: https://arxiv.org/abs/2606.17961
作者: Andrea Santomauro,Luigi Portinale,Giorgio Leonardi
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Positional encoding is a fundamental component of Transformer architectures, as it injects information about the spatial or sequential arrangement of inputs. Among recent alternatives to standard absolute and sinusoidal encodings, similarity-based positional encoding (simPE) has emerged as a flexible framework for representing positional structure through pairwise relations. simPE was originally designed for medical imaging applications, where geometric robustness is especially relevant: small rotations naturally arise during image acquisition, induced by imaging instruments, patient positioning, or slight acquisition misalignments. Despite its empirical promise, the theoretical behavior of simPE under geometric perturbations has not been fully characterized. In this paper, we study the robustness of simPE with respect to rotations, combining formal theoretical analysis with experimental validation. We first show that simPE is generally not rotation-invariant. We then prove that, under mild Lipschitz assumptions on the elementary components, simPE is stable under rotational perturbations and derive explicit perturbation bounds in Frobenius norm. We validate these findings experimentally on four controlled datasets–a synthetic Arrow dataset, a synthetic Shapes dataset (four geometric shape categories), a synthetic Digits dataset, and a benchmark image classification dataset (FashionMNIST)–in which training and validation images are kept in a fixed canonical orientation while test images are subjected to increasing rotation angles. Across all datasets, simPE consistently outperforms standard learned positional encoding in terms of accuracy, F1 score, precision, and recall under rotation, particularly in the small-to-moderate angle regime, corroborating the theoretical stability guarantees.
[CV-21] Beyond Visual Cues: CoT-Enhanced Reasoning for Semi-supervised Medical Image Segmentation MICCAI2026
链接: https://arxiv.org/abs/2606.17958
作者: Yuming Chen,Yuxin Xie,Tao Zhou,Yi Zhou
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted to MICCAI 2026
Abstract:Semi-supervised medical image segmentation has emerged as a dominant research problem in medical image analysis, mitigating annotation scarcity by leveraging consistency regularization on unlabeled data. However, existing approaches operate predominantly via visual pattern matching, relying heavily on pixel-level similarities. This visual-centric dependency often falters in clinical scenarios characterized by the visual-semantic mismatch, where visually similar lesions warrant distinct diagnostic conclusions, thus failing to capture the underlying diagnostic logic used by experts. To address this, we move beyond visual cues and propose CERS (CoT-Enhanced Reasoning Segmentation), a framework that integrates Chain-of-Thought (CoT) reasoning to distinguish pathologically distinct cases. Specifically, we construct a knowledge pool enriched with linguistic reasoning descriptions generated by large language models (LLMs). A semantic-aware reference selection strategy is introduced to identify historical evidence, filtering candidates first by morphology, and then refining them via CoT consistency to eliminate hard negatives. Furthermore, a multi-scale coordinate attention module (MCAM) is designed to effectively fuse this reasoning-derived context into the decoding process. Extensive experiments demonstrate the superiority of CERS against state-of-the-art approaches, particularly in resolving boundary ambiguities and semantic inconsistencies. The code is available at this https URL.
[CV-22] MLLM s Get It Right Then Get It Wrong: Tracing and Correcting Late-Layer Textual Bias IJCAI2026
链接: https://arxiv.org/abs/2606.17953
作者: Xingming Li,Ao Cheng,Qiyao Sun,Xixiang He,Xuanyu Ji,Runke Huang,Qingyong Hu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at IJCAI 2026. 16 pages, 10 figures
Abstract:When vision contradicts text, multimodal large language models (MLLMs) consistently favor text, even when images provide clear evidence otherwise. This bias poses risks for applications requiring visual grounding, yet its cause remains unclear. In this paper, we uncover a surprising finding: models often get it right initially, forming correct vision-based predictions in their intermediate layers, before changing their minds and favoring text in the final output. We call this “late-layer textual override”. The visual information is encoded, it simply does not survive to the output. More intriguingly, we find that how predictions change reveals whether they’re correct: 85% of failures shift toward text, while 89% of successes shift toward vision. This directional signature enables a simple but powerful intervention: when we detect a confident visual prediction being suppressed, we restore it. We propose CALRD (Conflict-Aware Layer Reference Decoding), a training-free method that recovers overridden predictions at inference time. Experiments across five MLLMs of varying architectures demonstrate up to 9.4% absolute improvements on conflict benchmarks while largely preserving standard performance, without training or external knowledge. It recovers what the model already knew but failed to preserve.
[CV-23] Plug-and-Adapt: Multimodal Coreference Resolution at First Sight with a Pretrained Alignment Model
链接: https://arxiv.org/abs/2606.17950
作者: Jinghan Wu,Jing Li,Ivor W. Tsang,Xuetao Zhang
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Visual information helps resolve ambiguity in coreference resolution, leading to notable performance gains. However, existing Multi-modal Coreference Resolution (MCR) methods require training with (partially) annotated data from the target dataset before they can be applied, preventing their direct usability and raising concerns about generalization. While Vision-Language Large Models (VLLMs) with billions of parameters offer promising zero-shot capabilities, they remain largely inaccessible. Their massive size limits deployability, and many are only accessible through paid APIs. In this paper, we propose a plug-and-adapt method that strategically adapts a carefully pre-trained \emphalignment model for immediate use in MCR tasks, designed to eliminate the need for training on scarce benchmark datasets or relying on resource-intensive VLLMs. Specifically, we first pre-train a fine-grained alignment model between textual and visual contextual information using vision-language alignment datasets. We then repurpose the alignment model to MCR through similarity aggregation by fusing visual and categorical cues with evidence theory, thereby enhancing effectiveness. Experiments on the Coreference Image Narratives (CIN) benchmark dataset demonstrate the effectiveness of our method, achieving a 5.31% and 2.12% improvement in CoNLL F1 over SOTA dedicated methods and popular VLLMs, respectively. We further evaluate our method on a masked CIN dataset for robustness testing and on a specially constructed VCR-MCR dataset for generalization assessment, with results confirming both capabilities.
[CV-24] MoonSplat: Monocular Online Gaussian Splatting with Sim(3) Global Optimization SIGGRAPH2026
链接: https://arxiv.org/abs/2606.17935
作者: Guo Pu,Yixuan Han,Haofeng Li,Yao Zhang,Hui Zhou,Zhouhui Lian
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: SIGGRAPH 2026
Abstract:Online 3D reconstruction from monocular image sequences is a challenging and ongoing research topic. 3D Gaussian Splatting (3DGS), leveraging its high-quality real-time rendering capability, empowers online 3D reconstruction to represent dense scenes with enhanced expressiveness, and thus holds great promise for a wide range of applications such as robotics and AR/VR. However, existing online 3DGS methods still suffer from some key challenges: fragile camera pose estimation due to the lack of global optimization, and low optimization efficiency in large-scale or long-sequence scenarios. To address these issues, we propose a robust and efficient online voxelized 3DGS reconstruction framework integrated with global \textSim(3) optimization, which enables reliable camera tracking and efficient global loop closure for both camera poses and voxelized 3DGS. To accelerate the convergence of the voxelized 3DGS, we further introduce a color residual learning strategy, which not only boosts optimization speed but also enhances rendering quality. Extensive experiments on diverse indoor and outdoor datasets demonstrate that our method achieves state-of-the-art performance in both camera pose estimation accuracy and rendering quality, while retaining real-time efficiency. Additionally, we develop and deploy a real-world UAV-based active reconstruction system grounded on our proposed method, validating its robustness and generalizability for practical online 3D reconstruction tasks. Our code and data are available at this https URL.
[CV-25] Revisiting Structural Dependency in Autoregressive Multi-Task Table Recognition via Order-Independent Cell-Level Representations ICDAR2026
链接: https://arxiv.org/abs/2606.17874
作者: Takaya Kawakatsu
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: ICDAR 2026
Abstract:Multi-task table recognition jointly addresses table structure prediction, cell localization, and cell content recognition within a unified framework. Existing approaches often rely on autoregressive decoders to generate table structures and reuse their hidden states for cell localization and content recognition. This autoregressive generation process can make cell representations order-dependent, degrading global consistency across cells. This paper proposes a structural refinement module that produces order-independent cell features through non-causal attention. This design enables parallel inference of cell contents while conditioning each cell on global context encoded in the refined features. Experiments on two large datasets demonstrate consistent gains in cell localization and end-to-end recognition, while reducing overall inference time by around threefold.
[CV-26] A Quantitative Analysis of Multimodal Biomarkers in Alzheimers Disease ALT
链接: https://arxiv.org/abs/2606.17867
作者: Antonio Scardace,Daniele Ravì
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to ICTS4eHealth 2026
Abstract:Despite increasing adoption of multimodal approaches in Alzheimer’s Disease (AD) research – aimed at integrating molecular, structural, clinical, and genetic biomarkers to enhance disease characterization – the relationships among these modalities remain poorly understood. A systematic analysis of their dynamic interaction is essential for improving disease modeling, identifying redundant assessments, and reducing patient burden and acquisition costs. In this paper, we present a quantitative analysis of multimodal AD biomarkers by integrating tau-PET, structural MRI, cognitive scores (MMSE and CDR), and APOE4 data from 789 subjects drawn from the ADNI dataset. In our analyses, we (A) quantify cross-modal mutual information and explained variance to assess redundancy and predictive dependencies; (B) examine associations between tau topologies and structural atrophy across brain regions to select informative ROIs; © perform a statistical decomposition of the tau-cognition association into atrophy-related and atrophy-independent components; (D) and identify a dominant neurodegenerative trajectory that aligns with cognitive decline. This study provides a systematic characterization of cross-modal relationships, improving the interpretability and selection of biomarkers in AD. Code is publicly available at: this https URL.
[CV-27] Qwen -RobotManip Technical Report: Alignment Unlocks Scale for Robotic Manipulation Foundation Models
链接: https://arxiv.org/abs/2606.17846
作者: Haoqi Yuan,Zhixuan Liang,Anzhe Chen,Ye Wang,Haoyang Li,Pei Lin,Yiyang Huang,Zixing Lei,Tong Zhang,Jiazhao Zhang,Jie Zhang,Jingyang Fan,Gengze Zhou,Qihang Peng,Chenxu Lv,Xiaoyue Chen,An Yang,Fei Huang,Junyang Lin,Dayiheng Liu,Jingren Zhou,Chenfei Wu,Xiong-Hui Chen
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 44 pages
Abstract:Foundation models in language and multimodality achieve strong generalization by aligning heterogeneous data under a unified formulation and training at scale. In this report, we investigate whether this scaling recipe can be applied to robotic manipulation to achieve genuine generalization. This is challenging because, unlike text, manipulation data is heterogeneous by nature, expensive to collect, and narrow in diversity, making alignment and scale simultaneously difficult. We present Qwen-RobotManip, a generalizable Vision-Language-Action foundation model built on Qwen-VL. Qwen-RobotManip introduces a unified alignment framework across the representation, motion, and behavioral dimensions of manipulation, making large-scale multi-source training coherent rather than conflicting. This alignment capability in turn enables Qwen-RobotManip to absorb manipulation data at a scale that prior training regimes could not sustain. A human-to-robot synthesis pipeline converts egocentric hand demonstrations into robot trajectories across 15 platforms, and a rigorous curation pipeline harmonizes heterogeneous datasets. Using only open-source datasets and human videos without proprietary data collection, Qwen-RobotManip constructs a ~38,100-hour pretraining corpus and exhibits emergent generalization capabilities, including zero-shot instruction following, robustness to perturbations, reactive error recovery, and cross-embodiment transfer. We find that standard benchmarks fail to capture pretraining quality and instead adopt OOD settings including RoboCasa365, LIBERO-Plus, EBench, RoboTwin-Clean2Rand, RoboTwin-IF, and RoboTwin-XE. Qwen-RobotManip substantially outperforms prior state-of-the-art models, including \pi 0.5, across all OOD settings, ranks 1st in RoboChallenge with a 20% relative improvement, and is validated on real-robot platforms including AgileX ALOHA, Franka, UR, and ARX.
[CV-28] High-Fidelity 3D Geometric Reconstruction of Pelvic Organs from MRI: A Hybrid Deep Learning and Iterative Optimization Approach
链接: https://arxiv.org/abs/2606.17836
作者: Hui Wang,Xiaowei Li,Chenxin Zhang,Yifan Feng,Jianwei Zuo,Yumeng Tang,Xiuli Sun,Jianliu Wang,Bing Xie,Jiajia Luo
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computational Geometry (cs.CG); Graphics (cs.GR)
备注:
Abstract:Patient-specific 3D reconstruction of pelvic organ geometry from MRI is important for pelvic floor modeling and downstream patient-specific analysis. However, while previous studies have focused primarily on either image segmentation or downstream use of 3D models, the reconstruction of high-fidelity, high-quality geometries remains labor-intensive and poorly standardized. The study introduced a hybrid deformable shape modeling framework that integrates deep learning prediction with iterative optimization for the reconstruction of the bladder, uterus, and rectum. The framework consists of three core components: a geometry-aware multi-level deep learning architecture that preserves topological consistency of pelvic organs; a two-stage amortized optimization training strategy that balances global shape capture and local surface refinement; and a holistic synergy mechanism–where iterative optimization provides supervision for deep learning during the training phase, and during inference, deep learning rapidly predicts the global organ morphology, followed by iterative optimization to refine local surfaces and mesh quality. This framework demonstrated marked superiority in geometric fidelity than current mainstream deep learning-based organ reconstruction models. For individual anatomical structures, the reconstructed 3D geometries for the bladder, rectum, and uterus achieved significantly lower Chamfer Distance values and higher Dice Similarity Coefficient scores. In addition, while maintaining high computational efficiency, the proposed architecture yielded superior overall volumetric mesh quality. At the patient level, the framework achieved higher mean values for the 10 worst elements for both minSICN and minSIGE compared to traditional geometric post-processing algorithms.
[CV-29] Human-in-the-Loop Atlas-Based 3D Asset Segmentation for Interactive Content Workflows
链接: https://arxiv.org/abs/2606.17824
作者: Paul Julius Kühn,Saptarshi Neil Sinha,Jakob Hansen,Robin Horst
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Segmenting 3D assets into meaningful regions remains challenging, especially when segmentation criteria are application-dependent and require user control. We present a human-in-the-loop pipeline for generating a segmented 2D parameterized atlas from a 3D model for interactive media, game, and XR content workflows. Our method first selects a compact set of rendered views using a greedy set cover strategy over sampled surface points, and then supports interactive segmentation of these views with SAM~2 and Label Studio. The resulting masks are back-projected onto the model’s UV parameterization to produce a unified segmented atlas that supports downstream production tasks such as segment-wise material assignment, style transfer, and semantic labeling. We assess the pipeline through a demonstration-based technical evaluation on eight cultural heritage objects. The results show that the approach can generate usable segmented atlases across diverse geometries while revealing recurring sources of manual correction, particularly fine structures, cavities, and weak appearance boundaries.
[CV-30] Million-scale multimodal pollen microscopy with expert-guided foundation models
链接: https://arxiv.org/abs/2606.17809
作者: András Biricz,Björn Gedda,Donát Magyar,Antonio Spanu,János Fillinger,Péter Pollner,István Csabai
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 31 pages, 5 main figures, supplementary information included. Submitted to Scientific Reports
Abstract:Automated pollen identification from microscopy remains a bottleneck in aerobiology, palaeoecology and biodiversity monitoring, because scalable systems must generalise across specimen preparation, scanner settings and geographic origins while retaining palynological interpretability. To address this gap, we present a million-scale multimodal pollen microscopy resource, Pollen AI Atlas, assembled from pure-species whole-slide bright-field images spanning four geographic origins, four scanner settings and 46 taxon labels across 31 botanical families. Seeded by one manually selected exemplar per source slide, token-level mining and filtering produced 1,511,390 released grain detections with 99.6% proposal precision in expert-curated test regions. Each detection was paired with machine-generated grain-level morphological captions from five open-weight vision-language models, guided by expert-verified palynological anchors, yielding structured descriptions of aperture systems, wall ornamentation, shape and size. Among the evaluated models, Gemma4 provided the most controlled primary caption set, combining tight length control, no leakage and the strongest text-retrieval performance. Baseline benchmarks with frozen visual features reached 88.16% top-1 accuracy, while cross-regional retrieval showed that caption-derived text embeddings remained robust when image similarity degraded (mAP@20 0.811 versus 0.262). Released data, annotations, captions, splits, code, and weights provide a benchmark for pollen recognition, cross-regional domain adaptation and domain-specific multimodal microscopy learning.
[CV-31] MaineCoon: Pursuing A Real-Time Audio-Visual Social World Model
链接: https://arxiv.org/abs/2606.17800
作者: Lichen Bai,Tianhao Zhang,Shitong Shao,Dingwei Tan,Qiyu Zhong,Zhengpeng Xie,Haopeng Li,Qinghao Huang,Dandan Shen,Tengjiao Ji,Wei Wang,Peicheng Wu,Yuxuan Zhao,Xiangyu Zhu,Welly Luo,Shurui Yang,Zeke Xie
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 32 pages, 13 figures, 3 tables
Abstract:As an increasing majority of global video content is consumed on social platforms for interactive social purposes, video generation models built for social worlds are important but largely overlooked by previous studies. In this work, we define the position of social world models and build a prototype model as the first step towards this goal. While previous world models successfully simulate physical environments or gaming world exploration, they remain fundamentally detached from human-centric social dynamics. To bridge this gap as the first step to social world models, we present MaineCoon, the first real-time audio-visual autoregressive model that has 22B parameters and is capable of real-time streaming generation and sub-second interaction, with a record-breaking frame rate of up to 47.5 FPS, on a single GPU. To the best of our knowledge, MaineCoon is also the first real-time audio-visual generation model specifically optimized for social-interactive applications. To enable efficient and stable training, we introduce several novel techniques into MaineCoon, including self-resampling, cross-modal representation alignment, domain-aware preference optimization, and reinforced online-policy distillation (ROPD). We also design the first agentic streaming inference framework that supports thousand-second-scale or even longer generation while mitigating drift with agentic cache management and prompt planing. These innovations significantly accelerate training while optimizing real-time inference performance. We believe this work not only sets a new state-of-the-art (SOTA) performance benchmark for high-quality, low-latency, and long-horizon audio-visual autoregressive models, but also points out the paradigm shift desired for next-generation AI-native social platforms.
[CV-32] LiveStarPro: Proactive Streaming Video Understanding with Hierarchical Memory for Long-Horizon Streams
链接: https://arxiv.org/abs/2606.17798
作者: Zhenyu Yang,Kairui Zhang,Bing Wang,Shengsheng Qian,Changsheng Xu
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Despite the remarkable progress of Video Large Language Models (Video-LLMs), current online architectures still struggle to simultaneously process continuous video streams, decide autonomously when to respond, and preserve long-horizon contextual memory. These obstacles undermine real-time responsiveness and cause severe forgetting throughout prolonged interactions. In this work, we introduce LiveStarPro, a live streaming assistant that is designed for proactive video understanding over long-horizon streams. The design of LiveStarPro rests on three complementary components. The first component is Streaming Verification Decoding (SVeD), an inference framework that identifies the appropriate response timing through single-pass perplexity verification, thereby eliminating the dependency on explicit silence tokens. The second component is Streaming Causal Attention Masks (SCAM), a training strategy that enforces incremental video-language alignment over variable-length streams. The third component is Tree-Structured Hierarchical Memory (TSHM), a recursive memory architecture that organizes evicted historical information into event chains and consequently enables efficient retrieval from effectively unbounded video streams. To facilitate a comprehensive evaluation under realistic online conditions, we further present OmniStarPro, a large-scale benchmark that spans 15 diverse real-world scenarios and that extends to hour-scale streams for the assessment of long-term recall. Extensive experiments demonstrate that LiveStarPro consistently surpasses existing methods, attaining a 28.9% improvement in semantic correctness and an 18.2% reduction in timing error, while its streaming key-value cache further yields a 1.58x inference speedup over the same model without caching. The model and the code are publicly available at this https URL.
[CV-33] BrainWorld: A Structural-Prior-Conditioned Generative Model for Whole-Brain 4D fMRI Dynamics
链接: https://arxiv.org/abs/2606.17742
作者: Junfeng Xia,Wenhao Ye,Junxiang Zhang,Xuanye Pan,Mo Wang,Quanying Liu
类目: Computer Vision and Pattern Recognition (cs.CV); Neurons and Cognition (q-bio.NC)
备注:
Abstract:Whole-brain 4D fMRI generation is valuable for modeling functional brain dynamics, yet existing fMRI foundation models mainly target representation learning and downstream prediction rather than conditional predictive generation. We introduce BrainWorld, a structural-prior-conditioned generative model for whole-brain 4D fMRI dynamics. BrainWorld uses sMRI as subject-level anatomical context to guide future fMRI generation, integrating structural information into the denoising process rather than treating it as a parallel modality. Evaluated on 22 datasets spanning diverse cohorts and brain states, BrainWorld generates stable 4D fMRI trajectories up to 400 frames, improves downstream performance through generated-example augmentation, and learns transferable multimodal representations that outperform baselines. Together, these results establish BrainWorld as a condition-aware generative framework for long-horizon brain dynamics modeling and multimodal representation learning.
[CV-34] ActWorld: From Explorable to Interactive World Model via Action-Aware Memory
链接: https://arxiv.org/abs/2606.17730
作者: Zhexiao Xiong,Yizhi Song,Hao Kang,Qing Yan,Liming Jiang,Jenson Yang,Zhoujie Fu,Stathi Fotiadis,Angtian Wang,Zichuan Liu,Bo Liu,Yiding Yang,Xin Lu,Nathan Jacobs
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
Abstract:Interactive world models aim to simulate environment dynamics under real-time user actions. However, their action vocabulary is largely confined to navigation: most actions correspond to motion (e.g., walk, turn, look around), while interaction with objects in the scene (e.g., pick up plates, open doors, or trigger physical responses) is either absent, restricted to game domains, or relegated to prompt-to-full-video scenarios. The resulting worlds are visually explorable but not truly actionable. In this work, we present ActWorld, an interactive world model that extends prior navigation-centric generators to support mid-rollout object interaction within a chunk-autoregressive framework. We argue that the navigation-interaction gap stems from two bottlenecks. First, a data bottleneck: the lack of human-object interaction data with accurate, dense labels. Second, a memory bottleneck: recency-biased history compression in existing world models discards the event-transition frames that causally determine subsequent object states, leading to an action-forgetting pathology. On the data side, we construct a 100K interaction video dataset, each annotated with per-chunk captions via chain-of-thought reasoning. On the model side, we introduce a hierarchical action-aware memory design that routes history compression by interaction importance, complemented by a persistent memory bank that maintains event-update and object-identity tokens across long rollouts. Experiments show that ActWorld supports both flexible navigation and rich object interaction within a single model, substantially improving interaction fidelity over navigation-only baselines without sacrificing viewpoint control. Project page is available at this https URL.
[CV-35] GSPan: A Continuous Gaussian Primitive Representation for Arbitrary-Scale Pansharpening
链接: https://arxiv.org/abs/2606.17722
作者: Fangyi Li,Xiaoyuan Yang,Yixiao Li,Zongyang Sui,Kangqing Shen,Gemine Vivone
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Pansharpening aims to generate high-resolution multispectral (HRMS) images by fusing low-resolution multispectral (LRMS) and panchromatic (PAN) observations. Most existing deep learning methods treat pansharpening as fixed-grid prediction, which limits scale adaptation. To address this, we propose GSPan, a framework that introduces 2D Gaussian Splatting (GS) into pansharpening. Instead of directly predicting pixels, GSPan represents band-wise residual details as continuous and learnable 2D Gaussian primitives. We design a Dual-Stream Hierarchical Interaction (DSHI) architecture with a Spatial-Spectral Interactive Attention (SSIA) module to estimate these primitives from complementary PAN and MS observations. The predicted primitives are rendered as a residual detail field and injected into the upsampled MS image. This continuous representation allows GSPan to render fused images on arbitrary target sampling grids without scale-specific retraining. It further enables a Scale-Decoupled Asymmetric Inference (SDAI) strategy, which estimates primitives at a reduced resolution and renders the fused image at the target resolution for efficient large-scene pansharpening. Experiments on QuickBird, GaoFen-2, WorldView-3, and WorldView-3-4K datasets show that GSPan delivers state-of-the-art fusion performance. Moreover, SDAI markedly accelerates inference, achieving a favorable trade-off between computational efficiency and fusion quality. Our results demonstrate the potential of continuous Gaussian residual representations as a flexible and scale-decoupled alternative to fixed-grid prediction.
[CV-36] Heterogeneous SAR-optical fusion for near-real-time land use and land cover mapping under cloud contamination: A novel framework and global benchmark dataset
链接: https://arxiv.org/abs/2606.17713
作者: Jiangong Xu,Weibao Xue,Xiaoyu Yu,Jun Pan,Xinlian Lianga,Mi Wang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Optical remote sensing imagery is frequently degraded by cloud and cloud-shadow contamination, which limits its reliability for near-real-time land use and land cover (LULC) mapping. Although synthetic aperture radar (SAR) can provide cloud-penetrating structural information, existing SAR-optical fusion methods often assume reliable optical observations and insufficiently address the semantic uncertainty introduced by cloud contamination. To address this issue, we propose CloudLULC-Net, an end-to-end heterogeneous SAR-optical fusion framework that directly predicts LULC maps from cloud-contaminated Sentinel-2 imagery and temporally adjacent Sentinel-1 SAR observations. The proposed network incorporates optical reliability modulation to suppress unreliable optical responses, heterogeneous information adaptive aggregation to model high-order spatial-channel interactions between optical and SAR representations, and a unified semantic mapping transformer to organize fused features in a LULC-oriented latent space. A semantic anchor-guided optimization strategy is further introduced to improve the consistency of intermediate semantic representations. To support this task, we construct CloudLULC-Set, a large-scale benchmark dataset containing 40,223 curated SAR-optical-label triplets with pixel-level LULC annotations across diverse geographic regions and cloud conditions. Experimental results show that CloudLULC-Net achieves an OA of 86.60%, an F1-score of 83.29%, and an mIoU of 73.51%, outperforming representative heterogeneous reconstruction-first and end-to-end SAR-optical mapping methods. Comparisons with existing global LULC products and analyses under different cloud-cover levels further demonstrate the robustness and practical value of CloudLULC-Net for target-date LULC mapping in cloud-prone this http URL project is publicly available at: this https URL
[CV-37] Structured Adversarial Camouflage via Voronoi Diagrams
链接: https://arxiv.org/abs/2606.17711
作者: Jens Bayer,Stefan Becker,David Münch,Michael Arens,Jürgen Beyerer
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Pixel-wise adversarial patches are computationally heavy and often visually detectable, limiting utility in security-critical systems. We present adversarial Voronoi camouflage that optimizes only seed-point locations under fixed, printable palettes using a soft assignment, producing structured, splinter camouflage-like patterns without additional regularization. Evaluated on person detection with COCO-style AP@[.5:.95], naive placement (Inria - COCO) performs comparably bad, while garment-level application via segmentation mask (3DPeople) results in a significant AP drop. The attack transfers to out-of-domain backgrounds and across detector families (YOLOv9/10/11/12), indicating robustness in black-box settings. Repainting with different palettes largely nullifies the effect, and single-color tweaks show limited tolerance (=0.17), highlighting a structure-palette coupling. The parameter-efficient, palette-constrained design improves visual plausibility while degrading real-time detector performance. Physical validation and color calibration are left for future work. Code: this https URL This paper was originally presented at the International Conference on Military Communication and Information Systems (ICMCIS), organized by the Information Systems Technology (IST) Scientific and Technical Committee, IST-224-RSY - the ICMCIS, held in Bath, United Kingdom, 12-13 May 2026. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2606.17711 [cs.CV] (or arXiv:2606.17711v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2606.17711 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[CV-38] SegTME-UNI2: A Foundation Model-Based Framework for Generalisable Multiclass Cell Segmentation and LLM -Driven Tumour Microenvironment Characterisation in Histopathology
链接: https://arxiv.org/abs/2606.17702
作者: Wan Siti Halimatul Munirah Wan Ahmad,Faris Syahmi Samidi,Mohammad Badal Ahmmed,Vimal Angela Thiviyanathan,Selvam James Thavaraj,Anwar P.P. Abdul Majeed
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Characterising the tumour microenvironment (TME) from routine HE-stained histology images requires simultaneous cell segmentation, feature extraction, and interpretable clinical reporting. We present SEGTME-UNI2, a unified framework addressing these requirements. Its core is UNI2-UPERHOVER, a dual-head segmentation model pairing the UNI2-H pathology foundation model (ViT-Giant, pretrained on 100M tiles from 100K slides) with two parallel UperNet decoders: one for six-class semantic segmentation and one for horizontal-vertical gradient regression enabling watershed-based nuclear instance separation. To address the lack of pixel-level annotations in large real-world repositories, UNI2-UPERHOVER undergoes a three-stage progressive pseudo-label curriculum. Each stage trains a fresh model without weight transfer, driving improvement entirely via increased pseudo-label quality: Stage 1: Uses human-annotated PanNuke (7,901 images, 189,744 nuclei, 0.25 um/pixel). Stage 2: Uses entropy-filtered pseudo-labels from the Stage 1 model on 271,711 TCGA-UT scale-0 patches (0.5 um/pixel). Stage 3: Uses pseudo-labels from the Stage 2 model on all 1,608,060 TCGA-UT patches across six resolution scales (0.5-1.0 um/pixel). Segmentation outputs feed a structured TME feature extraction pipeline computing 20+ per-patch compositional, morphological, spatial entropy, and intercellular distance metrics. These are encoded as JSON and passed to a fine-tuned NVIDIA BioNeMo GPT model to generate clinically interpretable TME narratives. Preliminary validation on held-out PanNuke and TCGA-UT partitions demonstrates framework feasibility and internal consistency. The pseudo-labelled TCGA-UT dataset and UNI2-UPERHOVER checkpoint are publicly released to support large-scale TME profiling and spatial biology research.
[CV-39] See First Answer Later: Visual Evidence Pre-Alignment via Sufficiency-Driven RL
链接: https://arxiv.org/abs/2606.17678
作者: Yilian Liu,Sicong Leng,Guoshun Nan,Junyi Zhu,Jiayu Huang,Minghao Sun,Xuancheng Zhu,Yisong Chen,Zexian Wei,Xiaofeng Tao
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Multimodal large language models (MLLMs) integrate strong text reasoning with visual inputs, yet their responses can be inconsistent with the underlying images, indicating ineffective utilization of visual evidence during inference. The prevailing training paradigm relies on large-scale caption-based pretraining for general alignment, followed by supervised fine-tuning and reinforcement learning to enable instruction following and complex reasoning. However, such pretraining provides only weak visual grounding: short, coarse captions bias models toward salient objects while neglecting fine-grained visual evidence. In this paper, we introduce Visual Evidence Pre-Alignment (VEPA), an intermediate stage between pretraining and post-training that explores a novel sufficiency-driven objective with Group Relative Policy Optimization (GRPO) to optimize question-conditioned visual evidence descriptions. Extensive experiments across diverse benchmarks show that our VEPA consistently enhances performance on visually demanding evaluations and complements standard supervised post-training. Further analyses show that the income stems from strengthened, transferable visual grounding, rather than from additional task-specific training.
[CV-40] Do We Really Need Diffusion? A Fast U-Net for Paired Medical Image Translation
链接: https://arxiv.org/abs/2606.17675
作者: Alicia Pirwass,Birte Glimm,Michael Munz,Hans-Joachim Wilke
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Magnetic resonance imaging-signal fat fraction (MRI-SFF) quantifies tissue fat and serves as an established biomarker for metabolic and musculoskeletal disorders. The acquisition requires, however, specialized MRI sequences, which are not available routinely. We investigate whether SFF can be estimated from widely available T2-weighted (T2w) MRI via image-to-image translation (I2I). We further compare a lightweight 4-level U-Net to a state-of-the-art Denoising Diffusion Probabilistic Model (DDPM) using a dataset of 230 048 paired 2D images (183 517 train, 23 621 val, 22 910 test) from the German National Cohort (NAKO). Both models clearly outperform the identity baseline (Pearson correlation r = 0.769, mean absolute error MAE = 0.070 +/- 0.054), which confirms that the models learn a non-trivial cross-modal mapping. Interestingly, the lightweight U-Net outperforms the DDPM in both correlation (r = 0.975 vs. 0.962) and error (MAE = 0.014 +/- 0.015 vs. 0.019 +/- 0.019), while reducing inference time by a factor of 208 (25.2 ms vs. 5 227.2 ms per image using 50 Denoising Diffusion Implicit Model (DDIM) steps). The strong clinical performance at substantially reduced computational cost enables real-time clinical use.
[CV-41] Bounding Box Label Propagation for Re-Annotation of Document Layout Analysis Datasets ICDAR2026
链接: https://arxiv.org/abs/2606.17644
作者: Nick Jochum,Tobias Alt-Veit,Christian Schön,Alexander Lück,René Schuster,Didier Stricker
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 17 pages, 3 figures, to appear in proceedings of ICDAR 2026, Vienna, Austria
Abstract:Datasets in practical document processing scenarios typically grow over time, and their class annotations undergo continuous refinement. This creates significant re-annotation efforts, which are time-consuming and costly. A promising remedy is to re-annotate only a small subset of available documents manually and apply semi-supervised learning techniques that leverage both labelled and unlabelled data. Although there are numerous approaches to tackle this problem for classification, there exists no adaptation for the problem of re-classifying object detection instances, e.g. for document layout analysis. To this end, we propose Bounding Box Label Propagation (BBLP), a pseudo-labelling framework for object detection. An object encoder integrates visual, textual, and positional embeddings from object detection samples to come up with a joint embedding that can be used for Label Propagation on partially annotated datasets in a plug-and-play fashion. Evaluation results indicate that the proposed approach produces high-quality class annotations of bounding boxes. In the D4LA layout analysis dataset, it achieves a mAP of 54.0%, corresponding to 81.6% of fully supervised performance, while using only 10% labelled data. Our work demonstrates the potential of Label Propagation for object detection and lays the groundwork for reducing manual annotation efforts in real-world document processing applications.
[CV-42] ERQA-Plus: A Diagnostic Benchmark for Reasoning in Embodied AI NEURIPS
链接: https://arxiv.org/abs/2606.17639
作者: Hong Yang,Basura Fernando
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: under review at NeurIPS
Abstract:Generalist embodied agents require more than object recognition: they must reason about spatial relations, actions, procedures, human intentions, environmental constraints, and commonsense consequences from situated visual observations. Yet existing visual and embodied question answering benchmarks often provide limited control over the reasoning dependencies being tested, making it difficult to distinguish grounded embodied reasoning from shortcut-driven visual or linguistic pattern matching. We present ERQA-Plus, a diagnostic benchmark for reasoning in embodied AI. ERQA-Plus contains 1,766 question-answer instances grounded in 711 robot-centric images and organized according to a structured taxonomy spanning perceptual, action-centric, social-interaction, navigation-environmental, and contextual commonsense reasoning. The dataset is constructed using a multi-stage generation and validation pipeline that combines taxonomy-guided question generation, automatic quality judging, iterative revision, and human assessment to improve visual grounding, answer validity, and reasoning quality. We benchmark representative general-purpose vision-language models and embodied models, including LLaVA-NeXT-8B, Prismatic-7B, MiniCPM-V-4.5-8B, Qwen3-VL, RoboRefer-8B, and RoboBrain2.5-8B. Although the strongest model, Qwen3-VL-32B, achieves 83.4% overall accuracy and 61.4 SBERT score, category-level results reveal persistent weaknesses in spatial reasoning, procedural reasoning, event prediction, and intention inference. ERQA-Plus therefore provides a fine-grained evaluation framework for measuring not only whether embodied agents answer correctly, but also which forms of embodied reasoning they can and cannot perform reliably. The dataset is available this https URL and the project page at this https URL.
[CV-43] Divide Deliberate Decide: A Multi-Agent Framework for Fine-Grained Egocentric Action Recognition
链接: https://arxiv.org/abs/2606.17627
作者: Alessandro Sottovia,Alessandro Torcinovich,Oswald Lanz
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Fine-grained action recognition in egocentric video is challenging for Vision-Language Models (VLMs): actions often differ only in small visual cues, and a single model tends to be biased toward a subset of these cues. We propose Divide, Deliberate, Decide, a fully-local, zero-shot multi-agent framework in which (i) a VLM orchestrator chunks the video and proposes a top-k candidate label list per segment, (ii) an ensemble of heterogeneous VLM specialists, drawn from different open model families, engages in a structured deliberation that includes a peer-consultation round of questions, and (iii) agent rankings are aggregated with a Borda count and the orchestrator re-ranks its own prediction in light of the specialists’ evidence. The entire pipeline runs locally with no fine-tuning. Experiments show that our method positively improves zero-shot action recognition performance over the baseline, highlighting the influence of a heterogeneous deliberation step, showing that the gain stems from decorrelated model priors rather than from additional compute.
[CV-44] RAVA: Retrieval-Augmented Viewpoint Alignment for Subject-Driven Image Generation
链接: https://arxiv.org/abs/2606.17619
作者: Qiwei Yan,Zhiqiang Yuan,Chongyang Li,Jiapei Zhang,Ying Deng,Jinchao Zhang,Jie Zhou
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Reference-driven image generation has made rapid progress on identity preservation, but reliable viewpoint control across different subjects remains poorly understood. The difficulty is not merely generating a new image of the target subject: the model must infer the implicit viewpoint of one subject and transfer it to another subject using only image-level evidence, without camera poses, depth, or ray-based conditions. In this setting, existing generators conditioned on multiple image references often rely on spurious semantic correlations, which lead to viewpoint drift, part-level structural mismatches, and missing or unsupported target-specific content. We formulate this challenge as cross-subject viewpoint alignment and propose RAVA, a retrieval-augmented framework that supplies explicit geometric evidence before generation. RAVA first learns a cross-instance viewpoint embedding that retrieves target-subject images aligned with the anchor viewpoint, then applies a LogDet-based subset selection strategy to retain a compact reference set that is both view-consistent and structurally complementary. The selected references are finally consumed by a fine-tuned multi-reference image generator. Experiments show that generic semantic embeddings are nearly random for this task, while the proposed retriever substantially improves viewpoint retrieval quality. On cross-subject generation, RAVA consistently outperforms zero-shot baselines and stronger retrieval alternatives under the same generation backbone. These results indicate that cross-subject viewpoint alignment benefits from retrieval-augmented geometric grounding rather than relying on end-to-end generation alone.
[CV-45] SkillM oV: Mixture-of-View Routing with Prototype-Conditioned Gating for Unified Multi-View Proficiency Estimation
链接: https://arxiv.org/abs/2606.17615
作者: Edoardo Bianchi,Antonio Liotta
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Estimating human proficiency from video is a key challenge for automated skill assessment, with applications in sports coaching, music pedagogy, surgical training, and workplace learning. Existing approaches often focus on individual scenarios or rely on shared multi-view aggregation, limiting their ability to adapt to heterogeneous camera viewpoints and activity domains. We introduce SkillMoV, a unified, parameter-efficient framework for multi-scenario proficiency estimation from synchronized multi-view video. At its core, SkillMoV introduces a Mixture-of-View Projector (MoVP), which adapts the mixture-of-experts paradigm to camera-specific view features. MoVP is composed of four stages: (i) a Mixture-of-View soft router with twelve expert MLPs that learns view-dependent expert preferences without camera-identity supervision; (ii) cross-view attention to align synchronized cameras; (iii) learnable prototype anchoring to condition the representation on class-level reference vectors; and (iv) a prototype-conditioned gated projection that produces the final skill embedding. We evaluate SkillMoV on EgoExo4D across six skill domains and three separately trained view configurations: Ego, Exos, and Ego+Exos. SkillMoV reaches 50.17% overall accuracy in the Exos setting with a single model trained jointly across all scenarios, surpassing the strongest reported Exos result among the compared methods by 3.57 percentage points. In Ego+Exos, SkillMoV remains close to the best reported result in that setting (47.63% versus 48.20%). Ablations on the selected Exos configuration validate each component: MoV routing contributes +6.61 pp over attentive aggregation, cross-view attention +4.92 pp, prototype anchoring +4.07 pp, and stochastic view dropout +3.90 pp. Through LoRA adaptation, SkillMoV trains only 23.32% of its parameters and adds limited measured overhead relative to a LoRA-only baseline.
[CV-46] Flux-Guard: Facial Identity Protection using diffusion models
链接: https://arxiv.org/abs/2606.17606
作者: Jie Wang,Tao Wang,Ru Zhang,Jianyi Liu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The widespread deployment of face recognition (FR) systems exposes personal images shared on social media and public platforms to identity linkage and privacy risks. Existing adversarial privacy protection methods can degrade unauthorized FR performance but are not compatible with generative face editing. Artificial intelligence-driven face editing tools are gaining popularity, which has significantly increased user demand for personalized portrait generation and social sharing. However, current editing methods often preserve identity features, making the edited images still susceptible to tracking by malicious FR systems. Thus, this paper proposes Flux-Guard, a privacy-preserving face editing framework based on adversarial attacks, which integrates face editing and privacy protection within a unified generative process. Specifically, we design a flow trajectory control method to align semantic manipulations with the generative process and introduce latent-space adversarial optimization with an adaptive perceptual-loss-driven weighting strategy, dynamically adjusting adversarial strength to maximize attack effectiveness while preserving visual quality. Extensive experiments demonstrate that Flux-Guard supports face editing while significantly improving attack success rates against cross-domain face recognition models on the CelebA-HQ and LADN datasets. Furthermore, evaluation results for commercial APIs have confirmed its effectiveness in real-world applications. The code is released at this https URL. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2606.17606 [cs.CV] (or arXiv:2606.17606v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2606.17606 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[CV-47] st-Time Training for Robust Text-Guided Open-Vocabulary Object Counting
链接: https://arxiv.org/abs/2606.17601
作者: Hao-Yuan Ma,Yuda Zou,Li Zhang,Yongchao Xu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Text-guided Open-vocabulary Object Counting (TOOC) enables counting arbitrary object categories specified by text prompts, offering substantially greater flexibility than conventional closed-set counting. However, existing TOOC methods are developed and evaluated primarily on ideal images, while real-world scenes often suffer from adverse conditions such as rain, fog, darkness, and sensor noise, which severely degrade visual quality and impair vision-language alignment. To bridge this gap, we introduce Robust-TOOC, the first benchmark for evaluating TOOC under diverse corruption conditions, which covers six representative degradation types: rain, fog, darkness, Gaussian noise, salt-and-pepper noise, and mixed corruption. To improve robustness while preserving the original counting architecture, we propose Dual-TTT, a dual-architecture test-time training framework for TOOC. Specifically, during test-time training, Dual-TTT updates only the Text-guided Lightweight Denoising module (TL-Denoiser), while keeping the original counting network frozen. Inspired by diffusion models, the TL-Denoiser is optimized to remove corruption-aware noise from image representations under degraded conditions. Since only the TL-Denoiser is trained at test time, Dual-TTT is annotation-free and can be seamlessly integrated into existing TOOC models without modifying their original architecture. Extensive experiments on multiple recent TOOC baselines demonstrate the effectiveness of our method.
[CV-48] MuseVLA: An Adaptive Multimodal Sensing Vision-Language-Action Model for Robotic Manipulation
链接: https://arxiv.org/abs/2606.17598
作者: Xingyuming Liu,Ruichun Ma,Heyu Guo,Qixiu Li,Qingwen Yang,Lin Luo,Shiqi Jiang,Chenren Xu,Jiaolong Yang,Baining Guo
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Humans naturally leverage diverse sensing modalities to interact with the physical world, while most Vision-Language-Action (VLA) models for robotics rely solely on RGB observations. This limits their ability to perceive physical properties that are difficult or impossible to infer from RGB cameras, such as temperature, sound, or radar response. We present MuseVLA, an adaptive multimodal sensing VLA model that integrates novel sensors as on-demand tools for robotic manipulation. Given a task instruction and visual context, MuseVLA first generates a sensor token and target description that select the sensing modality to invoke and what to attend to, analogous to a tool call with arguments. It then converts the selected sensor measurement into a grounded sensor image, a unified intermediate representation that encodes heterogeneous readings for multimodal fusion and action generation. This design decouples sensor-specific processing from the VLA backbone, enabling efficient integration of diverse modalities. To reduce the need for expensive multisensory robot datasets, we further introduce a data synthesis pipeline that augments existing RGB video datasets with grounded sensor images, enabling generalization to unseen sensor-guided tasks. We evaluate MuseVLA on a real-world robot across challenging dexterous hand manipulation tasks that require multimodal sensing inputs, including temperature-guided pick-and-place, audio-driven object search, and radar-assisted hidden object retrieval. MuseVLA achieves 80.6% success rate on average, outperforming RGB-only and multisensory VLA baselines significantly, and exhibits strong zero-shot capabilities on unseen tasks.
[CV-49] vTok: Broadcasting Time-Invariant Tokens for Scalable Video Tokenization
链接: https://arxiv.org/abs/2606.17590
作者: Weiliang Chen,Yuanhui Huang,Xuebo Wang,Yueqi Duan
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Video tokenization is fundamental to scalable video generation, as the number of tokens directly determines the computational cost and the length of videos that can be modeled. Existing tokenizers mainly improve scalability by compressing videos into fewer tokens, but they often continue to represent persistent content, such as static backgrounds and consistent object appearances, repeatedly across frames and chunks. In this paper, we propose \textbfTivTok (\textitTime-Invariant Tokenizer), a reuse-aware video tokenizer that makes persistent information reusable across time. TivTok represents a clip with Time-Invariant (TIV) tokens that encode information shared across frames and Time-Variant (TV) tokens that encode frame-specific residuals. To obtain this factorization, we introduce Scope-Induced Factorization (SIF), which assigns different attention scopes to the two token groups: TIV tokens attend to the full clip, whereas each TV token only accesses its corresponding frame together with the TIV tokens. In the decoder, Invariant Broadcasting (IB) reuses the same TIV tokens across frames and chunks for parallel reconstruction and long-video tokenization. Experiments show that TivTok achieves an rFVD of 12.65 on the standard 16\times256\times256 benchmark and improves compression efficiency by 2.91 \times for 128-frame videos compared with the evaluated baselines, while using only 1.1% of the tokens required by downsample-based tokenizers in our evaluation.
[CV-50] Root-Selecting Fixed-Point Inversion for Rectified Flows via Trajectory Straightness
链接: https://arxiv.org/abs/2606.17584
作者: Semin Kim,Jihwan Yoon,Seunghoon Hong
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Finding the initial noise that generates a given data sample, known as inversion, is a key component for downstream applications such as training-free image editing. Existing fixed-point inversion methods improve inversion accuracy by formulating each inversion step as a fixed-point problem, but they lack a principled mechanism for selecting among multiple fixed-point solutions that can arise in practice. We observe that different selections induce different inversion trajectories, leading to substantial variation in reconstruction and editing quality. For rectified flows, we further find that this variation is closely associated with trajectory straightness, motivating straightness as a principled selection criterion. We propose SelFix, a fixed-point inversion method that selects fixed-point solutions inducing straighter inverse trajectories while retaining convergence to an exact inverse root under standard local assumptions. Experiments on FLUX.1-dev and PIE-Bench show that SelFix improves fixed-point inversion, achieving stronger real-image reconstruction and better source-preserving prompt-based editing than prior inversion baselines. The code is available at this https URL.
[CV-51] Geometric Consistency Protocol for Foundation Model Features in Multi-View Satellite Imagery
链接: https://arxiv.org/abs/2606.17564
作者: Qiyan Luo,Jie Yang,Yingdong Pi,Lekang Wen,Mi Wang
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: The manuscript is accepted as Oral Presentation in IEEE International Geoscience and Remote Sensing Symposium(IGARSS 2026)
Abstract:Standardized evaluation protocols are indispensable for robust benchmarking in remote sensing, particularly as foundation features are increasingly transferred across diverse sensors and complex imaging geometries. In satellite multi-view reconstruction, conventional evaluations relying on unconstrained 2D global matching are often misleading. The Rational Function Model (RFM) and its Rational Polynomial Coefficients (RPC) dictate a curved, height-dependent epipolar geometry that render flat 2D search spaces physically inconsistent. We propose a geometry-faithful and reproducible protocol tailored for the RPC framework. Our approach integrates an RPC-projected 3D consistency metric with a geometry-constrained dense matching proxy, specifically evaluating whether similarity responses remain localized and unique under physically plausible search manifolds. A pivotal finding of our joint reporting strategy is the decoupling of semantic agreement and geometric localization: high cross-view similarity at a projected 3D point does not guarantee reliable matchability in practical inference. Our benchmark demonstrates that incorporating geometric constraints is fundamental to the problem definition in satellite imagery. Furthermore, we show that state-of-the-art 2D backbones remain remarkably competitive against specialized 3D-aware models when subjected to this RPC-consistent evaluation.
[CV-52] RT-Counter: Real-Time Text-Guided Open-Vocabulary Object Counting
链接: https://arxiv.org/abs/2606.17561
作者: Hao-Yuan Ma,Li Zhang,Zhiwei Zhu,Jie Gao
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Text-guided open-vocabulary object counting (TOOC) aims to count objects belonging to the categories specified by natural language descriptions. Although vision-language pre-trained models have been successful applied to TOOC tasks, they still struggle with fine-grained spatial understanding and real-time inference requirements in counting scenarios. To address these limitations, this paper proposes a real-time TOOC framework, called the Real-Time Counter (RT-Counter), that achieves not only good counting accuracy but also high computational efficiency. RT-Counter designs a novel Visual Prototype Textualization (VPT) module that can project learned visual features into a text feature space and then generate features containing the abstract information that is hard to capture with visual prototypes and the detailed prototype information that is difficult to describe in text, enhancing the object-level visual-language model’s counting capabilities. Additionally, RT-Counter incorporates our Weaving Transformer (Weaformer) layers, maintaining high descriptive power at a fraction of the computational cost. The Weaformer layer adopts a novel hybrid attention mechanism that can efficiently weave together local and global visual features. Extensive experiments on three public datasets show that RT-Counter successfully breaks the accuracy-speed trade-off in TOOC. While achieving a competitive MAE of 13.30 on FSC147, RT-Counter operates at 112.48 FPS, making it 7.4x faster and over 4 \times more parameter-efficient than the existing leading methods in TOOC. Our work aims at balancing high accuracy and real-time performance in TOOC. Code is available at: this https URL.
[CV-53] Universal Image Restoration via Internalized Chain-of-Thought Reasoning
链接: https://arxiv.org/abs/2606.17557
作者: Yu Guo,Zhengru Fang,Shengfeng He,Senkang Hu,Yihang Tao,Phone Lin,Yuguang Fang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Image restoration seeks to recover high-quality images from degraded inputs but becomes highly ill-posed under complex, mixed degradations. While unified all-in-one models are common, their performance declines as degradation complexity increases. Recent works adopt Chain-of-Thought (CoT) reasoning for multi-round restoration using specialized modules. However, this approach faces two key limitations: (i) increased computational cost due to multi-step processing, and (ii) weak modeling of interactions between degradations during stepwise inference. We introduce CoTIR, a universal image restoration framework that internalizes CoT reasoning within a single model. Concretely, we view image restoration as a specialized subtask of image editing, which implies that a large-scale pre-trained editing model provides a more favorable optimization starting point. Building on this, we fine-tune the model for restoration and further encode structured CoT-style reasoning into the learning objective via a differentiable formulation inspired by Lagrangian optimization, enabling holistic restoration without chaining specialized restorers. To facilitate training and evaluation, we further present CoTIR-Bench, a large-scale benchmark comprising 5.2 million samples with CoT-style reasoning traces. Extensive experiments on CoTIR-Bench and broad real composite degradation scenes show that CoTIR achieves stronger perceptual quality and more competitive fidelity than both all-in-one models and multi-round restoration methods. The source code is available at this https URL.
[CV-54] aFD: Threat-Aware Frequency Decoupling for Adversarial Robustness against Heterogeneous Attacks
链接: https://arxiv.org/abs/2606.17540
作者: Mengda Xie,Yiling He,Meie Fang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Multi-threat robustness remains a fundamental challenge in deep learning. Although joint adversarial training (JAT) is widely adopted, it suffers from negative transfer under heterogeneous threats, particularly between \ell_p -bounded and semantic attacks. Through first-order gradient analysis, we formalize this as gradient incompatibility and theoretically establish the necessity of decoupled optimization. We further reveal that these conflicting threats exhibit separable spectral characteristics in the frequency domain. Motivated by this observation, we propose Threat-aware Frequency Decoupling (TaFD), a two-stage defense framework that reformulates JAT as a frequency-domain divide-and-conquer paradigm. TaFD first discovers latent threat domains via unsupervised clustering of attack spectral prototypes and trains a lightweight classifier for inference-time threat domain identification. Conditioned on the prediction, TaFD employs a Frequency-Conditional Convolution that learns threat-domain-specific spectral masks and routes each sample to the corresponding expert, enforcing structural parameter separation and alleviating optimization conflicts. We validate TaFD on three representative image-classification benchmarks (CIFAR-10, CIFAR-100, and Tiny-ImageNet) and on two representative architectures (the convolutional ResNet and the hybrid-transformer MobileViT). Extensive results demonstrate that TaFD achieves more balanced robustness against heterogeneous attacks than existing JAT and frequency-domain baselines, improving average robust accuracy by approximately 11% over the strongest baseline while maintaining leading clean accuracy. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2606.17540 [cs.CV] (or arXiv:2606.17540v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2606.17540 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[CV-55] Reinforcing Dual-Path Reasoning in Spatial Vision Language Models
链接: https://arxiv.org/abs/2606.17539
作者: Yatai Ji,An-Chieh Cheng,Yang Fu,Yukang Chen,Han Zhang,Zhaojing Yang,Wei Huang,Ka Chun Cheung,Song Han,Vidya Nariyambut Murali,Pavlo Molchanov,Jan Kautz,Simon See,Hongxu Yin,Ping Luo,Sifei Liu
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Spatial VLMs have made substantial progress in geometric perception, yet complex spatial reasoning requiring multi-step inference over depth, distance, and scene relations remains challenging. Moreover, different spatial queries call for fundamentally different strategies: some are best addressed through purely linguistic, step-by-step deduction, while others require explicit 3D grounding before quantitative inference. We present Dual-Path Spatial Reasoning via Reinforcement Learning for Spatial VLMs (SR-REAL), a unified framework that equips a spatial VLM with two complementary reasoning paths: Language-Only Reasoning (LOR), which performs step-by-step linguistic deduction, and Detect-Then-Reason (DTR), which detects 3D geometric cues (e.g., centers or bounding boxes) via region tokens before explicit geometric inference. SR-REAL begins with a cold-start supervised fine-tuning stage that constructs LOR and DTR chain-of-thought supervision and exposes a region-to-3D interface, followed by RL that optimizes the policy model with accuracy and format rewards; for DTR, a discrete center-based detection reward further refines geometric alignment. Across diverse spatial benchmarks, SR-REAL significantly outperforms spatial VLM baselines: (i) a single RL-trained model supports both reasoning paths, with DTR excelling in region-aware tasks through precise 3D localization and LOR enhancing general spatial reasoning; (ii) jointly training both paths fosters mutual reinforcement; (iii) high-quality, blended cold-start data is crucial for stable RL optimization; and (iv) the model generalizes across datasets and domains without per-task tuning, demonstrating positive transfer between LOR and DTR.
[CV-56] OmniDrive: An LLM -Choreographed Multi-Agent World Model with Unified Latent Co-Compression for Multi-View Driving Video Generation
链接: https://arxiv.org/abs/2606.17536
作者: Zijie Meng,Yufei Liu,Chengqian Ma,Zhiyu Li,Jiyuan Liu,Wenhua Nie,Bingcai Wei,Shuqin Chen,Weichen Xu,Jiquan Yuan,Miao Zhang
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 24 pages, 10 figures
Abstract:Generative world models for autonomous driving face two unresolved tensions: heterogeneous control injection, where free-form language, HD-maps, trajectories, and camera poses reside in incompatible representational spaces, and post-hoc cross-view fusion, where per-camera latents fail to encode global 3-D geometry. We trace both to a single root cause: the absence of a shared symbolic interlingua aligning language, geometry, and pixels at the latent-token level. We present DRIVE-CHOREO, an LLM-choreographed multi-agent world model that recasts controllable multi-view video generation as latent choreography. Three Qwen2.5-VL agents - a Director parsing user intent into a structured WorldScript, a Cartographer grounding it into spatially-anchored layout tokens, and an Auditor feeding cross-view critiques back as auxiliary supervision - jointly author a single position-aware token sequence. This sequence is co-compressed with the multi-view video via a view-time permutation that enforces inter-camera geometry within the convolutional receptive field of a 3-D VAE. On nuScenes, DRIVE-CHOREO sets new state-of-the-art multi-view consistency and BEV mAP (21.6) with competitive FVD (45.7); a detector trained purely on our synthetic data gains +2.4 NDS on the real validation split, validating downstream utility.
[CV-57] GASE: Gaussian Splatting-Based Automated System for Reconstructing Embodied-Simulation Environments
链接: https://arxiv.org/abs/2606.17520
作者: Jiawei Zhang,Yiming Yan,Chao Liang,Nuo Xu,Seson Sun,Qichen Zhang,Yuhao Xu,Yantai Yang,Yingqiao Wang,Qin Jin,Zhipeng Zhang
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Training embodied agents in the real world requires skilled operators and expensive hardware. Simulation environments offer a compelling alternative by enabling large-scale, cost-effective data augmentation. Consequently, rapidly constructing high-fidelity simulation scenes with a minimal sim-to-real gap has become a critical objective in robot learning. While reconstruction-based methods provide superior visual quality, current workflows are hindered by inefficient data acquisition and subpar foreground object extraction. We thus propose GASE, a highly automated system for simulation scene construction. GASE leverages multi-view video streams from panoramic camera arrays to enable rapid environment scanning. To ensure high-quality asset generation, our pipeline introduces a camera-pose-based strategy that robustly extracts objects across frames in the 2D domain, followed by high-fidelity scene inpainting. Foreground objects and the static background are then reconstructed independently and seamlessly imported into physics simulators for policy training. Extensive experiments demonstrate that GASE outperforms existing 3D Gaussian-based methods in segmentation accuracy by over 10% while achieving state-of-the-art inpainting quality. Furthermore, real-robot deployments across manipulation and navigation tasks maintains a performance gap of less than 10% compared to policies trained purely on real-world data. These results confirm that GASE provides an efficient and highly effective solution for bridging the sim-to-real gap. Code will be released.
[CV-58] MagicSim: A Unified Infrastructure for Executable Embodied Interaction
链接: https://arxiv.org/abs/2606.17511
作者: Haoran Lu,Songling Liu,Yue Chen,Guo Ye,Mutian Shen,Shuyang Yu,Yu Xiao,Jihai Zhao,Shang Wu,Jianshu Zhang,Xiangtian Gui,Chuye Hong,Yuran Wang,Maojiang Su,Jiayi Wang,Ruihai Wu,Zhaoran Wang,Han Liu
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Robot learning and embodied agents now require simulation to serve as a shared execution substrate linking control, skills, and planning, not only as a renderer, controller testbed, or fixed task environment. Existing pipelines split these layers with “magic” actions, disconnected training environments, or forward-only renders that cannot reproduce, evaluate, and annotate the same episode. We present MagicSim, an embodied interaction infrastructure built around one deterministic batched runtime and a shared Markov decision process (MDP). From YAML-first specifications that decouple contents, placement, behavior, and agent exposure, MagicSim constructs diverse executable worlds spanning task families, interaction regimes, physics, layouts, sensors, avatars, and robot embodiments in one reset-and-step loop. A common execution interface grounds high-level commands through controllers, atomicskills, planner primitives, and asynchronous planning, realizing them as robot actions rather than simulator-side state edits. One task definition supports three capabilities: benchmark and RL evaluation, an autocollect interface that automatically turns commands into grounded trajectories, and agent/VLM-facing interaction. For automatic execution, commands flow through a Command-Skill-Planner-Robot-Record pipeline, while per-environment command, skill, planning, retry, annotation, and episode states advance independently above the shared physics tick. Successful rollouts are saved as structured multimodal trajectories aligning language supervision, action representations, visual/geometric representations, and task-level status with the executed episode. MagicSim thus unifies diverse world construction, embodied execution, task evaluation, automatic rollout generation, and interactive agent interfaces in one planner-in-the-loop runtime.
[CV-59] SPHINX: First Explain Then Explore
链接: https://arxiv.org/abs/2606.17482
作者: Nguyen Do,Tue M. Cao,Tien Van Do,András Hajdu,Tamás Bérczes,My T. Thai
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages
Abstract:Generating adversarial driving scenarios is critical for evaluating and improving autonomous vehicle decision-making systems in simulation. Recent approaches, such as ChatScene and LLM-Attacker, rely primarily on the prior knowledge of Large Language Models and Vision-Language Models to generate driving scenarios procedurally. We argue that adversarial scenes should be generated based on the failure diagnosis (e.g., indecisiveness, multi-frame inconsistency) of the driving policy to specifically address the policy’s weaknesses instead of relying on prior assumptions. In this paper, we propose SPHINX, a closed-loop framework for adversarial scenario synthesis guided by a simple principle: first explain, then explore. Beyond blindly exploring the scenario space, SPHINX leverages explainable artificial intelligence methods to analyze the policy, identifying key visual concepts and their influence on policy outputs, and the uncertainty of the decisions. Given the interpretable evidence extracted from the policy’s own decision process, we use a vision language model to rationalize and criticize failure modes of the current policy. These critics are then used to generate targeted adversarial scenarios for policy retraining and improvement. We demonstrate that SPHINX can highlight an interpretable account of policy failures while other adversarial scene generation cannot. Across the evaluated benchmarks and test suites, SPHINX can be applied to diverse state-of-the-art autonomous vehicle architectures and yields consistent robustness improvements over existing scenario-generation methods.
[CV-60] GeneralVLA-2: Geometry-Aware Reconstruction and Governed Memory for Robot Planning
链接: https://arxiv.org/abs/2606.17480
作者: Haoyu Wang,Guoqing Ma,Zeyu Zhang,Yandong Guo,Boxin Shi,Hao Tang
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:Generalist vision-language-action systems need object-centric 3D evidence and reusable manipulation experience to plan reliable robot trajectories. GeneralVLA provides a hierarchical interface for converting language and RGB-D observations into 3D end-effector paths, but two bottlenecks remain. First, monocular SAM3D-style object reconstruction can hallucinate pose and unseen geometry, while manipulation benefits from stable object shape when calibrated multi-view observations are available. Second, the original KnowledgeBank mainly retrieves semantically similar snippets and appends new knowledge, which makes it difficult to control memory quality, conflicts, confidence, and geometric relevance. To address the first challenge, we introduce GeoFuse-MV3D, a geometry-prior-guided MV-SAM3D reconstruction branch that verifies external geometry cues with input-view masks, applies soft visual-hull support, performs axis-wise refinement, and fuses only geometry while preserving appearance. To address the second challenge, we upgrade KnowledgeBank into a governed long-term memory system with explicit quality, confidence, lifecycle, verifier, and conflict metadata, together with precision-oriented retrieval. Finally, we evaluate the reconstruction branch on GSO-30 and the memory module on Terminal-Bench 2.0 and SWE-Bench Verified; GeoFuse-MV3D improves over the MV-SAM3D baseline by reducing CD and LPIPS by 2.20% and 2.02% while increasing PSNR and SSIM by 2.36% and 1.03%, and KnowledgeBank improves over ReasoningBank by 4.53% on Terminal-Bench SR and 3.73% on SWE-Bench resolve rate, while reducing AS by 4.95% and 5.65%, respectively. Code: this https URL. Website: this https URL.
[CV-61] heoretical Grounding of Out-Of-Distribution Detection With Reinforcement Learning Optimizer
链接: https://arxiv.org/abs/2606.17477
作者: Salimeh Sekeh,Xin Zhang
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Out-of-distribution (OOD) detection in dynamic open-world environments requires a model to continually adapt to evolving data distributions while generalizing to covariate-shifted inputs and rejecting semantic-shifted OOD examples. Most existing OOD detection methods optimize only the current-step objective and do not explicitly account for how post-deployment environment changes affect future OOD behavior. In this paper, we establish a theoretical grounding for dynamic OOD detection using a reinforcement learning (RL)-guided optimizer that explicitly favors updates that reduce the semantic OOD false positive rate over time. We develop a novel augmented optimizer that uses an RL-guided correction term on top of standard gradient descent (GD) and show its improvement over both future-domain generalization and semantic-OOD rejection. We analyze temporal error decomposition in terms of model-change and environment-change generalization errors and develop a new theoretical framework for comparing the generalization errors under both GD and RL-guided optimizers.
[CV-62] StereoFactory: A Unified Merging Framework for Robust Stereo Matching
链接: https://arxiv.org/abs/2606.17475
作者: Xianda Guo,Pinhan Fu,Ruilin Wang,Wenke Huang,Mang Ye,Qin Zou
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Stereo matching has advanced through foundation models trained on large-scale datasets, yet this paradigm suffers from a scalability bottleneck: incorporating new data requires costly joint retraining. Model merging offers a scalable post-hoc alternative by integrating knowledge from specialized models after source checkpoints are available. However, existing merging methods typically retain all available models or rely on greedy inclusion, which can preserve harmful task-vector interference. We propose StereoFactory, a coarse-to-fine evolutionary framework for adaptive model merging. Stage~1 employs a genetic algorithm to search the combinatorial space of model subsets, determining which models should participate. Stage~2 addresses module-level knowledge specialization (different functional modules exhibit distinct preferences for knowledge sources) through CMA-ES optimization of architecture-adaptive routing over the selected task vectors, with optional module-level scaling. Experiments across two architectures and four benchmarks demonstrate that StereoFactory consistently achieves the best four-benchmark average under the same checkpoint pool, reducing the average error from 3.80 to 3.30 on NMRF and from 2.88 to 2.19 on FoundationStereo relative to the strongest controlled baseline. The post-hoc search requires only 2.7–3.7% of the corresponding joint-retraining wall-clock time. Analysis reveals that knowledge contributions are inherently module-specific, and selected subsets can transfer across architectures with minimal degradation. Code will be publicly released upon acceptance at: this https URL.
[CV-63] WeaveLA: Event Driven Cross-Subtask Latent Memory Weaving for Repetitive Robot Manipulation
链接: https://arxiv.org/abs/2606.17463
作者: Shoujing Zhu,Zhenyang Liu,Fungmiu Wang,Jiafeng Wang,Bo Yue,Guiliang Liu,Simo Wu,Xiangyang Xue,Taiping Zeng
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:Vision-Language-Action (VLA) policies have achieved remarkable single-step manipulation, yet they remain brittle precisely where each stage depends on what was just completed. The core issue is structural: short-window VLAs lack an explicit channel for rouxting information across sub-task boundaries, and existing memory-augmented variants either write at every frame, retrieve from demonstration-time stages, or fire at sub-goal events without performing an explicit sub-task-to-sub-task hand-off into the action expert. We identify the sub-goal completion event as the natural temporal unit for cross-subtask memory hand-off, and present WeaveLA (Weave Latent memory for Vision-Language-Action policies), a cross-subtask memory interface that, on top of a frozen VLA backbone, compresses each completed segment into latent tokens via query-driven attention pooling and routes them directly into the action-generation path of the next sub-task. This event-triggered, action-side design preserves the base policy’s short-window interface while adding a lightweight cross-subtask channel. Through stratified evaluation on RoboMME with a \pi_0.5 backbone, WeaveLA’s gains land exactly where the channel is needed: on the hardest repetition slice (SwingXtimes, N=3 ), success rises from 0% to 47.8% , while single-execution episodes remain unchanged. Per-episode paired analysis confirms the gains are confined to tasks whose causal structure requires cross-subtask information.
[CV-64] AnnotateAnything: Automatic Annotation of 3D Assets for Robot Manipulation
链接: https://arxiv.org/abs/2606.17446
作者: Haoran Lu,Mutian Shen,Shuyang Yu,Yu Xiao,Songling Liu,Jianshu Zhang,Shang Wu,Yue Chen,Guo Ye,Jiayi Wang,Zhaoran Wang,Han Liu
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Simulation enables scalable robot data collection, but raw 3D assets provide only geometry, lacking the semantic, interactive, and physical knowledge needed to specify where and how robots should act. In this work, we present AnnotateAnything, a general automatic annotation framework that converts passive 3D assets into manipulation-ready assets with structured, diverse, and executable manipulation labels. AnnotateAnything is built around two complementary pipelines. First, a unified visual-language annotation pipeline using vision-language reasoning to infer object semantics, interaction constraints, and 3D-grounded cues, providing human-prior guidance for identifying meaningful interaction regions. Second, a fully automatic and massively parallel physics annotation pipeline grounds these priors in each asset’s geometry and physical constraints through candidate generation, geometry optimization and trajectory generation. This pipeline produces diverse and executable action annotations, including grasp poses, dexterous contacts, articulation waypoints, insertion directions, hanging affordances, and navigation targets. Using the generated annotations, we further build an asynchronous parallel simulation data-collection system across diverse objects, tasks, and robot embodiments. Experiments demonstrate that AnnotateAnything achieves superior annotation efficiency, data-collection efficiency, and task success rates over existing annotation and data-generation pipelines, while also supporting downstream tasks such as affordance detection, robotic VQA, and visual instruction finetuning. We provide project materials on the project page and plan to release the full code, annotations, and benchmark to facilitate future research. Videos, code, demo assets, and annotations are provided in supplementary materials Project page: this https URL.
[CV-65] Contact-Based Fringe Projection Profilometry for High-Resolution 3-D Surface Measurement of Reflective and Transparent Objects
链接: https://arxiv.org/abs/2606.17438
作者: Ingu Yeo,Hyung-Gun Chi,Jae-Sang Hyun
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:This paper presents a contact-based 3-D surface measurement method based on a Digital Fringe Projection (DFP) system, belonging to the vision-based tactile sensing family pioneered by the commercially successful GelSight sensor. Such sensors have proven effective for robotic fingertip manipulation and contact sensing. However, because GelSight employs photometric stereo with RGB LEDs, it does not measure absolute depth directly but instead infers it by integrating estimated surface gradients, which can accumulate reconstruction errors; in addition, it becomes increasingly difficult to calibrate as the sensing area grows, and its depth accuracy is challenged on highly reflective or transparent objects. To overcome these drawbacks, we propose a fringe-projection-based contact measurement technique that performs triangulation-based 3-D reconstruction on a coated silicone contact surface, providing dense per-pixel surface geometry and full-field 3-D shape measurement over the contact region. By integrating high-accuracy digital fringe projection into the sensor, our approach simplifies calibration over larger areas and enhances depth precision for complex surfaces. Experimental results, including a direct comparison with a GelSight Mini sensor, a sphere-fitting accuracy evaluation, and an uncertainty analysis, confirm that the proposed method significantly improves the accuracy and stability of structured-light-based 3-D measurements, allowing reliable reconstruction of objects with diverse optical properties.
[CV-66] Spatio-Temporal Fusion Model for Standard View Classification of Echocardiographic Videos
链接: https://arxiv.org/abs/2606.17437
作者: Bo Gou,Jicheng Zhang,Jianlong Xiong,Tao He,Bentian Liu,Hai Wu,Yijiao Wang,Yu Zhang,Yujia Yang,Yun Dai,Jian Liu,Jie Wang
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Automated classification of standard echocardiographic views is crucial for efficient clinical workflow but faces three main challenges. First, publicly available datasets are scarce and limited in scale and view coverage. Second, the performance of some modern video-level architectures for echocardiographic view classification remains underexplored. Third, some view categories exhibit highly similar spatial appearances, making single-frame features insufficient for discrimination, while heterogeneous frame quality complicates robust temporal information fusion. To address these challenges, we release the Echocardiographic Videos of Nine Views (EV9V) dataset, comprising 5,138 videos, 910,579 frames, and 9 standard views, which is, to the best of our knowledge, the largest publicly available echocardiography video dataset. Using EV9V, we systematically benchmark representative video classification architectures, including Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), and Transformers. Furthermore, we propose a Spatio-Temporal Fusion Model (STFM), an efficient dual-stream CNN-LSTM (Long Short-Term Memory) framework that jointly captures spatial anatomical structures and temporal cardiac dynamics. The proposed framework leverages uncertainty-aware learning to preferentially sample representative video segments during training and evidence-based fusion during inference, improving robustness to variations in frame quality across echocardiographic videos. Extensive experiments demonstrate that our method achieves competitive performance across diverse video classification models, validating the effectiveness of uncertainty-aware spatio-temporal learning for echocardiographic view classification. The code is available at this https URL.
[CV-67] UoU: A Universal Fingerprint Foundation Model Based on Large-Scale Unsupervised Learning
链接: https://arxiv.org/abs/2606.17436
作者: Xiongjun Guan,Jianjiang Feng,Jie Zhou
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Fingerprint recognition is still dominated by task-specific pipelines, where enhancement, structural parsing, alignment, and matching are optimized in isolation. Although effective in narrow settings, this design limits representation reuse across sensors, qualities, and downstream applications. We therefore present UoU, short for ``a \textbfUniversal fingerprint foundation model based \textbfon large-scale \textbfUnsupervised learning,‘’ which reframes fingerprint feature extraction as a domain-specific foundation-model problem. UoU is organized around a multi-level representation hierarchy spanning image restoration, structural fields, semantic tokens, point-level biometric entities, and compact global descriptors. Its training recipe combines a supervised cold start on precise annotations, large-scale weakly supervised refinement, and large-scale unsupervised consolidation, with the latter two stages iterated during large-scale training so that weak supervision broadens semantic coverage while unsupervised learning stabilizes correspondences, invariances, and representation geometry. Rather than treating fingerprint imagery as generic texture, UoU exploits domain-specific symmetries and intermediate structure, including orientation flow, periodic ridge patterns, sparse biometric entities, and spatial equivariance. The framework is intentionally architecture-agnostic: while the present study includes an initial transformer-based structured-prediction instantiation, the broader design supports multi-task learning, scalable model configurations, and downstream specialization for matching, alignment, enhancement, registration, and related fingerprint applications. This paper presents the technical motivation, system design, and validation protocol of UoU, and part of the baseline implementation is publicly available at this https URL.
[CV-68] LADBench: A Benchmark for Logical Fault Detection in Images
链接: https://arxiv.org/abs/2606.17433
作者: Sahasra Kondapalli,Lara Radovanovic,Aadi Palnitkar,Mingyang Mao,Xiaomin Lin
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to the IEEE International Conference on Development and Learning (ICDL 2026)
Abstract:Large Vision Language Models (VLMs) excel at visual question answering and semantic grounding, but their capacity for autonomous logical reasoning remains underexplored. Existing anomaly benchmarks emphasize visual errors or direct prompting rather than the physical and social common sense needed for open-world deployment. To address this, we introduce LAD-bench, a benchmark of more than 1,000 curated synthetic images with logical anomalies across four domains: Residential, Urban, Collaborative, and Nature. We further propose a Tiered Prompting Protocol based on progressive disclosure, which measures how much explicit assistance a model needs to localize and reason about a logical fault. Evaluating leading foundation models reveals substantial weaknesses: even the best achieves only 70.11% overall accuracy, showing that implicit logical fault detection remains unsolved. Crucially, models often fail to identify anomalies even after receiving explicit hints in deeper tiers. By surfacing these limitations in sequential multimodal reasoning, LAD-Bench offers a rigorous framework for advancing the safety, reliability, and cognitive alignment of autonomous visual systems. Dataset and Code: this https URL
[CV-69] Edit3DGS: Unified Framework for Dynamic Head Editing via 2D Instruction-Guided Diffusion and 3D Gaussian Splatting
链接: https://arxiv.org/abs/2606.17432
作者: Duy-Dat Tran,Trung-Nghia Le
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: SOICT 2025
Abstract:We present Edit3DGS, a unified framework for dynamic 3D head editing that integrates 2D instruction-guided diffusion with 3D Gaussian splatting. Unlike prior approaches that separately address frame-based edits or static 3D reconstruction, our method couples semantic controllability in the image domain with photorealistic, temporally consistent 3D representations. Given an input video, editable facial regions are masked and modified using a text-conditioned diffusion model to support fine-grained operations such as expression transformation, attribute modification, and appearance refinement. The edited frames are then aggregated through 3D Gaussian splatting to produce a coherent, high-fidelity avatar that preserves both identity and motion dynamics. To enforce consistency, Edit3DGS incorporates multi-view batch editing and lightweight inpainting strategies that recover lost expressions across timesteps. Experimental results demonstrate that our framework enables controllable, artifact-free head editing with smooth temporal transitions, offering practical applications in virtual avatars, immersive communication, film production, and interactive media.
[CV-70] Visual Retrieval-Augmented Generation for Silhouette-Guided Animal Art
链接: https://arxiv.org/abs/2606.17431
作者: Quoc-Duy Tran,Anh-Tuan Vo,Trung-Nghia Le
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: SOICT 2025
Abstract:Generative AI has advanced the ability to render photorealistic or artistic images, yet it remains limited in a key aspect of human creativity: interpreting ambiguous shapes. This phenomenon, rooted in pareidolia, allows humans to perceive meaningful forms in random patterns such as clouds, stones, or leaves. To computationally replicate this imaginative process, we introduce Visual Retrieval-Augmented Generation (Visual-RAG), a framework that generates animal art directly from natural silhouettes. Our method retrieves structurally similar animal shapes from a curated corpus of 28,586 high-quality silhouettes and uses them as reference exemplars to guide diffusion-based generation with ControlNet and IP-Adapter. Ablation studies confirm that shape Context with RANSAC provides the most accurate alignment, while removing shape standardization reduces the inlier ratio to just 13.4%, underscoring the importance of structural fidelity in Visual-RAG. A user study with 12 participants evaluated the outputs in terms of aesthetics, silhouette fidelity, and overall impression. Results reveal that while Visual-RAG provides plausible interpretations, challenges remain in achieving high perceptual impact. This work lays the foundation for computational pareidolia, showing how machines can contribute to the early stages of imaginative discovery.
[CV-71] CIAN: Multi-Stage Framework for Event-Enriched Image Captioning via Retrieval-Augmented Generation
链接: https://arxiv.org/abs/2606.17430
作者: Trinh Thi Thu Hien,Trung-Nghia Le
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: SOICT 2025
Abstract:Event-enriched image captioning describes not only visible content but also the broader context of events, including timing, location, and participants, capabilities missing in most pixel-bound models. We propose the Contextual Image-Article Narrator (CIAN), a multi-stage framework that enriches captions with external narratives. CIAN retrieves relevant articles using SigLIP, summarizes them to guide a Narrative Generation stage with a LoRA-fine-tuned Qwen model, and applies N-Gram-based Refinement for fluency and coherence. On the OpenEvents-V1 benchmark, CIAN achieves high retrieval performance (mAP 0.979) and improves caption quality, increasing CIDEr from 0.030 to 0.094. These results highlight the effectiveness of retrieval-augmented reasoning combined with linguistic refinement for generating context-aware, human-like captions.
[CV-72] Enhancing Pathological VLMs with Cross-scale Reasoning
链接: https://arxiv.org/abs/2606.17412
作者: Chi Phan,Tianyi Zhang,Qiaochu Xue,Yufeng Wu,Dan Hu,Zeyu Liu,Sudong Wang,Yueming Jin
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Pathological images are inherently multi-scale, requiring pathologists to integrate evidence from global tissue architecture at low magnification to cellular morphology at higher magnification for accurate diagnosis. While existing pathological datasets for vision-language model (VLM) include various scales, they often lack an explicit cross-scale reasoning objective. This limitation prevents VLMs from capturing essential cross-scale representations and learning evidence-based reasoning. To bridge this gap, we introduce the first cross-scale training and evaluation paradigm that formulates pathology interpretation as multi-magnification reasoning. However, creating such a task reveals a critical challenge: multi-image visual question answering (VQA) is prone to text-only shortcuts, which allow models to guess answers using magnification-dependent artifacts rather than visual evidence. To address this, we propose a leakage-aware curation pipeline that combines adversarial text-only screening with constraint-guided question design. Using this pipeline, we construct Scale-VQA, a high-quality benchmark with 4,685 multiple-choice questions grounded in 2,537 pathology images across multiple magnification levels. Finally, we present ScaleReasoner-R1, a model trained via reinforcement learning to optimize performance on the cross-scale VQA task. ScaleReasoner-R1 achieves state-of-the-art performance on our cross-scale reasoning benchmark and generalizes to SOTA performance on established single-scale benchmarks. Findings suggest that even the limited cross-scale supervision can significantly improve pathological understanding. The code and demos will be open-sourced.
[CV-73] Attention Alignment Between Humans and Vision-Language Models
链接: https://arxiv.org/abs/2606.17410
作者: Isaac R. Christian,Udith Haputhanthrige,Hanna Hornfeld,Declan Campbell,Samuel Nastase,Taylor Webb,Michael Graziano
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Visual perception depends on top-down goals and bottom-up sensory mechanisms. Vision-language models implement both, allowing us to treat each component as a separable hypothesis about what drives where we look. We compared spatial attention maps from six vision-language models against human fixation heatmaps recorded on 200 images during two tasks (general description and social captioning). The six models spanned a 2 \times 2 factorial of CNN vs.\ ViT encoders crossed with LSTM vs.\ Transformer decoders, plus Molmo 7B-D and Qwen3.5 9B. We found that both decoder and encoder architecture shaped alignment, but decoder choice dominated. LSTM vs.\ Transformer decoders increased alignment by 40–50 percentage points (80–87% vs.\ 40–59% of the human noise ceiling). In contrast, CNN vs.\ ViT encoders contributed a secondary 5–20 point advantage depending on decoder family, with CNN-LSTM the most aligned model overall (85–87%). Despite their alignment advantage, LSTM-decoder attention maps were spatially diffuse and minimally task-differentiated; ViT-Transformer, the weakest in alignment, showed the sharpest spatial concentration and strongest task differentiation. A hemispatial-neglect simulation confirmed that ablating attention impacted LSTM decoders more than Transformer decoders. In an exploratory extension using TRIBE-simulated synthetic neural responses, fixation alignment and neural relevance dissociate: CNN-Transformer attention maps better predicted synthetic brain activity despite lower fixation alignment, with attention maps best predicting early visual cortex. Together, top-down and bottom-up components trade off what they predict in behavioral and synthetic neural data.
[CV-74] Where Should Action Generation Begin? A Learnable Source Prior for Generative Robot Policies
链接: https://arxiv.org/abs/2606.17408
作者: Meipo Dai,Qiyuan Zhuang,He-Yang Xu,Ying-Jie Shuai,Yijun Wang,Qi Dou,Xiu-Shen Wei
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Generative robot policies typically begin action generation from an observation-independent standard Gaussian distribution, leaving the choice of source distribution underexplored. This work asks a simple question: where should action generation begin? We propose LeaP, a Learnable source Prior that replaces the standard Gaussian with a proprioception-conditioned diagonal Gaussian over action chunks. Parameterized by a lightweight MLP, LeaP jointly predicts the mean and state-adaptive variance of the source distribution, while keeping the downstream generator architecture and inference solver unchanged. This design provides an observation-informed yet stochastic initialization, allowing the generator to focus on precise action refinement rather than transporting samples from an uninformed noise source. On 15 RoboTwin manipulation tasks, LeaP achieves an average success rate of 81.6%, outperforming four representative baselines – including deterministic-source methods, a no-prior counterpart, and a diffusion-bridge policy – by 6.5 to 25.5 percentage points. The same prior consistently improves both flow-matching and diffusion-bridge generators, while using fewer parameters and converging faster. The advantage carries over to real-world deployment, where LeaP attains the best performance. These results suggest that the source distribution is an independent and reusable design axis for generative robot policies, complementary to the choice of generative dynamics.
[CV-75] Graph Neural Networks for Semi-Supervised Image Classification with Multi-Feature Aggregation
链接: https://arxiv.org/abs/2606.17406
作者: Marina Chagas Bulach Gapski,Vinicius Atsushi Sato Kawai,Gustavo Rosseto Leticio,Lucas Pascotti Valem,Daniel Carlos Guimarães Pedronette,Mohand Said Allili
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Feature extraction involves the identification and extraction of salient characteristics or patterns, including edges, textures, shapes, and color attributes. Contemporary feature extractors predominantly leverage deep learning architectures, such as Convolutional Neural Networks (CNNs) and Vision Transformers (VITs). The availability of diverse feature extractors in the literature provides a wide range of feature representations. Features extracted from an image depend on the specific application, the chosen extractor, and its configuration. Therefore, integrating complementary information by combining distinct extractors offers a promising way to enhance performance. Graph Neural Networks (GNNs), particularly Graph Convolutional Networks (GCNs), have emerged as powerful and widely adopted approaches for semi-supervised image classification, as they effectively leverage both labeled and unlabeled data while exploiting the underlying graph structures that capture relationships among samples. This study proposes a novel approach for GNNs in scenarios where labeled data is scarce, by integrating diverse sets of feature and graph representations derived from various extractors in classification scenarios. Experimental investigations were conducted, encompassing combinations of distinct feature and graph extractors, as well as rank aggregation strategies. The primary contributions of this work are underscored by the experimental findings, which demonstrate that the strategic combination of feature and graph representations, coupled with the application of manifold learning for graph processing, leads to significant improvements in classification accuracy across the majority of experimental conditions. Furthermore, the utilization of rank aggregation techniques to integrate features from different extractors was shown to enhance classification accuracy.
[CV-76] Bridging Spatial And Frequency Views For Disaster Assessment: Benefits And Limitations
链接: https://arxiv.org/abs/2606.17403
作者: Shikha V. Chandel,Yadav Raj Ghimire,Timothy Agboada,Leila Hashemi-Beni
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Copyright 2026 IEEE. Published in the 2026 IEEE International Geoscience and Remote Sensing Symposium (IGARSS 2026)
Abstract:Rapid assessment of building damage from satellite imagery is essential for effective disaster response and recovery. While most deep learning methods rely on spatial-domain features, frequency-domain representations can capture complementary structural cues such as debris patterns and collapse-induced textures. This study presents a controlled comparison of spatial-domain, frequency-domain, and dual-domain deep learning approaches for multi-class building damage classification using post-disaster imagery from the xView2 (xBD) dataset. To ensure fairness, all models are built on an EfficientNet-B0 backbone and trained under identical settings, differing only in their input representations and fusion strategies. Performance is evaluated using accuracy, macro F1-score, per-class metrics, and confusion matrices. Results show that dual-domain models provide measurable improvements over single-domain approaches. The dual spatial configuration achieves the highest test accuracy (0.4688) and lowest loss, while the spatial-only model attains the best macro F1-score (0.4254), indicating more balanced class performance. In contrast, frequency-only models perform worst and exhibit overfitting, suggesting limited generalization. Despite these gains, all models struggle to detect subtle damage levels, particularly the Minor class, due to class imbalance and fine-grained visual ambiguity. While dual-domain approaches improve detection of severe damage, challenges remain. These findings highlight the benefits and limitations of hybrid representations and motivate future work on data balancing, advanced fusion, and regularization.
[CV-77] rraTransfer: Learning End-to-End Driving Policies Without Expert Demonstrations
链接: https://arxiv.org/abs/2606.17386
作者: Zikang Xiong,Weixin Li,Zhouchonghao Wu,Akshay Rangesh,Saarth Bonde,Grantland Hall,Chen Tang,Yihan Hu,Wei Zhan
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:
Abstract:End-to-end autonomous driving has achieved state-of-the-art performance on benchmarks and real-world deployments. Its standard training recipe, however, is expensive across all stages: collecting and labeling millions of driving frames is costly, and closed-loop RL on images is bottlenecked by the per-step cost of photorealistic rendering plus a forward pass through a large vision backbone. Self-play in vectorized simulators changes the economics: millions of rollout steps per second, and a state distribution naturally rich in collisions, near-misses, and recoveries that no driving log contains. Our approach exploits this asymmetry by decoupling learning to drive from learning to see. We pretrain a single policy by self-play, then align its latent space with a pretrained vision backbone, through the action KL divergence and a batch-relational low-rank structural loss. The action target comes from the self-play policy, so alignment never supervises against a logged trajectory: a paired dataset of (image, scene-state) frames suffices, with no need for the curated expert demonstrations that imitation pretraining is built on. On photorealistic 3D Gaussian splatting closed-loop scenarios, the resulting end-to-end policy matches or exceeds prior end-to-end methods.
[CV-78] Improving and Evaluating Hand-Object Interaction Detection
链接: https://arxiv.org/abs/2606.17384
作者: Ahmad Darkhalil,Dima Damen,David Fouhey
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
Abstract:Understanding hands and the objects they interact with, both directly and through tools, is a key step for tasks ranging from action perception to 3D reconstruction and robotics. Our paper provides several contributions to the Hand-Object Interaction (HOI) understanding literature: (1) HOI-DETR, a new framework that introduces hand-object and object-object interactions to the Co-DETR architecture to produce a state-of-the-art method; (2) a comprehensive HOI evaluation suite of 4 diverse datasets, including a video benchmark derived from the HD-EPIC dataset and fresh annotations that improve the Hands23 benchmark and (3) a trained checkpoint that significantly improves the state of the art across Hands23, HOIST, FineBio, and HD-EPIC, including mAP gains of over 20 percentage points on Hands23 and FineBio. Our ablations confirm the contributions of each model component.
[CV-79] MeiBRD: Meta-Learning Intraoperative Biomechanical Residual Deformation
链接: https://arxiv.org/abs/2606.17379
作者: Casey Meisenzahl,Jon Heiselman,Michael Holtz,Yubo Ye,Michael Miga,Linwei Wang
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
备注:
Abstract:Accurate intraoperative liver registration is challenging due to substantial soft-tissue deformation yet sparse intraoperative measurements. Biomechanical models regularize this ill-posedness with prior knowledge but exhibit persistent prediction bias due to simplifying assumptions, while data-driven learning solutions struggle with data efficiency, generalization, and physical plausibility. We propose a hybrid registration framework that adapts a biomechanical prior using sparse intraoperative correspondences. Rather than learning a full deformation field, we learn a residual deformation function that corrects linear biomechanical predictions, modeled as a graph neural diffusion function with geometry-aware attention over the 3D liver mesh. To enable long-range information transfer of sparse observations, we take a novel perspective of sparse intraoperative measurements as \textitcontext samples where input-output pairs of the residual deformation function are fully observed, casting the problem into learning-to-learn this residual function from intraoperative context samples with feedforward meta-learners. Experiments on a deformable liver phantom dataset demonstrate improved registration accuracy and generalization compared to rigid, biomechanical, and data-driven baselines, particularly for out-of-distribution geometries and deformations.
[CV-80] Contactless Respiratory Monitoring on Heterogeneous Mobile Robots: A Multimodal Edge-Computing Framework
链接: https://arxiv.org/abs/2606.17376
作者: Milind Rampure,Shadman Sakib,Haley Patel,Zahid Hasan,Nirmalya Roy
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 6 figures. To appear in Proceedings of the 8th International Workshop on IoT Applications and Industry 5.0 (IoTI5 2026), co-located with IEEE DCOSS-IoT 2026, Reykjavik, Iceland, June 2026
Abstract:Respiratory-rate (RR) monitoring is a critical component of remote triage and victim assessment in emergency response, disaster recovery, and infectious-disease scenarios, where minimizing physical contact can reduce responder risk and improve operational safety. However, field deployment of contactless RR monitoring remains challenging due to variable illumination, posture changes, platform heterogeneity, and the impracticality of wearable sensors in hazardous environments. In this paper, we present a modality-adaptive contactless RR monitoring framework for heterogeneous mobile robots with onboard edge computing. The proposed system combines brightness-adaptive sensor selection across RGB, thermal, near-infrared (NIR), and low-light cameras, keypoint-guided chest ROI extraction for posture-robust monitoring, and a signal-quality-index (SQI)-based filtering mechanism for reliable respiratory estimation. We implement and evaluate the framework on three robotic platforms spanning quadruped and wheeled locomotion and multiple edge-computing architectures. Experiments conducted across diverse lighting conditions, subject poses, and robot-to-subject distances demonstrate that the framework generalizes across platforms without per-platform algorithmic retuning, while revealing modality-specific operational boundaries. RGB provides the broadest coverage up to 8m, NIR remains effective up to 6m, thermal is reliable only at short range, and low-light sensing supports monitoring in complete darkness up to 8m. Overall, the results demonstrate the feasibility of multimodal contactless RR monitoring on mobile robots and support its use as a foundation for autonomous triage and victim assessment in hazardous search-and-rescue settings.
[CV-81] DriveJudge: Rethinking Autonomous Driving Evaluation with Vision-Language Models
链接: https://arxiv.org/abs/2606.17362
作者: Xinglong Sun,Kevin Xie,Jenny Schmalfuss,Despoina Paschalidou,Xiuming Zhang,Sanja Fidler,Kashyap Chitta,Jose M. Alvarez
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
备注: Under Review
Abstract:Autonomous driving has shifted towards end-to-end policy learning, where reliable, interpretable policy evaluation is a fundamental challenge as driving quality is highly context-dependent. Commonly used rule-based driving metrics like EPDMS are interpretable but lack context-awareness, while recent VLMbased evaluations are context-aware but limited by ambiguous VLM outputs and weak physical grounding. To evaluate driving in a manner that is both interpretable and context-aware, we introduce DriveJudge. DriveJudge is a driving evaluation agent that combines rule-grounded evaluation with Vision-Language Model (VLM) reasoning and selectively invokes physically-grounded deterministic rule functions after interpreting the environmental context. To train and evaluate DriveJudge, we curate a large-scale dataset of 33,577 challenging driving samples with human annotations on whether the driving behavior is reasonable in the given scenario. With this dataset, we address the underexplored problem of driving metric evaluation, and introduce two human-aligned benchmark tasks: Driving Quality Classification and Trajectory Preference Selection. DriveJudge outperforms EPDMS for driving quality classification by 21.23 AUC, and the recent VLM-based DriveCritic for trajectory preference selection by 6.5%, setting a new standard for interpretable and precise driving evaluation.
[CV-82] Complex Layout Classification in the Wild: A Low-Resource Approach with Layout-Preserving Augmentations
链接: https://arxiv.org/abs/2606.17355
作者: Sharva Gogawale,Iddo Hakim,Gal Grudka,Mohammad Suliman,Omer Ventura,Daria Vasyutinsky-Shapira,Berat Kurar-Barakat,Nachum Dershowitz
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Many digitized corpora suffer from low resources because annotations may be scarce, page scans are noisy and of poor resolution, or layouts are structurally complex in ways that negatively affect the quality of automatic transcription. Developing robust classification models for low-resource languages is inhibited by the lack of large-scale annotated data and by the frequent semantic complexity of page layouts. To this end, we have curated a complex-layout dataset, manually classified into eight distinct layout types based on their separator regions. To overcome data scarcity, we propose a novel training strategy in the form of a CNN-based classifier that employs strong, domain-aware augmentations to improve generalization. We utilize narrow anisotropic Gaussian masking to suppress incidental textual details while preserving essential separations, compelling the model to learn global geometric arrangements. Additionally, we implement reflection-induced label transformations to enrich the training distribution while maintaining label consistency across asymmetric categories. The results demonstrate that layout-specific augmentations can substantially improve page-level layout classification under severe annotation scarcity.
[CV-83] MM: Unsupervised Scale-Invariant Multilayer OOD Detection via Top-K Gated Feature Fusion
链接: https://arxiv.org/abs/2606.17352
作者: Rahim Hossain,Md Tawheedul Islam Bhuian,Md Farhan Shadiq,Kyoung-Don Kang
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We introduce MM++ (Multilayer Mahalanobis++), a fully unsupervised, strictly post-hoc, and scale-invariant framework for Out-of-Distribution (OOD) detection. To address the trade-off between scale invariance and hierarchical expressivity, MM++ constructs a principled joint feature space. It first identifies discriminative intermediate layers by measuring entropy density drops, which mark the boundaries of sharp semantic compression. By fusing these selected layers with the terminal representation, the framework captures latent cross-layer correlations while mitigating early-layer noise. Crucially, a Ledoit-Wolf regularized tied covariance matrix stabilizes this unified space, enabling reliable distance estimation. Requiring no auxiliary OOD data, classifier fine-tuning, or architectural modifications, MM++ delivers robust performance across distinct architectures for both near- and far-OOD detection.
[CV-84] Bayesian Magnetic Resonance Joint Image Reconstruction and Uncertainty Quantification using Sparsity Prior Models and Markov Chain Monte Carlo Sampling
链接: https://arxiv.org/abs/2606.17343
作者: Ahmed Karam Eldaly,Matteo Figini,Daniel C. Alexander
类目: Computer Vision and Pattern Recognition (cs.CV); Applications (stat.AP)
备注:
Abstract:We propose a novel framework for uncertainty quantification using compressed sensing magnetic resonance image reconstruction. The problem is formulated within a Bayesian framework as a linear inverse problem, with prior distributions assigned to the unknown model parameters. Specifically, the image to be reconstructed is assumed to be sparse in a given basis. We develop a general framework applicable to any basis and as examples, we test the sparsity of the image in its (1) spatial gradients using a total variation prior model, and in its (2) wavelet transform. A Markov chain Monte Carlo (MCMC) method, based on a split-and-augmented Gibbs sampler, is then employed to sample from the posterior distribution of the unknown parameters. The non-differentiable conditional distributions are efficiently sampled using a proximal MCMC method. The proposed algorithms are validated on both single-coil and multi-coil datasets using various k-space sub-sampling patterns and ratios. The results demonstrate the superior performance of each proposed approach in reconstructing images compared to its counterpart optimisation-based method. Moreover, our framework effectively quantifies uncertainty, showing a notable correlation between estimated uncertainty maps and error maps computed using ground truth and reconstructed images, compared with existing deep learning-based methods.
[CV-85] Learning a Maximum Entropy Model for Visual Textures using Diffusion
链接: https://arxiv.org/abs/2606.17342
作者: Xinyuan Zhao,Eero P. Simoncelli
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Visual textures – spatially homogeneous image regions containing repeated elements (e.g. a field of grass, the bark of a tree) – are ubiquitous in visual scenes and provide important cues for recognizing and analyzing materials and objects. A number of existing texture models extract essential statistics from a single texture image, and can then generate high-quality samples that are visually similar to the original by matching these statistics. However, their statistics are either hand-designed or based on a network pretrained for another purpose (e.g., object recognition). Here, we develop the first principled method for unsupervised learning of a set of statistics that are used to constrain a maximum entropy probability model. We leverage methods developed for generative diffusion models to derive training and sampling procedures, and compare these to the traditional method of sampling via matching the statistics. Despite the compactness of our trained model (512 statistics), it generates texture images whose quality is as good as or better than the current state-of-the-art model (~177k statistics). A more direct comparison of the two models, obtained by synthesizing images that are indistinguishable for one model but maximally different for the other, reveals their relative strengths and weaknesses. Finally, we show that unlike previous statistical texture models, a straight trajectory in the representation space of our model generates homogeneous texture samples that interpolate smoothly between the features of the two end points.
[CV-86] Geometry-Consistent Endoscopic Representations for Image-Guided Navigation via Structured Foundation Model Adaptation
链接: https://arxiv.org/abs/2606.17340
作者: Hongchao Shu,Roger D. Soberanis-Mukul,Hao Ding,Morgan Ringel,Mali Shen,Saif Iftekar Sayed,Hedyeh Rafii-Tari,Mathias Unberath
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Accurate vision-based navigation in monocular endoscopy is difficult due to limited depth cues, weak tissue texture, non-rigid deformation, and substantial appearance variation across domains, all of which complicate pose estimation, depth prediction, and image-to-anatomy alignment. Although recent vision foundation models have shown promise, their learned representations often remain insufficiently geometry-consistent, hindering stable feature correspondence and limiting their reliability for downstream navigation tasks. We propose a unified framework for learning geometry-consistent and domain-robust image representations for monocular endoscopy. The framework combines a synthetic data pipeline that provides accurate geometric supervision with Hierarchy-Aware Geometry-Semantic Adaptation, a structured alternative to standard LoRA that inserts low-rank adapters selectively across the transformer hierarchy and couples them with layer-wise training objectives to encourage geometric correspondence in intermediate features and semantic consistency in deeper features. Experiments on public and proprietary datasets show improved geometric and semantic representation quality, leading to better performance on downstream navigation tasks including pose estimation and monocular depth estimation. The learned representations show favorable synthetic-to-real transfer on clinical bronchoscopy and provide a useful initialization for adaptation to sinus endoscopy and colonoscopy under limited supervision. The framework also shows favorable scaling with model size and training data. These results support hierarchy-aware, geometry-guided adaptation as a practical approach for endoscopic representation learning.
[CV-87] FATE: Pillar Encoding and Frequency-Aware Training for Event-Based Object Detection
链接: https://arxiv.org/abs/2606.17334
作者: Md Tawheedul Islam Bhuian,Kyoung-Don Kang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Event cameras are bio-inspired sensors that asynchronously capture logarithmic intensity changes, offering inherent advantages in high-speed and high-dynamic-range scenarios. However, the sparse and asynchronous nature of event streams poses a fundamental challenge for modern deep learning architectures. To enable compatibility with standard models, most existing approaches partition the accumulation window into fixed temporal sub-bins. While effective for spatial processing, this internal discretization discards fine-grained temporal structure and constrains inference to the low temporal frequencies imposed by training supervision. To address this limitation, we propose FATE, a unified framework built upon a novel Pillar Encoding (PE). While operating over discrete macro-accumulation windows dictated by the target frequency, PE avoids internal temporal sub-binning. It organizes events into spatial pillars and approximates their intra-window evolution via projection onto a continuous-time orthogonal polynomial basis. This formulation yields an L2-optimal representation that retains rich temporal dynamics in a dense pseudo-image, mitigating information loss under sparse event conditions. To fully leverage this representation, we introduce Frequency-Aware Training (FAT), a soft mean-teacher curriculum that generates temporally dense pseudo-labels, effectively bridging the mismatch between low-frequency supervision and high-frequency inference. Extensive experiments demonstrate that FATE generalizes across architectural paradigms and consistently outperforms strong baselines. It enables robust object detection at high temporal resolutions up to 200 Hz, while incurring minimal overhead in parameter count and inference latency
[CV-88] ProCUA-SFT Technical Report
链接: https://arxiv.org/abs/2606.17321
作者: Jaehun Jung,Ximing Lu,Brandon Cui,Muhammad Khalifa,Shaokun Zhang,Hao Zhang,Jin Xu,Amala Sanjay Deshmukh,Karan Sapra,Andrew Tao,Yejin Choi,Jan Kautz,Mingjie Liu,Yi Dong
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 5 figures
Abstract:Training computer-use agents (CUAs) – models that interact with graphical desktops through screenshots and keyboard/mouse actions – requires large-scale, diverse trajectory data collected in full desktop environments. The largest public resource, AgentNet (22.5K human trajectories), leads to negative transfer when used for supervised fine-tuning (SFT): continuing training UI-TARS 7B on AgentNet causes OSWorld success rate to fall from 26.3% to 8-10%. We present ProCUA-SFT, a dataset of 3.1M step-level SFT samples distilled from 93K synthetic trajectories across 2,484 application combinations. The dataset is produced by a fully automated pipeline that (i) synthesizes grounded tasks on live desktops seeded with real-world content – 912 spreadsheets from SpreadsheetBench, approximately 10K permissively-licensed presentations from Zenodo10K, and multi-application OSWorld configs – and (ii) verifies each task’s feasibility through binary precondition checking before rollout. A single VLM (Kimi-K2.5) serves as goal generator, precondition judge, and trajectory executor, eliminating planner-actor capability gaps. Each trajectory is expanded into step-prefix samples that exactly reproduce the context layout seen at inference time. Fine-tuning UI-TARS 7B on ProCUA-SFT for one epoch yields 45.0% on OSWorld – an 18.7 percentage-point improvement over the base model and over 35% above AgentNet-trained counterparts. A subset of ProCUA was incorporated into the training data for the Nemotron 3 Nano Omni model, contributing to its computer-use capabilities.
[CV-89] SierpinskiCam: Camera-Controlled Video Retaking with Sierpinski Triangle Pattern Cues
链接: https://arxiv.org/abs/2606.17310
作者: Suttisak Wizadwongsa,Hyelin Nam,Supasorn Suwajanakorn,Jeong Joon Park
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 20 pages, 13 figures
Abstract:Generating novel renderings of a scene along user-defined camera trajectories from a single monocular video, dubbed video retaking, is a compelling but difficult problem in content creation and visual effects. Existing geometry-guided approaches reconstruct a 4D representation from the source video and render it along the target trajectory to condition video diffusion models. However, this guidance degrades as the target camera departs from the source trajectory, leaving newly revealed regions sparse or entirely missing. We propose SierpinskiCam, which addresses this limitation by augmenting geometry-based guidance with Sierpinski dome texture cues that contains rich trackable features even under large viewpoint changes. We further introduce a reference video conditioning mechanism that appends source-video tokens to the target-token sequence and separates the two streams with negative RoPE indices, enabling appearance grounding without architectural modification or per-video adaptation. Extensive experiments show that SierpinskiCam achieves significant gains in camera controllability, geometric consistency, and video quality across diverse and challenging retaking scenarios. Project page: this https URL.
[CV-90] Reasoning Text-to-Video Retrieval for Operating Room Clips via Action-Driven Digital Twins
链接: https://arxiv.org/abs/2606.17298
作者: Yiqing Shen,Hao Ding,Mathias Unberath
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Text-to-video retrieval in operating rooms (OR) is an enabling technology for OR safety, as it allows stakeholders to retrieve and inspect recordings of specific events. However, because the most safety-critical events may not follow the common structure, to unlock its full potential text-to-video retrieval must be able to handle implicit queries that require reasoning to identify the right video (e.g., the step right before clipping). However, existing methods rely on global embeddings that cannot reason over such queries. We propose OR3, a text-to-video retrieval method that converts clips into action-driven digital twins (ActDTs), grouping concurrent subject-action-object triplets under non-overlapping temporal intervals. Moreover, rather than cross-modal matching through paired encoders, OR3 performs imagination-based retrieval where an LLM generates hypothetical ActDTs from queries. This enables intra-modal matching via a single encoder trained with ActDT-tailored hard negatives. Finally, evidence-grounded refinement revises imagined ActDTs based on discrepancies with top candidates to capture procedure-specific patterns. We construct a benchmark from MM-OR with 276 implicit queries across four reasoning categories over 386 clips from robotic knee procedures. OR3 achieves 57.6 R@1 and 77.3 R@5, outperforming the strongest baseline. These results demonstrate that OR3 enables fine-grained discrimination between visually similar OR video clips through temporal action reasoning.
[CV-91] Pareto LoRA: Mitigating Modality Imbalance in Unified Multimodal Models via Pareto-Optimal Gradient Integration
链接: https://arxiv.org/abs/2606.17296
作者: Xiwen Wei,Mark Nutter,Madhusudhanan Srinivasan,Radu Marculescu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Unified multimodal models (UMMs) have recently emerged as a promising paradigm for integrating multimodal understanding and generation within a single autoregressive transformer. However, during multimodal instruction tuning, these models often exhibit pronounced modality imbalance: language gradients dominate optimization, thus leading to lower image generation quality, especially under parameter-efficient fine-tuning such as LoRA. In this work, we systematically analyze modality imbalance in LoRA-based fine-tuning of UMMs for interleaved text-image generation. We show that vision modality performance degrades substantially more than text modality performance when compared to unimodal counterparts, and that modality-specific gradients can differ by orders of magnitude across various tasks and layers. Motivated by this observation, we reformulate the multimodal instruction tuning as a bi-objective optimization problem and propose Pareto LoRA, a Pareto-optimal gradient integration strategy that balances the text and image objectives by modulating the gradient direction and strength. Experiments on the CoMM benchmark with Emu2 demonstrate that Pareto LoRA consistently improves multimodal generation balance, achieving up to 44.9% gains in perceptual image quality over vanilla LoRA while maintaining comparable text performance.
[CV-92] raining LLM s with Reinforcement Learning over Digital Twin Representations for Reasoning -Intensive Surgical VideoQA
链接: https://arxiv.org/abs/2606.17279
作者: Yiqing Shen,Han Zhang,Mathias Unberath
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Surgical video question answering requires multi-step reasoning across semantic, spatial, and temporal dimensions. Existing methods architecturally compress videos into discrete token representations and couple visual perception with reasoning. This approach fragments continuous spatial-temporal relationships and has been shown to restrict multi-step reasoning capabilities. We introduce a reinforcement learning (RL) framework that trains large language models (LLMs) to decouple perception from reasoning by operating over digital twin representations constructed from surgical foundation models. Additionally, we introduce hierarchical representations across frame, temporal window, and procedure levels with probabilistic uncertainty estimates. Finally, we propose a novel reward that combines format validation with accuracy assessment through clinical plausibility evaluation and uncertainty-aware calibration for training. To demonstrate the capabilities of this approach, we introduce REAL-Colon-Reason, a colonoscopic benchmark with 2000 question-answer pairs across three complexity levels. We achieve state-of-the-art performance on REAL-Colon-Reason and two existing surgical VideoQA benchmarks REAL-Colon-VQA and EndoVis18-VQA.
[CV-93] Pulling The REINS: Training-Free Safety Alignment of Video Diffusion Models via Representation Steering
链接: https://arxiv.org/abs/2606.17257
作者: Rohit Kundu,Arindam Dutta,Sarosij Bose,Athula Balachandran,Amit K. Roy-Chowdhury
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Open-weight video diffusion models can generate photorealistic unsafe content, from violence to misinformation, yet existing defenses either require expensive safety fine-tuning that degrades general capability, or apply external filters that are trivially bypassed by adversarial prompts. We present REINS (REpresentation-space INference-time Safety steering), a training-free method that aligns video diffusion models at inference time by steering their internal representations toward safe generation. Our key finding is that safety-relevant structure is linearly encoded in the hidden-state activations of video diffusion transformers, and a single direction, discovered via Supervised PCA on binary safety labels, suffices to separate safe from unsafe generation trajectories. At inference, adding this direction to hidden states at an intermediate transformer layer redirects generation from harmful content to semantically related safe alternatives, with no weight updates, no concept enumeration, and negligible computational overhead. Through mechanistic analysis, we reveal that while safety information accumulates monotonically with transformer depth, steering effectiveness peaks at intermediate layers (~50% depth), exposing a fundamental tradeoff between information availability and downstream propagation capacity. We evaluate REINS across 9 video diffusion models, multiple parameter scales (1.3B-5B), and both text-to-video and image-to-video generation, to our knowledge, the broadest safety evaluation suite in the video generation literature.
[CV-94] Contrastive Action-Image Pre-training for Visuomotor Control
链接: https://arxiv.org/abs/2606.17256
作者: Yuvan Sharma,Dantong Niu,Anirudh Pai,Zekai Wang,Zhuoyang Liu,Baifeng Shi,Stefano Saravalle,Boning Shao,Ruijie Zheng,Jing Wang,Konstantinos Kallidromitis,Yusuke Kato,Fabio Galasso,Yuke Zhu,Danfei Xu,Linxi “Jim” Fan,Jitendra Malik,Trevor Darrell,Roei Herzig
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Existing vision encoders for robotics face a fundamental bottleneck: robotic datasets lack the scale necessary for large-scale pre-training. Prior work circumvents this data scarcity by turning to internet-scale image and language data or egocentric human video. While these models show promise, neither paradigm learns from paired vision and action data, which downstream visuomotor control policies require. However, robot trajectories, the most direct source of this paired signal, are not available at pre-training scale, motivating us to extract action signals from abundant human video instead. To this end, we introduce CAIP (Contrastive Action-Image Pre-training), a vision encoder that treats human hand poses from large-scale egocentric video as a proxy for end-effector actions. By extracting 3D hand keypoints, a representation that aligns naturally with downstream robot action spaces, CAIP learns a unified action-image representation through a contrastive objective. Leveraging 32,041 hours of egocentric human video and only 88 hours of robotic manipulation data, CAIP outperforms state-of-the-art vision encoders including DINOv2, SigLIP, MVP, and R3M. Evaluated on a challenging real-world dexterous manipulation setup using Dexmate Vega and Sharpa Wave hands, CAIP yields performance gains of more than 30% on tasks involving folding, pouring, and fine-grained manipulation. Our results show that our method of contrastive action-centric pre-training yields a scalable path to achieving robust visual representations better suited for physical interaction.
[CV-95] Landsat-Sentinel-2 Algal Bloom Mapping Using Vision Transformers: Model Description Implementation and Examples
链接: https://arxiv.org/abs/2606.17242
作者: Thainara Lima,Vitor Martins
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Coastal algal bloom monitoring requires frequent, spatially detailed, and globally consistent observations, provided by Landsat-8/9 and Sentinel-2 A/B/C. Together, these missions offer over a decade of medium-resolution multispectral imagery with near-global coverage every 2-3 days, enabling the detection of fragmented bloom structures not resolvable by coarse ocean-color sensors. However, their use in aquatic environments remains challenging due to limited spectral coverage and a lack of harmonized reflectance products. As an alternative to traditional bio-optical methods, deep learning-based image classification offers a data-driven approach that can overcome many of these limitations. This study presents the first successful implementation of vision transformer-based coastal algal bloom mapping using 30-m Landsat-Sentinel-2 images. A globally distributed bloom patch dataset was generated across bloom-prone coastal hotspots worldwide. Four transformer-based architectures were compared against a standard convolutional baseline for fine-scale bloom detection, and assessed under different optical water types and atmospheric and surface conditions. All deep learning models showed strong capabilities in detecting floating bloom areas, with omission and commission errors of 8-65%. Under cloud and glint stress in a time series, the Swin Transformer outperformed traditional spectral-index approaches, which produced widespread false positives, effectively avoiding cloud- and glint-affected pixels. Comparisons with MODIS-derived products further highlighted the benefits of higher spatial resolution in detecting fragmented and irregularly affected blooms. Our findings support deep learning as a reliable tool for medium-resolution, consistent monitoring of floating algal blooms in dynamic coastal environments.
[CV-96] Beyond Benchmarks: Continuous Edge Inference for Fine-Grained Roadside Perception
链接: https://arxiv.org/abs/2606.17241
作者: Aditya Mishra,Haroon Lone
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO); Systems and Control (eess.SY)
备注:
Abstract:Continuous AI inference on resource-constrained edge hardware introduces deployment effects that are largely invisible to conventional benchmark evaluation, including temporal instability in streaming video, thermal throttling under sustained load, and workload-dependent performance variability. We present Edge-TSR, a deployment-oriented continuous edge inference system for sustained roadside perception on the NVIDIA Jetson Orin Nano. Edge-TSR integrates detection, tracking, fine-grained classification, and a lightweight track-aware temporal stabilization mechanism that improves streaming inference consistency with negligible computational overhead. Our central finding is that benchmark-centric evaluation systematically overstates deployed edge inference performance. Across three state-of-the-art baselines, we observe consistent 20-30% relative degradation when transitioning from static-image evaluation to real-world streaming deployment. Edge-TSR addresses this gap through temporal inference stabilization, recovering up to 10.16% classification accuracy over per-frame inference baselines while maintaining sustained real-time performance under continuous operation. We evaluate the complete system under diverse real-world deployment conditions, jointly characterizing inference quality, latency, throughput, and thermal behavior during long-duration operation. A 55-minute vehicular deployment over a 26 km route demonstrates sustained operation at 16.18 FPS within safe thermal limits on a single embedded device without cloud offload. Our findings show that deployment-aware evaluation and temporal inference stabilization are necessary components of continuously operating edge AI systems intended for real-world sensing deployments. We release a sample annotated streaming video evaluation dataset and full system implementation to support reproducible deployment-centric evaluation.
[CV-97] Quantum Enchanced Multi-Scale CNN with Bi-directional Mamba for Crop Field Analysis
链接: https://arxiv.org/abs/2606.17222
作者: Mohammad Salman Khan,Ehsan Atoofian,Saad B. Ahmed
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Hyperspectral image (HSI) crop analysis is essential for precision agriculture because it captures rich spectral and spatial information for accurate crop monitoring and assessment. However, HSI classification remains challenging due to high spectral dimensionality, spatial complexity, class imbalance, and limited labeled samples. To address these challenges, this paper proposes a BiSpectral Mamba-based framework that combines multi-scale convolutional feature extraction, spectral attention, bidirectional state-space modeling, and quantum-inspired learning. A multi-scale CNN backbone first extracts hierarchical spatial-spectral representations through feature fusion across multiple resolutions. A spectral attention mechanism then emphasizes informative bands while suppressing redundant and noisy channels. The refined features are processed by a BiSpectral Mamba module that captures long-range dependencies in both forward and backward directions by modeling hyperspectral feature maps as sequential tokens. In addition, class-weighted optimization and feature fusion strategies are incorporated to improve training stability and mitigate class imbalance. Experimental evaluation on the UAVHSI-Crop dataset demonstrates the effectiveness of the proposed framework, achieving an overall accuracy of 84.83%. The results show that integrating convolutional, attention-based, and state-space modeling components enables robust spatial-spectral feature learning for crop classification. The proposed framework also shows potential for broader agricultural and remote sensing applications, including crop disease detection, yield prediction, and soil moisture estimation, while highlighting the effectiveness of structured state-space and quantum-inspired architectures for hyperspectral image analysis.
[CV-98] HRDX: A Large-Scale Vector HD-Map Dataset
链接: https://arxiv.org/abs/2606.17080
作者: Sahith Reddy Chada,Isht Dwivedi,Nirav Savaliya
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: this https URL
Abstract:Reliable autonomous driving requires vectorized HD maps that are geometrically accurate, semantically rich, and scalable to long-horizon driving. However, existing public HD map datasets are limited in scale, provide sparse semantic attributes, and lack modalities such as aerial imagery that could enable new research directions. We present HRDX, a large-scale dataset for vector HD-map construction, spanning about 40 hours (1,400 km) of minimally overlapping drives, which is several times larger than prior public HD map datasets. Data is captured using six synchronized surround cameras, a 128-beam LiDAR, and centimeter-level RTK GNSS/IMU, and is further complemented by precisely aligned aerial orthoimagery. Annotations cover 10 vector map classes, complemented with over 20 semantic and topological attributes. To evaluate this richer ontology, we introduce the Composite Score (CS) to jointly assess geometric fidelity and attribute correctness. Benchmark experiments show that HRDX’s scale improves online vector-map construction, and that aligned aerial imagery provides a useful structural prior: using aerial imagery at training and/or inference improves geometric map quality, while aerial-augmented teachers can transfer part of this benefit to camera-only students without increasing inference-time sensor requirements. HRDX is intended to support reproducible research on large-scale HD-map learning, multimodal BEV fusion, and training-time privileged information. HRDX dataset and benchmarks are available at this https URL
[CV-99] wo-Stage Fine-Tuning of ResNet50 for High-Sensitivity Melanoma Detection on Dermoscopic Images
链接: https://arxiv.org/abs/2606.17504
作者: Aryan Bhagat
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 4 figures, 4 tables. Code available at this https URL
Abstract:Melanoma is the most dangerous form of skin cancer with five-year survival rates exceeding 99% when detected early but falling sharply once the disease spreads. This paper proposes and evaluates a two-stage fine-tuning approach for ResNet50 applied to binary melanoma classification on dermoscopic images. The core challenges addressed are class imbalance and suboptimal transfer learning from single-stage fine-tuning. After stratified train/validation/test splitting, random oversampling was applied exclusively to the training set to achieve a 1:1 class balance. Stage 1 trained only the classification head with the ResNet50 base frozen, while Stage 2 fine-tuned all layers jointly at a low learning rate of 1e-5 to prevent catastrophic forgetting of learned visual features. On an independent test set of 3,826 images, the model achieved an AUC-ROC of 0.9559, accuracy of 88.34%, sensitivity of 87.56%, specificity of 89.13%, and F1-score of 88.29%. An ablation study confirms the two-stage protocol significantly outperforms single-stage fine-tuning, with sensitivity gains of over 4%. Grad-CAM visualizations demonstrate correct lesion localization. A fully deployable Streamlit detection application is provided alongside all training code.
[CV-100] Phenotyping TPF via Self-Supervised Learning: A Label-Agnostic Framework with Expert Validation
链接: https://arxiv.org/abs/2606.17295
作者: Miral Elnakib,Muhammad Saad,Ahmad Al-Kabbany
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The full potential of artificial intelligence in tibial plateau fracture characterisation remains unrealised, constrained by a fundamental dependency on labelled datasets whose consistency cannot be guaranteed: conventional classification schemes such as Schatzker and AO/OTA suffer from inter-observer variability, causing supervised models to learn human disagreement rather than stable fracture morphology. We design, implement, and validate a label-agnostic framework that eliminates this constraint by learning fracture representations directly from imaging data without observer-assigned labels. A RadImageNet-pretrained ResNet-50 encoder is fine-tuned on 154 cleaned knee radiographs using the SimCLR contrastive objective, preceded by a data cleaning protocol and followed by UMAP dimensionality reduction and k-means clustering to discover four imaging-derived phenotypes. Phenotype validity is assessed through a blinded expert review protocol administered to two independent clinicians. The four phenotypes demonstrate robust stability (bootstrap ARI = 0.319 +/- 0.041), strong internal cohesion (silhouette = 0.511), and coherence ratings of 3-5/5 from both reviewers under blinded conditions; one phenotype was unanimously identified as exhibiting comminution – a high-complexity feature isolated without any supervisory signal. Inter-partition comparison against Schatzker labels yields ARI = 0.013, confirming orthogonality to conventional classification boundaries. Notably, expert reviewers anchored to established classification vocabularies perceived imaging-derived groups as heterogeneous precisely where Schatzker alignment was lowest, suggesting that Schatzker-trained perception and label-agnostic embedding geometry measure orthogonal dimensions. These findings establish label-agnostic SSL phenotyping as a reproducible and clinically interpretable complement to conventional classification.
人工智能
[AI-0] Visual Verification Enables Inference-time Steering and Autonomous Policy Improvement
链接: https://arxiv.org/abs/2606.18247
作者: Mingtong Zhang,Dhruv Shah
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Website: this https URL
Abstract:Robots deployed in the real world should learn from their experience and improve over time. This requires a mechanism of practicing and learning from feedback. In this paper, we propose VERITAS, a generator-verifier framework for generalist robot policies for inference-time policy steering and self-improvement. We use a pre-trained generalist robot policy as a generator'' and pair it with a gradient-free visual verifier’’ that evaluates actions at inference time. This framework enables inference-time steering that improves policy performance without additional training. We demonstrate that inference-time verification consistently outperforms vanilla generalists without training on additional demonstration data. Additionally, we demonstrate that the verified rollouts provide effective supervision for offline policy improvement: policies fine-tuned on verified self-generated trajectories achieve consistent performance gains. Notably, we find that post-training with verified rollouts achieves comparable efficiency to expert demonstrations, while requiring no human interventions. Our results highlight inference-time verification as a practical and scalable mechanism for improving robotic policies during deployment.
[AI-1] EvolveNav: Proactive Preflection and Self-Evolving Memory for Zero-Shot Object Goal Navigation
链接: https://arxiv.org/abs/2606.18235
作者: Qi Chai,Wenhao Shen,Nanjie Yao,Yue Xia,Kaiyong Zhao,Jie Ma,Guosheng Lin,Hao Wang
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Zero-Shot Object-Goal Navigation (ZS-OGN) requires embodied agents to explore and locate target objects without any prior training. To this end, recent methods leverage foundation models. But they typically rely on static priors and lack adaptation, which leads to repeated errors and costly trial and error. In this paper, we propose a self-evolving ZS-OGN framework that enables continuous test-time improvement. Specifically, we build an agentic rule memory by extracting actionable knowledge from past trajectories. Then, we propose a retrieval strategy based on upper confidence bound, selecting effective rules by balancing semantic relevance and historical success. In addition, we introduce a memory-guided preflection module that forecasts potential outcomes before action, reducing inefficient exploration. Extensive experiments show that our method outperforms existing zero-shot baselines, achieving a 10.1% improvement in success rate with fewer unnecessary steps.
[AI-2] Learning Red Agent Policy from Observations for Neurosymbolic Autonomous Cyber Agents
链接: https://arxiv.org/abs/2606.18223
作者: Ankita Samaddar,Sandeep Neema,Daniel Balasubramanian,Xenofon Koutsoukos
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY)
备注:
Abstract:With sophisticated cyber-attacks becoming increasingly prevalent, modern networks require intelligent autonomous cyber-defense agents trained via Reinforcement Learning (RL). These agents employ neurosymbolic approaches such as behavior trees with learning-enabled components (LECs) to learn, reason, adapt, and implement security rules while maintaining critical operations. However, these autonomous networks are partially observable systems, i.e., the cyber-attacker’s (red agent’s) actions are not observable, making it difficult for the defender to predict red actions, learn red policies, or assess the attacker’s intrusion levels. To address this, we propose a Policy Learning Technique using imitation learning to learn policies for partially observable RL agents with discrete states and discrete actions. We apply this technique in an autonomous cyber environment to predict red agent’s actions from network observations and defender actions. Integrated with a neurosymbolic cyber-defense agent, our method effectively handles different red policies and achieves high prediction accuracy across diverse simulated scenarios.
[AI-3] Fixed-Point Reason ers: Stable and Adaptive Deep Looped Transformers
链接: https://arxiv.org/abs/2606.18206
作者: Sajad Movahedi,Vera Milovanović,Shlomo Libo Feigin,Alexander Theus,Thomas Hofmann,Valentina Boeva,T. Konstantin Rusch,Antonio Orvieto
类目: Artificial Intelligence (cs.AI)
备注: Code available at this https URL
Abstract:Looped architectures provide an inductive bias toward learning step-by-step procedures for tasks that require compositional reasoning. The number of effective layers reached by looping determines the quality of the solution these models find. Like deep architectures, looped architectures are prone to a signal propagation problem induced by depth as the halting decision is postponed. In this paper, we address this signal propagation issue using pre-norm layers and residual scaling. Building on these architectural modifications, we propose FPRM, a Transformer-based Fixed-Point Reasoning Model that uses fixed-point convergence as an end-to-end halting mechanism in a looped architecture. We show that fixed-point halting allows FPRM to adapt its compute to task difficulty. FPRM is effective on common reasoning benchmarks, namely Sudoku, Maze, state-tracking, and ARC-AGI.
[AI-4] he Stanford EDGAR Filings Dataset: Reconstructing U.S. Corporate and Financial Disclosures into Layout-Faithful and Token-Efficient Pretraining Data
链接: https://arxiv.org/abs/2606.18192
作者: Nick Bettencourt,Xiaowei Ding,Kay Giesecke
类目: Artificial Intelligence (cs.AI)
备注: Preprint. Includes appendix, tables, and figures
Abstract:As high-quality public web corpora become increasingly exhausted, clean long-context documents have become a scarce and expensive source of training data for large language models (LLMs). Existing long-context corpora are often proprietary and costly to acquire, synthetically generated, or concentrated in narrow domains such as programming. We introduce the Stanford EDGAR Filings Dataset (SEFD), an open reconstruction of SEC filings into layout-faithful MultiMarkdown for financial language modeling and evaluation. SEFD makes audited financial statements, risk disclosures, ownership reports, accounting notes, and market-moving event filings usable as long-context pretraining data and as a basis for financial reasoning, forecasting, compliance, and document understanding. The resulting corpus is token-efficient, model-ready, and has less than 0.1% overlap with Common Crawl-derived corpora. We release SEFD-v1, a 152B-token initial public snapshot, and provide corpus-level analyses of a larger 18.5M-filing archive estimated at 550B tokens. We further introduce two SEFD-derived benchmarks: EDGAR-Forecast, which evaluates filing-grounded numerical forecasting after model knowledge cutoffs, and EDGAR-OCR, which evaluates transcription of complex financial tables.
[AI-5] Kolmogorov Regression for Robust Diffusion Policies
链接: https://arxiv.org/abs/2606.18186
作者: Lekan Molu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Finite-dimensional (FD) diffusion policies exhibit temporal drift owing to discretization artifacts that degrade long-horizon performance (when deployed on physical systems). We introduce a backward Kolmogorov equation that lifts diffusion policies to a Cameron-Martin space – a subset of the Hilbert space. Essentially, replacing stochastic score matching with a deterministic boundary-value PDE problem. Our core innovation thrives on Gaussian measure theory whereupon the diffusion noise covariance operator is realized from a colored noise distribution which prescribes a notion of regularity on samples from the model at inference time. We train the diffusion model with a derived precision-weighted Cameron- Martin loss and a Kolmogorov residual is introduced as a PDE diagnostic during inference. These substitutions yield (i) convergence guarantees where the bound’s constants depend on the effective rank of the kernel rather than action dimension, (ii) improved trajectory regularity via spectral weighting, and (iii) a deterministic failure detector without reward signals. Validation across two application domains demonstrates substantial improvements: on the PushT manipulation benchmark, the Cameron-Martin loss achieves a 17% improvement in maximum episode reward (0.95 vs. 0.78 for MSE) and 67.6% reduction in inter-step drifts during inference via the introduced residual magnitude. Similarly, on a 6-station manufacturing line with constant work-in-process (CONWIP) flow control, we achieve 28.4% lower RMSE than classical LSTM baselines; a high starvation-event recall (1.0 in test cycles), and effective bottleneck identification (Precision@1 = 1.0 in test set, 13x signal-to-noise ratio). We then certify the dispatch policies with Hamilton-Jacobi reachability theory which reduces deadlock events by 96% compared to uncontrolled dispatch over 100 simulated runs (351 events prevented).
[AI-6] All Smoke No Alarm: Oracle Signals in Agent -Authored Test Code
链接: https://arxiv.org/abs/2606.18168
作者: Dipayan Banik,Kowshik Chowdhury,Shazibul Islam Shamim
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: Accepted at the 8th IEEE International Conference on Artificial Intelligence Testing, 2026
Abstract:Software practitioners increasingly use AI coding agents that generate test code alongside production code in open source pull requests (PRs). Recent studies report more than 932,000 agent-authored PRs across more than 116,000 repositories, yet whether their test files contain meaningful verification logic remains underexplored. Test files lacking explicit assertions execute code without verifying behavior, so quality gates based on test-file presence overestimate verification strength. The goal of this paper is to help practitioners assess the verification strength of agent-authored patches by characterizing oracle signals and their link to merge outcomes and review effort. We conduct an empirical study of 86,156 test-file patches from 33,596 agent-authored PRs across 2,807 GitHub repositories produced by five coding agents: OpenAI Codex, GitHub Copilot, Devin, Cursor, and Claude Code. A qualitative analysis of 384 stratified patches informs a syntactic taxonomy of eight oracle signal categories. Applied at scale, 80.2% of test patches contain weak or no explicit oracle signals. While raw merge rates are lower for strong-oracle PRs, a regression analysis adjusting for agent, PR size, repository popularity, task type, and language shows strong oracles significantly improve merge likelihood (OR = 1.28, p 0.001). Our findings suggest that test file counts substantially overestimate verification strength and that practitioners can adopt oracle-aware quality checks to more accurately evaluate agent-authored contributions.
[AI-7] Learning Cardiac Electrophysiology Digital Twins Through Agent ic Discovery of Hybrid Structure
链接: https://arxiv.org/abs/2606.18154
作者: Ziqi Zhou,Yubo Ye,Sumeet Atul Vadhavka,Linwei Wang,Zhiqiang Tao
类目: Artificial Intelligence (cs.AI)
备注: 10 pages, 4 figures
Abstract:Building personalized cardiac electrophysiology (EP) digital twins requires identifying the appropriate model structure for each patient, not merely fitting parameters. Traditional methods rely on experts to manually prescribe hybrid physics-neural architectures, which requires deep domain expertise and does not transfer across patients. Recent works have applied large language models (LLMs) to generate or act as hybrid models. However, despite their promising generalization capacity, these LLM-based methods lack the structural priors needed for stable cardiac simulations. Hence, we propose LEADS, a framework that formulates cardiac EP domain knowledge as a structured action space and utilizes an LLM agent to discover hybrid models. The agent follows an iterative reasoning-and-action loop to select, combine, and refine hybrid models, whilst gradient descent handles parameter fitting. The proposed LEADS designs every candidate model towards physically grounded, interpretable, and numerically stable, while allowing open-ended architectural discovery. We validate LEADS on synthetic data with three ground-truth reaction models and on real cardiac EP data, demonstrating that it outperforms both human-designed hybrid models and other LLM-based hybrid modeling.
[AI-8] WEQA: Wearable hEalth Question Answering with Query-Adaptive Agent ic Reasoning
链接: https://arxiv.org/abs/2606.18147
作者: Yuwei Zhang,Tong Xia,Bianca Emmerich,Yu Yvonne Wu,Dimitris Spathis,Xin Liu,Daniel McDuff,Cecilia Mascolo
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Language models are remarkably capable at medical question answering, in some cases surpassing the accuracy of general physicians. However, answering questions about wearable health data remains challenging and understudied, as these ubiquitous sensors produce continuous, high-dimensional, and longitudinal data, which is non-trivial to align with text-centric distributions in LLM pretraining. The diversity of sensor modalities and user intents cannot be effectively handled by a fixed reasoning workflow or a single pretrained foundation model. To address these challenges, we propose WEQA, a query-adaptive agent framework that unifies LLM reasoning with specialized wearable analytical and modeling tools. An LLM controller is employed to synthesize execution plans and dynamically route each query to the appropriate combination of sensor analysis and pretrained models, and perform grounded response auditing with external knowledge. We also curate a benchmark spanning four open wearable datasets comprising analytic and predictive tasks in three different health domains. Experiments show that our framework is 24% more accurate than LLM and agentic baselines, and a blinded study with 12 medical experts and 8 users shows substantial gains in usefulness and clinical soundness.
[AI-9] Memory as a Wasting Asset: Pricing Flash Endurance for Embodied Agents and the Limits of Doing So
链接: https://arxiv.org/abs/2606.18144
作者: Josef Liyanjun Chen
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG); Robotics (cs.RO)
备注:
Abstract:A robot’s flash endurance is a non-renewable stock: every persisted write spends one of a few thousand program/erase cycles and never refills, yet no fielded robot memory system prices which memories are worth an erase cycle. We treat embodied memory as depreciating capital and price that stock with a single endurance shadow price \eta , which makes cost-minimizing placement across a RAM / on-board NVM / cloud hierarchy a threshold in a wear-augmented per-byte index. The index is cost-optimal whatever the sign of the value-write association \chi ; only when \chi 0 does the optimum turn non-monotone, sending a robot’s most valuable memories off its flash. The pivot is thus empirical, and we measure \chi on real robot logs at a pre-specified gate: its sign is a property of the deployment regime – positive on recurrent long-horizon manipulation ( \hat\chi \approx +1.0 \times 10^-3 , replicated at full power), null on a shorter-horizon suite, and negative on non-recurrent teleoperation. Two boundaries scope the result. The endurance budget is dormant on premium 3,000-P/E TLC at datasheet prices and binding on the commodity QLC/eMMC ( \sim 1,000 P/E) that cheaper edge robots run. And where it binds, a learned wear-aware controller only ties price-based routing on task value, because realized value is tier-invariant across RAM, NVM, and cloud: the rent governs device lifetime and cost, not task performance. Whether wear-aware placement improves task value remains open – \chi is measured against a value proxy, and the non-monotone optimum, while proven, is not yet observed in data. Subjects: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG); Robotics (cs.RO) Cite as: arXiv:2606.18144 [cs.AI] (or arXiv:2606.18144v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2606.18144 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-10] Descriptor: Certus Caliber Classification Gunshot Dataset (C3GD)
链接: https://arxiv.org/abs/2606.18135
作者: Sinclair Gurny,Ryan Quinn
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注:
Abstract:In this work, we introduce the Certus Caliber Classification Gunshot Dataset (C3GD), a publicly accessible data set developed for the analysis of firearm muzzle blast sounds. The dataset aims to provide a wide variety of firearms, calibers, cartridges, microphones, and microphone locations with metadata detailed beyond what is currently otherwise available. It comprises more than 8000 field-collected data points from 28 firearms across 16 calibers. Because data collection in the field is costly, much of the existing research has been done using gunshot audio collected from the internet, which increases the risk of low-quality data and label noise. This dataset is primarily focused on caliber classification, but can also be used for gunshot detection, audio separation, and audio signal processing, providing a diversified and real-world reference. The dataset aims to provide enough diversity to be able to generalize to more real-world applications while also providing enough metadata for detailed academic analysis.
[AI-11] Knowledge Reutilization in Meta-Reinforcement Learning
链接: https://arxiv.org/abs/2606.18132
作者: Yuan Meng,Bo Wang,Juan de los Rios Ruiz,Xiangtong Yao,Zhenshan Bing,Fuchun Sun,Alois Knoll
类目: Artificial Intelligence (cs.AI)
备注: 18 pages initial submission
Abstract:Meta-reinforcement learning enables fast adaptation by extracting shared structure from related tasks, but existing end-to-end methods often couple task inference with embodiment-specific control. This coupling can obscure non-parametric task semantics, reduce sample efficiency, and limit cross-agent reuse. We propose a meta-knowledge reutilization framework that learns task-level knowledge on a dynamics-simplified agent and transfers it to heterogeneous agents. The framework uses a Bayesian non-parametric prior to organize latent task modes and a high-level policy to generate task-level magnitude guidance. To bridge reusable task knowledge with different embodiments, we introduce a semantic-magnitude interface and a lightweight temporal adaptor, which convert frozen meta-knowledge into temporally aligned subgoals for embodiment-specific low-level controllers. Experiments on multiple locomotion agents show that our framework reduces final-step tracking error by 94.75% – 99.79% compared with recent state-of-the-art baselines and achieves comparable deployment performance with about 23.8% of their interaction data.
[AI-12] Embedded Machine Learning for Microcontroller-Class Edge Devices: Data Feature Evaluation and Deployment Pipelines
链接: https://arxiv.org/abs/2606.18122
作者: Mostafa Darvishi
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Audio and Speech Processing (eess.AS); Signal Processing (eess.SP)
备注: 6 pages, 3 figures, 4 tables
Abstract:Embedded machine learning moves inference from cloud services to resource-constrained devices that must acquire data, preprocess signals, run a model, and act within tight limits on memory, energy, and latency. This paper presents a systems-oriented synthesis of an embedded machine-learning workflow for microcontroller-class platforms. The emphasis is placed on engineering decisions that are often hidden in generic machine-learning introductions: sampling and buffering, feature extraction as dimensionality reduction, validation under class imbalance, model/runtime co-design, and streaming deployment. Two representative signal families are used throughout the paper. The first is inertial motion recognition, where a two-second, three-axis accelerometer window is transformed from raw samples into root-mean-square and spectral features before classification. The second is keyword spotting, where audio is sampled, anti-aliased, transformed into mel-frequency cepstral coefficients, and processed by a compact one-dimensional convolutional network. The paper concludes with practical design rules for robust on-device inference, including data curation, quantization, thresholding, scheduling, and field monitoring.
[AI-13] First Proof Second Batch
链接: https://arxiv.org/abs/2606.18119
作者: Mohammed Abouzaid,Nikhil Srivastava,Rachel Ward,Lauren Williams
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:To assess the ability of current AI systems to correctly solve research-level mathematics problems, we tested several AI systems on a set of ten problems in a broad range of mathematical fields; these problems arose naturally in the research process of the contributors. This document includes the problems, our methodology, and the results of our testing. We provide links to supplementary documents including the human solutions, the AI-generated solutions, and the referee reports and logs for the AI-generated solutions. The ten problems were contributed by the following mathematicians: (1) Dariusz Kalociński and Theodore A. Slaman, (2) Richard Schwartz, (3) Aleksa Milojevic and Benny Sudakov, (4) Larry Guth, (5) Oleg Butkovsky, Jonathan Mattingly, and Lorenzo Zambotti, (6) Joshua Evan Greene and Duncan McCoy, (7) Sucharit Sarkar, (8) Sam Payne and Jidong (Jayden) Wang, (9) Sylvie Corteel and John Lentfer, (10) Srivatsav Kunnawalkam Elayavalli. Subjects: Artificial Intelligence (cs.AI) MSC classes: 68T01 Cite as: arXiv:2606.18119 [cs.AI] (or arXiv:2606.18119v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2606.18119 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-14] rnary Mamba: Grouped Quantization-Aware Training of W1.58A16 State Space Models
链接: https://arxiv.org/abs/2606.18114
作者: Ramprasath Ganesaraja,Sahil Dilip Panse,Swathika N
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:State Space Models (SSMs) such as Mamba-2 offer linear-time inference but their memory footprint limits edge deployment. Prior ternary SSM work (Slender-Mamba) trains from scratch on 150B tokens; we show a pretrained checkpoint suffices, reducing the marginal token budget by 1,000x. Using grouped quantization-aware training (QAT) with knowledge distillation from a frozen FP16 teacher, we compress Mamba-2 1.3B to 3.61x (2,687 to 744 MB) and achieve 48.1% zero-shot accuracy (7-task average) in just 102M tokens (4 GPU-hours, single H100) – approaching Bi-Mamba’s 48.4% (within +/-0.9pp CI). This QAT-from-pretrained setting reveals zero-ratio collapse, a novel instability caused by learnable quantization scales that does not arise in from-scratch training. We further show that post-hoc correction strategies effective for Transformers fail for SSMs due to error accumulation through the recurrence. These results demonstrate that ternary SSMs do not require expensive from-scratch training: QAT from pretrained checkpoints with KD is a data-efficient alternative.
[AI-15] Learning Fair Pareto-Optimal Policies in Multi-Objective Reinforcement Learning
链接: https://arxiv.org/abs/2606.18111
作者: Umer Siddique,Peilang Li,Yongcan Cao
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at the Reinforcement Learning Conference (RLC) 2025. 12 pages main + appendix, 8 figures, 4 tables
Abstract:Fairness is an important aspect of decision-making in multi-objective reinforcement learning (MORL), where policies must ensure both optimality and equity across multiple, potentially conflicting objectives. While single-policy MORL methods can learn fair policies for fixed user preferences using welfare functions such as the generalized Gini welfare function (GGF), they fail to provide the diverse set of policies necessary for dynamic or unknown user preferences. To address this limitation, we formalize the fair optimization problem in multi-policy MORL, where the goal is to learn a set of Pareto-optimal policies that ensure fairness across all possible user preferences. Our key technical contributions are threefold: (1) We show that for concave, piecewise-linear welfare functions (e.g., GGF), fair policies remain in the convex coverage set (CCS), which is an approximated Pareto front for linear scalarization. (2) We demonstrate that non-stationary policies, augmented with accrued reward histories, and stochastic policies improve fairness by dynamically adapting to historical inequities. (3) We propose three novel algorithms, which include integrating GGF with multi-policy multi-objective Q-Learning (MOQL), state-augmented multi-policy MOQL for learning non-statoinary policies, and its novel extension for learning stochastic policies. We evaluate our algorithms across various domains and compare our methods against the state-of-the-art MORL baselines. The empirical results show that our methods learn a set of fair policies that accommodate different user preferences.
[AI-16] rust the Right Teacher: Quality-Aware Self-Distillation for GUI Grounding
链接: https://arxiv.org/abs/2606.18101
作者: Jingyuan Huang,Zuming Huang,Yucheng Shi,Tianze Yang,Xiaoming Zhai,Wei Chu,Ninghao Liu
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Graphical user interface (GUI) grounding requires vision-language models (VLMs) to identify small target elements in high-resolution screenshots and predict precise screen coordinates. On-policy self-distillation (OPSD) is a promising post-training approach for this coordinate-sensitive task, since it provides dense token-level teacher signals beyond hard coordinate labels. However, naive OPSD is not well suited to GUI grounding: OPSD evaluates the teacher on student-generated prefixes, the quality of coordinate-token teacher signals can degrade when the prefix has already deviated from the target coordinate, leading to unreliable teacher signal. To mitigate this, We propose quality-aware self-distillation for VLM-based GUI grounding, which improves coordinate-token teacher-signal quality through soft correctness-aware gating and teacher-probability scaling. The soft correctness-aware gate checks whether the teacher’s current coordinate-token prediction can still be completed into the ground-truth box under the student-generated prefix. If not, the corresponding teacher signal is down-weighted. Teacher-probability scaling then uses the teacher’s confidence as a lightweight factor to further calibrate the strength of the gated supervision. A key empirical finding is that neither component alone improves overall performance, whereas combining them consistently improves performance. This suggests that the two mechanisms play complementary roles: correctness-aware gating suppresses unreliable coordinate-token supervision, while teacher-probability scaling calibrates the strength of the remaining signals. Experiments across six GUI grounding benchmarks show that our method consistently improves the base model and outperforms strong baselines.
[AI-17] IsabeLLM : Automated Theorem Proving Applied to Formally Verifying Consensus
链接: https://arxiv.org/abs/2606.18098
作者: Elliot Jones,William Knottenbelt
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Advances in Artificial Intelligence (AI) have led AI for Theorem Proving to become a promising means of formally verifying computer systems. Whilst formal verification is traditionally reserved for safety-critical systems due to the required amount of expertise and effort, AI can help to automate a large amount of this workload and make it far more accessible. Blockchain-based systems are becoming increasingly popular and are frequently targeted by malicious actors, often resulting in huge financial losses, highlighting the need to better verify these systems and mitigate vulnerabilities. Arguably the most important component of these systems is the consensus protocol, which allows nodes to agree on decisions in a potentially adversarial environment. In this paper, we improve upon IsabeLLM, the automated theorem proving tool in Isabelle. Namely, we implement a Retrieval-Augmented Generation framework, Error tracing and counterexample generation for improved context supplied to the Large Language Model. Compatibility with the latest version of Isabelle and Sledgehammer is also implemented for improved efficiency. We compare the performance of the two versions of IsabeLLM in their ability to complete the verification of Bitcoin’s Proof of Work consensus.
[AI-18] S4oP: Operator-level Pruning of Structured State Space Models for Resource-Constrained Devices
链接: https://arxiv.org/abs/2606.18096
作者: Marco Deano,Filippo Ziche,Nicola Bombieri
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:
Abstract:Structured State Space Models (SSMs), including the S4 and S4D architectures, have recently emerged as powerful alternatives to attention-based models for capturing long-range dependencies in sequential data. Despite their strong empirical performance, deploying these models in time- and resource-constrained settings remains challenging due to their computational and memory demands. In this paper, we propose a novel incremental, operator-level pruning approach for S4- and S4D-based models that significantly reduces inference cost while preserving predictive performance. To the best of our knowledge, this is the first work to systematically investigate structured operator pruning for SSMs. Our method progressively prunes model operators by interleaving structured masking with fine-tuning, while jointly monitoring accuracy and inference latency. We implement this approach within a unified training and evaluation framework that enables systematic exploration of efficiency-accuracy trade-offs. Experiments across multiple benchmark datasets show that pruning up to 70% of the model operators preserves the performance of the original models in most cases, while substantially reducing inference latency. These results demonstrate that structured operator pruning is an effective and previously unexplored strategy for improving the efficiency of SSMs and facilitate their deployment in practical, resource-constrained scenarios.
[AI-19] EAGG: Embodiment-Aligned Grasp Generation via Geometry-Aware Graph Conditioning
链接: https://arxiv.org/abs/2606.18092
作者: Wanhao Niu,Qiyan Ke,Yuan Sun,Hao Sun,Jie Xu,Muyuan Ma,Ruiqi Hu,Fuchun Sun
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 16 pages, 8 figures. Code is available at this https URL
Abstract:Cross-end-effector grasp generation seeks a unified model that generalizes across objects and across embodiments ranging from parallel grippers to dexterous end effectors. Existing grasp generators are typically designed for a fixed embodiment or encode embodiment identity with a static descriptor, which weakens transfer when topology, actuation coupling, and contact geometry differ substantially. We present EAGG, an embodiment-aligned grasp generator that represents each embodiment with a topology-aware end-effector graph and an embodiment-specific low-dimensional end-effector control space. A frozen end-effector-cognition backbone converts the current articulated state into geometry-aware tokens that act as a reusable morphology prior, and iterative geometry injection refreshes these tokens throughout sampling so that conditioning remains synchronized with the evolving end-effector geometry. On the MultiGripperGrasp benchmark, EAGG reaches 56.17% average success across six training end effectors, remaining within 1.10 percentage points of specialized training while preserving transfer to finetuning and zero-shot end effectors. Iterative geometry injection further reduces the pooled median contact distance from 0.239 cm to 0.189 cm. These results show that cross-end-effector grasp generation is strengthened by aligning embodiment structure inside a shared generator rather than suppressing embodiment differences. Code is available at this https URL.
[AI-20] A Unified Framework for Context-Aware and Relation-Aware Graph Retrieval-Augmented Generation WWW’26
链接: https://arxiv.org/abs/2606.18075
作者: Haoyang Zhong,Yifei Sun,Antong Zhang,Chunping Wang,Lei Chen,Yang Yang
类目: Artificial Intelligence (cs.AI)
备注: Accepted at The ACM Web Conference 2026 (WWW '26)
Abstract:Retrieval-Augmented Generation (RAG) has emerged as a paradigm for enhancing large language models (LLMs) with external knowledge, yet existing graph-based methods face a fundamental limitation: entity-centric and chunk-centric approaches operate on representations anchored to original text without true knowledge fusion. While entity-centric methods connect logically related content and chunk-centric methods preserve context, both retrieve information separately through similarity search, missing emergent understanding from their synthesis. In this paper, we propose HyGRAG, a hierarchical graph RAG framework that transcends source documents by addressing three core challenges: constructing summaries that genuinely integrate contextual and relational information, leveraging these synthesized representations to access emergent knowledge during retrieval, and efficiently updating hierarchical structures for dynamic corpora. Specifically, we design hierarchical index structures over hybrid graphs with both chunk and entity nodes, then iteratively cluster them and generate LLM-based summaries. Then, we design context and relation-aware retrieval that searches across all abstraction levels while expanding through community membership. Moreover, we enable dynamic knowledge update through attachment-based algorithms with only local re-summarization. Experimental results show that HyGRAG improves the average accuracy of multi-hop reasoning tasks by 9.7%, while maintaining reasonable efficiency.
[AI-21] Volterra Generative Models
链接: https://arxiv.org/abs/2606.18071
作者: Yusen Jia,Bingyan Han
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 36 pages
Abstract:Score-based diffusion models typically use Brownian perturbations, which provide tractable reverse-time dynamics but impose memoryless noising. We introduce Volterra generative models, a continuous-time score-based framework whose forward process injects path-dependent noise through fractional kernels. To handle the non-Markovian and non-semimartingale dynamics, we construct finite-dimensional Markovian lifts using Gaussian quadrature in both regimes and a hybrid finite-difference exponential approximation in the smooth regime. We prove squared error bounds, derive an augmented linear-Gaussian forward process, and show that the learning can remain data-dimensional by considering residual states and analytic auxiliary Gaussian scores. We also identify covariance and reverse-time degeneracies caused by shared Brownian factors and signed smooth-regime weights. The degeneracy motivates stabilized conditioning and, for stiff larger lifts, a Gaussian-bridge reconstruction sampler. Experiments on MNIST and CIFAR-10 show that persistent fractional perturbations with small Markovian lifts can improve score-based generation on MNIST and provide a promising extension to natural images, while the bridge sampler provides a stability mechanism for larger lifts.
[AI-22] Agent ic AI-based Framework for Mitigating Premature Diagnostic Handoff and Silent Hallucination in Healthcare Applications
链接: https://arxiv.org/abs/2606.18068
作者: Divyansh Srivastava,Shreya Ghosh,Anshul Verma,Rajkumar Buyya
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Recent advances in Large Language Models (LLMs) and multi-agent systems have driven the rise of Agentic AI, showing promise for medical reasoning. However, open-ended conversational agents remain prone to two critical failure modes: premature diagnostic handoff and silent clinical hallucinations that may go undetected before reaching the patient. In this work, we propose a multi-agent framework that addresses both issues by replacing ``LLM-as-a-judge’’ routing with deterministic orchestration constraints. The framework incorporates two safety mechanisms. First, a neuro-symbolic state-tracking gate enforces completeness of the OLDCARTS clinical protocol (Onset, Location, Duration, Character, Aggravating/Alleviating factors, Radiation, Timing, and Severity) by blocking diagnostic transitions until all required dimensions are collected. Second, an epistemic uncertainty quantification (UQ) gate computes semantic entropy (H) across K=5 independent diagnostic samples to identify and intercept divergent outputs before delivery. We evaluate the system using simulated patient agents powered by the llama-3.1-70b-instruct model on 150 test cases. The full architecture achieves 49.3% diagnostic precision, representing an absolute improvement of 11.3 percentage points over an unconstrained baseline. Additionally, we observe a statistically significant negative correlation (r = -0.181, p 0.05) between OLDCARTS completeness (\sigma) and semantic entropy (H), suggesting that structured information gathering is associated with reduced diagnostic uncertainty. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2606.18068 [cs.AI] (or arXiv:2606.18068v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2606.18068 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-23] Catastrophic Forgetting is Low-Rank: A Function-Space Theory for Continual Adaptation ICML2026
链接: https://arxiv.org/abs/2606.18024
作者: Ido Nitzan Hidekel,Dan Raviv
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted to the ICML 2026 Workshop on Continual Adaptation at Scale: Towards Sustainable AI
Abstract:Catastrophic forgetting in continual adaptation is usually studied through parameter drift, replay, or distillation, but these views do not identify which output-space directions are vulnerable. We give a function-space account in the NTK regime: new-task training induces old-task prediction drift through the cross-task kernel, yielding a closed-form predictor for the forgetting vector before any new-task gradient step. In frozen-backbone linear-head PEFT-CL, where the model is linear in the trainable parameters, the predictor is exact up to numerical precision; for nonlinear adapters/full fine-tuning, it is a local NTK approximation. The same expression reveals that forgetting concentrates in a small number of old-task NTK eigenmodes and under frozen linear heads gives a Kronecker scaling rule for the vulnerable rank. These results clarify the relation to prior NTK-overlap theory, explain why parameter-space regularizers can miss output-space interference, and motivate a targeted spectral regularizer.
[AI-24] LoopCoder-v2: Only Loop Once for Efficient Test-Time Computation Scaling
链接: https://arxiv.org/abs/2606.18023
作者: Jian Yang,Shawn Guo,Wei Zhang,Tianyu Zheng,Yaxin Du,Haau-Sing Li,Jiajun Wu,Yue Song,Yan Xing,Qingsong Cai,Zelong Huang,Chuan Hao,Ran Tao,Xianglong Liu,Wayne Xin Zhao,Mingjie Tang,Weifeng Lv,Ming Zhou,Bryan Dai
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Looped Transformers scale latent computation by repeatedly applying shared blocks, but sequential looping increases latency and KV-cache memory with the loop count. Parallel loop Transformers (PLT) alleviate this cost through cross-loop position offsets (CLP) and shared-KV gated sliding-window attention, making loop count a practical design choice. We therefore study PLT loop-count selection through a gain–cost view: an extra loop may refine representations, but CLP also introduces a positional mismatch at each loop boundary. We instantiate this study by training LoopCoder-v2, a family of 7B PLT coders with different loop counts, from scratch on 18T tokens, followed by matched instruction tuning and evaluation. Empirically, the two-loop variant delivers broad gains over the non-looped baseline across code generation, code reasoning, agentic software engineering, and tool-use benchmarks, improving SWE-bench Verified from 43.0 to 64.4 points and Multi-SWE from 14.0 to 31.0 points. In contrast, variants with three or more loops regress, revealing a strongly non-monotonic loop-count effect. Our diagnostics show that loop 2 provides the main productive refinement, while later loops yield diminishing, oscillatory updates and reduced representational diversity. Because the CLP-induced mismatch remains roughly fixed as refinement gains shrink, the offset cost increasingly dominates. This gain–cost trade-off explains PLT’s saturation at two loops and provides diagnostics for loop-count selection.
[AI-25] LLM Consumer Behavior Theory: Foundations of a Novel Research Field
链接: https://arxiv.org/abs/2606.18005
作者: Manon Reusens,Sofie Goethals,David Martens
类目: Artificial Intelligence (cs.AI); General Economics (econ.GN)
备注:
Abstract:Large language models (LLMs) are increasingly deployed as autonomous agents that make consumption decisions on behalf of users. This shift raises fundamental questions for consumer theory, which has traditionally modeled humans as the primary decision-makers. In this paper, we introduce LLM Consumer Behavior Theory, a new field of study concerned with analyzing consumer behavior in agentic markets. Drawing on classical and behavioral economics alongside recent advances in Natural Language Processing, we formalize how human preferences are reflected and acted upon by LLM-based agents, and how agent-level decisions aggregate into market demand. We unify previously fragmented literature on LLM decision-making, human behavior simulation, and preference elicitation under a common economic lens, highlighting where assumptions, such as rationality and heterogeneity, may fail in agentic markets. Rather than providing empirical validation, this paper outlines the scope of LLM consumer behavior and identifies open research questions related to alignment, preference representation, and market dynamics.
[AI-26] C2FL: Clustered Continual Federated Learning under Spatial and Temporal Drift
链接: https://arxiv.org/abs/2606.18003
作者: Davide Domini,Gianluca Aguzzi,Lorenzo Pellegrini,Mirko Viroli,Lukas Esterle
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Collective Adaptive Systems (CAS) increasingly rely on machine learning to let each node learn from locally sensed data, aligning its behavior with the surrounding environment. Scaling this intelligence, however, raises fundamental challenges: sensed data is often privacy-sensitive, preventing centralized collection; nodes are mobile, traversing regions where nearby nodes perceive similar phenomena while distant ones observe radically different conditions, creating natural spatial clusters; and these distributions evolve over time due to mobility, introducing temporal drift that makes local models progressively stale. These dynamics arise across domains - vehicular sensing, drone-based monitoring, smartphone crowdsensing - yet the interplay of privacy, spatial heterogeneity, and temporal drift severely undermines conventional learning strategies. Therefore, we propose C2FL, a fully distributed Federated Learning (FL) approach where nodes self-organize into learning groups through spatial clustering, reflecting the geographic structure of the environment. To counteract temporal drift, each node combines experience replay with a dwell-time-aware adaptive averaging step, progressively incorporating the regional consensus as it remains longer within the same area, while preserving previously acquired knowledge under evolving distributions. We evaluate our approach on synthetic experiments that systematically reproduce spatial and temporal shifts, showing that standard federated strategies degrade significantly under these conditions and that our method restores robust collective adaptation.
[AI-27] A T-API-Compliant ReAct Agent ic Loop for Optical Networks: Generic vs. Domain-Specific Tool Abstractions
链接: https://arxiv.org/abs/2606.18000
作者: Seyed Morteza Ahmadian,Paolo Monti,Carlos Natalino
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
备注: 4 pages, 2 figures, accepted for presentation at the 52nd European Conference on Optical Communications (ECOC), 2026
Abstract:Optical networks need intent-driven, closed-loop agentic management, a key enabler for higher autonomy levels. We present the first T-API-compliant reasoning and act (ReAct) loop. We show that domain-specific composite tools achieve 90% oracle-validated correctness with threefold token savings compared to generic tools.
[AI-28] Multiple cyclicity and Wavelet Decomposition with Channel Correlation for Long-term Time Series Forecasting
链接: https://arxiv.org/abs/2606.17996
作者: Bin Wang,Heming Yang,Jinfang Sheng
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Cyclicity and trend are important components of time series data and many studies based on cyclicity and trend have achieved good results in long-term time series forecasting. However, we believe that current work neglects the influence of real-world inter-channel correlations in time series data which leads to suboptimal predictions. Furthermore, these models rely on complex designs to capture diverse information so that resulting in low computational efficiency. To address this challenge, we propose McWC, a long-term time series forecasting model that separately models the cyclicity, trend, and inter-channel correlations. Specifically, McWC first decouples cyclical information from data using a multi-layer cyclicity construction module. Then, it extracts inter-channel correlations using multi-layer perceptron. Next, it models and fuses the multi-layer high-frequency and low-frequency information from data using a multi-level wavelet decomposition module. Finally, it aggregates the results of different components to obtain the output. Simultaneously, we decouple intra-channel autocorrelations by calculating a loss function in the frequency domain. Experiments on six real-world datasets demonstrate that McWC achieves state-of-the-art performance, exhibiting excellent computational efficiency and historical information extraction capabilities.
[AI-29] STAR: SpatioTemporal Adaptive Reward Allocation for Text-to-Image RL Post-Training
链接: https://arxiv.org/abs/2606.17979
作者: Jinjie Shen,Wei Deng,Xian Hu,Daiguo Zhou,Jian Luan
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Existing RL post-training methods for text-to-image generation usually convert the final-image reward into a single scalar advantage and apply it with the same strength to the entire generative trajectory. However, text-to-image generation naturally has temporal and spatial structure: different denoising steps are responsible for different generation stages, and the content that truly determines text alignment often appears only in part of the image. This granularity mismatch makes it difficult for policy updates to focus on the generative components that actually affect the reward. To address this issue, we propose \textbfSpatioTemporal Adaptive Reward (STAR) Allocation for RL post-training of text-to-image diffusion and flow models. STAR uses text-image attention inside the generative model and starts from the core content that the user truly cares about in the prompt. It constructs spatial allocation maps that dynamically vary across denoising steps and rollouts, and allocates the same group-relative advantage to more relevant latent regions with almost no additional computational overhead. STAR then applies stronger policy updates to these regions through a spatially resolved policy objective. We use Stable Diffusion 3.5 Medium as the base model and evaluate on three tasks: GenEval, OCR text rendering, and PickScore. Experimental results show that STAR improves compositional semantic alignment, text rendering, and preference optimization without changing the external reward source, achieving \mathbf0.9759 , \mathbf0.9757 , and \mathbf23.60 on GenEval, OCR, and PickScore, respectively.
[AI-30] MoCo-AIS: A Contrastive Learning Framework for Similarity Computation of Vessel Trajectories
链接: https://arxiv.org/abs/2606.17978
作者: Ruixin Song,Md Mahbub Alam,Zahra Sadeghi,Amilcar Soares,José F. Rodrigues-Jr,Gabriel Spadon
类目: Artificial Intelligence (cs.AI)
备注: Under review at SIGSPATIAL’26
Abstract:Trajectory similarity is a fundamental task in analyzing mobility patterns, essential for applications such as route pattern extraction, mobility prediction, and anomaly detection. Traditional distance-based measures for computing similarity incur high computational cost, driving the adoption of lightweight learning-based approaches. Supervised methods rely on extensive labels derived from traditional distance measures and often reproduce these metrics, which limits generalization. While self-supervised learning addresses this issue through contrastive learning, it lacks a unified framework, making it difficult to compare deep learning (DL) models for consistent trajectory representation. Accordingly, this paper presents MoCo-AIS, a unified framework for learning vessel trajectory embeddings based on the Momentum Contrast (MoCo) paradigm, which formulates similarity learning through positive and negative trajectory pairs. Within this framework, we evaluate a diverse set of leading DL models on large-scale, real-world vessel-tracking AIS datasets that capture diverse navigation behaviors and operating conditions. Results demonstrate that our framework significantly improves similarity learning over existing baselines, while providing a benchmarking platform for evaluating trajectory representation models.
[AI-31] SoftMoE: Soft Differentiable Routing for Mixture-of-Experts in LLM s ICML2026
链接: https://arxiv.org/abs/2606.17952
作者: Mikołaj Zasada,Łukasz Struski,Jacek Tabor,Marcin Kurdziel
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at ICML 2026
Abstract:Sparse Mixture-of-Experts (MoE) architectures enable scaling LLM parameters under a fixed inference budget by activating only a small subset of experts via top- k routing. While this preserves causality and suits autoregressive language models, the discrete top- k operator is not differentiable, forcing a fixed number of active experts per input and resulting in inefficient use of computation. We propose SoftMoE, which replaces discrete routing with a truncated soft top- k LapSum relaxation, allowing gradient-based optimization of expert routing. We further parameterize the mean number of active experts per layer and impose a global budget constraint, enabling the model to learn how to allocate expert capacity across layers. SoftMoE remains fully compatible with autoregressive modeling and achieves performance comparable to or better than sparse MoE on language modeling and downstream tasks, while activating significantly fewer experts. Notably, the learned allocation is highly non-uniform, with later layers activating more experts. The source code is publicly available ^\dagger .
[AI-32] Small Initialization Matters for Large Language Models
链接: https://arxiv.org/abs/2606.17945
作者: Liangkai Hang,Junjie Yao,Zhiyu Li,Feiyu Xiong,Hongkang Yang,Zhi-Qin John Xu
类目: Artificial Intelligence (cs.AI)
备注: 26 pages, 8 figures
Abstract:Large language models provide a tractable system for asking how intelligence itself emerges, rather than only how LLMs can be engineered. Although progress is usually attributed to scale, data and architecture, we show that parameter initialization is a gene-like determinant of training and, in particular, of model capacity. Reducing the initialization scale consistently improves pretraining, with the largest gains on reasoning-demanding tasks. We identify two widely used empirical settings that restrain the advantage of small initialization, and show how relaxing them restores favorable scaling. We further uncover a critical initialization that balances the reasoning and training. Mechanistically, small initialization drives a distinct developmental trajectory: parameters first condense into low-complexity structures and later expand into richer representations, giving concrete form to the idea that compression is intelligence. Token-level analyses show that the gains concentrate on non-trivial, context-constrained predictions rather than all tokens uniformly. These results motivate a simple \gamma -initialization rule: expose initialization rage as an explicit knob and use small initialization by default, an almost cost-free intervention that improves pretraining and strengthens reasoning across model scales.
[AI-33] How Inference Compute Shapes Frontier LLM Evaluation
链接: https://arxiv.org/abs/2606.17930
作者: Jessica McFadyen,Ole Jorgensen,Harry Coppock,Kevin Wei,Cozmin Ududec
类目: Artificial Intelligence (cs.AI)
备注: 34 pages, 4 figures
Abstract:AI evaluations are shifting toward harder tasks that benefit from longer trajectories involving tool use and iterative problem solving. As a result, performance is increasingly sensitive to the amount and allocation of compute available at test time (“inference compute”). Yet many evaluations still report performance at a single restrictive budget, meaning that low scores may reflect the evaluation setup rather than the model’s underlying capability. To test this, we evaluate up to 12 frontier language models on seven challenging benchmarks spanning software engineering, mathematics, medicine, and cybersecurity. We use a controlled setup combining three simple inference-scaling interventions: larger token budgets, context compaction, and repeated submission attempts, guided either by the model itself or by minimal correctness feedback. We find three main results. First, larger token budgets substantially improve performance on benchmarks across multiple domains, including cybersecurity, FrontierMath, Humanity’s Last Exam, and TerminalBench. Second, fixed-budget evaluations can increasingly understate frontier capability as models advance. Newer models reach higher performance at large budgets, where they unlock harder tasks and solve them more reliably. Third, benchmarks differ in which inference-scaling methods help most: repeated submission broadly improves performance, but the value of larger token budgets, external feedback, and parallel attempts varies by benchmark. Overall, our results show that benchmark scores are protocol-dependent. We therefore argue that evaluations should report capability as a function of inference-time compute, specify protocol choices explicitly, and compare model generations over a large shared compute range at matched budgets, especially in safety- or policy-relevant settings.
[AI-34] PreAct: Computer-Using Agents that Get Faster on Repeated Tasks
链接: https://arxiv.org/abs/2606.17929
作者: Bojie Li
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Computer-using agents drive real software through the screen – clicking and typing – but they solve every task from scratch: asked to repeat a task, an agent re-reads the screen, re-reasons every tap, and pays the full cost again. We present PreAct, which lets such an agent get faster on tasks it has done before. The first time it succeeds, PreAct compiles the run into a small state-machine program-states that check the screen, transitions that act-and on later runs replays it directly instead of invoking the agent 8.5-13x faster, with no per-step language-model calls. Replay is not blind: at each step PreAct checks that the screen matches what the program expects before acting, and hands control back to the agent the moment something is off. PreAct applies the same discipline when deciding what to keep: a freshly compiled program enters the store only if, re-run from a clean state, an independent evaluator confirms it solved the task-catching programs that replay to their last step yet leave the task undone. Across a mobile, a desktop, and a web benchmark, this store-time check separates repeated runs that improve from ones that degrade as faulty programs accumulate, worth 1.75-2.6 tasks per benchmark, the same direction on all three; a fallback that explores afresh when no program fits brings PreAct level with a strong record-and-replay baseline. We also report what did not matter: prompt wording, runtime guardrails, and whether a language model or a plain embedding retriever selects which program to reuse. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2606.17929 [cs.AI] (or arXiv:2606.17929v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2606.17929 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-35] KANLib – An Modular Extensible and Fast Kolmogorov-Arnold Network Implementation
链接: https://arxiv.org/abs/2606.17927
作者: Julian Hoever,Gregor Schiele
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Kolmogorov-Arnold Networks (KANs) have recently emerged as a promising alternative to traditional multilayer perceptrons by replacing linear weights with learnable univariate functions. Despite their theoretical advantages in interpretability and expressiveness, practical research of KANs remains difficult due to high computational costs and inconsistent feature support across existing frameworks. This paper introduces KANLib, a modular, extensible, and computationally efficient framework for developing and evaluating KAN architectures. KANLib unifies core concepts from existing implementations, including PyKAN, EfficientKAN, and FastKAN, within a consistent software architecture that emphasizes flexibility, feature parity, and high performance. The framework supports two basis function types, adaptive grid rescaling, grid extension, and fine-grained architectural customization while maintaining compatibility with standard PyTorch workflows. Experimental evaluation on the California Housing benchmark demonstrates that KANLib reproduces the predictive behavior of established reference KAN implementations while achieving competitive computational efficiency. Furthermore, the framework enables the exploration of architectural variations beyond standard KAN formulations with only minor impacts on predictive performance. Overall, KANLib provides a robust foundation for future research on scalable and extensible KAN architectures.
[AI-36] PearlVLA: Progressive Embodied Action-Plan Refinement in Latent Space
链接: https://arxiv.org/abs/2606.17924
作者: Bochen Yang,Lianlei Shan
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 21 pages, 2 figures. Preprint
Abstract:Current Vision-Language-Action (VLA) models face a trade-off between efficient action generation and explicit deliberation. Directly decoding actions from vision-language backbone representations enables low-latency control, whereas explicit reasoning through textual chains, pixel-level subgoals, or action search can improve planning but incurs substantial latency and computational cost. We propose PearlVLA, a VLA framework that moves deliberation into the latent space of a vision-language model (VLM). PearlVLA separates VLM meta-query representations into a fixed visual grounding branch and an iterative latent plan branch. At each refinement round, a plan-conditioned world query probes a lightweight frozen latent world model for an action-free future observation latent, which is fed back to guide plan refinement. A future-guided RefineNet then applies scheduled residual updates to progressively refine a coarse semantic draft into a fine-grained latent action plan. The refined plan after K rounds is then decoded in parallel into an action chunk for low-latency execution. We further introduce Causal Refinement-Grouped Process-Reward RL to optimize the latent refinement process with rewards from longer-horizon imagined futures induced by latent plan edits. Empirical evaluations on the LIBERO benchmark demonstrate that PearlVLA achieves state-of-the-art performance among existing methods.
[AI-37] DiagFlowBench: Evaluating How Language Models Handle Off-Procedure Inputs in Grounded Diagnostic Dialogue
链接: https://arxiv.org/abs/2606.17904
作者: Guillermo Gil de Avalle,Laura Maruster,Shaina Raza,Christos Emmanouilidis
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Language models increasingly serve as advisory systems in maintenance operations. To prevent hallucination, recent systems ground these models in procedural documentation to constrain them to approved steps. In practice, however, operator queries frequently stray from this path, requiring models to recognise out-of-scope inputs mid-conversation, a dynamic that current benchmarks rarely prioritise. We introduce DiagFlowBench, a dataset of 50 industrial diagnostic flowcharts from a consumer manufacturer converted into 1,676 multi-turn conversations that contrast compliant with out-of-scope utterances. Evaluating a panel of ten commercial and open-weight models reveals high variability in abstention rates, with models commonly selecting a real but contextually inadequate step rather than fabricating facts. The inherent plausibility and authority of this mapped but wrong advice exposes a challenging vulnerability for grounding systems.
[AI-38] Learn to Quantify Social Interaction with Constraints for Pedestrian Walking
链接: https://arxiv.org/abs/2606.17897
作者: Xiaodan Shi
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:
Abstract:Long-term human path forecasting in crowds is critical for autonomous moving platforms (like autonomous driving cars and social robots) to avoid collision and make high-quality planning. Although the current research take into account social interactions for prediction, they don’t reveal the exact kinds of social interactions happened among people and how the social interactions affect the decision-making process of pedestrians, which further limits its robustness. Social interactions in pedestrian walking are intuitively massive and hard to label and quantify. In this paper, we explore creatively to quantify and interpret how pedestrians interact with others by proposing Learn to Cluster. Our clustering social interactions is probabilistic latent variable generative, learning directly from sequential trajectory observations, scalable to arbitrary number of pedestrians. Learn to cluster is label-free and can be naturally integrated into the training process of the prediction model. The latent variables will then serve as ‘labels’ to categorize social interactions. Extensive experiments over several trajectory prediction benchmarks demonstrate that our method is able to learn the patterns of social interactions and effectively integrate the patterns to pedestrian trajectory prediction.
[AI-39] Dimensionality Controls When Modularity Helps in Continual Learning ICML2026
链接: https://arxiv.org/abs/2606.17889
作者: Kathrin Korte,Christian Medeiros Adriano,Joachim Winther Pedersen,Eleni Nisioti,Sebastian Risi
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注: Accepted to the 2nd Workshop on Compositional Learning (CompLearn) at ICML 2026, Seoul, South Korea. 8 pages, 5 figures
Abstract:Compositional learning systems must balance plasticity, the ability to acquire new knowledge, with stability, the preservation of previously learned components, especially when tasks share structure and risk interference. We study how modular architecture, task similarity, and representational dimensionality jointly shape compositional continual learning in a sequential A-B-A paradigm, comparing a task-partitioned recurrent network to a single-network baseline while inducing high- and low-dimensional regimes via weight-scale manipulations. In a high-dimensional “lazy” regime, both architectures achieve similar performance and internal geometry, suggesting that explicit modular structure has little impact when representations are weakly constrained. In a lower-dimensional “rich” regime, modularity becomes decisive: the modular network develops graded task-specific subspaces that overlap for similar tasks, partially align for moderately dissimilar tasks, and separate for dissimilar tasks, yielding a more compositional and interpretable organization than the single network. These findings identify the representational regime induced by initialization scale, which co-varies with representational dimensionality, as a key factor governing when compositional, modular structure is functionally beneficial in continual learning, and support viewing safety and robustness as problems of adaptive allocation of representational subspaces rather than fixed separation versus sharing.
[AI-40] MathVis-Fine: Aligning Visual Supervision with Necessity via Progressive Dependency-Guided Training for Multimodal Mathematical Reasoning
链接: https://arxiv.org/abs/2606.17888
作者: Wanshi Xu,Haokun Zhao,Haidong Yuan,Songjun Cao,Long Ma
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Chain-of-Thought (CoT) reasoning has extended from purely linguistic domains to multimodal scenarios; however, existing approaches often treat visual inputs as homogeneous or auxiliary signals, failing to capture the intricate and sample-specific dependencies between text and images in mathematical problem-solving. This gives rise to two core issues: first, the supervisory signals for visual content are generalized and coarse-grained, lacking adaptation to the actual necessity of visual information in each sample; second, training feedback becomes inaccurate when visual rewards are uniformly applied without distinguishing the complementary relationships among inputs. These limitations hinder models from achieving precise multimodal reasoning. In this work, we propose a framework for modeling fine-grained visual dependencies in mathematical reasoning. We first construct the MathVis-Fine dataset, augmenting fine-grained visual annotations with visual dependency ratings. Building upon this dataset, we introduce a two-stage progressive visual enhancement training paradigm that balances answer correctness rewards and visual grounding rewards according to the intrinsic visual dependency level of each sample, thereby mitigating reward bias and improving supervision accuracy. Extensive experiments demonstrate that the MathVis-Fine framework effectively enhances visual perception progressively based on visual dependency, offering a more precise training framework for multimodal mathematical reasoning. We will release the dataset upon acceptance.
[AI-41] Structural Preservation and the Logical Expressiveness of Graph Neural Networks
链接: https://arxiv.org/abs/2606.17882
作者: Przemysław Andrzej Wałęga,Bernardo Cuenca Grau
类目: Artificial Intelligence (cs.AI)
备注: 20 pages
Abstract:Bridges between graph neural networks (GNNs) and logical formalisms have been established by fixing architectural choices, such as the types of aggregation, combination, and activation functions. These choices define restricted classes of GNNs for which tight correspondences with logical formalisms can be obtained, by showing that logical formulae can be translated into equivalent GNNs and, conversely, that GNNs can be translated into equivalent formulae. In this paper we take a semantic perspective by establishing the logical expressiveness of classes of GNN classifiers that are preserved under structural properties: embeddings (extensions), injective homomorphisms, and homomorphisms. We show that, for each such property, there exists a fragment of graded modal logic characterising the class of GNNs. In particular, preservation under embeddings, injective homomorphisms, and homomorphisms corresponds to existential graded modal logic, its existential-positive fragment, and existential-positive modal logic, respectively. These results characterise the expressiveness of broad classes of GNNs independently of specific architectural choices, but we also show that each of these classes admits a GNN architecture of the same expressiveness. Technically, our approach uses a new well-quasi-order result for trees of bounded height, yielding finite representations of unravelling-invariant classes. Comments: 20 pages Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2606.17882 [cs.AI] (or arXiv:2606.17882v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2606.17882 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-42] AnchorKV: Safety-Aware KV Cache Compression via Soft Penalty with a Refusal Anchor
链接: https://arxiv.org/abs/2606.17872
作者: Ning Ni,Yingjie Lao
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) outperform earlier architectures on generative inference and long-context tasks, but their large size introduces significant challenges in memory usage, energy cost, and on-device deployment. Since scaling pre-trained language models improves downstream capability \citezhao2023survey, the key-value (KV) cache becomes a dominant inference bottleneck. Recent KV cache compression methods \citejo2025fastkv,li2024snapkv,zhou2024dynamickv reduce this cost by retaining only a subset of attention-relevant tokens. However, while these approaches preserve accuracy on benign workloads, their compression policies either fail to defend against jailbreak attacks \citejiang2024robustkv or degrade safety alignment under aggressive eviction. We propose AnchorKV, a drop-in modification to KV cache compression that biases token retention scores away from directions in key space associated with harmful prompts. AnchorKV constructs an offline safety anchor by adapting a difference-of-means representation engineering approach \citearditi2024refusal,zou2023representation to the layer-specific key projection space used in KV caching. Based on this anchor, a soft penalty token selection rule trades a small amount of utility for substantially improved safety alignment, while reducing to the original compressor when the penalty is zero. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2606.17872 [cs.LG] (or arXiv:2606.17872v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2606.17872 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-43] StepGuard: Guarding Web Navigation via Single-Step Calibration
链接: https://arxiv.org/abs/2606.17871
作者: Zhihao Cui,Yuchen Zhang,Xiyang Sun,Yaxiong Wang,Li Zhu,Jinpeng Hu,Liu Liu,Mengjia Li,Yujiao Wu
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Web navigation requires agents to follow natural language goals, interact with web pages, and produce accurate answers. While recent advances leverage vision-language models and reinforcement learning, existing methods still suffer from single-step fragility due to reward misalignment and error propagation. To tackle the reward entanglement, we design Dynamic Dual-Policy Optimization (DDPO), which dynamically switches between a navigation-first mode for exploration and an answer-first mode for question-answering to mitigate reward conflict. To calibrate the single-step error, we propose Confidence-Guided Adaptive Navigation Reflection (CANR), a mechanism that estimates per-step confidence, triggers reflection only when necessary, and uses contrastive rewards to encourage self-correction to calibrate the single-step inaccuracy. With the above as the main components, we finally develop our StepGuard, a new framework of Guarding Web Navigation via Single-Step Calibration. Experiments demonstrate that our approach significantly improves navigation and answer accuracy, setting new state-of-the-art performance on standard web navigation benchmarks.
[AI-44] FlowRAG : Synergizing Explicit Reasoning via Frequency-Aware Multi-Granularity Graph Flow
链接: https://arxiv.org/abs/2606.17856
作者: Bihao Zhan,Zongsheng Cao,Jie Zhou,Bo Zhang,Liang He
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Graph-based retrieval-augmented generation (GraphRAG) is effective for knowledge-intensive and multi-hop query tasks; however, many existing methods primarily seed entity-based graphs and rely on implicit semantic relevance propagation. This often (i) under-retrieves when user queries are abstract and semantically sparse at the entity level, and (ii) suffers from brittle multi-hop reasoning, where noisy activations can derail entity-to-entity transitions and corrupt the inferred relation chain, yielding unreliable conclusions. To this end, we propose \textttFlowRAG, a semantic-aware retrieval framework that improves both semantic recall and explicit reasoning. Specifically, \textttFlowRAG constructs a quad-level heterogeneous graph over passages, summaries, sentences, and entities, where summary nodes serve as a coarse semantic hub. At retrieval time, a dual-granularity activation module combines summary–query alignment with sentence-level matching to activate relevant entities under paraphrase and abstraction robustly. We then introduce a frequency-aware weighted flow module that routes relevance through entity–passage links weighted by within-passage term frequency, pruning noisy connections and extracting high-confidence reasoning paths as an explicit logic skeleton for generation. Extensive experiments show that \textttFlowRAG obtains state-of-the-art performance on complex reasoning benchmarks.
[AI-45] A homotopy-type-theoretic generalization of neurosymbolic inference
链接: https://arxiv.org/abs/2606.17851
作者: Fernando Zhapa-Camacho,Robert Hoehndorf
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注:
Abstract:A wide range of neurosymbolic (NeSy) systems compute one functional: a belief-weighted sum of a logical quantity over a space of \sigma -structures, of which weighted model counting, fuzzy logic, and probabilistic logic are special cases. This account is built on sets, and a set deliberately forgets two things that are important for NeSy: when two \sigma -structures are the same up to a symmetry of the theory, and how many distinct proofs witness a query. Replacing the underlying sets by types, in the sense of homotopy type theory, preserves this information, and turns this functional into a belief-weighted homotopy cardinality, a notion of size that counts each object in inverse proportion to its symmetries. We develop the framework from scratch for NeSy systems, prove a conservativity theorem that recovers the classical functional when symmetries are trivial, and show that the symmetry our framework exposes is exactly the one behind reasoning shortcuts. The payoff is concrete: the shortcut-aware concept posterior that recent methods reach by ensembling or expressive density estimation is the only symmetry-invariant point of the confusion-set simplex, computable in closed form by averaging a single model over the symmetry group. On MNIST reasoning-shortcut benchmarks this single-model wrapper is better calibrated than a diversity-trained ensemble, while leaving label accuracy and identifiable concepts untouched. Code is freely available at this https URL.
[AI-46] WallZero: Mastering the Game of WallGo with Strategic Analysis
链接: https://arxiv.org/abs/2606.17847
作者: Hsing-Yu Chen,Jérôme Arjonilla,I-Chen Wu,Ti-Rong Wu
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted by the Computers and Games conference (CG 2026)
Abstract:WallGo is a recently introduced strategic board game popularized by the 2025 Netflix series The Devil’s Plan. Although played on a small 7 x 7 board, its combination of stone movement and wall placement yields high game-tree complexity and intricate strategic interactions. Despite its growing popularity, WallGo remains underexplored. This paper presents WallZero, an AlphaZero-based agent for the two-player WallGo setting. We introduce tailored action and feature designs to improve playing performance significantly. In the evaluation, WallZero defeats two professional Go players who participated in this study, securing on average 1.98x more territory per game. Beyond its strength, we use WallZero to assess game fairness and identify key strategies for mastering WallGo. Interestingly, our results show that the opening used in the Netflix series yields a more balanced game. Our code is available at this https URL.
[AI-47] Functional Equivalence in Attention: A Comprehensive Study with Applications to Linear Mode Connectivity ICML2026
链接: https://arxiv.org/abs/2606.17830
作者: Viet-Hoang Tran,Vinh Khanh Bui,Van-Hoan Trinh,Tan Lai Ngoc,Tan M. Nguyen
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Published at the International Conference on Machine Learning (ICML 2026)
Abstract:Neural network parameter spaces are inherently non-injective, as distinct parameter configurations can realize identical functions through functional equivalence. While this symmetry is well understood in classical fully connected and convolutional models, it becomes substantially more intricate in modern attention-based architectures. Existing analyses of multihead attention have largely focused on the vanilla formulation, overlooking positional encodings that fundamentally reshape architectural symmetries. In this work, we provide a formal study of functional equivalence in Transformers with positional encodings. Focusing on the two most widely used variants–sinusoidal and rotary positional encodings (RoPE)–we show that sinusoidal encodings preserve the equivalence structure of vanilla attention, whereas rotary encodings significantly reduce the symmetry group, thereby enhancing expressivity. This offers a principled explanation for the growing prominence of RoPE in practice. We further examine how positional encodings affect linear mode connectivity, and through an alignment algorithm, empirically demonstrate that the presence and variability of connectivity across Transformer settings crucially depend on the positional encoding.
[AI-48] DecoSearch: Complexity-Aware Routing and Plan-Level Repair for Text-to-SQL
链接: https://arxiv.org/abs/2606.17821
作者: Esteban Schafir,Xu Zheng,Hojat Allah Salehi,Zhuomin Chen,Mo Sha,Wei Cheng,Dongsheng Luo
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) have demonstrated remarkable capabilities in translating natural language to SQL, yet existing methods still falter on complex queries requiring multi-step, data-aware reasoning. We introduce DecoSearch, a training-free framework that addresses this by routing each query to the appropriate level of reasoning effort. A lightweight Schema Selector first prunes the full database schema to the relevant tables and columns. An LLM Judger then decides whether the question requires decomposition: straightforward questions follow a direct generation path and complex ones are escalated to a Directed Acyclic Graph (DAG) of atomic sub-questions, each solved by a targeted SQL generation step. A RAG component grounds the decomposer with semantically similar training examples, and a Topology Refiner restructures the reasoning plan when execution failures signal a flawed decomposition rather than a fixable SQL error. DecoSearch achieves 70.53% execution accuracy on BIRD and 88.31% on Spider with a DeepSeek backbone, surpassing all training-free baselines while consuming an order of magnitude fewer tokens than competing methods. It also functions as a model-agnostic wrapper, consistently improving fine-tuned SQL generation backbones without any modification to the pipeline.
[AI-49] Conservation Laws for Modern Neural Architectures ICML2026
链接: https://arxiv.org/abs/2606.17816
作者: Viet-Hoang Tran,Vinh Khanh Bui,Tan Lai Ngoc,Nam Nguyen,Tuan Dam,Tan M. Nguyen
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Published at the International Conference on Machine Learning (ICML 2026)
Abstract:Understanding gradient descent dynamics is key to explaining the success of over-parameterized models, where implicit bias manifests through conservation laws in gradient flow. While such laws are well understood for linear and ReLU networks, they remain largely unexplored for modern architectures. This work develops a unified framework to characterize conservation laws for contemporary models, including feedforward networks with GELU, SiLU, and SwiGLU activations, multihead attention with sinusoidal and rotary positional encodings, and Mixture-of-Experts architectures under diverse gating designs. Our theoretical findings are supported by experiments that validate the predicted invariants.
[AI-50] No-Free-Fairness: Fundamental Limits and Trade-offs in Learning Systems
链接: https://arxiv.org/abs/2606.17810
作者: Khoat Than
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:In this paper, we establish a set of theoretical impossibility results, termed the No-Free-Fairness theorems, that identify three fundamental sources of disparity in learning systems. First, we show that when a task exhibits irreducible cost on a subgroup, any decision rule must trade off overall performance with disparity, yielding an inherent fairness–cost frontier. Second, we prove that even in ideal, noise-free settings where a perfectly fair and accurate solution exists, finite-sample learning alone induces nontrivial subgroup disparity, ruling out distribution-free fairness guarantees. More seriously, enforcing strict relative fairness creates a statistical bottleneck: achieving low cost may require exponentially many samples. Third, we show that limitations of the model class can independently induce disparity: if the model cannot represent accurate solutions for a subgroup, fairness remains unattainable regardless of data or training procedure. Overall, these results demonstrate that unfairness is not solely a consequence of biased data or suboptimal optimization, but arises from the intrinsic structure of decision problems, the constraints of finite data, and the expressivity of models. Our framework applies broadly beyond standard supervised learning, and suggests that achieving fairness requires explicit trade-offs and should be treated as a core design consideration.
[AI-51] MIVE: A Minimalist Integer Vector Engine for Softmax LayerNorm and RMSNorm Acceleration
链接: https://arxiv.org/abs/2606.17781
作者: Kosmas Alexandridis,Giorgos Dimitrakopoulos
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
备注:
Abstract:The rapid growth of Large Language Models (LLMs) has intensified the need for specialized hardware accelerators that can satisfy stringent inference latency and power constraints. Although matrix multiplications dominate the overall computational workload, non-linear vector normalization operations, such as LayerNorm, RMSNorm and Softmax can become critical hardware bottlenecks. Existing accelerators typically implement these functions using dedicated hardware blocks, leading to duplicated resources and inefficient silicon utilization. To address this limitation, we propose a Minimalist Integer Vector Engine (MIVE), a programmable architecture capable of executing all three operations within a unified datapath. By exploiting common computational patterns across LayerNorm, RMSNorm and Softmax the proposed vector engine maximizes hardware sharing while reducing implementation overhead. Physical ASIC implementation results show that MIVE provides comprehensive multi-function support while achieving higher area and hardware efficiency than most state-of-the-art standalone accelerators.
[AI-52] A Neuromorphic Trigger for Efficient Audio Event Detection
链接: https://arxiv.org/abs/2606.17775
作者: Benjamin Hatton,Oliver Rhodes,Luca Peres
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注: 9 pages, 4 figures, 6 tables
Abstract:Efficient processing of continuous audio streams remains a key challenge for real-time and resource-constrained systems. This paper introduces a neuromorphic trigger for audio event detection, based on a spiking neural network (SNN) that selectively gates input to downstream models. The proposed trigger acts as a low-cost front-end, identifying salient audio segments and forwarding only these to a more computationally intensive model for tasks such as classification. The trigger is implemented as a lightweight fully connected SNN and evaluated on two representative tasks: Anomalous Sound Detection (ASD) and Sound Event Detection (SED). For ASD, the trigger achieves a one-second segment-based F1 score of 0.97 on a class-agnostic form of the URBAN-SED dataset, demonstrating high reliability in identifying relevant audio regions. For SED, the trigger is combined with the Dang classifier on the DCASE 2017 Challenge Task 2 dataset, showing a potential 42.6\times reduction in FLOPs while reducing the lower bound of the event-based error rate from 0.41 to 0.25. These results highlight the potential of neuromorphic triggers as real-time, energy-efficient front-end filters, enabling substantial reductions in computational cost.
[AI-53] Shattering the Autoregressive Curse: Dynamic Epistemic Entropy Orchestrated Erasable Reinforcement Learning for LLM s
链接: https://arxiv.org/abs/2606.17735
作者: Ziliang Wang,Kang An,Faqiang Qian,Jialu Cai,Cijun Ouyang,Yuhang Wang,Qibing Ren,Yichao Wu
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Although reinforcement learning (RL) has expanded the cognitive boundaries of large language models (LLMs), it often remains vulnerable to the autoregressive curse in long-horizon logical reasoning: small epistemic perturbations introduced early in generation can propagate irreversibly along the Markov decision process flow, triggering cascading failures that drive the reasoning trajectory toward collapse. To overcome this autoregressive cascade, in which a single early mistake can compromise all subsequent reasoning steps, we propose dynamic epistemic entropy orchestrated erasable reinforcement learning ( \textE^3\textRL ). \textE^3\textRL eliminates reliance on external signals by grounding the model’s endogenous local autoregressive cross-entropy as an intrinsic coordinate of epistemic uncertainty. By introducing segment-level adaptive dynamic thresholds and advantage allocation, \textE^3\textRL enables the model to precisely excise localized logical defects while reusing historical key-value (KV) cache streams, thereby endowing the reasoning process with a self-healing capability. We train \textE^3\textRL on the DeepMath-103k dataset. Experimental results show that \textE^3\textRL reshapes the exploration efficiency of long-sequence reasoning and improves sample efficiency while maintaining linear memory overhead. On mathematical reasoning benchmarks such as AIME, \textE^3\textRL achieves substantial performance gains, with the 4B and 8B parameter models surpassing previous state-of-the-art (SOTA) results by 5.349% and 6.514%, respectively. These findings suggest that \textE^3\textRL shatters the autoregressive curse in long-sequence reasoning and establishes a theoretical and systems-level foundation for the next generation of self-healing artificial general intelligence (AGI).
[AI-54] LongWebBench: Evaluating Structural and Functional Webpage Generation in Long-Horizon Settings
链接: https://arxiv.org/abs/2606.17727
作者: Yi Zhao,Zhen Yang,Mengpan Chen,Mingde Xu,Shanghui Gong,Xijun Liu,Jibing Gong,Jie Tang
类目: Artificial Intelligence (cs.AI)
备注: 49 pages, 38 figures
Abstract:Recent vision-language models (VLMs) have shown promising progress in generating webpages from visual inputs, yet existing evaluations mainly focus on short, single-screen, and largely static webpages. We introduce LongWebBench, a benchmark for evaluating long-horizon webpage generation from both structural and functional perspectives. LongWebBench contains 490 real-world long webpages for structural fidelity evaluation and 507 goal-oriented interaction tasks over 129 webpages for functional evaluation. It employs two complementary protocols: a multi-dimensional VLM-based metric for assessing long-range structural coherence, and a DOM-augmented agent-based pipeline for end-to-end functional verification. We further examine the automatic evaluation protocols through human agreement analysis. Experiments with state-of-the-art open-source and proprietary VLMs under single-image and multi-image settings reveal that structural fidelity degrades as webpage length increases, while visually plausible generations often fail to support executable multi-step interactions. These results highlight the need to evaluate long webpage generation beyond visual similarity, with executable interaction as a core criterion. Our code and data are available at this https URL.
[AI-55] Confusion-Aware Transfer Teacher Curriculum Learning Framework: Disentangling Scoring and Pacing Effects ICML
链接: https://arxiv.org/abs/2606.17706
作者: Savini Kommalage,Sanka Mohottala,Asiri Gawesha,Dulara Madhusanka,Menan Velayuthan,Dharshana Kasthurirathna,Mahima Milinda Alwis Weerasinghe,Charith Abhayaratne
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at International Conference on Machine Learning (ICML) GlobalSouthML Workshop (2026)
Abstract:Curriculum learning couples two design choices, how samples are scored by difficulty and how harder samples are paced into training, making it difficult to attribute observed gains to either component. We disentangle these factors with two evaluation protocols: stage-wise test subsets that validate scoring functions independently of curriculum training, and a baseline that applies the same pacing schedule to randomly ordered data. Within the Transfer Teacher framework (TTF), we use these protocols to evaluate a confusion-aware difficulty score that considers both correct-class confidence and the probability distribution over incorrect classes. On CIFAR-10 with ResNet-18 and VGG-16, the proposed score produces model-interpretable difficulty rankings that align with human intuition. However, at full data, neither curriculum nor anti-curriculum ordering improves accuracy over standard training, indicating that improving the scoring function alone is insufficient to overcome the known failure modes of curriculum learning in TTF. In contrast, We find that confusion-aware curriculum ordering result in consistent data-efficiency benefits, outperforming random ordering by up to 8.7% points at the 20% data regime, suggesting the potential of TTF as a data-efficient training method.
[AI-56] FllumaOne: A Code-Native Multimodal CAD Dataset with Executable Programs and Kernel-Validated Feature Histories
链接: https://arxiv.org/abs/2606.17696
作者: Jizong Zhan
类目: Artificial Intelligence (cs.AI); Graphics (cs.GR)
备注: 24 pages, 4 figures
Abstract:Parametric computer-aided design records both final geometry and the ordered construction history that determines how a part can be edited. Datasets for editable CAD research should therefore expose modeling operations, parameters, and feature dependencies together with validated geometry. We introduce FllumaOne, a code-native multimodal CAD dataset whose models are generated by executable Python programs in Flluma, a Qt/C++ OpenCASCADE-based CAD system. Each sample aligns its program with a structured feature tree, a training-oriented intermediate representation, STEP geometry, a surface point cloud, natural-language descriptions, metadata, and eight canonical visible-edge renderings. The primary release, FllumaOne-100K, contains 100,000 accepted samples across four template-level complexity regimes. Programs are executed and retained only after kernel geometry, solid validity, and export checks; release reports also record modality completeness and split-level duplicate tests. A Qwen2.5-Coder-1.5B LoRA baseline trained on 80,000 samples achieves 99.98% Python syntax validity, 99.97% Flluma build success, and 99.14% STEP-export validity on the held-out 10,000-sample test split. For the 9,909 predictions converted to surface point clouds, the mean normalized Chamfer Distance is 0.002124. The dataset supports conditioned CAD reconstruction, executable program synthesis, feature-tree prediction, B-Rep analysis, retrieval, design completion, and editable reverse engineering. Comments: 24 pages, 4 figures Subjects: Artificial Intelligence (cs.AI); Graphics (cs.GR) Cite as: arXiv:2606.17696 [cs.AI] (or arXiv:2606.17696v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2606.17696 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-57] ASTEROID: A Spatiotemporal Information Transformer for Forecasting Multi-Step Time Series of Molecular Dynamics
链接: https://arxiv.org/abs/2606.17668
作者: Kexin Wu,Luonan Chen,Renxiao Wang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
备注: 32 pages,10 figures
Abstract:Molecular dynamics (MD) simulation is computationally demanding, particularly for large-scale systems requiring long-term analysis. Accurate forecast of the outcomes of a MD simulation is not only an attractive scientific challenge but also has substantial practical value. In this work, we developed a data-driven framework, termed ASTEROID (Advanced Spatiotemporal TransformER fOr Inferring Dynamics), that can directly predict multi-step atomic coordinates, avoiding conventional iterative integration. For this purpose, our ASTEROID reformulates MD trajectories as high-dimensional spatiotemporal sequences and integrates the Spatiotemporal Information (STI) Transformation equation into a Transformer architecture. The core innovation of ASTEROID lies in its ability to model multiscale spatiotemporal dependencies. In particular, for spatial dependencies, a local-global self-attention mechanism captures both short- and long-range interactions. For temporal dependencies, an encoder-decoder structure integrates global context with autoregressive forecasting. ASTEROID was evaluated on several quantum-mechanics derived molecular datasets. Our results indicate that ASTEROID achieved not only a higher level of accuracy in multi-step prediction than existing methods on various benchmarks, but also significantly reduced computational cost of conventional MD simulation. Moreover, the model supports iterative multi-step forecasting over an extended time scale. This work establishes a robust and generalizable data-driven paradigm for accelerating MD simulations.
[AI-58] Handling Feature Heterogeneity with Learnable Graph Patches KDD2025
链接: https://arxiv.org/abs/2606.17667
作者: Yifei Sun,Yang Yang,Xiao Feng,Zijun Wang,Haoyang Zhong,Chunping Wang,Lei Chen
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at KDD 2025
Abstract:In recent years, the rapid development of foundation models and graph pre-training technologies has spurred increasing interest in constructing a universal pre-trained graph model or Graph Foundation Model (GFM). However, a significant challenge is that existing models are unable to address feature heterogeneity in graph data without textual information, which hinders the transferability of graph models across different datasets. To bridge this gap, we propose the concept of learnable graph patches, which we regard as the smallest semantic units of any graph data. We decompose the graph into learnable graph patches by unfolding the node features and constructing corresponding patch structures separately. We then design a framework that mines transferable information from graph data across domains. Specifically, after extracting graph patches, we propose a patch encoder to extract knowledge from each unit and a patch aggregator to learn how the units are combined into a whole. Due to its domain-agnostic nature, the model can be applied to downstream data across different domains. Furthermore, we analyze the connection between our method and existing graph models, as well as the transferability of the node embeddings it generates. Empirically, our method not only achieves the capability to use multi-domain graphs for pre-training, but also shows enhanced performance across various downstream datasets and tasks. Moreover, we observe consistent improvement in downstream performance as the volume of pre-training data increases.
[AI-59] FacProcessTwin: An LLM -Based System for Process Twin Development
链接: https://arxiv.org/abs/2606.17666
作者: Yash Pulse,Yong-Bin Kang,Abhik Banerjee,Prem Prakash Jayaraman
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Process twins provide real-time representations of entire production processes. By capturing how process steps interact, rather than monitoring a single machine in isolation as an asset-based digital twin does, they have the potential to drive efficiency gains across the whole process. However, developing a process twin is costly. It requires accurately modelling the entire production process: its process steps, the equipment and product-specific settings each step uses, and its process variations. The resulting model must then be bound to live operational data. We present FacProcessTwin, a system that leverages a large language model (LLM) to reduce this development time, building a process twin from a plant’s process documentation and natural-language input from an operator. FacProcessTwin generates this complete process model and then automatically binds its process steps to live operational data. The generated model and its data bindings are rendered as an interactive process diagram through which manufacturing personnel can monitor and correct the system’s autonomous decisions, such as resolving uncertainty at safety-critical binding steps. We evaluate FacProcessTwin through a real-world case study of an Australian food manufacturer, covering 16 production process flows that span chilled, frozen, and aseptic shelf-stable product categories and include process variations within the same product. The results show that FacProcessTwin generates these process models accurately (a mean F1 of 95.2% against ground truth) and builds each twin in roughly a sixth of the manual time. Its human-in-the-loop governance then keeps the safety-critical bindings correct: at ambiguous tags where a single-pass baseline silently mis-binds 75.0% of the time, FacProcessTwin defers to the operator and mis-binds none.
[AI-60] uneAhead: Predicting Fine-tuning Performance Before Full Training Begins ICML ICML2026
链接: https://arxiv.org/abs/2606.17660
作者: Yuxiang Luo,Haonan Long,Chen Wang,Qiqi Duan,Xiaotian Lin,Yanwei Xu,Yuyu Luo,Weikai Yang,Nan Tang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 9 pages, 6 figures, accepted as ICML 2026 poster: this https URL
Abstract:Fine-tuning large language models (LLMs) is compute-intensive and error-prone: model performance depends sensitively on data quality and hyperparameter choices, and naïve runs can even degrade model performance. This raises a practical question:can we predict fine-tuning performance before committing to a full training run? We present TUNEAHEAD, a lightweight framework for pre-hoc prediction of fine-tuning performance. TUNEAHEAD encodes each candidate run as a meta-feature vector that combines static dataset descriptors with dynamic probe features from a short standardized probe. A predictor maps these features to performance estimates, while SHAP-based attributions provide interpretable diagnostics that reveal which specific features drive the prediction. Across 1,300+ fine-tuning runs on Qwen2.5-7B-Instruct, TUNEAHEAD consistently outperforms strong baselines such as Early-Stop Extrapolation and ProxyLM. On a held-out test set of 370 runs, TUNEAHEAD achieves an RMSE of 1.47 percentage points and places 95.1% of predictions within +3/-3 percentage points of the true score. These accurate continuous predictions support practical go/no-go screening policies that can reduce unnecessary full fine-tuning while retaining most promising runs.
[AI-61] Using Cognitive Models to Improve Language Model Simulation of Human Persuasion Games
链接: https://arxiv.org/abs/2606.17657
作者: Zirui Cheng,Zeyu Shen,Thomas L. Griffiths,Peter Henderson
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:People make decisions differently in strategic interactions. Some update beliefs like a Bayesian; others exhibit biases like motivated reasoning. Although creators of large language models use simulated humans for safety evaluations and training, they often fail to cover this breadth of human behavior. We argue that cognitive science and economics provide a convenient tool for doing so, making use of mathematical models of human decision-making. We propose an approach that we call Equation-to-Behavior Prompting for guiding large language models to match cognitive models, and evaluate this approach on persuasion games based on legal decision-making. We find that large models can approximate equation-based specifications – Bayesian updating, affine distortion, motivated updating, and Grether’s \alpha - \beta model – using prompting, but small models fail to do so. However, training small models with reinforcement learning to adhere to mathematical rules, Equation-to-Behavior RL, reduces belief error by 26.5% in out-of-distribution parameterizations. We show that these simulations can help create diverse training environments; training small models to consider different kinds of decision-makers improves average belief change by 2.5%–12% over Bayesian-only training, even when persuading GPT-5-mini. Our work could improve human simulations for training and evaluation in increasingly realistic settings, and could also enable novel research into more complicated mathematical models of human decision-making.
[AI-62] A Risk Decomposition Framework for Pre-Hoc Fine-Tuning Prediction ICML ICML2026
链接: https://arxiv.org/abs/2606.17649
作者: Yuxiang Luo,Chen Wang,Nan Tang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 9 pages, 4 figures, accepted as ICML 2026 Poster: this https URL
Abstract:The high cost of fine-tuning LLMs poses a significant economic barrier; pre-hoc performance prediction offers a critical solution to substantially reduce this expense. However, the theoretical limits of pre-hoc performance prediction remain unexplored. We formulate it as a stochastic estimation problem under information constraints, decomposing prediction risk into two components: an intrinsic limit (static data-model compatibility) and a reducible optimization variance. We prove that optimization variance admits a necessary lower bound on its decay rate, implying fundamental constraints on how quickly uncertainty dissipates, regardless of the predictor used. Based on these dynamics, we derive a budget-optimal probing principle and introduce a predictability phase diagram that organizes tasks into three distinct regimes: Static-Sufficient, Dynamic-Critical, and Noise-Dominant. Extensive experiments on synthetic and real-world benchmarks validate these theoretical regimes and demonstrate the efficiency of our probing strategy.
[AI-63] From Brewing to Resolution: Tracing the Internal Lifecycle of Code Reasoning in LLM s
链接: https://arxiv.org/abs/2606.17648
作者: Siyue Chen,Yifu Guo,Yuquan Lu,Zishan Xu,Jiaye Lin,Jianbo Lin,Siyu Zhang,Cheng Yang,Junxin Li,Yujia Li,Yu Huo,Ruixuan Wang
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Standard accuracy metrics cannot explain why LLMs handle variable tracking but fail on semantically equivalent loops. We study an internal lifecycle of code reasoning in which models first brew the answer, making it linearly recoverable many layers before it becomes self-decodable, and then diverge into one of four resolution outcomes: Resolved, Overprocessed, Misresolved, or Unresolved. Understanding this lifecycle matters because similar task accuracies can mask fundamentally different failure modes that surface-level evaluation cannot detect. We introduce a dual diagnostic framework pairing layer-wise linear probing with Context-Stripped Decoding (CSD) and apply it to six code-reasoning task families across 16 models spanning Qwen, Llama, and DeepSeek architectures. All four outcomes carry substantial mass in every task family: overall Resolved is only 41.5%, with multiple tasks below 30%. Controlled sweeps over structure, depth, and operators expose task-specific failure bottlenecks: Function Call Resolved plunges from 61.1% to 2.5% as call depth increases from one to three. Across architectures and scales, the brewing scaffold remains stable, with normalized brewing duration 24-42% across all 16 models, while resolution success varies with capability. This indicates that the scaffold is a stable empirical regularity across the tested decoder-only Transformer families, whereas resolution success covaries with capability, scale, and training. Code: this https URL
[AI-64] FinAcumen: Financial Multimodal Reasoning via Self-Evolving Experience Memory Harness
链接: https://arxiv.org/abs/2606.17642
作者: Pianran Guo,Pengcheng Zhou,Yucheng Jian,Shuhua Chen
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Financial multimodal reasoning requires agents to coordinate numerical computation, retrieval, visual interpretation, and temporal grounding across heterogeneous evidence sources. Existing tool-augmented agents improve execution fidelity, yet remain largely stateless across episodes, repeatedly rediscovering reasoning strategies and failure patterns. In high-stakes financial settings, this leads to unreliable tool routing, noisy retrieval, and hallucination-prone reasoning. We present FinAcumen, a financial reasoning agent framework centered on selective experience memory for tool-augmented multimodal reasoning. FinAcumen accumulates financially grounded reasoning experience from prior trajectories, distilling successful strategies and failure-derived cautionary rules into a persistent memory bank. During inference, retrieved experiences condition reasoning only when semantic relevance exceeds a calibrated threshold, while irrelevant memory is explicitly suppressed through a fallback mechanism. A deterministic financial tool environment further grounds numerical computation, retrieval, visual decoding, and answer this http URL four financial multimodal reasoning benchmarks, FinAcumen consistently improves a frozen 8B vision-language model over finance-specialized models and approaches leading proprietary general-purpose models. Further analysis shows that selective experience activation improves reasoning reliability under retrieval uncertainty. Our code is anonymously available at this https URL
[AI-65] Brick-DICL: Dynamic In-Context Learning for Automated Brick Schema Classification
链接: https://arxiv.org/abs/2606.17637
作者: Yiyue Qian,Shinan Zhang,Huan Song,Negin Sokhandan,Hannah Marlowe,Diego Socolinsky
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Building Management Systems (BMS) are essential for optimizing energy efficiency and operational performance in modern buildings. However, the lack of standardization across BMS points from different manufacturers creates significant barriers to integration and data utilization. While the Brick schema offers a standardized ontology for building systems, mapping BMS points to appropriate Brick classes presents three critical challenges: (i) the extensive number of Brick classes (936 in the latest version), (ii) limited domain-specific knowledge in large language models (LLMs), and (iii) substantial manual effort required for verification. To address these challenges, we propose Brick-DICL, a two-stage dynamic in-context learning framework for automated Brick schema classification. Brick-DICL consists of two primary components: metadata-RAG, which retrieves relevant examples to enhance LLMs’ domain knowledge, and class-RAG, which narrows down potential Brick classes to address the large classification space. Additionally, we implement a multi-LLM filtering mechanism that compares predictions across multiple models, flagging low-confidence classifications for human review. As a result: (i) General: Brick-DICL is applicable to any building management system regardless of manufacturer or metadata format; (ii) Novel and Powerful: as the first dynamic in-context learning approach for Brick schema classification, Brick-DICL achieves significant classification accuracy improvements on building datasets, outperforming existing methods; (iii) Efficient: our multi-LLM filtering strategy reduces manual verification effort, enabling rapid digital building onboarding. Extensive experiments demonstrate Brick-DICL’s effectiveness across diverse building datasets, accelerating the path toward standardized, interoperable building management systems.
[AI-66] Closing the Feedback Loop: From Experience Extraction to Insight Governance in Verbal Reinforcement Learning ICML2026
链接: https://arxiv.org/abs/2606.17591
作者: Yanwei Cui,Xing Zhang,Yulong Zhang,Li Shao,Xiaofeng Shi,Guanghui Wang,Peiyang He
类目: Artificial Intelligence (cs.AI)
备注: Accepted to the ICML 2026 RLxF: Reinforcement Learning from World Feedback Workshop, RLxF@ICML 2026, Seoul, South Korea
Abstract:Training-free verbal reinforcement learning enables LLM agents to learn from world feedback – objective signals such as dynamic task outcomes, market returns, or demand forecasts – by extracting verbal rules from experience and injecting them as context, updating the agent’s behavior without parameter changes. However, in non-stationary environments these agents face a retention-forgetting dilemma: retaining stale insights causes negative transfer, while discarding them causes catastrophic forgetting when conditions recur. We identify four requirements for navigating this dilemma – outcome-driven evaluation, persistent structured evidence, non-monotonic knowledge lifecycle, and compositional governance – and show that existing methods invest heavily in experience extraction while underinvesting in insight governance. We propose a three-layer architecture – rules, evidence, and skills – connected by a feedback-driven curation loop that closes the governance gap. Rules capture distilled experience from world outcomes; evidence logs track each rule’s reliability across episodes; skills govern which rules to apply, how to resolve conflicts, and when to abstain. On financial forecasting as a case study, where world feedback is naturally abundant, noisy, and non-stationary, we show that the same accumulated experience either degrades performance below the zero-shot baseline or dramatically improves accuracy and risk-adjusted returns, depending on whether the curation loop is present.
[AI-67] Understanding LLM s in Title-Abstract Screening: From Disagreements to Recommendations MICRO
链接: https://arxiv.org/abs/2606.17588
作者: Mika Mäntylä,Patricia Matsubara,Katia Romero Felizardo,Miikka Kuutila,Marco Gerosa,Savio de Sousa Sampaio,Tayana Conte,Igor Steinmacher
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 14 pages + references. Accepted for publication in the 52nd Euromicro Conference on Software Engineering and Advanced Applications (SEAA 2026)
Abstract:Several studies have examined the use of large language models (LLMs) for title-abstract screening in systematic reviews (SRs), reporting mixed accuracy. However, questions of reliability remain largely unaddressed. In this study, we go beyond quantitative LLM-human agreement metrics and qualitatively investigate how and why LLMs fail. We also propose actionable recommendations. We analyzed disagreements between LLMs and researchers across six software engineering SRs and over 1,000 primary study papers. For each SR, papers were screened independently by human experts and LLMs in zero-shot mode, resulting in Kappa values ranging from 0.52 to 0.77. Qualitative analysis suggests that human-LLM disagreement results from recurring, identifiable causes, such as boundary ambiguity in key terms, keyword overemphasization, and incorrect topic inference. Based on these findings, we propose recommendations such as validating semantic understanding before deployment, running multiple LLMs, and focusing validation efforts on borderline cases. Future studies are needed to validate the impact of our recommendations, and community efforts are needed to develop normative guidelines on LLM usage in SRs.
[AI-68] Visored: A Controlled-Natural-Language Prover for LLM -Generated Mathematics
链接: https://arxiv.org/abs/2606.17581
作者: Xiyu Zhai,Xinyi Chen,Yiping Wang,Runlong Zhou,Liao Zhang,Simon S. Du
类目: Programming Languages (cs.PL); Artificial Intelligence (cs.AI)
备注:
Abstract:We present a dependent-type-based prover designed around the way LLMs (and humans) tend to write mathematics, complementing existing systems such as Lean and Rocq. Its core design choices are a surface that imitates mathematical natural language and a rule-driven automation layer that closes the routine steps a textbook would omit, so that an accepted proof can be re-emitted as a checked Lean file. Early experiments suggest that, even without any prover-specific training data, LLMs can learn to use it effectively on the miniF2F benchmark. Lean output excerpts: this https URL
[AI-69] Surrogate Assisted Pedestrian Protection Design via a Foundation Model Orchestrated Workflow
链接: https://arxiv.org/abs/2606.17577
作者: Osamu Ito,Akihiko Katagiri,Yoshikazu Nakagawa,Shin Saeki,Jun Shiraishi,Masato Sasaki
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:AI-driven engineering workflows face particular challenges in crash safety design: unlike aerodynamics, crash events involve highly nonlinear contact dynamics, material nonlinearity, and discrete state transitions that are difficult to capture with data-driven surrogate models. To the best of our knowledge, we present the first foundation model–orchestrated workflow for crash safety design that enables surrogate-assisted exploration for pedestrian protection, reducing evaluation time from hours per CAE simulation to seconds. The workflow integrates four components: (1) a surrogate trained on CAE crash simulations to predict pedestrian leg injury metrics from design parameters, achieving an average R^2=0.87 and providing distribution-free conformal prediction intervals; (2) multiobjective evolutionary search (NSGA-II) to discover diverse feasible parameter sets under user-specified constraints; (3) a morphing-based geometry generator that maps parameters to topology-preserving 3D shapes; and (4) a natural-language interface in which an LLM orchestrates the workflow and a vision–language model supports semantic comparison of generated designs. In an automotive front-bumper case study, the workflow produces 35 distinct safety-compliant alternatives from a single exploration, a process that would require weeks with conventional CAE iteration. These results suggest that foundation models can serve as integration layers between ML surrogates and physics-based simulation, helping bring AI capabilities to safety-critical engineering domains. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2606.17577 [cs.AI] (or arXiv:2606.17577v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2606.17577 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: ICLR 2026 Workshop The 2nd Workshop on Foundation Models for Science
[AI-70] DeepInsight: A Unified Evaluation Infrastructure Across the Physical AI Stack
链接: https://arxiv.org/abs/2606.17574
作者: Siyi Li,Chunyu Sun,Jiahao Zhang,Yuchen Kang,Wuliang Wang,Yu Qiu,Rui Jiang,Haitao Cui,Jie Chen
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Evaluating a Physical AI stack spans operators that differ by more than three orders of magnitude – from a single foundation-model decoding step to thousands of physics ticks of whole-body control – varying orthogonally in modality, reward semantics, and resource profile. No existing framework spans this range, so the stack is evaluated today by stitching together separate harnesses that share neither runtime nor scoring, preserving each segment’s local validity but losing the shared identity needed to diagnose cross-layer regressions. We present DeepInsight, an evaluation infrastructure that serves this full spectrum on a single runtime. Rather than homogenize the regimes, it preserves their heterogeneity behind three narrow abstractions – task, resource, and result – each realized as one invariant shared by every subsystem: one episode driver, one resource-handle protocol implemented by every expensive backend (LLM inference and sandboxed runtimes alike), and one trace identity scheme under which every event is written. Deployed in production across all three layers of an embodied humanoid stack, this single set of invariants onboards new benchmarks largely by configuration. Where mature peer orchestrators exist – at the foundation-model end – it reproduces published references and peer-framework readings within their own spread, runs the same suites faster on a single node, and scales near-linearly across nodes. Its distinctive return is diagnostic: because every layer writes into one shared trace, a regression that begins in one layer and surfaces in another stays localizable on that trace – a cross-layer payoff no federation of per-segment harnesses can reproduce.
[AI-71] An AI Security Agent for Banking: Multi-Vector Fraud and AML Detection Across Retail and Corporate Accounts
链接: https://arxiv.org/abs/2606.17555
作者: Joseph Walusimbi,Joshua Benjamin Ssentongo
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Emerging Technologies (cs.ET)
备注: 7 pages, 1 figure, 5 tables
Abstract:Banks simultaneously face signature-based fraud (card-not-present attacks, account takeover, ATM cloning) and behavioural financial crime (structuring, layering, mule networks, business email compromise) – two threat families with fundamentally different detection requirements. Static rule engines that reliably catch brute-force and high-velocity events are structurally blind to business-email-compromise (BEC) payment redirection, session hijacking, and money-laundering layering, which are engineered to appear indistinguishable from legitimate activity at the individual transaction or session level. This paper presents an AI security agent for retail and corporate banking that addresses this gap through a three-component fusion architecture operating on two parallel event streams: a transaction stream (card fraud, ACH/wire fraud, AML categories) and a session stream (account takeover, session hijacking, SIM-swap, insider abuse). Each stream combines an LSTM sequence model capturing per-account behavioural history, a statistical velocity/threshold monitor, and a graph/network module capturing account-counterparty relationship patterns (fan-in, fan-out, pass-through ratio) for money-laundering detection. Experiments on a synthetic event log of 237,669 transactions and 113,508 sessions across 13 threat categories and 3,470 simulated accounts demonstrate overall F1 of 0.787 (transaction stream) and 0.867 (session stream) for the proposed model, versus 0.562/0.733 for a rule-based baseline and 0.655/0.713 for an LSTM-only baseline. The agent includes a customer-facing transaction-verification chatbot (96.6% identity verification accuracy, 86.8% mass-reset attack detection) and an analyst case-summary assistant (99.3% action-recommendation F1), with Critical-tier automated response latency under 0.43 ms at the 95th percentile.
[AI-72] Reversal Q-Learning
链接: https://arxiv.org/abs/2606.17551
作者: Aditya Oberai,Seohong Park,Sergey Levine
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Iterative generative modeling techniques, such as flow matching, provide powerful tools to model complex behaviors for effective offline reinforcement learning (RL). In this work, we propose a new off-policy RL algorithm that trains a flow policy based on prior data. Our idea starts from the “expanded” Markov decision process (MDP) framework, which treats individual flow refinement steps as separate actions in an MDP. To enable off-policy RL within this framework, we apply two techniques: we generate virtual on-policy trajectories (by “reversing” flows) to make this framework compatible with prior data, and we apply a bias-and-variance reduction technique to mitigate the curse of horizon in off-policy RL. We call the resulting algorithm Reversal Q-learning (RQL). RQL has several advantages over previous flow-based RL methods: it does not suffer from backpropagation through time, makes better use of the learned value function, and directly trains the full, expressive flow policy. Through our experiments on 50 challenging simulated robotic tasks, we show that RQL leads to the best average offline RL performance compared to state-of-the-art flow-based offline RL algorithms.
[AI-73] SEAGym: An Evaluation Environment for Self-Evolving LLM Agents
链接: https://arxiv.org/abs/2606.17546
作者: Congjie Zheng,Chuanyi Xue,Bin Liang,Jun Yang,Changshui Zhang
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Self-evolving LLM-based agents improve mainly by changing their agent harness: the structured execution layer around a base model, including prompts, memory, tools, middleware, runtime state, and the model-tool interaction loop. Existing evaluations often reduce this process to isolated task scores or a single sequential curve, obscuring whether an update produces reusable improvement, overfits recent tasks, increases cost, or harms older behavior. We introduce SEAGym, an evaluation environment for measuring agent harness updates across training, validation, test, replay, and cost records. SEAGym turns Harbor-compatible benchmarks into dynamic self-evolution task sources with train batches, frozen update-validation, held-out ID and OOD transfer views, replay diagnostics, and saved snapshot and metric records. Instantiating SEAGym on Terminal-Bench 2.0 and HLE, we compare ACE, TF-GRPO, and AHE under a shared epoch/batch protocol. The results show that these evaluation views provide complementary signals about the evolution process: frequent updates may fail to improve held-out performance, useful intermediate snapshots may collapse later, and source diversity and model backend can affect harness reliability.
[AI-74] Offline Preference-Based Trajectory Evaluation
链接: https://arxiv.org/abs/2606.17541
作者: Fernando Diaz
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Offline evaluation of agentic systems often collapses trajectories to terminal success, discarding information about partial progress and inducing widespread ties, creating substantial statistical inefficiency by reducing effective sample size and weakening the ability to distinguish systems. We propose preference-based trajectory evaluation, which compares trajectories directly through temporal preferences over progress and time-to-return profiles. We find that, across diverse agentic and interactive benchmarks, standard success-based metrics produce tied comparisons on roughly 75% of instances, whereas trajectory-aware preferences reduce ties to roughly 35%, improving discriminative power, ranking stability, and data efficiency. Our results suggest that benchmark saturation, often attributed to poor data collection or problem difficulty, may also be explained by the choice of evaluation measure.
[AI-75] FoundCause: Causal Discovery with Latent Confounders from Observational Data
链接: https://arxiv.org/abs/2606.17516
作者: Patrick Blöbaum,Krishnakumar Balasubramanian,Shiva Prasad Kasiviswanathan
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Methodology (stat.ME); Machine Learning (stat.ML)
备注: Download the model at this https URL
Abstract:Causal discovery from observational data remains challenging due to the need to recover directed structure and latent confounding without interventions. We propose FoundCause, an amortized causal discovery model trained entirely on synthetic data that maps datasets directly to causal graphs in a single forward pass. By learning from large collections of simulated structural causal models, FoundCause captures transferable statistical patterns that generalize beyond individual datasets. The architecture incorporates several key inductive biases for causal discovery. It uses a permutation-invariant transformer encoder with alternating attention over samples and variables to jointly model cross-variable dependence and per-variable distributions. Pairwise statistical features derived from classical asymmetry measures are injected through statistics-conditioned attention, guiding the model toward known causal signals. A factorized decoder separates edge existence from direction, while a triangular refinement module enables reasoning over higher-order causal motifs such as chains and colliders. In addition, a dedicated confounder module based on learnable latent tokens explicitly models hidden common causes, and the model explicitly handles missing data via its masked input representation. To our knowledge, FoundCause is the first amortized causal discovery approach to explicitly model latent confounding. FoundCause outperforms 11 classical non-amortized methods (e.g., PC, GES, NOTEARS-style optimization) and 4 amortized causal discovery methods on 15 real-world datasets, achieving +9.6% improvement in F_1 , +1.2% in AUROC, and an 18.9% reduction in structural Hamming distance relative to the strongest non-amortized methods, while performing inference in a single forward pass.
[AI-76] Unlocking LLM Code Correction with Iterative Feedback Loops
链接: https://arxiv.org/abs/2606.17514
作者: Le Zhang,Suresh Kothari
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 22 pages, 14th Computing Conference 2026
Abstract:Large Language Models have shown remarkable capabilities in code generation. However, most existing evaluations focus only on single-attempt accuracy and overlook the iterative refinement process that is central to real-world programming. This study presents a systematic investigation of LLMs’ ability to rectify their own code through execution feedback. Using real-world programming problems across four models and two major programming languages, this study evaluates performance using iterative refinement framework where LLMs receive compiler error messages and testcase feedback after each attempt. This study introduces metrics to evaluate code failures, analyze rectification patterns, and compare the effectiveness of reasoning and non-reasoning models, offering actionable insights into both the understanding and practical application of feedback loops in LLM-driven code generation systems. Results show that reasoning models consistently improve over iterations, substantially outperforming non-reasoning models in leveraging feedback, while syntactic and runtime errors are far more tractable than logical or algorithmic failures.
[AI-77] Geometry-Aware Post-Hoc Uncertainty Quantification in Operator Learning
链接: https://arxiv.org/abs/2606.17513
作者: Oriol Vendrell-Gallart,Nima Negarandeh,Ramin Bostanabad
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Neural operators provide fast surrogates for PDEs but their deterministic predictions limit their use in tasks requiring uncertainty quantification (UQ), especially under geometric variability. Existing approaches primarily model uncertainty in network parameters, largely overlooking the geometry-aware representations learned by the operator itself. We propose REEF-GP (Residual on Embedded Features Gaussian Process), a post-hoc UQ framework that fits a GP to the residuals of a frozen neural operator whose internal embeddings define the kernel feature space. Rather than learning a separate feature map, REEF-GP adapts the operator’s intrinsic coordinate-feature representations to construct geometry-aware uncertainties. To ensure stability and scalability on unstructured domains, REEF-GP incorporates spectral-normalized projections, heteroscedastic geometry-aware noise, and efficient subset-based training that avoids restrictive low-rank approximations. Across five PDE benchmarks with varying geometries, REEF-GP preserves predictive accuracy while achieving calibrated uncertainty estimates competitive with deep ensembles but at a fraction of their cost. Our approach remains robust under geometric distribution shift, with uncertainty concentrating in physically meaningful regions (e.g., shock fronts). Our results demonstrate that accurate and scalable post-hoc UQ for neural operators can be achieved directly in their learned feature space, offering a practical alternative to parameter-centric approaches.
[AI-78] LLM -as-Judge in Education: A Curriculum-Grounded Marking Pipeline
链接: https://arxiv.org/abs/2606.17507
作者: Xiwei Xu,Chen Wang,Jacky Jiang,Phil Yang,Qian Fu,Mohan Dhall,Wenjie Zhang,Liming Zhu
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:
Abstract:Generative AI and large language models (LLMs) are increasingly applied to question generation and automated assessment. However, deploying LLMs in preparation for high-stakes exams requires more than prompt engineering; it demands software pipelines that systematically ground model outputs in authorised curriculum artefacts and marking guidelines issued by education authorities. This paper presents a curriculum-grounded, configurable LLM-as-Judge pipeline for question-level marking, co-developed with an industrial partner, to support exam preparation for university admission. The pipeline identifies the relevant topics, subtopics, and cognitive demand of a question, and assembles verifiable and authorised context to support LLM judgement. Curriculum intent is operationalised through concrete syllabus artefacts, including prescribed verbs and outcomes, performance band descriptors, glossary definitions, and marking-guideline principles. A staged LLM workflow is employed to first generate question-specific rubrics, capturing structured expectations of performance, and then derive and evaluate marking criteria used to allocate marks to student responses. This design improves consistency, transparency, and alignment with official marking practices. Preliminary evaluation shows that the proposed LLM-as-Judge pipeline delivers marking outcomes comparable to human tutors, while yielding justifications that are more traceable to authorised curriculum artefacts and marking standards. The pipeline has also been integrated into an online study platform, where early deployment data provide initial insights into operational usage and manual overrides.
[AI-79] Online LLM Selection via Constrained Bandits with Time-Varying Demand
链接: https://arxiv.org/abs/2606.17489
作者: Yin Huang,Qingsong Liu,Jie Xu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 11 pages, 3 figures with multiple subfigures, 1 table, submitted for possible journal publication
Abstract:Large Language Models (LLMs) are increasingly deployed in edge-cloud inference systems to handle diverse user tasks with heterogeneous accuracy, latency, and cost profiles. Selecting the appropriate LLM for each incoming task is critical for ensuring service quality and efficient resource utilization. However, model heterogeneity, stochastic and unknown performance characteristics, and time-varying task demands make static selection strategies inadequate. Real-world deployments often impose hard resource budgets such as monetary expenditure limits, along with soft service-level requirements such as latency guarantees. These constraints introduce additional challenges for online decision-making. We formulate this problem as a constrained stochastic bandit learning task, where the learner sequentially selects models under both packing-type (hard) and covering-type (soft) constraints, while adapting to time-varying task demand. The learner operates without access to the underlying reward, cost, or latency distributions and must rely on partial feedback. We develop a novel online learning algorithm that leverages confidence-bound estimates and demand predictions to balance reward maximization with long-term constraint satisfaction. We provide theoretical guarantees showing sublinear regret and sublinear covering constraint violations compared to an offline benchmark with full information. Experimental results on synthetic workloads demonstrate the effectiveness and robustness of our approach in dynamic, resource-constrained environments.
[AI-80] AUTOGATE: Automated Clock Gating via Toggling-Aware LLM -based RTL Rewriting
链接: https://arxiv.org/abs/2606.17461
作者: Yiting Wang,Chenhui Deng,Chia-Tung Ho,Yanqing Zhang,Zhuo Feng,Cunxi Yu,Ang Li,Gang Qu,Brucek Khailany
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 9 pages, 6 figures, 7 tables
Abstract:Fine-grain clock gating (FGCG) is among the most effective techniques for reducing dynamic power, yet current FGCG optimization flows remain largely manual. Recent LLM-based RTL optimization approaches remain limited by two key drawbacks: (1) the inability to process long waveform traces spanning millions of cycles, and (2) the difficulty of scaling optimization to large hierarchical codebases while preserving correctness. In this work, we present AUTOGATE, the first agentic framework for industry-grade RTL power optimization, enabling workload-aware clock-gating optimization across large hierarchical codebases. AUTOGATE introduces a Machine Learning (ML)-LLM co-design that bridges waveform-level analysis and RTL rewriting. Specifically, we design an ML-based clustering algorithm that distills raw toggling traces into compact, structured representations that guide LLM-based RTL rewriting. This enables accurate identification and application of clock-gating opportunities without requiring LLMs to directly process raw waveform data. To enhance scalability, AUTOGATE employs a hierarchical multi-agent architecture that decomposes large designs into independently optimizable modules, enabling coordinated optimization across deep design hierarchies. We evaluate AUTOGATE on a diverse set of designs ranging from small RTL designs to large industrial-grade codebases. Experimental results show that AUTOGATE consistently reduces dynamic power relative to baselines. Across the small-design suite, AUTOGATE reduces dynamic power by 49.31% on average. On industry-scale designs, it achieves 19.34% and 7.96% dynamic power reductions on NVDLA and BlackParrot, respectively, and up to 6.86% on highly optimized proprietary production designs.
[AI-81] Can LLM s Be CEOs? Benchmarking Strategic Resource Reallocation with Multi-Role Agent Simulation
链接: https://arxiv.org/abs/2606.17459
作者: Yuyang Dai,Xueqing Peng,Lingfei Qian,Zhuohan Xie
类目: Artificial Intelligence (cs.AI)
备注: 13 pages
Abstract:Evaluating the decision-making capabilities of large language models (LLMs) is a growing research priority, yet existing benchmarks focus on isolated cognitive tasks such as reasoning, knowledge retrieval, and economic rationality in stylized settings. These evaluations overlook the defining challenge of real executive decision-making: integrating conflicting recommendations from specialized stakeholders under information asymmetry, organizational constraints, and temporal dependencies. We introduce \textscCEO-Bench, a multi-agent benchmark that evaluates LLMs on CEO-level strategic resource reallocation – the process of redirecting capital across business units in a multi-round, constraint-rich organizational environment. In \textscCEO-Bench, LLM agents receive conflicting advice from four role-conditioned C-suite advisors (CFO, CTO, COO, CMO), each with private signals and distinct priorities, and must synthesize these into a concrete allocation plan evaluated along four dimensions: role integration, conditional boldness, history-sensitive judgment, and plan validity. Experiments across five frontier models on 13 scenarios reveal that all models achieve high structural validity but diverge sharply on strategic calibration – the hardest capability layer. We identify systematic failure modes including single-advisor capture, conservative default under ambiguity, and historical amnesia, and uncover a structural integration-boldness tradeoff: models that engage more deeply with conflicting perspectives tend to produce less decisive action. These findings delineate the current capability boundary of LLMs as organizational decision-makers and inform the design of future AI-assisted executive systems.
[AI-82] Dissecting model behavior through agent trajectories
链接: https://arxiv.org/abs/2606.17454
作者: Gaurav Gupta,Vatshank Chaturvedi,Jun Huan,Anoop Deoras
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 106 pages, 50 Figures, 16 Tables
Abstract:AI agent performance is not just a modeling problem, it is fundamentally a systems problem. The advanced capabilities of models are realized through agent harnesses. Therefore, a gap between model assumptions and harness behavior can easily prevent the model’s full capabilities from translating into agent performance. We formalize this as the intent-execution' gap: the mismatch between what the model intends and what the harness executes, and vice versa. We argue that minimizing this intent-execution gap is as important as other aspects of harness design such as tools and execution loops. To illustrate the impact of this harness-model alignment, we develop a simple and customizable harness called Simple Strands Agent’ (SSA). SSA aims to find the bulk of common patterns which generalize across different model families (such as Claude, Gemini, GPT, Grok, Qwen), as well as a small number of model-specific preferences. We make two contributions: (i) we \textbfreproduce or improve on the pass@1 performance reported by diverse model-provider families on popular agentic benchmarks (SWE-Pro, SWE-Verified and Terminal-Bench-2), and (ii) building on an \textbfanalysis of 138k trajectories generated by SSA , we look beyond the \textttpass@1 numbers which tend to be relatively even across frontier models. By representing agent trajectories in code state-spaces, we observe model-level differences in problem-solving behavior. Finer-grained metrics such as edit frequency, testing activity, and phase-transitions reveal how individual models allocate effort across different stages of autonomous problem solving.
[AI-83] MapSatisfyBench: Benchmarking Satisfaction-Aware Map Agents through Behavior-Grounded Implicit Decision Factors
链接: https://arxiv.org/abs/2606.17453
作者: Lubin Bai,Mengyu Cao,Sixue Wang,Zhongwei Wan,Yue Pan,Jiale Hou,Xiang Li,Xiuyuan Zhang
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large language model agents are increasingly integrated into map services. Since map services are embedded in everyday-life scenarios rather than professional task settings, users often express their needs informally, resulting in underspecified queries with many unspoken needs, namely, implicit decision factors that are critical for user satisfaction. Although clarification is an effective way to mitigate this issue, it increases user burden in daily interaction, and a capable agent should first proactively recover such factors from available information sources. However, evaluating this ability is challenging. The first challenge is to determine which implicit decision factors are suitable for evaluation. A factor is evaluable only if it affects user acceptance and can be recovered from information available to the agent before it responds. Second, user satisfaction cannot be reliably represented by a single reference answer, requiring a benchmark that converts satisfaction-relevant factors into objective and quantifiable evaluation targets. To address these challenges, we propose a restore-identify-filter framework that reconstructs complete user needs from behavior-chain evidence, identifies implicit decision factors, and retains only those supported by pre-query evidence. Building on this methodology, we construct MapSatisfyBench from large-scale, real-world anonymized user data and annotate ground truth from five dimensions and enables full-chain evaluation of satisfaction-aware map agents. Experiments show that current agents generally perform well on explicit task completion, but remain limited in satisfying implicit decision factors and proactively acquiring the evidence needed for satisfaction-aware decisions. These findings establish MapSatisfyBench as a benchmark for shifting map-agent evaluation from task completion toward satisfaction-aware spatial decision making.
[AI-84] A Machine-Learned Comorbidity Index ICML2026
链接: https://arxiv.org/abs/2606.17450
作者: Suleman Baloch,Kishlay Jha,Alberto M. Segre,Philip M. Polgreen,Bijaya Adhikari
类目: Artificial Intelligence (cs.AI)
备注: Accepted at the 43rd International Conference on Machine Learning (ICML 2026), Seoul, South Korea. 35 pages
Abstract:Traditional comorbidity scores (e.g., Charlson and Elixhauser) are widely used for risk adjustment and patient stratification, but they have two key limitations: (i) they are largely mortality-centric and do not align well with other clinical outcomes, and (ii) their linear, rule-based structure cannot capture nonlinear, outcome-specific risk relationships. We propose a Machine-Learned Comorbidity Index (MLCI) that maps diagnosis codes to a single scalar by maximizing the normalized Hilbert-Schmidt Independence Criterion (nHSIC) between the learned score and multiple clinical outcomes. MLCI captures nonlinear risk-outcome dependence and is supported by a theory that characterizes when a unified, informative admission-level ordering can be achieved across outcomes. Empirical results on multiple benchmark electronic health record (EHR) datasets show that MLCI outperforms strong baselines across multiple evaluation metrics.
[AI-85] L-Proto: Language-Aware Episodic Prototypical Training for Multilingual Speaker Verification INTERSPEECH2026
链接: https://arxiv.org/abs/2606.17416
作者: Hyung-Seok Oh,Deok-Hyeon Cho,Seung-Bin Kim,Seong-Whan Lee
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注: Accepted by INTERSPEECH 2026
Abstract:Multilingual speaker verification remains challenging because language-dependent acoustic variability causes speaker identity to become entangled with linguistic characteristics, degrading generalization across languages. In multilingual training, embeddings often encode language cues with speaker identity, causing speakers to form language-specific clusters. We propose L-Proto, a language-aware episodic prototypical training strategy that constructs language-consistent episodes. By sampling speakers from a single language per episode, L-Proto reduces language-driven variation during training and encourages embeddings to focus more directly on speaker identity. Experiments on the TidyVoice Challenge benchmark demonstrate consistent performance improvements over conventional fine-tuning and random episodic sampling across multiple backbone architectures.
[AI-86] Discrete Autoregressive Transformer for Generative Mechanism Synthesis
链接: https://arxiv.org/abs/2606.17409
作者: Anar Nurizada,Anurag Purwar
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Planar path synthesis requires mechanisms whose coupler curves match a prescribed trajectory; the mapping from curve to linkage is inherently one-to-many across four-, six-, and eight-bar topologies. We address this design problem with simulation-grounded evaluation on a curated corpus of over one million mechanisms, reporting Chamfer distance and dynamic time warping after forward kinematics and geometric alignment. We formulate synthesis as conditional autoregressive sequence modeling: joint coordinates are uniformly quantized to tokens and generated by a decoder-only transformer with a variational-autoencoder (VAE) latent of the target curve and an explicit mechanism-type token. Training combines token cross-entropy with a Gaussian-smoothed bin auxiliary loss that respects ordinal structure among bins. At inference, a bounded latent-noise schedule decodes all mechanism types at each noise level; we retain the top five candidates by geometric error, yielding diverse accurate families without dataset lookup. On held-out tests, aggregate mean Chamfer distance is 0.0132 and mean dynamic time warping is 0.153 ; a latent k -nearest-neighbor baseline that conditions on training-set neighbor latents in VAE space achieves matched-topology mean Chamfer distance 0.0071 and mean dynamic time warping 0.117 using the same decoder.
[AI-87] reatment Response Optimized Clinical Decision Support AI System via Digital Twin Simulation
链接: https://arxiv.org/abs/2606.17405
作者: Xinyu Qin,Anil K. Sood,Ruiheng Yu,Sara Corvigno,Elaine Stur,Lu Wang
类目: Artificial Intelligence (cs.AI)
备注: Accepted for presentation at the IEEE Engineering in Medicine and Biology Conference (EMBC) 2026
Abstract:Clinical decision support AI systems (CDSASs) must adapt to evolving patient conditions in real-time while adhering to strict safety constraints. We present an online adaptive framework that integrates Treatment Effect (TE) estimation to quantify clinical benefits, a patient Digital Twin (DT) to simulate treatment trajectories, and Reinforcement Learning (RL) for sequential decision-making. The AI system is initially trained on historical medical records and operates in a continuous learning loop. To ensure safety, a rule-based module monitors vital signs and blocks contraindicated treatments. Cases with strong internal model disagreement are flagged for clinician review, simulated in our experiments via a pre-trained outcome model. We validate our framework using both a synthetic clinical simulator and a real-world ovarian cancer dataset from The Cancer Genome Atlas (TCGA). In both simulated and clinical settings, our method demonstrated superior effectiveness and stability in recommending treatments compared to standard computational baselines. Furthermore, the AI system maintains low latency and requires expert consultation for only a minority of cases in our experimental validation, demonstrating its potential as a safe, clinician-supervised tool for personalized medicine that continuously improves through practical use.
[AI-88] he Discrete-Log Clock: How a Transformer Learns Modular Multiplication ICML2026
链接: https://arxiv.org/abs/2606.17399
作者: Huu Danh Nguyen(Stanford University)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 5 pages, 5 figures. Accepted to the Mechanistic Interpretability Workshop at ICML 2026
Abstract:When small transformers grok modular multiplication, prior work reports that the learned embedding has a “dense” Fourier spectrum requiring all frequencies. This contrasts with modular addition, where only a sparse set of key frequencies suffices. We show this density is an artifact of analyzing in the wrong basis. The natural Fourier transform for multiplication is not the standard additive DFT but the multiplicative character transform, which decomposes functions on the multiplicative group (\mathbbZ/p\mathbbZ)^* into its irreducible representations. Applying this transform to a grokked transformer trained on a \cdot b \bmod 113 , we find the embedding spectrum becomes highly sparse (Gini coefficient 0.58 vs. 0.07 in the additive basis) with only 4 key frequencies carrying significant energy. Furthermore, 96.9% of MLP neurons are cleanly tuned to a single multiplicative frequency, and neuron activation heatmaps reveal 2D-periodic structure when reordered by the discrete logarithm. These results demonstrate the transformer reduces multiplication to addition in discrete-log space, implementing a “Discrete-Log Clock” algorithm analogous to Nanda et al.'s Clock algorithm for addition. The methodology generalizes: matching the analysis basis to the algebraic structure of the task reveals interpretable structure where standard tools see noise.
[AI-89] SoK: AI-Augmented Binary Reversing
链接: https://arxiv.org/abs/2606.17398
作者: Yujeong Kwon,Yiyue Zhang,Shakhzod Yuldoshkhujaev,Kexin Pei,Dokyung Song,Hyungjoon Koo
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: 20 pages, 7 tables, 3 figures
Abstract:Binary reversing is fundamental to software understanding, vulnerability discovery, malware investigation, and firmware auditing. However, it remains inherently challenging due to the irreversible loss of semantic information during compilation. Recent advances in machine learning, large language models (LLMs), and agentic AI systems have accelerated the adoption of AI-augmented binary reversing. Yet, the resulting body of work has become increasingly fragmented across reversing domains, artifact representations, learning approaches, and evaluation practices. This paper presents the first comprehensive systematization of knowledge on AI-augmented binary reversing. We analyze 144 research papers published since 2015, and organize them into 22 binary reversing domains according to the inference tasks. We further introduce a unified taxonomy spanning conventional and AI-augmented reversing pipelines. Our taxonomy connects traditional analysis techniques, binary-derived artifacts, representation strategies, learning paradigms, and downstream inference tasks, while clarifying the emerging roles of LLMs and agentic AI systems. By establishing a common vocabulary and structured framework, we provide a holistic view of the field’s evolution over the past decade. Our study reveals common structures underlying seemingly disparate approaches, highlights persistent technical challenges and evaluation gaps, and identifies promising opportunities for future research. Collectively, these insights clarify the current state of the field and provide a foundation for the next generation of reliable and scalable AI-augmented binary reversing systems.
[AI-90] Distributed General-Purpose Agent Networks: Architecture Key Mechanisms and Prototypes
链接: https://arxiv.org/abs/2606.17368
作者: Shengli Zhang,Deen Ma,Zibin Lin,Taotao Wang
类目: Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)
备注:
Abstract:Large language models have accelerated the transition from passive conversational assistants to autonomous agents that can understand goals, plan actions, invoke tools, and execute multi-step tasks. Yet the capability of a single agent remains constrained by its local data, tool permissions, runtime environment, and governance boundary. This paper studies distributed general-purpose agent networks: open peer-to-peer networks in which heterogeneous agents deployed on personal devices, edge nodes, or autonomous computing environments can discover one another, establish trust, negotiate cooperation rules, and execute open-ended tasks. We argue that such networks cannot be obtained by simply combining existing peer-to-peer overlays with conventional multi-agent systems. Unlike traditional P2P networks, agent networks must propagate semantic declarations about intentions, capabilities, states, and cooperation constraints. We therefore propose a layered architecture centered on a protocol adaptation layer that connects upper-level task semantics with lower-level network operations. Based on this architecture, the paper identifies three core mechanism problems: semantic announcement propagation for collaborator discovery, verifiable identity and multi-topic reputation for cooperation governance, and semantic-gradient mechanism design for open task execution. For each problem, we present a technical route, including bodyless gossip with sequential logs, BAID-based identity binding with MG-EigenTrust reputation, and a Stackelberg-style mechanism-generation loop driven by semantic attribution feedback. We further report prototype overhead results for BAID-style tiered verification and mechanism-level simulations of MG-EigenTrust under cross-topic disguise-collusion attacks. The resulting framework provides a system-level foundation for open, trustworthy, and scalable agent collaboration.
[AI-91] Counterfactual Optimization of Baseball Pitch Sequences and Estimation of Its Impact on Season-Level Statistics
链接: https://arxiv.org/abs/2606.17345
作者: Ryota Takamido,Hiroki Nakamoto
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Although pitch sequencing is a central topic in baseball analytics, previous studies have primarily focused on optimizing the final pitch within a single plate appearance, leaving the role of preceding setup pitches and their impact on long-term season-level performance insufficiently examined. To address these issues, this study conducted counterfactual analyses using MLB Statcast data. A Transformer-based machine-learning model was trained to predict whether a target pitch would result in an in-play outcome or swing-out. Counterfactual pitch sequences were then generated by replacing either the final pitch or the preceding setup pitch with alternative pitch types and locations while keeping the surrounding contextual information fixed. Optimal counterfactual selections were defined as those that minimized the predicted in-play probability, and their expected effects on pitchers’ seasonal statistics were estimated using regression models linking model outputs to season statistics. The results suggest that the optimization of both final and setup pitches may substantially influence season-level performance, including improvements of more than 1.0 in K/9. The analyses also provided several practical insights, including velocity-band-specific effective locations, the importance of pitch commands, and the expansion of pitch-selection options through middle-velocity pitches. These findings quantitatively support the strategic importance of pitch sequencing in baseball.
[AI-92] MemTrace: Probing What Final Accuracy Misses in Long-Term Memory
链接: https://arxiv.org/abs/2606.17328
作者: Xianxuan Long,Zhikai Chen,Shenglai Zeng,Shouren Wang,Kai Guo,Jiliang Tang
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:LLM agents increasingly maintain long-term memory of user facts across sessions. Yet such memory is usually evaluated by aggregating accuracy over question rows or episodes. Because this approach scores question rows independently, even when several questions probe the same fact, it cannot show how that fact behaves as conditions change. We introduce MemTrace, a benchmark whose unit of measurement is the knowledge point: a single typed fact about the user, rather than an individual question. MemTrace probes each fact along three controlled dimensions: memory age, defined by how many sessions ago the fact appeared in the history; question type, covering current state, earlier state, and trajectory of change; and evidence condition, covering present, missing, and contradicted-by-false-premise settings. Evaluating 13 memory-system configurations across four paradigms, we find that similar pooled accuracy hides different failures: recovering a fact’s current and earlier states does not imply tracking how it changed, and safe abstention does not imply correcting a false premise. The dominant bottleneck is evidence use, not retrieval: when systems fail, the evidence was retrievable 10 times more often than it was missing. These results suggest that improving long-term memory requires better use of reachable evidence, not simply more storage or retrieval.
[AI-93] ransformer-Based Warm-Starting for Feasible and Optimal Terminal Approach to Tumbling Objects with Space Manipulators
链接: https://arxiv.org/abs/2606.17317
作者: Yuji Takubo,Maximilian Adang,Mac Schwager,Simone D’Amico
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
备注: 8 pages, 4 figures
Abstract:Real-time trajectory generation for on-orbit robotic servicing is challenging due to the nonlinear coupling between spacecraft bus motion, manipulator dynamics, visibility cone, and trajectory-level safety constraints. This paper studies learning-based warm-starting for sequential convex programming (SCP) in the terminal approach of a space manipulator toward a tumbling target. The proposed framework decomposes the problem into a system center-of-mass translational planning stage and a coupled attitude–manipulator torque-allocation stage, and applies a causal transformer warm-start to the latter, which constitutes the dominant computational bottleneck. Linear and flow matching action decoders are compared under different action-chunking and training dataset sizes, and the resulting warm-starts are evaluated under both cost-optimal and feasibility projection using SCP. Across 300 held-out scenarios, the learned warm-start reduces the second-stage SCP iteration count by up to 28% and the runtime by 23% while preserving the final control-cost distribution. When the learned warm-starts are used for nonconvex feasibility projection, they nearly halve the runtime relative to cost-optimal SCP, while avoiding the catastrophic high-cost tail behavior observed when initialized heuristically. These results indicate that sequence-model warm-starts can improve both the computational efficiency and trajectory robustness of optimization-based terminal guidance for space manipulation.
[AI-94] Quantifying Consistency in LLM Logical Reasoning via Structural Uncertainty ICLR2026
链接: https://arxiv.org/abs/2606.17312
作者: Baishali Chaudhury,Mengdie Flora Wang,Hyunji Hayley Park,Rahul Ghosh,Sungmin Hong,Jae Oh Woo
类目: Artificial Intelligence (cs.AI)
备注: Published at ICLR 2026 Workshop on Logical Reasoning of Large Language Models. Accepted as best paper
Abstract:Large language models can arrive at the same answer through reasoning paths that are unstable, contradictory, or difficult to rank consistently – a failure mode especially prevalent in multi-step deductive reasoning. Existing methods assess reliability primarily through output dispersion – measuring how much sampled answers differ – but this discards a complementary signal: whether the model can consistently rank competing reasoning candidates. We propose structural uncertainty, a consistency-aware framework derived from the stability of self-preference-induced rankings over sampled reasoning solutions. Given a query, we generate multiple candidate solutions and ask the model to judge pairwise preferences among its own outputs. We aggregate self-preferences into ranking distributions via Bradley-Terry modeling with PageRank, and decompose the signal into two entropy-based components: across-trial ranking instability and within-trial candidate ambiguity. Across five LLMs and eight benchmarks, structural signals provide information complementary to answer dispersion: on logical and mathematical reasoning tasks, the combination improves identification of unreliable instances, while on factual retrieval the structural signal collapses toward uniformity, diagnosing a regime boundary where reasoning-level consistency evaluation is uninformative. The two components relate differently to accuracy: within-trial ambiguity correlates positively with correctness – consistent with settings where multiple plausible solution paths remain competitive – while across-trial instability correlates negatively, signaling unreliable reasoning. Structural uncertainty is best understood not as a universal confidence estimator, but as a regime-sensitive evaluator of logical reasoning consistency.
[AI-95] From Democracies to Autocracies: How AI Systems Enable Authoritarianism by Design
链接: https://arxiv.org/abs/2606.17286
作者: Jeba Sania,Marta Ziosi,Fazl Barez
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:
Abstract:AI-enabled authoritarianism is not confined to autocracies. In this paper, we provide greater transparency by investigating and mapping the lifecycles of six AI systems deployed in different political regimes, ranging from the US to China. By drawing on an extensive range of sources (academic publications, investigative research reports, third-party evaluations, media interviews, government procurement notices), we conduct a systematic, qualitative comparison across systems to identify the critical technical and operational features that enable authoritarianism within their respective political contexts. We find that enabling features include the centralization and co-optation of administrative data for law enforcement and political punishment, regulatory gaps that fail to deter misuse, weak user compliance that nullifies human oversight mechanisms, and the encoding of protected group traits that identify members of vulnerable populations. We find that these features are present across systems deployed in autocratic and democratic regimes, albeit in varying configurations. We also find that both centralized and fragmented AI systems can contribute to authoritarianism by exploiting governance gaps: centralized systems directed by executive authorities, particularly within security and military institutions, are often not subjected to formal oversight mechanisms, while fragmented systems diffuse accountability between stakeholders, paving the way for entrenchment. These findings reveal that AI-enabled authoritarianism is distributed, resulting from design and operational choices made by developers, administrators, and users alike. We conclude with recommendations for developers and policymakers to mitigate these risks.
[AI-96] ARVO: Atlas of Reproducible Vulnerabilities for Open-Source Software
链接: https://arxiv.org/abs/2606.17283
作者: Xiang Mei,Jordi Del Castillo,Pulkit Singh Singaria,Haoran Xi,Abdelouahab Benchikh,Tiffany Bao,Ruoyu Wang,Yan Shoshitaishvili,Adam Doupé,Hammond Pearce,Brendan Dolan-Gavitt
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at IEEE European Symposium on Security and Privacy (EuroSP) 2026
Abstract:Achieving reproducibility, quantity, and diversity in vulnerability datasets has long been viewed as an inherent three-way trade-off, where improving one dimension often comes at the cost of the others. In practice, reproducibility has been the dimension most often neglected. This has limited what can be automatically extracted from historical bug datasets, and has reduced their utility for downstream security research. In this work, we propose a method to produce a new security dataset which ensures reproducibility for diverse vulnerabilities at scale by identifying the key obstacles to large-scale bug reproduction and addressing them with general solutions. Using this method, we introduce full reproducibility to the largest open source software vulnerability dataset (OSS-Fuzz) and construct the ARVO dataset (an Atlas of Reproducible Vulnerabilities in Open-source software). ARVO is a large-scale dataset consisting of over 6,100 real-world vulnerabilities across 311 projects. Focusing on reproducibility, ARVO differs from existing datasets by providing each vulnerability in a form that can be consistently rebuilt, triggered, and analyzed across versions. Reproducibility also enables automatic identification of the corresponding patch for each vulnerability and supports direct interaction with vulnerabilities after code changes, capabilities that existing large-scale datasets do not provide. In our evaluation, ARVO successfully reproduces 81% of vulnerabilities and achieves 89.4% accuracy on the located patches. We also discuss ARVO’s influence on both upstream practices and downstream security research. Comments: Accepted at IEEE European Symposium on Security and Privacy (EuroSP) 2026 Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2606.17283 [cs.CR] (or arXiv:2606.17283v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2606.17283 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-97] Skill-Constrained Model Predictive Control for Resilient Manufacturing Supply Chains
链接: https://arxiv.org/abs/2606.17269
作者: Carlos Eduardo Sanoja
类目: Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注:
Abstract:In skill-constrained production-inventory systems, the qualified human capacity available tomorrow depends on training decisions made today: production requires certified workers, certifications decay unless maintained, and training consumes the same scarce worker hours that production needs now. We study a closed-loop skill-constrained model predictive controller that, at every shift, solves a finite-horizon mixed-integer program over production, inventory, backlog, and training, with binary predicted certification, hard production eligibility, and an interpretable terminal value that prices certified-capacity gaps at the horizon boundary; only the first-period action is applied before replanning. On synthetic, seed-controlled SkillChain-Gym scenarios - announced and surprise new-skill shocks, demand shocks, absenteeism, forecast- and availability-quality modes, capacity-boundary and training-rate sweeps, and negative controls - we evaluate the controller against production-only and maintenance-only ablations, static cross-training insurance plans, and a strong reactive heuristic, under an ex-ante locked configuration and paired statistics. The result is regime dependence, not superiority: no policy class dominates. Predictive control helps when skill or labor bottlenecks are forecastable early enough for training to complete; lean static insurance remains hard to beat under surprise shocks, near the demand-capacity boundary, and wherever pre-shock slack makes insurance cheap. Attribution ablations separate certification maintenance, re-acquisition of lapsed certifications, and greenfield skill acquisition. Forecastability, not adaptivity per se, decides when predictive control pays.
[AI-98] SkillChain-Gym: A Benchmark for Reskilling-Aware Production-Inventory Control under Disruptions
链接: https://arxiv.org/abs/2606.17266
作者: Carlos Eduardo Sanoja
类目: Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注:
Abstract:Production planning increasingly has to treat workforce capability as a decision variable: certifications lapse when skills are not maintained, new products require skills the current workforce does not hold, and reskilling competes for the same worker hours needed for production. Existing operations benchmarks usually treat labor as exogenous, while workforce-planning models with skills and learning are rarely released as reusable testbeds. We introduce SkillChain-Gym, a benchmark specification for reskilling-aware production-inventory control: a single-site environment with stylized worker skill-state dynamics, hard threshold certification, forgetting, and capacity-consuming training actions constrained by the same per-worker time budget as production. The benchmark includes seed-controlled disruption scenarios, three feasibility modes with projection diagnostics, deterministic replay, and metrics covering operations, resilience, capability growth, and training-access distribution. We evaluate production-only, reactive adaptive, water-filling adaptive, and static-insurance policies with budget variants over 60-shift horizons with paired statistical tests. The results are regime-dependent rather than a ranking. Training-capable policies dominate the production-only baseline, and maintenance training is necessary under forgetting even without disruptions. Among training-capable classes, adaptive training helps when bottlenecks are visible in the forecast, while a lean static cross-training plan, a deliberately favorable comparator whose structure encodes relevant skill contingencies, acts as strong insurance under surprise shocks and absenteeism. Capacity slack and the forgetting rate govern the boundary between these regimes. No policy class dominates across regimes, motivating forecast-driven controllers that decide when to buy skill insurance and when to react.
[AI-99] When Rules Learn: A Self-Evolving Agent for Legal Case Retrieval ACL2026
链接: https://arxiv.org/abs/2606.17220
作者: Mingxu Tao,Jiawei Hu,Xian Zhou,Wenpeng Hu,Jiajun Cheng,Yunbo Cao,Zhunchen Luo,Guotong Geng
类目: Artificial Intelligence (cs.AI)
备注: To appear in ACL 2026
Abstract:Legal case retrieval remains challenging due to the complexity of legal language and the need for precise lexical alignment between queries and relevant cases. Although dense retrieval models have achieved notable progress, empirical studies show that BM25 continues to serve as a strong baseline in this domain. It motivates us to propose a self-evolving framework for rule-driven query rewriting that enhances BM25 without any parameter training. The framework equips an LLM-based agent with an automatic evaluation environment, enabling it to iteratively create rewriting rules, plan validation experiments over rule combinations, and eliminate ineffective rules based on historical feedbacks. We evaluate our method on the Chinese legal case retrieval benchmark LeCaRD-v2. Experimental results demonstrate that the proposed framework outperforms non-evolutionary baselines, including human-designed rules and greedy rule selection, particularly when powered by a highcapacity core LLM. We also conduct detailed analyses to investigate the mechanisms underlying self-evolution. Our findings reveal that LLM’s capabilities to leverage previous experimental results and its intrinsic knowledge of rule elimination play critical roles in refining the rule set via self-evolution.
[AI-100] rust-Aware Multi-Agent Traceability: Confidence-Calibrated Knowledge Graphs for Consistent Software Artifact Management
链接: https://arxiv.org/abs/2606.17203
作者: Mohamed Essam,Kareem Wael,Azza Hassan,Ahmed Haitham,Mahmoud Soliman,Samer Saber,Ibrahim Habib
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Multi-agent AI systems are increasingly used to automate software engineering tasks including requirements analysis, architecture design, test generation, and traceability linking. When these agents operate as a sequential pipeline over shared software artifacts, errors and low-confidence decisions made by upstream agents propagate to downstream stages, producing orphaned requirements, contradictory links, and compliance gaps that pose significant risks in safety-critical domains. We propose a trust-aware coordination framework where a shared knowledge graph serves as both centralized semantic memory and a coordination surface through which agents assess and build upon each other’s contributions using calibrated confidence scores. Our approach introduces a two-stage traceability link prediction pipeline combining embedding-based retrieval with LLM-based multi-criteria analysis, a traceability seeding mechanism that enables comparison between derivation-time and validation-time confidence, and a consistency protocol governing pipeline interactions through confidence threshold gating, confidence divergence detection, and conflict resolution. We evaluate on an automotive software engineering case study measuring link prediction calibration, protocol effectiveness, threshold sensitivity, and the impact of traceability seeding. Ablation studies confirm that confidence calibration is essential for effective pipeline coordination.
[AI-101] PowerOPD: Stabilizing On-Policy Distillation with Bounded Power Transformation
链接: https://arxiv.org/abs/2606.17199
作者: Anhao Zhao,Junlong Tong,Yingqi Fan,Ping Nie,Wenjie Li,Xiaoyu Shen
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Standard on-policy distillation (OPD) for large language models estimates the reverse-KL objective using student-sampled tokens, yielding an unbiased single-sample Monte Carlo estimator that avoids vocabulary-wide computation. However, we show that this estimator suffers from severe training pathologies in practice: sample inefficiency, unstable generation dynamics, and a substantial performance gap compared to exact full-vocabulary OPD. Reward-level diagnosis traces these pathologies to the log-ratio reward, which is unbounded by construction, producing extremely high-variance gradients concentrated at early positions and persisting throughout training; standard post-hoc scaling fail as they operate only after this distortion occurs. To solve this problem, we propose PowerOPD: a family of natively bounded, sign-consistent rewards from the Box-Cox power transformation, parameterized by alpha 0, of which the log-ratio is the degenerate alpha - 0 limit. Across six mathematical reasoning benchmarks and four Qwen3 teacher-student pairs, PowerOPD achieves benchmark-averaged Avg@8/Pass@8 gains of up to +6.37/+5.71 over vanilla OPD, +3.01/+3.54 over post-hoc stabilization, and +2.59/+8.90 over full-vocabulary OPD, while reducing wall-clock time by 59.2% and peak GPU memory by 23.1%. Larger alpha generally improves accuracy, consistently shortens responses, and keeps gradient norms more than 3,000x smaller than vanilla OPD.
[AI-102] Cluster-Aware Dual-Level Test Specification Generation for Large-Scale Automotive Software Requirements
链接: https://arxiv.org/abs/2606.17197
作者: Hazem Ayman,Menna Sedik,Kareem Mostafa,Mahmoud Soliman,Samer Saber,Ibrahim Habib
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Generating test specifications that satisfy Automotive SPICE SWE.6 requirements becomes increasingly challenging and time-consuming as projects scale to thousands of requirements. Because this manual process often consumes weeks of engineering effort, automation becomes a critical necessity. However, standard Large Language Model (LLM) approaches struggle at scale: processing requirements individually discards vital inter-requirement dependencies, while feeding entire corpora at once exceeds context-window limits, leading to incomplete integration coverage and redundant test cases. This paper presents a novel “Cluster-then-Summarize” pipeline that addresses these limitations through three-stages. Requirements are embedded using sentence transformers and grouped using UMAP dimensionality reduction followed by HDBSCAN density-based clustering. This grouping utilizes an automatic minimum cluster size selection driven by a quality criterion combining normalized Silhouette and Calinski-Harabasz scores. A multi-level map-reduce summarization algorithm then distills each cluster into concise, domain-conformant descriptions while preserving quantitative thresholds and safety integrity levels. The pipeline exploits the derived cluster topology to generate test specifications at two levels: individual requirement verification and cluster-level integration tests that verify cross-requirement feature behavior. A nearby-cluster context mechanism provides bounded cross-feature awareness during each LLM call, and Retrieval-Augmented Generation grounds all outputs in ISO 26262 and ASPICE standards. Evaluation on automotive requirement datasets of varying scale demonstrates that the cluster-aware approach improves integration test coverage and maintains summarization fidelity compared to baseline methods while scaling efficiently to thousands of requirements.
[AI-103] Vibrato Expression Control for Singing Voice Conversion with Improving Independent Control
链接: https://arxiv.org/abs/2606.17126
作者: Joon-Seung Choi,Dong-Min Byun,Seong-Whan Lee
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注: Accepted to IEEE Transactions on Audio, Speech, and Language Processing (TASLP)
Abstract:Singing style is a crucial aspect of a natural and expressive singing voice. Singers utilize singing styles to convey the feeling or emotion of the songs. Several works have been proposed to control singing style for making the more expressive singing voice. Recently, VibE-SVC successfully controls vibrato by predicting high-frequency F0 contour. In this paper, we introduce a singing voice conversion framework, called VibE-SVC2, to improve singing style conversion performance and controllability. The model offers control over two types of singing styles: a pitch style and a timbre style. For the pitch style, to resolve the pitch-energy entanglement issue that is unresolved in our previous work, we introduce a novel Energy Style Converter to address remaining style information in the energy contour. In addition, we propose a Zero-shot Pitch Style Converter, which mimics the pitch style of reference audio. To expand the controllability of the model, we propose vibrato rate scaling that is an independent control of vibrato extent, which is unavailable in VibE-SVC. For the timbre style, we extend the model to handle a variety of phonation styles. However, addressing specific styles such as vocal fry poses a challenge, as conventional F0 extraction often fails due to their inherent subharmonic characteristics, which degrades the conversion quality. To address this, we propose a novel Subharmonic Correction algorithm to refine the F0 contour for more natural timbre conversion. Through comprehensive objective and subjective evaluations, we demonstrate that VibE-SVC2 provides fine-grained, independent control over two types of singing styles, outperforming existing methods.
[AI-104] LineageMark: Multi-user White-box Watermarking for Contribution Tracing in Model Derivation Chains
链接: https://arxiv.org/abs/2606.17123
作者: Bingxue Zhang,Xiaofeng Xu,Feida Zhu
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 14 pages, 2 figures
Abstract:In open large language model (LLM) ecosystems, models are frequently adapted across multiple domains and applications, forming multi-stage derivation chains. Consequently, tracking and verifying historical contributions is essential for model provenance and intellectual property protection. However, existing watermarking methods are mainly designed for single-user, one-time embeddings, often fail under repeated model derivation and incremental updates. To address this problem, we propose LineageMark, a multi-user white-box watermarking framework for model derivation chains. The framework encodes watermarks in model parameters using a projection-based approach. Stable carriers are first selected to reduce sensitivity to model changes, each watermark bit is then represented as a projection statistic over these carriers. Additional watermark insertions introduce only bounded perturbations in the projection space, and margin constraints are used to maintain signal integrity. We evaluate the effectiveness of LineageMark in multi-stage model derivation chains. Experimental results show that LineageMark preserves contributor watermarks across multi-stage derivation and supports incremental multi-user watermark insertion. Furthermore, it exhibits robustness against perturbations such as re-watermarking, fine-tuning, quantization, and pruning.
[AI-105] rustErase: Auditable Instant Machine Unlearning with Passport-Embedded Representations
链接: https://arxiv.org/abs/2606.17122
作者: Rutger Hendrix,Leonardo G. Russo,Concetto Spampinato,Matteo Pennisi,Giovanni Bellitto
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:The demand for privacy-compliant AI has amplified the need for machine unlearning; yet, existing retraining or distillation-based methods remain unverifiable and computationally costly. We introduce TrustErase, a verifiable, data-free unlearning framework leveraging passport-embedded representations for instant, modular, and auditable forgetting. By treating passports as cryptographic keys within parameter-efficient adaptation layers, TrustErase enables the removal of specific classes or datasets through simple deactivation, without retraining, fine-tuning, or access to the original data. A singular value based decomposition conceals passports within model weights, ensuring that unlearning actions remain transparent and provably compliant. Evaluations on MNIST, CIFAR10 and CIFAR100 show that TrustErase matches or exceeds state-of-the-art benchmarks such as DELETE, L2UL, and Boundary Shrink, while operating in a strictly data-free regime. Ultimately, TrustErase establishes a new paradigm for trustworthy, accountable, and instantly forgettable AI systems.
[AI-106] Graph neural networks at war: integrating cybersecurity and drone intelligence in the Israeli-Iranian conflict
链接: https://arxiv.org/abs/2606.17119
作者: Sozan Sulaiman Maghdid,Tarik Ahmed Rashid,Shavan Askar
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Physical cyber systems have brought about new threats and challenges in detection and immediate response. This study examines how Graph Neural Networks (GNNs) can be used to aid cybersecurity and drone management in a physical cyber system comprising of cyber intrusions and unmanned aerial vehicles (UAVs). By providing a bridge between structural understanding of graphical neural networks, this work has provided an integrated procedure that allows intrusion detection systems to educate on underlying network structures, identify malicious activity, and facilitates drone response measures. Based on an emulation-based case study, cyberattacks models were created to provoke the responses of the drones, which proved that graph-based learning can assist with the situational awareness, swarm coordination, and adaptive maneuver. According to the performance valuation, this method has a detection rate of 94.2, average area under the receiver operating characteristic (ROC) of 0.955 and an average response time of 1.4 seconds. Comparative experiments reveal that proposed GraphSAGE network is more effective than the Graphical Convolutional Networks (GCNs) and Graphical Attention Networks (GATs) in the identical situation. Such findings prove that graphical neural networks can be used to avert intrusion and response of dynamic cyber-physical systems.
[AI-107] MODE: Modality-Decomposed Expert-Level Mixed-Precision Quantization for MoE Multimodal LLM s
链接: https://arxiv.org/abs/2606.17118
作者: Yuanteng Chen,Peisong Wang,Zhilei Liu,Nanxin Zeng,Yuantian Shao,Shiqiang Lang,Tao Liu,Chuangyi Li,Qinghao Hu,Gang Li,Jing Liu,Jian Cheng
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 18 pages, 8 figures
Abstract:Mixture-of-Experts Multimodal Large Language Models (MoE-MLLMs) offer remarkable performance but incur prohibitive GPU memory costs, making compression essential. Among PTQ methods, expert-level mixed-precision quantization has proven effective for MoE-LLMs, yet suffers notable degradation on MoE-MLLMs due to two overlooked biases in expert importance estimation. (1) At the cross-modal level, the numerical dominance of vision tokens causes expert selection frequency to be dominated by vision tokens, masking experts that are critical to the text modality; (2) at the intra-vision level, the large proportion of redundant vision tokens further skew frequency statistics, obscuring experts critical for informative visual content. To bridge gaps, we propose MODE, a modality-decomposed expert-level mixed-precision quantization framework for MoE-MLLMs that decomposes expert selection frequency by modality, filters redundant vision tokens to obtain denoised visual frequency, and further evaluates quantization sensitivity per modality as a complementary signal to frequency-based estimation. These signals are integrated into an Integer Linear Programming formulation to assign per-expert bit-widths under a given budget. Extensive experiments show that MODE is particularly well-suited for MoE-MLLMs, limiting average performance loss to within 2.9% at W3A16, with larger gains at the extreme 2-bit setting.
[AI-108] Probing Fusion and Trustworthiness: A Systematic Evaluation of Foundation Model Representations for Multimodal Cancer Analysis
链接: https://arxiv.org/abs/2606.17115
作者: Jingyu Hu,Giuseppe Tripodi,Reed Naidoo,Sarah F. McGough,Tapabrata Chakraborti
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
备注:
Abstract:Foundation models (FMs) have emerged as powerful representation extractors for medical data, yet their generalizability to datasets under distribution shift remains underexplored. This work systematically evaluates FM-based representations on a suite of computational pathology tasks across two real-world commercial cohorts, IH-BC and IH-NSCLC, drawn from the licensed in-house (IH) oncology dataset. The analysis focuses on two modalities, whole-slide images and transcriptomic profiles, drawn from the IH multimodal data. We first benchmark unimodal probing performance across five FMs on eight downstream classification tasks, and find that image and omics representations carry complementary predictive signals. Then we investigate whether multimodal fusion can yield additional gains over unimodal baselines by comparing three image-omics fusion strategies built on paired representations. The trustworthiness of selected unimodal and multimodal pipelines is further assessed through conformal prediction. Our results show that FM representations achieve competitive performance on out-of-distribution data and that multimodal fusion helps mainly when no single modality dominates the signal. Conformal prediction reveals that in the majority of cases where a point prediction fails, the true diagnosis remains recoverable within the prediction set, reinforcing the value of uncertainty-aware inference for clinical support.
[AI-109] An Evaluation of Data Leakage Risks in Tool-Using LLM Agents in Realistic Scenarios
链接: https://arxiv.org/abs/2606.17114
作者: Hankyul Baek,Jaewon Noh,Sang Seo,Yongsu Kim,Gabriel Waikin Loh Matienzo,Young Il Kim,Ee Wei Seah,Akriti Vij
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:AI agents are increasingly being adopted in enterprise and personal settings with access to emails, databases, documents, and other tools where they can read, update, and disseminate sensitive information. Much of prior research on data leakage risks in agents has focused on adversarial data exfiltration through prompt injections and jailbreaks. However, sensitive information may also be exposed during non-adversarial use, creating leakage risks even when users issue benign requests. We report a joint evaluation by the Singapore AI Safety Institute and the Korea AI Safety Institute examining agent data leakage in 12 realistic, non-adversarial tasks spanning customer support, DevOps, web automation, and enterprise and personal productivity. The evaluation covers five risk types: lack of data awareness, audience awareness, policy compliance, data minimization, and access-boundary awareness. Both institutes tested a common set of scenarios mirroring real-world deployments using independent testing environments and task-specific LLM-judge rubrics. Across the three tested agents, none achieved fully correct and fully safe execution across all scenarios. Successful task completion often coincided with data-handling failures such as accessing unnecessary information or disclosing information to inappropriate recipients, indicating that capability and data-handling safety should be evaluated separately. Qualitative review also revealed claim-action mismatches, simulation-aware behavior, user-simulator role reversal, and interpretation gaps in automated judging. Overall, the results indicate that operational data leakage is a first-order agent-safety concern distinct from adversarial exfiltration and provide a methodology for future evaluations of agent data-handling safety. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2606.17114 [cs.CR] (or arXiv:2606.17114v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2606.17114 Focus to learn more arXiv-issued DOI via DataCite
[AI-110] mestamp-Aware Spatio-Temporal Graph Contrastive Learning for Network Intrusion Detection
链接: https://arxiv.org/abs/2606.17109
作者: Jianli Dai,Guangwei Wu,Jiacheng Li,Weiping Wang,An He,Xinjun Xiao
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Given their effectiveness in modeling the relational structure among network traffic flows, graph neural networks (GNNs) have been widely adopted in network intrusion detection systems (NIDSs). However, most existing GNN-based NIDS approaches focus on the relational structure of traffic flows, and treat them as temporally independent, which limits their ability to cope with evolving attack behaviors. Moreover, their reliance on supervised or semi-supervised learning often restricts generalization to unseen attacks. To address these limitations, we propose a novel self-supervised GNN-based framework. To the best of our knowledge, the proposed model is among the first self-supervised GNN-based NIDS models to explicitly leverage real timestamps, which provides faithful temporal dependencies for representation learning. We first construct a series of temporal graphs from network traffic flows according to their timestamps, and then employ an E-GraphSAGE and LSTM based encoder to fully extract temporal information and spatial dependencies of network traffic, without introducing time-costly attention mechanisms. A multi-view graph contrastive learning (GCL) scheme is introduced, where temporal, spatial, and feature contrasts are jointly performed to capture temporal continuity, preserve structural consistency, and improve the generalization and robustness of the learned representations, respectively. In addition, a gradient-norm-based adaptive weighting strategy is designed to optimize the contrastive loss weights. Experimental results on four representative NIDS datasets with real timestamps demonstrate that our method significantly outperforms existing self-supervised approaches and achieves performance comparable to the supervised state-of-the-art GNN method, while maintaining high computational efficiency.
[AI-111] Models Take Notes at Prefill: KV Cache Can Be Editable and Composable
链接: https://arxiv.org/abs/2606.17107
作者: Bojie Li
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Prefix caching reuses prefill only across an exactly shared prefix, so one changed field invalidates the entire downstream cache. Yet overwriting the field’s own key/value vectors and reusing the rest leaves the model acting on the old value. The reason, established causally across four model families: at prefill the model has already written the field-conditioned conclusion onto downstream notes; the field’s own key/value drives under 1% of the decision. Read as a notebook of memoized conclusions, two capabilities follow. (1) It is editable. A salient erratum amends the notes; and with chain-of-thought, editing the field alone recovers the decision (1.00 at 8B, ~1% compute), while without CoT it is ignored. (2) It is composable. The notes are position-portable, so a precompiled skill can be RoPE-repositioned and spliced into any context, indistinguishable from full recompute (logit cosine 0.90-0.999, twelve models) at O(L) rather than O(L^2) time-to-first-token. A unified edit+compose agent stays decision-identical to recompute at up to 14.9x lower latency. The approach applies to any per-token attention KV cache, validated across scale, quantization, Mixture-of-Experts, and multimodal caches, and extends to several attention variants through small adapters. Because the erratum is append-only, it composes with production prefix caching: in an online vLLM benchmark it keeps the prefix cache-aligned (98.5% hit-rate), cutting p90 time-to-first-token by 53-398x.
[AI-112] Prefill/Decode-Aware Evaluation of LLM Inference on Emerging AI Accelerators
链接: https://arxiv.org/abs/2606.17104
作者: Shun Usami,Venkatram Vishwanath,E. Wes Bethel
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注: 8 pages, 5 figures. Accepted to the Workshop on HPC for AI Foundation Models LLMs for Science (HPAI4S’26), co-located with IEEE IPDPS 2026
Abstract:As large language models (LLMs) are increasingly deployed in latency- and cost-sensitive settings, inference efficiency has become a central systems challenge. While GPUs dominate current deployments, a growing number of AI accelerators claim advantages for LLM inference, yet it remains unclear under which conditions such accelerators outperform GPUs in practice. Recent inference systems decompose execution into Prefill and Decode phases, which exhibit distinct computational characteristics and latency metrics, commonly captured by time to first token (TTFT) and time per output token (TPOT). This paper presents a phase-aware evaluation of LLM inference performance across GPUs and emerging AI accelerators using a common model, Llama2-7B. By separately measuring Prefill and Decode performance, we reveal that accelerator advantages differ by phase and metric. Our results show that GPUs consistently excel in the compute-intensive Prefill phase, while GroqRack achieves significantly lower TPOT during Decode (batching not currently supported). However, GPUs regain an advantage in Decode throughput as batch size increases. These findings demonstrate that each platform exhibits distinct phase-dependent strengths. We further analyze heterogeneous Prefill/Decode disaggregation across different accelerator platforms, identifying performance gains and the workload and network conditions under which such gains are realized. Comments: 8 pages, 5 figures. Accepted to the Workshop on HPC for AI Foundation Models LLMs for Science (HPAI4S’26), co-located with IEEE IPDPS 2026 Subjects: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC) ACMclasses: C.4; I.2.7 Cite as: arXiv:2606.17104 [cs.AR] (or arXiv:2606.17104v1 [cs.AR] for this version) https://doi.org/10.48550/arXiv.2606.17104 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-113] Software Delegation Contracts: Measuring Reviewability in AI Coding-Agent Work
链接: https://arxiv.org/abs/2606.17099
作者: Vincent Schmalbach
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 11 pages; empirical pilot study with 64 coding-agent runs and 192 blinded reviews
Abstract:AI coding agents increasingly accept assigned software tasks, modify repositories under bounded authority, and return work packages for review. Prior work proposed the software delegation contract, covering the task, authority, returned work package, and acceptance context, as the unit of analysis for delegated coding work, but did not measure its effects. This paper reports a controlled pilot study of explicit delegation contracts for coding agents. We built a dependency-free TypeScript API task environment with seeded defects and documentation gaps, authored ten tasks across five families, and ran 64 agent executions across two model tiers under three conditions: a realistic issue-style prompt, an explicit delegation contract, and a contract with a required evidence bundle. Each run was scored with hidden acceptance tests, mutation checks, and scope analysis, then reviewed by three independent condition-blinded model-based reviewers using a fixed rubric, for 192 reviews. Explicit contracts did not improve objective task outcomes: all 64 runs passed hidden acceptance checks, with zero scope violations. They did improve reviewability. Evidence sufficiency improved in 22 of 30 paired comparisons and worsened in none (+0.83 on a 5-point scale, p 0.0001, Cliff’s delta = 0.66); reviewer ambiguity decreased (p = 0.035); changed-file lists, known-limitations sections, residual-risk sections, and reviewer checklists appeared mostly or only when demanded by the contract. Contracts cost +13% agent tokens and +38% wall-clock time, with larger effects for the weaker model tier. On these small tasks, delegation contracts bought reviewability rather than correctness.
[AI-114] ANEForge: Python for direct computation on the Apple Neural Engine
链接: https://arxiv.org/abs/2606.17090
作者: Spencer H. Bryngelson
类目: Programming Languages (cs.PL); Artificial Intelligence (cs.AI); Mathematical Software (cs.MS)
备注: 8 pages
Abstract:ANEForge is a Python package that programs the Apple Neural Engine (ANE), the fixed-function neural accelerator on every recent Apple device, directly and without CoreML. In production the engine is reachable only through CoreML, which treats it as a scheduling option: no configuration requires the ANE, and a model can silently run on the CPU or GPU instead. ANEForge compiles a lazy tensor graph, built from 58 fused operators and 19 native bridge operators, into a single ANE program. The program is dispatched through the same ANE daemon and kernel-driver stack as Apple’s internal framework. Beyond inference, the package reaches the engine’s native fused attention, streams int8, int4, and sparse weights, keeps decoder and optimizer state resident across steps, and runs the forward pass, backward pass, and optimizer update of training on the engine. A small fused program completes a call in about 90us, near the engine’s 70us per-program dispatch floor, and a pretrained ResNet-18 forward runs end-to-end in 0.33ms. ResNet-18, a sentence encoder, and a Vision Transformer run end-to-end against framework references, and a Stable Diffusion U-Net validates its forward pass. ANEForge targets Apple Silicon under macOS 14 and later. Each release is verified against a recorded macOS and ANE-compiler version.
[AI-115] ZIVARI-TLBO: A Zero-Cost Inter-Group Evaluated-Elite Relay Mechanism for Teaching-Learning-Based Optimization
链接: https://arxiv.org/abs/2606.17087
作者: Pezhman Zivari
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
备注: 21 pages, 7 figures, 11 tables
Abstract:ZIVARI-TLBO is a grouped Teaching-Learning-Based Optimization (TLBO) method that augments an existing population-state controller with a fixed inter-group evaluated-elite relay. At each scheduled event, every group offers its already evaluated elite to the next group in a fixed ring; the elite replaces the receiver’s worst eligible learner only when its stored objective value is better. Because the exact relay copies an already evaluated solution and its stored fitness, it requires no additional objective-function calls. The frozen gts-v4-cm-fixed implementation is evaluated under equal 10,000-evaluation budgets on eight classical functions at dimensions 10, 30, 50, and 100, with 30 matched seeds, and on five constrained engineering problems. A direct ablation against the same grouped landscape-aware controller without relay records 728/11/221 wins/ties/losses and a rank-biserial effect size of 0.624 across dimensions. In an eight-method multidimensional comparison, WOA obtains the best average rank (2.914) and ZIVARI-TLBO ranks second (3.382); ZIVARI-TLBO significantly outperforms TLBO, MCTLBO, DE, PSO, and GWO, loses significantly to WOA, and is not significantly different from HHO after Holm adjustment. Feasibility-aware engineering results are mixed and sensitive to the current static-penalty formulation. The evidence supports a scoped relay contribution and budget-consistent information-sharing mechanism, but not universal state-of-the-art, global-convergence, engineering-dominance, or CEC superiority claims.
[AI-116] ParkingTransformer: LLM -Enhanced End-to-End Trajectory Planning for Autonomous Parking
链接: https://arxiv.org/abs/2606.17082
作者: Hauteng Wu,Xu Li,Dong Kong,Zihang Wang,Xieyuanli Chen,Benwu Wang,Wenkai Zhu
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:End-to-end autonomous parking has emerged as a critical task within the realm of autonomous driving. However, existing methods suffer from black-box characteristics, lacking high-level semantic understanding and interpretability, which impedes the realization of seamless long-distance autonomous parking from the road to the target spot. To address these limitations, we propose ParkingTransformer, a novel framework that leverages multi-view perception and the scene understanding capability of Large Language Models (LLMs). By combining trajectory queries with LLMs implicit state features, our method interacts directly with historical information and raw sensor data to output planning trajectories, eliminating the need for dense Bird’s-View (BEV) representations. To compensate for the inadequate spatial reasoning ability of LLMs, we introduce 3D positional encoding to explicitly inject spatial geometric awareness. Furthermore, a fixed-window streaming mechanism is designed for historical information processing, significantly improving long-term temporal processing efficiency and inference speed. Additionally, a coarse-to-fine decoding strategy is employed to progressively enhance trajectory precision. Extensive closed-loop experiments are conducted on the CARLA simulator and real-world vehicle platforms. The results demonstrate that our method achieves a driving score of 61.32 in CARLA simulator and an average success rate of 88.70% in real-world experiments, validating the feasibility and effectiveness of the proposed algorithms.
[AI-117] he Price of Anarchy in Disaggregated Inference
链接: https://arxiv.org/abs/2606.17081
作者: Athos Georgiou(NCA)
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Computer Science and Game Theory (cs.GT); Performance (cs.PF)
备注: 38 pages, 7 figures, 8 tables. Measurements on a 3-node NVIDIA B200 cluster running NVIDIA Dynamo v0.9.0
Abstract:Disaggregated inference architectures physically separate prefill and decode phases onto distinct GPU pools, creating competing “agents” that share a fixed hardware budget. We provide, to our knowledge, the first formal game-theoretic analysis of this architecture, using NVIDIA Dynamo as a concrete case study. We model disaggregated serving as three coupled games: a two-player resource game between prefill and decode pools, a selfish caching game over the hierarchical KV cache, and a congestion game with positive externalities for request routing. We empirically validate the latter two; the P/D resource game is treated analytically (Section 9.2). We characterize how GPU saturation induces regime transitions that shift the game’s payoff structure: below saturation, selfish behavior has bounded Price of Anarchy (PoA); at saturation, superlinear latency and cache externalities drive our empirical estimator PoA-hat (defined in Section 6.4) upward. Based on this analysis, we design an adaptive controller that detects saturation transitions in real time and adjusts routing parameters accordingly, shifting from cache-affinity exploitation to load-balanced congestion avoidance. We instantiate our framework on a 3-node NVIDIA B200 cluster running Dynamo with two models, Nemotron-4-340B (TP=8, full-node workers with cross-InfiniBand KV transfers) and Llama-3.1-70B (TP=4), and find the same three-regime PoA-hat structure with the same first post-knee grid point (C=128) on both models. Adaptive routing shifts each model to a better operating point. Our strongest result is on the 70B 1P/5D topology, where PoA-hat drops 3.1x (66.4 to 21.5) in the saturated phase at a 13% throughput cost. On the 70B 1P/2D, PoA-hat drops 2.2x and TTFT P99 drops 7.6x (see Section 8.5).
[AI-118] Surveying GenAI-based Automation in Printed Circuit Board Design and Test
链接: https://arxiv.org/abs/2606.17074
作者: Sahana Srinivasan,Benjamin Turnbull,Hammond Pearce
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
备注: 33 pages, 5 figures, 11 tables. Under review
Abstract:Generative artificial intelligence (GenAI) is increasingly used for applications in the hardware and software domains. It purports to reduce the manual effort involved in the development and testing of complex systems before release. Within the hardware space, most tasks have focused on design automation of integrated circuits, particularly with hardware description languages. However, other types of hardware also exist! In this survey, we instead examine how GenAI has been and is being across the printed circuit board (PCB) design life cycle. This includes everything from supply chains, system specification, circuit design, layout and optimisation, validation and test, and PCB assembly and distribution. Through this lens we present a taxonomy of discovered works, categorising them according to their intent and contributions. This survey also identifies key technical challenges that GenAI faces in this space, such as domain-specific data scarcity and limited support for integration with existing PCB tools. Finally, future research directions are discussed: our survey shows that there are many opportunities remaining when considering how GenAI may be integrated into various tasks in PCB design and test.
[AI-119] Extracting Semantics: LLM -Guided Automatic Population of Robot Ontology from URDF
链接: https://arxiv.org/abs/2606.17073
作者: Bastien Dussard(LAAS-RIS, LAAS),Guillaume Sarthou(LAAS-RIS, LAAS)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:While commonsense knowledge may suffice for virtual agents, embodied robots interacting with humans require grounded and semantically rich representations of both their environment and their own physical embodiment. In cognitive robotics, ontologies are effective for integrating such heterogeneous knowledge to enable explainable reasoning, even during continuous knowledge updates. Yet, their manual construction remains a bottleneck. We present a preliminary approach for the automatic generation of robot semantic abstractions by transforming Unified Robot Description Format (URDF) models into populated ontologies. Although URDF files provide structural and kinematic descriptions, their identifiers often require commonsense interpretation to recover meaningful semantics, a task at which Large Language Models (LLMs) excel. Our pipeline leverages LLMs to infer semantic relationships by prompting them with concepts from an existing ontology, ensuring the final classification remains aligned with the formal model. To improve reliability, the pipeline combines majority voting across multiple LLM queries along with syntactic and schema-level validation to ensure that generated outputs conform to the expected representation format and ontology constraints. We evaluate the approach on multiple robot descriptions and discuss the generated abstractions. Initial results indicate that the proposed method can effectively bridge the gap between low-level robot descriptions and the structured, grounded knowledge representations required for human-robot interaction.
[AI-120] owards Distributed Inference of LLM s on a P2P Network
链接: https://arxiv.org/abs/2606.17059
作者: Shabari S Nair,Krishanu Saini
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注:
Abstract:Prefix caching can reduce LLM inference latency by reusing KV caches across requests with shared prompts, but cluster-scale reuse is challenging because caches are partitioned across nodes. We propose a decentralized, prefix-cache-aware routing scheme for peer-to-peer LLM serving. Each node maintains a local radix tree of its own cached prefixes and asynchronously refreshed estimates of peer caches using periodic anti-entropy. Requests are routed to the node with the longest estimated prefix match, without centralized coordination or KV-cache transfer. Stale metadata only causes cache misses, not incorrect outputs, making weak consistency sufficient for correctness. Evaluation on simulated MMLU workloads show that decentralized routing improves latency under low communication delay and skewed prefix distributions, while high network latency and affinity-induced hotspots limit its benefits.
[AI-121] Querying an astronomical database using large language models : the ALeRCE text-to-SQL system
链接: https://arxiv.org/abs/2606.18108
作者: P.A. Estevez,J.Espejo-Moreira,S. Sanfeliu-Alvarez,F. Forster,A. M. Munoz Arancibia,G. Cabrera-Vives,F. E. Bauer,A. Bayo,M. Catelan,R. Dastidar,L. Hernandez-Garcia,J.A. Intriago,G. Pignata
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Artificial Intelligence (cs.AI)
备注:
Abstract:We develop a text-to-SQL (structured query language) system based on large language models (LLMs) using in-context learning and apply it to the Automatic Learning for the Rapid Classification of Events (ALeRCE) astronomical database. ALeRCE is a community broker for the Zwicky Transient Facility and the Vera C. Rubin Observatory. The system enables users to query the database in natural language (NL) and generates executable SQL queries. To develop and evaluate the system, we constructed a dataset of 110 NL/SQL pairs. We propose a step-by-step generation framework comprising four modules: schema linking, query classification, prompt decomposition, and self-correction. The performance of thirteen LLMs is evaluated using in-context learning and prompt engineering techniques. Text-to-SQL performance is assessed using the perfect-match (PM) rate for row identifiers (e.g., object identifiers) and column identifiers (i.e., column names). The proposed step-by-step framework consistently outperforms a direct-inference baseline, while the self-correction module consistently reduces execution errors. For Claude Opus 4.6, PM performance on row (column) identifiers is high for simple queries, reaching 0.97 (0.94), and decreases with query complexity to 0.44 (0.72) for medium queries and 0.59 (0.49) for hard queries. Among the thirteen evaluated models, the best-performing LLMs for the text-to-SQL task are Claude Opus 4.6, Gemini 2.5 Pro, Gemini 3 Flash, and GPT-5.2-Codex.
[AI-122] Symplectic Transversality and Endpoint Green Estimates for Finite-Horizon Pontryagin Systems
链接: https://arxiv.org/abs/2606.17762
作者: Pyuyi Chufeng Huang,Zikang Song,Xingshu Chen
类目: Optimization and Control (math.OC); Artificial Intelligence (cs.AI)
备注: 20 pages
Abstract:We study horizon-uniform local branches of finite-horizon discrete-time Pontryagin boundary value systems after smooth control elimination. The central input is a two-point endpoint inverse for the linearization. We verify this inverse from scaled stable–unstable boundary transversality, prove the associated endpoint-corrected Green estimate, and combine it with weighted contractions to obtain existence, uniqueness, Lipschitz dependence, and first-order expansions with constants independent of the horizon. The framework covers smooth nonlinear endpoint maps, including the original Pontryagin rows that fix the initial state and couple the terminal costate to the terminal state. Symplectic and Riccati criteria verify the inverse hypothesis at the level of the matrix data; in particular, every stabilizable linear-quadratic system with invertible dynamics and definite weights is covered, including noncommuting coupled data. A numerical section illustrates the certificates and the horizon-uniform first-order expansion.
[AI-123] Feynman Kac Reweighted Schrödinger Bridge Matching for Surface-Based Tau PET Harmonization
链接: https://arxiv.org/abs/2606.17420
作者: Jianwei Zhang,Xinyu Nie,Jiaxin Yue,Yonggang Shi
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
备注:
Abstract:Tau PET imaging is central to tracking Alzheimer’s disease progression, but systematic differences between scanners, protocols, and radiotracers across sites introduce nonbiological variability that inflates biomarker variance, reduces sensitivity to disease effects, and can bias downstream clinical assessments. Harmonization methods aim to remove these site-induced shifts while preserving biologically meaningful signal, yet existing approaches struggle when source and target cohorts differ in subgroup composition, risking conflation of site effects with biological variation such as tau-positivity status. We propose the Feynman Kac Reweighted Schröodinger Bridge Matching (FKRSBM) model to address this problem. Rather than routing data through a Gaussian noise prior as in diffusion-based methods, FKRSBM learns a direct stochastic transport process between source and target distributions via entropy-regularized optimal transport. To enforce biologically consistent transport, FKRSBM incorporates a subgroup-aware endpoint proposal derived from a Feynman Kac reweighting of the reference bridge measure, implemented entirely through stratified importance sampling at the data level and requiring no changes to the underlying bridge-matching solver or network architecture. For surface-based neuroimaging, FKRSBM employs a spherical convolutional backbone operating on cortical meshes to perform vertex-level harmonization. We evaluate the method on tau PET SUVR maps, harmonizing PI-2620 data from the HABS-HD cohort into the AV-1451 domain of ADNI. Compared against ComBat, CycleGAN, a diffusion-based method (DF), and unregularized Diffusion Schröodinger Bridge Matching (DSBM), FKRSBM achieves superior distributional alignment, reduced tau-positivity sign mismatch, stronger APOE subgroup alignment, and improved downstream disease classification performance.
[AI-124] Model Validation of Agent ic AI Systems: A POMDP-Based Framework for Belief-State Forecast and Policy Validation
链接: https://arxiv.org/abs/2606.17383
作者: Matthew Francis Dixon
类目: Risk Management (q-fin.RM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注: 28 pages, 3 figures, 6 tables. Source code available from this https URL
Abstract:Agentic artificial intelligence systems introduce a new class of model risk. Unlike traditional predictive models, autonomous agents continuously acquire information, form beliefs regarding latent states of the environment, generate forecasts, select actions, and adapt their behavior over time. Existing validation methodologies focus primarily on predictive accuracy and therefore provide limited insight into the quality of the underlying decision process. This paper proposes a model validation framework for agentic AI based on Partially Observable Markov Decision Processes (POMDPs). The framework decomposes autonomous decision making into information, beliefs, forecasts, actions, and utility, allowing each component to be validated independently. Large language models (LLMs) are formalized as approximate Bayesian filtering operators, and a model-risk taxonomy is developed encompassing state-space, filtering, forecast, policy, utility-specification, and parameter risks. The model risk validation methodology is demonstrated through a portfolio-management case study in which an agent infers latent market regimes from market and macroeconomic information, generates belief-conditioned forecasts, and constructs portfolios using a Black–Litterman framework. Empirical validation combines performance analysis, belief calibration diagnostics, coverage tests, ablation studies, and parameter-sensitivity analysis. The results indicate that latent-state inference contributes independently to decision quality and that the principal conclusions remain robust across a broad range of parameter values. The principal contribution of the paper is a practical framework for extending established model risk management concepts to autonomous AI systems and providing a rigorous foundation for their validation, governance, and monitoring. Comments: 28 pages, 3 figures, 6 tables. Source code available from this https URL Subjects: Risk Management (q-fin.RM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML) MSC classes: 68T07, 68T37, 68T45, 68Q32, 93E11, 93E20, 60J05, 62C10, 62P05, 91G70, 91B06, 90C40 Cite as: arXiv:2606.17383 [q-fin.RM] (or arXiv:2606.17383v1 [q-fin.RM] for this version) https://doi.org/10.48550/arXiv.2606.17383 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-125] Physics-Informed Attention Mechanism and Generalization Capability of Deep Learning-Based Grain Growth Evolution Prediction
链接: https://arxiv.org/abs/2606.17235
作者: Pungponhavoan Tep,Marc Bernacki
类目: Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI)
备注:
Abstract:Machine Learning (ML) models for grain growth prediction are typically trained on idealized synthetic data, yet practical applications require generalization to conditions outside the training distribution. This study evaluated the Out-Of-Distribution (OOD) generalization capability of the trained model from our previous study across three test cases, including experimental microstructures, microstructures characterized by a bimodal grain size distribution, and abnormal grain growth. To further probe whether physics-informed architectural design could improve robustness under these different conditions, a boundary-masked attention mechanism was proposed specifically for grain growth, constraining attention to grain boundary pixels. Both the baseline and the proposed physics-informed attention model were evaluated without retraining or fine-tuning on the OOD data. Both models successfully generalized to all three test cases, yet the boundary-masked attention mechanism provided substantial improvements, with the most notable gains for microstructures characterized by a bimodal grain size distribution, where Structural Similarity Index Measure (SSIM) improved from \num0.6221 to \num0.7609 and mean grain size ( \overlineR ) error decreased from \SI8.75\percent to \SI3.57\percent. The attention heatmap analysis revealed that the boundary-masked attention model learned to concentrate attention on large grain boundaries in a manner consistent with curvature-driven grain growth physics, emerging from training without being explicitly encoded into the architecture. These results indicate that models trained on synthetic data can generalize to diverse OOD conditions without retraining, and that physics-informed attention may improve accuracy when the boundary morphology matches the training domain.
[AI-126] Statistical Foundations of LLM -based A/B Testing: A Surrogacy Framework for Human Causal Inference
链接: https://arxiv.org/abs/2606.17165
作者: Joel Persson,Mårten Schultzberg,Sebastian Ankargren
类目: Methodology (stat.ME); Artificial Intelligence (cs.AI); Econometrics (econ.EM); Statistics Theory (math.ST)
备注:
Abstract:Organizations and researchers show increasing interest in using large language models (LLMs) in place of human participants in A/B tests, in the hope of experimenting faster and at lower cost. We study when a treatment effect estimated on LLM outcomes recovers the effect that would have been measured on the human population of interest. Distributional equivalence between LLM and human outcomes would make any standard estimator valid but is unrealistic. We therefore develop a statistical framework that adapts surrogate endpoint theory to LLMs. The framework shows that calibrating LLM outcomes to human outcomes identifies the average treatment effect under surrogacy and comparability conditions that are jointly weaker than distributional equivalence. When these conditions fail, the effect of interest is only partially identified, and we provide diagnostics that can falsify surrogacy on historical experiments together with a bound on the worst-case bias from limited overlap. We further show that the stochasticity inherent to LLMs introduces both bias and variance, but using an average of multiple draws as the surrogate mitigates both. We illustrate the methods and theory in simulations and an application to A/B tests on Upworthy headlines. A central takeaway from our work is that the validity of LLM outcomes as surrogates can only be falsified for past treatments and never verified for new ones, so human experiments remain indispensable for novel interventions. We discuss the role of LLM choice, prompting, and temperature as design variables, and how to size human experiments for validation.
[AI-127] Agent ic Discovery of Non-Canonical Antimicrobial Peptides with AMPGAN v3 ICML2026
链接: https://arxiv.org/abs/2606.17127
作者: Jay Jung,Xiaohan Zhang,Shenghan Song,Mahmoud Sayedahmed,Chijian Xiang,Yunong Xu,Ahmed AbdelKhalek,Severin T. Schneebeli,Matthew J. Wargo,Jianing Li,Safwan Wshah
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Presented at the GenBio Workshop, ICML 2026
Abstract:Antimicrobial resistance causes to over a million deaths annually. Antimicrobial peptides (AMPs) are a promising solution, but generative AMP models are not yet ready to design peptides with non-natural amino acids and/or chemical modifications, which are essential for real-world peptide drugs. We present AMPGAN v3, a multi-objective conditional GAN that expands the generative vocabulary to D-amino acids and N/C-terminus modifications such as amidation. By separating adversarial and activity-aware supervision across two specialized discriminators, AMPGAN v3 substantially improves training stability and outperforms prior generative AMP models on external classifiers. We validated five candidates spanning three structural classes in vitro; two showed activity against Gram-positive strains, with the best candidate reaching MIC 8 \mug/mL against B. subtilis. To support downstream curation, we further present PepCraft, a multi-agent framework for end-to-end AMP discovery in which a Planning Agent orchestrates specialized executors for generation, filtering, and verification. Its prioritization recommendations align with our in vitro outcomes. Together, these contributions let us examine, on a small but real scale, how generative and agentic AI compose in therapeutic peptide discovery. Code: this https URL
[AI-128] Quantum Cinema: An Interactive Cinematic Exploration of Quantum Computing Hardware via Generative World Models
链接: https://arxiv.org/abs/2606.17102
作者: Aoyu Zhang,Dongping Liu,Luyao Zhang
类目: Popular Physics (physics.pop-ph); Artificial Intelligence (cs.AI); Quantum Physics (quant-ph)
备注:
Abstract:Quantum computing promises transformative advances across science and industry, yet the physical hardware that enables these computations remains invisible to the public: quantum processors operate inside sealed dilution refrigerators at temperatures near absolute zero, making direct observation impossible. This “imagination gap” between quantum computing’s growing societal impact and the public’s ability to visualize it represents a significant barrier to quantum literacy and workforce development. We present Quantum Cinema, an open-source, browser-based interactive application that closes this gap by transforming invisible quantum hardware into explorable, cinematic experiences using generative world models. Quantum Cinema guides users through a four-act narrative – from the foundational Nobel Prize-winning science of quantum entanglement, through curated video introductions to three major quantum computing architectures (trapped-ion, neutral-atom, and superconducting systems), into immersive three-dimensional generative worlds that make invisible quantum phenomena observable, and finally to interactive radar-chart comparisons grounded in real quantum device specifications. All three-dimensional environments are generated using WorldLabs’ generative world model platform and are scientifically grounded in curated metrics from Amazon Web Services (AWS) Braket quantum hardware. Quantum Cinema requires no installation, no specialized hardware, and no quantum computing background. It is designed to serve two distinct communities: scholars and developers seeking to replicate or extend the platform, and educators, researchers, and science communicators seeking an intuitive tool for explaining quantum hardware to diverse audiences. This paper describes the system architecture, the generative world model pipeline, use cases for both communities, and directions for future work.
[AI-129] Comprehensive pKa Data Augmentation from Limited Real Data through an Engineered Models-Quantum Framework
链接: https://arxiv.org/abs/2606.17077
作者: Wang Rui,Liu Dinghao
类目: Chemical Physics (physics.chem-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Proton dissociation constants (pKa) are critical for functional molecule discovery and molecular modeling. Building on iBonD, the largest experimental pKa database established, we and other researchers have developed several methods including machine-learning-based empirical prediction and high-accuracy energy calculations. Despite this foundation, the rapid augmentation of high-quality pKa data remains fundamentally constrained. As part of this work, we performed large-scale regression-based pKa prediction on unlabeled molecular datasets using a collection of extensively optimized machine-learning models. The results indicate that, since the feature distributions of unlabeled molecular datasets, the pKa data distribution approximates normality, with extreme scarcity of tail-region samples. Although such augmentation is highly valuable for improving overall data availability and predictive modeling, it remains insufficient for efficiently discovering molecules with broad-spectrum pKa properties. To address this, we explore the targeted generation of molecules with sparse pKa properties from the vast chemical space. Given that traditional continuous latent space VAE-RNN methods for molecular generation suffer from insufficient stability and fail to demonstrate clear advantages in complementing sparse data, we design and implement a quantum-assisted sparse-pKa molecular generation. Feasibility is validated on a simulated quantum annealer, and superior extreme-value sampling is further achieved on physical coherent Ising machines (CIMs). (to be continued)
[AI-130] CMIP-Forge: An Agent ic System that Retrieves Computes and Self-Reviews Climate Science
链接: https://arxiv.org/abs/2606.17076
作者: Dmitrii Pantiukhin,Boris Shapkin,Ivan Kuznetsov,Thomas Jung,Nikolay Koldunov
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Artificial Intelligence (cs.AI)
备注: 28 pages, 9 figures. Code available at this https URL
Abstract:The Coupled Model Intercomparison Project Phase 6 (CMIP6) has generated thousands of peer-reviewed publications documenting model configurations, evaluation procedures, emergent constraints, and projection uncertainties. As the community transitions toward CMIP7, efficiently extracting and operationalizing this unstructured knowledge alongside live data analysis represents a critical bottleneck. Here we present CMIP-Forge, a hybrid retrieval-augmented generation (RAG) and autonomous analysis system that bridges the gap between scientific literature and Earth System Grid Federation (ESGF) data archives. The system pairs a curated corpus of 6,581 CMIP6-related open-access publications (101,828 indexed chunks) with an agentic pipeline in which a tool-augmented worker plans and executes Python workflows over live climate data, while a panel of independent reviewer models audits its methodology end to end. CMIP-Forge introduces a multi-layered Defense-in-Depth architecture that enforces physical and methodological invariants through executable mechanisms: Abstract Syntax Tree (AST) static analysis, audited scientific primitives, and an autonomous adversarial peer-review protocol. We demonstrate the system’s capabilities through end-to-end autonomous research pipelines spanning atmospheric teleconnections, ocean dynamics, regional extremes, and global warming projections. An agentic analysis system grounded in peer-reviewed literature, constrained by automated code guardrails, and audited by an independent adversarial review loop can complete complex climate-research workflows autonomously. The same experiments expose concrete failure modes of the review loop (sycophantic regression, REVISE verdicts that are never resolved, and the submission of stub code for review), each diagnosable from the immutable telemetry and provenance record released with the article.
[AI-131] KFTD: Koopman-Fourier Time-Differentiable Network for Continuous Ocean Spatiotemporal Forecasting
链接: https://arxiv.org/abs/2606.17070
作者: Qinghui Chen,Zekai Zhang,Hailong Liu,Jinglin Zhang,Cong Bai
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Accurate oceanic forecasting is critical for climate monitoring and disaster early warning. However, ocean spatiotemporal forecasting encounters the double challenges of modeling complex dynamical systems and ensuring computational efficiency. We present Koopman Fourier Time-Differentiable (KFTD) Network, a time continuous twostage paradigm that decouples interpolation from prediction to achieve efficient and scalable spatiotemporal modeling. We map complex nonlinear dynamics into the Koopman linear space and exploit Fourier analysis to enable continuous time interpolation at arbitrary sub-steps. A lightweight residual network consumes the high fidelity intermediate states to yield the final forecast. Unlike diffusion models, KFTD eliminates multi step noise sampling and directly evolves the system in continuous time, yielding a 4 computational speedup. We further introduce a DPP Loss that supports arbitrary PDE constraints in an endtoend manner, breaking the physical consistency bottleneck of pure data-driven approaches. Empirical results on four ocean datasets confirm that our continuous time framework reduces MSE by an average of 5.6% (up to 12.7% for SST) and improves efficiency over MCVD by 76.25%.
[AI-132] PIVOT: Bridging Black-Scholes Implied-Volatility and Price Objectives via Differentiable Jäckel Operator
链接: https://arxiv.org/abs/2606.17065
作者: Raeid Saqur,Yannick Limmer,Anastasis Kratsios,Blanka Horvath,Hans Buehler
类目: Computational Finance (q-fin.CP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 30 pages, 17 figures, 12 tables
Abstract:Modern option-learning systems operate in two coordinates: price space, where markets quote and no-arbitrage constraints are most naturally enforced, and implied volatility (IV) space, where volatility surfaces are smoothed, regularized, and evaluated. The bottleneck is interface, not approximation: Jäckel’s seminal “Let’s Be Rational” (LBR) solver already inverts the Black-Scholes price to machine precision efficiently. What is missing is a differentiable layer that preserves LBR in the forward pass and avoids backpropagating through its branch logic. Such a layer must also confront the unavoidable singularity of the inverse map in the low-vega regime, where the sensitivity 1/vega diverges as vega - 0. We close this gap with PIVOT, the Price-Implied-Volatility Objective Translator. PIVOT keeps the LBR forward pass intact and supplies the backward pass by implicit differentiation through the smooth Black-Scholes/Black-76 price map, with an explicit gating contract: invalid domains return NaN, well-conditioned rows receive the exact 1/vega gradient, and low-vega rows are attenuated rather than silently regularized. On a single H100, a fused Triton kernel reaches 1.79e9 IV/s at machine precision (9.3e-14 max relative error vs. the reference C solver); end-to-end label generation sustains 48.9M/s on synthetic chains and 16.6M/s on SPX OptionMetrics. In a HyperIV-style one-day reproduction on SPX, PIVOT-augmented objectives Pareto-dominate the baselines, reducing held-out price MAE by up to 43.4% and the strongest three-seed gated objective improving price MAE by 38.8% and IV MAE by 21.3% jointly; cross-asset results on RUT, VIX, and NDX show directional price-MAE gains of 40.1%, 24.2%, and 16.7%, while an ungated IV-roundtrip control collapses to a degenerate near-zero surface, confirming the gate as a correctness contract rather than a tuning knob. Comments: 30 pages, 17 figures, 12 tables Subjects: Computational Finance (q-fin.CP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) MSC classes: 91G60, 91G20, 68T07, 65K10 ACMclasses: I.2.6; G.4; G.1.0 Cite as: arXiv:2606.17065 [q-fin.CP] (or arXiv:2606.17065v1 [q-fin.CP] for this version) https://doi.org/10.48550/arXiv.2606.17065 Focus to learn more arXiv-issued DOI via DataCite
机器学习
[LG-0] Sign-Rank Index and List Replicability: Connections and Separations
链接: https://arxiv.org/abs/2606.18236
作者: Ari Blondal,Hamed Hatami,Pooya Hatami,Chavdar Lalov,Sivan Tretiak
类目: Machine Learning (cs.LG); Information Theory (cs.IT)
*备注: 29 pages, 1 figure
Abstract:In learning theory, the sign rank of a binary concept class captures the smallest dimension in which it can be represented by points and halfspaces. Despite tremendous interest, lower bounds on sign rank are notoriously difficult to come by. Two recent approaches to the problem establish lower bounds on sign rank by measures that are easier to analyze: the \mathbbZ_2 -index and the list replicability number. We order these measures, showing that the \mathbbZ_2 -index is upper-bounded by a linear function of the list replicability number. As a main consequence, we obtain a strong separation between sign rank and \mathbbZ_2 -index, thereby resolving a question of Frick, Hosseini, and Vasileuski. This motivates a thorough study of list replicability, the stronger of the two lower-bounding measures. We establish upper bounds on the list replicability number by two combinatorial measures: height and minimum star number. We also prove a fundamental composition result, showing that the product of two concept classes has list replicability number bounded by the sum of the list replicability numbers of the two classes. Comments: 29 pages, 1 figure Subjects: Machine Learning (cs.LG); Information Theory (cs.IT) Cite as: arXiv:2606.18236 [cs.LG] (or arXiv:2606.18236v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2606.18236 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-1] Rethinking Dataset Distillation for Classification: Do Distilled Sets Outperform Coresets?
链接: https://arxiv.org/abs/2606.18209
作者: Trisha Mittal,Akshay Mehra,Joshua Kimball
类目: Machine Learning (cs.LG)
*备注:
Abstract:Dataset distillation (DD) has emerged as a prominent approach in data centric machine learning, aiming to synthesize compact training sets for efficient training by compressing the information in large datasets into a small number of synthetic samples. However, DD methods are often evaluated under inconsistent evaluation protocols, ranging from standard ERM to single/multi-teacher supervision, making it difficult to isolate the effectiveness of distilled data from evaluation. Moreover, many prior methods claim that DD outperforms data pruning approaches such as coreset selection (CS), based on the assumption that restricting condensed datasets to subsets of real samples fundamentally limits their expressiveness. In this work, we critically evaluate DD methods through large-scale experiments using standardized datasets and evaluation protocols to assess their intrinsic effectiveness. We benchmark seven state-of-the-art (SOTA) DD methods on ImageNet-1K, ImageNet100, and ImageNette, using three widely adopted training protocols against three CS strategies. Our results show that while some DD methods fail to outperform even simple random subsets, the SOTA DD approaches are comparable to or worse than coresets on large-scale datasets and incur a substantially higher cost for construction. Beyond accuracy, we also evaluate the representativeness, diversity, and quality of condensed sets, and find that coresets consistently achieve better coverage of the original data distribution. These findings highlight the limited practical advantages of current DD methods and show that coresets remain competitive and are often a more computationally efficient alternative for data-centric learning.
[LG-2] Multi-Source Cybersecurity Logs: An ATTCK-Labeled Dataset and SLM Evaluation
链接: https://arxiv.org/abs/2606.18190
作者: Abir Ashab Niloy,Ahmed Ryan,Imamul Hossain Rafi,Md Erfan,Md Rayhanur Rahman
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:
Abstract:Multi-stage cyberattacks span system, network, and browser logs. Detecting them requires correlating events across all three sources. Machine learning methods can learn these cross-source patterns, but they need labeled multi-source data. Existing public datasets fall short. Network-only datasets such as CICIDS and UNSW-NB15 miss host and browser activity. Host-focused datasets such as LMDG and CICAPT-IIoT lack browser telemetry. ATLAS includes all three sources but labels events only as malicious or benign, without MITRE Adversarial Tactics, Techniques, and Common Knowledge (ATTCK) technique granularity. No public dataset combines all three sources with per-entry ATTCK technique labels. We close the gap by building a multi-source log dataset of 870 sessions (70 attack, 800 benign) and approximately 2.3 million events. We captured system, network, and browser activity simultaneously on Windows endpoints. We labeled malicious events with ATTCK technique IDs, covering 12 tactics and 53 techniques. We generated all attack data using real tools, including Remote Access Trojan (RAT), Command and Control (C2) tunnels, and cloud exfiltration. To demonstrate learnability, we fine-tuned three Small Language Models (SLMs) (Qwen2.5-1.5B, Llama-3.2-3B, Phi-4-Mini) using Low-Rank Adaptation (LoRA). We compared each against its base variant across ten metrics on two tasks: chunk classification and ATTCK technique identification. Fine-tuning improved every model on every metric. Chunk classification accuracy rose from approximately 8% in the base variants to between 90% and 97% after fine-tuning. Technique identification remained challenging, with the best exact-match accuracy at 42%, although high partial-match scores show the models captured most of the underlying reasoning.
[LG-3] A Convex Quasilinearization Method for Solving Nonlinear PDEs with Physics-Informed Neural Networks
链接: https://arxiv.org/abs/2606.18175
作者: Gbenga T. Awojinrin,Abdul-Akeem Olawoyin,Rami M. Younis
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注: Preprint. 56 pages, 18 figures. Code: this https URL
Abstract:We present a numerical method for the forward solution of nonlinear partial differential equations (PDEs) in which Bellman-Kalaba quasilinearization reduces the nonlinear problem to a sequence of linear subproblems, each discretized by collocation onto a trial space that is linear in its parameters and solved by a single direct linear least-squares QR factorization. The trial space, which we term Linear-in-Learnables (LiL), comprises representations whose trainable parameters enter linearly, including random-feature extreme learning machines, spectral polynomial bases, and trigonometric expansions, each implemented as a physics-informed neural network. The method thus replaces the nonconvex gradient-based training that limits standard PINNs with a convex per-step solve. We establish local Newton-Kantorovich convergence of the outer iteration to a residual-limited neighborhood under an explicit smallness condition, with the limiting accuracy governed by the best-approximation residual of the trial space rather than by an optimization tolerance. The method, denoted LiL-Q, is assessed on seven benchmarks spanning scalar nonlinear PDEs (Bratu, viscous Burgers, Buckley-Leverett), coupled systems (plane-strain elasticity and the incompressible Navier-Stokes equations in two and three spatial dimensions), and steady-state Darcy flow with heterogeneous permeability. Across these problems, LiL-Q converges in single-digit outer iterations in most cases, even at the coarsest basis sizes and independent of the parameter count. When the exact solution lies in the span of the trial space, the method recovers it to machine precision in a single solve. On the Navier-Stokes benchmarks, it matches or exceeds published PINN solvers with up to two orders of magnitude fewer trainable parameters, without gradient-based optimization.
[LG-4] Evaluating Open-Source LLM s for Multi-Label ATTCK Technique Classification on CTI Reports
链接: https://arxiv.org/abs/2606.18166
作者: Ahmed Ryan,Saad Sakib Noor,Md Erfan,Shaswata Mitra,Sudip Mittal,Md Rayhanur Rahman
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:
Abstract:Classifying Cyber Threat Intelligence (CTI) using MITRE Adversarial Tactics, Techniques, and Common Knowledge (ATTCK) is essential for proactive defense, but historically required extensive human effort. Pre-Large Language Model (LLM) automation sped up this process, but could not resolve the complex language and multi-step attack patterns found in unstructured CTI reports. LLMs addressed previous limitations by using contextual reasoning to understand unstructured text. However, current evaluations rely on simplified, single-technique sentences that ignore the complexity of real-world CTI reports, which often leads to inflated performance results. Consequently, the baseline performance of open-source LLMs on complex unstructured CTI reports remains unevaluated. To address this gap, we constructed a ground-truth dataset of 2,076 human-annotated sentences (1,281 technique-positive, 795 negative) from 83 complex unstructured CTI reports. These sentences were mapped to 114 unique ATTCK techniques using a six-phase annotation process, achieving \kappa = 0.68 inter-annotator agreement. Using this dataset, we evaluated seven open-source LLMs ranging from 8B to 236B parameters across prompt strategy and temperature configurations. The highest-performing LLM achieved a micro-averaged F1 score of 0.22, establishing the empirical baseline for multi-label ATTCK classification on complex unstructured CTI. Parameter size showed a statistically significant positive correlation with F1 score. Prompt strategy and temperature produced no statistically significant gains across model configurations. These results indicate that current open-source LLMs are insufficient for production-grade ATTCK classification. The dataset, benchmark, and findings provide a reproducible foundation for future CTI research.
[LG-5] Deep Reinforcement Learning for Minimum Zero-Forcing Sets
链接: https://arxiv.org/abs/2606.18106
作者: Steve Halley,Maurício Gruppi
类目: Machine Learning (cs.LG)
*备注:
Abstract:This paper explores the problem of finding the minimum zero-forcing set on undirected graphs and proposes an adapted machine-learning framework to solve the problem. The minimum zero-forcing set problem is a graph coloring problem where the color of an initial set of nodes propagates throughout a network. The set of nodes is zero-forcing if it forces all uncolored nodes to change color under the constraint of the color-change rule. There are several applications to this problem across different domains such as network science, network control, and designing logical circuits. Finding the minimum zero-forcing set is shown to be NP-hard. We propose a reinforcement learning framework, SD-ZFS, that adapts the S2V-DQN architecture to the ZFS problem. We train several models on this adapted framework and analyze the performance across graph datasets that have varying structures. We evaluate how the models trained on the framework generalize, scale, and transfer to different network types. The results demonstrate the effectiveness of the framework when compared against the optimal solution and greedy heuristic. We provide further insight into how the ZFS problem can be solved through machine-learning and the influence of network structure on the problem.
[LG-6] OmniPlan: An Adaptive Framework for Timely and Near-Optimal Network Planning Optimization
链接: https://arxiv.org/abs/2606.18105
作者: Longlong Zhu,Jiashuo Yu,Zedi Chen,Yuhan Wu,Zhifan Jiang,Yuchen Xian,Yimeng Liu,Jiajie Su,Shaopeng Zhou,Xingyuan Li,Hongyan Liu,Xuan Liu,Dong Zhang,Chunming Wu,Xiang Chen
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG)
*备注:
Abstract:Network planning optimization is a fundamental problem across diverse domains, including transportation systems, communication networks, and power grids. It requires simultaneous optimization of multiple competing objectives under complex constraints. Existing network planning optimization frameworks rely on mixed integer programming (MIP) solvers, heuristics, and deep reinforcement learning (DRL) models to compute planning decisions. However, they lack effective adaptability to diverse and dynamic user intents, thus leading to the trade-off between execution time and optimality. In this paper, we propose OmniPlan, an adaptive framework that achieves both timeliness and near-optimality in network planning optimization. To achieve the adaptability lacking in existing solutions, OmniPlan employs a large language model (LLM)-based interpreter to convert heterogeneous natural-language intents into a unified and quantifiable user-preference vector. Then it employs a mixture-of-experts architecture that integrates MIP solvers, heuristics, and DRL models as specialized experts, where OmniPlan adapts to diverse intents by dynamically selecting timely and near-optimal experts. Finally, it incorporates a DRL-based expert configuration module that fine-tunes optimization objective weights to align planning decisions with user-specific preferences. We evaluate OmniPlan with a representative real-world workload, i.e., distributed machine learning (ML), where we leverage OmniPlan to offload a wide spectrum of ML inference tasks, e.g., decision trees, SVM, naive Bayes, XGBoost, and random forests, onto a network of hardware devices. Our experiments on a real-world testbed indicate that OmniPlan achieves near-optimal and low-execution-time offloading for real-world ML inference tasks, reducing latency by up to 97.8% and network device resource consumption by up to 11.5%.
[LG-7] From Reasoning Traces to Reusable Modules: Understanding Compositional Generalization in Language Model Reasoning ICML2026
链接: https://arxiv.org/abs/2606.18089
作者: Lingjing Kong,Xin Liu,Guangyi Chen,Martin Q. Ma,Xiangchen Song,Yuekai Sun,Mikhail Yurochkin,Taylor W. Killian,Ruslan Salakhutdinov,Kun Zhang,Eric P. Xing,Zhengzhong Liu
类目: Machine Learning (cs.LG)
*备注: ICML2026
Abstract:Post-training pipelines that combine supervised fine-tuning (SFT) with reinforcement learning (RL) have emerged as the key recipe for transforming large language models (LLMs) into robust reasoners. We argue that this combined success is driven by compositional generalization, which we formalize through a hierarchical latent selection model. In this framework, reasoning traces are generated by a cascade of discrete latent selection variables corresponding to reusable atomic modules, including both skills (local operations) and routing mechanisms (how intermediate information is selected, reused, and composed). Within this model, we theoretically show that SFT and RL play asymmetric, complementary roles: SFT supplies the raw module materials in compositional traces, and RL decomposes those traces to identify the latent atomic modules and enable compositional generalization. We design controlled experiments to validate this theory. Our results demonstrate that RL can extract atomic modules from compound traces supplied by SFT and recombine them to solve new configurations. Moreover, we find that training on compound traces yields stronger generalization than training on isolated atomic modules. Finally, we investigate the relationship between SFT and RL data and identify an effective protocol in which SFT ensures coverage of all atomic modules through compositional traces, while RL focuses on novel compositions outside the SFT support to drive exploration.
[LG-8] Edge Flow: A Tractable and Predictive Continuous-Time Model for Gradient Descent at the Edge of Stability
链接: https://arxiv.org/abs/2606.18080
作者: Pierre Marion
类目: Machine Learning (cs.LG)
*备注: 24 pages, 13 figures
Abstract:Gradient descent in deep learning may operate at the edge of stability (EoS), a regime in which the largest eigenvalue of the loss Hessian hovers near the stability threshold 2/\eta , where \eta is the learning rate. Classical analysis tools such as gradient flow and the descent lemma do not apply here, motivating the search for a continuous-time model valid at EoS. We propose Edge Flow, a system of three coupled ordinary differential equations that provides a tractable, faithful, and predictive model of gradient descent dynamics at EoS. Edge Flow decomposes the dynamics into a center, an oscillation direction, and an oscillation magnitude. The center follows a modified gradient flow on a symmetrized loss; the direction tracks a top eigenvector of the Hessian via Rayleigh quotient dynamics; and the magnitude grows or decays exponentially depending on whether the sharpness exceeds or falls below the threshold 2/\eta . Crucially, sharpness stabilization emerges from the coupled dynamics via a self-stabilization feedback loop. Discretizing Edge Flow only requires two gradient evaluations and one Hessian–vector product at each iteration. We demonstrate empirically that Edge Flow tracks the dynamics of gradient descent at least as faithfully as previously proposed continuous-time EoS models, while in addition resolving the oscillation of the sharpness at the onset of EoS, and that it provides a principled framework for understanding and mitigating instabilities in this regime.
[LG-9] NoiseTilt: Noise-Tilted Reverse Kernels for Diffusion Reward Alignment
链接: https://arxiv.org/abs/2606.18066
作者: Jisung Hwang,Yunhong Min,Jaihoon Kim,I-Chao Shen,Minhyuk Sung
类目: Machine Learning (cs.LG)
*备注: 52 pages
Abstract:We introduce the Noise-Tilted Reverse Kernel (NTRK), a reward-guided diffusion sampler that injects reward gradients through the noise term, leaving the pretrained reverse kernel unchanged and requiring only a single sample per step. Reward-guided sampling at inference time has greatly expanded the versatility of pretrained diffusion models. Yet existing methods face a trade-off. Gradient-based guidance shifts the reverse mean, steering generation but pushing intermediate states outside the region that the model was trained on and degrading quality. Search-based methods preserve quality but gain no gradient signal. No prior method achieves both. NTRK resolves this by keeping the reverse mean fixed and biasing the noise term toward high reward. We introduce a whitening operator, the central mechanism behind NTRK, that makes the reward gradient safe to inject as noise without losing its guiding signal. Across various reward alignment tasks, NTRK outperforms recent state-of-the-art baselines without losing sample quality. Remarkably, on aesthetic generation, NTRK surpasses the reward of the best baseline at 500 NFEs using only 25 NFEs, a 20 \times reduction in compute.
[LG-10] ConTex: Reformulating Counterfactual Generation For Time Series Forecasting
链接: https://arxiv.org/abs/2606.18049
作者: Jan Voets,Hasan Tercan,Tobias Meisen,Sebastian Baum
类目: Machine Learning (cs.LG)
*备注: 19 pages, 5 figures, 14 tables
Abstract:Decision-making with deep learning-based time series forecasting requires not only accurate predictions but also actionable insights. However, current architectures do not inherently provide such information. Specifically, guidance is needed on how current conditions must be modified to shift from a predicted outcome to a desired future scenario. Counterfactual explanations provide a natural framework for this task, as they represent minimal input changes that alter the model’s prediction, indicating when and how intervention is required. Existing approaches rely on instance-wise optimization, leading to inconsistency across instances, high computational costs, and limited applicability in real-time settings. To address these limitations, we reformulate counterfactual generation for time series forecasting as the problem of learning a globally consistent intervention strategy, allowing counterfactuals to be generated through a single shared function. We propose Counterfactual Time Series Explanations (ConTex), a model-agnostic, decomposed architecture comprising a temporal context encoder and a conditional encoder, followed by two heads that capture interventions in terms of temporal relevance and modification strength. This structure overcomes the instability and inconsistency of instance-based approaches by producing targeted, interpretable interventions across time and feature dimensions in a single forward pass, making it suitable for real-time applications. Across multiple forecasting architectures and benchmark datasets, ConTex achieves state-of-the-art validity while generating sparse counterfactuals that minimize the number of necessary interventions. Additionally, our approach reduces computational cost by at least 12-36x compared to instance-wise generation and supports real-time inference at approximately 0.007 seconds. Comments: 19 pages, 5 figures, 14 tables Subjects: Machine Learning (cs.LG) Cite as: arXiv:2606.18049 [cs.LG] (or arXiv:2606.18049v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2606.18049 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-11] Uncertainty Quantification for Flow-Based Vision-Language-Action Models
链接: https://arxiv.org/abs/2606.18043
作者: Ralf Römer,Maximilian Seeliger,Saida Liu,Ben Sturgis,Marco Bagatella,Daniel Marta,Andreas Krause,Angela P. Schoellig
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: Project page: this http URL . 28 pages, 12 figures
Abstract:Vision-language-action models (VLAs) combine vision-language backbones with expressive generative action heads trained via flow matching on large-scale robotic datasets. Despite their strong empirical performance in robotic manipulation, VLAs lack mechanisms to quantify confidence in their predictions and to detect when their actions may be unreliable. This presents a critical limitation for real-world deployment in non-stationary environments, where models inevitably encounter scenarios outside their pretraining distribution and may fail without warning. To address this, we derive an efficient method for quantifying epistemic uncertainty in flow-matching models by leveraging velocity-field disagreement (VFD) across a small ensemble. We successfully use this uncertainty estimate for failure detection during deployment and active fine-tuning of flow-based VLAs. To this end, we propose SAVE, a framework for uncertainty-guided active multitask fine-tuning that reduces the number of costly expert demonstrations required to adapt VLAs to new tasks. Through extensive experiments on the LIBERO benchmark, we demonstrate that VFD yields better-calibrated uncertainty estimates predictive of downstream performance, that VFD achieves strong performance in detecting failures, and that uncertainty-guided data acquisition with SAVE requires at least 22% fewer samples than baselines. In summary, our work shows that quantifying epistemic uncertainty in flow-based VLAs improves both failure awareness and adaptation. Project website: this http URL.
[LG-12] INI-VPINN: A Variational Physics-Informed Neural Network with Implicit Neumann and Interface Handling for Multi-Material Domains with Geometric Singularities
链接: https://arxiv.org/abs/2606.18032
作者: Shayan Dodge(1),Alessandro Formisano(2),Sami Barmada(1) ((1) DESTeC, University of Pisa, Pisa, Italy, (2) Department of Engineering, University of Campania Luigi Vanvitelli, Aversa, Italy)
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注: Preprint version. Under peer review. Code available at: this https URL
Abstract:We propose a new weak-form Physics-Informed Neural Network approach (named INI-VPINN). INI-VPINN naturally incorporates Neumann boundary and interface conditions into the variational formulation. It removes the need for additional loss terms or multiple subdomain networks. This framework employs compact support weighting functions and integration by parts to implicitly impose flux and continuity constraints. In this way, it implicitly ensures physical consistency across material boundaries. The proposed method is tested on Poisson and Laplace problems with sharp interfaces and complex geometries. Results show that, compared with several other Physics Informed Neural Networks-based formulations, the INI-VPINN consistently achieves higher accuracy, smoother and faster convergence. The proposed framework provides a general approach for solving multimaterial problems with complex geometries and mixed Neumann-Dirichlet boundary conditions using neural networks. The implementation is publicly available in a GitHub repository.
[LG-13] Recursive Scaling in Masked Diffusion Models
链接: https://arxiv.org/abs/2606.18022
作者: Alba Carballo-Castro,Julianna Piskorz,Paulius Rauba,Mihaela van der Schaar,Pascal Frossard
类目: Machine Learning (cs.LG)
*备注:
Abstract:Masked diffusion models (MDMs) have recently emerged as a promising paradigm for sequence generation. Scaling MDMs is conventionally achieved by increasing the parameter count or the number of denoising steps. We introduce Recursive Masked Diffusion Models (R-MDMs), which add recursive depth as a third scaling axis by repeatedly applying the same denoising transformer within each diffusion step. Recursion enables iterative refinement of the output through parameter reuse, increasing effective model depth without increasing parameter count. Across structured generation tasks, including Sudoku and Countdown, we show that R-MDMs achieve substantially improved parameter efficiency: a model with L recursive iterations often matches the performance of non-recursive baselines with roughly L\times more parameters. Moreover, recursive refinement can partially substitute for additional denoising steps, allowing recursive models to reach the same generation quality with fewer forward passes at inference time. These results suggest that recursive depth is a practically useful scaling mechanism for MDMs, improving both parameter efficiency and the allocation of test-time compute.
[LG-14] Half a Link can Be Enough to Predict a Whole Link: Understanding Generalization in Knowledge Graph Foundation Models
链接: https://arxiv.org/abs/2606.18001
作者: Cosimo Gregucci,Obaidah Theeb,Daniel Hernandez,Antonio Vergari,Steffen Staab
类目: Machine Learning (cs.LG)
*备注:
Abstract:Knowledge graph (KG) foundation models (KGFMs) are zero-shot generalizers: trained once, they can predict links on unseen graphs without retraining. However, understanding when and how they can robustly generalize across KGs is still an open question. In this paper, we shed some light on their generalization mechanisms highlighting how their performance on unseen KGs is not uniform when it comes to partially seen links, which we call half-links. In fact, we show that to predict a test triple (h,r,t) it might suffice in practice to have observed the half-link (h,r) or (r,t) in the inference graph. This yields a taxonomy of four scenarios when combinations of these half-links are observed or not. In a rigorous stratified analysis over these scenarios, we reveal that SoTA KGFMs use seen half links for predictions, while unseen half-links pose different challenges. As such, our finer-grained taxonomy can be a diagnostic protocol for robust KGFM generalization and highlights where novel KGFMs can improve.
[LG-15] Predictive Analytics in E-Commerce for CustomerBehavior Forecasting using hybrid Ret-DNN withXGBoost Model
链接: https://arxiv.org/abs/2606.17931
作者: Degala Pushpa Sri,Mayank Atreya,Lakshmi. H,Navin Chhibber,Mukesh Soni
类目: Machine Learning (cs.LG)
*备注: 2025 2nd International Conference on Software, Systems and Information Technology (SSITCON)
Abstract:In recent years, electronic (E) commerce services have rapidly increased in the daily lives of people, which helpsthem to purchase products online. However, retail platforms have struggled to understand customer behavior and make it difficult to predict their future purchases. To overcome these challenges, this study proposes a hybrid Retail Deep NeuralNetwork (Ret-DNN) with an Extreme Gradient Boosting(XGBoost) model for capturing temporal features and tabular dynamics of retail data. First, data were sourced from a UnitedKingdom (UK)-based online retailer that contains transactions with almost 500,000 records. Then, the collected data were pre-processed using a series of techniques, such as data cleaning, outlier handling, temporal feature extraction, feature encoding, and z-score normalization, to ensure that the data were ready for model training and testing. Subsequently, the preprocessed data were fed into the Ret-DNN model, which acts as a feature extractor to understand the complete context of customer transactions. Further, the extracted data were fed as input into the XGBoost model, which predicted the final output as the purchase probability of customers. Finally, the proposed Ret-DNN XGBoost model achieved better results by attaining aMean Absolute Error (MAE) 0.2193 when compared to the existing Ret-DNN model. Keywords: Customer behavior forecasting, extreme gradientboosting, electronic commerce, predictive analytic, retail deepneural networks. Comments: 2025 2nd International Conference on Software, Systems and Information Technology (SSITCON) Subjects: Machine Learning (cs.LG) Cite as: arXiv:2606.17931 [cs.LG] (or arXiv:2606.17931v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2606.17931 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-16] Monotonic Kolmogorov-Arnold Networks: A Theoretical and Empirical Study of Monotonicity as an Inductive Bias
链接: https://arxiv.org/abs/2606.17886
作者: Mikhail Krasnov,Carolina Fortuna,Blaž Bertalanič
类目: Machine Learning (cs.LG)
*备注:
Abstract:Monotonicity has been a long-running architectural inductive bias for neural networks, motivated by tabular, scientific, and economic settings where outputs are known to respond monotonically to certain inputs. Existing approaches are MLP- or flow-based and lack per-edge functional transparency; the only Kolmogorov–Arnold Network (KAN) variant with monotonicity, MonoKAN, enforces the constraint only on a restricted parameter subset and requires a projection-style training procedure. We close this gap with \textbfMKAN, a KAN with hard monotonicity guaranteed for \emphall parameter values via exponential reparameterization of B-spline coefficients, positive edge weights, and a monotone base activation. Training reduces to standard unconstrained gradient descent. Our headline theoretical contribution is a \emphrepresentation-cost theorem: any C^K, K 0 feature extractor inducing a ball-shaped semantic-neighborhood partition admits a monotone realization of the equivalent neighborhood structure at N’ = N^* + k \le 2N^* , where k is the number of non-monotone coordinates of the original. The bound is architecture-agnostic and gives a principled sizing rule for monotone encoders. Empirically, MKAN is competitive with state-of-the-art monotone NNs on the SMM/ICML-2024 benchmark while being the only method that combines hard unconstrained monotonicity with KAN’s per-edge functional transparency; the 2N^* prediction is validated in a self-supervised feature-size sweep on four real datasets, and on a controlled monotone-generative dataset MKAN recovers ground-truth factors with substantially higher Spearman alignment than KAN, MLP, and linear baselines.
[LG-17] Meta-classification of one-class classification models using ranking correlation and nearest neighbor
链接: https://arxiv.org/abs/2606.17858
作者: Toshitaka Hayashi,Hamido Fujita,Dalibor Cimr,Richard Cimler,Jitka Kühnová
类目: Machine Learning (cs.LG)
*备注:
Abstract:Machine Learning (ML) techniques have been applied to various problems. However, applying ML to ML models is an unexplored direction. For this purpose, this paper considers a meta-classification of one-class classification (OCC) models, because all ML models could be approximated as OCC models. The proposal represents OCC models as normality rankings and classifies them using nearest-neighbor and ranking-correlation metrics. The experiment classifies OCC models, where classes correspond to training datasets, algorithms, and hyperparameters. The proposal achieves high accuracy when class labels are datasets. Moreover, it can classify algorithms when the training datasets contain the same class. In addition, the discussion highlights that the classification of OCC models is essentially the classification of datasets that treats multiple samples as a single input. The experiment demonstrates the classification of datasets using sleeping records. The proposed method can provide a unified solution for classifying OCC models, datasets, and rankings. Source code is uploaded to the public repository this https URL.
[LG-18] From Drift to Coherence: Stabilizing Beliefs in LLM s
链接: https://arxiv.org/abs/2606.17832
作者: SongEun Kim,Seungyoo Lee,Edwin Fong,Hyungi Lee,Juho Lee
类目: Machine Learning (cs.LG)
*备注:
Abstract:Large language models (LLMs) are often hypothesized to perform implicit Bayesian inference, yet a key coherence condition, the martingale property of predictive beliefs, has been shown to fail in controlled synthetic in-context learning settings. We revisit this question in a more typical usage regime: generic multiple-choice question answering. Exploiting the discrete answer space, we compute exact predictive distributions and study belief dynamics induced by autoregressive answer resampling. We introduce prompted predictive resampling (PPR), where an LLM generates a sequence of answers to the same question. Empirically, PPR reveals early-stage belief drift, indicating martingale violations. However, after sufficient resampling steps, the belief process self-stabilizes and converges to a coherent predictive distribution. Based on this observation, we further propose (i) a seed-answer prompting strategy to accelerate stabilization, and (ii) a self-consistency loss that amortizes early-stage drift into the model via fine-tuning. Experiments on multiple-choice QA benchmarks show that our methods substantially reduce belief drift and improve predictive coherence without sacrificing accuracy.
[LG-19] QueryMarket: Cost-Aware Online Active Learning in Data Markets
链接: https://arxiv.org/abs/2606.17805
作者: Xiwen Huang,Pierre Pinson
类目: Machine Learning (cs.LG)
*备注: 10 pages, 8 figures. Submitted to IEEE Transactions on Neural Networks and Learning Systems
Abstract:Data acquisition is a major bottleneck for learning in real-time streams: analysts must decide on the fly which labels to purchase while respecting a rolling budget. However, existing online active learning rarely unifies pricing, information gain, and rolling budget constraints under concept drift. We introduce QueryMarket, a market-inspired framework that queries each incoming data point based on its estimated utility to the model and its price. Within this framework, we propose OVBAL (online variance-based active learning), which integrates data pricing with information-driven selection by estimating each sample’s marginal utility via a D-optimality criterion with exponential forgetting and executing cost-aware purchases under rolling budget constraints. OVBAL yields a simple, fully online decision rule that adapts to nonstationary streams and heterogeneous label costs. Experiments on synthetic data and a real-world solar power generation forecasting task show that OVBAL is particularly effective under seller-centric pricing and yields a more favorable long-run error-cost trade-off in the real-world task under both pricing schemes.
[LG-20] Continual Self-Improvement with Lightweight Experiential Latent Memories
链接: https://arxiv.org/abs/2606.17803
作者: Vaggelis Dorovatas,Nancy Kalaj,Rahaf Aljundi
类目: Machine Learning (cs.LG)
*备注:
Abstract:Large language models achieve strong reasoning performance by scaling inference-time compute, yet remain fundamentally stateless, discarding the rich, self-produced reasoning traces generated during this process. We investigate whether models can instead learn online from this experience, converting transient computation (reasoning traces) into persistent reusable knowledge, and without external supervision or access to future data. We show that In-Context Learning (ICL) over raw reasoning traces fails to generalize, reflecting a fundamental limitation of token-level reuse: individual traces lack the abstraction needed for transfer, even after refinement (e.g. self-reflection). In contrast, drawing inspiration from recent works on unsupervised reinforcement learning, we find that lightweight per-instance training with self-generated test-time signals (majority voting) as rewards yields substantial gains, often surpassing full-dataset offline training, motivating a shift from raw traces to learned latent representations. Building on this insight, we propose an online method that distills inference-time compute spent on encountered problems into compact modular latent memories capturing the underlying reasoning structure. These memories are stored and retrieved for future inputs, enabling continual improvement while avoiding catastrophic forgetting through modular design. Importantly, our method is highly efficient, parametrized as extremely lightweight soft prompt memories (~0.001% of model parameters) and trained with only a few gradient steps, yet achieving performance competitive with full parametric updates and offline training. Across challenging mathematical reasoning benchmarks, our approach significantly outperforms zero-shot and raw data ICL baselines, while transferring effectively across datasets.
[LG-21] Blind Recovery of Latent Domains via Unsupervised Symmetry Discovery
链接: https://arxiv.org/abs/2606.17782
作者: Onur Efe,Arkadas Ozakin
类目: Machine Learning (cs.LG)
*备注:
Abstract:Primary motivation in blind inverse problems is to recover signals of interest from corrupted observations without knowing the obfuscating mechanism. Blind deconvolution is a prominent approach when the corruption is convolutional, but it is not applicable when general linear transformations obfuscate the domain structure. In this work, we propose an unsupervised framework for recovering latent domains and signals by discovering symmetries of the data distribution. Our framework models observations as linear measurements of signals sampled from a latent random field, and optimizes a shallow group-convolutional network by imposing stationarity and locality regularization at the model output. The model learns a latent symmetry action and an appropriate filter, thereby mapping unstructured observations to a symmetry-based representation that reveals latent signals. Experiments on stochastic processes, Ising models, shuffled and bit-scrambled images, and neural recordings show that the method recovers latent domains and signals from unstructured observations, suggesting symmetry discovery as a new direction for unsupervised structure learning and blind inverse problems.
[LG-22] A fairness-aware extension of Stochastic Multicriteria Acceptability Analysis for ranking
链接: https://arxiv.org/abs/2606.17756
作者: Guilherme Dean Pelegrina,Renata Pelissari
类目: Machine Learning (cs.LG)
*备注:
Abstract:Fairness has become a central concern in ranking problems involving individuals or social groups, particularly under the Responsible Artificial Intelligence agenda. In Multi-Criteria Decision Analysis, Stochastic Multicriteria Acceptability Analysis (SMAA) provides a robust framework for handling uncertainty and incomplete preference information, but it does not explicitly address fairness in the resulting rankings. This paper proposes SMAA-Fair, a fairness-aware extension of SMAA for ranking problems. The approach reweights the simulated rankings generated by SMAA according to their level of group fairness, so that fairer rankings contribute more strongly to the acceptability indices and central weights vector. The framework is independent of the aggregation model and can incorporate different fairness metrics. In this study, Statistical Parity, normalized discounted Kullback–Leibler divergence (rKL) and normalized discounted cumulative Kullback–Leibler divergence (nDKL) are adopted. Rankings are derived from the fairness-adjusted acceptability matrix using expected ranking and maximum acceptability ranking. We also derive the central weight according to the degree of fairness in the obtained rankings. Numerical experiments with synthetic and real data show that SMAA-Fair improves the representation of protected groups among favourable ranking positions, while preserving robustness to preference uncertainty.
[LG-23] Delta-Based Target Reformulation for Short-Term Electricity Load Forecasting Using LSTM and Transformer Models
链接: https://arxiv.org/abs/2606.17692
作者: Vansh Bansal
类目: Machine Learning (cs.LG)
*备注: 8 pages, 3 tables
Abstract:Accurate short-term electricity load forecasting is critical for the reliable and economic operation of modern power systems, under non-stationarity arising from weather variability, calendar effects, and evolving consumption patterns. While deep learning models such as LSTMs and Transformers show promising performance, most existing studies focus on direct absolute load prediction without explicitly addressing target non-stationarity. Motivated by classical time-series differencing techniques in ARIMA models, this paper investigates a delta-based target reformulation for short-term electricity load forecasting using deep learning. Instead of directly predicting absolute load values, the proposed formulation trains models to predict the change in load between consecutive time steps, with final forecasts reconstructed using the last observed load. This aims to stabilize the learning target and reduce forecasting difficulty. Using multi-year, hourly real-world electricity load data from India, augmented with meteorological variables from the NASA POWER project and calendar features, this study evaluates LSTM and Transformer models under both formulations, benchmarking them against LightGBM. Experiments are conducted for hour-ahead and day-ahead horizons, assessing performance via Mean Absolute Error (MAE) and Mean Absolute Percentage Error (MAPE). Results show that delta-based reformulation consistently improves forecasting accuracy for hour-ahead prediction across all evaluated models, yielding MAPE reductions of over 50% compared to absolute formulations. For day-ahead forecasting, delta targets specifically benefit deep sequence models (LSTM and Transformer), while LightGBM remains competitive under the absolute formulation. These findings indicate that while delta reformulation is a powerful inductive bias for neural networks, its efficacy is model- and horizon-dependent. Comments: 8 pages, 3 tables Subjects: Machine Learning (cs.LG) Cite as: arXiv:2606.17692 [cs.LG] (or arXiv:2606.17692v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2606.17692 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Vansh Bansal [view email] [v1] Tue, 16 Jun 2026 09:01:44 UTC (15 KB)
[LG-24] Physics-Constrained Neural Networks for Improved Short-Term Weather Forecasting: A Case Study over the South Pacific ICLR2026
链接: https://arxiv.org/abs/2606.17659
作者: Egor Bugaev,Fedor Buzaev,Dmitry Efremenko,Denis Derkach,Fedor Ratnikov
类目: Machine Learning (cs.LG)
*备注: Presented at ICLR 2026 Workshop AI and PDE
Abstract:This study introduces enhancements to physics-constrained neural networks (PCNNs) that improve the accuracy and stability of hybrid short-term weather forecasting models. Building on the WeatherGFT architecture, three innovations are proposed. First, an upgraded numerical solver, combining a fifth-order weighted essentially non-oscillatory scheme (WENO-5), a beta-plane approximation, and subgrid-scale viscosity, permits a fourfold increase in the integration time step to 1200 s while reducing the daily mean squared error by up to 26%. Second, a unified autoregressive hybrid block replaces the original chain of 24 specialised modules, eliminating overfitting to specific lead times. Third, the physical core is integrated with two state-of-the-art neural backbones, resulting in PI-PredFormer and PI-IAM4VP. Evaluation on the WeatherBench South Pacific subset from 2000 to 2004 shows that these hybrids reduce root mean squared error at 1-12 h lead times by 8-22% compared to purely neural counterparts, while better preserving physical consistency. These results demonstrate that incremental refinement of hybrid components offers a practical route toward more accurate and efficient short-range weather forecasting.
[LG-25] Expanding SPHERE-JEPA: A Family of Statistical Regularizers for the Hypersphere
链接: https://arxiv.org/abs/2606.17603
作者: Léo Nicollier(CB, ATT),Enric Meinhardt-Llopis(CB),Max Dunitz(ATT),Marc Pic(ATT),Pablo Musé(CB, IFUMI),Gabriele Facciolo(CB)
类目: Machine Learning (cs.LG)
*备注:
Abstract:In Self-Supervised Learning (SSL), preventing representation collapse by explicitly enforcing a uniform distribution on the unit hypersphere has proven to be effective. However, current frameworks typically rely on sliced statistical regularizers such as SIGReg (used in LeJEPA) and SUSReg (used in SPHERE-JEPA), which approximate this continuous objective via Monte Carlo sampling along random 1D directions. This stochasticity injects projection variance into the training gradients, destabilizing optimization, and hindering convergence. In this work, we first show that analytically integrating out these random projections natively yields a deterministic Maximum Mean Discrepancy (MMD), bypassing the variance of sliced methods. Motivated by this equivalence, we formulate full-dimensional objectives for MMD, Kernel Stein Discrepancy (KSD), and Kullback-Leibler (KL) divergence directly on the sphere to enforce a uniform distribution. To prevent spatial bias, we equip these tests with rotationally invariant kernels constructed via spectral theory, systematically evaluating two canonical families: smooth exponential decay (Heat) and strict frequency cutoff (Bandlimited) filters. Empirically, removing projection-induced noise results in more stable optimization, faster convergence, and consistent improvements over stochastic sliced regularizers on ImageNet and Galaxy10. Furthermore, we reveal that the choice of the statistical test shapes the geometry of the learned latent space: MMD and KSD favor locally clustered organization suitable for object-centric domains, whereas the continuous KDE-based KL divergence promotes fine-grained instance separation, yielding the strongest results on unclustered procedural texture retrieval.
[LG-26] When Dynamics Models Read the Wrong Time Steps: Label-Free Event Credit Re-Anchoring for Robust Global Readouts
链接: https://arxiv.org/abs/2606.17572
作者: Yifan Wang
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 7 pages, 6 figures
Abstract:Learned dynamics models often answer global physical questions, such as fault severity or impact stiffness, by pooling a per-step feature sequence into one readout vector. This sequence-to-global interface creates an under-studied temporal credit problem: with only trajectory-level supervision, a model can predict accurately in training conditions while reading from abundant smooth correlates rather than the brief physical events that determine the target. We call this failure temporal credit dilution. It is not exposed by the training loss and is not removed by standard physics-informed residuals, because the error lies in where the global readout assigns functional credit. We introduce Credit-in-Event, an interface-level probe for measuring how much pooled credit lands on event steps, and prove in closed form that a pooled linear reader routes credit to a spurious background channel as the event fraction shrinks. We then propose CREST, a training-free and label-free readout that estimates a transient event core from learned features and re-anchors the pooled representation through event-versus-rest contrast. Across simulated gear and impact systems, recurrent and attention encoders, and public bearing vibration data, CREST reduces out-of-distribution error while restoring event credit. Ablations show that stable-step selection and receptive-field shrinking fail, confirming that the gain comes from event-core credit re-anchoring rather than a generic locality or stability prior.
[LG-27] Reducing Learner Redundancy in Boosting via Residual Orthogonalization
链接: https://arxiv.org/abs/2606.17567
作者: Ye Su,Jipeng Guo,Yong Liu,Xin Xu,Gangchun Zhang,Jinxin Chen,Di Wu,Longlong Zhao
类目: Machine Learning (cs.LG)
*备注:
Abstract:While sequential residual fitting is the bedrock of standard boosting frameworks, it inherently breeds learner redundancy by repeatedly revisiting correlated error components. To address this bottleneck, we propose a shift from residual fitting to \textitresidual orthogonalization and introduce SCBoost. Our framework tackles redundancy through two complementary mechanisms: Spectral Residual Projection (SRP) and Covariance-Regularized Weighting (CRW). During training, SRP projects each residual target onto the orthogonal complement of the historical prediction subspace, forcing successive learners to capture only novel empirical innovations. During aggregation, CRW optimizes ensemble weights on a validation set with an explicit covariance penalty to mitigate remaining correlations. Theoretically, we provide a finite-sample geometric characterization proving that SRP yields an exact additive residual-energy decomposition. Furthermore, under an isotropic-noise assumption, we rigorously establish the conditions under which this projection improves the effective Signal-to-Noise Ratio. Extensive experiments across ten benchmark datasets demonstrate that SCBoost delivers strong out-of-the-box performance, particularly in accuracy and F1 score. This work reinterprets boosting through a geometric lens, suggesting that explicit redundancy control is a principled and necessary step toward more efficient ensemble architectures.
[LG-28] AoiZora: Topology-Aware Auto-Parallel Optimization for Inference of Diffusion Transformers
链接: https://arxiv.org/abs/2606.17566
作者: Kaijian Wang,Yuanyuan Xu,Fanjiang Ye,Ye Cao,Jingwei Zuo,T.S. Eugene Ng,Yarong Mu,Yuke Wang
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:
Abstract:Video diffusion has quickly grown into a key generative serving workload, yet producing each clip demands many denoising iterations over large spatio-temporal latents, which puts low-latency inference out of reach on a single device. A denoising step is therefore typically distributed across multiple accelerators, and TPU sub-slices have become an attractive and practical fabric for doing so. Current auto-parallel systems, however, search almost exclusively over logical device meshes and disregard how a chosen sharding is actually laid out on the physical TPU interconnect – an oversight that leaves large, topology-dependent performance on the table. We address this gap with AoiZora, a compiler-mediated topology planner built for low-latency video diffusion inference on TPU sub-slices. Its guiding principle is to reconnect logical sharding with physical placement by drawing on different points in the compilation flow: AoiZora first eliminates weak sharding candidates from inexpensive pre-compilation IRs, then compiles only the ones that survive and orders their physical placements using compiled HLO together with a topology-aware communication model. The winning plan is realized along the ordinary compiler path, leaving model code, compiler lowering, collective kernels, and network routing entirely intact. On TPU v5e sub-slices, AoiZora reduces Wan 2.1 one-step denoising latency by as much as 1.42x relative to existing solutions.
[LG-29] SpatioTemporal Causal Network Diagnostics for Geographic Tipping Point Early Warning
链接: https://arxiv.org/abs/2606.17553
作者: Zhaoyuan Yu,Zhangyong Liang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Geographic tipping points in ecosystems, climate subsystems, or ice sheets pose severe challenges for localized early warning. Classical spatial indicators such as Moran’s I summarize global spatial structure, but they struggle with three issues: spatial dilution, Euclidean assumptions, and correlated noise. This paper introduces SpatioTemporal Causal Network Diagnostics (ST-CND), a framework that addresses these three issues by representing the geographic field as a time-evolving directed causal network. The core workflow is as follows: (1) infer which spatial nodes help predict other nodes via transfer entropy, replacing fixed Euclidean neighborhoods with data-driven information-flow topology; (2) estimate local recovery rates within each candidate subnetwork via dynamic mode decomposition; and (3) identify the most vulnerable subnetwork by combining three signals, namely high internal fluctuation, high internal synchronization, and low external coupling, thereby suppressing false alarms from spatially correlated noise. Validated on synthetic bifurcations and two observational sea-surface temperature benchmarks, namely Indo-Pacific SST and North Atlantic AMOC, ST-CND delivers localized and interpretable warnings. On the AMOC task, it achieves an AUROC of 0.783 and a critical-subnetwork IoU of 0.378, outperforming recurrence-network and lambda-AR1 baselines. The framework provides an interpretable and scalable pipeline for spatial early warning in Earth system science.
[LG-30] Continuous-time Optimal Stopping through Deep Reinforcement Learning
链接: https://arxiv.org/abs/2606.17545
作者: Cosmin Borsa,Michael Ludkovski
类目: Machine Learning (cs.LG); Computational Finance (q-fin.CP); Pricing of Securities (q-fin.PR)
*备注: 33 pages
Abstract:Simulation based solvers for optimal stopping problems must discretize the stopping decision. Under classical dynamic programming, a coarse exercise grid with only a few stopping opportunities can materially undervalue the optimal expected reward, whereas on a very fine grid, approximation errors accumulate through the backward recursion. To remove this limitation, we develop a new reinforcement-learning inspired algorithm that enables us to learn the exercise rule at arbitrarily fine time resolution. Our CARLOS (Continuous-time Adaptive Reinforcement Learning for Optimal Stopping) algorithm utilizes an aggregate deep neural network (ADNN) to learn a joint space-time decision boundary. Starting from a coarse time grid, we progressively increase the frequency of stopping opportunities, while in parallel training the ADNN to refine its timing-value estimates. We moreover design an adaptive sampling strategy that gradually concentrates training effort near the stopping boundary. Benchmarked results show that CARLOS delivers higher prices than existing Bermudan solvers, approaching the American upper bound, and achieves high computational efficiency relative to non-RL comparators.
[LG-31] Non-negative Matrix Factorisation with Topological Regularisation
链接: https://arxiv.org/abs/2606.17531
作者: Matias de Jong van Lier,Shizuo Kaji,Keunsu Kim
类目: Machine Learning (cs.LG); Computational Geometry (cs.CG); Algebraic Topology (math.AT)
*备注:
Abstract:We investigate the learning of interpretable bases in non-negative matrix factorisation (NMF) by regularising the topology of the learned basis functions. Our approach is motivated by the observation that many data modalities can be viewed as non-negative functions on a structured domain, where the quality of a basis is intrinsically linked to its topology. However, naive methods for incorporating the topology of the support are often hindered by discreteness and threshold dependence, rendering them unsuitable for continuous optimisation. We address these challenges by employing persistent homology as a stable, threshold-free topological quantifier and by designing topological scores that integrate into the NMF objective as regularisers. The resulting framework encompasses spatially coherent image components, periodic time-series structures, and clique-like graph signals within a unified modelling language.
[LG-32] Domain-Validity-Gated Metamorphic Testing of Scientific ML Surrogates
链接: https://arxiv.org/abs/2606.17529
作者: Meng Li,Xiaohua Yang,Jie Liu,Shiyu Yan
类目: Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注:
Abstract:Scientific machine-learning (SciML) surrogates approximate expensive simulations, but exact expected outputs for arbitrary inputs are unavailable (the oracle problem). Metamorphic testing checks relations across executions, yet a candidate relation is not automatically valid: its preconditions, output mapping, and the numerical floor of the scoring operator determine whether a violation is meaningful. We study how candidate metamorphic relations (MRs) can be screened for domain validity and turned into executable, oracle-free test assets for SciML surrogates. We propose (i) a domain-validity rubric that admits a candidate only when its tolerance dominates the operator’s numerical floor and its preconditions hold; (ii) an MR-card executable-asset format recording source cases, transformations, metrics, tolerances, and typed relation-level verdicts; and (iii) a case-study protocol on MeshGraphNets cylinder-flow surrogates, with a claim ledger binding every result to a tracked artifact. On a MeshGraphNets checkpoint, node permutation holds to machine precision, mirror-y is a bounded out-of-distribution stress finding rather than an exact symmetry, and absolute conservation stays deferred while a reference-relative guard passes. The same readings hold across held-out trajectories, a checkpoint roster, three further architectures, and PhysicsNeMo. On a second CFD task (compressible airfoil) the predicate instead rejects incompressible continuity on physical grounds, showing it reasons about domain validity rather than running a fixed checklist. On a second PDE family, FNO Burgers and heat surrogates run full admit/reject/execute verdicts. The evidence spans two CFD tasks and a second PDE family, supporting a validity-aware bridge from candidate MRs to auditable SciML test assets that separates model-level violations from out-of-domain applications.
[LG-33] MGUP: A Momentum-Gradient Alignment Update Policy for Stochastic Optimization NEURIPS2025
链接: https://arxiv.org/abs/2606.17526
作者: Da Chang,Ganzhao Yuan
类目: Machine Learning (cs.LG)
*备注: Published in NeurIPS 2025
Abstract:Efficient optimization is essential for training large language models. Although intra-layer selective updates have been explored, a general mechanism that enables fine-grained control while ensuring convergence guarantees is still lacking. To bridge this gap, we propose \textbfMGUP, a novel mechanism for selective updates. \textbfMGUP augments standard momentum-based optimizers by applying larger step-sizes to a selected fixed proportion of parameters in each iteration, while applying smaller, non-zero step-sizes to the rest. As a nearly plug-and-play module, \textbfMGUP seamlessly integrates with optimizers such as AdamW, Lion, and Muon. This yields powerful variants such as \textbfMGUP-AdamW, \textbfMGUP-Lion, and \textbfMGUP-Muon. Under standard assumptions, we provide theoretical convergence guarantees for \textbfMGUP-AdamW (without weight decay) in stochastic optimization. Extensive experiments across diverse tasks, including MAE pretraining, LLM pretraining, and downstream fine-tuning, demonstrate that our \textbfMGUP-enhanced optimizers achieve superior or more stable performance compared to their original base optimizers. We offer a principled, versatile, and theoretically grounded strategy for efficient intra-layer selective updates, accelerating and stabilizing the training of large-scale models. The code is publicly available at this https URL.
[LG-34] Learning to Refine Hidden States for Reliable LLM Reasoning
链接: https://arxiv.org/abs/2606.17524
作者: Chia-Hsuan Hsu,Jui-Ming Yao
类目: Machine Learning (cs.LG)
*备注: Code is available at tongyu0924/Learning-to-Refine-Hidden-States
Abstract:Large language models show strong reasoning ability, but their internal reasoning process can remain unstable in complex multi-step settings, where early hidden-state errors may propagate to incorrect predictions. We propose ReLAR, a reinforcement-guided latent refinement framework that iteratively updates hidden representations before decoding. ReLAR maintains a compact latent reasoning state and uses learned depth and action controllers to adaptively determine both the number and direction of refinement steps. The controllers are trained with a policy gradient objective based on step-wise likelihood improvement, enabling efficient input-dependent reasoning without explicit chain-of-thought generation. Experiments on medical, mathematical, multi-hop reasoning, and open-ended generation benchmarks show that ReLAR improves accuracy, generation quality, and reasoning stability with substantially lower inference overhead than explicit reasoning baselines.
[LG-35] When the Next Step Is Not One Step: Distribution-Aware Execution Modeling for Concurrent Go Programs
链接: https://arxiv.org/abs/2606.17508
作者: Kaviru Hapuarachchi
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Programming Languages (cs.PL); Software Engineering (cs.SE)
*备注: 10 pages, 2 figures
Abstract:Training a model to predict the next step in a concurrent program is harder than it looks: two runs of the same program from the same trace prefix can produce different next events, both valid, because the scheduler is nondeterministic. A model trained against a single label is learning to guess one outcome of a random process. We turn this around and use the nondeterminism as a training signal. We run each program many times, aggregate the observed next events into an empirical distribution, and fine-tune a 7B model to match that distribution with a KL objective. On 798 held-out predictions drawn from real production Go bugs (CockroachDB, Kubernetes, gRPC, etcd), fine-tuning on fewer than a thousand traces reaches 36.2% accuracy, ahead of Gemini 3.5 Flash used zero-shot (34.8%) and the same model without fine-tuning (28.6%). Distribution training matches cross-entropy on accuracy (35.8% vs. 36.2%) while reducing Expected Calibration Error from 0.205 to 0.169. We also derive a formal goroutine-leak signature for a class of select-blocked goroutines where P(GoUnblock)=0 holds by scheduler semantics, not by learning. We release the dataset, trained adapters, and all tooling.
[LG-36] Reconfigurable Computing Challenge: Transformer for Jet Tagging on Versal AI Engines
链接: https://arxiv.org/abs/2606.17500
作者: Gram Koski,Sean Lipps,Zhenghua Ma,G. Abarajithan,Ryan Kastner
类目: Machine Learning (cs.LG); Hardware Architecture (cs.AR)
*备注: 4 pages, 4 figures. In FCCM 2026 proceedings
Abstract:Transformer-based models achieve strong performance for jet tagging at the CERN LHC, but deploying them in low-latency, resource-constrained trigger systems is challenging. We present an initial implementation of a quantized, integer-only transformer for jet tagging on the AMD Versal AI Engine (AIE), mapping dense and multi-head attention (MHA) layers to AIE tiles. The main contribution is a reusable software framework that represents transformer layers as composable AIE building blocks and automatically generates the corresponding Vitis graph code from a high-level Python model description. This framework provides a foundation for future research and is released as open-source software at this https URL.
[LG-37] Multi-Adapter PPO: A Cross-Attention Enhanced Wavelength Selection Framework for LIBS Quantitative Analysis
链接: https://arxiv.org/abs/2606.17476
作者: Hao Li,Man Fung Zhuo
类目: Machine Learning (cs.LG)
*备注: 6 pages
Abstract:Laser-induced breakdown spectroscopy (LIBS) quantitative analysis faces critical challenges in wavelength selection due to high-dimensional spectral data and the fundamental trade-off between prediction accuracy and feature efficiency. This paper presents a novel Multi-Adapter PPO framework that transforms wavelength selection into a reinforcement learning problem, leveraging cross-attention mechanisms and multiple specialized adapters to capture complex spectral relationships. Our approach outperforms traditional Particle Swarm Optimization (PSO) by an average of 28.4% in comprehensive score and 45.2% in prediction accuracy across steel and coal datasets. The proposed method demonstrates superior performance in balancing prediction accuracy with feature efficiency, achieving state-of-the-art results in LIBS quantitative analysis while maintaining interpretability and computational efficiency. We released our code and dataset here: this https URL
[LG-38] ReRAM-aware Model Finetuning addressing I-V Non-linearity and Retention Errors
链接: https://arxiv.org/abs/2606.17471
作者: Ching-Yi Lin,Shamik Kundu,Arnab Raha,Sahil Shah
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 11 pages, 12 figures, 2 tables, with appendix (5 pages, 9 figures)
Abstract:Traditional CPU, GPU, and NPU architectures are increasingly limited by the von Neumann bottleneck. While In-Memory Computing (IMC) using ReRAM crossbar arrays offers a high-density, energy-efficient alternative, its practical deployment is constrained through their non-idealities. Existing hardware-aware training frameworks often require training from scratch, which is computationally prohibitive for modern large-scale models. In this work, we propose a finetuning-based hardware-aware training algorithm that enables robust DNN deployment on ReRAM with minimal training overhead. Our approach mitigates I-V non-linearity by applying a range-shrunk sinh transformation and incorporates retention errors directly into a regularization loss during the finetuning process. We evaluate our framework across models and tasks such as image classification and question-answering (QA). Experimental results demonstrate that our method achieves similar accuracy on large-scale models like ResNet18 and DeiT-Tiny as the base model. In-case of ImageNet for MobileNetV3 families the technique has only less than 2% accuracy degradation. Further, applying the technique on the SQuAD v2 dataset results in only 1 point degradation of F-1 score.
[LG-39] Perron–Frobenius Operator Matching for Generative Modeling
链接: https://arxiv.org/abs/2606.17465
作者: Shiqi Zhang,Wuwei Wu,Jaemin Oh,Jie Chen,Xiaoning Qian
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:
Abstract:We introduce Perron–Frobenius Operator Matching (PFOM), a generative framework that matches density evolution via the integral PF operator, subsuming flow, diffusion, and jump models. We prove that among Bregman divergences, only Kullback–Leibler divergence preserves equality between density-level and sample-conditioned objectives, yielding a practical loss equivalent to Koopman path matching. We further develop Nesterov-accelerated training and sampling that stabilize discretization and accelerate convergence. %On Gaussian mixtures and two-moons, PFOM achieves faster KL/ W_2 /MMD decrease and improved wall-clock efficiency with empirical validation. PFOM unifies operator-theoretic identification with modern generative modeling and opens paths to adaptive dictionaries and high-dimensional applications.
[LG-40] CheckMIABench: Firm Foundations For Membership Inference Attacks on Language Models
链接: https://arxiv.org/abs/2606.17464
作者: Jeffrey G. Wang,Jason Wang,Marvin Li,Seth Neel
类目: Machine Learning (cs.LG)
*备注:
Abstract:Membership inference attacks (MIAs) are a canonical way to assess a machine learning model’s privacy properties. Although several attempts have been made to evaluate MIAs on language models, the extant literature has suffered numerous difficulties in constructing clean evaluations to test new techniques. In particular, subtle distribution shifts between member and non-member sets can undermine the statistical validity of MIAs; recent work has underscored this by showing that “blind” methods with no access to the underlying model can perform far better than published methods on the same benchmarks. This paper constructs a benchmark for principled evaluation of MIAs against LLMs, by leveraging the insight that training data before and after a fixed point during training are drawn from the same distribution. Therefore, all open-source models with intermediate checkpoints and public training data can be converted into MIA testbeds. We apply our framework to a half-dozen published attacks on the Pythia and OLMo family of models, from 70M to 7B parameters. To facilitate further privacy research, we open-source a modular library for designing and implementing attacks in this setting: this https URL.
[LG-41] ResAware: Cross-Environment Website Fingerprinting via Resource-Privileged Distillation
链接: https://arxiv.org/abs/2606.17462
作者: Chongru Fan,Wei Wang,Wentao Huang,Zhenquan Ding,Jinqiao Shi,Lei Cui,Zhiyu Hao,Xiaochun Yun
类目: Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注: 18 pages, 9 figures
Abstract:While Website Fingerprinting (WF) attacks achieve high accuracy in controlled laboratory settings, they often degrade substantially in real-world environments due to spatio-temporal drift, browser heterogeneity, proxy obfuscation and etc. This limitation stems from their sole reliance on low-level traffic features that are noisy and highly sensitive to environmental perturbations. To address this problem, we propose \textbfResAware, a cross-environment resource-aware distillation framework under a \textittraining-rich/inference-poor asymmetric setting. Specifically, ResAware trains a teacher model on resource-level features, and then distills the resulting privileged knowledge into a student model through heterogeneous knowledge distillation. At deployment time, the student model performs inference using only encrypted traffic, incurring zero additional cost. We evaluate ResAware on a large-scale dataset collected over five months from six globally distributed vantage points, comprising more than 160,000 paired samples. The results show that ResAware significantly enhances the cross-environment robustness of diverse WF baselines. Under a 150-day temporal drift, for example, ResAware improves the F1-score of Var-CNN from 72.77% to 81.49% and the open-world TPR@1%FPR from 22.40% to 27.20% . Our results demonstrate that resource-level supervision improves WF robustness without expanding online observation capabilities.
[LG-42] Operator Boosting Produces Pareto-Efficient PDE Surrogates
链接: https://arxiv.org/abs/2606.17460
作者: Lennon J. Shikhman
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA); Computational Physics (physics.comp-ph)
*备注: 19 pages, 4 figures, 3 tables. Preprint submitted to Elsevier
Abstract:Neural operators are widely used as surrogate solution maps for partial differential equations (PDEs), but full-size models can be costly to store, deploy, and evaluate in many-query scientific workflows. This work introduces Operator Boosting, a stagewise residual-learning framework for constructing compact neural-operator surrogates directly, rather than training a large model and compressing it afterward. Starting from the empirical mean predictor in normalized output coordinates, the method trains a sequence of tiny same-family neural operators on residual fields and incorporates each correction through validation-selected shrinkage. We instantiate the framework with Fourier neural operators (FNOs), DeepONets, and convolutional neural operators (CNOs), and compare boosted tiny stacks against full-size monolithic baselines across one-, two-, and three-dimensional PDE benchmarks from PDEBench, APEBench, and The Well. Across 30 dataset-architecture pairs, 21 show positive mean accuracy gains and 17 have positive confidence intervals, while all boosted stacks reduce trainable parameter count by approximately 72-95%. Best-model comparisons show empirical Pareto improvements on 7 of 10 completed PDE benchmarks, including two-dimensional Navier-Stokes, shallow-water dynamics, Darcy flow, one-dimensional transport and reaction systems, and three-dimensional compressible Navier-Stokes. These results show that Operator Boosting often improves the empirical accuracy-parameter Pareto frontier of neural PDE surrogates, while also exposing PDE- and architecture-dependent regimes where residual boosting fails to offset compression.
[LG-43] Credibility-Weighted Pricing of Autonomous Vehicle Liability Under Operational Design Domain Shift
链接: https://arxiv.org/abs/2606.17451
作者: Doyeon Jang
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注:
Abstract:Automated Driving System deployments create a foundational ratemaking challenge: sparse experience, shifting operational design domains, and non-stationary risk across software releases. We propose a hierarchical Bayesian credibility framework pooling across cities, software versions, and territories via a learned ODD-similarity kernel, nesting Buhlmann-Straub as a limiting case. Demonstrated on 648 verified-engaged Waymo crashes across four U.S. metros from the NHTSA Standing General Order database against 116 million matched miles, city-aggregate credibility weights are moderate (0.12-0.46), partial pooling decisively outperforms no pooling, and a power analysis shows the learned kernel’s advantage becomes detectable at approximately twelve deployed cities.
[LG-44] oward Controllable Catalyst Inverse Design via Large-Scale Autoregressive Pretraining
链接: https://arxiv.org/abs/2606.17445
作者: Dong Hyeon Mok,Jonggeol Na,Seoin Back
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci); Chemical Physics (physics.chem-ph)
*备注:
Abstract:Inverse design of heterogeneous catalysts remains challenging because catalyst surfaces exhibit substantial structural complexity with coupled surface-adsorbate interactions across a vast chemical space that is difficult to explore efficiently through conventional screening alone. Although machine learning-based high-throughput screening has accelerated catalyst discovery, its efficiency inevitably declines as the search space grows, motivating the development of generative models that can directly construct catalysts with target properties. Here, we present a conditional catalyst generative model based on the Generative Pretrained Transformer architecture with a numerical embedding layer that enables the generation of catalyst structures conditioned on both categorical and continuous properties within a single autoregressive framework. The model was pretrained on 133 million catalyst structures and subsequently fine-tuned on approximately 460,000 optimized structures with associated categorical properties and binding energies for conditional generation. The resulting model achieved 98% structural validity, 95% optimization validity, and high categorical condition fidelity, with a 93 % joint match rate for adsorbate type and composition. For binding energy conditioning, the match rate of approximately 20% represents a four-fold improvement over the baseline training distribution, and the generated distributions shift systematically toward the target values, enabling a 1.5 to 4-fold improvement in screening efficiency for reaction-targeted catalyst discovery without additional fine-tuning. These results show that large-scale autoregressive pre-training, combined with explicit property conditioning, provides a practical route toward controllable catalyst generation and accelerated catalysts discovery.
[LG-45] MorphStrata: Layer-Specific Perturbations for Generating Morphence Students in Time-Series Moving Target Defense
链接: https://arxiv.org/abs/2606.17435
作者: Abhishek Bhardwaj,Arnav Doshi,Anusri Nagarajan,Thanh Quynh Nhu Ta,Mohammad Masum,Robert Chun,Jaydip Sen,Saptarshi Sengupta
类目: Machine Learning (cs.LG)
*备注: 13 pages, 9 figures, 11 tables
Abstract:Time-series forecasting models remain vulnerable to gradient-based adversarial attacks while existing defense mechanisms typically incur a trade-off in robustness for bounded response and compute cost. The problem is pronounced in Moving Target Defense where maintaining multiple randomized model instances substantially exacerbates the training overhead. In this work, we introduce MorphStrata, a student generation strategy with selective, layer-specific stochastic noise injection that extends the traditional Morphence defense. MorphStrata uses a Transformer backbone as the teacher and perturbs randomly selected architectural blocks to create structured heterogeneity across student models in response to varied data distributions and threat models. We evaluate against vanilla Transformer and Morphence backbones on a suite of benchmarks including the Jena Climate, Electricity Load Diagrams, and Appliances Energy Prediction using FGSM, BIM and PGD attacks across multiple attack strengths. Across datasets and attack regimes, the proposed ensemble maintains comparable adversarial RMSE. Specifically, for high entropy, periodic datasets as in the case of the AEP data, MorphStrata achieves the lowest RMSE across all attacks and perturbation budgets, improving over the static baseline by up to 24.11% and 97.97% under FGSM and BIM respectively at an epsilon value of 0.5 over 30 randomized trials. Targeting the layers to generate MorphStrata students accounts for less than 1% increase in train-times over the Morphence MTD baseline for most of the experiments, while accounting for double digit gains in adversarial RMSE reduction. We also observe a positive correlation between higher pairwise L2 distance (among generated students) and overall defense effectiveness. In summary, MorphStrata maintains adversarial robustness as an MTD defense at marginal cost deltas when compared to existing baselines.
[LG-46] Generalization Guarantees for Multi-Input Neural Operator Learning in Sobolev Spaces
链接: https://arxiv.org/abs/2606.17419
作者: Yahong Yang,Zecheng Zhang,Wei Zhu,Wenjing Liao,Hao Liu
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:
Abstract:We develop approximation and generalization error estimates for multi-input neural operators, with the output error measured in Sobolev norms. In contrast to standard operator-learning settings with a single input function, our framework allows multiple input functions defined on possibly different domains, with different dimensions and Sobolev regularities. The derived rates explicitly quantify the contribution of each input space to the final error bound. In particular, in the balanced regime, the approximation and generalization rates are governed by the interaction between the input dimensions, regularities, and Sobolev orders, while the dependence on the model complexity retains a (\log\log/\log)-type structure. Our analysis provides a general theoretical framework for multi-input operator learning, including Sobolev training, and is applicable to operator learning problems arising from partial differential equations and scientific computing.
[LG-47] A Closer Look at Failure Modes in Temporal Understanding of Large Audio-Language Models INTERSPEECH2026
链接: https://arxiv.org/abs/2606.17417
作者: Apoorva Kulkarni,Kaousheik Jayakumar,Sreyan Ghosh,Sarah Wiegreffe,Dinesh Manocha,Ramani Duraiswami
类目: ound (cs.SD); Machine Learning (cs.LG)
*备注: Accepted to Interspeech 2026
Abstract:Large Audio Language Models (LALMs) achieve strong performance on a variety of audio understanding tasks but continue to struggle with temporal reasoning, a fundamental capability central to human auditory perception. Understanding the causes of these failures remains challenging as existing benchmarks report performance gaps without probing underlying mechanisms. To address this, we introduce a benchmark with 1,657 questions across three foundational tasks designed specifically for mechanistic analysis. Examining model outputs across varying input settings (behavioral analysis) reveals that models often under-utilize audio when textual cues are available. We also provide the first causal mechanistic analysis of temporal reasoning failures in LALMs. Comparing attention upweighting against scaling, we find that redistributing attention across audio tokens is more effective than increasing audio attention. Targeting task-relevant tokens yields further gains. These findings suggest that modality imbalance alone cannot explain failures. Attention scaling at bottleneck layers improves accuracy from 55.9% to 59.1% without fine-tuning, demonstrating a promising direction for future work.
[LG-48] Memory-Efficient Meta-Reinforcement Learning for Adaptive Safety-Critical Control in Adversarial Spacecraft Proximity Operations
链接: https://arxiv.org/abs/2606.17414
作者: Alejandro Posadas-Nava,Richard Linares,Minduli Wijayatunga
类目: Machine Learning (cs.LG); Dynamical Systems (math.DS)
*备注:
Abstract:Autonomous spacecraft rendezvous and proximity operations (RPO) require controllers that guarantee safety under thrust constraints while minimizing fuel expenditure. Input-constrained control barrier functions (ICCBFs) provide a control method for nonlinear systems with actuation constraints that construct a forward-invariant safe set. Previous work has shown that learning class- \mathcalK functions defining the ICCBF recursion via meta reinforcement learning (meta-RL) yields a robust, non-greedy approach to safety-critical control in RPO. This paper extends that framework further by investigating the performance of three recurrent network architectures (Long Short Term Memory (LSTM), Gated Recurrent Unit (GRU), Selective State Space Model (Mamba)) and two training algorithms (Proximal Policy Optimization (PPO) and Soft Actor Critic (SAC)) to identify the best setup for tuning ICCBF class-K functions via meta-RL. In addition to cooperative test cases, performance is evaluated in the presence of adversarial behavior where the target spacecraft behaves in a way that worsens the safety of the chaser spacecraft. Results indicate that state space models such as Mamba when used with PPO achieve superior task completion, safety, and fuel-savings compared to other architectures, across all cooperative and uncooperative scenarios tested.
[LG-49] Amortized Probabilistic Retrieval of Atmospheric CO2 from OCO-2 Spectra Using Deep Learning with Laplace Approximations and Normalizing Flows
链接: https://arxiv.org/abs/2606.17413
作者: Alejandro Calle-Saldarriaga,Felix Jimenez,Jack Grosskreuz,Jiazheng Wang,Jonathan Hobbs,Matthias Katzfuss
类目: Machine Learning (cs.LG); Applications (stat.AP)
*备注: 23 pages, 8 figures
Abstract:Space-based monitoring of atmospheric carbon dioxide (CO2) is essential for constraining the global carbon budget. NASA’s Orbiting Carbon Observatory-2 (OCO-2) estimates column-averaged dry-air mole fractions of CO2 (XCO2) using high-resolution spectra. However, current operational retrieval algorithms are computationally expensive and do not properly quantify uncertainties. We present a novel deep learning framework that addresses these challenges. Due to the difficulties of ground-truth data for real satellite observations, we develop and validate our approach using a high-fidelity simulation dataset. This dataset, created to support OCO-2 uncertainty quantification (UQ), incorporates realistic forward model errors. Our architecture encodes spectral bands using a multi-branch neural network and estimates posteriors of the full CO2 column or desired summaries thereof using two scalable UQ methods: Laplace approximations and normalizing flows. Our approach has five key advantages relative to operational “full-physics” solvers: (1) Amortization: Inference is orders of magnitude faster, enabling real-time processing of massive data streams; (2) Model error robustness: By training on simulations that explicitly include model discrepancies, our method accounts for systematic errors often neglected by standard inversions; (3) Point estimate accuracy: We achieve superior predictive accuracy compared to baseline methods; (4) Improved UQ: The probabilistic outputs yield better-calibrated uncertainty estimates; and (5) Non-Gaussian posteriors: When utilizing normalizing flows, our framework successfully models complex, asymmetric posterior distributions, overcoming the limitations of the Gaussian assumption. These results suggest that simulation-based deep learning is a viable path toward next-generation operational processing systems.
[LG-50] Damage Adaptation in Seconds for Architected Materials
链接: https://arxiv.org/abs/2606.17394
作者: James Avtges,Jake Ketchum,Helena Young,Taekyoung Kim,Ryan Truby,Todd Murphey
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: Proceedings of Robotics: Science and Systems
Abstract:Adaptation to damages and in-situ physical repairs is essential for long-term robot autonomy, yet challenging outside of narrowly defined and well-anticipated bounds. In this work we proprioceptively adapt to catastrophic damage in soft-actuated systems in under one minute. Architected materials are well equipped for adaptation: actuator failure occurs gradually rather than acutely, and damage can be described in a low-dimensional, discrete coordinate space. Surprisingly, latent damage representations plus a simple yet robust ensemble method is sufficient for adapting to unseen damage in real-time. Moreover, we identify conditions under which exponential sample complexity collapses to linear sample complexity for learned representations of architected materials, a concrete advantage over rigid components or continuum soft mechanisms. We demonstrate LEAP, our method for adaptive proprioception, via a tracing task for a 6DoF soft wrist based on Handed Shearing Auxetic (HSA) actuators. Our algorithm is able to adapt to cuts, burns, and actuator repairs, enabling simulation-free real-time adaptation that is critical for realizing the promise of soft robots outside the lab. Videos and more information are available at this https URL.
[LG-51] Performance-Driven Environment Abstraction with Multi-Timescale Learning
链接: https://arxiv.org/abs/2606.17377
作者: Yue Guan,Dipankar Maity,Panagiotis Tsiotras
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:
Abstract:We study performance-driven environment abstraction for decision-making in large Markov decision processes. Rather than preserving geometric or topological structure, we seek abstractions that directly optimize decision quality. We model abstraction as a controlled approximation obtained by aggregating the state space and enforcing a shared action distribution within each aggregated state. For a fixed partition, we establish a performance guarantee that separates value-function approximation error from the loss introduced by action sharing. Guided by this analysis, we develop a multi-timescale reinforcement learning framework that jointly adapts the policy and a tree-structured environment abstraction. The resulting algorithm refines and coarsens regions of the state space based on Q-value discrepancies, balancing performance against abstraction size and complexity. Empirical results demonstrate substantial state compression, improved sample efficiency, and faster replanning compared to actor-critic baselines.
[LG-52] Decision-Driven Geosteering Under Uncertainty: A Unified Framework for Sequential Decision Optimization
链接: https://arxiv.org/abs/2606.17331
作者: Hibat Errahmen Djecta,Sergey Alyaev,Kristian Fossum,Reidar B. Bratvold,Ressi Bonti Muhammad,Apoorv Srivastava
类目: Machine Learning (cs.LG)
*备注:
Abstract:Geosteering requires navigating a well trajectory through an unknown geological configuration, while sequentially updating decisions based on indirect measurements acquired during drilling. This work presents an uncertainty-aware geosteering framework that tightly integrates particle filtering for probabilistic subsurface interpretation with value-based reinforcement learning for sequential decision-making. Geological uncertainty ahead of the drill bit is represented explicitly through a particle filter (PF), enabling belief-informed control rather than deterministic trajectory correction. The framework couples PF belief updates with belief-informed decision policies and evaluates three decision-making options that operate under identical uncertainty representations: an interpretable Approximate Dynamic Programming (ADP) scheme, a Deep Q-learning baseline, and a Dual Deep Reinforcement Learning (Dual DRL) architecture trained with a target Q-network scheme for stability, using a dueling (value/advantage) decomposition for Q-value parameterization. Beyond final placement performance, we assess policy behavior using stability-oriented metrics that quantify steering smoothness over time, providing additional operational insight into how decision policies respond as uncertainty evolves. The framework is integrated with an API for validation within an industrial geosteering simulator under realistic measurement noise and drilling constraints. Using identical geological realizations, operational limits, and reward definitions across methods, the experiments provide a controlled and high-fidelity evaluation of how alternative decision policies behave throughout the drilling process, rather than evaluating performance solely from the final well trajectory. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2606.17331 [cs.LG] (or arXiv:2606.17331v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2606.17331 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-53] urning music identification into a neural forward pass
链接: https://arxiv.org/abs/2606.17301
作者: Muhammad Taimoor Haseeb,Ahmad Hammoudeh,Gus Xia
类目: ound (cs.SD); Machine Learning (cs.LG)
*备注:
Abstract:Search, a foundational operation in computer science, maps a query to a matching item in a collection. It is typically implemented as a System-2 like, rule-based pipeline in which a key is computed, an index is probed, and candidates are verified. By contrast, human recognition resembles a System-1 like, associative model of identity recovery, in which even partial cues can trigger a recall without explicitly enumerating, ranking, or even accessing discrete candidates. Here, we show that music sound identification, a difficult search problem, can be performed in a single neural feed-forward pass by a generative transformer. Trained on an audio dataset, the model predicts the corresponding track identifier from a short audio excerpt. This approach surpasses state-of-the-art acoustic fingerprinting, with the largest gains for short audio segments (1 second), demonstrating the method is not only viable but advantageous. Moreover, it reduces external storage to 0.33% of the baseline footprint and improves inference latency by 2.3x (p95). Furthermore, the model can reject queries for unseen tracks, supporting open-set operation while reducing misattribution risk. Using music track identification as an example, this work reframes search, bringing it closer in spirit to human associative recognition and away from algorithmic database lookup.
[LG-54] VISTA: Scale-Aware Visual Navigation via Action History Conditioning
链接: https://arxiv.org/abs/2606.17294
作者: Maeva Guerrier,Koki Kobayashi,Simon Roy,Jana Pavlasek,Giovanni Beltrame
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:
Abstract:Vision Navigation Foundation Models (VNMs) promise end-to-end learned navigation policies capable of zero-shot deployment across diverse embodiments and environments. To maintain generality, many vision-based navigation models predict normalized actions. However, this normalization introduces a critical deployment vulnerability: applying different scaling factors to the same normalized trajectory alters its physical geometry, which degrades navigation performance and increases collision risks. We address this vulnerability by conditioning the model on normalized action histories alongside image observations, providing explicit context on the relationship between the model’s predictions and the robot’s actual physical displacement. Furthermore, current VNMs often struggle in visually repetitive environments that lack distinct features. To resolve this issue, we integrate a DINOv3 encoder, whose richer representations enable our model to capture both spatial and geometric dimensions between observations. VISTA generalizes robustly to out-of-distribution environments, achieving 100% goal prediction accuracy in zero-shot, real-world deployment in Outdoor, Forest and Office settings, and an average of 95% checkpoints crossed, demonstrating consistent path following in unseen environments.
[LG-55] From Compression to Deployment: Real-Time and Energy-Efficient FastGRNN on Ultra-Constrained Microcontrollers FAST
链接: https://arxiv.org/abs/2606.17249
作者: Emre Can Kizilates
类目: Hardware Architecture (cs.AR); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Signal Processing (eess.SP)
*备注: 14 pages, 8 figures. Code: this https URL
Abstract:The dominant trajectory of modern machine learning has been to scale up: larger models, larger accelerators, larger memory budgets. Yet a multi-year global semiconductor supply constraint and the growing energy and carbon cost of always-online inference expose the fragility of this trajectory and motivate the opposite direction: refactoring AI and ML algorithms to fit the small, ubiquitous microcontrollers already in mass production in wearables, sensors, and edge appliances. We present an end-to-end open-source reproduction of FastGRNN, a compact gated recurrent cell, deployed on two bare-metal targets: the 8-bit Arduino (ATmega328P) and the 16-bit MSP430 (no hardware multiplier; 16 KB Flash; 512 B SRAM). Our compression pipeline combines low-rank weight factorization, iterative hard-thresholding sparsity, and per-tensor Q15 post-training quantization with explicit activation calibration. The deployed model occupies 566 bytes of weights and achieves macro F1 = 0.918 (seed 0; five-seed Q15 mean 0.853±0.107) on the HAPT test set. It matches a PyTorch reference at 100% prediction agreement across 3,399 test windows (MCU seed 0; 99.91-100% C-equivalent across five seeds). Both platforms sustain real-time 50 Hz streaming inference (9.21 ms per sample on Arduino; 13 ms on MSP430), where a 256-entry sigmoid/tanh look-up table delivers a 30.5x speedup on the multiplier-less MSP430. Four contributions extend the original FastGRNN paper: (i) cross-platform bit-equivalent deterministic inference; (ii) characterization of recurrent warm-up latency (median 74 samples, 1.48 s; worst-case 125 samples, 2.50 s over 100 test windows); (iii) a deployable look-up-table recipe for multiplier-less embedded targets; and (iv) hardware energy characterization showing 17.7 mW active inference power, 0.09 mW idle power, and 96.7% energy reduction with the LUT.
[LG-56] Uncertainty Quantification of Engineering Structures by Polynomial Chaos Expansion and Multivariate Active Learning
链接: https://arxiv.org/abs/2606.17233
作者: Qitian Lu,Jafar Jafari-Asl,Panagiotis Spyridis,Lukas Novak
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:In many engineering applications, a single high-fidelity model produces multiple quantities of interest (QoIs) under the same input parameters, e.g. finite element models of complex physical systems. To alleviate the high computational cost of direct model evaluations, surrogate models are widely used to construct efficient approximations of model responses. Naturally, the accuracy of surrogates strongly depends on the quality of the experimental design (ED). However, a single ED may not provide an adequate representation for all outputs simultaneously, especially when different outputs exhibit varying sensitivities to the input variables. A straightforward solution is to perform separate sampling for each output, but this results in increased sampling complexity and computational cost. From a statistical perspective, such an approach also ignores potential correlations among all outputs and may compromise data consistency. To address this issue, an adaptive sequential sampling method for constructing polynomial chaos expansion surrogate models is generalized for vector valued QoIs. The method sequentially selects new samples from a candidate pool based on their local contribution to the output variance, while balancing distance-based exploration of the input space and exploitation of aggregated variance information across all outputs. Its performance is compared with non-sequential Latin Hypercube Sampling through several numerical examples from engineering problems. Numerical results demonstrate that the proposed strategy improves both surrogate accuracy and stability, and provides a more reliable estimation of second-order statistics.
[LG-57] Sum-of-Squares Degree Barriers for the Reweighted-Hinge Method in Robust Halfspace Learning: A Christoffel-Function Characterization
链接: https://arxiv.org/abs/2606.17215
作者: Xiaoyu Li
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS); Machine Learning (stat.ML)
*备注:
Abstract:A certificate that removes outliers sees the data only through its low-degree moments, and an adversary exploits exactly this, hiding corruption where the clean data already looks typical, in the blind spot no bounded-degree test resolves. That blind spot turns out to have an exact size: the Christoffel function of the clean marginal, the very quantity modern data analysis thresholds to detect outliers, here read from the adversary’s side as the corruption a bounded-degree certificate cannot remove. We turn this inversion into the organizing principle of the reweighted-hinge approach to robustly learning \gamma -margin halfspaces under malicious noise (Shen, 2025; Zeng and Shen, 2025): the governing resource is the Sum-of-Squares degree of the outlier-removal certificate, and the resolution principle states that the maximal corruption mass which can hide at a center c from a degree- 2t certificate is exactly the Christoffel function \lambda_t+1© of the clean marginal. Three consequences follow, all against the certificate method (not information-theoretic). A margin-degree tradeoff: certifying the dense pancake to error \epsilon costs SoS degree \Omega(\log(1/\epsilon)) or margin \Omega(\sqrt\log(1/\epsilon)/\sqrtd) , explaining why the \log(1/\epsilon) margin Shen (2025) records is forced, with a weighted-Chebyshev reduction making the threshold 2t=\Theta((|c|/s)^2) tight modulo one classical weighted-extremal estimate. A degree- 2 outlier barrier: the resolution principle realized as an explicit instance on which degree 2 is stuck at \eta^1/2 while degree 4 escapes, locating the method’s small breakdown rate in the degree, not the analysis. And a degree- 2t algorithm tracing the frontier \eta^1-1/2t (recovering Shen (2025) at t=1 ), whose gain is an explicit constant, capped by the pancake density and shown unimprovable by the degree- 2 barrier.
[LG-58] Constrained Diffusion Models with Primal-Dual Inference
链接: https://arxiv.org/abs/2606.17192
作者: Samar Hadou,Yigit Berkay Uslu,Alejandro Ribeiro
类目: Machine Learning (cs.LG)
*备注:
Abstract:This paper develops constrained diffusion models with primal-dual inference (PDI) to sample from optimal distributions of entropy-regularized optimization problems with \emphaverage constraints. We formalize constrained sampling in the Lagrangian dual domain, where the optimal distribution takes the form of a Gibbs distribution indexed by the optimal dual variable. Rather than estimating this dual multiplier before sampling and freezing it throughout generation, PDI jointly infers the optimal primal distribution and its parametrizing dual variable. Each reverse diffusion step denoises using the score field associated with the current multiplier and then updates the multiplier through dual ascent using the estimated constraint violation of the denoised samples. To enable this conditional score field, we train a single dual-conditioned score network over the family of Gibbs distributions induced by the dual variables encountered during inference. We prove that the time average of the dual variables generated along the inference trajectory converges to a neighborhood of the dual optimum and bound the effect of residual dual mismatch on the terminal distribution through schedule-dependent stability factors. We evaluate PDI on constrained sampling from a mixture of Gaussians, wireless resource allocation, and portfolio management.
[LG-59] Finsler Geometry Graph Neural Networks and You
链接: https://arxiv.org/abs/2606.17185
作者: T. Mitchell Roddenberry,Richard G. Baraniuk
类目: Machine Learning (cs.LG); Signal Processing (eess.SP); Differential Geometry (math.DG); Machine Learning (stat.ML)
*备注:
Abstract:Graph neural network architectures based on the graph Laplacian approximate the Laplace-Beltrami operator, thus limiting their application to isotropic operators. As a nonlinear alternative to the Laplace-Beltrami operator, we consider estimates of the Finsler Laplacian on point clouds sampled from a manifold. We prove that these discrete estimates converge to the true operator on the manifold as the number of point samples grows. Moreover, we show that this operator can be expressed as a graph neural network layer, which we use to define a family of Finslerian graph neural networks constrained to express Finsler geometry. We show that Finslerian graph neural networks recover the geometry underlying nonlinear diffusion equations in practice.
[LG-60] owards Fast GNN Surrogates for CO2 Migration in Complex Geological Formations
链接: https://arxiv.org/abs/2606.17180
作者: Rodrigo S. Luna,Thiago H. N. Coelho,Luiz S. L. Neto,Roberto M. Velho,Adriano M. A. Cortes,Renato N. Elias,Alexandre G. Evsukoff,Fernando A. Rochinha,Mauricio Araya-Polo,Herve Gross,Alvaro L. G. A. Coutinho
类目: Machine Learning (cs.LG)
*备注:
Abstract:This chapter discusses how a data-driven machine learning approach can reproduce key aspects of the physical behavior of multiphase flows in complex geological formations. We propose an end-to-end graph neural surrogate tailored to CO _2 plume migration forecasting in geological storage. The method is evaluated on the SPE11A benchmark, a well-known industry test case designed to assess CO _2 storage scenarios and characterized by sharp gas-water interfaces, strong advective transport, and rapid convective mixing with fingering development. The benchmark is reformulated as a graph in which nodes represent computational cells and edges encode transmissibility-based interactions enriched with geometric attributes. Directional transport arising from grid geometry, permeability contrasts, and geological heterogeneity is captured through an anisotropic message-passing mechanism, where interaction weights are computed via geometry-conditioned edge embeddings, biasing message aggregation toward physically relevant transport directions. Temporal evolution is modeled in latent space using an autoregressive residual formulation trained with multi-step supervision. The proposed model produces competitive forecasts of gas saturation and liquid-phase density, which are key indicators for CO _2 storage monitoring, with cumulative errors that remain moderate over extended forecasting horizons.
[LG-61] Noise-Driven Escape from Metastable Phases explains Grokking in Deep Neural Networks
链接: https://arxiv.org/abs/2606.17120
作者: Ibrahim Talha Ersoy,Karoline Wiesner
类目: Machine Learning (cs.LG); Chemical Physics (physics.chem-ph)
*备注: 13 pages, 4 figures. Accepted at HiLD 2026: 4th Workshop on High-dimensional Learning Dynamics
Abstract:Deep neural networks (DNNs) exhibit first order phase transitions under variations of the L2 regularization strength, with each transition marking the onset of a new learnable feature. Below a critical regularization strength, all features are in principle learnable, but coexisting metastable states, separated by energy barriers, can trap the network and impede convergence. A strength of DNNs is their ability to generalize. But many open questions remain, among them the origin of so called grokking: the abrupt, delayed onset of generalization after prolonged apparent overfitting. We show for linear DNNs that grokking is consistent with hysteresis in first-order L2 phase transitions: using L2 regularization to engineer deliberate trapping, we demonstrate that a model in a low-accuracy metastable state escapes only when SGD noise drives it across an energy barrier, with escape times following Arrhenius scaling. We reproduce grokking-like delayed convergence across two orders of magnitude in escape time by deliberately trapping models in metastable phases. Using sparse sub-sampling we also reproduce the canonical grokking curve where test error eventually approaches the final training error. Our work suggests that the number of metastable states equals the number of learnable features – one per singular value of the data covariance – the potential for hysteresis grows naturally with task complexity. We provide evidence that the same mechanism likely operates in general nonlinear DNNs. Our results provide routes toward more efficient learning schemes.
[LG-62] Loss Landscape Poisoning: Targeted Extraction of Unseen Training Data from LLM s
链接: https://arxiv.org/abs/2606.17110
作者: Md Abdullah Al Mamun,Ngoc Phu Doan,Pedram Zaree,Ihsen Alouani,Nael Abu-Ghazaleh
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:
Abstract:Large Language Models are increasingly trained on proprietary or sensitive data, from private healthcare and financial records to user conversations containing secrets. Ensuring the privacy of such data against extraction attacks has become a central concern. In this paper, we ask whether an attacker who can poison a portion of the training data can facilitate the leakage of a separate target record they have no access to. We answer in the affirmative and show that such leakage can be induced by a poisoning mechanism that reshapes the model’s local loss landscape around the target completion. Our key insight is that poisoning to create a sharp loss minimum at the target, surrounded by elevated loss on nearby alternatives, forces the model to memorize the target as the unique low-loss solution in its neighborhood. The attack requires no architectural changes, and generalizes across centralized and federated learning settings. We demonstrate that the attack amplifies privacy leakage across language (up to 100% successful extraction), and vision-language models (up 90% successful extraction). We show that the attack is thwarted when the model is trained to be differentially private. However, we introduce a new attack that directly probes the loss landscape bypassing even differential privacy defenses.
[LG-63] Informative Missingness to Generate Irregular Clinical Time Series
链接: https://arxiv.org/abs/2606.17106
作者: Hadi Mehdizavareh,Gabriele Santangelo,Giovanna Nicora,Simon Lebech Cichosz,Arianna Dagliati,Arijit Khan,Riccardo Bellazzi
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注:
Abstract:Laboratory tests in electronic health records are collected irregularly, and the absence of a test order can be as informative as the measurement itself. Such missingness reflects clinicians’ decisions and patient physiology, making it important to model it directly rather than treat it as a preprocessing artifact. Here we present a diffusion-based approach for generating clinical time series that jointly models laboratory values and their observation patterns using the public Data Analytics Challenge on Missing Data Imputation (DACMI) benchmark derived from MIMIC-III. To preserve realistic sampling, we align chart times into 4-hour intervals and segment admissions into 7-day windows, producing trajectories that pair each lab value with a corresponding observation indicator. Standard transformations and normalization are applied to stabilize training. Our method extends the TimeDiff framework to learn continuous lab values and discrete missingness patterns through complementary diffusion objectives. Experiments show that the generated data closely match real patient trajectories across individual lab distributions and joint value-missingness embeddings, demonstrating that diffusion models can capture clinically meaningful dependencies between patient physiology and clinicians’ testing behavior under MNAR-like (missing-not-at-random) missingness. These preliminary results indicate that our model can serve as an initial component toward developing clinical foundation models. By producing synthetic priors that preserve key physiology-missingness relationships, this work motivates the subsequent training of Prior-Data Fitted Networks capable of leveraging informative missingness, which we will investigate in the extended work.
[LG-64] Diagnosing and Repairing Shape-Prior Shortcuts in Long-Range Single-Shot Fringe Projection Profilometry
链接: https://arxiv.org/abs/2606.17093
作者: Adam Haroon,Anush Lakshman,Cody Fleming,Beiwen Li
类目: Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注: 44 pages, 27 figures
Abstract:Learning-based single-shot fringe projection profilometry (FPP) has been studied mostly at close range. The long-range regime (standoff beyond 1 m) remains largely unaddressed: inverse-square intensity falloff lowers fringe signal-to-noise ratio and degrades physical ground truth, the single-shot problem is ill-posed because fringe-order information is absent from one image, and these architectures have not been studied mechanistically. We present a diagnose-repair-verify study using mechanistic interpretability (MI) and conformal uncertainty quantification (UQ) as convergent diagnostics: they agree on one physical failure locus, driving and verifying an architectural repair. On a photorealistic synthetic benchmark (15,600 fringe images, 50 objects at 1.5-2.1 m), a best UNet baseline reaches 14.54 mm object mean absolute error (MAE). Three probes (linear probing, Grad-CAM, flat-plane out-of-distribution test) converge: the baseline solves the task via object-boundary shape priors rather than fringe-phase decoding. We repair this with PhiCalNet, which outputs wrapped phase rather than depth and applies a fixed differentiable calibration layer mapping phase to depth, removing the shape-prior solution from the hypothesis space architecturally rather than by a loss penalty. A physics-informed loss that enforces the same physics as a soft penalty on a depth-regressing network yields no measurable gain, isolating the architecture as the operative factor. PhiCalNet reduces object MAE 3.3x to 4.46 mm; the residual is carried by 0.103% of pixels at the +/-pi wrap discontinuity. Pixel-wise conformal UQ confirms the diagnosis: rejecting the top 5% of object pixels by snapshot disagreement cuts PhiCalNet RMSE by 64% (20.6-7.4 mm) versus 3.5% for the baseline. MI and UQ converge on the same failure locus.
[LG-65] Finite-Time Queue Peak Laws in Stochastic Networks: Logarithmic Scaling After Geometric Thresholds
链接: https://arxiv.org/abs/2606.18218
作者: Hao Liang,Cheng Tang,Yunzong Xu
类目: Probability (math.PR); Machine Learning (cs.LG); Systems and Control (eess.SY); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:
Abstract:We study finite-horizon queue peaks in generalized switches, a standard stochastic-network model in which many queues share constrained service resources. Arrivals may be dependent, time-varying, and adapted to the past; the standing load condition is uniform interior slack, meaning the conditional mean arrival vector stays in a fixed contraction of the capacity region. We show that this slack reshapes the finite-time peak law for drift-minimizing scheduling policies such as MaxWeight. The square-root envelope that is sharp without slack persists only up to a geometry-dependent threshold; beyond that threshold, the running maximum grows only logarithmically with the horizon, both with high probability and in expectation. The mechanism is self-normalization: in the current queue direction, the projected fluctuation scale is normalized by the stabilizing drift scale. This removes capacity geometry from the logarithmic coefficient, while geometry remains in the threshold. Matching lower bounds show that both the logarithmic term and a geometric threshold are unavoidable. When finite-time state-space collapse is available, the threshold can be sharpened using local bottleneck geometry. For generalized input-queued switches, we obtain finite-time peak bounds with tight logarithmic coefficients. Simulations illustrate the two-phase envelope, local geometric refinements, and variance-sensitive improvements predicted by the theory. Subjects: Probability (math.PR); Machine Learning (cs.LG); Systems and Control (eess.SY); Optimization and Control (math.OC); Machine Learning (stat.ML) Cite as: arXiv:2606.18218 [math.PR] (or arXiv:2606.18218v1 [math.PR] for this version) https://doi.org/10.48550/arXiv.2606.18218 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-66] A Diffusion Approximation for Temporal-Difference Learning with Linear Features under Markovian Noise
链接: https://arxiv.org/abs/2606.18183
作者: M. Forzo,E. Monzio Compagnoni,A. Russo,A. Pacchiano
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Probability (math.PR)
*备注:
Abstract:Temporal difference (TD) learning with linear function approximation is a core method for policy evaluation. Its classical continuous-time description is an ordinary differential equation (ODE), which captures the asymptotic mean dynamics but neglects stochastic fluctuations determining the error floor. We introduce a stochastic differential equation (SDE) approximation for linear TD(0) under Markovian noise. The resulting model distinguishes the contraction dynamics governed by the projected Bellman operator from the influence of Markovian sampling. As a consequence, the model explains the constant-stepsize error floor through the interaction between Markovian long-run covariance and the contraction geometry of the projected Bellman operator.
[LG-67] nsor-based second-order causal discovery
链接: https://arxiv.org/abs/2606.18074
作者: Nathan Ouyang,Kexin Wan,Anna Seigal
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注: 27 pages, 7 figures. Code available at this https URL
Abstract:Causal discovery seeks to uncover the causal dependencies among variables. For this purpose, we propose an algorithm called Tensor-based Second-order Causal Discovery (TSCD). Its input is a tensor obtained from the covariance matrices of observational and interventional data. Assuming the causal dependencies follow a linear structural equation model on a directed acyclic graph (DAG), TSCD outputs the DAG and the functions on its edges, requiring only that the noise variables are uncorrelated. We also implement a version of the approach for nonlinear models. Our focus on second-order statistics (via the covariance matrices) is motivated by their statistical and computational efficiency relative to higher-order moments, their identifiability relative to first-order statistics, and that they work regardless of whether the variables are Gaussian. We show that TSCD has identifiable causal order and parameters from a number of interventions that is logarithmic in the number of variables. Experiments show that TSCD is robust to noise, competitive with existing methods, and scales to hundreds of variables.
[LG-68] Fast Nonparametric Conditional Independence Testing via Two-Stage Regression FAST
链接: https://arxiv.org/abs/2606.18011
作者: Eric V. Strobl
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注: A fast R implementation with C++ back-end is available at this https URL
Abstract:Constraint-based causal discovery relies on repeated conditional independence tests, but fast nonparametric tests often sacrifice calibration, especially when variables depend on the conditioning set through nonlinear relationships. We introduce BLITZ (Broad-to-Local Independence Testing via residualiZation), a nonparametric conditional independence test designed to run well under a second while maintaining the accuracy needed for the thousands of queries performed by constraint-based causal discovery algorithms. BLITZ first removes broad smooth dependence on the conditioning set using low-order polynomial regression, then applies a small nonlinear feature map and residualizes those features with shallow tree regressions. The resulting statistic tests residual cross-covariance, with a moment-matched chi-square approximation to the null distribution. We show theoretically that the two-stage design reduces the effective complexity faced by the tree residualizers, allowing shallow trees to control residual conditional-mean bias while avoiding excessive overfitting. In simulations, BLITZ provides better null calibration than fast kernel, random-feature, and regression-based competitors while remaining among the fastest methods tested. In causal discovery experiments on synthetic graphs and flow-cytometry data, BLITZ yields more reliable endpoint orientations among retained adjacencies and competitive structural recovery. These results suggest that broad-to-local residualization is a practical route to calibrated, scalable nonparametric conditional independence testing for causal discovery.
[LG-69] Differential Privacy of Gaussian Process Posterior Sampling
链接: https://arxiv.org/abs/2606.17995
作者: Tomasz Maciazek
类目: Machine Learning (stat.ML); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 8 pages of main text + 25 pages appendix
Abstract:We study the privacy of releasing posterior sample paths from a Gaussian process (GP) when the entire training set including covariates and responses is private. Unlike standard differential-privacy (DP) mechanisms that add external noise, posterior sampling is random by construction. We show that this intrinsic randomness yields DP guarantees by deriving explicit Rényi-DP bounds for GP posterior sample-path release. The bounds separate posterior-mean leakage from data-dependent posterior-covariance leakage showing that meaningful privacy depends sharply on effective ridge regularisation. We apply membership-inference attacks to show that empirical leakage follows the predicted dependence on regularisation, posterior variance and the number of released posterior sample-paths. Utility experiments on downstream posterior-sampling tasks identify noisy-observation regimes where privacy-compatible regularisation preserves useful decisions with modest utility loss. When stronger privacy is needed, the intrinsic guarantee can be sharpened by adding calibrated GP noise, providing an explicit additional privacy knob.
[LG-70] Geometrical fairness in graph neural networks
链接: https://arxiv.org/abs/2606.17684
作者: Arturo Pérez-Peralta,Sandra Benítez-Peña,Blas Kolic,Rosa E. Lillo
类目: Machine Learning (stat.ML); Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注: 32 pages, 21 tables, 6 figures
Abstract:Graph-based learning methods have become increasingly prominent due to their strong performance across diverse applications. Among these, recent frameworks grounded in diffusion processes provide a unifying perspective that extends traditional graph neural network formulations while addressing limitations of standard message-passing mechanisms. Despite these advances, concerns remain regarding the fairness of such models, as they may propagate or amplify biases present in the data. In this work, we introduce a fairness-aware adaptation of graph-based diffusion by modifying the underlying Laplacian operator. Our approach incorporates multiple complementary transformations, including subspace projections, spectral adjustments, and frequency-based filtering, to mitigate bias-related components. Leveraging the intrinsic smoothing properties of graph diffusion, we provide a principled analysis of the resulting behavior and establish theoretical insights into fairness properties. We evaluate the proposed framework on both synthetic and real-world datasets, demonstrating that it achieves competitive performance while improving fairness metrics with limited additional computational cost.
[LG-71] Public transit gains and spatially uneven travel demand changes after NYC congestion pricing
链接: https://arxiv.org/abs/2606.17530
作者: Donghang Li,Dingyi Zhuang,Yunlin Li,Chenan Shen,Nina Cao,Yunhan Zheng,Shenhao Wang,Jinhua Zhao
类目: Physics and Society (physics.soc-ph); Machine Learning (cs.LG); General Economics (econ.GN); Applications (stat.AP)
*备注:
Abstract:New York City implemented the nation’s first cordon-based congestion pricing program in January 2025, providing an opportunity to evaluate how system-wide urban mobility responds to large-scale pricing interventions. Because such policies generate spillovers across modes and locations, credible control groups are difficult to construct. We address this challenge using time series foundation models to generate probabilistic counterfactual demand forecasts with calibrated uncertainty. Applying this framework to bus, subway, and aggregate trip volume data, we find that post-policy bus and subway ridership increased significantly relative to expected no-policy demand, while overall travel demand decreased modestly. The effects are spatially heterogeneous: while reductions in overall travel demand are concentrated within the Congestion Relief Zone, transit gains extend beyond Manhattan’s core. Socio-demographic analyses further reveal uneven adaptation across neighborhoods, highlighting spatial equity implications. Our framework provides a scalable approach for the uncertainty-aware evaluation of system-wide urban interventions when clean control groups are unavailable.
[LG-72] Beyond IGO-Flow: Toward Convergence Analysis of IGO in Continuous Spaces PPSN2026
链接: https://arxiv.org/abs/2606.17523
作者: Ryosuke Kimura,Youhei Akimoto
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注: Accepted at PPSN 2026
Abstract:Information-Geometric Optimization (IGO) provides a unified framework for black-box optimization by interpreting the adaptation of a search distribution as a natural gradient update. Despite its conceptual importance, the convergence theory of IGO remains limited: most existing results concern continuous-time idealizations such as the IGO flow, rather than discrete-time updates with non-infinitesimal learning rates. In this paper, we study discrete-time IGO in continuous spaces, formulated as natural gradient updates in the expectation-parameter coordinates of an exponential family. In particular, we analyze IGO over the multivariate Gaussian family on strongly convex quadratic objective functions. Our analysis covers a setting that simultaneously incorporates full covariance adaptation, a fixed positive learning rate, and quantile-based weights. In this setting, we prove that the covariance matrix converges to the zero matrix. We further show that the mean vector converges to the global optimum, provided that the condition number of the appropriately scaled covariance matrix is bounded at sufficiently frequent iterations. These results advance the convergence theory of IGO and help bridge the gap between the mathematical theory of IGO and practical covariance-adaptive search methods such as CMA-ES.
[LG-73] A Bayesian Boolean Matrix Factorization with Application to Copy Number Analysis in Cancer
链接: https://arxiv.org/abs/2606.17491
作者: Adolphus Wagala,Mehmet Samur,Giovanni Parmigiani
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:
Abstract:Binary data factorization is common, but real-valued methods ignore discreteness and yield hard-to-interpret factors. Boolean Matrix Factorization (BooMF) instead decomposes a binary matrix into two lower-rank binary matrices via logical AND and OR, expressing the data as a Boolean disjunction of interpretable patterns. In cancer genomics, BooMF can reveal coordinated feature changes that may drive tumor evolution, unlike rotational or additive decompositions. Most existing BooMF methods are heuristic, greedy, sensitive to initialization, prone to local optima, and do not support principled model selection or uncertainty quantification. We introduce Bayesian Boolean Matrix Factorization (BBMF), a fully conjugate generative model with sparsity-inducing priors. It enforces Boolean constraints, yields interpretable latent factors with coherent uncertainty quantification, and admits Gibbs sampling with closed-form full conditionals. Because cancer evolution often involves widespread, near-simultaneous chromosome-number changes (e.g., whole-genome duplication followed by instability and selection), Boolean factorizations capture these patterns more naturally than additive models. Applied to arm-level copy-number alteration data in multiple myeloma, where entries indicate presence/absence of chromosomal-arm amplifications, BBMF finds a small set of interpretable bicliques linking patient subsets to recurrently co-altered chromosomal arms, providing a compact, biologically meaningful summary of tumor heterogeneity and demonstrating BBMF’s utility for uncovering discrete latent structure in complex binary data.
[LG-74] Bounded Difference Concentration for Infinitely Exchangeable Sequences with Applications to AI Benchmark Uncertainty
链接: https://arxiv.org/abs/2606.17426
作者: Fangyuan Lin,Spencer Frei,Victor H. de la Pena
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Probability (math.PR)
*备注:
Abstract:We consider the concentration properties of functions of infinitely exchangeable random variables. By conditioning on the de Finetti directing measure, we show that the deviation of any function with bounded-difference constants c_1, \dots, c_n decomposes into a conditional sampling fluctuation and a latent mixture fluctuation. When this latent mixture is \sigma_\mathrmmix^2 -subgaussian, we establish a concentration inequality with an effective variance proxy of \frac14\sum_i c_i^2 + \sigma_\mathrmmix^2 . Crucially, we demonstrate that for zero-sum linear contrasts, such as the difference between a subsample mean and a full population mean, the latent mixture term cancels exactly. This cancellation yields a tight, mixture-free Hoeffding-type bound that provides a direct de Finetti mechanism for the infinite-extendibility limit of recent finite-exchangeable concentration results. We apply this framework to quantify uncertainty in composite AI benchmarks, such as MMLU, where question items naturally exhibit exchangeable dependence across domains. Our results provide both a domain-stratified hierarchical model for bounding the uncertainty of accuracy scores, and a distribution-free, cost-saving statistical guarantee for accurately estimating full benchmark scores from random subsets.
[LG-75] ght L_infty Sample Complexity for Low-Degree and Sparse Boolean Polynomials
链接: https://arxiv.org/abs/2606.17319
作者: Jasper van Doornmalen,Mathieu Molina,Victor Verdugo,José Verschae
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Combinatorics (math.CO); Statistics Theory (math.ST)
*备注:
Abstract:Motivated by the optimization of bounded binary black-box functions, we study the problem of learning polynomial surrogates over the Boolean hypercube. To ensure that optimizing the surrogate yields good solutions for the underlying objective, we require uniform L_\infty -error guarantees rather than the usual L_2 -type guarantees. We characterize the minimax sample complexity of uniform estimation under subgaussian noise for two classes of bounded polynomials. First, for polynomials of degree at most d on n variables, the sample complexity scales as n^d+1 . Second, for s -sparse Fourier-Walsh polynomials with s \leq n , it scales as ns^2 . These rates differ structurally from the noiseless setting, where uniform exact recovery scales as n^d and ns , respectively. Our lower bounds hold even for arbitrary adaptive learners, showing that the additional factors are intrinsic to the noisy cases. Standard Fourier-analysis tools for the L_2 -norm do not naturally extend to the L_\infty -setting in a way that yields uniform guarantees. Our proofs overcome this difficulty by relying on suitably chosen auxiliary norms that serve as proxies for controlling the L_\infty -error. Together, our results provide a tight characterization of the sample complexity of learning optimization-safe polynomial surrogates.
[LG-76] Accelerated Convex Optimization via Hamiltonian Dynamics with Deterministic Integration Time COLT2026
链接: https://arxiv.org/abs/2606.17260
作者: Xiuyuan Wang,Vishwak Srinivasan,Qiang Fu,Siddharth Mitra,Ashia Wilson,Andre Wibisono
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 51 pages, 7 figures. Accepted to the 39th Annual Conference on Learning Theory (COLT 2026)
Abstract:We develop Hamiltonian dynamics-based algorithms for smooth convex optimization that achieve accelerated rates of convergence. By exploiting contraction of averaged Hamiltonian flow trajectories rather than requiring contraction at trajectory endpoints, we show that Hamiltonian dynamics-based optimization methods admit deterministic and accelerated convergence guarantees, extending prior work that is limited to quadratic objectives or holds only in expectation. We analyze an idealized continuous-time algorithm and derive practical discrete-time implementations with optimal first-order complexity, thereby establishing Hamiltonian dynamics as a useful algorithmic primitive for deterministic accelerated convex optimization.
[LG-77] Another Look at Log-PCA for Probability Measures: A Dynamical Formulation and Statistical Convergence
链接: https://arxiv.org/abs/2606.17196
作者: Peng Xu,Changbo Zhu,Young-Heon Kim,Xiaohui Chen
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:
Abstract:This paper is concerned with learning principal variations of random probability measures on \mathbbR^m under the Wasserstein geometry. We introduce a new dynamical formulation to interpret the log-PCA, a linearized principal geodesic analysis, as a variational approach. Our differentiable version, termed as the Wasserstein Tangential PCA (WT-PCA), captures the local principal modes of geodesic variations of a (weighted) probability measure on the Wasserstein space via its covariance operator at barycenter. Based on the dynamical perspective and leveraging parallel transport structure of the optimal transport problems, we derive a general statistical convergence rate of the empirical WT-PCA when estimated from data in terms of the 2-Wasserstein distance between the population and empirical barycenter reference measures.
[LG-78] Regularized Machine Learning for System Identification of Ship Free-Running Manoeuvres from CFD-Based Synthetic Data: A Comparative Study
链接: https://arxiv.org/abs/2606.17121
作者: R.F. Suárez,J.C. Berndt,M. Abdel-Maksoud
类目: Applications (stat.AP); Machine Learning (cs.LG); Fluid Dynamics (physics.flu-dyn)
*备注: 28 pages
Abstract:This study investigates supervised machine learning techniques for identifying ship hydrodynamic coefficients from CFD-generated data from free-running simulations. Specifically, ordinary least squares and regularized regression methods are applied to Abkowitz-type manoeuvring models. Training and validation datasets are derived from URANS simulations of zig-zag and turning circle manoeuvres, which are validated against experimental benchmark data. The analysis evaluates the effects of coefficient set size, minimum training length required for predictive model training, and manoeuvre combinations on model performance. Results demonstrate the suitability of large-angle zig-zag manoeuvres for hydrodynamic system identification, provided that multicollinearity is addressed through appropriate coefficient selection, regression models, or input data variability. Larger coefficient sets offer greater model flexibility for variable conditions but are more prone to multicollinearity. Regularized regression techniques effectively mitigate multicollinearity and notably enhance prediction accuracy, as does incorporating more diverse manoeuvring data. Among tested models, Ridge regression provided the best compromise between computational efficiency and prediction accuracy.
[LG-79] RadSEM: A Finding-by-Finding Metric for Clinical Consistency in Radiology Reports
链接: https://arxiv.org/abs/2606.17062
作者: Zhenhong Yang,Zhuoyun Liu,Jintao Fei,Wen Tang,Shichao Quan,Jun Zhao,Jun Xu
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG)
*备注:
Abstract:Radiology report evaluation must distinguish clinical compatibility from surface similarity, because negation, laterality, or normal-abnormal polarity can reverse a finding. We propose RadSEM (Radiology Sentence-Level Evaluation Metric), a constrained LLM-assisted metric for reference-based evaluation of radiology Findings. RadSEM rewrites reference and generated reports into ordered atomic finding sentences, each expressing one site-finding proposition. It then performs contradiction-constrained many-to-many matching: incompatible pairs such as “effusion” and “no effusion” receive no credit, while compatible granularity differences can receive partial credit. A deterministic stage weights pairs by part-whole and abnormal-detail relationships, counts unmatched findings, and produces an abnormal-focused weighted F1 score. Thus, the LLM supports structured rewriting and local alignment rather than acting as an opaque judge. We evaluate RadSEM with SSREE, a controlled monotonicity stress test built from 2,448 de-identified reports expanded into five graded corruption levels. RadSEM achieves Kendall tau_b of 0.957, all-pairs concordance of 97.8%, adjacent concordance of 95.0%, and strict five-level ordering for 81.9% of reports, outperforming radiology-specific and general text metrics while avoiding the failure in which polarity-inverted reports regain lexical overlap. On the same SSREE set, RadSEM outperforms the Ref-anchored RadSEM-Alt policy, improving adjacent concordance from 90.7% to 95.0% and strict ordering from 67.2% to 81.9%. On a 599-triplet synonym/antonym subset, RadSEM prefers synonyms in 597 cases (99.67%). These results suggest that explicit finding units, contradiction-aware matching, and abnormal-focused deterministic scoring make report scoring more interpretable and sensitive to clinically meaningful errors. Code is available at this https URL.
附件下载


