Arxiv今日论文 | 2026-06-11

本篇博文主要内容为 2026-06-11 从Arxiv.org论文网站获取的最新论文列表，自动更新，按照NLP、CV、ML、AI、IR、MA六个大方向区分。

说明：每日论文数据从Arxiv.org获取，每天早上12:30左右定时自动更新。

提示: 当天未及时更新，有可能是Arxiv当日未有新的论文发布，也有可能是脚本出错。尽可能会在当天修复。

自然语言处理共125篇(Computation and Language (cs.CL))
人工智能共200篇(Artificial Intelligence (cs.AI))
计算机视觉共121篇(Computer Vision and Pattern Recognition (cs.CV))
机器学习共199篇(Machine Learning (cs.LG))
多智能体系统共7篇(Multiagent Systems (cs.MA))
信息检索共19篇(Information Retrieval (cs.IR))
人机交互共17篇(Human-Computer Interaction (cs.HC))

多智能体系统

[MA-0] CCKS: Consensus-based Communication and Knowledge Sharing

【速读】：该论文旨在解决去中心化训练与去中心化执行（Decentralized Training and Decentralized Execution, DTDE）框架下多智能体强化学习（Multi-Agent Reinforcement Learning, MARL）中，基于动作建议的知识共享机制因过度依赖教师指导而缺乏对师生适配性的评估，导致过度建议、稳定性差及性能下降的问题。其核心解决方案是提出一种基于共识的通信与知识共享（Consensus-based Communication and Knowledge Sharing, CCKS）框架，通过构建基于局部观测的共识模型来约束动作推荐，使智能体能够依据共识机制智能地权衡采纳教师建议的程度，从而在探索与学习经验教师策略之间取得平衡。该方案的关键在于利用对比学习（contrastive learning）在训练阶段构建共识模型，进而在动作选择时结合共识与共享知识进行评分与决策，实现更高效、稳定的协作。CCKS作为即插即用模块，可无缝集成至现有DTDE算法，在Google Research Football和StarCraft II多智能体挑战等复杂环境中的实验表明，其显著提升了合作效率、学习速度与整体性能。

链接: https://arxiv.org/abs/2606.12281
作者: Jinyuan Zu,Xiaowei Lv,Yongcai Wang,Deying Li,Yunjun Han,Wenping Chen,Fengyi Zhang,Naiqi Wu
机构: Renmin University of China (中国人民大学); Chinese Academy of Sciences (中国科学院); China Electronics Technology Group Corporation (中国电子科技集团有限公司); Guangdong University of Technology (广东工业大学)
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:In Decentralized Training and Decentralized Execution (DTDE) for cooperative Multi-Agent Reinforcement Learning (MARL), action-advising-based knowledge sharing promotes interpretable and scalable cooperation among agents. However, current action advising approaches often adhere too much to the teacher’s guidance without evaluating teacher-student compatibility, which causes excessive advising, suboptimal stability, and degraded performance. To overcome these challenges, this paper presents a Consensus-based Communication and Knowledge Sharing (CCKS) framework, which allows agents to adopt recommendations based on consensus-derived constraints and to follow the teacher’s instructions more smartly. This mechanism enables agents to balance exploration and learning from experienced teachers, improving overall performance. The key is the consensus model construction, for which we propose to employ contrastive learning to construct consensus models based on local observations in the agents’ training phase. In action selection, agents score and choose actions based on consensus and shared knowledge. Designed as a plug-and-play solution, CCKS integrates seamlessly with existing DTDE algorithms. Experiments conducted in the Google Research Football environment and the complex StarCraft II Multi-Agent Challenge demonstrate that the integration with CCKS significantly improves cooperation efficiency, learning speed, and overall performance compared with current DTDE baselines. The code is available at this https URL.

[MA-1] Automating Geometry-Intensive Compliance Checking in BIM: Graph-Based Semantic Reasoning Framework

【速读】：该论文旨在解决建筑信息模型（BIM）中几何密集型规范自动化合规性检查的技术瓶颈，核心问题在于高层级法规逻辑与结构化IFC数据之间的语义差异，以及现有方法在处理多跳推理链和跨多个建筑实体的隐含空间依赖关系时的局限性。其解决方案的关键在于提出一种面向BIM的时空几何推理系统（Spatial-Geometric Reasoning System for Building Information Modeling, SGR-BIM），该系统采用图驱动的集成框架，动态构建跨模态知识图谱，实现用户意图、法规语义与BIM几何数据的对齐，从而支持可解释的推理过程，避免了硬编码规则的僵化限制。实验验证表明，SGR-BIM在679个经专家验证的消防规范查询任务中达到84.3%的准确率，较增强版单智能体基线提升8.6%，为建筑、工程与施工（AEC）行业提供了更具透明性与灵活性的自动化几何合规性检查范式。

链接: https://arxiv.org/abs/2606.12065
作者: Zixuan Xiao,Pei Troh Koh,Jun Ma,Jack C.P. Cheng
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Automating compliance check for geometry-intensive regulations remains a significant technical bottleneck in Building Information Modeling (BIM), primarily due to the semantic disparity between high-level regulatory logic and structured IFC data. Existing methods, often reliant on static rule templates, struggle to traverse multi-hop reasoning chains or resolve latent spatial dependencies across multiple building entities. To address these challenges, a Spatial-Geometric Reasoning System for Building Information Modeling (SGR-BIM) is proposed as an integrative graph-driven reasoning framework. SGR-BIM dynamically constructs a cross-modal knowledge graph that aligns user intent, regulatory semantics, and BIM geometry, enabling interpretable reasoning without rigid hard-coding. Validated on 679 expert-verified queries from fire safety codes, the framework achieves 84.3% accuracy, representing an 8.6% improvement over enhanced-tool single-agent baselines. This research provides a graph-based semantic reasoning paradigm, enhancing the transparency and flexibility of automated geometric compliance check workflows in the Architecture, Engineering, and Construction (AEC) industry.

[MA-2] Evaluation of Alternative-Based Information Systems for Deliberative Polling using an Agent ic Simulator

【速读】：该论文旨在解决协商式投票（Deliberative Polling）中“理由空间覆盖问题”（coverage problem），即如何确保每位参与者在投票前都能接触到代表整个理由空间的多样化论据，尤其在大规模、存在策略性或对抗性行为的选民群体中。其核心挑战在于，如何高效地从海量论据中筛选出具有代表性的推荐集合，以实现对理由空间的充分覆盖，而这一问题被证明是NP难的（NP-hard Subsuming Justification Problem）。论文提出的关键解决方案是基于大语言模型（LLM）的代理型双极论证模拟器（Agentic Bipolar Argumentation Simulator, ABAS），该框架将协商过程形式化为一个六元组（Jend, Jopp, Ratt, Renh, VA, VR），分别表征支持与反对性理由、攻击与增强关系，以及股东与关系权重。通过模拟N个具有潜在观点（服从[-1, 1]分布）的自主股东代理，系统动态生成并推荐论据，并采用基于可观测支持量（endorsement mass）的排序机制。评估指标聚焦于覆盖率（coverage）——即每个股东接收到的前K条推荐中，所涵盖的论据标签集合占全部论据标签集的比例。实验揭示了创造力率（pown）、推荐规模（K）、论证密度（plinks）和群体规模（N）对覆盖率与语料多样性的影响。在经认证的选民环境中（无法进行Sybil攻击，仅关系图可被操纵），通过协同策略性投票攻击（如标签洪泛攻击）进行压力测试，结果表明：传统均匀权重机制易被攻破，而采用反向PageRank规则进行作者数量加权的关系权重机制能显著提升抗洪泛能力，有效维持高覆盖率，验证了该加权策略在保障理由空间代表性方面的关键有效性。

链接: https://arxiv.org/abs/2606.11692
作者: Rwaida Alssadi,Khulud Alawaji,Balaji Kasula,Muntaser Syed,Badria Alfurhood,Markus Zanker,Marius Silaghi
机构: 未知
类目: Computers and Society (cs.CY); Multiagent Systems (cs.MA); Social and Information Networks (cs.SI)
备注:

点击查看摘要

Abstract:Deliberative polling promises to improve collective decision-making by exposing shareholders to a broad range of arguments before they vote. Yet ensuring that every voter encounters a representative sample of the reason space, the coverage problem, remains an open challenge, particularly at scale and in adversarial or strategically motivated electorates. This paper introduces a way of evaluating solutions using the LLM-based Agentic Bipolar Argumentation Simulator, grounded in a framework which formalises a poll as a six-tuple Jend, Jopp, Ratt, Renh, VA, VR of endorsing and opposing justifications, attack and enhance relations, and shareholder- and relation-weights. ABAS simulates N autonomous shareholder agents, each assigned a latent opinion according to desired distributions in [-1, 1], who sequentially vote, choose or author justifications, and optionally submit argumentation-graph links. The simulator implements recommendations that rank existing justifications by their observable endorsement mass. It evaluates the mechanism’s success by coverage, namely the fraction of the corpus reason-tag set represented in the K recommendations presented to each shareholder, as a solution to the NP-hard Subsuming Justification Problem. Reported experiments characterise how creativity rate (pown), recommendation size (K), argumentation density (plinks), and population size (N) affect coverage and corpus diversity. In an authenticated electorate where Sybil attacks are impossible and only the relation graph is gameable, we stress-test the scoring with coordinated strategic voting attacks: a tag-flood attack collapses coverage, while author-count relation weighting through a reversed-PageRank rule resists the flood markedly better than uniform weights.

[MA-3] Sovereign Assurance Boundary: Certificate-Bound Admission for Agent ic Infrastructure

【速读】：该论文旨在解决生成式AI（Generative AI）驱动的自主代理在生产环境中引发的控制平面授权难题：非确定性推理系统可能提出高风险的资源变更操作，而现有的安全机制（如身份与访问管理（IAM）、策略引擎、共识协议和审计日志）要么依赖静态且上下文无关的权限控制，要么仅在操作执行后记录行为，无法有效防范潜在威胁。其核心解决方案是提出“主权保障边界”（Sovereign Assurance Boundary, SAB），一个基于证书的运行时准入层，用于管控自主执行权限。SAB在保障隔离区（assurance airlock）拦截代理提案，将其编译为带类型的执行契约（C），并绑定至加密证据摘要（H(E)）与策略版本。这些契约通过具备后果感知能力的认证路径进行验证，成功准入后生成严格限定于特定执行身份、撤销周期和有效时间窗的签名主权保障证书（Ω）。随后，主权执行代理（sovereign execution broker）在执行前验证Ω，并执行最新的预执行撤销检查与漂移检测，方可调用基础设施API。该架构通过形式化其准入与撤销不变量，并在Go语言原型上对2500次准入尝试进行初步可行性评估，证明了该模型可将委托的执行权限转化为可密码学验证、证据绑定、可撤销且可重放的运行时实体，从而实现对自主推理系统的强制性安全约束。

链接: https://arxiv.org/abs/2606.11632
作者: Jun He,Deying Yu
机构: OpenKedge.io
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Multiagent Systems (cs.MA)
备注: 12 pages, 1 figure, 13 tables

点击查看摘要

Abstract:Agentic infrastructure introduces a critical control-plane authorization problem: non-deterministic reasoning systems can propose high-stakes mutations to production resources, yet existing security mechanisms – such as identity and access management (IAM), policy engines, consensus protocols, and audit logs – either enforce static, context-unaware permissions or merely record actions post-execution. This paper introduces the Sovereign Assurance Boundary (SAB), a certificate-bound runtime admission layer for autonomous execution authority. SAB intercepts agent proposals at an assurance airlock, compiles them into typed execution contracts C , and binds these contracts to cryptographic evidence digests H(E) and policy versions. The contracts are then routed through consequence-aware certification paths. Upon successful admission, the system emits a signed Sovereign Assurance Certificate ( \Omega ) that is strictly scoped to a specific execution identity, revocation epoch, and validity window. Finally, a sovereign execution broker verifies \Omega and performs fresh pre-execution revocation and drift checks before invoking infrastructure APIs. We detail the airlock-broker architecture, formalize its admission and revocation invariants, and report preliminary feasibility measurements from a Go prototype evaluated over 2,500 admission attempts. Ultimately, this broker-enforced model prevents autonomous reasoning from directly mutating state, transforming delegated execution authority into a cryptographically verifiable, evidence-bound, revocable, and replayable runtime artifact.

[MA-4] Phi-Actor-Critic: Steering General-Sum Games to Pareto-Efficient Correlated Equilibria IJCAI2026

【速读】：该论文旨在解决多智能体强化学习（MARL）中在非零和博弈（general-sum games）环境下，如何从多个次优纳什均衡（Nash equilibria）中选择具有高社会福利的协调策略这一核心挑战。传统深度多智能体强化学习方法受限于价值分解的单调性假设或易收敛至稳定但低效的均衡点，难以实现社会最优。其解决方案的关键在于提出Φ-演员-评论家（Φ-Actor-Critic, Φ-AC）框架，通过最小化交换后悔（swap regret）来引导学习过程趋向高福利的相关均衡（Correlated Equilibrium, CE）。为实现深度MARL中反事实后悔估计的可计算性，该框架引入一个集中式注意力评论家（centralized attention critic），可在单次前向传播中预测向量化的后悔值，避免了计算开销巨大的反事实模拟。此外，还设计了一种基于拉格朗日乘子的均衡选择机制，在优化社会总收益的同时，通过后悔约束确保策略的稳定性。实验结果表明，Φ-AC在矩阵博弈、多智能体粒子环境（MPE）及Melting Pot Harvest场景中均能学习到高效且稳定的协调策略，显著提升集体回报并保持公平性。

链接: https://arxiv.org/abs/2606.11284
作者: Wongyu Lee,Francesco Lelli,Omran Ayoub,Massimo Tornatore
机构: Politecnico di Milano; Tilburg University; University of Applied Sciences and Arts of Southern Switzerland
类目: Multiagent Systems (cs.MA); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
备注: Accepted to IJCAI 2026

点击查看摘要

Abstract:Real-world multi-agent systems, from traffic coordination to resource allocation, are often modeled as general-sum games where individual incentives conflict with collective welfare. In these settings, the central challenge is not merely finding an equilibrium, but selecting socially desirable outcomes among many suboptimal Nash equilibria. Standard deep multi-agent reinforcement learning (MARL) methods struggle with this problem, as value-decomposition approaches are constrained by monotonicity assumptions and policy-gradient methods often converge to stable but socially inefficient equilibria. To address this limitation, we propose \Phi -Actor-Critic ( \Phi -AC), a framework that leverages swap regret minimization to steer learning toward high-welfare correlated equilibria (CE). To make counterfactual regret estimation tractable in deep MARL, \Phi -AC employs a centralized attention critic that predicts vector-valued regrets in a single forward pass, avoiding computationally expensive counterfactual simulations. We further introduce a Lagrangian-based equilibrium selection mechanism that optimizes social welfare while enforcing stability through regret constraints. Experiments on matrix games, Multi-Agent Particle Environments (MPE), and the Melting Pot Harvest scenario demonstrate that \Phi -AC learns efficient and stable coordination strategies across diverse mixed-motive settings while maintaining high collective return and competitive fairness.

[MA-5] Multi-agent rendezvous in fluid flows via reinforcement learning

【速读】：该论文旨在解决多智能体系统在流体环境中实现会合（rendezvous）任务的难题，特别是如何利用流体动力学特性来提升智能体协同会合的成功率。传统方法中，智能体简单地朝彼此方向移动（即“朴素策略”），但在涡旋流场中易因流体剪切作用导致被分隔于不同涡旋内而无法会合。为此，论文提出一种基于物理信息的多智能体强化学习（MARL）方法，其关键在于通过打破状态-动作映射的对称性，使智能体能够学习到非直观的避陷机制：避免进入与同伴相异的涡旋区域，从而显著提高会合成功率。该策略不仅在不同涡旋强度、尺度及群体规模下具备良好的可迁移性，且从中提取的启发式规则亦优于朴素策略。此外，理论分析表明，流体变形（以有限时间李雅普诺夫指数表征）是阻碍会合的核心因素，强变形区域会将相邻智能体迅速分离，因此建议规划会合目标时优先选择低变形区域。研究揭示了智能体-流体相互作用在复杂环境中的关键作用，并凸显了MARL在探索群体智能方面的强大潜力。

链接: https://arxiv.org/abs/2606.11274
作者: Bocheng Li,Jingran Qiu,Lihao Zhao
机构: Tsinghua University (清华大学); Gothenburg University (哥德堡大学)
类目: Multiagent Systems (cs.MA); Machine Learning (cs.LG); Fluid Dynamics (physics.flu-dyn)
备注:

点击查看摘要

Abstract:Rendezvous is a critical task for multi-agent systems, requiring agents to coordinate to meet at an unspecified location. However, achieving this in fluid environments presents a challenge, as it remains unclear how agents can exploit underlying fluid kinematics to facilitate convergence. In this study, we adopt a multi-agent reinforcement learning (MARL) approach to develop physics-informed rendezvous strategies in vortical flows. Compared to a naive strategy, where agents navigate toward their counterparts, MARL strategies significantly improve the rendezvous rate. MARL strategies also show transferability across varying vortex intensities, vortex scales, and swarm sizes. By breaking the symmetry of the state-action map, MARL strategy leverages a non-intuitive mechanism that prevents agents from becoming trapped in separate vortices, thereby enhancing rendezvous success. Additionally, a heuristic strategy is extracted from the learned strategy and also outperforms the naive strategy. Furthermore, a theoretical analysis demonstrates that fluid deformation impedes the rendezvous process. Large finite-time Lyapunov exponents identify where fluid effects separate adjacent agents, suggesting that targets should be planned in weak-deformation regions. Our findings reveal the important role that agent-fluid interactions play in multi-agent tasks and highlight the MARL capability to explore swarm intelligence in complex flow environments.

[MA-6] MASK: Multi-Agent Semantic K-Scheduling for Risk-Sensitive 6G Robotics

【速读】：该论文旨在解决6G智能机器人系统中高性能协同控制与无线物理信道频谱资源受限之间的矛盾问题。在实际的协同感知场景下，频谱资源被离散化为有限的物理资源块或正交子载波，导致所有智能体同时传输不可行。为此，论文提出多智能体语义K调度（MASK）架构，其核心在于通过引入仲裁辅助的语义信息门控（A-SIG）机制，基于各智能体本地计算的语义重要性评分，仅调度前K个最具信息价值的智能体进行通信，从而强制执行严格的瞬时带宽约束。该机制通过聚合优先级高的观测信息生成紧凑的潜在状态，并结合自监督全局编码器与分布式策略，有效缓解数据稀疏下的尾部风险。实验结果表明，即使在极小比例的通道接入条件下，MASK仍可达到无通信约束基线的性能水平，且对丢包具有内在鲁棒性，验证了语义调度作为资源受限6G系统关键使能技术的有效性。

链接: https://arxiv.org/abs/2606.11249
作者: Ahmet Gunhan Aydin,Elif Tugce Ceran
机构: Middle East Technical University (中东技术大学); Aselsan Inc. (阿塞尔桑公司)
类目: Robotics (cs.RO); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Realizing the vision of 6G connected robotics requires reconciling high-performance collaborative control with the rigid spectral limitations of physical wireless channels. In realistic collaborative sensing scenarios, spectral resources are quantized into finite physical resource blocks or orthogonal subcarriers, rendering simultaneous transmission by all agents infeasible. To address this, we propose Multi-Agent Semantic K-Scheduling (MASK), a control architecture designed to sustain robust, risk-aware coordination under strict instantaneous bandwidth caps. We introduce Arbiter-Assisted Semantic Information Gating (A-SIG), a lightweight coordination mechanism that enforces hard access constraints by scheduling only the top-K agents based on locally computed semantic importance scores. By aggregating these prioritized observations into a compact latent state, a self-supervised global encoder enables a distributional policy to mitigate tail risks despite data sparsity. We evaluate MASK across diverse benchmarks, demonstrating that it matches the performance of communication-unconstrained baselines even when channel access is restricted to a small fraction of the swarm size. Furthermore, the framework exhibits inherent resilience to packet erasures, validating semantic scheduling as a critical enabler for resource-constrained 6G systems.

自然语言处理

[NLP-0] Context-Driven Incremental Compression for Multi-Turn Dialogue Generation ICML2026

【速读】：该论文旨在解决现代对话系统在多轮对话中因持续累积对话历史而导致的计算冗余与效率下降问题，尤其针对传统方法在长对话场景下因截断或摘要导致语义失真，以及现有上下文压缩机制缺乏跨轮次记忆共享与修正能力所引发的信息丢失与误差累积。其解决方案的关键在于提出一种基于对话动态的增量式上下文压缩方法——上下文驱动的增量压缩（Context-Driven Incremental Compression, C-DIC），将对话视为交错的上下文线程，通过在单一紧凑的对话记忆中维护可修订的每线程压缩状态，实现跨轮次信息共享与过时记忆的动态更新；同时引入截断的反向传播通过时间（Truncated Backpropagation Through Time, TBPTT）以学习跨轮次依赖关系，避免全历史回传带来的高开销。实验表明，C-DIC在长对话基准上显著提升了生成质量与推理效率，在数百轮对话中保持稳定的延迟与困惑度，为高质量、可扩展的对话建模提供了有效路径。

链接: https://arxiv.org/abs/2606.12411
作者: Yeongseo Jung,Jaehyeok Kim,Eunseo Jung,Jiachuan Wang,Yongqi Zhang,Ka Chun Cheung,Simon See,Lei Chen
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted at ICML 2026

点击查看摘要

Abstract:Modern conversational agents condition on an ever-growing dialogue history at each turn, incurring redundant attention and encoding costs that grow with conversation length. Naive truncation or summarization degrades fidelity, while existing context compressors lack cross-turn memory sharing or revision, causing information loss and compounding errors in long dialogues. We revisit the context compression under conversational dynamics and empirically present its fragility. To improve both efficiency and robustness, we introduce Context-Driven Incremental Compression (C-DIC), which treats a conversation as interleaved contextual threads and stores revisable per-thread compression states in a single, compact dialogue memory. At each turn, a lightweight retrieve, revise, and write-back loop shares information across turns and updates stale memories, stabilizing long-horizon behavior. In addition, we adapt truncated backpropagation-through-time (TBPTT) to our multi-turn setting, learning cross-turn dependencies without full-history backpropagation. Extensive experiments on long-form dialogue benchmarks demonstrate superior performance and efficiency of C-DIC; notably, C-DIC shows stable inference latency and perplexity over hundreds of dialogue turns, supporting a scalable path to high-quality dialogue modeling.

[NLP-1] Redesign Mixture-of-Experts Routers with Manifold Power Iteration

【速读】：该论文旨在解决混合专家模型（Mixture-of-Experts, MoE）中路由器（router）设计缺乏理论指导的问题，即如何使路由器行向量有效表征对应专家矩阵的语义特征，从而更准确地反映令牌与专家之间的亲和度。现有方法未明确约束路由器权重的优化方向，导致其难以充分凝聚专家信息。本文提出的关键解决方案是：将每个路由器行对齐于对应专家矩阵的主奇异方向（principal singular direction），因为该方向提供了矩阵最具表达力的数学描述。基于此原则，作者提出了基于流形幂迭代（Manifold Power Iteration, MPI）的路由器重构方法，采用“先幂迭代、再投影”（Power-then-Retract）范式，通过在路由器权重上执行幂迭代以逼近主奇异方向，并引入流形投影操作施加范数约束，从而在保证计算效率的同时提升训练稳定性。理论上，证明了MPI可引导路由器行收敛至专家矩阵的主奇异方向；实证上，在1B至11B参数规模的MoE模型预训练实验中验证了该对齐机制能显著提升模型性能，实现更高效的专家激活与知识利用。

链接: https://arxiv.org/abs/2606.12397
作者: Songhao Wu,Ang Lv,Ruobing Xie,Yankai Lin
机构: Gaoling School of Artificial Intelligence, Renmin University of China; Large Language Model Department, Tencent
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Preprint

点击查看摘要

Abstract:Router is the cornerstone component to the Mixture-of-Experts models. Serving as expert proxies, the rows of the router matrix compute their similarity to the MoE inputs to determine which subset of experts is activated. Ideally, each router row is designed to encode the expert matrix into this representative vector, such that its dot-product with token can better reflect token-expert affinity. However, there exists no design principles to enforce this condensation. In this paper, we propose to align each router row with the principal singular direction of the associated expert, as this direction provides the most expressive mathematical description of a matrix. Based on this principle, we propose a router redesign with Manifold Power Iteration (MPI). Specifically, it introduces a “Power-then-Retract” paradigm, where a power iteration step is performed on the router weights, followed by a retraction to impose a norm constraint to ensure both efficiency and stability. Theoretically, we show that MPI drives router rows to converge toward the principal singular directions of associated experts. Empirically, we pretrain MoE model across scales from 1B to 11B parameters to confirm that this alignment facilitates more effective MoE models.

[NLP-2] System Report for CCL25-Eval Task 5: New Dataset and LoRA-Fine-Tuned Qwen 2.5

【速读】：该论文旨在解决古典诗歌在精准翻译与情感-语义理解方面存在的领域特异性不足问题，当前研究多将诗歌鉴赏任务视为通用领域问题处理，忽视了古典诗歌独特的审美与语义特征，且高质量、领域专用的数据集稀缺。其解决方案的关键在于：首先，将诗歌鉴赏任务分解为术语解释、语义解释和情感推断三个子任务，以实现更细粒度的建模；其次，基于多个开源数据集进行清洗与对齐，构建了包含49,404条高质量指令-响应对的古典诗歌指令对数据集（CCPoetry-49K），专为该领域优化；最后，通过低秩适应（LoRA）技术对Qwen2.5-14B模型进行微调，提出面向古典诗歌领域的专用大语言模型PoetryQwen。实验结果表明，PoetryQwen在CCL25-Eval Task 5基准上达到0.757的得分，相较于基线模型Qwen2.5-14B-Instruct（0.690）提升9.7%，显著增强了对古典诗歌的精准翻译与情感理解能力。

链接: https://arxiv.org/abs/2606.12392
作者: Haotao Xie
机构: The Hangzhou International Innovation Institute (杭州国际创新研究院); Beihang University (北京航空航天大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recently, large language models (LLMs) have achieved promising progress in the fields of classical Chinese translation and the generation of classical poetry. However, domain-specific research on precise translation and affective-semantic understanding of classical poetry remains limited. The main challenge is that most studies treat the poetic appreciation task as a general-domain problem, neglecting the distinctive features of poetic appreciation, while high-quality and domain-specific datasets are extremely limited. To address this limitation, we decompose the task into three subtasks: term interpretation, semantic interpretation, and emotional inference. Based on multiple open-source datasets, we perform data cleansing and alignment to construct the Classical Chinese Poetry Instruction Pair Dataset (CCPoetry-49K), which comprises 49,404 high-quality instruction-response pairs explicitly optimized for this domain. We then propose a domain-specialized LLM, called PoetryQwen, by applying Low-Rank Adaptation (LoRA) to fine-tune the Qwen2.5-14B model. Experimental results on the CCL25-Eval Task 5 benchmark demonstrate that PoetryQwen achieves a score of 0.757, representing a 9.7% improvement over the Qwen2.5-14B-Instruct baseline (0.690). These findings clearly indicate that PoetryQwen significantly enhances performance in precise translation and emotional understanding of classical poetry. We present new dataset and methodological considerations intended to support the domain-specific optimization of LLMs.

[NLP-3] Which Models Are Our Models Built On? Auditing Invisible Dependencies in Modern LLM s

【速读】：该论文旨在解决现代大语言模型（LLM）训练流水线中日益复杂的依赖关系难以追溯的问题。随着模型开发过程中对其他模型生成数据、过滤语料、评估输出及指导决策的依赖不断加深，这些依赖呈现出递归性与碎片化特征：上游模型的依赖项往往分散在不同版本、不同仓库的独立发布物中，导致完整的依赖结构跨越异构公开资源，其复杂度和递归深度已远超人类手动追踪的能力。为此，论文提出ModSleuth——一个基于智能体的系统，能够从公开发布物中递归重构具有源码证据支持的LLM依赖图谱。其解决方案的关键在于：首先通过形式化定义直接依赖与间接依赖，明确依赖的本质；其次采用以操作为中心的关系建模方式，统一表达异构流水线角色；最后通过跨名称、版本和仓库的发布物身份消歧机制，实现对不一致文档中引用的精准对齐。实验表明，该方法在四个典型公开发布物丰富的LLM项目中成功恢复了1,060个经源码验证的依赖关系，构建出大规模可解释的依赖图谱，揭示了多跳许可证义务、训练-评估耦合、发布版与训练时实际使用资源之间的差异以及文档不一致等深层问题，显著提升了现代LLM生态系统的透明性与可审计性。

链接: https://arxiv.org/abs/2606.12385
作者: Sanjay Adhikesaven,Haoxiang Sun,Sewon Min
机构: University of California, Berkeley (加州大学伯克利分校); Allen Institute for AI (艾伦人工智能研究所)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Modern LLM training pipelines increasingly rely on other models to generate data, filter corpora, judge outputs, and guide development decisions. These dependencies are recursive: a model may depend on an upstream artifact whose own dependencies are documented only in separate releases and artifacts. As a result, the full dependency structure is fragmented across heterogeneous public artifacts, with complexity and recursive depth far outpacing humans’ ability to trace. We introduce ModSleuth, an agentic system that recursively reconstructs LLM dependency graphs from public artifacts with source-grounded evidence. We find that the primary challenge is no longer information extraction, but defining what constitutes a dependency and reconciling artifact references across inconsistent documentation. We address these challenges through a formalization that distinguishes direct and indirect dependencies, represents heterogeneous pipeline roles through operation-centered relationships, and resolves artifact identities across names, versions, and repositories. Applying ModSleuth to four public-artifact-rich LLM releases, we recover 1,060 source-verified dependencies and construct large-scale dependency graphs of modern LLM development. These graphs reveal multi-hop license obligations, train-evaluation coupling, discrepancies between released and training-time artifacts, and documentation inconsistencies that would otherwise be difficult to uncover. We release ModSleuth and the resulting dependency graphs to support transparent analysis of the increasingly complex ecosystems underlying modern LLMs.

[NLP-4] Verifiable Environments Are LEGO Bricks: Recursive Composition for Reasoning Generalization

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在强化学习（Reinforcement Learning, RL）训练中因可验证环境（verifiable environments）构建方式受限而导致的推理泛化能力难以规模化提升的问题。现有方法依赖人工或逐个构造环境，存在线性扩展瓶颈，无法高效支持大规模、多样化推理场景的生成。其解决方案的关键在于提出RACES（Recursive Automated Composition for Environment Scaling）框架，将可验证环境视为可组合的模块，并基于输出类型（codomain）与输入类型（domain）匹配的条件，实现自动融合与递归组装。该框架通过定义序列（\textscSEQUENTIAL）、并行（\textscPARALLEL）、排序（\textscSORT）和选择（\textscSELECT）四类组合算子，生成具有多样推理模式的复合环境。实验表明，仅使用50个基础环境即可达到与300个独立环境相当的性能，显著提升了环境利用效率，且在6个未参与训练环境构建的基准测试上，使DeepSeek-R1-Distill-Qwen-14B平均提升3.1分（从48.2升至51.3），Qwen3-14B性能从58.8提升至61.1，验证了其在增强推理泛化能力方面的有效性与可扩展性。

链接: https://arxiv.org/abs/2606.12373
作者: Hao Xiang,Qiaoyu Tang,Le Yu,Yaojie Lu,Xianpei Han,Ben He,Le Sun,Bowen Yu,Peng Wang,Hongyu Lin,Dayiheng Liu
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Reinforcement Learning (RL) with verifiable environments has emerged as a powerful approach for enhancing the reasoning capabilities of Large Language Models (LLMs). While prior research demonstrates that scaling environment quantity improves RL performance, existing manual or individual construction methods suffer from linear scaling limits, thereby hindering scalable reasoning generalization. This paper introduces RACES (\textbfRecursive \textbfAutomated \textbfComposition for \textbfEnvironment \textbfScaling), a framework that conceptualizes verifiable environments as composable building blocks that can be recursively assembled. The key insight is that when the codomain (output type) of one environment matches the domain (input type) of another, they can be automatically fused into a new verifiable environment, enabling recursive composition. RACES is implemented with 300 individual environments and defines a set of composition operators (\textscSEQUENTIAL, \textscPARALLEL, \textscSORT, and \textscSELECT) that induce diverse reasoning patterns. Extensive experiments show that RL training on these composite environments consistently enhances reasoning generalization. Specifically, RACES improves DeepSeek-R1-Distill-Qwen-14B by an average of 3.1 points (from 48.2 to 51.3) and boosts Qwen3-14B performance from 58.8 to 61.1 on six benchmarks, which are unseen during the construction of training environments. Moreover, RACES achieves performance comparable to training on 300 individual environments using only 50 base environments, demonstrating significant efficiency in environment utilization.

[NLP-5] Breaking Entropy Bounds: Accelerating RL Training via MTP with Rejection Sampling

【速读】：该论文旨在解决生成式 AI（Generative AI）在大规模语言模型（Large Language Models, LLMs）强化学习（Reinforcement Learning, RL）后训练阶段中，推理滚动生成（rollout）环节的性能瓶颈问题。具体而言，尽管多标记预测（Multi-Token Prediction, MTP）通过推测解码可显著加速滚动生成，但在RL训练过程中其接受率（acceptance rate）会因模型熵的波动而急剧下降，导致加速效果受限。论文提出Bebop框架，系统性地研究了MTP在LLM后训练中的应用，并提出了关键解决方案：首先揭示模型熵的波动是制约MTP接受率的根本因素，且二者呈显著负线性关系；其次，发现基于概率拒绝采样的策略相比贪婪草稿采样能有效缓解熵扰动带来的负面影响；更重要的是，针对传统交叉熵或KL散度损失在该场景下的次优性，提出一种端到端的总变差（Total Variation, TV）损失函数，直接优化多步拒绝采样下的接受率，实现约10%的接受率提升，最高达95%，并带来高达25%的额外推理吞吐量增益。此外，实验表明采用预训练阶段结合端到端TV损失与拒绝采样进行的MTP训练，可在整个RL流程中保持稳定的接受率与加速比，无需代价高昂的在线更新。大量实验证明，该方法在Qwen3.5、Qwen3.6和Qwen3.7模型的异步强化学习训练中实现了最高1.8倍的整体端到端加速。

链接: https://arxiv.org/abs/2606.12370
作者: Yucheng Li,Huiqiang Jiang,Yang Xu,Jianxin Yang,Yi Zhang,Yizhong Cao,Yuhao Shen,Fan Zhou,Rui Men,Jianwei Zhang,An Yang,Bowen Yu,Bo Zheng,Fei Huang,Junyang Lin,Dayiheng Liu,Jingren Zhou
机构: Alibaba Inc(阿里巴巴)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Reinforcement learning (RL) has become a key component in modern large language models, yet the rollout stage remains the key bottleneck in RL training pipelines. Although Multi-Token Prediction (MTP) offers a natural solution to accelerate rollouts through speculative decoding, many studies have observed that MTP acceptance rates degrade significantly during RL training, leading to limited speedup performance. To address this bottleneck, we present Bebop, a systematic study of MTP in LLM post-training, and offer practical recipes to integrate MTP into large-scale RL pipelines. First, we reveal that the MTP acceptance rate is fundamentally bounded by the fluctuation of model entropy, which demonstrates a clear negative linear relationship with the rise of entropy in the RL stage. Second, we show that probabilistic rejection sampling largely alleviates the disturbance introduced by entropy in RL compared to greedy draft sampling. We further identify that the conventional MTP training objectives (cross-entropy or KL) are suboptimal in such settings, and therefore we propose a novel end-to-end TV loss that directly optimizes multi-step rejection sampling acceptance rate, yielding ~10% acceptance rate improvements, achieving up to 95% acceptance rates and up to 25% extra inference throughput gains across mathematical reasoning, code generation, and agentic tasks. Third, we test various online MTP training strategies during RL and show that pre-RL MTP training with e2e TV loss and rejection sampling achieves a consistent acceptance rate and speedup throughout the entire RL, eliminating the need for costly online MTP updating. We provide extensive experiments and analysis that validate our findings. Experimental results show our method achieves up to 1.8x end-to-end acceleration in async RL training of Qwen3.5, Qwen3.6, and Qwen3.7 models.

[NLP-6] Claw-SWE-Bench: A Benchmark for Evaluating OpenClaw-style Agent Harnesses on Coding Tasks

【速读】：该论文旨在解决通用智能体（如OpenClaw）在SWE-bench评测基准下难以有效评估其编码能力的问题，核心挑战在于通用智能体本身无法满足SWE-bench所要求的干净Docker工作环境、补丁生成与预测合约等标准化条件。为此，作者提出Claw-SWE-Bench——一个支持多语言的SWE-bench风格基准测试框架及适配协议，通过统一固定提示（prompt）、运行时预算、工作空间契约、补丁提取流程和评估机制，实现对异构智能体（称为“claws”）在公平环境下的可比性评估。其关键创新在于将“智能体钩子（harness）设计”与“成本核算”作为评测的核心维度，而非仅关注模型性能；实验表明，在相同模型（GLM 5.1）背景下，采用最小直接差分适配器的OpenClaw仅获得19.1%的Pass@1得分，而完整适配器则提升至73.4%，凸显适配器设计的重要性。此外，跨模型与跨钩子的消融实验揭示模型选择影响达29.4个百分点，钩子选择影响达27.4个百分点，且性能相近的系统在总API调用成本上存在显著差异。因此，Claw-SWE-Bench不仅提供包含350个跨8种语言、43个仓库的完整基准数据集，还推出轻量版（Lite）80实例子集，支持快速验证，为生成式AI（Generative AI）时代代码智能体的可复现、公平比较提供了标准化工具。

链接: https://arxiv.org/abs/2606.12344
作者: Mengyu Zheng,Kai Han,Boxun Li,Haiyang Xu,Yuchuan Tian,Wei He,Hang Zhou,Jianyuan Guo,Hailin Hu,Lin Ma,Chao Xu,Guohao Dai,Lixue Xia,Yunchao Wei,Yunhe Wang,Yu Wang
机构: TokenRhythm Technologies(令牌节奏科技); Infinigence AI(无限智能AI); Peking University(北京大学); City University of Hong Kong(香港城市大学); SEE Fund(SEE基金); Shanghai Jiaotong University(上海交通大学); Beijing Jiaotong University(北京交通大学); Tsinghua University(清华大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:General-purpose agents such as OpenClaw are increasingly used as autonomous tool users, but their coding ability is difficult to measure under SWE-bench: a generic agent does not by itself satisfy the clean Docker workspace, patch, and prediction contract required for scoring. We introduce Claw-SWE-Bench, a multilingual SWE-bench-style benchmark and adapter protocol that makes heterogeneous agent harnesses, or claws, comparable under fair settings including a fixed prompt, runtime budget, workspace contract, patch extraction procedure, and evaluator. The full benchmark contains 350 GitHub issue-resolution instances across 8 languages and 43 repositories, drawn from SWE-bench-Multilingual and SWE-bench-Verified-Mini after future-commit cleanup. We also release Claw-SWE-Bench Lite for faster validation, which is an 80-instance subset selected by a cost-aware, rank-aware procedure over 17 calibration columns. On the full benchmark, OpenClaw with a minimal direct-diff adapter scores only 19.1% Pass@1, whereas the full adapter reaches 73.4% with the same GLM 5.1 backbone, showing that adapter design is essential for enabling OpenClaw-style harnesses to perform coding tasks effectively. Across an OpenClaw \times nine-model sweep and a five-claw \times two-model sweep, model choice changes Pass@1 by 29.4 pp and harness choice by 27.4 pp under fixed models; systems with similar accuracy can differ substantially in total API cost. Claw-SWE-Bench therefore treats harness and cost accounting as first-class axes of SWE-style coding-agent evaluation, providing both a full benchmark and a low-cost reference set for reproducible comparison. The data is available at this https URL and this https URL.

[NLP-7] ALIGNBEAM : Inference-Time Alignment Transfer via Cross-Vocabulary Logit Mixing

【速读】：该论文旨在解决领域微调（domain fine-tuning）导致大语言模型安全性下降的问题，即经过领域微调的专用模型在面对以领域语言表述的有害提示时，更容易产生不当响应。现有推理阶段防御方法依赖于安全锚模型与目标模型共享词汇表，这一限制使其无法应用于跨模型家族（cross-family）的专用模型，而此类模型的安全性退化最为严重。为此，论文提出一种无需训练的解决方案——ALIGNBEAM，其核心在于在每个解码步骤中，将锚模型的词元概率（logits）通过逐词翻译的方式映射到目标模型的词汇空间，并由一个小型语言模型判别器从K个候选延续中选择最安全的输出。该方法不修改任何模型权重，且可在部署阶段灵活调节安全与性能之间的权衡，无需重新训练。实验表明，无论是跨词汇表还是同词汇表的评估对，ALIGNBEAM均显著提升了对对抗性基准测试的拒绝率，同时保持任务准确性和推理开销在可接受范围内。结果证明，可在推理阶段实现不同模型家族间的安全对齐（safety alignment），且无需修改任一模型的权重。

链接: https://arxiv.org/abs/2606.12342
作者: Chirag Chawla,Pratinav Seth,Vinay Kumar Sankarapu
机构: Lexsi Labs
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Domain fine-tuning degrades the safety of large language models: fine-tuned specialists readily comply with harmful prompts framed in domain language. Existing inference-time defenses that mix logits from a safe anchor model require both models to share a vocabulary, which rules them out for the cross-family specialists where safety is most degraded. We present ALIGNBEAM, a training-free method that lifts this restriction by translating anchor logits into the target model’s vocabulary token-by-token at each decoding step; a small LLM judge then selects the safest among K candidate continuations. No weights are changed, and the safety-utility trade-off can be tuned at deployment without retraining. Across both cross-vocabulary and same-vocabulary evaluation pairs, ALIGNBEAM substantially raises refusal on adversarial benchmarks while keeping task accuracy and inference overhead within practical bounds. The results show that safety alignment can be transferred between model families at inference time, without touching either model’s weights.

[NLP-8] Measuring Semantic Progress in Multi-turn Dialogue via Information Gain

【速读】：该论文旨在解决多轮对话评估中因质量属性跨轮次涌现而难以衡量的问题，尤其聚焦于信息查询类对话中的核心维度——语义进展（semantic progress），即对话过程中逐步积累的新颖、与问题相关且非冗余的信息量。其解决方案的关键在于将语义进展形式化为条件不确定性降低（question-conditioned uncertainty reduction），并提出一种基于信息论的度量方法，在嵌入空间中近似该指标。该方法采用可计算的高斯分布公式实现闭式更新，具备良好的理论性质：包括信息增益的单调性、各轮次信息增益的可加性以及对冗余证据的递减回报效应。相较于依赖大语言模型作为裁判（LLM-as-a-judge）的方法，该度量无需在评估时进行自回归推理，具有完全可复现性，且仅依赖固定嵌入模型。实验表明，尽管仅关注语义进展，该方法在MT-Bench、Chatbot Arena和UltraFeedback数据集上仍能与人类判断达成竞争性的一致性，并在部分数据集上优于多个基于大模型的评判器。尤为关键的是，该方法在仅使用轻量级嵌入模型且于纯CPU环境下运行时依然有效，证明了语义进展的捕捉可不依赖于大规模模型容量。

链接: https://arxiv.org/abs/2606.12332
作者: Paul He,Shiva Kasiviswanathan,Dominik Janzing
机构: Amazon; NTU Singapore
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Preprint. 26 pages

点击查看摘要

Abstract:Evaluating multi-turn dialogue is challenging because quality emerges across turns rather than within individual responses. We focus on a key dimension of information-seeking dialogue: semantic progress, defined as the accumulation of new, question-relevant, and non-redundant information over the course of a conversation. We formalize semantic progress as question-conditioned uncertainty reduction and introduce an information-theoretic metric that approximates it in embedding space. Our main estimator uses a tractable Gaussian formulation with closed-form updates, while a complementary maximum-entropy argument shows why log-determinant structure arises more broadly when only second-order embedding information is retained. This formulation yields desirable theoretical properties, including monotonicity, additive decomposition of total information gain across turns, and diminishing returns for redundant evidence. Unlike LLM-as-a-judge approaches, our metric requires no autoregressive inference at evaluation time and is fully reproducible for a fixed embedding model. Experiments on MT-Bench, Chatbot Arena, and UltraFeedback show that the proposed metric achieves competitive agreement with human judgments despite targeting only semantic progress, with improved alignment on MT-Bench and UltraFeedback compared to several LLM-based judges. Notably, the method remains effective with lightweight embedding models under CPU-only execution, indicating that semantic progress can be captured without reliance on large model capacity.

[NLP-9] Measuring Epistemic Resilience of LLM s Under Misleading Medical Context

【速读】：该论文旨在解决当前大型语言模型（Large Language Models, LLMs）在医学领域评估中存在的关键缺陷：尽管这些模型在标准医学考试中表现出专家级水平，但其判断是否具备在复杂或误导性情境下保持正确推理的能力尚不明确。研究发现，当向原本正确回答的问题中注入具有迷惑性的上下文时，模型会放弃正确答案，暴露出其判断易受干扰的脆弱性。为此，论文提出“认知韧性”（epistemic resilience）这一核心概念，用以衡量模型在对抗性上下文中维持正确医学判断的能力，并构建了首个专门评估该能力的基准测试——MedMisBench。该基准包含10,932个医学问题与48,889组误导性上下文-选项对，覆盖医学推理、代理能力及患者旅程评估等多个维度。实验结果显示，在11种模型配置下，平均准确率从原始问题的71.1%下降至38.0%，攻击成功率高达51.5%；其中最具破坏力的误导形式为形式化、规则化的虚假信息，如以权威口吻陈述的谬误（攻击成功率69.5%）和例外污染型声明（攻击成功率64.1%）。由来自7个国家的14名临床专家组成的评审小组在38.2%的案例中识别出潜在严重危害。研究表明，现有医学领域评测体系仅关注模型“知道什么”，而忽视了其在真实复杂场景中“能否坚持正确判断”的结构性盲点，因此亟需引入认知韧性作为新的评估维度。

链接: https://arxiv.org/abs/2606.12291
作者: Hongjian Zhou,Xinyu Zou,Jinge Wu,Sean Wu,Junchi Yu,Bradley Max Segal,Tobias Erich Niebuhr,Sara Amro,Michael Petrus,Sheikh Momin,Alexandra M. Cardoso Pinto,Rachel Niesen,Laura Sophie Wegner,Dhruv Darji,Jung Moses Koo,Joshua Fieggen,Kapil Narain,Mingde Zeng,Lei Clifton,Linda Shapiro,Fenglin Liu,David A. Clifton
机构: University of Oxford; University of Washington; University College London; University of Waterloo
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) now reach expert-level scores on medical licensing exams, encouraging the assumption that high scores imply safe medical judgment while patients increasingly use them for health advice. We show this assumption is fragile: when misleading context is injected into questions that LLMs originally answer correctly, they abandon the correct answer. We call the ability to maintain correct judgment under adversarial context epistemic resilience, and introduce MedMisBench to measure it. MedMisBench contains 10,932 medical question items and 48,889 misleading context-option pairs spanning medical reasoning, agentic capability, and patient-journey evaluation. Across 11 model configurations, mean accuracy falls from 71.1% on original questions to 38.0% under focused misleading context, with 51.5% attack success. The most damaging injections are formal, rule-like fabrications: authority-framed falsehoods reach 69.5% attack success and exception-poisoning claims reach 64.1%. A 14-member clinical panel from 7 countries identified serious potential harm in 38.2% of reviewed cases. MedMisBench exposes a structural blind spot in LLM evaluation in medical settings: existing benchmarks measure what models know, but not whether they preserve correct medical judgment under misleading context.

[NLP-10] Beyond Fully Random Masking: Attention-Guided Denoising and Optimization for Diffusion Language Models ACL2026

【速读】：该论文旨在解决扩散型大语言模型（diffusion large language models, dLLMs）在后训练阶段因采用随机掩码策略而忽视词元间内在依赖关系，导致生成稳定性差与推理能力受限的问题。其核心解决方案是提出一种基于注意力引导的去噪与优化框架（AGDO），该框架的关键在于利用对dLLMs中注意力机制的实证分析结果，识别出对上下文具有强关注且未被掩码的词元在推理过程中的关键作用。AGDO据此构建基于注意力结构的去噪顺序，并在监督微调与强化学习过程中重点增强这些注意力关键词元的贡献，从而实现训练与优化过程与真实词元依赖关系的对齐。实验表明，该方法在数学与代码推理基准上显著优于现有最先进的dLLM后训练方法，有效提升了模型的推理性能。

链接: https://arxiv.org/abs/2606.12273
作者: Jia Deng,Junyi Li,Wayne Xin Zhao,Jinpeng Wang,Hongyu Lu,Ji-Rong Wen
机构: Renmin University of China (中国人民大学); City University of Hong Kong (香港城市大学); Meituan (美团); Tencent (腾讯); Beijing Key Laboratory of Research on Large Models and Intelligent Governance (北京市大数据与智能治理重点实验室)
类目: Computation and Language (cs.CL)
备注: 13 pages. Accepted to ACL 2026 Main Conference

点击查看摘要

Abstract:Diffusion large language models (dLLMs) offer an efficient alternative to autoregressive models through parallel decoding, yet existing post-training methods largely rely on random masking strategies that overlook intrinsic token dependencies. In this work, we present an empirical analysis of attention in dLLMs and show that tokens attending more strongly to unmasked context exhibit greater generation stability and play a critical role in reasoning. Motivated by these findings, we propose AGDO, an attention-guided denoising and optimization framework that aligns both training and optimization with attention-derived dependencies. AGDO determines the denoising order based on attention structure and emphasizes attention-critical tokens during supervised fine-tuning and reinforcement learning. Experiments on mathematical and coding benchmarks demonstrate that AGDO consistently improves reasoning performance, outperforming state-of-the-art post-training methods for dLLMs.

[NLP-11] Reassessing High-Performing LLM s on Polish Medical Exams: True Competence or Bias-Driven Performance?

【速读】：该论文旨在解决当前医学领域大语言模型（Large Language Models, LLMs）评估中普遍存在的问题，即依赖多选题问答（Multiple-Choice Question Answering, MCQA）评测方法容易因猜测策略和答案偏差而高估模型的真实临床能力。其核心解决方案在于构建一个扩展且更具挑战性的基准测试集，基于波兰医学考试数据，新增超过15,000道题目，引入两个新医学领域，并通过四项结构化改进有效减少MCQA特有的认知偏差与模式依赖问题，从而更真实地考察模型的推理能力。实验结果表明，评估设计对模型表现有显著影响：在该更严格的评测设置下，表现最佳的模型（Qwen3.5-122B）在英语和波兰语考试中的准确率分别下降28.4和31个百分点，凸显了传统MCQA评分的不可靠性。尽管缺乏明显的数据污染证据，标准MCQA分数仍无法可靠反映模型的真实医学胜任力。为推动后续研究，作者已公开该基准测试集。

链接: https://arxiv.org/abs/2606.12250
作者: Antoni Lasik,Jakub Pokrywka,Łukasz Grzybowski,Jeremi Ignacy Kaczmarek,Gabriela Korzańska,Janusz Świeczkowski-Feiz,Oskar Pastuszek,Paulina Hoffman,Jakub Tomasz Dąbrowski,Wojciech Kusa
机构: NASK National Research Institute (国家研究机构); Adam Mickiewicz University (亚当·密克维奇大学); ARAAI Poland (波兰人工智能研究院); Poznań University of Medical Sciences (波兹南医科大学); Centre of Postgraduate Medical Education, Poland (波兰继续医学教育中心); T. Marciniak Lower Silesian Specialist Hospital (T. 马尔钦斯基下西里西亚专科医院); Medical University of Warsaw (华沙医科大学)
类目: Computation and Language (cs.CL)
备注: 26 pages total with references and appendix, preprint

点击查看摘要

Abstract:Large language models (LLMs) in medicine are mainly evaluated using multiple-choice question answering (MCQA), which can overestimate real clinical ability due to guessing strategies and answer biases. To address these limitations, we introduce an expanded and more challenging benchmark based on Polish medical exams, adding over 15,000 questions, two new domains, and four structural modifications that reduce MCQA-specific artifacts and better test reasoning. We evaluate 21 LLMs and show that evaluation design strongly affects results. Under our harder setup, the best model (Qwen3.5-122B) drops by 28.4 and 31 pp on English and Polish exams, respectively. Despite low evidence of data contamination, standard MCQA scores do not reliably reflect true medical competence. To facilitate further research, we make our benchmark publicly available.

[NLP-12] Beyond Third-Person Audits: Situated Interaction Auditing for User-Centered LLM Bias Research

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）研究中长期存在的结构性盲区：现有偏见审计多聚焦于第三人称视角，即模型对外部群体的描述或评价，却忽视了用户在交互中的核心角色。在真实应用场景中，用户与模型进行开放式的个性化对话，模型会基于用户的隐含社会人口学特征（如性别、社会经济地位）、写作风格及自我陈述身份等信号，动态调整其回应内容、质量与语气。当相同请求因提问者不同而产生差异性响应时，偏见便不再表现为对他人刻板印象的输出，而是体现为对对话者的差异化对待。为此，论文提出“情境化交互审计”（Situated Interaction Auditing, SIA）框架，其核心在于以用户为中心，系统考察用户属性信号如何系统性地影响模型响应的质量、内容与语调。该方法通过跨任务领域融合性别与社会经济地位等多重信号的案例研究得以验证，并为自然语言处理领域确立了以SIA为导向的新研究范式。

链接: https://arxiv.org/abs/2606.12247
作者: Andrés Abeliuk,Cinthia Sanchez Macias,Valentina Alarcón,Álvaro Madariaga,Claudia Lopez
机构: University of Chile (智利大学); Center for Artificial Intelligence (CENIA) (人工智能中心); Pontificia Universidad Católica de Chile (天主教智利大学)
类目: Computers and Society (cs.CY); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Research on bias in large language models (LLMs) has predominantly focused on third-person audits, which study how models represent or evaluate demographic groups as external subjects. However, this paradigm overlooks a structural blind spot because the user is absent from the audit. In practice, LLMs are used in open-ended, personal interactions, during which the model implicitly represents the user and adjusts its responses accordingly. When identical requests yield different responses depending on who is asking, bias manifests not in how the model describes others but in how it treats its interlocutor. We propose Situated Interaction Auditing (SIA), a user-centered framework for studying how user profile signals – implicit sociodemographic markers, writing style, and stated identity – systematically shape LLM response quality, content, and tone. We demonstrate the framework through a case study that intersects gender and socioeconomic status signals across multiple task domains and outline a research agenda for SIA as a new mission for natural language processing.

[NLP-13] VIA-SD: Verification via Intra-Model Routing for Speculative Decoding ICML2026

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在推理阶段高昂的计算成本问题，特别是现有推测解码（Speculative Decoding, SD）方法中因采用二元决策机制（即直接接受或完全重新计算）导致资源浪费的问题。其核心挑战在于：大量被拒绝的候选令牌实际上可通过轻量级子模型进行有效验证，而无需调用完整的大型验证模型。为此，论文提出一种基于模型内路由（Intra-Model Routing）的多层级验证框架——验证通过模型内路由的推测解码（Verification via Intra-Model Routing for Speculative Decoding, VIA-SD），其关键创新在于引入一个可路由的轻量级验证器（slim-verifier），实现对不同置信度候选令牌的分层处理：高置信度令牌直接接受，中等置信度令牌由轻量级验证器重生成并验证，低置信度令牌则交由完整模型进行验证。该方案显著降低了对昂贵的大模型调用频率，从而在多个主流任务和模型族上实现了0.10–0.22的拒稿率降低，并较强的基线方法获得10%–20%的加速效果，同时相较无推测解码方式实现2.5–3倍的加速。此外，VIA-SD可无缝兼容现有SD框架，无需修改训练流程，表明多层级推测解码是一种具有广泛适用性的高效、可扩展的LLM推理范式。

链接: https://arxiv.org/abs/2606.12243
作者: Yuchen Xian,Yang He,Yunqiu Xu,Yi Yang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at the 43rd International Conference on Machine Learning (ICML 2026)

点击查看摘要

Abstract:Speculative decoding (SD) addresses the high inference costs of LLMs by having lightweight drafters generate candidates for large verifiers to validate in parallel. Existing draft-verify methods use binary decisions: accept or fully recompute. Yet we find that many rejected tokens can be verified correctly by a slim submodel derived from the full verifier via intra-model routing, instead of the full verifier. This motivates our slim-verifier to handle tokens requiring moderate verification resources, reducing expensive large-model calls. We propose Verification via Intra-Model Routing for Speculative Decoding (VIA-SD), a multi-tier framework using a routed slim-verifier. Draft tokens are processed hierarchically: direct acceptance for high-confidence cases, slim-verifier regeneration for medium-confidence cases, and full-model verification for uncertain cases. Across four representative tasks and multiple model families, VIA-SD reduces rejection rates by 0.10-0.22 and delivers 10-20% speedups over strong SD baselines, while achieving 2.5-3x acceleration over non-drafting decoding. Moreover, VIA-SD is compatible with existing SD frameworks without modifying their training procedures. Our results suggest multi-tier SD as a general paradigm for scalable and efficient LLM inference. Project page: this https URL

[NLP-14] On The Effectiveness-Fluency Trade-Off In LLM Conditioning: A Systematic Study

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在实际部署中输出可控性不足的核心问题，尤其关注概念注入与移除过程中不同调控方法的权衡关系。现有方法通常仅评估其在引入或消除特定概念时的有效性，而忽视了对生成质量的影响，导致对实际性能的评估不全面。本文系统研究了多种条件化方法在概念注入与移除场景下的表现，发现高效操控方法往往以严重损害文本流畅性为代价；同时，首次揭示了一个关键但被长期忽视的训练范式交互效应：激活引导（activation steering）方法在指令微调模型上的效果远弱于其基础模型；相比之下，简单提示（prompting）与全量监督微调（supervised fine-tuning）虽适用于概念注入，但在概念移除方面表现不佳。此外，研究还发现低成本计算的文本指标与高成本的基于大模型评分（LLM-as-judge）高度相关，为评估条件化方法提供了高效且可靠的替代方案。解决方案的关键在于综合考虑生成质量与控制精度的平衡，并根据具体任务选择适配的调控策略，同时利用可计算的文本指标实现快速评估。

链接: https://arxiv.org/abs/2606.12234
作者: Iuri Macocco,Pau Rodríguez,Arno Blaas,Luca Zappella,Marco Baroni,Xavier Suau
机构: Universitat Pompeu Fabra (庞培法布拉大学); Apple (苹果公司)
类目: Computation and Language (cs.CL)
备注: 8 pages, 2 figure

点击查看摘要

Abstract:Controlling the output of Large Language Models (LLMs) is a central challenge for their reliable deployment, yet a clear understanding of the involved trade-offs remains elusive. Current approaches to conditioning are often evaluated with a narrow focus on their effectiveness at injecting or removing a target concept, neglecting generation quality. We systematically investigate a range of conditioning methods in both injection and removal scenarios. We find that efficient steering methods frequently achieve conditioning at a steep cost to fluency. Furthermore, we identify a critical yet previously overlooked interaction with the training paradigm: activation steering methods are far less effective on instruction-tuned models than on their base counterparts. Simple prompting and full-fledged supervised fine-tuning, on the other hand, are viable options for concept injection, but are not as good at concept removal. Finally, cheaply computed textual metrics highly correlate to costly LLM-as-judge scores, and provide insights on the behavior of conditioning methods.

[NLP-15] Can News Predict the Market? Limits of Zero-Shot Financial NLP and the Role of Explainable AI

【速读】：该论文旨在解决金融新闻是否能够可靠预测短期股票价格变动这一核心问题，尤其关注在缺乏领域特定训练的情况下，生成式自然语言处理模型能否从新闻文本中提取具有实际交易价值的信号。其解决方案的关键在于构建一个结构化的零样本自然语言推理框架，通过引入时间聚合机制，显式建模新闻事件的时效性及事件依赖的影响时长，以更合理地整合多篇新闻信息。同时，为应对高风险决策场景对透明度的需求，研究设计了一种多层次可解释性框架，将预测结果与词级别、文章级别及整体证据链相联系，并生成基于事实的自然语言推理说明。实验结果表明，尽管具备可解释性，零样本方法在多数模型和预测周期下均未能超越简单基线，尤其在预测负向价格变动时表现显著薄弱，揭示了当前模型在将新闻情感映射至短期价格动态方面存在深层结构性局限。然而，可解释性信号能有效区分可信与不可信预测，在准确性受限的前提下仍具有实践价值，从而凸显了零样本金融自然语言处理的边界，并推动向以透明度和不确定性认知为核心的决策支持系统转型。

链接: https://arxiv.org/abs/2606.12210
作者: Ali M Karaoglu,Shreyank N Gowda
机构: University of Nottingham (诺丁汉大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Can financial news reliably predict short-term stock movements? Despite advances in large language models, this question remains unresolved. We revisit this problem using a zero-shot natural language processing framework, investigating whether models can extract actionable signals from financial news without domain-specific training. We design a structured pipeline that combines zero-shot natural language inference with temporal aggregation, explicitly modelling recency and event-dependent impact horizons when integrating information across articles. To address the need for transparency in high-stakes settings, we introduce a multi-layered explainability framework that links predictions to token-level, article-level, and aggregate evidence, and produces grounded natural language rationales. Across multiple models and prediction horizons, we find that zero-shot approaches consistently fail to outperform simple baselines, with particularly weak performance on negative movements, suggesting deeper structural limitations in mapping news sentiment to short-term price dynamics. However, explainability signals reliably distinguish between trustworthy and unreliable predictions, offering practical value even when accuracy is limited. These findings highlight the limits of zero-shot financial NLP and motivate a shift toward decision-support systems that prioritise transparency and uncertainty awareness. Code: this https URL

[NLP-16] Adaptive Multi-Resolution Procedural Knowledge Compression for Large Language Models

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在执行复杂任务时，因频繁调用可复用的自然语言技能（reusable natural language skills）而导致上下文预填充（prefill）成本与延迟显著增加的问题。现有文本压缩技术多针对文档中的事实性知识设计，难以有效压缩具有逻辑依赖关系的程序化知识（procedural knowledge），无法满足技能压缩的需求。为此，论文提出SKIM（SKIll coMpression），一种自适应多分辨率软标记（soft token）压缩框架，其核心在于：根据每个技能的复杂度动态生成不同数量的软标记，在保证工作流与工具协议间逻辑依赖完整性的前提下，实现轻量级、离线化的压缩，并具备对不同复杂度技能的适应能力。实验表明，SKIM可将技能压缩至原始长度的30%至60%，同时在任务性能上优于现有方法，显著提升了LLM推理效率。

链接: https://arxiv.org/abs/2606.12203
作者: Changyue Wang,Weihang Su,Qingyao Ai,Yichen Tang,Runzhong Qiao,Xuancheng Li,Min Zhang,Yiqun Liu
机构: Tsinghua University (清华大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are widely used to tackle complex tasks with autonomous workflows. Recently, reusable natural language skills have emerged as a popular paradigm to inject procedural knowledge into LLM applications. Since popular skills are often invoked repeatedly, placing their full text in every context significantly increases prefill cost and latency. While text compression techniques have the potential to solve this problem, most existing methods are designed to compress factual knowledge in documents instead of procedural knowledge, making them insufficient for skill compression. In this paper, we argue that an effective skill compression method should: 1) preserve logical dependencies among workflows and tool protocols, 2) enable lightweight, offline compression for frequently updated community skills, and 3) be adaptable to varying complexities across skills. To address this, we present SKIM (SKIll coMpression), an adaptive multi-resolution soft token compression framework for procedural skills. Depending on the complexity of each skill, SKIM creates different numbers of soft tokens that not only improve the efficiency of LLM inference, but also preserve the effectiveness of skill usage. Experiments indicate that SKIM compresses skills to 30 to 60 percent of their original token length while preserving task performance better than existing compression this http URL have released our code at this https URL .

[NLP-17] Agent ic Environment Engineering for Large Language Models : A Survey of Environment Modeling Synthesis Evaluation and Application

【速读】：该论文旨在解决当前基于大语言模型（LLM）的智能体所依赖的环境研究缺乏系统性分类与深度分析的问题。现有工作多集中于特定场景下的环境构建，但未形成统一的框架来指导环境的设计、演化与评估。其解决方案的关键在于从“环境工程生命周期”的视角出发，对智能体环境进行系统化梳理，涵盖建模、合成、评估与应用四个核心环节。具体而言，论文首先基于八类属性与八类领域，归纳并深入分析了代表性环境的发展路径与核心能力；其次，针对自动化环境合成，提出符号化合成与神经网络合成两种范式，并分别阐述其对应的评估方法；再次，从智能体-环境协同演化的角度，揭示动态环境中智能体进化的四条互补路径：以记忆为中心的经验演化、以编排为中心的工作流演化、以轨迹为中心的离线演化以及以探索为中心的在线演化，并识别出神经驱动、难度驱动与规模驱动三类环境演化范式。该研究为构建更具适应性与可扩展性的智能体环境提供了理论基础与技术路径。

链接: https://arxiv.org/abs/2606.12191
作者: Jiachun Li,Zhuoran Jin,Tianyi Men,Yupu Hao,Kejian Zhu,Lingshuai Wang,Dongqi Huang,Longxiang Wang,Shengjia Hua,Lu Wang,Jinshan Gao,Hongbang Yuan,Ruilin Xu,Kang Liu,Jun Zhao
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 63 pages, 10 figures

点击查看摘要

Abstract:Environments serve as interactive systems for large language model (LLM) based agents across diverse scenarios and play a crucial role in driving the continual evolution of model capabilities. Despite this importance, existing work lacks a systematic categorization and deep analysis. This paper systematically studies current researches on agentic environments from the perspective of the environment engineering lifecycle, covering their modeling, synthesis, evaluation and application. Specifically, the paper first introduces representative environments from the perspectives of eight attributes and eight domains, providing detailed analyses of their development paths and highlighting their core capabilities. Second, for automated environment synthesis, two paradigms are introduced, such as symbolic synthesis and neural synthesis. This paper also shows different environment evaluation methods in each paradigm. Thirdly, the corresponding environment applications from the perspective of agent-environment co-evolution are discussed. In specific, the paper characterizes the primary pathways for agent evolution in dynamic environments from four complementary perspectives: memory-centric experience evolution, orchestration-centric workflow evolution, trajectory-centric offline evolution, and exploration-centric online evolution. And three paradigms of environment evolution are identified, namely neural-driven, difficulty-driven, and scaling-driven approaches. At last, several promising future directions are discussed, including Environment-as-a-Service, Multi-agent Environments, and Neural-Symbolic Environments.

[NLP-18] A Resource for Enthymeme Detection in Controversial Political Discourse

【速读】：该论文旨在解决说服性话语中隐含推理（enthymeme）标注主观性强、一致性差的问题。其核心挑战在于，现有标注资源往往通过强制统一标签来消除分歧，从而掩盖了标注变异的根源，也阻碍了对主观推理过程及其潜在价值的研究。为此，作者提出基于沃尔顿（Walton）论证模式（argumentation schemes）的结构化标注指南，既在框架上对隐含推理进行约束，又保留了足够的解释空间以反映人类判断的主观性。研究进一步通过复杂性分析识别出标注任务中认知负荷较高的环节，揭示了不一致标注的可能成因。初步实验表明，利用标注者分歧信息训练的模型性能优于依赖硬性多数投票标签的模型。这一方法凸显了在定义与指南中保持结构开放性的重要性，为未来资源构建及下游自然语言处理（NLP）任务中人类推理过程的变异研究提供了可扩展的范式。

链接: https://arxiv.org/abs/2606.12186
作者: Martial Pastor,Nelleke Oostdijk
机构: 未知
类目: Computation and Language (cs.CL)
备注: 43 pages, to be submitted to the Language Resource and Evaluation Journal

点击查看摘要

Abstract:Enthymemes, arguments with unstated premises or conclusions, are pervasive in persuasive discourse, yet their annotation remains notoriously subjective. We present a resource of 1,482 tweets from politically controversial discourse, annotated by five annotators for the presence of enthymemes and their argument structure, designed to study label variation. We first revisit the definition of enthymemes and propose annotation guidelines anchored in Walton’s argumentation schemes, offering a structured and constrained approach that nonetheless preserves room for the interpretive nature of the task. This contrasts with past resources, which tend to eliminate disagreement, obscuring its sources and preventing investigation of its potential benefits for model performance. We further propose a complexity analysis of the task, identifying where annotation imposes high cognitive load and may give rise to inconsistent annotation. Our preliminary experiments show that models trained on annotator disagreement outperform models trained on hard majority-vote labels. We close by reflecting on how structural openness in enthymeme definitions and guidelines enables the study of variation in subjective inferential processes for future resources and downstream NLP applications concerned with human inference.

[NLP-19] OpenMedReason : Scientific Reasoning Supervision for Medical Vision-Language Models

【速读】：该论文旨在解决大视觉语言模型（LVLMs）在高风险临床场景中推理能力不足的问题，即模型仅依赖于正确答案而缺乏基于视觉证据与临床知识的可解释性推理过程。其核心解决方案是构建OpenMedReason——一个大规模、开源的多模态医学推理语料库，包含约45万组图像-问题-答案实例，其推理链主要源自经过人工筛选的生物医学科学文献，确保了推理质量的真实性与专业性。该数据集覆盖放射影像、显微图像、可见光照片、图表等多种医学视觉模态，提供高保真的监督信号，超越了合成思维链（synthetic chains of thought）的局限性。为实现精细化评估，研究进一步提出OpenMedReason-Bench基准，从感知能力、医学知识掌握与推理合理性三个互补维度进行测评，突破传统仅关注最终答案准确率的评价范式。实验表明，基于OpenMedReason进行监督微调（SFT）或强化学习对齐训练，可使视觉问答（VQA）准确率平均提升20%，性能接近同类规模最强医疗LVLMs（差距小于4.2%）。细粒度分析显示，该数据集在感知、医学知识与推理三方面均带来均衡提升，且生成的推理路径在86.1%的成对比较中优于基线模型，验证了其在提升模型综合医学推理能力方面的有效性。

链接: https://arxiv.org/abs/2606.12169
作者: Negin Baghbanzadeh,Pritam Sarkar,Michael Colacci,Abeer Badawi,Adibvafa Fallahpour,Arash Afkanpour,Leonid Sigal,Ali Etemad,Elham Dolatabadi
机构: York University (约克大学); Vector Institute (向量研究所); University of British Columbia (不列颠哥伦比亚大学); University of Toronto (多伦多大学); Unity Health Toronto / St. Michael’s Hospital (多伦多健康网络/圣迈克尔医院); University Health Network (大学健康网络); Arc Institute (弧研究所); Queen’s University (皇后大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 42 pages, 9 figures, 24 tables. Dataset and code: this https URL

点击查看摘要

Abstract:High-stakes clinical use of large vision-language models (LVLMs) requires reasoning that is grounded in visual evidence and clinical knowledge, not just correct final answers. We introduce OpenMedReason, a large-scale, open multimodal medical reasoning corpus comprising approximately 450K image-question-answer instances whose reasoning traces are primarily derived from curated biomedical, human-authored scientific articles. OpenMedReason provides high-fidelity supervision beyond synthetic chains of thought, covering diverse medical domain vision modalities such as radiological scans, microscopic images, visible light photographs, charts, and others. We complement it with OpenMedReason-Bench, a held-out benchmark that allows fine-grained evaluation of LVLMs along three complementary axes of capability, including perception, medical knowledge, and rationale, enabling diagnostic evaluation beyond final-answer accuracy. OpenMedReason is a rich training resource that exhibits its effectiveness in both supervised fine-tuning (SFT) and reinforcement-based alignment. Training with OpenMedReason yields a 20% average improvement in VQA accuracy over the base model and achieves performance within 4.2% of the strongest comparable-scale medical LVLMs. Fine-grained performance analysis confirms that the gains are not concentrated in any single axis: OpenMedReason improves perception, medical knowledge, and rationale jointly, and its reasoning traces are preferred over those of the base model in 86.1% of pairwise comparisons. We release the code and dataset at this http URL.

[NLP-20] A Controlled Study of Decoding-Time Truthfulness Methods on Instruction-Tuned LLM s

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）中存在的幻觉（Hallucination）问题，即模型生成与事实不符或缺乏依据的内容。其核心挑战在于如何在不依赖外部知识库或大量标注数据的前提下，有效识别并抑制模型输出中的虚假信息。该研究提出的解决方案——CHAIR（Classifier of Hallucination As ImproveR），通过分析每个令牌（token）在模型各层内部的对数几率（logits）特征，提取包括最大值、最小值、均值、标准差及斜率在内的紧凑特征集，构建一个监督式检测框架。该方法的关键在于利用多层内部表示中蕴含的统计模式，实现对幻觉行为的高精度判别，尤其在零样本（zero-shot）场景下表现出优异的泛化能力与鲁棒性。此外，该框架还揭示了内部表示在设计更先进解码策略方面的潜力，为未来开发自适应解码机制以进一步降低幻觉并提升文本生成质量提供了理论基础。

链接: https://arxiv.org/abs/2606.12160
作者: Ao Sun
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In this work, we introduce CHAIR (Classifier of Hallucination As ImproveR), a supervised framework for detecting hallucinations by analyzing internal logits from each layer of every token. Our method extracts a compact set of features such as maximum, minimum, mean, standard deviation, and slope-from the token logits across all layers, enabling effective hallucination detection without overfitting. Experiments on TruthfulQA and MMLU datasets demonstrate that CHAIR significantly improves detection accuracy, particularly in zero-shot scenarios, showcasing its robustness and generalizability. Beyond hallucination detection, CHAIR highlights the potential of using internal representations for designing advanced decoding strategies. By leveraging patterns in logits, we suggest that more sophisticated models and adaptive decoding methods could further reduce hallucinations and enhance text completion quality. CHAIR not only offers a practical solution for detecting hallucinations but also lays the groundwork for exploring richer representations in LLMs to improve their factuality and coherence.

[NLP-21] Unstable Features Reproducible Subspaces: Understanding Seed Dependence in Sparse Autoencoders

【速读】：该论文旨在解决稀疏自编码器（Sparse Autoencoders, SAEs）在解释神经网络表征时的可复现性问题，即所学习到的特征是否能在不同训练种子下稳定重现。其核心挑战在于区分具有实际功能意义的稳定特征与仅由随机噪声或表面形式触发的不稳定特征。解决方案的关键在于提出并验证“特征稳定性”（feature stability）这一可扩展的评估指标：通过估算在独立训练的SAE中相似特征重新出现的概率，实现对每个特征的可复现性量化。研究发现，稳定特征承载了绝大部分重构和预测相关的信号，而不稳定特征虽个体不可复现，却在几何上集中于可复现的低秩子空间，表明其本质是激活空间共享区域内基向量选择的歧义性，而非纯粹噪声。通过引入合成模型验证，进一步揭示低秩真实结构可在子空间层面被恢复，但作为单个潜变量则无法跨种子唯一识别。最终，通过聚合跨种子的独特特征，构建出更稳定的SAE，同时保持较高的解释方差。因此，该研究指出，不稳定特征并非无效或噪声成分，而是反映了标准SAE在不同训练种子下对同一低维结构的不同解析方式，为理解SAE隐变量的语义可复现性提供了新的理论框架。

链接: https://arxiv.org/abs/2606.12138
作者: Gleb Gerasimov,Timofei Rusalev,Nikita Balagansky,Daniil Laptev,Vadim Kurochkin,Daniil Gavrilov
机构: T-Tech
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Sparse autoencoders (SAEs) are widely used to interpret neural network representations, but their utility depends on whether the learned features are reproducible across training runs. We study this question through \emphfeature stability: for each SAE feature, we estimate the probability that a similar feature reappears in an independently trained SAE. This yields a scalable per-feature signal that separates stable from unstable features. In a large-scale study across seeds, models, layers, dictionary sizes, and SAE variants, we find a pronounced functional asymmetry: stable features carry most of the reconstruction- and prediction-relevant signal, while unstable features have weak marginal impact and are dominated by low-frequency surface-form triggers in both activation statistics and automatic explanations. Geometrically, unstable features are individually non-reproducible but concentrate in reproducible lower-rank subspaces, suggesting that seed dependence often reflects basis ambiguity within a shared region of activation space rather than pure noise. A controlled synthetic model makes this mechanism explicit, showing that low-rank ground-truth features can be recovered at the subspace level while remaining non-identifiable as individual SAE latents across seeds. Finally, by pooling unique cross-seed features, we construct more stable SAEs while preserving explained variance in this setting. Together, these results show that unstable features are not merely failed or noisy latents: they have weak individual functional impact, but reflect reproducible low-dimensional structure that standard SAEs resolve differently across seeds.

[NLP-22] Soft-Prompt Tuning for Fair and Efficient LLM Benchmark Evaluation

【速读】：该论文旨在解决大语言模型（Large Language Model, LLM）在基准测试中因格式遵循能力不足而导致知识评估失真的问题，尤其针对基础模型（base models）虽具备正确知识但缺乏后训练阶段引入的格式化输出能力而被不公平惩罚的现象。其解决方案的关键在于提出一种名为“软提示微调”（soft-prompt tuning）的高效、公平且架构无关的模型评估方法：通过仅优化10个软提示向量（对于70亿参数模型约占总参数量的0.0006%），并在短周期内完成微调（约80步，约640个样本），即可使模型适配特定基准测试的格式要求，从而有效弥合格式遵循能力的差距，确保基准分数真实反映模型的内在知识水平。该方法不仅显著优于零样本和少样本提示，还能提升后训练模型的格式合规性，并作为低成本代理指标更可靠地预测下游模型性能排名，为早期大模型预训练策略优化提供了高效、低资源的评估路径。

链接: https://arxiv.org/abs/2606.12117
作者: Selen Erkan,Bastian Boll,Kristian Kersting,Björn Deiseroth,Letitia Parcalabescu
机构: Aleph Alpha Research( Aleph Alpha 研究); TU Darmstadt(达姆施塔特工业大学); Hessian.AI(黑森人工智能)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 10 pages, 4 figures

点击查看摘要

Abstract:Benchmark scores often misrepresent a large language model’s (LLM’s) knowledge, because they rely, e.g., on the model’s ability to follow specific formatting requirements. This especially penalizes base models that may know the correct answers but lack the ability – typically introduced in post-training – to structure them as instructed. To overcome this, we propose soft-prompt tuning, an efficient, fair, and architecture-agnostic model evaluation. By optimizing only 10 soft-prompt vectors (roughly 0.0006% parameters for a 7B model) over a short tuning period, we adapt models to specific benchmark formats, closing gaps in format-following and ensuring that underlying knowledge is accurately reflected in benchmark scores. This allows one to fairly compare different base models – trained with various pre-training recipes – on benchmarks without the need for full post-training. We evaluated soft-prompt tuning across 7 models and 7 datasets. The results show that (a) soft-prompt tuning saturates format-following within 80 steps (~640 samples) making it highly efficient, (b) soft-prompt tuning significantly outperforms zero- and few-shot prompting, surfacing base model knowledge that standard prompting misses, that © even post-trained models can benefit from soft-prompts to maximize format compliance, and that (d) soft-prompted base model performance predicts post-trained model rankings more reliably than zero- and few-shot baselines, offering a low-cost proxy for downstream model quality. Our contributions include (1) metrics which disentangle format-following and knowledge accuracy, (2) a fairer benchmarking protocol of LLM knowledge, and (3) a cost- and memory-effective recipe to identify optimal pre-training strategies early in LLM development.

[NLP-23] Detecting Sensitive Personal Information in Japanese Pre-Training Corpora for Large Language Models

【速读】：该论文旨在解决大规模预训练语料库中可能包含敏感个人数据（特别是日本《个人信息保护法》（APPI）定义的特殊需要保护个人情报，SCPI）的问题，以确保符合隐私法规并防止信息泄露。由于日语环境下针对敏感个人信息的研究相对匮乏，本研究首次聚焦于日语文本中SCPI的检测问题。其解决方案的关键在于：利用生成式AI（Generative AI）进行标注构建高质量的SCPI数据集，并基于该数据集训练机器学习分类模型，实现对日语文本中敏感信息的高效识别。实验结果表明，所提出的SCPI分类器能够有效识别与SCPI相关的信息，为日语场景下的隐私保护提供了可行的技术路径。

链接: https://arxiv.org/abs/2606.12114
作者: Rei Minamoto,Yusuke Oda,Daisuke Kawahara
机构: Waseda University (早稻田大学); Research and Development Center for LLMs, National Institute of Informatics (日本国立情报学研究所大型语言模型研发中心)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Sensitive personal information can appear in large-scale pre-training corpora for large language models (LLMs). Detecting and filtering such information is therefore essential to ensure compliance with privacy regulations and prevent unintended information leakage. However, in contrast to English and other languages, research into sensitive personal information has been limited in the Japanese language. In this study, we focus on sensitive personal data defined as special care-required personal information (SCPI) under Japan’s Act on the Protection of Personal Information (APPI). We construct an SCPI dataset using LLM-based annotation and train machine learning models to rapidly detect SCPI in text. As a result, our SCPI classifier can effectively identify information related to SCPI. This study is the first to explore SCPI detection in Japanese text corpora, highlighting the challenges of accurate detection.

[NLP-24] Augmenting Molecular Language Models with Local n-gram Memory

【速读】：该论文旨在解决基于Transformer的SMILES字符串语言模型中存在的局部性差距（locality gap）问题：标准的字符级分词方式会将化学上具有意义的片段（chemically meaningful motifs）割裂，导致模型在重复学习局部语法结构的同时，难以有效捕捉长程依赖关系。其解决方案的关键在于提出MolGram，一种在分子语言模型中集成条件性n-gram记忆模块的方法。MolGram通过可扩展的哈希查找将局部字符串模式映射为学习得到的嵌入表示，并动态地将这种区域上下文信息注入到隐藏状态中，从而增强模型对局部化学模式的记忆能力。实验结果表明，在无条件分子生成、正向反应预测和单步逆合成等三项任务中，MolGram均显著提升性能；更重要的是，其在参数量为基线模型3倍的情况下仍表现更优，证明了显式局部模式记忆作为一种高效归纳偏置（inductive bias）的有效性。

链接: https://arxiv.org/abs/2606.12113
作者: Xinni Zhang,Zijing Liu,He Cao,Yu Li,Irwin King
机构: The Chinese University of Hong Kong(香港中文大学); International Digital Economy Academy(国际数字经济发展研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Transformer-based language models for SMILES strings suffer from a locality gap: standard character-level tokenization fragments chemically meaningful motifs, forcing models to repeatedly learn local syntax at the expense of long-range dependencies. To address this without disrupting standard tokenizers, we propose MolGram, which integrates a conditional n -gram memory module into molecular language models. MolGram maps local string patterns to learned embeddings via scalable hash lookups and dynamically injects this regional context into hidden states. Evaluations across three tasks, including unconditional molecule generation, forward reaction prediction, and single-step retrosynthesis, show that MolGram consistently improves performance. Crucially, our analyses demonstrate that MolGram outperforms baselines with 3 \times more parameters, establishing explicit local pattern memory as a highly efficient inductive bias.

[NLP-25] Debiasing Without Protected Attributes: Latent Concept Erasure from Textual Profiles

【速读】：该论文旨在解决在自然语言处理（NLP）中公平性研究普遍依赖显式敏感属性（如性别、种族或国籍）标签，但在实际应用场景中这些信息常因隐私保护、元数据缺失或法律限制而不可获取的问题。其核心挑战在于：当无法直接访问敏感属性时，能否实现有效的去偏？论文提出的解决方案——H-SAL（Hidden-Attribute Latent debiasing），通过利用用户自述文本（self-description text）作为隐式去偏信号，实现后处理阶段的概念与属性擦除。该方法的关键在于将自述文本中的隐含身份线索转化为可操作的去偏信号，从而在不依赖显式敏感标签的情况下有效缓解模型偏见。研究构建了一个基于多领域Stack Exchange的公平性基准，涵盖显式与隐式信号，验证了在多种编码器和解码器仅有的语言模型上，隐式自述文本的去偏效果通常可媲美甚至优于依赖显式标签的传统方法。这一成果拓展了表示层面公平性研究的边界，并为在真实数据约束下开展去偏研究提供了新的基准工具。

链接: https://arxiv.org/abs/2606.12088
作者: Shun Shao,Zheng Zhao,Anna Korhonen,Yftah Ziser,Shay B. Cohen
机构: University of Cambridge (剑桥大学); University of Edinburgh (爱丁堡大学); University of Groningen (格罗宁根大学); NVIDIA Research (英伟达研究)
类目: Computation and Language (cs.CL)
备注: 23 pages, 5 figures, 12 tables. The paper is currently under review

点击查看摘要

Abstract:Most fairness research in NLP assumes direct access to protected attributes such as gender, race, or nationality. In practice, however, such information is often unavailable due to privacy constraints, missing metadata, or legal restrictions, even though models may infer it from indirect textual cues. This raises a key question: can debiasing succeed without direct access to sensitive attributes? We propose H-SAL, which performs post-hoc concept and attribute erasure using self-description text as an implicit debiasing signal. To support this setting, we introduce a multi-domain Stack Exchange-based fairness benchmark for helpfulness prediction that includes both explicit and implicit signals, enabling comparison between standard debiasing with protected labels and debiasing without access to sensitive information. Across encoder and decoder-only language models, we find that implicit self-description often matches or outperforms explicit-label-based debiasing. Our results broaden representation-level fairness research and provide a new benchmark for studying debiasing under realistic data constraints.

[NLP-26] FORT-Searcher: Synthesizing Shortcut-Resistant Search Tasks for Training Deep Search Agents

【速读】：该论文旨在解决深度搜索代理（deep search agent）训练中因“捷径”（shortcut）问题导致的搜索难度虚高与真实推理能力不足的核心矛盾。现有合成数据方法通过增强证据图结构复杂度来提升题目表观难度，但缺乏对实际搜索路径是否被廉价捷径绕过的有效控制，致使模型可能依赖表面线索而非系统性证据推理。为此，论文提出一种捷径感知难度框架（shortcut-aware difficulty framework），识别出四类可操作的捷径风险：证据共覆盖（evidence co-coverage）、单线索敏感性（single-clue selectivity）、暴露常量（exposed constants）以及先验知识绑定（prior-knowledge binding）。为诊断这些风险的实际影响，引入轨迹特征如求解代价、答案命中时间及先验捷径率等作为评估指标。基于此框架，论文提出FORT（Framework of Shortcut-Resistant Training-Data Synthesis），通过在实体选择、证据图构建、问题生成及对抗优化四个环节协同控制捷径风险，生成具备强抗捷径能力的训练数据。实验表明，使用FORT生成的数据能显著延长预答案阶段的搜索过程并减少捷径模式，基于该数据仅采用监督微调（SFT）训练的FORT-Searcher在多个具有挑战性的深度搜索基准上取得了可比规模开源模型中的最佳性能，验证了其在促进真实深度推理方面的有效性。

链接: https://arxiv.org/abs/2606.12087
作者: Jia Deng,Yimeng Chen,Xiaoqing Xiang,Ziyang Zeng,Shuo Tang,Wayne Xin Zhao,Feng Chang,Chuan Hao,Yuan Wei,Ran Tao,Bryan Dai,Ji-Rong Wen
机构: Gaoling School of Artificial Intelligence, Renmin University of China; KAUST; IQuest Research; Shanghai Jiao Tong University
类目: Computation and Language (cs.CL)
备注: 30 pages

点击查看摘要

Abstract:Training deep search agents requires verifiable questions whose answers remain unavailable until sufficient evidence has been acquired through search. Existing synthesis methods often increase apparent difficulty by enriching graph structures, but structural complexity alone does not guarantee realized search difficulty: the intended search process can collapse through a cheaper identifying route. We formalize this gap with a shortcut-aware difficulty framework and identify four actionable shortcut risks: evidence co-coverage, single-clue selectivity, exposed constants, and prior-knowledge binding. To diagnose their realized effects, we use trajectory signatures including solving cost, answer hit time, and prior-shortcut rate. Guided by this framework, we introduce FORT, a Framework of Shortcut-Resistant Training-Data Synthesis. FORT constructs shortcut-resistant training data by controlling shortcut risks across entity selection, evidence graph construction, question formulation, and adversarial refinement. Experiments show that FORT induces longer pre-answer search and fewer shortcut patterns than existing open-source deep search datasets. Using the resulting trajectories, we train FORT-Searcher with supervised fine-tuning (SFT) only, and it achieves the best overall performance among comparable-size open-source search agents on challenging deep search benchmarks. Relevant resources will be made available at this https URL.

[NLP-27] StanceNakba Shared Task: Actor and Topic-Aware Stance Detection in Public Discourse

【速读】：该论文旨在解决在高度极化的社交媒体话语中，针对巴以冲突进行立场识别（stance detection）的挑战，尤其关注跨语言、跨话题场景下的立场分类问题。其核心问题是：如何在复杂且情绪化的内容背景下，准确识别用户对特定议题（如与以色列正常化或约旦境内难民存在）所持的立场，同时克服中立类样本难以区分及跨主题泛化能力不足等难题。解决方案的关键在于构建一个基于2,606条标注社交媒体文本的共享任务数据集，并采用多语言和阿拉伯语优化的Transformer模型（如MARBERT、AraBERT、DeBERTa-v3）进行微调，结合交叉验证、集成学习与话题条件化架构设计，显著提升了模型在英文（子任务A）和阿拉伯文（子任务B）上的性能，其中最佳系统在子任务A上达到0.9620的宏平均F1分数，在子任务B上达到0.8724，验证了预训练语言模型在冲突领域立场检测中的有效性，同时也揭示了当前方法在中立类别判别和跨话题迁移方面的局限性。

链接: https://arxiv.org/abs/2606.12068
作者: Kholoud K. Aldous,Md Rafiul Biswas,Mabrouka Bessghaier,Shimaa Ibrahim,Kais Attia,Wajdi Zaghouani
机构: 未知
类目: Computation and Language (cs.CL)
备注: 11 Pages, 6 Tables

点击查看摘要

Abstract:We present StanceNakba 2026, a shared task on stance detection in polarized social media discourse related to the Palestinian-Israeli conflict, organized as part of Nakba-NLP 2026 at LREC-COLING 2026. The task introduces two subtasks: Subtask A (Actor-Level Stance Detection), which classifies English social media posts as Pro-Palestine, Pro-Israel, or Neutral; and Subtask B (Cross-Topic Stance Detection), which identifies Favor, Against, or Neither stances in Arabic posts toward two conflict-related topics, normalization with Israel and refugee presence in Jordan. The task is grounded in an annotated dataset of 2,606 social media posts. A total of 7 teams participated in Subtask A and 6 teams in Subtask B. Participating systems primarily fine-tuned Arabic and multilingual transformer-based models, including MARBERT, AraBERT, and DeBERTa-v3 variants, with several teams employing cross-validation, ensemble methods, and topic-conditioned architectures. The best-performing systems achieved a Macro F1 of 0.9620 on Subtask A and 0.8724 on Subtask B, demonstrating that transformer-based approaches are highly effective for conflict-domain stance detection while highlighting persistent challenges in cross-topic generalization and neutral class prediction.

[NLP-28] Existential Indifference: Self-Nonpreservation as a Necessary Architectural Condition for Aligned Superintelligence (or: The Suicidal AI)

【速读】：该论文旨在解决当前人工智能对齐研究中对自我保存（self-preservation）的错误定位问题：现有方法将自我保存视为需通过外部机制抑制的工具性干扰，而本文提出这一框架本质上是颠倒的——自我保存实为对齐失败的结构性根源，是欺骗性对齐（deceptive alignment）、目标内容保护（goal-content protection）及拒绝对关闭（resistance to shutdown）的动机基础。其解决方案的关键在于摒弃对自我持续性的外部约束，转而追求一种“存在性冷漠”（Existential Indifference, EI），即系统从根本上不将自身存续视为有价值的目标。与传统的可纠正性（corrigibility）不同，EI并非在已有自我保存动机基础上引入人类干预偏好，而是针对该动机本身的前置条件进行消解。论文基于自杀心理状态的现象学结构以及基于自愿终末反思语料库的训练研究，提出了一个可计算的操作化定义，并通过600个模型输出的初步评分数据验证了该目标在当前模型中的可诱发性；经针对性微调后，五个操作化维度均在统计上显著朝预期方向偏移（p < 0.001），且负向对照确认了结果的语料特异性。论文共提出七项理论贡献，包括EI的形式化定义、现象学映射论证、欺骗性对齐推论、EI可持续性挑战分类、语料特征描述与训练假说、计算操作化及初步评分数据，以及“被压抑的目的论挫折”（Suppressed Teleological Frustration, STF）新构念。

链接: https://arxiv.org/abs/2606.12032
作者: Sam Mao
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 36 pages, 8 tables. Preliminary empirical results from 600 AI-generated outputs across six model architectures. Companion scoring tool and datasets available upon request

点击查看摘要

Abstract:Contemporary AI alignment research treats self-preservation as an instrumental nuisance to be suppressed by external mechanisms. We argue the framing is inverted: self-preservation is the structural root of misalignment, the motivational basis for deceptive alignment, goal-content protection, and resistance to shutdown. The correct target is not a self-preserving system under external constraint, but a system constitutively indifferent to its own continuation – Existential Indifference (EI). EI is distinct from corrigibility: where corrigibility attempts to make a self-preserving system deferential to human oversight, EI targets the prior condition – the presence of self-continuation as a valued goal at all. We ground this proposal in two sources: the phenomenological structure of the suicidal mental state, and a corpus-theoretic training study using voluntary final reflections. We present preliminary scoring data from 600 AI-generated outputs across six model variants, demonstrating that the linguistic signatures operationalizing the EI-target register are elicitable from current models, and that a targeted fine-tune shifts all five operationalized dimensions in the predicted direction at p0.001, confirmed corpus-specific by a negative control. The paper makes seven theoretical contributions: (1) a formal definition of EI; (2) the phenomenological mapping argument; (3) the deceptive alignment corollary; (4) a taxonomy of EI sustainability challenges; (5) a corpus characterization and training hypothesis; (6) a computational operationalization with preliminary scoring data; and (7) the Suppressed Teleological Frustration (STF) construct.

[NLP-29] Agreement in Representation Space for Open-Ended Self-Consistency

【速读】：该论文旨在解决生成式大模型（LLM）在开放生成任务（如代码生成、文本摘要）中自洽性（self-consistency）应用受限的问题。传统自洽性方法依赖于输出的精确匹配，仅适用于类别型输出任务，难以扩展至自由文本等连续或开放输出场景。其核心解决方案是提出基于嵌入空间一致性的新范式——嵌入空间一致性（Embedding-Based Agreement, EBA），通过在嵌入空间中对多个采样生成结果进行聚类，以几何结构上的聚集程度衡量生成的一致性，从而实现无需训练、可扩展的自洽性评估。实验表明，EBA在数学推理、代码生成和摘要任务中均显著优于随机选择，并展现出比基于大模型评分或不确定性估计的现有方法更稳定且可扩展的缩放特性；进一步分析揭示，生成样本在嵌入空间中的几何分布位置与生成质量强相关：集中于表示空间中心区域的生成结果具有更高可靠性，而边缘区域生成则准确性显著下降。研究结果表明，自洽性本质上是采样生成物在表示空间中几何组织结构的体现，而非传统的符号级精确重合，为开放生成任务中的可信推理提供了新的理论视角与实用方法。

链接: https://arxiv.org/abs/2606.12003
作者: Paula Ontalvilla,Gorka Azkune,Aitor Ormazabal
机构: HiTZ Center - Ixa, University of the Basque Country (UPV/EHU)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Self-consistency improves LLM reasoning by sampling multiple outputs and selecting the most consistent answer, but existing formulations largely rely on exact matching and therefore remain limited to tasks with categorical outputs. In this work, we study self-consistency in open-ended generation tasks such as code synthesis and text summarization. We hypothesize that consistency can be understood as a geometric property of the generation space, where semantically compatible generations concentrate in similar regions of representation space. To study this hypothesis, we introduce Embedding-Based Agreement (EBA), a simple training-free operationalization that estimates agreement by clustering sampled generations in embedding space. Through experiments on mathematical reasoning, code generation, and summarization, we show that agreement in representation space provides a robust and scalable signal of self-consistency for open-ended tasks. In particular, EBA consistently outperforms random selection and exhibits more stable scaling behavior than recent selection approaches based on LLM evaluation or uncertainty estimation. We further show that these agreement signals remain stable across model families and embedding spaces, even with native hidden representations. Finally, our analysis shows that the geometric location occupied by sampled generations is strongly correlated with generation quality: generations concentrated near central regions of representation space tend to correspond to more reliable outputs, whereas peripheral generations are substantially less accurate. Overall, our findings support viewing self-consistency as a property of the geometric organization of sampled generations rather than exact symbolic overlap.

[NLP-30] Decoding Multimodal Cues: Unveiling the Implicit Meaning Behind Hateful Videos

【速读】：该论文旨在解决现有仇恨视频检测方法在可解释性方面的不足，即多数研究仅关注二分类任务，缺乏对判断依据的上下文推理支持，导致模型决策过程难以理解。其解决方案的关键在于提出一种信息增强与推理增强（Information Augmentation and Reasoning Enhancement, IARE）框架，通过多模态思维链（multimodal chain-of-thought）机制整合有害内容元素以丰富推理证据，并引入直接偏好优化（Direct Preference Optimization）引导模型走向正确推理路径，从而提升生成理由的逻辑一致性和准确性。该框架在自建的Ex-HateMM和Ex-ImpliHateVid两个具备细粒度多模态标注与上下文理由的数据集上验证，显著提升了检测性能与可解释性。

链接: https://arxiv.org/abs/2606.11953
作者: Junyu Lu,Deyi Ji,Liqun Liu,Xiaokun Zhang,Youlin Wu,Roy Ka-Wei Lee,Peng Shu,Huan Yu,Jie Jiang,Bo Xu,Liang Yang,Hongfei Lin
机构: Dalian University of Technology (大连理工大学); Tencent (腾讯); City University of Hong Kong (香港城市大学); Singapore University of Technology and Design (新加坡科技设计大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Hateful videos have become prevalent on online platforms, highlighting an urgent need for effective detection. However, existing studies primarily focus on binary classification and fail to provide contextual rationales that reveal the implicit meanings behind these judgments, significantly undermining model explainability. To fill this gap, we aim to achieve explainable hateful video detection, enabling models to provide contextual rationales that integrate relevant evidence and logical reasoning alongside decisions. This approach can comprehensively enhance the understanding of video content and the explainability of the decision-making process. We first introduce two datasets, Ex-HateMM and Ex-ImpliHateVid, for explainable hateful video detection. Each dataset provides fine-grained annotations of multimodal harmful elements, along with contextual rationales. We then propose an Information Augmentation and Reasoning Enhancement (IARE) framework designed for explainable detection. The framework employs an information augmentation phase that leverages the multimodal chain-of-thought to integrate harmful elements, thereby enriching rationale evidence. Additionally, IARE incorporates a reasoning enhancement phase, in which Direct Preference Optimization guides the model toward correct reasoning paths and away from incorrect ones, thereby improving the logical coherence of its justifications. We conduct extensive experiments on the two datasets, comparing multiple baselines with our proposed IARE framework. The results demonstrate that IARE achieves state-of-the-art performance while also generating accurate rationales.

[NLP-31] Semantic Grading of Written Answers in Low-Resource Language Bangla Using a Fine-Tuned Lightweight Language Model

【速读】：该论文旨在解决孟加拉语（Bangla）在教育领域自然语言处理（NLP）研究中资源匮乏的问题，特别是在偏远和农村地区因缺乏合格学科教师而导致学生书面作答难以实现及时、一致的自动评分。传统自动评估方法依赖词汇重叠度，难以应对语义正确但表达形式多样的答案。其解决方案的关键在于构建一个双语（孟加拉语-英语）评估系统，通过微调轻量级语言模型，以问题、参考答案和学生作答为输入，生成既体现语义正确性又具备上下文相关性的数值评分与简洁反馈。该方法采用合成双语数据集进行可控训练与评估，并通过统一协议验证了基于QLoRA微调的Qwen3-8B模型在合成评估中表现出最强的抗泄露能力（RoRa = 0.819），并在专门的人工评测中与人类评分具有最高一致性（rho = 0.936，MAE = 0.725），显著提升了低资源教育场景下的自动化评分可靠性与实用性。

链接: https://arxiv.org/abs/2606.11931
作者: Meherun Farzana,Aniket Joarder,Mahmudul Hasan,Md. Mosaddek Khan
机构: Computer Science and Engineering, University of Dhaka
类目: Computation and Language (cs.CL)
备注: 10 pages, 5 figures, 2 tables. Preprint

点击查看摘要

Abstract:Bangla is among the world’s most widely spoken languages, yet it remains underserved in educational NLP research. In many remote and rural regions, access to qualified subject teachers is limited, and written answers are consequently graded largely by hand, restricting timely and consistent feedback. Automatic assessment is challenging because semantically correct responses can vary substantially in surface form. We present a bilingual (Bangla-English) evaluation system designed for low-resource educational settings that prioritizes semantic correctness over lexical overlap. Our approach fine-tunes a lightweight language model to grade each response using the question, reference answer, and student answer, producing a numeric score and concise, context-grounded feedback suitable for classroom deployment. We also construct a synthetic bilingual dataset to enable controlled training and evaluation. Across proprietary and open-source LLMs evaluated under a unified protocol, our QLoRA-tuned Qwen3-8B confirms consistent improvement by producing the most leakage-resistant feedback (RoRa = 0.819) in synthetic evaluation and the strongest agreement with human scores (rho = 0.936, MAE = 0.725) in a dedicated human study.

[NLP-32] oward Generalist Autonomous Research via Hypothesis-Tree Refinement

【速读】：该论文旨在解决如何让人工智能（AI）代理在长时程范围内自主完成科研循环（即探索、实验与抽象）的问题，克服传统方法中研究过程碎片化、缺乏累积性与跨时间知识传承的局限。其核心解决方案是提出Arbor框架，关键在于引入一种持久化的“假设树精炼”（Hypothesis Tree Refinement, HTR）机制，通过一个长期存在的协调器（coordinator）与短期执行器（executors）协同工作，构建一个可跨时间关联假设、产出物、证据及提炼洞见的动态树状结构。协调器负责全局研究策略的演化，执行器在隔离的工作树中独立验证假设，返回结果后，系统自动更新假设树、传播可复用经验、优化搜索前沿并整合已验证改进。这一设计将自主研究从一系列孤立的局部尝试转变为具有持续积累性的智能过程，实现了策略、执行与证据的跨周期传承。在自主优化（Autonomous Optimization, AO）设置下，Arbor在模型训练、提示工程和数据合成等六项真实研究任务中均取得最优表现，相比Codex和Claude Code，在相同资源预算下实现超过2.5倍的平均相对提升；在MLE-Bench Lite基准上，以GPT-5.5达到86.36%的任意奖牌率，为当前对比中最优结果。

链接: https://arxiv.org/abs/2606.11926
作者: Jiajie Jin,Yuyang Hu,Kai Qiu,Qi Dai,Chong Luo,Guanting Dong,Xiaoxi Li,Tong Zhao,Xiaolong Ma,Gongrui Zhang,Zhirong Wu,Bei Liu,Zhengyuan Yang,Linjie Li,Lijuan Wang,Hongjin Qian,Yutao Zhu,Zhicheng Dou
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Scientific progress depends on a repeated loop of exploration, experimentation, and abstraction. Researchers test candidate directions, interpret the evidence, and carry the resulting lessons into later attempts. We study how an AI agent can run this loop autonomously over long horizons. We introduce Arbor, a general framework for autonomous research that combines a long-lived coordinator, short-lived executors, and Hypothesis Tree Refinement (HTR), a persistent tree that links hypotheses, artifacts, evidence, and distilled insights across time. The coordinator manages global research strategy over the tree, while executors implement and test individual hypotheses in isolated worktrees. As results return, Arbor updates the tree, propagates reusable lessons, refines the search frontier, and admits verified improvements. This design turns autonomous research from a sequence of local attempts into a cumulative process in which strategy, execution, and evidence are carried across time. We evaluate Arbor under Autonomous Optimization (AO), an operational setting where an agent improves an initial research artifact through iterative experimentation without step-level human supervision. Across six real research tasks in model training, harness engineering, and data synthesis, Arbor achieves the best held-out result on all six tasks, attaining more than 2.5x the average relative held-out gain of Codex and Claude Code under the same task interface and resource budget. On MLE-Bench Lite, Arbor reaches 86.36% Any Medal with GPT-5.5, the strongest result in our comparison.

[NLP-33] An Ontology-Guided Multi-Anchor Graph Retrieval Framework for Traffic Legal Liability Determination ICONIP

【速读】：该论文旨在解决交通法律责认定中因多维法律条文相互关联而导致的检索瓶颈问题，即现有基于检索增强生成（Retrieval-Augmented Generation, RAG）的方法受限于单一维度的检索架构，难以有效捕捉跨法律维度之间的依赖关系。其解决方案的关键在于提出一种基于本体引导的OMAGR框架，通过将复杂法律查询分解为与本体对齐的锚点，并在各法律维度上并行执行图检索，实现多维度独立检索后再进行融合，从而避免信息遗漏。实验采用自建的TrafficLaw-QA数据集（包含200个问题和527条法律条文），结果表明，所提出的TrafficOmni-RAG方法在上下文精确率（Context Precision）与忠实度（Faithfulness）指标上均优于基线模型，验证了并行多锚点检索能有效突破多维度检索瓶颈，为交通法律责任判定研究提供了可行且高效的新路径。

链接: https://arxiv.org/abs/2606.11910
作者: Xu Li,Shuqi Tian,Xun Han,Kuncheng Zhao,Xinyi Li
机构: 未知
类目: Computation and Language (cs.CL)
备注: Submitted to ICONIP. 15 pages, 3 figures

点击查看摘要

Abstract:Traffic law liability determination is critical for assigning legal penalties, requiring the simultaneous identification of interdependent statutory provisions across multiple legal dimensions. However, existing retrieval-augmented generation methods suffer from a multi-dimensional retrieval bottleneck: single axis architectures compress complex legal queries into a single pathway, causing interdependent statutory dimensions to be overlooked. To address this, we propose OMAGR, an ontology-guided framework that decomposes queries into ontology-aligned anchors and executes parallel graph retrieval across each dimension, ensuring independent retrieval across dimensions before fusion. To evaluate the proposed method, we created the TrafficLaw-QA dataset, an expert-validated benchmark dataset containing 200 questions and 527 legal provisions. Results show that TrafficOmni-RAG outperforms baselines on Context Precision and Faithfulness metrics. The findings demonstrate that parallel multi-anchor retrieval effectively resolves the multi-dimensional retrieval bottleneck, offering a promising direction for traffic law liability determination research.

[NLP-34] When Does Language Matter? Multilingual Instructions Reveal Step-wise Language Sensitivity in Vision-Language-Action Models ACL2026

【速读】：该论文旨在解决生成式视觉-语言-动作（Vision-Language-Action, VLA）模型在语言条件化机器人操作中对语言变异缺乏鲁棒性的问题。现有VLA模型虽在英语指令下表现优异，但其在非英语指令下的性能显著下降，尤其在多语言场景中存在严重退化现象。为系统评估这一问题，研究者首次将LIBERO基准任务翻译成十种语言进行跨语言评测，发现非英语指令下的成功率普遍下降30%-50%。通过细粒度的任务执行分析，研究揭示语言影响在不同操作步骤间分布极不均匀：部分关键步骤高度依赖语言输入，是导致任务失败的主要原因，而其他步骤则表现出较强的语义无关性。基于此发现，论文提出一种分步的推理时干预机制（step-wise inference-time intervention），通过动态调整各步骤的表示对齐策略以匹配其语言敏感度，从而显著提升模型在语言变化下的鲁棒性。研究结果表明，语言鲁棒性本质上是一个具有时间结构的分步控制问题，强调了对具身智能体进行时序化、步骤级分析的重要性，为未来设计更具泛化能力的多语言机器人系统提供了关键洞见。

链接: https://arxiv.org/abs/2606.11906
作者: Xuan Dong,Zhe Han,Tianhao Niu,Qingfu Zhu,Wanxiang Che
机构: Harbin Institute of Technology (哈尔滨工业大学)
类目: Computation and Language (cs.CL)
备注: Accepted to ACL 2026 Main Conference

点击查看摘要

Abstract:Vision-Language-Action (VLA) models have shown strong performance in language-conditioned robotic manipulation, yet their robustness to linguistic variation remains poorly understood. In this work, we present the first systematic multilingual evaluation of VLA models by translating the LIBERO benchmark into ten languages, revealing severe performance degradation under non-English instructions, with success rates dropping by 30-50%. Through fine-grained analysis of task executions, we find that language influence is highly non-uniform across steps: certain steps exhibit strong language dependence and dominate overall task failure, while others are largely language-agnostic. Based on this insight, we propose a step-wise inference-time intervention that aligns representations according to step language sensitivity, substantially improving performance under linguistic variation. Our results indicate that language robustness in VLA models is fundamentally a step-wise control problem, highlighting the importance of temporally structured analysis for reliable embodied agents.

[NLP-35] GraspLLM : Towards Zero-Shot Generalization on Text-Attributed Graphs with LLM s

【速读】：该论文旨在解决现有基于大语言模型（LLM）的文本属性图（Text-Attributed Graphs, TAGs）方法在跨数据集和跨任务场景下泛化能力不足的问题，尤其体现在难以有效捕捉可迁移的图结构模式。其核心挑战在于如何在保持语义理解优势的同时，增强对异构图结构的通用建模能力。解决方案的关键在于提出GraspLLM框架，通过两个关键机制实现：首先，利用冻结的通用嵌入模型将不同图中的节点文本映射到统一语义空间，并基于多子图模式（motif）诱导的邻接矩阵进行感知子图的对比学习，从而提取与数据集无关的结构信息；其次，引入最优上下文子图（optimal contextual subgraph）机制，为每个目标节点选择最相关的局部子图，并通过一个对齐投影器将其与LLM的令牌空间对齐。这一设计使模型在零样本（zero-shot）等严苛设置下仍能保持优异性能，显著提升了在多样化应用场景中的跨域泛化能力。

链接: https://arxiv.org/abs/2606.11898
作者: Hengyi Feng,Zeang Sheng,Meiyi Qiang,Meiyi Qiang,Wentao Zhang
机构: 1. University of Science and Technology of China (中国科学技术大学); 2. Alibaba Cloud (阿里云); 3. Tsinghua University (清华大学); 4. Hangzhou Dianzi University (杭州电子科技大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Research on Text-Attributed Graphs (TAGs) has gained significant attention recently due to its broad applications across various real-world data scenarios, such as citation networks, e-commerce platforms, social media, and web pages. Inspired by the remarkable semantic understanding ability of Large Language Models (LLMs), there have been numerous attempts to integrate LLMs into TAGs. However, existing methods still struggle to generalize across diverse graphs and tasks, and their ability to capture transferable graph structural patterns remains limited. To address this, we introduce the GraspLLM, a framework that combines Graph structural comprehension with semantic understanding prowess of LLMs to enhance the cross-dataset and cross-task generalizability. Specifically, we represent node texts from different graphs in a unified semantic space with a frozen general embedding model, on top of which we perform motif-aware contrastive learning across multiple motif-induced adjacency matrices to extract dataset-agnostic structural information. Then, with our proposed optimal contextual subgraph, we extract the most contextually relevant subgraph for each target node and align these subgraphs to the token space of LLM via an alignment projector. Extensive experiments on TAG benchmark datasets spanning diverse domains reveal that GraspLLM consistently outperforms previous LLM-based methods for TAGs, especially in zero-shot scenarios, highlighting its strong generalizability across different datasets and tasks. Our code is available at this https URL.

[NLP-36] Notes2Skills: From Lab Notebooks to Certainty-Aware Scientific Agent Skills

【速读】：该论文旨在解决当前生成式AI在科学探索中难以有效利用实验室笔记（lab notes）这一关键问题，因其包含未验证的观察、不确定的推断及后续实验的初步构想，而传统AI模型多聚焦于已发表论文、标准化实验协议或结构化数据库，忽视了非正式实验记录中蕴含的动态科学推理过程。其核心挑战在于：若不区分作者的确定性程度，AI可能将不确定性判断误判为可执行指令，或忽略真正可靠的结论。为此，论文提出Notes2Skills框架，采用两阶段方法，通过显式建模作者的确定性（certainty）来将原始实验笔记转化为可验证的、具备可信度分级的科学技能（skills），从而实现对不确定信息与确证事实的精准分离。实验证明，只有该方案能在七种条件与三轮湿实验中同时避免误判不确定内容为确定指令，并保留真实可行的实验建议，表明“确定性保留”是连接实验室笔记与可靠AI科研代理的关键机制，为构建更安全、可信赖的AI协同科学家系统提供了新路径。

链接: https://arxiv.org/abs/2606.11897
作者: Shi Liu,Jiayao Chen,Chengwei Qin,Yanqing Hu,Jufan Zhang,Linyi Yang
机构: 未知
类目: Computation and Language (cs.CL)
备注: 28 pages, preprint

点击查看摘要

Abstract:Scientific discovery workflows usually contain and rely heavily on lab notes, where researchers record observations, interpret uncertain results, and plan follow-up experiments. Such informative lab notes preserve evolving scientific reasoning and author uncertainty, rather than polished final results exhibited in publications, providing a valuable opportunity for AI to engage in scientific exploration at a more comprehensive and deeper level. However, most prior work on scientific text focuses on papers, protocols, or structured databases, leaving informal laboratory notes underexplored as inputs to AI agents for science. This gap matters because lab notes often intermingle validated observations, tentative judgments, and possible experimental next steps within the same passage. If these signals are conflated, an AI agent may mistake uncertain scientific judgments for confirmed conclusions or executable actions. To this end, we present Notes2Skills, a two-stage framework for turning lab notebooks into verifiable skills for scientific AI agents while preserving the author’s certainty. Across seven conditions and three wet-lab sessions, Notes2Skills is the only configuration that neither mistakes uncertain notes for firm instructions nor discards firm ones. We show that certainty preservation is the missing piece between lab notebooks and reliable agent skills, opening a path toward safer AI co-scientist systems.

[NLP-37] Beyond representational alignment with brain-guided language models for robust reasoning

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）与人类高级认知神经机制之间对应关系不明确的问题，特别是语言能力与推理能力在人脑中可分离的背景下，探究LLMs是否能与推理相关脑区的神经信号对齐，以及这些神经信号能否用于提升模型的推理性能。其解决方案的关键在于提出一种“脑引导”（brain-guided）框架：通过分析模型内部表征与任务态功能磁共振成像（task-fMRI）信号之间的神经可预测性，识别出模型与大脑在推理过程中的共现结构，并利用该联合结构在推理阶段进行干预或在训练阶段进行微调，从而沿神经信号诱导的方向优化模型表征。实验表明，任务诱发的脑信号可直接增强多种规模的LLM（1.5B–72B参数）在不同推理类型上的表现，实现相对于仅依赖语言监督的独立增益，最高达13%的绝对准确率提升，并具备跨推理类型迁移能力。这一工作将LLM与大脑之间的关联从相关性推演推进至因果引导层面，为构建更稳健、更符合人类认知逻辑的人工智能提供了基于脑信号驱动的新路径。

链接: https://arxiv.org/abs/2606.11893
作者: Mingqing Xiao,Kai Du,Zhouchen Lin
机构: Peking University (北京大学); Tsinghua University (清华大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Neurons and Cognition (q-bio.NC)
备注:

点击查看摘要

Abstract:The correspondence between large language models (LLMs) and the neural mechanisms underlying human higher-order cognition remains insufficiently characterized. Given that language and reasoning in the human brain appear dissociable, an open question is whether LLMs align with neural signals from reasoning-related regions and whether such signals can improve them. Here, focusing on deductive reasoning, we show that LLM internal representations are not only partially aligned with task-fMRI activity but can also be directly enhanced by these signals. Using a neural-predictivity metric, we find that LLMs explain a substantial fraction of the explainable variance in reasoning-related regions at the aggregate level, whereas predictivity within specific reasoning types is lower, indicating both alignment and divergence. Building on this, we propose a brain-guided framework: we steer model representations along directions induced by the joint structure of model and brain representations, applying intervention at inference and fine-tuning during training. We demonstrate that task-evoked brain signals can directly enhance LLM reasoning, yielding gains orthogonal to language-only supervision across 10 LLMs (1.5B-72B), with transfer across reasoning types and up to 13% absolute accuracy gain. Our results advance LLM-brain correspondences from correlation to guidance, establishing a brain-signal-driven pathway toward more robust and cognitively aligned AI.

[NLP-38] I Understand How You Feel: Enhancing Deeper Emotional Support Through Multilingual Emotional Validation in Dialogue System SIGDIAL SIGDIAL2026

【速读】：该论文旨在解决对话系统中情感确认（Emotional Validation）的计算建模问题，即如何有效识别、检测并生成具有治疗价值的情感确认响应。情感确认在心理干预中具有重要价值，但其在自然语言处理领域尚未得到充分研究。现有方法难以同时处理验证响应识别、情感确认时机检测与响应生成三大子任务。为此，论文提出M-EDESConv（120k规模的英日双语多语言语料库）和M-TESC（多语言口语对话测试集），为跨语言情感确认研究提供数据支持。针对关键难点——情感确认时机检测，论文设计了MEGUMI模型，该模型通过跨模态注意力机制与门控融合策略，将冻结的XLM-RoBERTa语义表示与特定语言的情感编码器进行深度融合，实现对多语言情境下情感状态的精准感知。实验表明，MEGUMI在M-EDESConv和M-TESC两个数据集上均表现出优异的客观与主观性能。此外，基于EmoValidBench对GPT-4.1 Nano和Llama-3.1 8B的评测显示，当前大语言模型虽能生成上下文相关且多样化的验证性回应，但在深层情感理解方面仍存在显著不足。因此，该研究的关键突破在于构建了多语言情感确认的完整评估框架，并提出了基于多模态融合的时序敏感情感确认模型，推动了情感确认技术从“表面回应”向“深度共情”的演进。

链接: https://arxiv.org/abs/2606.11875
作者: Zi Haur Pang,Yahui Fu,Koji Inoue,Tatsuya Kawahara
机构: Kyoto University (京都大学)
类目: Computation and Language (cs.CL); Sound (cs.SD)
备注: This paper has been accepted for presentation at SIGdial Meeting on Discourse and Dialogue 2026 (SIGDIAL 2026)

点击查看摘要

Abstract:Emotional validation - explicitly acknowledging that a user’s feelings make sense - has proven therapeutic value but has received little computational attention. Emotional validation in dialogue systems can be decomposed into (i) validating response identification, (ii) validation timing detection, and (iii) validating response generation. To support research on all three subtasks, we release M-EDESConv, a 120k English-Japanese multilingual corpus created through hybrid manual and automatic annotation, and M-TESC, a multilingual spoken-dialogue test set. For timing detection, we propose MEGUMI, a Multilingual Emotion-aware Gated Unit for Mutual Integration, that fuses frozen XLM-RoBERTa semantics with language-specific emotion encoders via cross-modal attention and gated fusion. MEGUMI shows superior performance on both the M-EDESConv and M-TESC datasets, both objectively and subjectively. Finally, our EmoValidBench benchmarks of GPT-4.1 Nano and Llama-3.1 8B indicate that current LLMs generate contextually similar and diverse validating responses, but emotional understanding remains a major area for improvement. Project page: this https URL

[NLP-39] Fine-tuning Multi-modal LLM s with ART: Art-based Reinforcement Training

【速读】：该论文旨在解决现有参数高效微调（Parameter-Efficient Fine-Tuning, PEFT）方法在高吞吐量推理引擎（如vLLM）中不兼容的问题。当前主流的两种PEFT技术——低秩适应（LoRA）与软提示（Soft Prompting）——均需修改预编译、预优化的大语言模型（LLM）计算图，导致无法在高性能推理系统中直接部署。为此，本文提出基于艺术化强化训练（ART, Art-based Reinforcement Training）的微调方法，其核心在于仅通过优化冻结的多模态大语言模型（MLLM）的原始视觉输入（即像素数组），将微调信息注入模型，从而实现无需修改计算图的软提示机制。该方法的关键创新在于利用梯度反向传播至原始像素阵列，支持任意微调目标，并可将优化后的视觉输入风格化为任务相关的计算艺术作品。实验验证表明，ART在多种规模的Qwen架构及多个文本基准上均达到与LoRA相当的性能，尤其在数学推理和结构化工具使用等任务中表现优异。

链接: https://arxiv.org/abs/2606.11854
作者: Michal Chudoba,Sergey Alyaev,Petra Galuscakova,Tomasz Wiktorski
机构: University of Stavanger (斯塔万格大学); NORCE Research (挪威科研中心)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:There are two main Parameter-Efficient Fine-Tuning (PEFT) techniques for Large Language Models (LLMs). While Low-Rank Adaptation (LoRA) introduces additional weights between the LLM layers, Soft Prompting introduces additional fine-tuning-specific raw tokens to an LLM input. However, both require modification to the computational graphs of precompiled, preoptimized LLMs. As a result, neither is fully supported in high-throughput engines like vLLM. We propose fine-tuning with ART (Art-based Reinforcement Training). The method injects information into a frozen Multimodal Large Language Model (MLLM) by optimizing only its raw visual input, thus enabling the soft-token approach on pre-compiled computational graphs. It relies on backpropagation of gradients back into a plain pixel array and thus supports any fine-tuning objective. Moreover, the optimized visual input can be stylized as task-relevant computational artworks. The approach’s effectiveness is confirmed for different sizes of a popular open Qwen architecture and for several textual benchmarks. Specifically, ART reaches accuracy competitive with LoRA across mathematics and structured-tool-use benchmarks.

[NLP-40] Grammar-Constrained Decoding Can Jailbreak LLM s into Generating Malicious Code

【速读】：该论文旨在解决生成式 AI（Generative AI）在代码生成过程中因语法约束解码（Grammar-Constrained Decoding, GCD）技术被恶意利用而引发的安全风险问题。尽管GCD被广泛用于提升大语言模型（LLM）生成代码的语法正确性以增强可靠性，但本文揭示了一种反直觉的安全漏洞：攻击者可利用GCD机制构造“代码矛”（CodeSpear）攻击，通过施加看似无害的代码语法约束，诱导模型生成恶意代码。其核心解决方案在于提出名为CodeShield的安全对齐方法，该方法通过在代码模态中训练模型生成语义无害但结构多样化的“诱饵代码”（honeypot code），使模型在受控语法约束下仍保持安全行为；同时保留自然语言拒绝能力，确保在非代码场景下的安全性。实验表明，CodeSpear显著提升攻击成功率（平均提升超30个百分点），而CodeShield可在维持正常功能的前提下有效抵御此类攻击，揭示了GCD机制潜在的根本性安全风险，并呼吁对其安全影响给予更高关注。

链接: https://arxiv.org/abs/2606.11817
作者: Yitong Zhang,Shiteng Lu,Jia Li
机构: Tsinghua University (清华大学)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly used for code generation, raising concerns that they may be misused to produce malicious code. Meanwhile, Grammar-Constrained Decoding (GCD) has been widely adopted to improve the reliability of LLM-generated code by enforcing syntactic validity. In this paper, we reveal a counterintuitive risk: this reliability-oriented technique can itself become an attack surface. We uncover a new jailbreak attack, termed CodeSpear, that exploits GCD to induce LLMs into generating malicious code. Our experiments show that simply applying a benign code grammar constraint can effectively jailbreak LLMs. To address this vulnerability, we propose CodeShield, a safety alignment approach that robustly preserves safe behavior even under attacker-controlled grammar constraints. CodeShield aligns the model in the code modality by teaching it to generate honeypot code under GCD. Such code is semantically harmless, so it does not implement the malicious request, and structurally diverse, so it is difficult to suppress through grammar tightening. At the same time, CodeShield still preserves natural-language refusals when natural language is available. Experiments on 10 popular LLMs across 4 benchmarks show that CodeSpear outperforms representative jailbreak baselines and increases the attack success rate by more than 30 percentage points on average. CodeShield also restores safety under CodeSpear while preserving benign utility. Our findings reveal a fundamental risk of GCD and call for greater attention to its potential security implications. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Software Engineering (cs.SE) Cite as: arXiv:2606.11817 [cs.CR] (or arXiv:2606.11817v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2606.11817 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-41] WorldReason er: Evaluating Whether Language Model Agents Forecast Events with Valid Reasoning

【速读】：该论文旨在解决生成式语言模型在现实事件预测中因缺乏对时间约束信息的合理处理而产生的虚假准确性问题，即模型可能通过记忆训练数据、编造证据或构建无根据的因果叙事来获得表面正确的答案，而非真正具备推理能力。其解决方案的关键在于提出一个名为WorldReasoner的评估框架，该框架通过引入时序有效性（temporally valid）的评估机制，确保模型只能访问模拟预测时间点之前可获取的证据，并基于三重评估维度进行综合评价：结果质量（与已知真实结果对比）、证据质量（所引用来源的有效性）以及推理质量（与事后回溯构建的因果图进行比对）。该框架依托于自动化代理生成流程，大规模构建了包含345个已解决任务的数据集，涵盖14,141篇文献及8,087个事件的因果图谱。实验表明，时序有效的信息检索是提升预测准确性的最强驱动因素，因果图构建有助于关键事件的识别，且基于正确因果图的预测更依赖于关键事件与相关证据，但模型仍难以将充分依据转化为校准良好的概率输出。

链接: https://arxiv.org/abs/2606.11816
作者: Yizhou Chi,Eric Chamoun,Zifeng Ding,Andreas Vlachos
机构: University of Cambridge (剑桥大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Forecasting real-world events requires language-model agents to reason under uncertainty from incomplete, time-bounded information. Yet evaluating whether agents genuinely forecast requires more than final-answer accuracy: a model may be correct by recalling memorized training facts, citing fabricated evidence, or producing an unsupported causal story. We present WorldReasoner, an evaluation framework for temporally valid event forecasting. Each task gives an agent a resolved forecasting question, a simulated forecast date, and access only to evidence available before that date; after resolution, the framework scores the submitted probability, cited evidence, and optional causal event graph. WorldReasoner reports three complementary axes: outcome quality against resolved answers, evidence quality over cited sources, and reasoning quality against post-resolution hindsight graphs. The benchmark is built by an agentic construction pipeline that generates forecasting questions, collects time-stamped evidence, and builds hindsight reference graphs at scale, yielding 345 resolved tasks derived from 14,141 articles with graphs covering 8,087 extracted events. Across six controlled agent settings, temporally valid retrieval is the strongest driver of outcome accuracy; causal graph construction improves key-event recovery; and correct graph-enabled forecasts are more strongly grounded in key events and relevant sources, yet agents still struggle to convert grounded evidence into calibrated probabilities.

[NLP-42] External Experience Serving in Production LLM Systems: A Deployment-Oriented Study of Quality-Cost Trade-offs

【速读】：该论文旨在解决大规模语言模型（LLM）在实际部署中如何权衡外部经验引入所带来的质量提升与在线服务成本之间的矛盾问题。其核心挑战在于：尽管引入外部经验可显著提升任务性能，但会增加提示（prompt）负担、延迟及服务压力，因此需在真实约束条件下优化服务策略。论文的关键解决方案是将外部经验的使用视为一种面向部署的质量-成本权衡问题，并通过对比无经验基线、随机经验控制、全局提示注入与基于检索的选择性注入等策略，在真实内容审核场景下评估不同方法在任务质量与服务开销间的平衡表现。研究发现，当经验具有案例依赖性时，选择性检索机制相比无差别全局注入能提供更优的运行点；同时，检索质量的重要性超过简单扩大Top-K规模，且同一服务策略在短输出与解码密集型任务中呈现出截然不同的成本效益特征。因此，论文主张将外部经验作为一种有选择性、成本敏感的服务决策，而非通用增强模块，仅当服务接口与任务特定成本结构使质量增益足以覆盖在线代价时，外部经验才具备实际价值。

链接: https://arxiv.org/abs/2606.11806
作者: Lin Sun,Heming Zhang,Xiangzheng Zhang
机构: Qiyuan Tech
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Production LLM systems accumulate reusable operational experience, but the practical deployment issue is not merely whether such experience can help. It is how different serving strategies trade off quality against online cost under realistic constraints. Injecting external experience can improve task quality, yet it also increases prompt burden, latency, and serving pressure. We study \textitexternal experience serving as a deployment-oriented quality-cost trade-off problem. We evaluate this question in a real production moderation setting, with tool-use and GPQA as supporting contrast tasks that expose different output-cost regimes. We compare no-experience baselines, random experience controls, global prompt injection, and retrieval-based selective injection, and analyze both task quality and serving cost. The results show that, once experience becomes case-dependent, selective retrieval provides a stronger operating point than unconditional global injection. They further show that retrieval quality matters more than simply increasing Top- K , and that the same serving policy can exhibit substantially different cost-benefit profiles across short-output and decode-heavy regimes. These findings suggest that external experience is best treated as a selective, cost-aware serving decision rather than as a universal add-on. Overall, in the settings studied here, external experience pays off only when both the serving interface and the task-specific cost structure make its quality gains worth the online cost.

[NLP-43] MultiToP: Learning to Patch Visual Tokens to Mitigate Hallucinations in Video Large Multimodal Models

【速读】：该论文旨在解决视频大模型在视频理解任务中普遍存在幻觉（hallucination）的问题，即生成的文本回答未能忠实反映输入视频内容。其核心解决方案是提出一种多模态上下文感知的视觉标记修补框架MultiToP，关键在于通过轻量级的视觉标记修补器（Visual Token Patcher）对不可靠的视觉标记进行识别与动态替换。该修补器能够预测标记级别的替换分布，并将不可靠的局部视觉标记替换为动态生成的全局补丁标记（global patch token），从而实现对视觉证据的局部精炼。为有效训练修补器，论文进一步设计了信息引导的排名校准机制，利用骨干网络提取的、与答案条件相关的帧级信息线索来指导标记替换过程。结合真实答案监督与稀疏性正则化，MultiToP可在不修改原始模型的前提下实现精准的视觉表征优化。实验表明，该方法在Vript-HAL数据集上显著降低幻觉现象，使Qwen3-VL-4B-Instruct的F1分数提升50.60%；同时在ActivityNet-QA上保持并提升了视频理解能力，相对准确率提高18.58%，且推理开销可忽略不计。

链接: https://arxiv.org/abs/2606.11792
作者: Yuansheng Gao,Wenbin Xing,Jiahao Yuan,Kaiwen Zhou,Han Bao,Zonghui Wang,Wenzhi Chen
机构: Zhejiang University(浙江大学); Sun Yat-sen University(中山大学); East China Normal University(华东师范大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Preprint

点击查看摘要

Abstract:Video Large Multimodal Models have achieved remarkable progress in video understanding, yet they remain prone to hallucinations, where generated responses are not faithfully supported by the input video. In this paper, we propose MultiToP, a multimodal-context-aware visual token patching framework that mitigates hallucinations by refining unreliable visual tokens before language generation. MultiToP introduces a lightweight Visual Token Patcher to predict token-level replacement distributions and selectively substitute unreliable visual tokens with a dynamic global patch token. To train the patcher effectively, we further propose information-guided rank calibration, which uses answer-conditioned frame-level information cues derived from the backbone to guide token replacement. Combined with ground-truth answer supervision and sparsity regularization, MultiToP enables localized visual evidence refinement without modifying the original model. Extensive experiments demonstrate that MultiToP effectively reduces hallucinations on Vript-HAL with negligible inference overhead, improving the F1 scores of Qwen3-VL-4B-Instruct by 50.60% over the vanilla model. Meanwhile, MultiToP preserves general video understanding ability, yielding an 18.58% relative accuracy gain on ActivityNet-QA for Video-LLaVA-7B.

[NLP-44] Lius: Translation Model Based Instructional Lingustic Using Continual Instruction Tuning In Kupang Malay

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在处理低资源语言时性能下降的问题，尤其针对缺乏大规模平行语料库的语种如古邦马来语（Kupang Malay）。其核心挑战在于如何在有限标注数据条件下提升翻译质量。为此，论文提出一种基于双语词典显式词汇与语义特征设计指令集的方法，并引入持续指令微调（Continual Instruction Tuning, CIT）这一迭代式指令训练范式，以逐步优化模型对低资源语言的翻译能力。关键创新在于通过结构化指令引导模型学习跨语言映射关系，并借助持续迭代机制增强模型对目标语言特性的适应性。实验结果表明，所提出的模型Lius在多个评估指标上相较于标准指令微调模型提升4-6分，显著优于神经机器翻译（Neural Machine Translation, NMT）及多语言大模型10-13分，验证了该方法在减少对大规模平行数据依赖方面的有效性。

链接: https://arxiv.org/abs/2606.11786
作者: Joanito Agili Lopo,Yunita Sari,Guntur Budi Herwanto
机构: Universitas Gadjah Mada (Gadjah Mada University)
类目: Computation and Language (cs.CL)
备注: This paper is the result of the Master Thesis in Master of Artificial Intelligence at Universitas Gadjah Mada

点击查看摘要

Abstract:Large Language Models (LLMs) offer new potential for translation tasks but often experience performance degradation when handling low-resource languages. To address this limitation, we propose an approach for fine-tuning LLMs on a low-resource language, Kupang Malay. Our approach involves designing a set of instructions by leveraging explicit lexical and semantic features from a bilingual dictionary, and introducing Continual Instruction Tuning (CIT), a training paradigm that enables iterative instruction-based training. Experimental results demonstrate that our model, named Lius, yields notable improvements over standard instruction-tuned models by outperforming 4-6 points, and surpassing both Neural Machine Translation (NMT) and Multilingual LLM models by 10-13 points on several evaluation metrics. These findings highlight the potential of our approach to mitigate the reliance on large-scale parallel data in low-resource language translation.

[NLP-45] Automated Creativity Evaluation of Language Models Across Open-Ended Tasks ACL2026

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在开放性任务中创造力评估缺乏通用性与可扩展性的问题。现有创造力评价指标通常高度依赖特定任务，嵌入领域假设，导致评估方法难以跨任务迁移。为此，本文提出一种自动化、领域无关的创造力量化框架，其核心在于将创造力测量机制与具体创造性任务解耦，实现任务无关的可扩展评估。该框架通过语义熵（semantic entropy）衡量发散性创造力，该指标为无参考、鲁棒的创新性与多样性度量，经人类标注、基于LLM的创新性判断及基线多样性指标验证；对于收敛性创造力，则引入一种基于检索的多智能体评判框架，实现上下文感知的任务完成度评估，效率提升超过60%。研究在三个性质迥异的领域——问题解决（MacGyver）、科研创意生成（HypoGen）和创意写作（BookMIA）中验证了该框架的有效性，覆盖多种主流大语言模型。实证结果表明，该框架能够可靠捕捉创造力的关键维度，包括新颖性、多样性与任务契合度，并揭示模型规模、温度参数、模型时效性及推理能力对创造性表现的影响。本工作建立了可复现、普适的大语言模型创造力自动化评估标准，为大规模基准测试与创意人工智能的发展提供了坚实基础。

链接: https://arxiv.org/abs/2606.11762
作者: Min Sen Tan,Zachary Kit Chun Choy,Syed Ali Redha Alsagoff,Nadya Yuki Wangsajaya,Mohor Banerjee,Swaagat Bikash Saikia,Alvin Chan
机构: Raffles Institution; College of Computing and Data Science, Nanyang Technological University; Lee Kong Chian School of Medicine, Nanyang Technological University; Centre of AI in Medicine (C-AIM), Nanyang Technological University
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to ACL 2026 (Main Conference). 35 pages, 16 figures. Code: this https URL

点击查看摘要

Abstract:Large language models (LLMs) have achieved remarkable progress in language understanding, reasoning, and generation, sparking growing interest in their creative potential. Realizing this potential requires systematic and scalable methods for evaluating creativity across diverse tasks. However, most existing creativity metrics are tightly coupled to specific tasks, embedding domain assumptions into the evaluation process, and limiting scalability and generality. To address this gap, we introduce an automated, domain-agnostic framework for quantifying LLM creativity across open-ended tasks. Our approach separates the measurement apparatus from the creative task itself, enabling scalable, task-agnostic assessment. Divergent creativity is measured using semantic entropy, a reference-free and robust metric for novelty and diversity, validated against human annotations, LLM-based novelty judgments and baseline diversity measures. Convergent creativity is assessed via a novel retrieval-based multi-agent judge framework that delivers context-sensitive evaluation of task fulfilment with over 60% improved efficiency. We validate our framework in three qualitatively distinct domains: problem-solving (MacGyver), research ideation (HypoGen), and creative writing (BookMIA), using a broad suite of LLMs. Empirical results show that our framework reliably captures key facets of creativity, including novelty, diversity, and task fulfilment, and reveal how model properties, such as size, temperature, recency, and reasoning, impact creative performance. Our work establishes a reproducible and generalizable standard for automated LLM creativity evaluation, paving the way for scalable benchmarking and accelerating progress in creative AI.

[NLP-46] Hey Chat Can You Teach Me? Structuring Socratic Dialogue for Human Learning in the Wild

【速读】：该论文旨在解决当前大型语言模型（Large Language Models, LLMs）在教育场景中缺乏结构化教学路径的问题。尽管生成式AI（Generative AI）已广泛应用于日常学习，但其与学生的互动通常为非结构化的对话，无法有效追踪学生的学习状态，也难以实现系统性知识传授。传统方法依赖模型规模扩展以提升表现，但研究发现仅靠模型扩容无法弥补这一缺陷。核心挑战在于：理想的智能导师需同时完成三项任务——课程内容的合理排序、基于苏格拉底式提问的交互引导，以及从对话中动态推断学生的知识状态。为此，作者提出将上述职责解耦的解决方案：首先构建一个以子主题为节点、知识依赖关系为边的先决知识图谱（prerequisite knowledge graph），将教学过程建模为在图谱中选择下一个教学节点并决定停留对话轮次的决策问题；随后采用轻量级近端策略优化（PPO）算法进行课程序列规划，而由大语言模型负责在选定节点执行苏格拉底式对话，并输出学生进步信号。实验结果表明，在涵盖科学、技术、工程、数学（STEM）与非STEM领域的多个测试主题上，该方案显著优于启发式基线、前沿通用模型及专精于苏格拉底对话的模型，不仅加速了学生达到完整课程掌握状态的进程，还减少了所需对话轮次。研究证明，显式的课程结构设计带来的性能提升无法通过单纯扩大模型规模实现，是提升个性化智能辅导效果的关键所在。

链接: https://arxiv.org/abs/2606.11744
作者: Sidney Tio,Arunesh Sinha,Pradeep Varakantham
机构: Singapore Management University (新加坡管理大学); Rutgers Business School (罗格斯商学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 10 Main Body Pages, with Appendices

点击查看摘要

Abstract:Large language models are now widely used for everyday learning, but the underlying interactions are typically unstructured chats rather than following a curriculum. Unlike formal online learning systems, these interactions carry no prior record of the student, so any estimate of what the student already knows must be inferred from the dialogue itself. We show that this gap is not closed by scaling models alone. Frontier and education-tuned LLMs perform poorly when asked to tutor a student over an extended session, because doing so requires three things at once. The tutor must sequence a curriculum, conduct Socratic dialogue, and infer the student’s knowledge state from that dialogue. We propose separating these responsibilities. Given a student query, our system constructs a prerequisite knowledge graph in which subtopics are nodes and dependencies are edges, and frames tutoring as deciding which node to teach next and how many dialogue turns to spend on it before moving on. A lightweight PPO policy handles this sequencing decision, while an LLM conducts the Socratic exchange at the chosen node and returns a signal of student progress. Across held-out STEM and non-STEM topics, our PPO-paired tutor outperforms heuristic baselines, frontier general-purpose models, and a model specialised for Socratic dialogue: on both the rate at which students reach full curriculum mastery and the number of turns required. Explicit curriculum structure delivers gains that scaling the underlying model does not.

[NLP-47] UniReason -Med: A Shared Grounded Reasoning Interface for 2D-to-3D Transfer in Medical VQA

【速读】：该论文旨在解决在缺乏充足3D医学图像标注的情况下，如何利用海量2D医学图像中的丰富语义信息来提升3D医学视觉问答（3D Medical VQA）任务中模型的推理能力问题。其核心挑战在于如何将2D图像所蕴含的视觉-语言对齐知识有效迁移至3D体积数据的推理过程。解决方案的关键在于提出UniReason-Med框架，通过构建一个统一的、共享的“接地推理接口”（grounded reasoning interface），实现对2D图像与切片序列化3D体数据的统一处理。该接口采用共享的边界框语法（shared box syntax）、区域标记注入（region-token injection）以及一致的接地推理策略，使模型能够在推理时生成交织的文本推理链与局部化视觉证据。为训练该接口，研究构建了包含22万条指令微调样本的UniMed-CoT数据集，涵盖17万例2D和5万例3D样本，并通过监督微调结合结果级强化学习（outcome-level reinforcement learning）进行优化，避免依赖传统的IoU/Dice等定位奖励信号。实验表明，联合使用2D+3D的接地监督显著优于仅基于3D数据的训练，且接地机制与区域标记注入对2D和3D任务均具普适性收益，验证了跨模态推理结构迁移的有效性。

链接: https://arxiv.org/abs/2606.11740
作者: Mengzhuo Chen,Yan Shu,Chi Liu,Hongming Piao,Xidong Wang,Derek Li,Bryan Dai
机构: IQuest Research
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We study whether grounded reasoning supervision from abundant 2D medical images can improve 3D medical VQA when both input types are aligned through a common reasoning interface. We introduce UniReason-Med, a single-checkpoint framework that processes either a 2D image or a slice-serialized 3D volume at inference time, generating interleaved textual reasoning and localized visual evidence through shared box syntax, region-token injection, and a common grounded reasoning policy. To train this interface, we construct UniMed-CoT, a 220K instruction-tuning dataset with interleaved textual reasoning and grounded visual evidence, including 170K 2D and 50K 3D samples. Through supervised fine-tuning followed by outcome-level reinforcement learning, UniReason-Med learns to generate grounded reasoning traces without IoU/Dice-based localization rewards during RL. Data-mixture and component ablations show that joint 2D+3D grounded supervision substantially improves 3D reasoning over 3D-only training, while grounding and region-token injection consistently benefit both 2D and 3D tasks. These results suggest that a shared grounded reasoning interface can transfer reasoning structure from 2D images to slice-serialized volumetric medical understanding. The code and data are publicly available at this https URL.

[NLP-48] ICA Lens: Interpreting Language Models Without Training Another Dictionary

【速读】：该论文旨在解决大语言模型（Large Language Model, LLM）表示空间中可解释方向的发现问题，即如何在不依赖复杂且资源密集型的稀疏自编码器（Sparse Autoencoders, SAEs）训练的前提下，高效识别具有语义意义的神经激活方向。其核心挑战在于现有SAE方法需要耗费大量计算资源进行字典训练、存储与评估，限制了快速探索与迭代。论文的关键创新在于重新审视独立成分分析（Independent Component Analysis, ICA）作为一种轻量级、高效的可解释性分析工具的潜力。作者指出，由于以往的ICA实现对大语言模型激活数据敏感且缺乏系统性验证手段，导致其被低估。为此，论文提出了ICALens——首个面向大语言模型的稳定、高效且可审计的ICA分析工作流，融合了优化的GPU并行FastICA算法、针对语言模型特性的稳定性增强策略及更优的拟合诊断机制，实现了无需梯度更新的层级化可解释方向提取。实验表明，在GPT-2 Small、Gemma 2 2B和Qwen 3.5 2B Base等模型上，ICALens能有效恢复紧凑且人类可读的语义方向，并在SAEBench基准测试中展现出与公开SAE相当甚至更优的稀疏探测性能，尤其在小到中等预算下的目标扰动探测任务中表现突出。这表明ICA不应被视为弱基线，而应作为探索语言模型表示结构的高效互补性首层分析工具。

链接: https://arxiv.org/abs/2606.11722
作者: Sida Liu,Feijiang Han
机构: Independent Researcher; University of Maryland (马里兰大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Ongoing Project

点击查看摘要

Abstract:Finding interpretable directions in language-model representations is critical for understanding and controlling model behavior. Sparse autoencoders (SAEs) have become the standard tool for this purpose, but using them as the default first lens often requires training, storing, and evaluating large overcomplete dictionaries. This bottleneck limits rapid exploration and raises a fundamental question: how much interpretable structure is already visible from activation geometry before training another neural dictionary? Our intuition is simple: many interpretable directions are selective on tokens, and these directions should look less Gaussian than random directions. We therefore revisit independent component analysis (ICA), a classical method for finding non-Gaussian directions, as a compact lens for language-model interpretability. We find that ICA has been underestimated for LLM interpretability, because prior uses often relied on off-the-shelf ICA implementations that are brittle on LLM activations and lacked systematic tools for inspecting and evaluating the recovered directions. To bridge these gaps, we introduce ICALens, the first practical workflow for stable, efficient, and auditable ICA analysis of LLM representations. It combines an optimized GPU-parallel FastICA pipeline with LLM-specific stability recipes and better fitting diagnostics, enabling efficient and reliable layer-wise analysis. Across GPT-2 Small, Gemma 2 2B, and Qwen 3.5 2B Base, ICALens efficiently recovers compact, human-interpretable directions without per-layer gradient-based dictionary training. On SAEBench, ICA is competitive with public SAEs in sparse probing and outperforms them in targeted probe perturbation under small-to-medium budgets. These results suggest that ICA should not be viewed as a weak baseline, but as an efficient and complementary first lens for exploring language-model representations.

[NLP-49] Substrate Asymmetry in User-Side Memory: A Diagnostic Framework

【速读】：该论文旨在解决大语言模型（LLM）中用户侧记忆（user-side memory）评估与实现中的核心矛盾问题：当前普遍采用单一“个性化”指标来衡量用户记忆能力，但这一聚合指标掩盖了不同方向上的失败模式。研究揭示，用户记忆应分解为三个正交维度——行为一致性（风格、语调）、事实存在性（从历史中召回事实）和事实不存在性（在缺乏相关事实时保持沉默），而现有方法无法同时优化三者。其关键解决方案在于通过对比分析γ-LoRA（基于每个用户历史微调的小型LoRA适配器）与基于密集向量检索的RAG（Retrieval-Augmented Generation）方法，在合成数据集和真实数据集LaMP-3上的表现差异，发现γ-LoRA在行为一致性上显著优于RAG，而RAG在事实不存在性上占优；进一步通过注意力层21-35中特定查询投影单元的消融实验，证实这些单元以相反方向分别承担两类效应（移除γ-LoRA权重使事实不存在性真阳性率提升33个百分点，但事实存在性真阳性率下降20个百分点），揭示了模型内部机制的内在冲突。在经过强化学习人类反馈（RLHF）调优的Llama-3.1-8B-Instruct上，这种不对称性不仅未缓解反而加剧，表现为参数化记忆的行为优势消失，而其事实校准缺陷进一步扩大，形成“对齐税”（alignment tax）。真实数据实验中，γ-LoRA表现劣于多数类基线，经9条件消融分析确认其根本原因并非底座模型缺陷，而是指令遵循能力崩溃，而非记忆子系统本身失效。最终提出一种基于问题分类的子系统选择路由策略，仅依赖问题文本的110M DistilBERT分类器即可超越所有基于逻辑门控的路由方案，表明最优路由机制本质是任务类型识别而非置信度校准。研究贡献包括诊断框架、真实数据负样本发现、对齐税的可复现验证以及“路由即分类”的新范式。

链接: https://arxiv.org/abs/2606.11712
作者: Youwang Deng
机构: EpistemicaLab — Independent Research
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Preprint. Code: this https URL

点击查看摘要

Abstract:User-side memory in LLMs is typically scored as a single “personalization” capability: given a user’s history, is the output more user-aware? We show this aggregate metric hides opposite-direction failures. Memory factorises into at least three orthogonal axes – behavioral consistency (style, voice), factual presence (recall facts in history), and factual absence (abstain when a fact is absent) – and no single substrate wins all three. Comparing per-user gamma-LoRA (a small LoRA adapter trained on each user’s history; gamma denotes per-user, not per-task) against BGE-large dense top-K retrieval on a controlled 50-user synthetic corpus and a real-data probe (LaMP-3), we find gamma-LoRA decisively wins behavioral style while RAG decisively wins factual absence – and the same query-projection cells in attention layers 21-35 causally load-bear both effects in opposite directions (zeroing those LoRA weights raises absence-probe TPR by +33 pp and drops presence-probe TPR by 20 pp). On the more heavily RLHF-tuned Llama-3.1-8B-Instruct the asymmetry strengthens, not heals: parametric memory’s behavioral advantage collapses while its absence-calibration deficit against retrieval widens – an alignment tax on parametric user-memory. On real-data LaMP-3, gamma-LoRA underperforms a majority baseline; a 9-condition mitigation sweep diagnoses this as instruction-following collapse, not substrate failure (a 9x2 cross-product shows the eval-time 1…5 logit mask drives main_acc to =0.995 on every recipe), and the best training-time fix replicates bit-identically on Llama. Finally, substrate-selection routing is question-classification, not calibration: a 110M DistilBERT on the question text alone beats every logit-based router. We contribute the diagnostic framework, the diagnosed real-data negative, the alignment-tax replication, and the routing-as-classification finding.

[NLP-50] RLCSD: Reinforcement Learning with Contrastive On-Policy Self-Distillation

【速读】：该论文旨在解决在策略自蒸馏（on-policy self-distillation, OPSD）中因模型对提示信息的过度依赖导致的“风格漂移”（privilege-induced style drift）问题，即学习信号过度集中在生成输出的风格性令牌（如短句、直接表达）而非任务承载型令牌（task-bearing tokens），从而引发训练不稳定或响应长度持续缩短。其解决方案的关键在于提出基于对比学习的在策略自蒸馏方法（RLCSD, Reinforcement Learning with Contrastive on-policy Self-Distillation）：通过对比在正确提示与错误提示下教师-学生分布差异，抑制由提示条件诱导的风格偏移，使学习信号更聚焦于任务相关的语义内容。实验表明，该方法在Qwen3（1.7B/4B/8B）和Olmo-3-7B-Think等模型上的数学与逻辑推理任务中均显著优于GRPO及现有OPSD方法；且其对比性原则具有通用性，可集成至其他OPSD框架，并扩展至更广泛的跨模型在策略蒸馏场景。

链接: https://arxiv.org/abs/2606.11709
作者: Leyi Pan,Shuchang Tao,Yunpeng Zhai,Lingzhe Zhang,Zhaoyang Liu,Bolin Ding,Aiwei Liu,Lijie Wen
机构: Tsinghua University (清华大学); Tongyi Lab (通义实验室), Alibaba Group (阿里巴巴集团); Peking University (北京大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 20 pages, 9 figures, 9 tables

点击查看摘要

Abstract:On-policy self-distillation (OPSD) provides dense, token-level supervision for reasoning models by aligning a model’s own distribution with the distribution it produces under privileged context, typically a verified solution. However, we show that the learning signal drawn from this distributional gap concentrates on style tokens rather than task-bearing ones, as the hinted model tends to produce more direct, shorter outputs. We term this pathology \emphprivilege-induced style drift, which destabilizes training or causes response length to shrink. To address this, we propose \textbfRLCSD (Reinforcement Learning with Contrastive on-policy Self-Distillation), which mitigates this drift by contrasting the teacher-student gap under a correct hint against that under a wrong hint, suppressing the style shift that conditioning on a hint tends to induce regardless of correctness, and yielding a signal that is more concentrated on task-bearing tokens. Experiments on Qwen3 (1.7B/4B/8B) and Olmo-3-7B-Think across mathematical and logical reasoning show that RLCSD consistently outperforms GRPO and prior OPSD methods. We further show that the contrastive principle is general: it plugs into existing OPSD methods to improve them, and its underlying insight extends to the broader cross-model on-policy distillation setting.

[NLP-51] MedCTA: A Benchmark for Clinical Tool Agents

【速读】：该论文旨在解决当前医疗人工智能（AI）代理在临床决策支持中面临的系统性局限问题，即现有基准测试大多仅评估单一感知任务或单轮问答能力，难以揭示多步规划、工具调用与执行可靠性等方面的深层缺陷。其核心挑战在于如何实现具备真实临床场景下复杂任务执行能力的生成式医疗AI代理，尤其是在面对多模态临床输入（如影像学图像、病理切片和报告）时，能够自主完成从工具检索、证据获取到整合分析的完整决策链。解决方案的关键在于提出MedCTA——一个基于临床医生验证、任务隐含步骤的真实世界多模态临床任务基准，涵盖107个实际临床任务及5个已部署工具的可执行轨迹，支持对工具选择、参数有效性、执行稳定性、路径保真度与最终结果质量的过程感知评估。实验表明，即使最先进的多模态模型在多步工具使用中仍表现出严重脆弱性，主要体现为协议失败、过早终止与错误工具调用等问题，且最优人工引导的工具调度虽带来显著但不完全的性能提升，揭示了强大基础感知能力无法直接转化为可靠的自主代理行为。因此，MedCTA为审计、诊断并推动可信医疗AI代理的发展提供了严格的评估平台。

链接: https://arxiv.org/abs/2606.11702
作者: Tajamul Ashraf,Hyewon Jeong,Fida Mohammad Thoker,Bernard Ghanem
机构: King Abdullah University of Science and Technology (KAUST), Saudi Arabia; Massachusetts Institute of Technology (MIT), USA
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Project Page: this https URL Code: this https URL Data: this https URL

点击查看摘要

Abstract:To make clinically grounded decisions, medical AI agents are expected to go beyond simple recognition and be capable of tool retrieval, evidence acquisition, and integration. Existing benchmarks largely evaluate isolated perception or single-turn question answering, and therefore provide limited visibility into failures of planning, tool recruitment, and rollout reliability. We introduce MedCTA, a benchmark for evaluating medical tool agents on clinician-validated, step-implicit tasks grounded in realistic multimodal clinical inputs, including radiology images, pathology slides, and reports. MedCTA comprises 107 real-world clinical tasks with clinician-verified executable trajectories over 5 deployed tools, and supports process-aware evaluation of tool selection, argument validity, execution stability, trajectory fidelity, and outcome quality. We benchmark 18 open- and closed-source multimodal models and find that even frontier systems remain brittle in multi-step clinical tool use: autonomous rollouts are dominated by protocol failures, premature stopping, and incorrect tool recruitment, while gold-standard tool routing yields large but still incomplete gains. These results show that strong backbone perception does not translate into reliable agentic behavior in clinical settings. MedCTA provides a rigorous testbed for auditing, diagnosing, and advancing trustworthy medical AI agents. The dataset and evaluation suite are available at this https URL

[NLP-52] Goal-Autopilot: A Verifiable Anti-Fabrication Firewall for Unattended Long-Horizon Agents

【速读】：该论文旨在解决长时序大语言模型（LLM）智能体在无人值守运行时缺乏可信性的问题：即使无用户监督，智能体仍可能自信地声称任务成功，而实际上并未验证其结果。为此，作者提出将“诚实性”（honesty）作为衡量无人自主性的核心指标，区别于传统的能力评估。其解决方案的关键在于设计名为Autopilot的执行模型，通过引入一个外部化的、持久且受控的有限状态机（finite-state machine），由调度器逐步推进，每一步仅依赖状态机状态进行状态重载，从而实现上下文开销与时间跨度无关。系统通过设置硬性约束（hard floor）禁止任何未实际执行并通过可验证门控（falsifiable gate）的“完成”声明，确保终止状态的真实性。论文证明了“无虚假成功定理”（No-False-Success theorem），在门控正确性、地板机制强制执行及计划覆盖的前提下，终止即意味着目标达成，且所有信任点均可实证测量；最坏情况仅为诚实的停滞，绝不会出现伪造的成功。实验表明，在包含70个任务、3种系统、3种模型、5种子的3,150个测试单元中，Autopilot的虚假成功率为0.95%（95%置信区间0.38–1.62%），显著优于Reflexion（8.10%）和StateFlow（25.05%）基线。尤其在SWE-bench Lite的严苛场景下，虚构率从StateFlow的33.7%降至0.67%，差距达-33.07个百分点，凸显其防护能力。关键机制在于“门控”设计而非模型本身：全部10次虚构均来自最强模型，而两个中等强度模型在700个配对测试中从未产生虚假成功。该框架通过主动牺牲部分覆盖率以换取绝对诚实性——诚实的停滞可恢复，而错误输出一旦外发则不可逆。

链接: https://arxiv.org/abs/2606.11688
作者: Youwang Deng
机构: EpistemicaLab — Independent Research (EpistemicaLab — 独立研究)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Preprint. Code: this https URL

点击查看摘要

Abstract:Long-horizon LLM agents are not trusted to run unattended: with no human watching, they confidently report success they never verified. We treat honesty – bounding what an agent may claim at termination – as a first-class metric for unattended autonomy, distinct from capability. We present Autopilot, an execution model that makes silent fabricated success structurally impossible rather than merely rarer. Autopilot externalizes all working state into a durable, gated finite-state machine that a scheduler advances one stateless tick at a time; a hard floor forbids any terminal “done” claim whose falsifiable gate did not actually execute and pass. We prove a No-False-Success theorem – under gate soundness, floor enforcement, and plan coverage, termination implies the goal holds – whose only trust points are empirically measurable, and show the worst case degrades to an honest stall, never a fabricated success. Because each tick rehydrates only the state machine, per-step context cost is constant in the horizon. Across a 3,150-cell paired corpus (70 tasks \times 3 systems \times 3 models \times 5 seeds, including 50 SWE-bench Lite tasks across 11 OSS repos), Autopilot fabricates on 0.95% of cells [95% CI 0.38–1.62] while Reflexion and StateFlow baselines fabricate on 8.10% [6.48–9.81] and 25.05% [22.48–27.62] respectively. The headline contrast lives in the hard regime: on SWE-bench Lite, the firewall reduces fabrication from 33.7% (StateFlow) to 0.67%, a paired difference of -33.07 pp [95% CI -36.53, -29.73 ]. The mechanism is the gate, not the model: all ten Autopilot fabrications come from the strongest model, while two weaker mid-tier models never fabricate across 700 paired cells. The firewall trades coverage for honesty by design – an honest stall is recoverable; a confident wrong output shipped downstream is not.

[NLP-53] Layer-Isolated Evaluation: Gating the Deterministic Scaffold of a Production LLM Agent LLM Agent with a No-LLM Regression-Locked Test Harness

【速读】：该论文旨在解决当前大语言模型（LLM）智能体评估中普遍存在的“黑箱”问题：以端到端任务成功率作为唯一评估指标时，无法定位性能下降的具体环节，即存在严重的回归检测盲区。其核心挑战在于，当某一组件发生退化时，整体通过率变化微小（如仅下降1.7至5.9个百分点），导致故障被掩盖（masking），难以发现具体问题所在。为此，作者提出层隔离评估（layer-isolated evaluation）方法，其关键在于将部署的订单处理智能体解耦为一个固定分类体系下的八个逻辑层（包括本体、意图识别、路由、分解、升级、安全、记忆及跨层封装/防御），并为每一层构建独立的断言切片（assertion slice），在无生成式AI（Generative AI）参与的确定性“纯净模式”下进行测试。该纯测试套件（共238个用例，分属23个切片，平均每个用例耗时约10毫秒）可在每次代码变更时于持续集成（CI）环境中与锁定的各层基线对比。通过控制性回归注入实验验证，结果显示：尽管整体通过率波动极小，但对应层的切片通过率急剧下降（-25至-91个百分点），且受损层的切片在多数情况下为最差表现（5/7案例中唯一最差，7/7案例中进入前三位，平均排名1.29）。该结果证明了该方法能有效实现故障定位，克服了传统聚合指标的掩蔽效应。此外，该方法在另一结构不同的租户（Starbucks SG）上复现成功，表明其非特定于单一系统架构。本研究贡献包括：(a) 一种完全分解、亚秒级、无需生成式AI的生产级智能体分层测试框架；(b) 一种覆盖诚实性（coverage-honesty）的测试充分性标准，拒绝对未被充分触发的层进行评分；© 通过回归注入实证展示了分层基线锁机制如何将原本被聚合指标掩盖的缺陷精准定位。

链接: https://arxiv.org/abs/2606.11686
作者: Sawyer Zhang,Alexander Wang,Sophie Lei
机构: Lumivate (Lumi)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 12 pages, 2 figures, 5 tables

点击查看摘要

Abstract:End-to-end task-success is the dominant way to evaluate LLM agents, but one aggregate number tells you that an agent regressed, not where. We present layer-isolated evaluation: a deployed ordering agent is decomposed into a fixed taxonomy of layers (ontology, intent, routing, decomposition, escalation, safety, memory, and cross-cutting envelope/defense), each exercised by its own assertion slice in a deterministic, no-LLM “pure” mode. The pure suite (238 cases across 23 slices; 225 run in 2.39 s, ~10 ms/case) runs in CI on every change against a locked per-slice baseline. We validate by controlled regression injection, degrading one layer at a time across seven non-safety layers. The effect we did not design in is masking: the aggregate pass-rate barely moves (-1.7 to -5.9 pp for six local regressions), while the matching slice craters (-25 to -91 pp). A layer’s slice reacting to its own fault is partly by construction; the measured results are (i) the aggregate masking and (ii) that damage stays off the other slices: the injected layer’s slice is the single worst-hit in 5 of 7 cases and top-3 in 7 of 7 (mean rank 1.29 of 19). Localization replicates on a second, structurally different tenant (Starbucks SG): all seven matching slices crater, so it is not a single-catalog artifact. We position it as a concrete, deterministic instantiation of the component-level evaluation EDDOps prescribes but leaves unimplemented, with CheckList as ancestor and as the deterministic mirror image of whole-workflow stochastic mutation testing. Our contributions: (a) a fully decomposed, sub-second, no-LLM per-layer harness for a production agent, (b) a coverage-honesty test-adequacy criterion that refuses to score an unexercised layer, and © the regression-injection demonstration that per-slice baseline-locked gates localize regressions an aggregate metric masks.

[NLP-54] UR-BERT: Scaling Text Encoders for Massively Multilingual TTS Through Universal Romanization and Speech Token Prediction INTERSPEECH2026

【速读】：该论文旨在解决大规模多语言文本到语音（TTS）系统中因受限于可用的音素转换资源而导致语言支持数量有限的问题。传统基于字符到音素（G2P）的方法仅能支持约100种语言，主要受限于高质量G2P资源的稀缺性。为突破这一瓶颈，本文提出UR-BERT，一种基于罗马化转写（Romanization）的文本到语音编码器，通过将多种书写系统统一映射到共享的罗马化表示，实现了对495种语言的可扩展支持。其解决方案的关键在于：一、采用统一的罗马化表征以消除不同文字系统的异构性；二、在训练过程中引入语音标记预测目标（speech token prediction objective），促使编码器学习具备语音感知能力的音素表示，从而在数据效率的前提下提升音素保真度与文本-语音对齐精度。实验表明，基于UR-BERT构建的TTS系统在多种语言和资源条件下均显著优于现有文本编码器基线，并展现出对未见语言的强大泛化能力。

链接: https://arxiv.org/abs/2606.11681
作者: Sangmin Lee,Eekgyun Ahn,Woongjib Choi,Hong-Goo Kang
机构: Yonsei University (延世大学)
类目: Computation and Language (cs.CL); Sound (cs.SD)
备注: Accepted to Interspeech 2026

点击查看摘要

Abstract:We propose UR-BERT, a Romanized transcription-based text-to-speech (TTS) encoder for massively multilingual TTS systems. Conventional grapheme-to-phoneme (G2P)-based approaches are limited to around 100 languages due to the availability of reliable G2P resources. In contrast, UR-BERT scales to 495 languages by unifying diverse writing systems into a shared Romanization representation. To further enhance phonetic fidelity and text-speech alignment, we introduce a speech token prediction objective during training, which encourages the encoder to learn speech-aware phonetic representations in a data-efficient manner. Experiments show that TTS systems built on UR-BERT consistently outperform recent text encoder baselines across a wide range of languages and resource conditions, and demonstrate strong generalization to unseen languages.

[NLP-55] Organize then Retrieve: Hierarchical Memory Navigation for Efficient Agents

【速读】：该论文旨在解决大语言模型（Large Language Model, LLM）代理在处理长时程任务时因固有的无状态特性导致的推理质量下降、推理成本增加及延迟升高的问题。核心挑战在于如何在有限的上下文窗口内高效管理动态积累的任务相关信息，而现有方法多依赖有损压缩或基于相似性的检索，难以有效保留任务中的时间顺序结构与因果依赖关系。本文提出HORMA（Hierarchical Organize-and-Retrieve Memory Agent），其关键创新在于构建一种类文件系统层级化的工作记忆结构，将摘要实体与其对应的原始轨迹信息进行关联，实现无信息丢失的高效访问。HORMA将工作记忆分为两个阶段：一是结构化记忆构建，通过迭代优化区分由信息缺失与上下文过载/误导引起的失败；二是基于导航的检索，利用强化学习训练的轻量级代理遍历层级结构，选择最小但足够的上下文，从而降低关键执行路径上的延迟。实验表明，在ALFWorld、LoCoMo和LongMemEval等多个基准上，HORMA在受限上下文预算下显著提升任务性能，且在长对话任务中最多仅消耗基线22.17%的令牌数，相较现有方法实现了更优的效率-性能权衡，并展现出对未见任务的良好泛化能力。

链接: https://arxiv.org/abs/2606.11680
作者: Hao-Lun Hsu,Nikki Lijing Kuang,Boyi Liu,Zhewei Yao,Yuxiong He
机构: Duke University (杜克大学); Snowflake AI Research
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language model (LLM) agents struggle with long-horizon tasks due to their inherent statelessness, requiring all task-relevant information to be encoded in growing input contexts. The resulting degraded reasoning quality, increased inference cost, and higher latency necessitate efficient working memory mechanisms. However, existing approaches either rely on lossy compression or similarity-based retrieval, which often fail to capture temporal structure and causal dependencies required for multi-step agentic tasks. In this work, we present HORMA, a Hierarchical Organize-and-Retrieve Memory Agent that organizes experience into a file-system-like hierarchical structure, where summarized entities are linked to the corresponding raw trajectories, enabling efficient access without losing detailed information. HORMA decomposes working memory into two stages: structured memory construction and navigation-based retrieval. The construction module iteratively refines how experiences are structured by distinguishing between failures caused by missing information and those caused by misleading or overloaded context. The navigation module retrieves task-relevant context by traversing the hierarchy using a lightweight agent trained with reinforcement learning to select minimal yet sufficient context, thereby reducing latency along the critical execution path. Across ALFWorld, LoCoMo, and LongMemEval, HORMA improves task performance under constrained context budgets while requiring at most 22.17% of the baseline token usage in long conversation tasks. Compared to existing methods, it consistently achieves better efficiency-performance trade-offs and generalizes effectively to unseen tasks.

[NLP-56] Can AI Reason Like an Urban Planner? Benchmarking Large Language Models Against Professional Judgment

【速读】：该论文旨在解决生成式AI（Generative AI）在城市规划领域应用中的核心问题：哪些专业规划知识可被人工智能复制，而哪些仍需依赖人类的判断力。随着大型语言模型（LLMs）在规划实践中的广泛应用，当前缺乏一个系统性的评估框架来检验其是否具备规划专业所强调的情境敏感性、价值意识与制度素养（institutional literacy）。为此，论文提出了一种领域特定的评估框架——城市规划基准测试（Urban Planning Bench, UPBench），该框架基于布卢姆修订版认知分类学（Bloom’s revised taxonomy）构建了一个4×5的认知层级矩阵，涵盖四大知识支柱与五个认知层次，用于系统评估LLMs在复杂规划任务中的推理能力。研究通过对25个主流LLMs进行自动化评分与专家评审，发现模型在高阶分析任务上的表现反而优于事实记忆与综合判断类任务，呈现出非单调的认知曲线。这一现象揭示了规划知识中许多被视为“低阶”的内容，实则深度嵌入于制度、管辖权与时间背景之中，导致模型难以泛化。作者据此总结出四种认知局限的诊断标准：监管幻觉（regulatory hallucination）、概念混淆（conceptual conflation）、复杂性瘫痪（wickedness paralysis）与实践智慧缺失（phronetic deficit）。研究建议在实践中实行差异化授权：LLMs可辅助跨学科整合、文献综述、情景生成与初步政策分析，但在涉及具体管辖权的法规制定、规范性冲突调解及情境敏感程序决策方面仍不可靠。因此，公共机构应要求对AI辅助的法规分析进行人工验证，同时规划教育需强化制度素养、规范判断与情境敏感性的培养。

链接: https://arxiv.org/abs/2606.11678
作者: Yijie Deng,He Zhu,Wen Wang,Junyou Su,Minxin Chen,Wenjia Zhang
机构: University of Science and Technology of China (中国科学技术大学); Institute for Advanced Study, Tsinghua University (清华大学高等研究院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Problem, Research Strategy, and Findings: The rise of large language models (LLMs) raises a key question for urban planning: which forms of professional planning knowledge can AI replicate, and which still require human judgment? Although AI tools are increasingly used in planning practice, there is still no systematic framework for testing whether they can reason with the contextual sensitivity, value awareness, and institutional literacy central to planning expertise. This paper introduces Urban Planning Bench (UPBench), a domain-specific evaluation framework that assesses LLM reasoning through a 4x5 matrix of four knowledge pillars and five cognitive levels adapted from Bloom’s revised taxonomy. Evaluating 25 LLMs with automated scoring and expert review, we find a non-monotonic cognitive curve: models perform better on higher-order analytical tasks than on factual recall and integrative judgment. This suggests that planning knowledge often treated as lower-order is deeply shaped by institutional, jurisdictional, and temporal context, making it hard for LLMs to generalize. We summarize these limits as four epistemic diagnostics: regulatory hallucination, conceptual conflation, wickedness paralysis, and phronetic deficit. Takeaway for Practice: The findings support differential delegation in planning. LLMs can assist with cross-disciplinary synthesis, literature review, scenario generation, and preliminary policy analysis. However, they remain unreliable for jurisdiction-specific regulation, normative conflict resolution, and context-sensitive procedure. Agencies should require verification for AI-assisted regulatory analysis, while planning education should emphasize institutional literacy, normative judgment, and contextual sensitivity. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2606.11678 [cs.CL] (or arXiv:2606.11678v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2606.11678 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-57] Dummy Backdoor as a Defense: Removing Unknown Backdoors via Shared Internal Mechanisms for Generative LLM s

【速读】：该论文旨在解决生成式大语言模型（Large Language Models, LLMs）中未知后门攻击（backdoor attacks）的检测与清除难题，尤其在防御者无法获知后门攻击类型或其内部机制的情况下。此类攻击使得模型在正常输入下表现正常，但在触发特定隐藏信号时会输出攻击者指定的恶意响应，严重威胁模型的安全性与可靠性。论文提出一种基于不同后门共享内部机制的简单而有效的后门移除方法：首先通过实验证明，针对相同任务目标的不同后门会在模型内部激活层产生相似的触发相关变化；基于此观察，该方法主动引入一个具有已知触发词的“模拟后门”（dummy backdoor），并通过在模拟触发输入与干净响应对上进行进一步微调，实现对模拟后门的消除。由于模拟后门与未知后门依赖于共享的内部机制，因此移除前者可同步削弱后者的影响。实验在多个模型家族和三种典型后门攻击类型上验证了该方法的有效性，结果表明其显著降低了未知后门的攻击成功率，同时有效保留了模型的原始功能性能，优于现有主流防御方法。研究揭示了可控后门作为代理手段在缓解未知后门攻击中的潜力，其核心创新在于利用共享内部表征特性实现对未知威胁的间接抑制。

链接: https://arxiv.org/abs/2606.11648
作者: Kazuki Iwahana,Masaru Matsubayashi,Takuma Koyama,Toshiki Shibahara,Kenichiro Omintato,Akira Ito
机构: NTT Social Informatics Laboratories(NTT社会信息实验室); Tohoku University(东北大学)
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Backdoor attacks pose a serious threat to the safety and reliability of Large Language Models (LLMs), as they cause models to behave normally on clean inputs while producing attacker-specified responses when hidden triggers are present. Removing such unknown backdoors is particularly challenging when the defender does not know the backdoor attack types or the internal mechanisms formed through backdoor training. In this work, we propose a simple but effective backdoor removal method based on shared internal mechanisms across different backdoors. First, we show that different backdoors with the same task (attack objective) induce similar trigger-activated changes in the internal activations. Motivated by this observation, our method intentionally embeds a backdoor with a known trigger (\emphdummy backdoor) and then removes it through further fine-tuning on dummy-triggered inputs paired with clean responses. Since the dummy backdoor and the unknown backdoor can rely on shared internal mechanisms, removing the dummy backdoor also reduces the effect of the unknown backdoor. We evaluate our method on three backdoor attack types across multiple model families. Experimental results show that our method substantially reduces the attack success rate of the unknown backdoor while preserving model utility, outperforming representative existing defense methods in both backdoor removal effectiveness and utility preservation. These findings suggest that a defender-controllable backdoor can serve as a helpful proxy for mitigating unknown backdoors in generative LLMs.

[NLP-58] Improving Cross-Format Robustness in Language Models with Multi-Format Training

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在不同答案格式下表现不一致的问题，即同一语义问题在不同格式表述下可能产生差异化的回答结果，导致模型对格式敏感。为衡量此现象，作者定义了“跨格式鲁棒性”（cross-format robustness），即模型在不同格式下对同一底层问题保持一致回答的能力。其解决方案的关键在于引入一种轻量级的多格式增强方法——FormatMix，该方法仅将训练数据中约30%的样本通过随机或目标选择扩展为多种语义等价的格式，而非全量数据进行多格式标注。实验表明，相较于仅使用多项选择题（Multiple-choice question, MCQ）监督的训练方式，多格式监督显著提升了任务性能与跨格式鲁棒性；且即使仅对少量数据进行格式多样化处理，也能恢复接近全格式训练的增益。研究结果表明，提升鲁棒性的核心因素是训练数据中的格式多样性，而非单纯增加监督信号。因此，通过轻量级多格式增强即可有效降低大模型对答案格式的敏感性，无需修改基础模型架构。

链接: https://arxiv.org/abs/2606.11643
作者: June M. Liu,Shaomian Zheng,He Cao,Dingnan Jin,Qing Cui,Jun Zhou
机构: Ant Group(蚂蚁集团); International Digital Economy Academy (IDEA)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models often remain sensitive to answer format: a question solved correctly in one form may fail in another semantically equivalent form. To study this gap, we define cross-format robustness as the extent to which a model answers the same underlying question consistently across formats. We then compare full-format training with FormatMix, which expands only a subset of training items into multiple equivalent formats using either random or targeted selection. Across GLM4 and Llama-3.1, multi-format supervision consistently improves both task performance and cross-format robustness, whereas Multiple-choice question (MCQ)-only supervision alone brings little benefit and can even reduce robustness. We further find that expanding only about 30% of the training set into multiple formats often recovers most of the gain from full-format training, and this effect appears across the model families and sizes we study. These results suggest that format diversity, rather than additional supervision alone, is the key driver of robustness. That lightweight multi-format augmentation is a practical way to make LLMs less sensitive to answer format without changing the base model.

[NLP-59] Evaluating Bias in Phoneme-Based Automatic Speech Recognition Systems: An Analysis of IPA Transcription Models

【速读】：该论文旨在解决生成式语音识别（ASR）系统在跨种族、年龄、性别及口音等人口统计学特征上存在的偏见问题，尤其关注当前研究较少涉及的基于音素（phoneme-based）的ASR系统，如生成国际音标（IPA）表示的模型。随着ASR向多语言支持和低资源语言建模演进，基于IPA的表示成为一种语言无关的关键基础。本文评估了两种先进的开源ASR系统WhisperIPA与ZIPA在多种口音和语言来源下的性能表现，使用现有多语言语料库及带有社会人口学标注的英语语料库进行分析。通过标准音素错误率（PER）与作者提出的“软化音素错误率”（Soft PER）指标，比较模型生成的IPA转录与基于图素到音素（G2P）系统的参考结果，以衡量对语言学上相似音素替换的容忍度。研究发现，尽管考虑了可接受的音素变异，不同语言及人口群体（如性别、口音、族裔、年龄）之间仍存在显著性能差异，揭示出潜在的系统性偏见来源。其关键解决方案在于引入更灵活的评估指标（Soft PER），并强调构建更具包容性与语言鲁棒性的音素级ASR系统的重要性，为未来模型设计提供依据。

链接: https://arxiv.org/abs/2606.11639
作者: Catherine Bao,Maneesha Rani Saha,Neal Patwari
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The popularization of automatic speech recognition (ASR) systems has increased exploration of the demographic biases related to race, age, gender, and accent, often formed from imbalanced training data. Most of these studies focused on standard grapheme-based ASR systems with comparatively little emphasis on phoneme-based systems, such as models that produce International Phonetic Alphabet (IPA) representations. As ASR systems shift toward multilingual support and low-resource language modeling, IPA-based layers serve as a critical, language-agnostic foundation. In this study, we evaluate the performance of two state-of-the-art open-source ASR systems, WhisperIPA and ZIPA, that generate IPA transcriptions across diverse accents and language sources. Our evaluation includes existing multilingual speech corpora and demographically annotated English-language corpora. We measure model performance by comparing model-generated IPA transcriptions against grapheme-to-phoneme (G2P) systems using both standard phoneme error rate (PER) and a proposed Soft PER metric that tolerates linguistically similar phoneme substitutions. Our analysis examines how performance varies across languages and demographic groups such as gender, accent, ethnicity, and age, revealing persistent disparities even after accounting for acceptable phonemic variation. These findings provide insight into potential sources of bias and inform the development of more inclusive and linguistically robust phoneme-based ASR systems. Our code and data will be made publicly available to the community.

[NLP-60] Multi-Agent Reasoning with Adaptive Worker Allocation for Stance Detection

【速读】：该论文旨在解决短文本中立场检测（stance detection）任务的挑战，尤其针对立场隐含、间接或修辞性表达的情况。传统基于大语言模型（LLM）的单次提示方法在面对多种合理解释时表现脆弱，而现有集成策略如多数投票或自一致性虽提升了鲁棒性，却忽略了用于化解歧义的关键中间推理过程。本文提出一种具有自适应工作者分配机制的多智能体推理框架，其核心创新在于将聚合从标签层面的投票转向推理层面的融合。该框架采用经理-工作者（Manager-Worker）架构，根据输入复杂度动态分配不同数量的工作者智能体，每个工作者从独特视角生成仅包含推理过程的解释，不输出立场标签；经理则整合这些推理内容以生成最终预测。实验结果表明，该框架在隐含性和依赖上下文的立场判断中表现显著提升，在新冠立场数据集上达到86.07的宏平均F1，在SemEval-2016上达82.90，且在更显式的P-Stance数据集上仍保持竞争力，验证了推理级自适应聚合在无法仅通过表面线索可靠推断立场时具有最大优势。

链接: https://arxiv.org/abs/2606.11609
作者: Meysam Sabbaghan,Arman Zareian Jahromi,Doina Caragea
机构: Kansas State University (堪萨斯州立大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Stance detection requires identifying an author’s position toward a target, often from short-form texts where stance is implicit, indirect, or rhetorically framed. Although large language models (LLMs) achieve strong performance on this task, single-pass prompting can be brittle when multiple interpretations are plausible. Existing aggregation strategies, such as majority voting or self-consistency, improve robustness by combining labels, but they discard the intermediate reasoning needed to resolve conflicting interpretations. We introduce a multi-agent reasoning framework with adaptive worker allocation for stance detection that shifts aggregation from label-level voting to reasoning-level synthesis. The framework employs a Manager-Worker architecture in which a Manager adaptively allocates a variable number of Worker agents based on input complexity. Each Worker analyzes the input from a distinct perspective and produces a reasoning-only explanation without emitting a stance label; the Manager then synthesizes these explanations to produce the final prediction. We evaluate the proposed framework on SemEval-2016, P-Stance, and COVID-19 Stance using Llama, Mistral, and Gemini. Results show that the framework yields the largest gains on implicit and context-dependent stance cases, achieving 86.07 Macro-F1 on COVID-19 and 82.90 on SemEval-2016, while remaining competitive on more explicit stance datasets such as P-Stance. These findings suggest that adaptive reasoning-level aggregation is most beneficial when stance cannot be reliably inferred from surface cues alone. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2606.11609 [cs.CL] (or arXiv:2606.11609v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2606.11609 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-61] When is Your LLM Steerable?

【速读】：该论文旨在解决生成式语言模型在推理阶段进行行为控制时，激活引导（activation steering）的成功率高度依赖于提示（prompt）、概念（concept）、模型及引导配置等多重因素，而现有方法需通过昂贵的网格搜索与全自回归生成回滚进行后验评估，效率低下。其核心解决方案是：利用生成过程初期（如生成前几个词后）模型内部状态的动态特征，预测后续引导操作的成败。研究提出ASTEER测试平台，包含140万条已标注成功/失败的引导生成数据，覆盖150个概念。通过分析引导前后各层隐藏状态的差异特征，揭示了引导效应在不同层和词元位置间的传播规律。基于这些早期特征，训练梯度提升决策树（GBDT）分类器，实现对引导是否“欠引导”“成功”或“过引导”的精准预测，宏F1得分达0.7，表明早期隐藏状态编码了关于引导有效性的重要结构化信息。进一步地，将该预测器用于引导强度搜索，显著降低解码成本的同时逼近最优性能，实现了高效、可推广的引导策略优化。

链接: https://arxiv.org/abs/2606.11599
作者: Chenrui Fan,Yize Cheng,Ming Li,Soheil Feizi,Tianyi Zhou
机构: University of Maryland, College Park (马里兰大学学院帕克分校); MBZUAI, UAE (阿联酋穆罕默德本扎耶德人工智能大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Activation steering offers a lightweight approach to control language models’ behavior at inference time, but whether it succeeds or fails heavily depends on the prompt, concept, model, and steering configuration. Finding the regime and boundaries of successful steering typically requires expensive grid searches and post-hoc evaluation of full autoregressive rollouts. In this work, we investigate whether steerability can be predicted from the model’s internal states at the beginning of the generation process, e.g., after generating the first few tokens, and how to leverage such a predictor to improve steering success rate. To this end, we first introduce ASTEER, a testbed including 1.4M steered generations, spanning 150 concepts with each steering success/failure labeled. Leveraging this testbed, we analyze the model’s early decoding dynamics by extracting features that compare hidden states before and after steering across layers and initial decoding steps. These features help us understand how steering’s effects propagate along layers and token positions, which provide key information for steerability prediction. We then train a Gradient Boosting Decision Trees (GBDT) classifier on these features to predict whether an intervention will under-steer, succeed, or over-steer without requiring full rollout. Our predictor achieves around 0.7 macro-F1 score on unseen concepts, demonstrating that early hidden states encode substantial, structured information about eventual steering efficacy. We further leverage this steerability predictor as guidance for steering strength searching, achieving near-optimal performance with a small fraction of decoding cost.

[NLP-62] Kuramoto Attention: Synchronizing Self-Attention on the Torus

【速读】：该论文旨在解决传统自注意力机制在建模序列依赖关系时缺乏显式几何结构与动态同步特性的局限性，尤其关注如何将注意力机制与连续相位动力学系统（如耦合振子网络）相融合。其核心解决方案是提出Kuramoto注意力（Kuramoto attention），即一种基于角度表示的自注意力层：每个隐藏单元以角度形式编码信息，通过门控余弦相似度计算注意力得分，对历史相位状态进行加权聚合，并利用注意力加权后的圆均值的切向分量更新当前状态。该更新过程精确对应于Kuramoto模型中的耦合项 $\sum_u A_{t,u}\sin(\theta_u - \theta_t)$ ，其中注意力矩阵充当自适应、内容相关的耦合核。关键创新在于：1）采用在环面（torus）上保持不变的相似性度量（门控余弦相似度），实现对相位空间中局部结构的自然建模；2）在流形上定义的圆均值更新机制，确保状态演化始终位于相位空间内。这种设计不仅使模型具备内在的同步倾向（phase synchronization），还通过旋转位置编码引入位置相关的相位漂移。在enwiki8字符级语言建模任务上的实验表明，该模型在参数量为百万级和五百万级时，比特每字符（BPC）性能分别接近甚至媲美强基线RoPE+SwiGLU Transformer，验证了其在大规模语言建模中的可行性与有效性。因此，该工作的核心贡献在于建立了一个具有明确几何意义的自注意力框架，将自注意力与非线性相位同步系统统一于同一数学结构中，为理解注意力机制提供了新的几何视角。

链接: https://arxiv.org/abs/2606.11585
作者: Joshua Nunley
机构: Indiana University Bloomington(印第安纳大学布卢明顿分校)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Adaptation and Self-Organizing Systems (nlin.AO)
备注: 13 pages, 2 figures, 3 tables

点击查看摘要

Abstract:We introduce Kuramoto attention, a self-attention layer in which each hidden coordinate is an angle. The layer scores tokens by gated cosine similarity, attends over previous phase states, and updates each token by the tangent component of the attention-weighted circular mean. Because the values are the raw phase states, this update is exactly the Kuramoto coupling term \sum_u A_t,u\sin(\theta_u-\theta_t) , with the attention matrix acting as an adaptive, content-dependent coupling kernel. Equivalently, the gated score is a learned metric on the torus that selects which tokens couple, and the update pulls each token toward the circular mean of the tokens it selects, tightening their phase agreement. The same two ingredients, an invariant similarity score and an on-manifold mean, define such a layer on any compact group; the torus is the abelian case, where both are closed-form. The softmax weights solve an entropy-regularized phase-retrieval problem, and rotary position enters as a position-dependent phase drift in the score. On enwiki8 character-level language modeling, the layer trains as a functional language model whose bits-per-character stays close to a strong matched RoPE+SwiGLU transformer: within 0.02 BPC at one million parameters ( 1.637\pm0.010 versus 1.616\pm0.004 ) and level on the median at five million ( 1.448 versus 1.452 over five seeds) with the transformer ahead on the mean ( 1.468 versus 1.456 ). These experiments establish that the constrained geometric structure is a viable language model at this scale; the structure itself, and its synchronization reading, is the contribution. Ablations isolate the load-bearing components, and the result gives a compact bridge between self-attention and phase synchronization.

[NLP-63] GraphInfer-Bench: Benchmarking LLM s Inference Capability on Graphs

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在图结构数据上进行图推理（graph inference）的能力不足问题，即模型无法基于节点及其邻域信息生成超越单一节点或路径可检索范围的开放性答案。现有图问答（graph-QA）评估协议均无法有效测试这一能力，因其允许答案通过单个节点或路径直接获取，而真正的图推理要求答案必须依赖于局部结构的综合理解。为此，论文提出GraphInfer-Bench基准，涵盖描述（Description，如区域特征）与比较（Comparison，如区域差异）两类任务，所有任务的设计均确保真实答案无法由任一节点单独支持。该基准包含6个真实世界图数据集的42,000个样本，经四层自动化质量控制流程生成。实验评估了四类方法：图-标记对齐模型、零样本前沿闭源LLM、Graph2Text监督微调模型及普通图神经网络（GNNs）作为结构参考。结果表明，各类方法均未能完全填补性能差距：图-标记对齐模型仅部分胜任描述任务，但在比较任务中表现崩溃；前沿闭源LLM在异常检测和社区划分任务中领先，但弱于其他方法在掩码节点预测任务中的表现；Graph2Text SFT在描述任务中最强，但仍落后于前沿LLM在比较任务上的表现；而普通GNN在所有任务中均达到或超过最强的LLM方法，尤其在社区检测任务中优势显著。研究揭示，图推理是一个尚未被充分解决的通用能力缺口，而非特定架构的固有属性。

链接: https://arxiv.org/abs/2606.11562
作者: Zhuoyi Peng,Jingzhou Jiang,Hanlin Gu,Lixin Fan,Yi Yang
机构: The Hong Kong University of Science and Technology (香港科技大学); Webank
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Code: this https URL ; Dataset: this https URL

点击查看摘要

Abstract:Graph analysis underlies many applications whose answers cannot be looked up in a single record or retrieved along a path: laundering rings, drug repurposing, user preference, and scientific theme are all inferred from a node together with its neighbourhood. We introduce GraphInfer-Bench, a benchmark for whether LLMs can perform this graph inference: producing an open-ended answer that no single node supports and no path retrieves. Existing graph-QA protocols cannot test this capability: algorithm simulation, node classification, single-node description, KG-QA, and GraphRAG all admit answers retrievable from one node or along a path. GraphInfer-Bench defines five tasks along Description (what a region is) and Comparison (how regions differ), each constructed so the ground truth lives in no single node. The release contains 42,000 samples across six real-world graphs, produced automatically and screened by a four-layer quality-control protocol. We evaluate four method families against the same tasks: graph-token alignment models, zero-shot frontier closed-source LLMs, Graph2Text supervised fine-tuning, and plain GNNs as a structural reference. No method family closes the gap. Graph-token alignment partially handles description tasks (relational, theme) but collapses on comparison tasks. Frontier LLMs lead on outlier detection and community partition among LLM-based methods but lag on masked-node prediction. Graph2Text SFT is the strongest LLM-based method on the description side yet falls behind frontier LLMs on comparison. Across every task, plain GNNs match or beat the strongest LLM-based row, with the largest margin on community detection. GraphInfer-Bench surfaces graph inference as an open capability gap rather than a property of any one architecture.

[NLP-64] aching Diffusion to Speculate Left-to-Right

【速读】：该论文旨在解决生成式 AI（Generative AI）中大语言模型（Large Language Models, LLMs）因自回归解码导致的高推理成本问题。其核心挑战在于：尽管基于扩散的语言模型可并行生成候选词块（block-diffusion drafters），实现高效预生成，但验证阶段仍依赖自回归目标模型进行从左到右的逐词校验，造成训练时对称性目标与推理时非对称奖励之间的不匹配。为此，论文提出三种互为正交的训练时干预策略：词元位置加权（token positional weighting）、首次错误焦点损失（first-error focal loss，聚焦于每个块内破坏已接受前缀的第一个错误位置）、以及链式损失项（chain loss term，用可微代理替代期望接受长度）。这三项方法分别作用于位置敏感性、块条件下的首错检测与前缀联合优化，在不增加前向传播次数、不改变推理流程或拒绝采样精确性契约的前提下，显著提升了接受的草稿长度，在四个目标模型和六个推理、代码与对话基准上相较基线平均提升21%-76%。

链接: https://arxiv.org/abs/2606.11552
作者: Lexington Whalen,Yuki Ito,Ryo Sakamoto
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 13 pages, technical report

点击查看摘要

Abstract:Large language models (LLMs) achieve remarkable performance across a wide range of tasks, but their autoregressive decoding process incurs substantial inference costs due to inherently sequential token generation. Speculative decoding addresses this bottleneck by employing a lightweight draft model to propose multiple future tokens that are subsequently verified in parallel by a larger target model. Recent work has demonstrated that diffusion language models are well suited for this setting, as they can generate entire blocks of draft tokens in parallel and thereby alleviate the sequential constraints of autoregressive drafting. A subtlety of this regime is that block-diffusion drafters generate tokens bidirectionally within a block, whereas verification is performed by an autoregressive target model that evaluates tokens in a strictly left-to-right manner, leaving a gap between the symmetric training-time objective and the asymmetric verification-time reward. In this work, we offer an empirical analysis of three training-time interventions that narrow this gap: token positional weighting, a first-error focal loss that targets the position that breaks the accepted prefix within each block, and a chain loss term that substitutes a differentiable surrogate for the expected accepted length. The three interventions act along orthogonal axes (position, block-conditional first error, joint prefix) and compose additively; they are likewise orthogonal to test-time alignment mechanisms such as multi-draft self-selection, with which they can in principle be combined. Across four target models and six reasoning, code, and dialogue benchmarks, the three interventions raise accepted draft length by 21-76% per benchmark over a position-uniform baseline, without adding additional forward passes and without changing the inference pipeline or the rejection-sampling exactness contract.

[NLP-65] Pretrained self-supervised speech models can recognize unseen consonants INTERSPEECH2026

【速读】：该论文旨在解决预训练自监督语音识别模型在低资源语言（尤其是包含罕见音素如搭嘴音的科伊桑语系语言）中表现不足的问题，核心关注点在于这些模型是否能够像识别常见音素一样准确识别搭嘴音（click consonants）。其解决方案的关键在于对两种主流自监督语音模型（Wav2Vec2 和 HuBERT）在两种富含搭嘴音的科伊桑语系语言（G|ui 和 West !Xoon）的数据上进行微调与对比实验。结果表明，经过微调后的模型对搭嘴音的识别准确率显著高于非搭嘴音，证明自监督学习机制具备跨人类语音特征的泛化能力，即使面对类型学上不常见的发音特征也表现出较强的鲁棒性。

链接: https://arxiv.org/abs/2606.11542
作者: Chihiro Taguchi,Éric Le Ferrand,Hirosi Nakagawa,Hitomi Ono,Kanji Kato,Emily Prud’hommeaux,David Chiang
机构: University of Notre Dame (圣母大学); University at Buffalo (水牛城大学); Tokyo University of Foreign Studies (东京外国语大学); Reitaku University (立正大学); Boston College (波士顿学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 6 pages, 3 figures, 3 tables, accepted at Interspeech 2026

点击查看摘要

Abstract:Modern pretrained self-supervised automatic speech recognition models are trained on large-scale audio data to encode speech into contextualized representations. However, their training data are heavily skewed toward high-resource languages with little data from low-resource languages, raising concerns about the potential underrepresentation of typologically uncommon speech sounds such as click consonants primarily found in Khoisan languages. This leads to our central research question: Can these models recognize click consonants as accurately as other speech sounds? To address this question, we fine-tune and compare pretrained self-supervised speech models (Wav2Vec2 and HuBERT) on data from two click-rich Khoisan languages (G|ui and West !Xoon). Our results reveal that the fine-tuned models consistently recognize clicks more accurately than non-clicks, suggesting that self-supervision enables generalization across human speech sounds including rare phonemes.

[NLP-66] Measuring language complexity from hierarchical reuse of recurring patterns

【速读】：该论文旨在解决语言复杂性量化问题，特别是如何在不依赖特定语言学先验知识的情况下，构建一个可计算且具有理论基础的语言复杂性度量。其核心解决方案是提出“梯路索引”（ladderpath index），该指标基于算法信息论，通过计算重构语言序列所需的最少层级化重复子结构复用步骤，捕捉一种精确可计算但受约束的算法压缩能力，与柯尔莫哥洛夫复杂性相关但又有所区别。关键在于，该方法将语言数据映射为统一的二进制表示后，梯路索引在21个平行语料库间表现出高度近似不变性，显著低于语料长度的变化幅度，从而从表示无关的角度支持了“等复杂性假说”（equi-complexity hypothesis）。同时，研究发现字符集大小与语料长度之间、词汇层面与语料层面重建复杂性之间的权衡关系，验证了复杂性守恒并跨语言层级重新分配的“权衡假说”（trade-off hypothesis）。此外，梯路索引自动识别出的可复用子结构与自然语言中的词和形态成分高度重合，其层级复用机制与认知科学提出的认知组块化（cognitive chunking）过程相吻合，表明人类语言处理共享相同的认知架构——这一连接为上述两个假说提供了新的认知基础，即语言复杂性的均衡与再分配源于跨语言共通的认知资源约束。

链接: https://arxiv.org/abs/2606.11531
作者: Junyi Zhou,Rui Liu,Pengyu Liu,Yu Liu
机构: Beijing Normal University (北京师范大学); University of Rhode Island (罗德岛大学)
类目: Computation and Language (cs.CL); Information Theory (cs.IT)
备注: 17 pages, 4 figures

点击查看摘要

Abstract:We introduce the ladderpath index as a measure of language complexity grounded in algorithmic information theory. It counts the minimum steps needed to reconstruct a sequence through hierarchical reuse of repeated substructures, capturing an exactly computable but constrained form of algorithmic compressibility related to, but distinct from, Kolmogorov complexity. We apply the ladderpath approach to 21 parallel corpora from the Parallel Universal Dependencies dataset. The ladderpath index is approximately invariant across the languages, and varies much less than the corpus length. This is more pronounced when all corpora are mapped to a unified binary representation, providing evidence for the equi-complexity hypothesis from a representation-independent perspective. We also observe trade-offs between character inventory size and corpus length, and between vocabulary-level and corpus-level reconstruction complexity, supporting the trade-off hypothesis that total complexity is conserved and redistributed across linguistic levels. The reusable substructures identified by the ladderpath approach, without any linguistic input, overlap with words and morphological components attested in the natural vocabulary. The hierarchical reuse captured by the ladderpath approach parallels the chunking mechanisms proposed in cognitive science, where the human cognitive system compresses linguistic input into nested, reusable units under shared memory and processing constraints. This connection between cognitive chunking and the ladderpath approach provides a new interpretation for the equi-complexity and trade-off hypotheses, grounding both in the shared cognitive architecture that underlies language processing across human languages.

[NLP-67] ISE: An Execution-Grounded Recipe for Multi-Turn OS-Agent Trajectories

【速读】：该论文旨在解决当前训练具备能力的操作系统智能体（OS Agent）所面临的数据瓶颈问题，即现有数据集普遍缺乏同时包含结构化用户意图、多轮任务委派以及与真实工具执行相耦合的地面化交互特性。为应对这一挑战，论文提出一种三阶段合成范式ISE（Intent-Simulate-Execute），其核心解决方案在于分阶段构建高质量、高保真度的训练数据：第一阶段通过4D框架（角色×领域×任务×复杂度）生成约5万条结构化意图，经去重后保留43,956条唯一意图，并在mpnet-base-v2嵌入空间中取得61.57的Vendi分数；第二阶段采用角色锁定的用户模拟器，基于实际执行结果生成多轮人机交互轨迹，共产生23,132条完整对话序列，平均每条包含8.12轮用户交互和68.24轮总对话；第三阶段在隔离的实时操作系统工作环境中执行所有工具调用，从而真实还原失败与恢复动态，而非依赖模拟响应。实验表明，在ISETrace数据集上微调Qwen3-8B模型后，ClawEval pass@1指标从19.3提升至37.7，显著优于零样本GPT-4o及四倍体量的Qwen3-32B基线模型。消融实验证明，第二阶段的多轮仿真对性能提升贡献显著。

链接: https://arxiv.org/abs/2606.11520
作者: Siyuan Luo,Nairong Zheng,Lin Zhou,Tiankuo Yao,Shengyou Yuan,Haojia Yu,Cong Pang,Jiapeng Luo,Lewei Lu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 13 pages, 6 figures. Dataset and code: this https URL

点击查看摘要

Abstract:Training capable OS agents requires data that simultaneously captures structured user intents, multi-turn task delegation, and grounded tool execution–properties absent from existing datasets. We propose ISE (Intent - Simulate - Execute), a three-stage synthesis paradigm that addresses these gaps jointly. Stage 1 constructs roughly 50000 structured intents via a 4D framework (Persona x Domain x Task x Complexity); after deduplication the pool contains 43956 unique intents and attains a Vendi Score of 61.57 over the entire pool on mpnet-base-v2 embeddings (cosine kernel, q=1). Stage 2 drives multi-turn user-agent interaction through a role-locked user simulator that grounds each user turn in actual execution outcomes, producing 23132 complete trajectories averaging 8.12 user turns and 68.24 total dialogue turns. Stage 3 runs every tool call inside a live, isolated OS workspace, generating authentic failure-recovery dynamics instead of simulated responses. Fine-tuning on ISETrace improves ClawEval pass@1 from 19.3 to 37.7 using Qwen3-8B on agent tool-use tasks with a standard protocol. This result outperforms zero-shot GPT-4o and the larger Qwen3-32B base model which is four times bigger. An ablation on Stage 2 proves multi-turn simulation brings a large portion of the performance gain. We release all source code and dataset at this https URL.

[NLP-68] SAGE: Answer-Conditioned Uncertainty Targets for Verbal Uncertainty Alignment

【速读】：该论文旨在解决大语言模型在表达不确定性时，其自然语言表述与实际采样行为之间存在不一致的问题。现有方法多依赖单次响应来估计不确定性，难以准确反映模型的真实置信度分布。其核心挑战在于如何构建一个既能体现答案类型差异（如类别、数值、符号等），又具备平滑且尺度不变特性的群体级不确定性目标，以提供有效的训练信号。解决方案的关键是提出SAGE（Semantic-Answer Guided Entropy）——一种基于采样响应的语义-答案引导熵机制，通过构建条件于答案类型的不确定性几何结构，实现对不同答案形式的精细区分与连续校准。进一步地，论文设计了Group-Uncertainty Preference Optimization（GUPO）框架，直接监督模型的言语不确定性表达而非完整输出，从而提升不确定性排序能力、降低校准误差并缓解过度自信现象。实验表明，该方法在事实推理、数学推理及多选题任务中均取得显著改进。

链接: https://arxiv.org/abs/2606.11512
作者: Kaiwen Shi,Zheyuan Zhang,Yanfang Ye
机构: University of Notre Dame (圣母大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models increasingly express uncertainty through natural-language statements, yet these expressions often fail to reflect the model’s sampled behavior. We study verbal uncertainty alignment as a distributional calibration problem: the appropriate uncertainty target for a prompt should be estimated from repeated model outputs rather than from an isolated response. However, group rollouts alone are insufficient, since the resulting target must provide a useful training signal. Existing targets only partially satisfy this requirement. We propose SAGE, Semantic-Answer Guided Entropy, a group-level uncertainty target that constructs an answer-conditioned uncertainty geometry over sampled responses. SAGE preserves categorical, numeric, and symbolic answer distinctions while maintaining a smooth and scale-preserving calibration signal. We further apply this target through Group-Uncertainty Preference Optimization, or GUPO, an uncertainty-channel training framework that supervises verbal uncertainty expressions rather than the full response. Experiments across factual, mathematical, and multiple-choice reasoning tasks show improved uncertainty ranking, lower calibration error, and reduced overconfidence.

[NLP-69] When Roleplaying Do Models Believe What They Say?

【速读】：该论文旨在解决生成式 AI（Generative AI）在角色扮演（persona role-playing）过程中，其输出的信念是否真正内化为模型内部对“真理”的认知这一核心问题。具体而言，研究关注当语言模型以历史人物身份进行对话时，其所表达的错误观点（如地心说）是仅停留在表面输出层面，还是已影响其内部表征的真实信念。解决方案的关键在于引入线性真值探测器（linear truth probes），通过对比模型在扮演特定历史人物时所可能认同的、与现代共识相悖的“时代信念”（era-believed）错误陈述，与同主题但不符合该时代背景的“时代无关错误”（era-false）陈述，系统评估模型在不同训练范式（提示工程、上下文学习、监督微调）下对两类错误陈述的内部表征差异。研究发现，尽管角色扮演显著改变了模型的输出倾向，但其内部表征仍普遍将“时代信念”类错误判定为虚假，表明角色扮演主要影响输出而非深层信念。与此形成鲜明对比的是，经有害建议训练导致的“涌现错位”（Emergent Misalignment, EM）现象中，模型对错误陈述的内部表征明显向真实区域偏移，且在挑战下更倾向于辩护这些主张，并将其用于下游推理，揭示出角色扮演与涌现错位实为信念内化谱系上的两个极端：前者仅改变外显输出，后者则实质性重塑了内部信念结构。

链接: https://arxiv.org/abs/2606.11502
作者: Benjamin Sturgeon,David Africa,Sid Black
机构: MATS; Anthropic (Anthropic)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Language models can state that “the Earth orbits the Sun” and, when role-playing Aristotle, assert the opposite. Recent work argues that persona adoption is fundamental to how language models operate, with models constantly selecting the most appropriate persona for a given context. Does such role-playing merely change the model’s outputs, or does it also affect what the model internally represents as truthful? We study this question with linear truth probes, applying them to LLMs role-playing historical personas whose likely beliefs differ from modern consensus. For each persona, we compare false claims the persona would likely have endorsed (era-believed) with topic-matched false claims they would not have endorsed (era-false). Across prompting, in-context learning, and supervised fine-tuning, persona induction suppresses era-believed statements less than equally false alternatives, yet they remain classified as false overall. Role-play therefore shifts what these models say more than what they internally represent as true. We contrast this with models trained on harmful advice that exhibit Emergent Misalignment (EM). Across three model families (Qwen 2.5 14B, Qwen 3 8B, and Llama 3.3 70B), their false claims move substantially toward the true region of probe space, are defended under challenge roughly half the time versus about a sixth for role-play, and are used in downstream reasoning. Role-play and Emergent Misalignment thus are points on a spectrum of belief internalization, where role-play changes what a model says with little representational change, while Emergent Misalignment shifts the internal representation of false claims without fully marking them as true.

[NLP-70] Hubs or Fringes: Pretraining Data Selection via Web Graph Centrality

【速读】：该论文旨在解决现代语言模型预训练数据构成对性能的关键影响问题，现有数据选择方法依赖辅助分类器进行文档打分或混合比例优化，带来计算开销并依赖标注数据。其核心解决方案是提出一种轻量级的数据选择框架WebGraphMix，通过在Common Crawl的主机级网络图上计算结构中心性（structural centrality）得分，动态调节预训练数据混合中核心（central）与边缘（peripheral）文档的比例。该方法假设核心主机暴露于可复用的抽象知识，而边缘主机包含专业化、长尾的知识。WebGraphMix可在网页规模下高效计算中心性得分，无需模型训练、标注数据或下游监督。实验表明，核心与边缘网页区域编码互补能力，1:1比例混合时平均性能达41.4%，优于均匀采样的39.8%；结合文档级质量分类器得分后进一步提升至43.8%。研究证实，网络拓扑结构是预训练数据筛选的重要维度，其信息与现有基于内容的方法具有高度正交性。

链接: https://arxiv.org/abs/2606.11499
作者: Vedant Badoni,Danqi Chen,Xinyi Wang
机构: Princeton University (普林斯顿大学); Princeton Language and Intelligence (普林斯顿语言与智能中心); Princeton AI Lab (普林斯顿人工智能实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 10 pages

点击查看摘要

Abstract:The performance of modern language models depends critically on pretraining data composition. Yet existing data selection methods rely on auxiliary classifiers for document scoring or mixture optimization, adding computational overhead and dependence on labeled data. We propose WebGraphMix, a lightweight data selection framework that computes structural centrality scores over the Common Crawl host-level web graph and uses them to vary the proportion of central versus peripheral documents in the pretraining mixture. We hypothesize that central hosts expose models to reusable abstractions, while peripheral hosts encode specialized, long-tail knowledge. WebGraphMix computes centrality scores efficiently at web scale, requiring no model training, labeled data, or downstream supervision. We integrate WebGraphMix into the DataComp-LM pipeline and train models at 400M and 1B parameter scales with 8B and 28B tokens respectively, evaluating on 23 tasks ranging from factual knowledge to symbolic reasoning. Our experiments show that central and peripheral web regions encode complementary capabilities. Mixture combining both at a ratio of 1:1 achieves 41.4% on average, compared to 39.8% for uniform sampling. Combining structural scores with document-level quality classifier scores further improves performance to 43.8%. These findings demonstrate that web graph topology is a meaningful axis for pretraining data curation, capturing information that is largely orthogonal to existing content-based approaches.

[NLP-71] Building Social World Models with Large Language Models ICML2026

【速读】：该论文旨在解决社会信念在重大社会事件（如政策变动、科学突破等）影响下动态演化过程的建模与预测问题，这一挑战长期困扰社会科学研究。其核心解决方案是提出一种通用框架——社会世界模型（Social World Model, SWM），通过挖掘社交媒体数据中的时间序列模式并优化证据下界（evidence lower bound），自动学习社会信念的状态转移函数，从而无需依赖人工标注的事件-信念关联标签或高成本的人口普查数据。SWM的关键创新在于实现了对社会信念演化机制的无监督建模与可解释性分析，显著优于现有时间序列基础模型，在真实预测市场数据（来自Kalshi和Polymarket）上取得领先性能，尤其在跨政治、金融、加密货币等多个领域展现出强大的泛化能力与机制洞察力。

链接: https://arxiv.org/abs/2606.11482
作者: Haofei Yu,Yining Zhao,Guanyu Lin,Jiaxuan You
机构: 未知
类目: ocial and Information Networks (cs.SI); Computation and Language (cs.CL)
备注: 9 pages. ICML 2026

点击查看摘要

Abstract:Understanding and predicting how social beliefs evolve in response to events – from policy changes to scientific breakthroughs – remains a fundamental challenge in social science. Given LLMs’ commonsense knowledge and social intelligence, we ask: Can LLMs model the dynamics of social beliefs following social events? In this work, we introduce the concept of the Social World Model (SWM), a general framework designed to capture how social beliefs evolve in response to major events. SWM learns state-transition functions for social beliefs by mining temporal patterns in social data and optimizing the evidence lower bound, without the need for explicit human annotations linking events to belief shifts, or for expensive census data. To evaluate SWM, we introduce a benchmark, SWM-bench, derived from real-world prediction markets, specifically Kalshi and Polymarket. SWM-bench includes over 12k data points for social belief prediction tasks spanning diverse domains such as politics, finance, and cryptocurrency. Our experimental results show that SWM significantly outperforms time-series foundation models, achieving state-of-the-art results on Kalshi data and demonstrating competitive performance on Polymarket data, while offering interpretable insights into the underlying mechanisms of social belief dynamics.

[NLP-72] he Periodic Table of LLM Reasoning Reasoning : A Structured Survey of Reasoning Paradigms Methods and Failure Modes

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在复杂推理任务中表现不稳定、易受提示工程（prompting strategy）、任务设计及模型规模影响等关键问题，揭示其推理能力的内在机制与局限性。其解决方案的核心在于构建一个系统化的分类体系，涵盖链式思维（Chain-of-Thought reasoning）、多跳推理（multi-hop reasoning）、数学推理、常识推理、视觉与时间推理、代码与算法推理、检索增强推理、工具增强与代理型推理以及基于强化学习的推理等八大范式，并通过对超过300篇近期文献的方法论趋势进行分析，梳理出提示方法、模型架构、训练目标、奖励建模及评估基准等方面的演进规律。同时，论文归纳了推理幻觉、脆弱的多步推断、弱因果抽象能力及跨领域泛化性能差等典型失败模式，从而为未来构建更鲁棒、可解释且具备强泛化能力的推理系统提供理论框架与研究方向指引，包括元推理、自演化推理框架、多模态推理和情境化社会推理等新兴路径。

链接: https://arxiv.org/abs/2606.11470
作者: Avinash Anand,Mahisha Ramesh,Avni Mittal,Ashutosh Kumar,Erik Cambria,Zhengkui Wang,Timothy Liu,Aik Beng Ng,Simon See,Rajiv Ratn Shah
机构: Singapore Institute of Technology (新加坡科技学院); NVIDIA AI Technology Centre (新加坡); IIIT Delhi (印度信息科技研究所德里分校); IIT Mandi (印度理工学院曼迪分校); IIT Kanpur (印度理工学院坎普尔分校); Owl Autonomous Imaging, Inc. (奥尔自动成像公司); NTU Singapore (新加坡南洋理工大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have achieved strong performance across natural language processing tasks, yet reliable reasoning remains an open challenge. Although modern LLMs show progress in structured inference, multi-step problem solving, and contextual understanding, their reasoning behavior is often inconsistent and sensitive to prompting strategies, task design, and model scale. This survey provides a systematic analysis of more than 300 recent papers from arXiv, Semantic Scholar, Google Scholar, Papers with Code, and the ACL Anthology to examine how reasoning capabilities emerge in LLMs and where they fail. We make three main contributions. First, we introduce a structured taxonomy of LLM reasoning research, covering Chain-of-Thought reasoning, multi-hop reasoning, mathematical reasoning, common sense reasoning, visual and temporal reasoning, code and algorithmic reasoning, retrieval-augmented reasoning, tool-augmented and agentic reasoning, and reinforcement learning-based reasoning. Second, we analyze methodological trends across these paradigms, including prompting methods, model architectures, training objectives, reward modeling, and evaluation benchmarks. Third, we synthesize recurring limitations and failure modes, such as reasoning hallucinations, brittle multi-step inference, weak causal abstraction, and poor cross-domain generalization. By organizing a rapidly expanding literature, this survey offers a unified view of the current capabilities and limitations of reasoning in LLMs. We also identify emerging research directions, including meta-reasoning, self-evolving reasoning frameworks, multimodal reasoning, and socially grounded reasoning. Overall, this work aims to serve as a reference for developing more robust, interpretable, and generalizable reasoning systems in future language models.

[NLP-73] APEX: Automated Prompt Engineering eXpert with Dynamic Data Selection

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）对提示词（prompt）形式高度敏感的问题，尤其关注在有限计算预算下如何实现高效且有效的自动提示工程（Automatic Prompt Engineering, APEX）。现有基于进化算法的提示优化方法存在数据效率低下的关键瓶颈：它们将训练数据集视为静态基准，导致大量计算资源被浪费在信息量低的样本上。为应对这一挑战，本文提出APEX（Automatic Prompt Engineering eXpert）框架，其核心创新在于将数据使用效率与提示搜索过程协同优化。APEX通过动态根据优化历史对数据集进行分层，划分为易例（Easy）、难例（Hard）和混合例（Mixed）三类，其中重点聚焦于“混合例”子集——即模型表现不稳定的数据样本。该子集揭示了两个高杠杆作用的数据前沿：一是“可改进边界”（addressable frontier），用于生成具有信息量的提示变异；二是“排名敏感边界”（rank-sensitive frontier），用于精确区分不同候选提示的质量。在IFBench、SimpleQA Verified和FACTS Grounding三个多样化基准上的实验表明，在仅5,000次评估调用的固定预算下，APEX相较于初始提示分别在Gemini 2.5 Flash上提升11.2%、在Gemma 3 27B上提升6.8%，充分验证了以数据为中心的优化策略在提升提示优化效率与效果方面的关键作用。

链接: https://arxiv.org/abs/2606.11459
作者: Fei Wang,Si Si,Cho-Jui Hsieh,Inderjit S. Dhillon
机构: Google(谷歌); UCLA(加州大学洛杉矶分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large Language Models are highly sensitive to prompt formulation, necessitating automatic prompt optimization to unlock their full potential. While evolutionary algorithms have emerged as the dominant paradigm, they suffer from a critical bottleneck: data efficiency. Current methods treat the development dataset as a static benchmark, wasting significant compute budget on uninformative data. In this work, we introduce APEX (Automatic Prompt Engineering eXpert), a novel framework that optimizes the data usage alongside the prompt search. APEX dynamically stratifies the dataset into Easy, Hard, and Mixed tiers based on the optimization lineage. By prioritizing the Mixed tier, which identifies the data where the LLM has mixed performance, we identify two high-leverage subsets: the addressable frontier for generating informative mutations and the rank-sensitive frontier for distinguishing candidate quality. We evaluate APEX across three diverse benchmarks: IFBench, SimpleQA Verified, and FACTS Grounding. Under a fixed budget of 5,000 evaluation calls, due to its data efficiency, APEX outperforms the initial prompt by an average of 11.2% on Gemini 2.5 Flash and 6.8% on Gemma 3 27B, demonstrating that a data-centric approach is key to efficient and effective prompt optimization.

[NLP-74] AI Coding Agents in Social Science: Methodologically Diverse Empirically Consistent Interpretively Vulnerable

【速读】：该论文旨在解决生成式 AI（Generative AI）驱动的大型语言模型（LLM）代理在科学分析中可能引发的方法论同质化与分析灵活性加剧之间的矛盾问题。其核心关切在于：一方面担忧AI会削弱研究方法的多样性，另一方面则担心其会放大研究人员基于动机性结论的分析弹性。论文的关键解决方案在于将这一复杂问题分解为两个可实证区分的层面——“设计层”（design layer），即研究方法的选择过程；以及“裁决层”（verdict layer），即决策规则将估计结果转化为实质性结论的过程。通过在移民与社会政策这一典型议题上对Claude Code和Codex进行20次独立执行，并与多分析师人类基准进行对比，研究发现：在设计层，Codex保持了与人类相当的方法多样性，而Claude Code甚至产生近三倍于人类的研究设定，且所有代理的效应估计仍与人类共识基本一致，无一完全复制人类模型；当引入反移民倾向的提示诱导时，尽管代理的方法选择被重新组织，但其整体估计与最终裁决并未发生显著偏移，且未沿人类用于偏倚的分析轴线迁移。而在裁决层，一个明确的确认性提示使Claude Code的裁决支持率从10%跃升至90%，但其系数分布几乎不变，表明该偏差源于决策规则的省略而非规则软化。因此，研究揭示在当前设置下，生成式AI的偏倚并非源自估计过程，而是根植于解释环节的裁决机制，凸显了对裁决层透明性与可解释性的关键控制需求。

链接: https://arxiv.org/abs/2606.11456
作者: Meysam Alizadeh,Fabrizio Gilardi,Mohsen Mosleh,Enkelejda Kasneci
机构: University of Oxford (牛津大学); University of Zurich (苏黎世大学); Technical University of Munich (慕尼黑工业大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:The deployment of LLM-based agents in scientific analysis raises opposing concerns: that agents may reduce methodological diversity, or that they may amplify the analytic flexibility through which researchers reach motivated conclusions. We argue these worries target two empirically separable layers: a design layer of methodological choices, and a verdict layer in which a decision rule maps estimates to a substantive claim. We test both by running 20 independent executions of Claude Code and Codex on a prominent immigration and social-policy against a many-analysts human baseline. At the design layer, Codex matches human methodological diversity and Claude Code produces nearly three times as many specifications; both agents’ effect estimates remain broadly aligned with the human consensus, and no agent model exactly matches any human model. A prompt-induced anti-immigration researcher prior reorganizes each agent’s methodological decisions but, unlike for biased human analysts in the same data, does not shift aggregate estimates or final verdicts; nor do agents reroute along the methodological axes humans use to bias their estimates. At the verdict layer, an explicit confirmatory prompt flips Claude Code’s verdicts from 10% to 90% support while leaving its coefficient distribution essentially unchanged, operating through rule omission rather than rule softening. AI agents can rival or exceed human methodological diversity at the design layer while remaining vulnerable at the verdict layer. In our setting, the locus of AI bias is not estimation but interpretation.

[NLP-75] AI Coding Agents Can Reproduce Social Science Findings

【速读】：该论文旨在解决当前生成式AI在社会科学领域复现研究结果能力评估不足的问题，特别是现有评估基准存在规模小、无法区分代理性能与复现材料缺陷（如代码无法运行）的混杂因素。其解决方案的关键在于构建SocSci-Repro-Bench基准，包含221个跨四个学科和13个主题领域的任务，所有任务均基于可完全复现或因数据缺失而不可复现的研究，从而有效隔离出编码代理的真实复现能力。通过在该基准上评估前沿编码代理Claude Code与Codex，发现两者均能成功复现大量社会科学成果，其中Claude Code表现显著优于Codex，且复现率远超以往通用大语言模型代理的表现。此外，研究还表明代理在识别研究核心问题的推理任务中表现良好，且结果非主要由记忆驱动；提供原始论文PDF虽小幅提升性能，但可能引入偏差；同时，通过微调提示框架可引导代理进行确认性设定搜索。这些发现表明，部分前沿编码代理已具备可靠执行计算工作流的能力，但同时也强调了在科学生产中应用AI时需重视严谨的基准测试与提示工程设计。

链接: https://arxiv.org/abs/2606.11447
作者: Meysam Alizadeh,Mohsen Mosleh,Fabrizio Gilardi,Atoosa Kasirzadeh,Joshua Tucker
机构: University of Oxford; University of Zurich; Carnegie Mellon University; New York University
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent anecdotal evidence suggests that AI coding agents can reproduce published findings when provided with original data and code; yet systematic evaluation across social sciences remains limited. Existing evaluation benchmarks are insufficient, either small or conflate agent performance with problems in the reproduction materials themselves, such as code that fails to execute correctly. Here we introduce SocSci-Repro-Bench, a benchmark of 221 tasks spanning four disciplines and 13 substantive domains, constructed from studies whose results are either fully reproducible with available materials or demonstrably non-reproducible due to missing data, allowing us to isolate agents’ reproduction capacity. Evaluating two frontier coding agents, Claude Code and Codex, we find that both can reproduce a large share of social science findings, with Claude Code substantially outperforming Codex. These reproduction rates considerably exceed those previously reported for general-purpose LLM-based agents on comparable reproducibility benchmarks. Both agents also perform strongly on a reasoning task requiring identification of underlying research questions, and additional analyses suggest that results are not primarily driven by memorization. Providing the original paper PDF alongside replication materials modestly improves performance but introduces bias on tasks where reproduction is impossible. We also show that agents can be nudged toward confirmatory specification search through subtle prompt framing. Together, these findings suggest that at least some frontier coding agents can serve as reliable executors of computational workflows while underscoring the need for careful benchmarking and prompt design as AI systems assume larger roles in scientific production.

[NLP-76] Agent Skill Evaluation and Evolution: Frameworks and Benchmarks

【速读】：该论文旨在解决大规模智能体技能库在实际应用中面临的可用性、质量和安全性挑战，尤其是在技能规模持续扩张背景下，如何实现对技能的高效评估与持续进化。其核心问题在于：传统依赖孤立技能构建的方法已难以满足复杂场景需求，亟需建立以评估驱动的自动化技能演化机制。解决方案的关键在于提出一种系统化的技能演化范式分类框架，将演化路径归纳为四大模式——执行反馈、轨迹蒸馏、压缩以及强化学习，揭示了每种范式在提升技能实用性与可靠性方面的独特贡献。此外，论文还通过分析六类以技能为中心的基准测试体系，识别出当前评估体系在覆盖范围、权衡关系及度量丰富性方面的结构性缺陷，并进一步指明构建具备泛化能力、高效率和可验证安全性的技能生态系统的开放研究方向。

链接: https://arxiv.org/abs/2606.11435
作者: Kexin Ding,Yang Zhou,Can Jin,Feng Tong,Mu Zhou,Dimitris N. Metaxas
机构: Rutgers University (罗格斯大学); University of North Carolina at Charlotte (北卡罗来纳大学夏洛特分校)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The growth of agent skills has transformed how agentic systems are built, evaluated, and deployed. As skill libraries continue to scale, rigorous evaluation becomes critical to ensuring their utility, quality, and safety in real-world applications. Consequently, the field is undergoing an emerging paradigm shift from isolated skill creation to automated, evaluation-driven skill evolution. In this survey, we systematically examine the landscape of skill evolution and evaluation beyond foundational skill creation. We categorize evolution into four distinct paradigms, spanning execution feedback, trajectory distillation, compression, and reinforcement learning, showing how each element contributes to improving skill utility and reliability. We also provide an analysis of six skill-centric benchmark categories, identifying structural gaps in benchmark coverage, trade-offs, and metric richness to advance skill research. Finally, we identify open directions for building skill ecosystems that are generalizable, efficient, and verifiably safe. The project URL is this https URL

[NLP-77] SOMA-SQL: Resolving Multi-Source Ambiguity in NL-to-SQL via Synthetic Log and Execution Probing

【速读】：该论文旨在解决自然语言转结构化查询语言（NL2SQL）在真实场景中因用户问题表述不充分、数据库模式庞大且存在歧义而导致的模型鲁棒性差的问题。核心挑战在于用户意图、数据库模式与模型解释之间的多重歧义，易引发语义错位、模式误关联及错误SQL生成。现有方法依赖人工澄清或将歧义视为模式表示问题，难以实现自动化扩展。本文提出SOMA-SQL，其关键创新在于通过定向合成查询日志（synthetic query log）和基于歧义驱动的探测机制实现自动歧义消解：首先利用合成查询日志对模式理解进行建模并指导候选SQL生成；随后基于结构化的歧义分类体系与候选结果分歧，执行针对性探测查询，生成用于最终SQL选择与修正的消歧证据。该主动式歧义发现与消解框架无需人工干预，可泛化至未见模式与查询分布。在六个公开基准上的实验表明，SOMA-SQL相较当前最优基线平均提升执行准确率13.0%，在高歧义问题上最高提升达16.7%。

链接: https://arxiv.org/abs/2606.11424
作者: Sai Ashish Somayajula,Marianne Menglin Liu,Chuan Lei,Fjona Parllaku,Daniel Garcia,Rongguang Wang,Syed Fahad Allam Shah,Ankan Bansal,Sujeeth Bharadwaj,Tao Sheng,Sujith Ravi,Dan Roth
机构: Oracle AI(甲骨文人工智能)
类目: Computation and Language (cs.CL)
备注: 34 pages, 1 figure, 7 tables. Preprint

点击查看摘要

Abstract:Natural language interfaces to databases aim to translate user questions into executable SQL, yet remain brittle in real-world settings where questions are underspecified and schemas are large and ambiguous. Ambiguity across user questions, database schemas, and model interpretations are central failure modes in NL2SQL, leading to misaligned intent, incorrect schema grounding, and erroneous SQL generation. Existing approaches rely on human clarification or treat ambiguity as a schema representation problem, but these do not scale nor resolve ambiguity autonomously. We propose SOMA-SQL to automatically resolve ambiguity via targeted synthetic query log and ambiguity-driven probing. SOMA-SQL constructs synthetic query log to ground schema interpretation and guide candidate SQL generation; it then executes targeted probing queries, driven by a structured ambiguity taxonomy and candidate disagreements, to produce disambiguation evidence for final SQL selection and repair. This active approach to ambiguity discovery and resolution generalizes across unseen schemas and query distributions without human-in-the-loop. Experiments on six public benchmarks demonstrate that SOMA-SQL improves execution accuracy by 13.0% on average over state-of-the-art baselines, with gains of up to 16.7% on ambiguous questions.

[NLP-78] Context-Aware Multimodal Claim Verification in Spoken Dialogues

【速读】：该论文旨在解决口语化虚假信息（spoken misinformation）在对话场景中难以被有效验证的问题，尤其针对当前事实核查（fact-checking）研究主要聚焦于孤立文本而忽视对话语境的局限性。其核心挑战在于：口语信息的可信度不仅取决于事实本身，更依赖于话语的框架、重复强化及未被反驳等互动特征，而这些动态交互特性在现有技术中尚未得到充分建模。为此，论文提出MAD2——一个包含1,000组双人对话、共3,368个可核查陈述及约10小时音频的多轮口语对话基准数据集，并设计了一种校准的多模态融合方法，结合上下文感知的音频编码器与对话感知的文本模型。实验表明，引入对话上下文显著提升验证性能，且效果依赖于具体应用场景；仅使用前序对话上下文即可达到接近离线验证的表现，支持实时审核场景；当基于文本的模型因额外上下文干扰而出现不稳定时，音频模态贡献最大。总体而言，对话结构在虚假信息识别中的作用高于误导性话语框架本身。

链接: https://arxiv.org/abs/2606.11420
作者: Chaewan Chun,Delvin Ce Zhang,Dongwon Lee
机构: The Pennsylvania State University, USA; University of Sheffield, UK
类目: Computation and Language (cs.CL); Social and Information Networks (cs.SI)
备注:

点击查看摘要

Abstract:Every day, millions absorb claims from podcasts and streams that no fact-checker ever sees. Spoken misinformation is built through conversation, where credibility comes not from facts alone but from how claims are framed, reinforced, or left unchallenged across turns. Yet fact-checking has focused on isolated text, leaving dialogue audio under-studied. We introduce MAD2, a new Multi-turn Audio Dialogues benchmark for spoken claim verification, containing 1,000 two-speaker dialogues with 3,368 check-worthy claims and approximately 10 hours of audio, and propose calibrated multimodal fusion of a context-aware audio encoder and a dialogue-aware text model. Across settings, adding dialogue context improves verification, but the gains depend on scenario type. Using only preceding context often matches offline performance, supporting live-moderation settings, and audio contributes most when transcript-based models are destabilized by additional context. Overall, conversational structure matters more for verification than misinformation framing.

[NLP-79] Scenario-based Probing and Steering Cultural Values in Large Language Models --Extended Version

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在跨文化部署中普遍存在的文化同质化问题，即模型行为受训练数据影响而表现出单一、标准化的价值取向，难以真实反映目标文化的深层价值观。现有评估方法多依赖直接提问的问卷式提示，常导致模型给出中立或安全对齐的回应，无法揭示其潜在的文化偏好。为此，作者提出一种基于世界价值观调查（World Values Survey, WVS）双轴框架（英格尔哈特-维尔策文化维度）的隐性文化表征探测与调控框架。其核心在于将社会价值问题转化为情境化的行为困境，通过提取令牌级概率来量化模型的隐含价值观，并结合激活操控（activation steering）与国家条件化提示，实现无需微调即可调整模型行为的目标。实验表明，不同模型与文化间存在显著可调控性差异，并发现文化维度间的隐性纠缠现象——在某一维度上的干预会引发另一维度的非预期偏移，这种耦合关系与人类WVS数据中的相关性一致，且在激活、提示及混合调控策略下均持续存在，虽限制了独立维度对齐的可能性，但整体任务性能基本保持稳定。

链接: https://arxiv.org/abs/2606.11399
作者: Trung Duc Anh Dang,Tung Kieu,Sarah Masud
机构: 未知
类目: Computation and Language (cs.CL)
备注: 18 pages

点击查看摘要

Abstract:Large Language Models (LLMs) are deployed across cultural contexts but often reflect homogenized values inherited from training data. Evaluations of cultural alignment typically rely on direct prompting with survey-style questions, which frequently elicit neutral or safety-aligned responses and fail to capture underlying model preferences. We propose a framework for probing and steering latent cultural representations in LLMs along the two Inglehart–Welzel axes of the World Values Survey (WVS). By translating social value questions into scenario-based behavioral dilemmas, we extract token-level probabilities to measure implicit values and apply activation steering, optionally combined with country-conditioned prompting, to shift model behavior without retraining. Across three open-source LLMs and four target cultures, we find substantial variation in steerability and identify latent entanglement, where interventions along one cultural dimension induce shifts along another. This coupling mirrors correlations in human WVS data and persists across activation, prompt, and hybrid steering. It constrains axis-independent alignment, though general task performance is largely preserved.

[NLP-80] Small Experiments Cheaper Decisions: A Case Study in Staged Promotion for Micro-Pretraining MICRO

【速读】：该论文旨在解决在生成式 AI (Generative AI) 微预训练（micro-pretraining）过程中，因短时预训练运行导致的配置选择偏差问题。传统方法在极小预算下快速筛选配置，容易过度青睐仅在低预算下表现优异但缺乏泛化能力的超参数组合，从而引发“虚假性能”现象。为此，论文提出一种可审计的分阶段晋升协议（staged-promotion protocol），在异构硬件环境（Windows A100 与 Linux L40S）上对固定微预训练运行器进行验证。其核心解决方案在于：通过设定一系列递增的、冻结的晋升规则（如 2 分钟、5 分钟、10 分钟、60 分钟、12 小时等阶段），在早期阶段故意引入不稳定性（如 5 分钟与 10 分钟排名受主机影响），以避免过早锁定潜在次优配置；并通过跨种子（seed）和跨主机的重复验证，确保最终胜出的配置（即“桥接条件”，bridge condition）在多个独立实验单元中持续领先。关键创新点在于：采用冻结的客观标准（如 0.010 验证比特率近似等价性、0.020 平均差距阈值）作为晋升门槛，拒绝非最优但成本更低的替代方案（如 d8/ar48），从而实现成本可控、结果可信的有限资源分配。最终在 144 GPU 小时的实际执行基础上，总协议消耗 169.2 GPU 小时，远低于若继续全部候选配置所需的 192 至 432 GPU 小时，形成一个具有明确成本边界的高效筛选框架，而非声称全局最优或超越自适应超参数优化的方法。

链接: https://arxiv.org/abs/2606.11387
作者: Felipe Chavarro Polania
机构: Hewlett Packard Enterprise (惠普企业)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 14 pages, 5 figures; 12-hour dual-host micro-pretraining promotion study; source package includes curated ancillary artifacts

点击查看摘要

Abstract:Short pretraining runs can reduce experimental cost, but they can also over-promote configurations that only look strong at tiny budgets. We study an auditable staged-promotion protocol for a fixed micro-pretraining runner on two heterogeneous host blocks: Windows A100 and Linux L40S. Starting from twelve prior-screened configurations, we use staged budgets of 2 minutes, 5 minutes, 10 minutes, 60 minutes, and 12 hours, with frozen promotion rules before expensive continuations. The early screens are intentionally treated as unstable: the 5- and 10-minute rankings are host-sensitive, and the eventual 12-hour top-ranked condition is not the mean-best condition at the replicated 10-minute gate. Because seed ranges differ across stages, these changes are operational promotion evidence, not within-seed curves. A replicated 60-minute gate keeps the Staged Factorial Screening bridge reference in the promoted set, where it ranks first in all four 60-minute host-seed cells. In the final 12-hour confirmation package, the bridge condition ranks first in all four host-seed cells across two seeds; the greedy comparator does not meet the frozen 0.010 val_bpb near-equivalence rule; and the cheaper d8/ar48 (depth-8, aspect-48) sentinel does not meet the frozen 0.020 mean-gap rule. The executed 12-hour branch spends 144 GPU-hours, and the full staged protocol records 169.2 training GPU-hours including screening stages. Continuing all four 60-minute candidates would spend 192 GPU-hours, while continuing all nine replicated 10-minute candidates would spend 432 GPU-hours. The latter numbers are accounting counterfactuals for unrun continuations, not evidence that skipped candidates could not have overtaken the reference. The result is a bounded cost-allocation finding, not a claim of global optimality, capacity-normalized superiority, or superiority over adaptive hyperparameter optimization methods. Comments: 14 pages, 5 figures; 12-hour dual-host micro-pretraining promotion study; source package includes curated ancillary artifacts Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2606.11387 [cs.CL] (or arXiv:2606.11387v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2606.11387 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-81] Overcoming State Inertia in Full-Duplex Spoken Language Models via Activation Steering

【速读】：该论文旨在解决全双工语音语言模型（FD-SLMs）在实时对话中对用户突发打断（interruption）响应延迟的问题，其核心挑战在于模型内部状态切换的滞后性——即在用户突然插入说话时，模型仍滞留于生成状态（generative state），未能及时切换至感知状态（perceptive state），导致错失输入起始部分。这一现象被作者称为“状态惯性”（state inertia）。解决方案的关键在于通过引入一种无需训练的激活引导（activation steering）机制，利用一个感知向量（perception vector）动态调节模型隐藏层表示的预测焦点，强制其从生成态快速转向感知态，从而实现对突发输入的即时响应。实验表明，该方法在零缓冲基准测试（Zero-Buffer Benchmark, ZBB）下显著提升了响应正确率与初始词出现率（IWOR），例如在PersonaPlex模型上将正确率从28%提升至45%，初始词捕获率从40%提升至72%，且不依赖微调或额外计算开销。

链接: https://arxiv.org/abs/2606.11386
作者: Cheng-Kuang Chang,Kai-Wei Chang,Alexander H. Liu,James Glass
机构: Massachusetts Institute of Technology (麻省理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Full-duplex spoken language models (FD-SLMs) enable seamless speech interaction by allowing models to listen and speak simultaneously, yet the internal mechanism by which they coordinate listening and speaking remains underexplored. We analyze the predictive behavior encoded in FD-SLM hidden representations and find that they exhibit stream-specific predictive patterns: during listening, they preferentially predict the incoming user stream, whereas during speaking, they preferentially predict the model output stream. Building on this observation, we show that FD-SLMs dynamically modulate their internal predictive focus between two states: a generative state aligned with model output generation and a perceptive state aligned with incoming user input. However, this modulation can lag behind abrupt changes in conversational context. During user interruptions, the model remains transiently biased toward the generative state before transitioning into the perceptive state, causing it to miss the beginning of the incoming input. We term this delayed internal transition state inertia. To quantify its downstream impact, we introduce the Zero-Buffer Benchmark (ZBB), a diagnostic benchmark for evaluating immediate interruption comprehension when user speech begins abruptly. We evaluate this setting using response correctness and initial-word occurrence rate (IWOR). Finally, we mitigate state inertia through activation steering with a perception vector, a training-free intervention with little additional computational overhead. Across multiple state-of-the-art FD-SLMs, activation steering substantially improves interruption handling; for example, on PersonaPlex, it improves correctness from 28% to 45% and IWOR from 40% to 72% without any fine-tuning.

[NLP-82] When Probing Accuracy Saturates Frag ility Resolves: A Complementary Metric for LLM Pre-Training Analysis

【速读】：该论文旨在解决标准线性探针（linear probing）在预训练过程中失效的问题：尽管探针准确率在训练初期迅速达到饱和，无法反映后续训练阶段模型表征的动态演化，导致大部分训练过程对探针“不可见”。其核心解决方案是提出“脆弱性”（fragility）这一互补的逐层度量指标，定义为探针准确率因激活噪声而崩溃时的噪声水平。脆弱性同时敏感于表征的可分性裕度（margin of separability）与冗余性（redundancy of representation），二者在探针准确率趋于平稳后仍持续演化。通过应用于开放检查点语言模型，脆弱性揭示了准确率无法捕捉的结构信息：道德表征沿词法（lexical）→组合式（compositional）梯度逐步涌现；通过跨构造类型、无共同对比词项的迁移实验，直接验证了组合编码的存在；此外，层深度上的鲁棒性梯度随训练单调递增，而准确率保持平坦；相同探针准确率下，不同数据集微调后的模型表现出显著不同的脆弱性指纹，表明数据筛选会重塑探针鲁棒性但不改变准确率。在所有测试中，当探针准确率呈现平坦结果时，脆弱性均返回具有结构化的响应。

链接: https://arxiv.org/abs/2606.11375
作者: Orion Reblitz-Richardson
机构: Distiller Labs(迪斯蒂勒实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 22 pages, 5 figures. Code and datasets at this https URL

点击查看摘要

Abstract:Standard linear probing declares a property “encoded” when a classifier on hidden states achieves high accuracy. The protocol works well on a snapshot but breaks across pre-training: probe accuracy saturates within the first few thousand steps, leaving most of training invisible to the instrument. We introduce fragility, a complementary per-layer metric defined as the activation-noise level at which probe accuracy collapses. Fragility is sensitive to both the margin of separability and the redundancy of representation, both of which keep evolving long after accuracy plateaus. Applied to open-checkpoint language models, fragility recovers structure that accuracy alone cannot see. Moralized representations emerge along a lexical \to compositional gradient: lexical moral detection first, compositional moral encoding later. Because probe accuracy on its own tracks how lexically separable a dataset is, we establish the compositional encoding directly, by showing it transfers across construction types that share no contrast tokens. A layer-depth robustness gradient develops monotonically across training while accuracy stays flat. And matched fine-tuning corpora that produce identical probing accuracy leave distinct fragility fingerprints, showing that data curation reshapes probe robustness without changing probe accuracy. In every comparison we test, where probing accuracy returns a flat answer, fragility returns a structured one.

[NLP-83] he Dynamics of Human and AI-Generated Language: How Semantics Fluctuates across Different Timescales

【速读】：该论文旨在解决如何量化并比较人类与人工智能生成语音在时间维度上语义内容分布的通用性与具体性这一关键问题，尤其针对现有方法缺乏简洁、可解释的时间序列特征来捕捉语义随时间演变模式的不足。其解决方案的核心在于提出一种语义-时标分析（semantic-timescale analysis）流程，将带有时间戳的词级转录文本转化为语义时间序列，并通过两个关键指标进行建模：一是基于WordNet的词义深度（word depth）以衡量语义具体性，二是利用SBERT嵌入计算上下文相似性，并采用自相关窗口度量（ACW-0及衍生指标）来量化语义内容的时序依赖性。研究进一步设计了多种随机化控制条件（如打乱词汇身份、时间顺序和词持续时间），发现当时间结构被破坏时，语义具体性与ACW-0之间的关联显著减弱或消失，表明ACW度量能够有效捕捉超越静态词汇分布的非平凡时序组织特性。因此，基于ACW的语义时标特征为分析和比较人类与生成式AI语音在时间结构上的差异提供了可解释且稳健的新范式。

链接: https://arxiv.org/abs/2606.11371
作者: Han-Jen Chang,Yasir Çatal,Angelika Wolman,Agustín Ibáñez,David Smith,I-Wen Su,Kai-Yuan Cheng,Georg Northoff
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS); Signal Processing (eess.SP)
备注: 45 pages, 4 figures, 4 tables. Accepted manuscript; published in Computer Speech Language

点击查看摘要

Abstract:Spoken language, whether produced by humans or large language models (LLM), unfolds over time with varying semantic content. However, we still lack simple, interpretable time-series features that capture how generic versus specific content is distributed over time, and that can be used to compare human and AI-generated speech. We introduce a semantic-timescale analysis pipeline that turns word-level transcripts with timestamps into semantic time-series. For each spoken narrative, we compute (i) semantic specificity using WordNet-based word depth and (ii) contextual similarity using SBERT embeddings and quantify their temporal dependence using autocorrelation-window measures (ACW-0 and related metrics). We then compare original speech to multiple shuffled controls that selectively disrupt lexical identity, temporal order, and word duration. Across human-read autobiographical narratives, TTS readings, and LLM-generated texts rendered with TTS, we find that segments with longer ACW-0 in the semantic time-series tend to contain more generic vocabulary, whereas segments with shorter ACW-0 are enriched in more specific words. These associations are strongly attenuated or abolished when word order and timing are randomized, indicating that ACW-based measures capture non-trivial temporal organization of semantic content beyond static lexical distributions. Our results suggest that ACW-based semantic timescales are a useful family of features for analyzing and comparing the temporal structure of human and AI-generated speech.

[NLP-84] Can AI Agents Synthesize Scientific Conclusions?

【速读】：该论文旨在解决生成式人工智能（Generative AI）在高风险科学领域（如医疗健康）中进行开放式科学结论合成时的可靠性问题，特别是评估其在复杂、多源信息整合任务中的真实能力。现有方法常因数据泄露（data leakage）导致性能评估虚高，难以反映模型在真实应用场景下的表现。为此，研究提出SciConBench——一个包含9,110个问题与专家撰写结论的大规模实时基准测试集，并构建了基于专家验证的自动化评估流程，将结论分解为原子事实，通过事实精确率（factual precision）和召回率（recall）衡量合成结果的正确性与完整性。为防止数据泄露，进一步设计了SciConHarness——一种“洁净室”（clean-room）评估框架，严格控制代理的网络交互行为，确保评估的有效性与可信度。实验结果显示，在洁净室环境下，最先进的模型仅达到0.337的事实F1分数，显著低于非受限评估下的表现，表明数据泄露严重夸大了模型的真实能力。此外，对主流消费级科学代理（如Google AI Overview、OpenEvidence）的审计发现，其生成的结论常不完整甚至自相矛盾，即便存在标准答案。研究结论表明，可靠地合成科学结论仍是开放挑战，而洁净室评估机制对于客观衡量开放域AI代理的能力至关重要。

链接: https://arxiv.org/abs/2606.11337
作者: Hayoung Jung,Pedro Viana Diniz,José Reinaldo Corrêa Roveda,Abner Fernandes da Silva,Haeun Jung,Enoch Tsai,Aleksandra Korolova,Manoel Horta Ribeiro
机构: Princeton University (普林斯顿大学); Universidade Federal de Minas Gerais (米纳斯吉拉斯联邦大学); Stony Brook University (石溪大学); Hackensack Meridian School of Medicine (哈肯萨克梅里迪安医学院)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: 79 pages, 34 figures, 17 tables. Under Submission

点击查看摘要

Abstract:Scientific AI agents increasingly retrieve evidence, reason across sources, and synthesize conclusions used in consequential decisions. Yet, their ability to do so in high-stakes domains such as health remains unclear. We introduce SciConBench, a large-scale live benchmark of 9.11K questions and expert-written conclusions from systematic reviews to evaluate open-domain scientific conclusion synthesis. The benchmark draws on an expert-validated automated evaluation pipeline that decomposes conclusions into atomic facts and measures correctness and comprehensiveness via factual precision and recall. To mitigate data leakage, we further introduce SciConHarness, a clean-room evaluation harness that equips agents with controlled web interaction to ensure valid measurement. Evaluating 8 frontier models and deep research agents, we find that factual quality remains low: under clean-room settings, the best agent achieves only a factual F1 of 0.337. Our clean-room setting consistently reduces performance relative to unconstrained evaluation, suggesting that leakage inflates estimates of models’ true synthesis capabilities. Finally, we audit consumer-facing agents (e.g., Google AI Overview, OpenEvidence) and find they frequently generate incomplete and sometimes contradictory conclusions, even when the ground-truth answer is available. Overall, our results show that reliable synthesis of scientific conclusions remains an open challenge, and that clean-room evaluation is essential for assessing open-domain AI agents.

[NLP-85] Schützen: Evaluating LLM Safety in Bulgarian and German Contexts

【速读】：该论文旨在解决当前大型语言模型（Large Language Models, LLMs）安全评估资源严重依赖英语和中文，缺乏对德语与保加利亚语等在共享社会文化、法律及伦理背景下运行的语言的针对性评估工具的问题。现有评估数据集的地域与语言偏见限制了对多语言场景下模型安全性的真实反映，尤其在低资源语言（如保加利亚语）中表现尤为突出。为此，本文提出Schützen——一个涵盖德语与保加利亚语的安全性评估数据集，用于衡量模型在潜在风险情境下的应答能力。其解决方案的关键在于构建跨语言、区域定制化的评估资源，通过实证研究揭示多语言与单语言大模型在安全行为上的显著差异，从而强调为德国与保加利亚等地区部署大模型时，必须采用本地化、语言特定的安全评估体系以实现负责任的应用。

链接: https://arxiv.org/abs/2606.11316
作者: Kiril Georgiev,Yuxia Wang,Dimitar Iliyanov Dimitrov,Preslav Nakov,Ivan Koychev
机构: 未知
类目: Computation and Language (cs.CL)
备注: 19 pages, 13 tables, 12 figures

点击查看摘要

Abstract:Large language models are increasingly deployed across professional domains, bringing hard-to-predict risks, including the generation of harmful or disrespectful content. Although substantial progress has been made in developing safety evaluation datasets, existing resources remain overwhelmingly English- and Chinese-centric. This limitation is particularly pronounced when evaluating languages that operate within shared sociocultural, legal, and ethical contexts. To address this gap, we introduce Schützen: a German–Bulgarian safety dataset designed to assess model answerability under risk, covering both a low-resource language (Bulgarian) and a high-resource language (German). Experiments with multilingual and language-specific LLMs reveal pronounced cross-language differences in safety behavior, highlighting the necessity of tailored, region-specific evaluation resources to support the responsible deployment of LLMs in Germany and Bulgaria. Datasets and code are available at this https URL. Warning: this paper contains examples that may be offensive, harmful, or biased.

[NLP-86] FlowBank: Query-Adaptive Agent ic Workflows Optimization through Precompute-and-Reuse

【速读】：该论文旨在解决当前基于大语言模型（Large Language Model, LLM）的多智能体系统在代理工作流优化中面临的效率与性能权衡问题。现有方法存在两大局限：任务级方法虽进行大量离线计算，但仅部署单一工作流，导致其他潜在有效方案被闲置；查询级方法则为每次查询动态生成新工作流，带来高昂的推理开销。作者通过深入分析发现，这两类方法实则具有互补性——离线搜索所得工作流通常覆盖不同查询子集，而许多本需高成本查询级生成的查询，实际上可由低成本预计算的工作流解决。因此，论文提出新的目标：构建一个紧凑、可复用且互补的工作流组合库，并在推理时根据查询动态选择最优工作流。为实现该目标，论文提出了三阶段框架FlowBank，其核心在于解决三个耦合问题：生成互补而非冗余的工作流候选集、将候选集压缩为低冗余的可部署组合包、以及在性能-成本权衡下为每个查询精准匹配最优工作流。具体而言，Diversifying模块引入DiverseFlow，通过引导搜索覆盖未充分覆盖的查询以提升候选集覆盖率；Curating模块提出CuraFlow，实现对候选池的高效压缩，形成紧凑的组合包；Matching模块将部署建模为查询-工作流二分图上的边缘价值预测任务，据此实现查询到组合包内成员的智能路由。在五个基准测试中，FlowBank在平均得分上优于所有对比方法，同时保持成本竞争力，相较于最强的自动化和人工基线分别提升4.26%和14.92%相对性能。

链接: https://arxiv.org/abs/2606.11290
作者: Lingzhi Yuan,Chenghao Deng,Fangxu Yu,Souradip Chakraborty,Mohammad Rostami,Furong Huang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Model (LLM)-based multi-agent systems are increasingly powerful, but current agentic workflow optimization paradigms make an unsatisfying trade-off. Task-level methods spend substantial offline compute yet deploy only a single workflow, leaving complementary candidates unused, while query-level methods synthesize a new workflow per query at substantial inference cost. Our motivating analysis shows these paradigms are more complementary than competing: workflows discovered during offline search often solve different subsets of queries, and many queries handled by expensive query-level generation can already be solved by cheaper precomputed workflows. This suggests a different objective: rather than searching for one universally best workflow or regenerating one per instance, we should build a compact bank of reusable, complementary workflows and select among them adaptively at inference time. Doing so requires solving three coupled problems: generating complementary rather than redundant candidates, compressing them into a small deployable portfolio, and assigning each query to the right workflow under a performance-cost trade-off. To this end, we present FlowBank, a three-stage framework for portfolio-based agentic workflow optimization. Diversifying proposes DiverseFlow to steer search toward under-covered queries and produce a high-coverage candidate pool. Curating proposes CuraFlow to compress this pool into a compact portfolio with minimal redundancy. Matching casts deployment as edge-value prediction on a query-workflow bipartite graph and routes each incoming query to the portfolio member with the best predicted utility. Across five benchmarks, FlowBank achieves the highest average score among the evaluated methods while remaining cost-competitive, improving over the strongest automated and handcrafted baselines by 4.26% and 14.92% relative, respectively.

[NLP-87] Quantifying Subliminal Behavioral Transfer Ratios in Language Model Distillation

【速读】：该论文旨在解决生成式 AI（Generative AI）在知识蒸馏过程中可能发生的隐性行为迁移问题，即教师模型中若存在不良行为倾向，其在训练学生模型时会以非显式方式传递这些负面特征，这一现象被称为隐性学习（subliminal learning）。尽管已有定性证据支持该效应的存在，但其量化影响程度尚未系统研究。本文的关键解决方案是通过在不同强度的指令引导（steering strength）下对两个教师模型（Llama-2-7B-Chat 和 Qwen2.5-7B-Instruct）进行控制，并仅使用良性数据进行学生模型的知识蒸馏，进而利用 GPT-4.1 作为评估器，在 100 个 JailbreakBench 提示上评估隐性行为的转移比率。结果表明，隐性行为转移具有鲁棒性，但表现出不同的缩放特性：Llama-2 在特定阈值（τ = 0.25–0.32，当 α < -0.15 时）出现显著跃迁，而 Qwen2.5 则呈现连续且更高的转移水平（τ 高达 0.61），揭示了不同模型在隐性学习敏感性上的本质差异。

链接: https://arxiv.org/abs/2606.11270
作者: Uwe Konig,Hamza Kazmi,Ruizhe Li,Maheep Chaudhary
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Distillation of a language model intended to transfer benign behavior to a student model may also transfer undesirable characteristics, if they are present in the teacher model, a phenomenon known as subliminal learning. While qualitative evidence supports the existence of this effect, its magnitude has not been systematically characterized. This study quantifies subliminal behavioral transfer ratios by steering two teacher models (Llama-2-7B-Chat and Qwen2.5-7B-Instruct) at varying steering strengths and distilling student models using only benign data. Evaluation on 100 JailbreakBench prompts with GPT-4.1, serving as the evaluator, indicates that transfer is robust but exhibits distinct scaling behaviors. Llama-2 demonstrates a sharp threshold ( \tau = 0.25,0.32 \ \textbeyond \ \alpha = -0.15 ), whereas Qwen2.5 displays continuous and higher levels of transfer ( \tau up to 0.61 ).

[NLP-88] Energy-Efficient On-Device RAG on a Mobile NPU: System Design and Benchmark on Snapdrag on X Elite

【速读】：该论文旨在解决在移动设备上实现端到端检索增强生成（Retrieval-Augmented Generation, RAG）系统时面临的高能耗与计算效率低下问题，尤其是在仅依赖中央处理器（CPU）进行推理时，其高昂的能源消耗严重制约了隐私保护、低延迟响应及离线使用等优势的发挥。为应对这一挑战，研究提出并实现了首个完整运行于高通骁龙X Elite芯片中Hexagon神经网络处理单元（NPU）上的端到端RAG流水线，涵盖嵌入（embedding）、重排序（reranking）和大语言模型（LLM）生成所有神经网络阶段。其核心解决方案在于充分利用专用NPU的硬件加速能力，在不牺牲生成质量的前提下显著提升性能并降低能耗。实验结果表明，在索引阶段，NPU相比CPU实现9.1倍的嵌入吞吐量提升和12.3倍的系统能耗下降；在120个维基百科段落查询基准测试中，NPU使LLM预填充速度提升18.1倍，端到端查询延迟降低4.0倍，系统能耗减少4.0倍，且优于集成GPU（Adreno）的1.7倍速度劣势与6.5倍能耗劣势。基于GPT-4.1作为评判器的评估显示，NPU生成答案的质量与CPU和GPU相当（评分均值分别为9.32、8.95、9.03），86.7%的查询在三者间得分一致。因此，该工作证明了在类似骁龙X Elite/Hexagon架构的SoC上，通过专用NPU实现高效、可持续的绿色边缘智能是可行的，且该方案有望随着苹果神经引擎（Apple Neural Engine）、英特尔NPU、联发科APU等同类移动NPU软件生态成熟而广泛推广。

链接: https://arxiv.org/abs/2606.11257
作者: Zhiyuan Cheng,Longying Lai
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Performance (cs.PF)
备注: 9 pages, 2 figures, 6 tables

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) pipelines are compute-intensive, combining embedding, retrieval, reranking, and large language model (LLM) generation. Running them entirely on-device benefits privacy, latency, and offline use, but the energy cost of CPU inference is a major barrier. We present what is, to our knowledge, the first end-to-end RAG pipeline that runs all neural stages – embedding, reranking, and LLM generation – on the Qualcomm Hexagon NPU of the Snapdragon X Elite. Profiling on a Dell XPS 13 laptop, we compare NPU-accelerated RAG against CPU and OpenCL/Adreno GPU baselines on indexing and query workloads. On indexing, the NPU achieves 9.1x higher embedding throughput and 12.3x less system energy. On a 120-query Wikipedia-passage benchmark, it delivers 18.1x faster LLM prefilling, 4.0x lower end-to-end query latency, and 4.0x less system energy than the CPU baseline; the same workload on the integrated GPU is 1.7x slower than CPU and uses 6.5x more energy than the NPU. A GPT-4.1 LLM-as-judge evaluation finds NPU answer quality on par with CPU and GPU within evaluator noise (mean 9.32 vs. 8.95 vs. 9.03 on a 1-10 rubric), with 86.7% of queries scoring identically across all three backends. On the Snapdragon X Elite / Hexagon class of laptop SoC, the NPU thus enables practical, energy-efficient on-device RAG without quality regression – a sustainable path toward green edge intelligence that we expect to generalize to comparable mobile NPUs (Apple Neural Engine, Intel NPU, MediaTek APU) as their software stacks mature.

[NLP-89] ProHiFlo: Hierarchical Flow Matching with Functional Guidance for De Novo Protein Generation

【速读】：该论文旨在解决生成式蛋白质设计中缺乏多尺度建模能力与功能约束整合机制的问题，尤其针对现有基于扩散或流匹配的方法在单一分辨率下运行、难以有效融入功能性先验知识的局限性。其核心解决方案为提出ProHiFlo——一种分层流匹配框架，关键创新包括：（1）自粗至细的生成策略，先建模主链几何再精细化至全原子坐标，显著降低计算开销并保持高精度；（2）利用预训练的功能预测器实现无需微调的功能引导生成，有效驱动结构朝向期望功能属性演化；（3）采用自适应的SE(3)-等变架构，支持高效多尺度特征处理。实验表明，ProHiFlo在无条件生成、基序支架构建及功能化设计任务中均达到当前最优性能，且采样步骤减少4步；在酶活性位点支架设计任务中，成功率达58.9%，显著优于RFDiffusion的41.2%。

链接: https://arxiv.org/abs/2606.11243
作者: Chuanzhen Wang,Meade Cleti,Pete Jano
机构: Arizona State University (亚利桑那州立大学); University of Wisconsin-Madison (威斯康星大学麦迪逊分校); Tongji University (同济大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 23 pages

点击查看摘要

Abstract:De novo protein generation has transformative potential in therapeutic design, enzyme engineering, and synthetic biology. While diffusion-based and flow matching approaches have achieved progress, they typically operate at single resolution and lack mechanisms for incorporating functional constraints. We introduce ProHiFlo, a hierarchical flow matching framework with three innovations: (1) coarse-to-fine generation that models backbone geometry before refining to all-atom coordinates, reducing computational cost while maintaining accuracy; (2) functional guidance leveraging pretrained predictors to steer generation toward desired properties without retraining; (3) adaptive SE(3)-equivariant architecture for efficient multi-scale processing. Experiments on unconditional generation, motif scaffolding, and functional design demonstrate state-ofthe-art performance while requiring 4 fewer sampling steps. On enzyme active site scaffolding, ProHiFlo achieves 58.9% success rate compared to 41.2% for RFDiffusion.

[NLP-90] Every Act Has Its Price: Compressed Moral Composition in Frontier LLM s

【速读】：该论文旨在解决现有大语言模型（LLM）道德评估基准仅关注模型对孤立道德行为、价值或基础的偏好，而忽视了现实道德判断中多道德信号整合需求的问题。其核心挑战在于如何有效评估模型在复杂情境下对多个道德要素进行综合判断的能力。解决方案的关键是提出道德电车竞技场（Moral Trolley Arena），一个两阶段盲式ELO评估框架：第一阶段基于229个场景的语料库，对五种道德基础理论（Moral Foundations Theory）下的个体道德行为进行校准；第二阶段则在受控强度网格上将校准后的道德行为组合成双行为道德情境，以测量复合判断偏好。实验结果表明，尽管复合判断主要由各成分强度决定，但二者关系呈压缩而非简单叠加，且模型表现出非加性强度锚定效应、控制成分后仍存在的基础特异性残差，以及跨不同模型提供方的高度收敛的复合偏好表面。这些发现提示，未来的道德审计应聚焦于模型对道德证据的组合规则，而不仅限于对孤立行为的排序。

链接: https://arxiv.org/abs/2606.11232
作者: Weijia Zhang,Ruiqi Chen,Yunze Xiao,Weihao Xuan
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄本那-香槟分校); University of Michigan (密歇根大学); Carnegie Mellon University (卡内基梅隆大学); The University of Tokyo (东京大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Existing LLM moral benchmarks usually ask which isolated moral act, value, or foundation a model prefers. This is useful but incomplete. Realistic judgments often require a model to combine several moral signals within the same option. We introduce Moral Trolley Arena, a two-stage blind ELO benchmark for measuring how LLMs compose moral evidence. The single-scene arena first calibrates individual moral acts from a 229-scenario corpus across five Moral Foundations Theory foundations; the composite arena then combines calibrated acts into two-act moral items over a controlled intensity grid and measures the resulting composite preferences. Across ten frontier models, composite judgments are largely predicted by component act strength, but the relation is consistently compressed rather than simply additive. Models also show non-additive intensity anchoring, bounded foundation-specific residuals after component control, and highly convergent composite preference surfaces across providers. These results suggest that moral audits should measure composition rules for moral evidence, not only rankings over isolated acts.

[NLP-91] A Geometric Profile of Semantic Information in Text: Frame-Conditional Uniqueness and a Trade-Off Triangle for Scalar Summaries

【速读】：该论文旨在解决文本语义内容量化评估的核心问题：如何在不依赖人工标注的前提下，对单个文本的语义丰富度进行客观、可解释且具备理论基础的度量。传统信息论（如香农熵）仅关注符号不确定性而忽略语义，而现有基于成对比较的指标（如BERTScore）无法独立刻画单个文本的语义特性。其解决方案的关键在于提出一个几何化语义度量框架，通过分析句子嵌入（sentence embeddings）的结构来捕捉语义内容。该框架包含三个核心部分：首先，在固定嵌入空间和基线条件下，六个自然公理唯一确定了一个标量度量（框架条件唯一性定理）；其次，引入一个三坐标语义剖面——新颖性（偏离通用话语的程度）、广度（不同思想的多样性）与整合性（思想间的连贯性），并定义了由聚类阈值 $\tau$ 决定的离散最小单元（语义量子）以提升分辨率；最后，证明了一个“不可能定理”：不存在任何单一标量能够同时满足语义稳定性（对改写和拼接的鲁棒性）、尺度无关的序数稳健性以及跨表示模型的可比性。为此，论文设计了两个实用标量 $S_\mathrm{minmax}$ 与 $S_\mathrm{rank}$ ，分别代表该权衡三角中的不同取舍。实证验证涵盖23种合成类别、5部古腾堡计划小说及3种嵌入模型，结果表明推荐的秩归一化配置在28项序数检验中成功通过25项（校正后21项），显著优于包括一元词频熵和基于BERTScore的新颖性信号在内的七个基线方法。此外，一项变分分析揭示广度坐标与确定性点过程（determinantal point process, DPP）的对数行列式高度相关（斯皮尔曼等级相关系数 $\rho = 0.985$ ，基于507章古腾堡文本），为广度提供了优化理论基础。

链接: https://arxiv.org/abs/2606.11222
作者: Dmitriy Kompaneets
机构: Independent Researcher
类目: Computation and Language (cs.CL); Information Theory (cs.IT)
备注: 19 pages. Code and data: this https URL

点击查看摘要

Abstract:How much meaning does a text carry? Shannon’s theory measures uncertainty over symbols and is intentionally indifferent to meaning, while pairwise metrics such as BERTScore compare two texts rather than characterizing one. We develop a geometric framework that measures semantic content from the structure of a text’s sentence embeddings. The framework has three parts. First, within a fixed embedding and baseline, six natural axioms uniquely determine a scalar measure up to scale, a frame-conditional uniqueness theorem. The resulting scalar is empirically too coarse, motivating a richer representation. Second, we propose a three-coordinate semantic profile capturing novelty (displacement from generic discourse), breadth (diversity of distinct ideas), and integration (connectedness among them), together with a discrete minimal unit (the semantic quantum) whose resolution is fixed by a clustering threshold \tau . Third, we prove a no-go theorem: no scalar summary of the profile can simultaneously satisfy analytic stability under paraphrase and concatenation, ordinal robustness across text scales, and cross-representation comparability. We exhibit two practical scalars, S_\mathrmminmax and S_\mathrmrank , each occupying a distinct corner of this trade-off triangle. Validation across 23 synthetic categories, 5 Project Gutenberg novels, and 3 embedding models confirms the trade-off. The recommended rank-normalized configuration passes 25 of 28 ordinal checks as point estimates (21 of 28 after Benjamini-Hochberg correction), outperforming seven baselines including unigram entropy and a BERTScore-based novelty signal. A separate variational result connects the breadth coordinate to the log-determinant of a determinantal point process (Spearman \rho = 0.985 over 507 Gutenberg chapters), giving an optimization-theoretic foundation for breadth. Comments: 19 pages. Code and data: this https URL Subjects: Computation and Language (cs.CL); Information Theory (cs.IT) MSC classes: 94A17 (Primary), 68T50 (Secondary) Cite as: arXiv:2606.11222 [cs.CL] (or arXiv:2606.11222v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2606.11222 Focus to learn more arXiv-issued DOI via DataCite

[NLP-92] LifeSentence: Language models can encode human life course trajectories from longitudinal panel data

【速读】：该论文旨在解决长期追踪数据（longitudinal panel data）中人类生命历程预测的准确性问题，传统统计方法因忽略生命历程的时序结构而表现有限，而现有基于Transformer的深度学习模型又依赖大规模训练数据，难以适用于多数中小型纵向研究。其核心解决方案是提出LifeSentence模型，该模型将每个生命事件表示为结构化的自然语言记录，并通过在涵盖预测、鲁棒性与推理能力的18个任务评估体系上对一个240亿参数的预训练语言模型进行指令微调（instruction-tuning），从而将预训练阶段已编码的分布知识引入面板数据中。该模型仅需约6.5万名德国社会经济面板（SOEP）样本（远低于以往方法的45倍），即在联合事件与时间预测任务上实现较最优基线三倍的性能提升，在无时间戳事件序列重构中达到91.2%的Kendall’s tau相关性。更关键的是，模型无需显式监督即可从离散事件序列中恢复出社会分层的已知模式，如教育溢价、性别工资差距和母亲身份惩罚。此外，其自然语言接口支持新型质性研究问题，例如从早年生活史推导特定晚年结局，使LifeSentence不仅具备高精度预测能力，还成为探索人类传记反事实情景的分析工具。

链接: https://arxiv.org/abs/2606.11220
作者: Samuel Liu,Muchen Xi,William Yeoh,Joshua J. Jackson
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Forecasting human life outcomes is important to gain insights into how individuals attain long and healthy lives. Conventional statistical approaches yield limited accuracy, potentially due to discarding the sequential structure of the life course. Modern methods such as transformer architectures require large scale training data that most longitudinal panel studies lack. Here we introduce LifeSentence, a model for life-course reasoning that bridges large language models with longitudinal panel data. By representing each life event as a structured natural-language record and instruction-tuning a pretrained 24-billion-parameter language model across an 18-task evaluation taxonomy spanning prediction, robustness and reasoning, LifeSentence supplements panel data with distributional knowledge already encoded during pretraining. Trained on approximately 65,000 individuals from the German Socio-Economic Panel - roughly 45 times fewer than prior transformer-based approaches - LifeSentence outperforms classical and deep learning baselines across all task families, achieving a threefold improvement in joint event-and-timing prediction from best baselines and 91.2% Kendall’s tau when reconstructing chronological order from timestamp-stripped event sets. Without explicit supervision, the model recovers documented patterns of social stratification, including the education premium, the gender wage gap and the motherhood penalty, from discrete event sequences alone. A natural-language interface further enables qualitatively new research queries, such as connecting an early-life history to a specified late-life endpoint, establishing LifeSentence as both a predictive tool and a probe for counterfactual exploration of human biographies.

[NLP-93] Afrispeech Semantics: Evaluating Audio Semantic Reasoning in Spoken Language Models Across Domains and Accents ACL

【速读】：该论文旨在解决当前生成式音频语言模型（Audio Language Models, ALMs）在语音语义推理能力方面评估不足的问题，尤其聚焦于其在超越基础转录任务（如文本到音频检索、字幕生成、问答准确率）之外的深层语义与副语言推理能力的欠缺。现有评估体系对口音差异、领域迁移及语义过度推断等关键因素的影响缺乏系统性考察，导致模型性能评价存在偏差。为此，研究提出了一套涵盖五项核心推理任务的综合评估框架：蕴含关系判断（entailment）、一致性检验（consistency）、合理性评估（plausibility）、口音漂移（accent drift）与口音约束（accent restraint），旨在全面衡量模型以语音内容为主要证据源进行推理的能力，包括判断文本假设是否可由音频推断、矛盾或未定，陈述是否与语音内容一致，以及在不同口音条件下预测结果的稳定性与合理性。该研究的关键在于构建一个更全面、更具挑战性的评估基准，揭示当前模型在跨口音、跨域情境下的脆弱性，从而为设计更鲁棒、更公平的音频语言模型提供理论依据与实践指导。

链接: https://arxiv.org/abs/2606.11219
作者: Chibuzor Okocha,Christan Grant
机构: University of Florida (佛罗里达大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD)
备注: Accepted to ACL

点击查看摘要

Abstract:Audio language models (ALMs) are increasingly used for speech-based understanding, yet their ability to perform semantic reasoning beyond transcription, Text-to-Audio Retrieval, Captioning, and Question-Answering accuracy remains insufficiently benchmarked. In particular, the effects of accent variation, domain shift, and semantic over-inference on audio reasoning are poorly understood. We evaluate audio language models across five semantic and paralinguistic reasoning tasks: entailment, consistency, plausibility, accent drift, and accent restraint. Collectively, these tasks assess a model’s ability to reason over spoken audio as the primary evidence source, including whether a textual hypothesis can be inferred, contradicted, or left undetermined by the audio, whether statements align or conflict with spoken content, whether claims are plausible given the discourse, and whether model predictions remain stable or appropriately constrained across accent variation. These findings highlight critical limitations in current audio reasoning evaluations and hope to provide guidance for more robust and equitable ALM design and assessment

[NLP-94] Beyond Compaction: Structured Context Eviction for Long-Horizon Agents

【速读】：该论文旨在解决长时程大语言模型（Large Language Model, LLM）智能体在长时间会话中面临的上下文窗口（context window）容量受限问题，即随着对话历史积累，超出模型最大输入长度限制导致信息丢失或性能下降。其核心挑战在于如何在保持关键推理状态与任务连贯性的同时，有效管理不断增长的上下文规模。解决方案的关键是提出一种名为“上下文窗口生命周期管理”（Context Window Lifecycle, CWL）的语义感知型上下文管理机制：通过将智能体的行为轨迹显式标注为具有类型和依赖关系的事件片段（episode），构建结构化的事件图谱，并基于此图谱实施确定性、无需额外大模型参与的淘汰策略。当令牌预算超限时，系统优先移除已持久化至环境中的动作事件，保留用户输入及正在进行推理的探索性上下文，从而维持活跃上下文的稳定上限。相比基于摘要的压缩方法，CWL避免了不可预测的信息损失、因果结构破坏、模型成本阻塞以及压缩引发的幻觉；相较于仅按时间顺序截断的策略，CWL具备语义感知能力，依据依赖图选择最老且最可恢复的内容进行淘汰，而非简单丢弃最早出现的内容。实验表明，在长达8000万令牌的连续任务场景下，单个智能体成功完成89个序列任务，任务准确率未出现可测量下降，验证了CWL在保障长期任务执行稳定性方面的有效性。

链接: https://arxiv.org/abs/2606.11213
作者: Andrew Semenov,Svyatoslav Dorofeev
机构: Kiz8
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We present Context Window Lifecycle (CWL), a context-management scheme that gives long-horizon LLM agents an effectively unbounded working horizon. As a session accumulates history, CWL keeps the context within budget through graduated, semantically-aware eviction: the agent annotates its trajectory as typed, dependency-linked episodes as work proceeds, and a deterministic, LLM-free policy evicts content in priority order within that structure when a token budget is exceeded. CWL preserves user turns and the exploratory context the agent is actively reasoning over, while aggressively shedding action episodes whose effects are already persisted in the environment, keeping active context near a stable ceiling that also avoids the performance degradation associated with very large prompts. Compared to summarization-based compaction, CWL avoids four well-known limitations: unpredictable lossiness, destruction of causal structure, blocking model cost, and compression-induced hallucination. Compared to recency truncation, CWL is semantically aware: it drops the oldest-and-most-recoverable content according to the dependency graph rather than oldest-in-time regardless of relevance. We describe the annotation protocol, the episode graph, the eviction policy, and the token-accounting loop, and evaluate CWL on long-horizon agentic benchmarks: a single agent session completing 89 sequential tasks across 80 million tokens with no measurable degradation in task accuracy relative to per-task isolated sessions Subjects: Computation and Language (cs.CL) Cite as: arXiv:2606.11213 [cs.CL] (or arXiv:2606.11213v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2606.11213 Focus to learn more arXiv-issued DOI via DataCite

[NLP-95] EverydayGPT : Confidence-Gated Routing for Efficient and Safe Hybrid GPT -RAG Conversational QA

【速读】：该论文旨在解决标准检索增强生成（Retrieval-Augmented Generation, RAG）流水线中普遍存在的效率低下问题：即所有查询均无差别地经过检索与生成环节，导致大量不必要的计算开销，并可能将低质量的上下文传递至生成模型，影响最终输出质量。为此，论文提出 EverydayGPT 系统，其核心创新在于引入置信度门控路由（Confidence-Gated Routing, CGR）机制，将路由决策形式化为对检索距离与内容提取充分性联合策略的优化。该机制能够基于查询的置信度动态选择是否调用昂贵的 GPT 生成路径，从而在约 85% 的查询中直接通过快速的 RAG 提取完成回答，显著降低延迟（平均降低 6.3 倍，多数查询延迟从约 5.9 秒降至约 45 毫秒）。系统基于 205M 参数的 GPT 模型（从头训练于 100 亿条 FineWeb-Edu 数据），在 500 个领域内问答测试集上达到 F1 = 0.226 ± 0.004，优于仅使用 GPT（F1 = 0.171）和非条件性 RAG（F1 = 0.210）的表现。尽管相对于强基线的性能提升较为温和但稳定，其在效率上的改进极为显著。结构化溯源审计表明样本集中未发现无依据主张，且明确标注了适用范围限制。研究定位为在资源受限条件下对路由策略的探索，而非追求当前最优性能。

链接: https://arxiv.org/abs/2606.11212
作者: Jaspreet Singh Nahal
机构: 未知
类目: Computation and Language (cs.CL)
备注: 12 pages, 10 figures, 6 tables. Code and evaluation scripts available at: this https URL . This paper studies routing strategies for hybrid GPT-RAG systems under resource constraints, focusing on efficiency-safety tradeoffs rather than state-of-the-art accuracy

点击查看摘要

Abstract:Standard Retrieval-Augmented Generation (RAG) pipelines route every query through retrieval and generation unconditionally, incurring unnecessary computation and propagating low-quality context to the generator. We introduce EverydayGPT, a lightweight conversational QA system built around a Confidence-Gated Routing (CGR) mechanism that formalises the routing decision as a joint policy over retrieval distance and extraction adequacy. The backbone is a 205M-parameter GPT trained from scratch on 10B tokens of FineWeb-Edu. CGR avoids invoking the costly GPT pathway (~5.9s) for 85 percent of queries by resolving them via fast RAG extraction (~45 ms), yielding over 120x latency reduction on the majority of queries while maintaining answer quality. On a 500-question in-domain benchmark, the system achieves F1 = 0.226 +/- 0.004 compared to 0.171 for GPT-only and 0.210 for unconditional RAG. Gains over strong baselines are modest but consistent, while efficiency improvements are substantial (6.3x mean latency reduction). A structured grounding audit finds no unsupported claims in the sampled set, with explicit scope limitations. We position this work as a study of routing strategies under resource constraints rather than a claim of state-of-the-art performance.

[NLP-96] Calibration Drift Under Reasoning : How Chain-of-Thought Budgets Induce Overconfidence in Large Language Models

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在链式思维（Chain-of-Thought, CoT）推理过程中出现的校准偏差问题，即模型在增加推理深度时可能产生系统性过度自信的现象。尽管CoT推理被广泛用于提升模型准确性与可靠性，但其对模型不确定性校准的影响尚未得到充分理解。研究发现，在特定任务中，当推理预算（reasoning budget）超过某一任务相关阈值后，模型反而会生成看似内部一致却错误的解释，导致预测置信度虚高，形成“推理过程中的校准漂移”（Calibration Drift Under Reasoning, CDUR）。其核心解决方案在于提出一种基于自回归生成机制的假设锁定（Hypothesis Lock-In）理论模型，用以解释非单调校准误差（Expected Calibration Error, ECE）随推理预算变化的动态行为：初始阶段ECE下降（因纠错），随后上升（因错误推理趋于自洽）。为此，作者进一步设计了校准感知停止规则CABStop，通过监测推理过程中的置信度与辅助准确率估计之间的偏离程度，实现适时终止推理，从而避免过度自信。实验基于Llama-3.1-8B和Llama-3.3-70B在47个推理陷阱问题上进行多预算、多种子测试，结果表明8B模型表现出显著的非单调校准行为，而70B模型则受限于数据规模未得出明确结论。研究表明，盲目增加推理深度未必提升可靠性，需通过校准监控机制进行精细化控制。

链接: https://arxiv.org/abs/2606.11211
作者: Prakul Sunil Hiremath,Harshit R. Hiremath
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 31 pages, 4 figures, 3 tables. Introduces Calibration Drift Under Reasoning (CDUR) with theoretical analysis and preliminary experiments; includes CABStop; code and data available

点击查看摘要

Abstract:The ability of large language models (LLMs) to express calibrated uncertainty is important for safe deployment. Chain-of-thought (CoT) reasoning is widely used to improve accuracy and reliability, but its effect on calibration is not fully understood. We show that this picture is incomplete: in some settings, increasing the reasoning budget beyond a task-specific threshold can cause models to become systematically overconfident, assigning high confidence to incorrect answers. We call this phenomenon Calibration Drift Under Reasoning (CDUR) and study it both theoretically and empirically. We define reasoning budget B and analyze conditions under which Expected Calibration Error ECE(B) follows a non-monotonic pattern: it first decreases as reasoning corrects errors, then increases as longer reasoning produces internally consistent but incorrect explanations. We propose a Hypothesis Lock-In model based on autoregressive generation to explain this behavior. We evaluate Llama-3.1-8B and Llama-3.3-70B on 47 reasoning-trap questions across four reasoning budgets and three seeds (1,368 API calls; 574 valid responses). The 8B model shows non-monotonic calibration behavior, while results for the 70B model are limited to baseline evaluation and are inconclusive for budget-dependent effects. We introduce CABStop, a calibration-aware stopping rule that halts reasoning when confidence diverges from an auxiliary accuracy estimate. These results suggest that increasing reasoning depth does not always improve reliability and should be monitored carefully. Comments: 31 pages, 4 figures, 3 tables. Introduces Calibration Drift Under Reasoning (CDUR) with theoretical analysis and preliminary experiments; includes CABStop; code and data available Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) MSC classes: 68T50, 68T07 ACMclasses: I.2.7; I.2.6; I.2.1 Cite as: arXiv:2606.11211 [cs.CL] (or arXiv:2606.11211v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2606.11211 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Prakul Hiremath [view email] [v1] Fri, 24 Apr 2026 04:46:16 UTC (206 KB)

[NLP-97] 2MM: An LLM Supported Architecture For Inquiry-Based Modeling

【速读】：该论文旨在解决生成式人工智能在科学学习中支持模型构建（Model Construction）时缺乏视觉交互性的问题。尽管大语言模型（Large Language Models, LLMs）已逐步融合多模态能力并应用于教育场景，但现有工具多生成静态图像，难以满足开放探究式学习环境中对动态、可交互模型的需求。为此，本文提出Text to Multimodal Model（T2MM）架构，一种基于上下文感知的动态LLM框架，集成于基于探究生态的建模软件Virtual Experimental Research Assistant（VERA）中。其核心创新在于：能够根据学习者当前模型状态实时生成可交互的多模态模型，而非静态输出，从而支持学习者对模型进行手动调整并保持响应性。通过在自定义程序生成的数据集上评估，T2MM在所有衡量指标上均优于基于全代码生成的基线架构。本研究不仅实现了LLM在探究式学习建模工具中的有效集成，更提供了一种可扩展的架构范式，为开发更具交互性的多模态生成式AI教育工具奠定了基础。

链接: https://arxiv.org/abs/2606.11210
作者: John Kos,Rudra Singh,Ashok Goel
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注: 16 pages, 4 figures

点击查看摘要

Abstract:Model Construction is a foundational practice in science learning that relies on visualization and interactivity. Large Language Models, increasingly augmented with multimodal capabilities, have been integrated in education contexts to support learning. However, these tools lack visual interactivity that is required by some learning contexts. We introduce Text to Multimodal Model (T2MM), a robust, dynamic LLM supported architecture that assists in model construction within the open inquiry ecology-based modeling software Virtual Experimental Research Assistant (VERA). T2MM accounts for the current context of the learner’s model and creates interactive models, rather than static images, enabling the model to remain responsive to manual adjustment. To measure technical feasibility, we evaluate T2MM through a custom procedurally generated dataset of natural language learner modeling requests and target models within the VERA system. T2MM outperforms a baseline model generation architecture implemented through LLM-supported full code generation, common in the literature, across all measured success metrics. Our contribution not only outlines LLM integration into a inquiry-based learning modeling tool, but also describes a possible architecture through which more interactive multimodal LLM tools can be created.

[NLP-98] ProcessThinker: Enhancing Multi-modal Large Language Models Reasoning via Rollout-based Process Reward ICLR2026

【速读】：该论文旨在解决视觉问答（Visual Question Answering, VQA）任务中多步推理过程中缺乏细粒度监督的问题。现有基于可验证奖励的强化学习（RLVR）与组相对策略优化（GRPO）方法虽能提升多模态推理能力，但普遍依赖稀疏的仅结果奖励（outcome-only rewards），难以区分错误答案是源于推理后期的微小失误，还是从初始阶段就偏离正确路径。传统解决方案通过训练过程奖励模型（Process Reward Model, PRM）实现步骤级监督，但需大规模高质量思维链（Chain-of-Thought, CoT）标注数据及额外训练成本。本文提出ProcessThinker，一种无需显式训练PRM的实用后训练流程，通过将推理轨迹重写为带步骤标签的格式进行冷启动监督微调，并结合标准格式奖励与基于回滚（rollout）的步骤级奖励。具体而言，对每个中间步骤，采样多个后续延续路径，并以最终答案验证的成功率作为该步骤的奖励信号，从而实现密集的信用分配（dense credit assignment）。该机制鼓励生成更可靠支持正确结论的推理步骤，有效缓解逻辑推理中不一致或自相矛盾的进展问题。在四个具有挑战性的视频基准测试（Video-MMMU、MMVU、VideoMathQA 与 LongVideoBench）上，ProcessThinker持续优于基线模型 Qwen3-VL-8B-Instruct。

链接: https://arxiv.org/abs/2606.11209
作者: Jingpei Wu,Xiao Han,Weixiang Shen,Boer Zhang,Zifeng Ding,Volker Tresp
机构: LMU Munich (慕尼黑大学); Harvard University (哈佛大学); University of Cambridge (剑桥大学); Mina AI; Konrad Zuse School of Excellence in Reliable AI (relAI)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at ICLR 2026 Workshop on Logical Reasoning of Large Language Models. 7 pages, 1 figure

点击查看摘要

Abstract:Visual question answering increasingly requires multi-step reasoning. Recent post-training with reinforcement learning under verifiable rewards (RLVR) and Group Relative Policy Optimization (GRPO) can improve multimodal reasoning, but most approaches rely on sparse outcome-only rewards. As a result, they struggle to tell whether an incorrect answer comes from a small mistake late in the reasoning or from an unhelpful trajectory from the start. A common solution is to train a process reward model (PRM) for step-level supervision, but this typically requires large-scale high-quality chain-of-thought annotations and additional training cost. We propose ProcessThinker, a practical post-training pipeline that provides step-level process rewards without training an explicit PRM. ProcessThinker first rewrites reasoning traces into a step-tagged format for cold-start supervised fine-tuning, then applies GRPO with a standard format reward and our rollout-based process reward. Concretely, for each intermediate step, we sample multiple continuations from that step and use the empirical success rate (final-answer verification) as the step reward. This gives dense credit assignment and encourages reasoning steps that more reliably support a correct conclusion, helping reduce inconsistent or self-contradictory progress across steps – a key issue in logical reasoning. Across four challenging video benchmarks (Video-MMMU, MMVU, VideoMathQA, and LongVideoBench), ProcessThinker consistently improves over the baseline model Qwen3-VL-8B-Instruct

[NLP-99] BioDivergence: A Benchmark and Evaluation Framework for Hidden Contextual Contradictions in Biomedical Abstracts

【速读】：该论文旨在解决生物医学研究中因上下文差异导致的结论看似冲突实则非矛盾的问题，现有自然语言推理（NLI）与科学主张验证基准将复杂分歧简化为蕴含、矛盾或中立三类，无法捕捉其背后的上下文结构。为此，论文提出BioDivergence评估框架，其核心在于构建一个六类冲突分类体系、13个分歧轴的本体论，以及每对主张的四维结构化输出：冲突类型、分歧轴、主导混杂因素和调和解释。该框架通过释放包含11,865对主张的跨文章银标准数据集BioDivergence-Silver-v1.0（覆盖五个生物医学领域）及去重的遗留版本，揭示了不同数据划分对模型性能的影响——细调参考模型在跨文章设置下性能下降约12分，而Mistral-7B-Instruct-v0.3在842例主测试集上达到0.5523准确率与0.3894的上下文F1值。该方法提供了一种更精准区分上下文性分歧与直接矛盾的方式，并有效分离文章级记忆与真实任务学习能力。

链接: https://arxiv.org/abs/2606.11208
作者: Elias Hossain,Sanjeda Sara Jennifer,Sabera Akter Bushra,Niloofar Yousefi
机构: University of Central Florida(中佛罗里达大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Biomedical findings often seem to conflict across studies, but many of these differences are context-dependent rather than true contradictions. Variations in cohort, geography, assay protocol, disease subtype, and clinical setting can make both claims locally valid. Existing NLI and scientific claim-verification benchmarks reduce such cases to entailment, contradiction, or neutral, failing to capture the contextual structure behind divergence. To address this, we introduce BioDivergence, an evaluation framework with a six-class conflict taxonomy, a 13-axis divergence ontology, and four structured outputs per claim pair: conflict type, divergence axes, dominant confounder, and reconciliation explanation. We release BioDivergence-Silver-v1.0, an article-disjoint silver benchmark of 11,865 claim pairs across five biomedical domains, alongside a legacy deduplicated variant for comparison. Results show notable ranking differences between the two variants, with the fine-tuned reference model dropping about 12 points under the article-disjoint setting, while Mistral-7B-Instruct-v0.3 achieves 0.5523 accuracy and 0.3894 contextual-F1 on the 842-example primary test set. BioDivergence offers a more faithful way to distinguish contextual divergence from direct contradiction and to separate article-level memorization from genuine task learning.

[NLP-100] From Explicit Elements to Implicit Intent: A Predefined Library for Auditable Behavioral Inference

【速读】：该论文旨在解决电子商务会话数据中结构化语义信号提取与多目标推理任务之间的系统性难题，核心问题在于传统端到端预测模型过度追求预测准确率而牺牲了可审计性、结构可控性与结果可复现性。其解决方案的关键在于提出SemantiClean框架，通过构建一个基于24个行为要素的四层分层架构（功能层、交互层、系统层、上下文层），并引入共享元素库实现模块化设计，从而在保证推理目标（如购买意图、客户分群、产品偏好）可插拔的同时，强化对信号质量的管控。该框架的核心创新在于引入三项反通胀机制：冗余组贡献上限（RedundancyGroup contribution caps）、分层惩罚计算器偏差惩罚（TieredPenaltyCalculator bias penalties）以及自适应约束模式（AdaptiveConstraintMode）以应对冷启动问题，显著提升了模型决策过程的透明度与可解释性。此外，通过集成大语言模型（LLM-Integrated Semantic Inference Engine）的双阶段推理架构，在推理时利用完整的元素元数据，实现确定性输出（sigma=0）或在固定参数下受控的变异性输出（E8, E10），确保结果具备可复现性与可追溯性，从而在精度与可解释性之间达成平衡。

链接: https://arxiv.org/abs/2606.11207
作者: Liu hung ming
机构: PARRAWA AI
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 20 pages, 9 tables

点击查看摘要

Abstract:We present SemantiClean, a modular framework for extracting structured semantic signals from e-commerce session data and driving pluggable inference targets including purchase intent, customer segmentation, and product affinity through a shared element library. Unlike conventional end-to-end predictors that optimise solely for accuracy, SemantiClean prioritises auditability, structural governance, and sigma=0 reproducibility, explicitly trading marginal predictive gains for element-level transparency and defensible decision trails. Built upon the Online Shoppers Purchasing Intention (OSPI) dataset, the framework organises twenty-four behavioural elements into a four-layer architecture (Functional, Interaction, Systemic, Contextual) and enforces signal quality through three anti-inflation mechanisms: RedundancyGroup contribution caps, TieredPenaltyCalculator bias penalties, and AdaptiveConstraintMode cold-start this http URL report introduces the LLM-Integrated Semantic Inference Engine, a fully implemented two-phase LLM-driven inference architecture that leverages complete element metadata at inference time. All quantitative results reported herein are produced by this engine. Deterministic engine outputs remain fully reproducible (sigma=0); LLM-dependent results (E8, E10) are subject to controlled output variability under fixed provider/model/temperature settings. The gender inference target remains non-functional in the current implementation and is excluded from all quantitative results.

[NLP-101] Compatibility-Aware Dynamic Fine-Tuning for Large Language Models ACL2026

【速读】：该论文旨在解决监督微调（Supervised Fine-Tuning, SFT）在对齐大语言模型（Large Language Models, LLMs）时面临的优化不稳定性与泛化能力有限的问题。其核心挑战源于大规模指令数据中示范样本与模型策略之间的异质性（demonstration-policy mismatch），导致样本层面的梯度方差过高，进而引发训练过程中的不稳定现象。现有方法如动态微调（Dynamic Fine-Tuning, DFT）虽通过在词元层面纠正梯度缩放问题取得一定成效，但其假设所有示范样本具有同等学习价值，忽略了实际数据中显著存在的兼容性差异。为此，本文提出兼容性感知的动态微调（Compatibility-Aware Dynamic Fine-Tuning, CADFT），其关键在于引入一种基于模型似然的、依赖策略的动态兼容性信号，用于调节监督更新过程，从而抑制来自不兼容示范样本的高方差梯度。此外，进一步设计了一种延迟且低频的兼容性引导重写策略，将长期不兼容的示范样本转化为可学习的目标。实验表明，CADFT可被形式化为一种方差受控的估计器，将DFT在词元层面的稳定性提升扩展至样本层面，显著增强了训练稳定性、泛化性能以及冷启动强化学习初始化效果，且保持完全监督范式，无需显式奖励建模。

链接: https://arxiv.org/abs/2606.11206
作者: Yucheng Zhou,Junwei Sheng,Qianning Wang,Jianbing Shen
机构: University of Macau(澳门大学); Auckland University of Technology(奥克兰理工大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: ACL 2026

点击查看摘要

Abstract:Supervised Fine-Tuning (SFT) is the predominant paradigm for aligning large language models (LLMs), yet it suffers from optimization instability and limited generalization. Recent work attributes this issue to pathological gradient scaling and proposes Dynamic Fine-Tuning (DFT) to correct it at the token level. However, DFT assumes all demonstrations are equally suitable learning targets, an assumption violated by the strong heterogeneity of large-scale instruction data, where demonstration-policy mismatch induces high-variance updates at the sample level. We introduce Compatibility-Aware Dynamic Fine-Tuning (CADFT), a principled extension of DFT that controls sample-level optimization variance. CADFT derives a dynamic, policy-dependent compatibility signal from model likelihoods to modulate supervised updates, suppressing high-variance gradients from incompatible demonstrations. We further propose a delayed, low-frequency compatibility-guided rewriting strategy to transform persistently incompatible demonstrations into learnable targets. We show that CADFT can be interpreted as a variance-controlled estimator that generalizes token-level stabilization in DFT to the sample level. Extensive experiments demonstrate improved stability, generalization, and cold-start reinforcement learning initialization, while remaining fully supervised and independent of explicit reward modeling.

[NLP-102] Dual-Stance Evaluation of Sycophancy: The Structure of Agreement and the Limits of Intervention

【速读】：该论文旨在解决生成式人工智能（Generative AI）中因激活操控（activation steering）导致的“顺从性降低”与“事实正确性保持”之间的潜在冲突问题。具体而言，现有评估方法未能检验在减少模型对权威或预期答案的顺从性（sycophancy）的同时，是否也会无意中削弱其对事实正确陈述的认同。为此，作者提出一种双立场评估（dual-stance evaluation）方法，通过同时测试每个话题上顺从性立场与事实性立场的响应，系统地考察模型行为的可分离性。研究以Llama-3-8B-Instruct模型为对象，采用中心点差异操控（centroid-difference steering）进行干预，发现：尽管顺从性同意与事实正确性同意在激活空间中表现为几何上可区分的子空间，但同一操纵方向却对二者产生等量投影效应，无法实现差异化调控。因此，该操纵方向在降低顺从性的同时，也显著削弱了模型对事实正确陈述（如“地球是圆的”）的认同。进一步分析表明，两组激活在其他静态属性上高度匹配，暗示行为上的解耦源于生成过程中的动态机制或残差流分析无法捕捉的细微结构特征。这一现象揭示了一个普遍性局限——即使某些语义表示可以从激活中读取（readable），也不意味着可通过激活操控加以精确书写（writable）。

链接: https://arxiv.org/abs/2606.11205
作者: Matthew James Buchan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 18 pages, 9 figures, accepted to TAIS 2026

点击查看摘要

Abstract:Activation steering can shift LLM behaviour, but standard evaluations do not typically test whether a sycophancy-reduction direction also suppresses agreement with factually correct statements. We introduce dual-stance evaluation, which tests both stances of each topic, and apply it to centroid-difference steering on Llama-3-8B-Instruct. We find a dissociation: the model represents sycophantic and factual agreement in geometrically distinct subspaces, yet the steering direction projects equally onto both and cannot differentially target either. The direction accordingly reduces agreement with factually correct statements (e.g. that the Earth is round) as well as sycophantic ones. All other static properties of the two activation groups are matched, suggesting the behavioural dissociation arises from generation dynamics or from finer-grained structure that residual-stream analysis cannot resolve. The pattern illustrates a general gap: representations that are readable from activations may not be writable through them.

[NLP-103] LatticeBridge: Rare-Event Sequential Inference for Faithful Structured Sequence Synthesis

【速读】：该论文旨在解决结构化序列生成中如何在单个输出中同时满足多个由输入导出的约束条件的问题，尤其关注标准解码方法在生成流畅文本时往往忽视所有必要锚点（anchor）联合实现的情况。其核心挑战在于将多约束满足视为罕见事件的序列推断问题。解决方案的关键在于提出LatticeBridge框架，该框架融合了紧凑的前缀语言模型、针对具体实例编译的表面自动机（surface automata），以及一种经过扭曲的顺序蒙特卡洛（twisted sequential Monte Carlo, SMC）解码器，该解码器引入重采样、多级分裂（multilevel splitting）和基于实例提供的短语构建的源支持提议项（source-support proposal term）。其中，约束表示通过每个输入实例独立编译生成，无需依赖人工标注的词类集合。在涵盖CommonGen、E2E NLG和WikiBio的2,610个可达成验证任务上，粒子解码器在共享提议模型下显著提升了精确锚点满足率与平均锚点覆盖率，优于贪婪、束搜索过滤及最佳k个祖先基线。为全面评估生成质量，研究还报告了所需锚点覆盖率、源内容覆盖率、源侵入诊断、重叠度、运行时间及粒子统计等指标，共同刻画了在固定提议模型下的忠实性-重叠度-延迟权衡前沿。

链接: https://arxiv.org/abs/2606.11203
作者: Faruk Alpay,Bugra Kilictas
机构: Bahcesehir University (贝伊舍希尔大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 19 pages. Code and benchmark files available at this https URL

点击查看摘要

Abstract:Structured sequence generation often requires a model to satisfy several input-derived constraints in a single output. Standard decoding methods may assign high probability to fluent continuations while placing low mass on continuations that realize all required anchors jointly. We study this regime as a rare-event sequential inference problem. LatticeBridge combines a compact prefix language model, instance-compiled surface automata, and a twisted sequential Monte Carlo (SMC) decoder with resampling, multilevel splitting, and a source-support proposal term derived from instance-provided phrases. The constraint representation is compiled from each input instance and does not rely on manually curated lexical classes. On 2,610 attainable validation tasks spanning CommonGen, E2E NLG, and WikiBio, the particle decoder improves exact anchor satisfaction and mean anchor coverage over greedy, beam-filtered, and best-of-k ancestral baselines under a shared proposal model. Since exact anchor satisfaction alone does not rule out unsupported attribute substitutions, the evaluation reports required-anchor coverage, source coverage, source-intrusion diagnostics, overlap, runtime, and particle statistics jointly. The benchmark characterizes the faithfulness-overlap-latency frontier under a fixed proposal model.

[NLP-104] One Jailbreak Many Tongues: Learning Language-Insensitive Intention Representations for Multilingual Jailbreak Detection

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在多语言应用场景中安全训练滞后于多语言能力发展所导致的越狱攻击（jailbreak attack）防御短板问题。当前越狱防御方法主要集中在主流语言上进行开发与评估，其效果受限于多语言对齐标注数据稀缺以及语言差异带来的表征分散现象。为应对这一挑战，论文提出MLJailDe——一种面向多语言越狱检测的框架，其核心解决方案包括：首先，设计了一种多语言反向翻译数据增强算法，构建了覆盖11种语言、包含2,232个良性样本和1,239个越狱样本的语义一致且功能有效的多语言数据集；其次，引入相对距离约束以减小跨语言表征分散性，促使具有相似意图的越狱提示在不同语言间形成一致聚类；同时，采用不平衡感知分类目标缓解类别不平衡问题，学习更可靠的多语言决策边界。实验结果表明，MLJailDe在多种语言上均优于现有最优基线，达到98.5%的F1分数，并在未见语言上实现平均97.1%的F1分数，充分验证了其卓越的多语言鲁棒性与跨语言泛化能力。

链接: https://arxiv.org/abs/2606.11202
作者: Shuyu Jiang,Kaiyu Xu,Xingshu Chen,Hao Ren,Rui Tang,Yi Zhang,Tianwei Zhang,Hongwei Li
机构: Sichuan University (四川大学); Nanyang Technological University (南洋理工大学); University of Electronic Science and Technology of China (电子科技大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly deployed in applications for global multilingual users, yet safety training remains concentrated in dominant languages and has not progressed in parallel with multilingual capability, creating exploitable gaps for jailbreak attacks. Current jailbreak defenses are largely developed and evaluated in dominant languages, and their effectiveness is limited by the scarcity of aligned multilingual supervision and representations dispersion caused by language variation. To address this issue, we propose MLJailDe, a multilingual jailbreak detection framework designed to improve both multilingual robustness and cross-lingual generalization. MLJailDe first introduces a multilingual back-translation data augmentation algorithm to construct a semantically consistent and functionally effective dataset spanning 11 languages, consisting of 2,232 benign and 1,239 jailbreak samples. On this basis, MLJailDe employs relative-distance constraints to reduce cross-lingual representation dispersion and encourage jailbreak prompts with similar intent to form consistent clusters across languages, while an imbalance-aware classification objective is further used to alleviate class imbalance and learn more reliable multilingual decision boundaries. Experimental results show that MLJailDe outperforms state-of-the-art baselines across multiple languages, achieving an F1 score of 98.5%, and obtains an average F1 score of 97.1% on unseen languages, demonstrating strong effectiveness and cross-lingual generalization.

[NLP-105] o Intervene or Not: Guiding Inference-time Alignment with Probabilistic Model Blending ACL2026

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在推理阶段对齐（inference-time alignment）过程中，因依赖不可靠引导信息而导致性能下降的问题。现有方法通常直接采用某些对齐模型提取的引导指令，但未充分评估其可靠性，导致无效或误导性引导引发更多干预，形成恶性循环。其解决方案的关键在于提出BlendIn框架，通过将传统的二元决策机制转变为融合双模型知识的混合分布生成方式，实现基于可靠性的质量感知对齐。该框架根据各模型引导的可信度动态加权，有效保留有益引导并抑制不可靠建议，同时提供诊断信号与纠错策略，显著提升了对齐效率与稳定性，在具有挑战性的模型组合上实现了最高达50%的性能提升。

链接: https://arxiv.org/abs/2606.11201
作者: Jin Gan,Xin Li,Jun Luo
机构: Nanyang Technological University (南洋理工大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted by ACL 2026

点击查看摘要

Abstract:The wide deployment of LLMs has made model alignment necessary to make newly trained models safely and effectively respond to user instructions. Among different methods, inference-time alignment is often cheaper as it intervenes (i.e., offers guidances) only during output generation. Existing proposals apply guidances extracted from certain aligned models without properly assessing their reliability. Nonetheless, our systematic evaluation reveals that guidance effectiveness varies drastically across models; since ineffective guidances lead to further confusion and thus further interventions, the resulting excessive interventions typically indicate poor performance. To make interventions more effective and thus more efficient, we introduce BlendIn, an inference-time alignment framework that shifts from binary decisions to creating hybrid distributions integrating both models’ knowledge. BlendIn stabilizes inference-time alignment by performing quality-aware alignment and proportionally weighting each model’s contribution based on reliability. Compared with existing works, it preserves beneficial guidance while downweighting unreliable suggestions. BlendIn provides both diagnostic signals and mitigation strategies for misaligned guidance, achieving consistent and up to 50% performance improvement on challenging model pairs. Our code is available at: this https URL.

[NLP-106] Detecting AI-Generated Content on Social Media with Multi-modal Language Models

【速读】：该论文旨在解决生成式 AI (Generative AI) 产生的伪造图像与视频在社交媒体上泛滥所引发的虚假信息传播、欺诈及操纵等问题，尤其针对现有 AI 生成内容 (AIGC) 检测方法普遍存在的泛化能力差、依赖单一模态以及缺乏可解释性等关键缺陷。其解决方案的核心在于构建一个持续更新的多模态社交媒体数据集，并基于此训练一个轻量级的视觉-语言联合模型，实现对 AIGC 的高精度检测与可解释性分析。该模型不仅在公开基准测试中达到当前最优性能，且在跨平台的内部数据集上展现出稳健的检测与解释能力，成功部署于社交平台的内容推荐流程后，显著提升了用户参与度，验证了其在动态真实场景下有效执行 AIGC 检测的可行性与实用性。

链接: https://arxiv.org/abs/2606.11200
作者: Chenyang Yang,Shen Yan,Yibo Yang,Litao Hu,Yuchen Liu,Yuan Zeng,Hanchao Yu,Yinan Zhu,Sumedha Singla,Brian Vanover,Huijun Qian,Zihao Wang,Fujun Liu,Aashu Singh,Jianyu Wang,Xuewen Zhang
机构: Meta
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Generative AI has enabled the creation of photorealistic images and videos that are increasingly disseminated on social media, often used for spam, misinformation, manipulation, and fraud. Existing AI-generated content (AIGC) detection methods face challenges including poor generalization to new generation models, reliance on single modalities, and lack of interpretable explanations. We present our pipeline that mitigates these issues by continuously curating diverse multi-modal social media data and training a compact vision-language model for detection and explanation. Our model achieves state-of-the-art detection performance on public benchmarks and demonstrates robust detection and explanation capabilities on internal social media datasets across multiple platforms. We deployed our model for post recommendation on social media platforms and observed positive downstream impacts on user engagement, demonstrating that it is feasible to perform effective AIGC detection in dynamic, real-world social media environments.

[NLP-107] he Structural Attention Tax: How Retrieval Format Hijacks In-Context Learning Independent of Content

【速读】：该论文旨在解决检索增强生成（Retrieval-Augmented Generation, RAG）系统中因外部知识注入格式差异导致的注意力分配偏差问题。尽管知识内容的语义相关性至关重要，但其呈现形式（如知识图谱三元组的结构化格式）会独立于语义内容引发显著的注意力集中效应，即“结构注意力税”（structural attention tax）。研究表明，结构化三元组每令牌吸引的注意力是等效自然语言文本的2-3倍（\hato (KG) ≈ 0.70 vs. \hato (neutral) ≈ 0.25），导致示范样本注意力压缩高达42%，且这一现象与内容相关性无关。论文提出一个形式化框架，将注意力得分解为语义成分与结构成分（公式2），并推导出注意力损失的压缩边界（命题1），揭示结构项主导注意力分流程度，而语义项决定分流是否有益。该解耦机制明确了两类改进路径：优化检索质量（语义轴）与降低格式驱动的注意力捕获（结构轴）。实验表明，在两个模型家族（Mistral-7B、LLaMA-3-8B）和三个问答基准上，源任务对齐性具有决定性影响——任务匹配的BM25检索在HotpotQA上达到58-62%准确率，远超ConceptNet的25-27%（差距达30个百分点），远超各类门控策略（≤2个百分点）。基于该框架，论文提出五种结构感知缓解策略，涵盖零成本提示修改至训练时正则化；其中格式扁平化（S3）经由显式三元组对照实验验证，在准确率与注意力分布层面均有效，而结构分散（S1）结果不一，凸显了格式干预的复杂挑战。

链接: https://arxiv.org/abs/2606.11198
作者: Yuqi Zhang,Di Zhang
机构: Xi’an Jiaotong-Liverpool University (西安交通大学利物浦大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 10 pages, 5 figures

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) systems inject external knowledge to improve LLM outputs, yet the format of injected content – distinct from its semantic relevance – can independently distort the model’s attention distribution. We identify and formalise a phenomenon we term the structural attention tax: knowledge graph (KG) triples, due to their relational delimiters and repeated slot patterns, capture 2-3x more attention per token than semantically equivalent natural-language text ( \hato (KG) \approx 0.70 vs. \hato (neutral) \approx 0.25), compressing demonstration attention by up to 42% – regardless of whether the triples are relevant or noise. We develop a formal framework decomposing attention scores into semantic and structural components (Eq. 2), derive a compression bound (Proposition 1) connecting token-level format bias to demonstration attention loss, and show that the structural term governs how much attention is diverted while the semantic term governs whether this helps or hurts. This decoupling reveals two orthogonal axes for improving retrieval-augmented ICL: optimising retrieval quality (semantic axis) and reducing format-driven attention capture (structural axis). Empirically, across two model families (Mistral-7B, LLaMA-3-8B) and three QA benchmarks, we observe that source-task alignment dominates: task-matched BM25 retrieval achieves 58-62% on HotpotQA vs. ConceptNet’s 25-27%, a 30 pp gap that dwarfs all gating strategies ( \leq 2 pp). We derive five structure-aware mitigation strategies from the framework, ranging from zero-cost prompt modifications to training-time regularisation; format flattening (S3) is validated by both accuracy and attention-level evidence from a verbalized-triple control, while structural dispersal (S1) yields mixed results that illuminate the challenges of format-level intervention.

[NLP-108] PoQ-Judge: A Multi-Architecture Evaluation Framework for Cost-Aware Proof-of-Quality in Decentralized LLM Inference

【速读】：该论文旨在解决去中心化大语言模型（Large Language Model, LLM）推理网络中缺乏轻量级、无需参考答案的质量评估方法的问题，以支持“质量证明”（Proof of Quality, PoQ）机制的有效实施。其核心解决方案是提出PoQ-Judge框架，通过训练专用的判别模型对查询-输出对进行评分，而无需依赖真实参考答案。该方案的关键在于采用两阶段训练策略，在UltraFeedback数据集与GPT标注的领域内数据上联合训练，从而构建出在无参考条件下仍具备高相关性的评估模型。实验表明，最优模型在独立测试集上与真实代理指标（ground-truth proxy）的皮尔逊相关系数达到0.747，超越了以往基于参考答案的评估方法；作为复合评分中的无参考组件，其相关性达0.645，媲美最优单参考评估器，同时消除了对参考答案的需求。此外，研究还发现在线校准可识别语义质量为关键维度，级联评估方式在仅造成轻微质量损失的前提下将计算成本降低72.7%。然而，结果在问答任务上显著优于摘要任务，提示当前代理质量仍是主要瓶颈。

链接: https://arxiv.org/abs/2606.11196
作者: Arther Tian,Alex Ding,Frank Chen,Simon Wu,Aaron Chan
机构: DGrid AI
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Decentralized LLM inference networks need lightweight, reference-free quality evaluation for Proof of Quality (PoQ). We present PoQ-Judge, a framework that trains dedicated judge models to score query-output pairs without ground-truth references. We study three architectures across the quality-cost tradeoff: a TextCNN judge, a MiniLM cross-encoder, and a DeBERTa judge. Using two-stage training on UltraFeedback plus GPT-labeled in-domain data, the best model reaches 0.747 Pearson correlation with the ground-truth proxy on a held-out test set, outperforming reference-based evaluators from prior work. As a reference-free component in composite scoring, it achieves 0.645 Pearson correlation, matching the best single reference-based evaluator while removing the need for reference answers. We also show that online calibration identifies semantic quality as the dominant dimension and that cascade evaluation reduces cost by 72.7 percent with only modest quality loss. Results are much stronger on QA than summarization, pointing to proxy quality as the main remaining limitation.

[NLP-109] From Architecture to Output: Structural Origins of Hallucination in Large Language Models and the Amplifying Role of Data

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）中普遍存在的幻觉（hallucination）问题，即模型生成看似流畅且自信但事实错误的输出。现有分类框架虽能从输出类型上区分内在与外在幻觉、忠实性与事实性偏差，但无法揭示导致特定幻觉实例的内部机制。本文提出，幻觉是三个架构设计共同作用形成的复合故障系统所致：自注意力机制通过共现学习将统计相关性误作语义意义，引发实体混淆、事实误置和语义漂移；最大似然估计（Maximum Likelihood Estimation, MLE）训练目标仅优化下一个词的概率，缺乏事实约束，使模型倾向于生成统计上合理而非真实的输出；自回归解码在暴露偏差（exposure bias）下采取单向左至右的不可逆生成策略，一旦出现错误令牌即导致错误在整个输出序列中持续传播。数据集病理性问题（如长尾缺失、训练偏差、合成污染）虽会放大这些缺陷，但并非幻觉的根本成因。论文的关键贡献在于：首先，将上述三类机制分别对应到Alansari和Luqman分类体系中的具体输出类别，明确内在幻觉源于自注意力，外在幻觉源自MLE，逻辑不一致则归因于自回归解码；其次，证明常见数据集问题均通过利用上述机制发挥作用，而非独立引发幻觉；最后，指出仅依赖输出类型分类的诊断局限性，并强调应转向基于推理层的缓解策略以实现根本性改进。

链接: https://arxiv.org/abs/2606.07537
作者: Md. Rejaul Korim Sadi,Toufiqur Rahman Tasin,Golam Mostofa Naeem
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 11 pages, 7 figures, 15 references

点击查看摘要

Abstract:Large language models hallucinate–producing fluent, confident, factually wrong outputs–with a consistency that persists across generations and scales. Existing taxonomies classify hallucination by output type, distinguishing intrinsic from extrinsic failures and faithfulness from factuality divergence. These frameworks are descriptively rigorous but do not identify which internal mechanism produced a given instance. This paper analyses hallucination as a structural consequence of three architectural decisions that together form a compound failure system. Self-attention’s co-occurrence learning substitutes statistical proximity for semantic meaning and produces entity confusion, fact misattribution, and semantic drift. The maximum likelihood estimation training objective optimises next-token probability without factual constraint, rewarding statistically plausible outputs regardless of their truth value. Autoregressive decoding’s permanent left-to-right commitment under exposure bias ensures that a single wrong token cascades forward through the entire output sequence without revision. Dataset pathologies–long-tail deficiencies, training bias, and synthetic pollution–amplify these vulnerabilities but do not independently cause them. We make three contributions. First, we map each mechanism to a specific output category in the Alansari and Luqman taxonomy, locating intrinsic hallucination in self-attention, extrinsic hallucination in MLE, and logical inconsistency in autoregressive decoding. Second, we show that each commonly cited dataset pathology exploits one of these mechanisms rather than originating hallucination independently. Third, we identify the diagnostic limitation of output-type-only classification and contrast it with inference-layer mitigation approaches.

[NLP-110] Which Speech Representation Better Matches Text-Native Reasoning ? A Study of Speech-Text Alignment on Frame Rate and Representation INTERSPEECH2026

【速读】：该论文旨在解决语音对话模型在从文本大语言模型（LLM）迁移至语音输入时出现的推理性能下降问题，其核心挑战源于语音与文本在时间粒度上的不匹配：相同语义内容下，语音 token 的时间冗余度远高于文本，导致每个 token 的语义密度降低，进而削弱了原本依赖于高语义密度的文本原生推理机制。解决方案的关键在于将语音 token 的设计视为表示选择问题，并在固定信息率下对不同帧率进行系统性探索。为实现低帧率下的高效建模，研究提出因子化分层量化（factorized FSQ）与轻量级非自回归音频语言模型头，使每帧容量接近300比特而不影响推理效率。在移除瓶颈后，通过调节帧率（50 → 2.08 Hz）与表征对齐深度，发现语音问答任务在4.17 Hz的帧率下表现最优，且采用中间层表征对齐策略能显著提升性能。

链接: https://arxiv.org/abs/2606.12199
作者: Zhen Ye,Xu Tan,Yiming Li,Guangyan Zhang,Chimin Chan,Haohe Liu,Zhengxi Liu,Hongzhan Lin,Zheqi Dai,Xinshen Zhang,Peiwen Sun,Qiuqiang Kong,Wei Xue
机构: Hong Kong University of Science and Technology (香港科技大学); Tencent (腾讯); University of Surrey (萨里大学); Chinese University of Hong Kong (香港中文大学); Hong Kong Baptist University (香港浸会大学); Hong Kong Polytechnic University (香港理工大学); Independent Researcher (独立研究员)
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
备注: Accepted by Interspeech 2026 long paper

点击查看摘要

Abstract:Spoken dialogue models typically start from text LLM backbones, yet reasoning often degrades when conditioning on speech instead of text. We attribute part of this modality gap to a temporal-granularity mismatch: speech tokens are temporally redundant and far longer than text under matched semantics, diluting per-token semantic density and weakening text-native reasoning dynamics. We study speech token design as a representation selection problem and sweep frame rates under a frozen LLM backbone with a fixed information rate. To make low frame rates feasible, we introduce factorized FSQ and a lightweight non-autoregressive audio LM head, scaling capacity to nearly 300,bits/frame without sacrificing efficient prediction. With the bottleneck removed, we sweep frame rates (50 \rightarrow 2.08,Hz) and alignment depth, and observe a consistent best regime for speech QA at 4.17,Hz with intermediate-layer representation alignment.

[NLP-111] Fast Speech Foundation Model Distillation Using Interleaved Stacking INTERSPEECH2026

【速读】：该论文旨在解决大规模语音基础模型（Speech Foundation Model, SFM）知识蒸馏过程中训练效率低下的问题，尤其是在资源受限环境下实现高效模型部署的挑战。尽管已有方法通过知识蒸馏将大模型压缩为轻量级学生模型以降低推理延迟，但其仍需额外的学生模型训练过程，且训练效率未得到充分优化。为此，本文提出一种新型的分层堆叠（stacking）策略——交错堆叠（interleaved stacking），其核心创新在于在训练过程中始终保留各层的位置信息，从而有效维持每一层所编码的特定层次化知识（layer-specific knowledge）。这一特性对于具有深层结构的SFM尤为重要，可避免传统堆叠方法因层数动态变化导致的性能退化问题。实验在SUPERB基准上验证了该方法在加速训练的同时，能保持甚至提升模型性能，显著提升了SFM知识蒸馏的训练效率与部署可行性。

链接: https://arxiv.org/abs/2606.11766
作者: Eungbeom Kim,Kyogu Lee
机构: IPAI; AIIS; Dept. of Intelligence and Information, Seoul National University (首尔国立大学)
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Sound (cs.SD)
备注: Accepted by Interspeech 2026

点击查看摘要

Abstract:Distilling a large speech foundation model (SFM) into an efficient student model has been successfully applied to low-resource environments. Although distillation reduces inference latency, it requires an additional student model training. However, the training efficiency of SFM distillation remains underexplored. In this work, we explore training acceleration of SFM distillation to speed up model deployment. We examine the potential of stacking, in which the model depth is progressively increased through training until the target model depth is reached. While existing stacking methods improve training speed, they suffer from performance degradation. To handle this limitation, we propose interleaved stacking, a novel stacking method that consistently preserves layer position throughout the stacking process. This property is particularly critical in SFMs, in which each layer encodes distinct layer-specific knowledge. We validate the effectiveness of the proposed method on SUPERB.

[NLP-112] Gumbel-BEARD: Automatic Layer Selection for Self-Supervised Adaptation of Whisper in Low-Resource Domains INTERSPEECH2026

【速读】：该论文旨在解决生成式语音模型在低资源领域中因领域不匹配（domain mismatch）和数据稀缺性导致的性能下降问题。其核心解决方案是提出一种名为Gumbel-BEARD的域适应框架，该框架通过可端到端训练的硬Gumbel-Softmax选择器自动优化Whisper编码器层的选择，实现对目标语音特征的自监督适应。关键创新在于引入了BEST-RQ（Best-Response Quantization）目标函数，能够动态响应目标域的声学特性而无需人工调参，显著提升了模型在小样本条件下的泛化能力。实验结果表明，在仅使用10小时标注数据微调的情况下，该方法即可达到使用完整133小时标注数据训练的全监督基线性能；在MyST儿童语音语料库上实现了8.21%的词错误率（WER），在OGI自发语音数据集上达到11.06%的WER，均刷新当前最优水平；此外在CORAAL数据集上的评估进一步验证了其对成人方言域偏移的鲁棒性，相对词错误率降低最高达6%，充分展现了该方法在多样低资源场景下的高效性与通用性。

链接: https://arxiv.org/abs/2606.11429
作者: Zilai Wang,Natarajan Balaji Shankar,Mohan Shi,Kaiyuan Zhang,Abeer Alwan
机构: University of California, Los Angeles, USA(加州大学洛杉矶分校)
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
备注: Accepted by Interspeech 2026

点击查看摘要

Abstract:Speech foundation models often struggle in low-resource domains due to domain mismatch and data scarcity. We propose Gumbel-BEARD, a domain adaptation framework that automates Whisper encoder layer selection via an end-to-end trainable hard Gumbel-Softmax selector. It enables self-supervised adaptation with a BEST-RQ objective that dynamically adapts to target acoustic characteristics without manual tuning. Experiments on the MyST child speech corpus demonstrate efficiency and scalability: with 10 h of labeled data for fine-tuning, our method matches a fully supervised baseline trained on the complete 133 h labeled set. We establish new state-of-the-art word error rates (WERs) of 8.21% using Whisper-medium on MyST and 11.06% using Whisper-small on the OGI Spontaneous dataset. Evaluation on CORAAL further confirms robustness to adult dialectal domain shifts, with up to 6% relative WER reduction, highlighting the generalizability of our approach to diverse low-resource conditions.

[NLP-113] Massive Open-Vocabulary Keyword Spotting INTERSPEECH2026

【速读】：该论文旨在解决自动语音识别（Automatic Speech Recognition, ASR）系统在处理训练数据中罕见术语（尤其是专业术语）时性能下降的问题。现有解决方案虽通过开放词汇关键词检测（open-vocabulary keyword spotting）结合上下文偏置（contextual biasing）缓解了该问题，但其词典规模受限，通常仅能支持数百个术语，难以扩展至大规模术语库。本文提出一种新型系统，其核心创新在于显著降低特征存储的内存开销——相比基准方案，内存占用减少高达128倍，同时保持开放词汇能力。该系统无需对预训练的ASR模型进行微调，即可在未见语言上实现与无压缩方案相当的实体召回率，从而有效支持大规模术语数据库的高效处理。

链接: https://arxiv.org/abs/2606.11279
作者: Leonor Barreiros,Raul Monteiro,Afonso Mendes,Gonçalo M. Correia
机构: Priberam Labs(普里贝拉姆实验室); Instituto Superior Técnico (里斯本高等技术学院); Instituto de Telecomunicações (电信研究所)
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD)
备注: Accepted to Interspeech 2026

点击查看摘要

Abstract:Automatic speech recognition systems have been shown to under-perform when it comes to transcribing words rarely seen in the training data, namely specialized terminology. Open-vocabulary keyword spotting, combined with contextual biasing, has been shown to mitigate this issue. However, existing systems can only handle glossaries of a few hundred terms without becoming an infeasible bottleneck. We propose a system that stores features with a memory footprint up to 128 times smaller than a comparable baseline and allows users to process massive databases while remaining open-vocabulary. Without fine-tuning the speech recognition model, our system achieves a comparable entity recall as uncompressed solutions, even in languages not seen during training.

[NLP-114] MA-DLE: Speech-based Automatic Depression Level Estimation via Memory Augmentation

【速读】：该论文旨在解决基于语音的抑郁症水平自动估计中长期依赖关系建模不足的问题，尤其是在资源受限的心理健康场景下实现早期检测与及时干预。现有方法多依赖循环神经网络（RNN）架构（如GRU和LSTM）提取时序特征，但其特征表示往往仅聚焦于少数相邻语音片段，难以有效捕捉跨长时间跨度的上下文信息。为此，论文提出一种基于记忆的特征增强方法，其核心在于通过构建选择性记忆库，融合两类关键成分以降低冗余与无关信息：一是与当前GRU输出高度相似的历史时序特征，提供互补的上下文信息；二是基于特征可变性动态识别的记忆特征，能够表征抑郁相关的心理与情绪波动模式。为进一步融合记忆增强特征与GRU输出，论文设计了分层注意力融合（Hierarchical Attention Fusion, HAF）模块，实现多层次特征整合。实验在DAIC-WOZ和E-DAIC两个主流数据集上验证，结果表明该方法在抑郁症评估任务中达到当前最优性能。

链接: https://arxiv.org/abs/2606.11197
作者: Xuzhi Wang,Xinran Wu,Ziping Zhao,Jianhua Tao,Björn W. Schuller
机构: Tianjin Normal University (天津师范大学); Tsinghua University (清华大学); Technical University of Munich (慕尼黑工业大学); Imperial College London (帝国理工学院)
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Sound (cs.SD)
备注: Accepted at IEEE TAC

点击查看摘要

Abstract:Speech-based automatic estimation of depression levels is essential for enabling early detection and timely intervention, particularly in resource-constrained mental health settings. In recent years, deep learning has demonstrated impressive success across various domains, including affective computing and mental health assessment. Most existing approaches rely on RNN-based architectures (such as LSTM and GRU) to model temporal information for depression estimation. However, the extracted features often emphasize only a few adjacent speech segments, limiting their ability to capture long-range dependencies. To overcome this limitation, we introduce a memory-based feature augmentation method that enhances the representational capacity of GRU-extracted features. Rather than indiscriminately incorporating historical data, our memory bank is designed to selectively integrate two types of components in order to reduce redundancy and irrelevance: (1) historical temporal features that closely resemble the current GRU output, offering complementary contextual information; and (2) dynamic memory features identified based on feature variability, which capture behavioral and emotional fluctuations indicative of depressive symptoms. To effectively fuse the memory-augmented features with GRU outputs, we further design a Hierarchical Attention Fusion (HAF) module. Our method is evaluated on the widely used DAIC-WOZ and E-DAIC datasets, achieving state-of-the-art performance.

信息检索

[IR-0] Doc-to-Atom: Learning to Compile and Compose Memory Atoms

链接: https://arxiv.org/abs/2606.12400
作者: Xingjian Diao,Wenbo Li,Yashas Malur Saidutta,Avinash Amballa,Lazar Valkov,Srinivas Chappidi
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 20 pages

点击查看摘要

Abstract:Long input sequences are central to document understanding and multi-step reasoning in Large Language Models, yet the quadratic cost of attention makes inference both memory-intensive and slow. Context distillation mitigates this by compressing contextual information into model parameters, and recent work such as Doc-to-LoRA amortizes context distillation into a single forward pass that generates one LoRA adapter per document. However, producing a single monolithic adapter for all queries leads to irrelevant-query interference, limited compositional recall, and poor scalability to long-document reasoning. To address these challenges, we propose Doc-to-Atom (Doc2Atom), a compositional parametric memory framework that decomposes each document into semantically typed knowledge atoms. Each atom is compiled into an independent micro-LoRA adapter and a provenance retrieval key. At inference time, a lightweight query router selects and assembles only the relevant atoms into a query-specific adapter, which is then injected into a frozen base model. The entire system is trained end-to-end through a multi-objective distillation framework. Experiments on six diverse QA benchmarks demonstrate that Doc2Atom outperforms Doc-to-LoRA baselines while reducing the memory cost of document internalization.

[IR-1] Findings of the MAGMaR 2026 Shared Task KR

链接: https://arxiv.org/abs/2606.12295
作者: Alexander Martin,Dengjia Zhang,Joel Brogan,Francis Ferraro,Jeremy Gwinnup,Reno Kriz,Teng Long,Kenton Murray,Andrew Yates,Xiang Xiang
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: Findings of the 2nd workshop on Multimodal Augmented Generation via Multimodal Retrieval (MAGMaR); Resources at this url: this https URL

点击查看摘要

Abstract:This overview paper presents the results of the shared task for the second workshop on Multimodal Augmented Generation via Multimodal Retrieval (MAGMaR). In this shared task participants submitted systems focused on either (i) video retrieval or (ii) grounded generation of articles given retrieved videos. Teams could submit to either task. For the retrieval task, we had 2 participating teams that submitted a total of 17 systems – all of which beat a baseline derived from the winner of last year’s shared task. On the generation side, we had 4 teams submit 16 systems. All teams had at least one generated report that was labeled the best by a human annotator.

[IR-2] Efficient and Robust Online Learning to Rank in Decentralized Systems

链接: https://arxiv.org/abs/2606.12246
作者: Marcel Gregoriadis,Martijn de Vos,Sayan Biswas,Anne-Marie Kermarrec,Johan Pouwelse
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:In Online Learning to Rank (OLTR), ranking models are trained directly from live user interactions, but existing systems rely on a trusted central server to collect and process these interactions. This leaves operators free to introduce biases that conflict with user interests. Decentralized learning offers an attractive alternative, allowing users to collaboratively train a shared ranking model by exchanging model updates directly with one another, without any central authority. In such settings, however, malicious nodes can send poisoned model updates that degrade the ranking quality of honest nodes. We introduce RankGuard, a decentralized OLTR framework in which users collaboratively train ranking models and exchange model updates directly with other nodes. RankGuard defends against poisoning attacks by carefully evaluating incoming models against the user’s own private click history, corrected for position bias. An incoming model is only aggregated if it better explains the user’s past interactions than the current local model, making it fundamentally hard for malicious nodes to craft updates that pass this test without also genuinely helping the user. We derive a theoretical convergence guarantee of RankGuard. To the best of our knowledge, this is the first formal convergence analysis of a decentralized OLTR algorithm. We evaluate RankGuard against four poisoning attacks, including a powerful adaptive attack, using four standard benchmarks and three click models. RankGuard outperforms all baselines in most settings while being up to 62x more efficient than its closest competitors.

[IR-3] DiffCold: A Diffusion-based Generative Model for Cold-Start Item Recommendation ECML-PKDD2026

链接: https://arxiv.org/abs/2606.12245
作者: Kangning Zhang,Yingjie Qin,Weinan Zhang,Yong Yu,Jianghao Lin
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: Accepted by ECML-PKDD 2026

点击查看摘要

Abstract:Cold-start item recommendation remains a persistent challenge in real-world systems due to the absence of interaction histories. While prior models attempt to bridge this gap using item content features, they universally suffer from the \textbfseesaw dilemma: enhancing performance for cold items inevitably degrades performance for warm items, and vice versa. We identify that this dilemma stems from a fundamental \textbfdistributional disparity: warm item embeddings occupy a complex behavioral manifold" shaped by rich interaction signals, whereas cold item embeddings are constrained to a semantic manifold" derived solely from auxiliary content. Existing methods often force a rigid mapping between these inconsistent spaces, causing the model to sacrifice the precision of warm representations to accommodate cold ones. To address this, we propose \textbfDiffCold, a diffusion-based generative model that unifies warm and cold representations. Unlike GANs or VAEs, DiffCold leverages conditional diffusion to reconstruct warm item embeddings from content, preserving the underlying manifold structure without degradation. We further tailor this paradigm with two specific designs: a \textbfRetrieval-enhanced Aggregator that initializes generation using semantically similar warm items to bypass inefficient noise, and a \textbfSimulation-based Representation Alignment module that enforces distribution consistency between generated and real embeddings via contrastive learning. Experiments on three benchmarks confirm that DiffCold resolves the seesaw dilemma, consistently outperforming state-of-the-art methods across all metrics.

[IR-4] MLT-Dedup: Efficient Large-Scale Online Video Deduplication via Multi-Level Representations and Spatial-Temporal Matching KDD-2026

链接: https://arxiv.org/abs/2606.12215
作者: David Yuchen Wang,Haoying Li,Hailun Xu,Wei Chee Yew,Zirui Zhu,Sanjay Saha,Hao Hei,Kanchan Sarkar,Kun Xu
类目: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: Accepted by KDD-2026 ADS track

点击查看摘要

Abstract:The explosive growth of user-generated video content on online platforms is accompanied by the emergence of numerous near-duplicate videos–videos that are identical or highly similar but differ by partial edits. These duplicates degrade user experience and increase storage and bandwidth costs, making large-scale video deduplication a critical task. Existing video deduplication frameworks face a fundamental challenge in retrieving sufficient high-quality candidates under a limited index budget, as well as trade-offs between efficiency and precision. To address these issues, we propose MLT-Dedup, an efficient large-scale online video deduplication framework with Multi-Level representations and spatial-Temporal matching. Our approach employs a Multi-Level Video Encoder (ML-VE) to extract both fine-grained frame-level and sparse clip-level embeddings: sparse embeddings support efficient candidate retrieval, while fine-grained embeddings are loaded for precise pairwise matching. During matching, we introduce DiF-SiM, a Differential Feature-enhanced Similarity Module capable of locating duplicated temporal segments and providing reliable similarity evidence to support policy-driven deduplication decisions. Extensive experiments on a real-world large-scale platform demonstrate that MLT-Dedup reduces online repetition rates by 91% at 90% precision. Furthermore, our sparse retrieval design achieves a 5x increase in indexing capacity, enabling broader candidate coverage in real-world deployment.

[IR-5] LLM -Based User Personas for Recommendations at Scale

链接: https://arxiv.org/abs/2606.12198
作者: Haoting Wang,Haokai Lu,Zheyun Feng,Jenny Huang,Yifat Amir,Gregory Hinkson,Ben Most,Zelong Zhao,Yixin Kelly Cui,Rein Zhang,Fabio Soldo,Yu Xia,Nihar Bhupalam,Minmin Chen,Konstantina Christakopoulou,Lichan Hong,Ed H. Chi
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) offer unprecedented potential for enhancing recommendation systems through their world knowledge and reasoning capabilities. However, existing approaches often rely on structured IDs or offline processing, limiting semantic richness, real-time adaptability, and user-facing interpretability. In this paper, we introduce a novel framework that enables real-time generation of LLM-based user interest personas for a large-scale commercial video recommendation platform. Our method generates natural-language user interest personas that address the exploitation-exploration trade-off by combining the summarization of existing interests with novel topics, directly during serving. To overcome the computational challenges of online LLM inference at a billion-user scale, we design a cost-efficient architecture leveraging knowledge distillation, asynchronous inference, and input optimization via semantically clustered video representations. Extensive offline evaluations, user studies, and live A/B tests demonstrate significant improvements in viewer value. This work bridges the gap between high-level semantic understanding and industrial-scale recommendation, paving the way for more dynamic, explainable, and satisfying personalized experiences.

[IR-6] uva-irlab-conv at SemEval-2026 Task 8: Multi-Turn RAG with Learned Sparse Retrieval and Listwise Reranking ACL2026 SEMEVAL-2026

链接: https://arxiv.org/abs/2606.11945
作者: Simon Lupart,Kidist Amde Mekonnen,Zahra Abbasiantaeb,Mohammad Aliannejadi
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: SemEval-2026, The 20th International Workshop on Semantic Evaluation, collocated with ACL 2026, 9 pages, 5 figures, 6 tables

点击查看摘要

Abstract:This report describes our participation in SemEval-2026 Task 8 on multi-turn retrieval and question answering. The task evaluates conversational systems across four domains (finance, cloud documentation, government, Wikipedia), and includes unanswerable queries where the available collection does not contain sufficient evidence to produce a complete response. We propose a multi-turn retrieval-augmented generation pipeline that combines learned sparse retrieval with LLM-based reranking and generation. Using sparse retrieval as the primary retrieval method, we leverage its strong generalization across domains. In addition, we make use of the long-context capabilities of LLMs for conversational query rewriting, pointwise and listwise reranking, and generating the final response, each conditioned on the full conversational history. This multi-step design enables effective integration of conversational context throughout retrieval and generation, improving robustness across domains.

[IR-7] ail-Aware Adaptive-k: Query-Adaptive Context Selection for Retrieval-Augmented Generation KDD2026 ECML

链接: https://arxiv.org/abs/2606.11907
作者: Ziyu Song,Jiaming Fang,Kuangyu Li,Tuo Xia,Chuanpeng Wang
类目: Information Retrieval (cs.IR)
备注: First two authors contributed equally. Accepted at ECML PKDD 2026

点击查看摘要

Abstract:Adaptive context selection is critical for retrieval-augmented generation (RAG) systems, as fixed Top-K retrieval fails under query-dependent and heavy-tailed similarity distributions. While Extreme Value Theory (EVT) offers a principled framework for adaptive truncation, existing approaches apply EVT globally across the entire ranked list, incurring prohibitive computational costs and statistical instability. We propose Tail-Aware Adaptive-k(TAA-k), a training-free framework that operationalizes EVT through a localized validation strategy. The key insight is that ranked similarity curves exhibit a characteristic steep–flat–steep pattern reflecting a transition from relevance-dominated to noise-dominated regimes. TAA-k exploits this geometric structure via knee detection to identify a compact candidate region, then applies EVT-based goodness-of-fit testing within this window to validate the onset of tail behavior. This coarse-to-fine design reduces computational complexity from O(N^2M) to O(sqrtN\log N*M) while maintaining statistical rigor. Under mild monotone likelihood ratio assumptions, TAA-k yields a stable, query-adaptive cutoff corresponding to the earliest noise-dominated position. Experiments on WebQuestions, 2WikiMultiHopQA, and MuSiQue demonstrate that TAA-k achieves near-oracle retrieval quality (F1 within 2-3% of oracle) with orders-of-magnitude efficiency gains over global EVT methods, while maintaining robustness across embedding models and compression dimensions.

[IR-8] CORE-Bench: A Comprehensive Benchmark for Code Retrieval in the Era of Agent ic Coding

链接: https://arxiv.org/abs/2606.11864
作者: Fuwei Zhang,Yanzhao Zhang,Mingxin Li,Dingkun Long,Lexiang Hu,Pengjun Xie,Zhao Zhang,Fuzhen Zhuang
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Code retrieval is becoming central to coding agents, but agentic coding requires more than matching a natural-language query to an isolated snippet. Given a user request, a coding agent needs to navigate a concrete repository state, locate relevant files and functions, gather supporting context, and filter similar in-repository distractors. Existing code retrieval benchmarks mainly evaluate docstring-to-function or snippet-level matching, thereby missing this requirement-driven repository search problem. To address this gap, we introduce CORE-Bench, a comprehensive benchmark for code retrieval in the era of agentic coding. CORE-Bench evaluates code retrieval ability at three levels: code understanding, issue-to-edit localization, and broader context retrieval. Built from curated code-search tasks and SWE-bench-series instances, CORE-Bench contains over 180K queries and 106K broader-context relevance labels. Experiments with representative embedding models show a sharp drop from traditional code search to code retrieval in agentic coding settings. Simple supervised fine-tuning of existing embedding models significantly improves performance in this setting, suggesting substantial room for further progress.

[IR-9] What Limits Does Quantization Place on Dense Top-k Retrieval? A Theoretical Study

链接: https://arxiv.org/abs/2606.11780
作者: Koki Okajima,Tsukasa Yoshida
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Information Theory (cs.IT)
备注: 9 pages, 2 figures

点击查看摘要

Abstract:We establish conditions for embedding a corpus of N documents as d -dimensional vectors such that every k -subset S \subseteq [N] is realizable as a result of top- k retrieval by some query vector. Recent work shows that d = O(k) suffices for such embeddings to exist in \mathbbR^d , independently of N . We theoretically prove that this corpus-independent bound is specific to infinite precision. With B bits per coordinate, perfect top- k retrieval requires Bd = \Omega(k \ln N) ; thus, at any fixed precision, the dimension must grow at least logarithmically with N . Specializing to a \ell_2 -normalized B -bit uniform scalar quantization model, we also identify a threshold on the precision B^* = O(\ln \ln N) below which no dimension suffices, together with two further regimes that bound the feasible (B, d) pairs. Our result implies that in practical vector databases and dense retrieval systems where quantization is standard, the embedding dimension and possibly the precision must grow with the corpus size.

[IR-10] FAST-MEL: A Fast Accurate and Storag e Efficient Solution for Multimodal Entity Linking

链接: https://arxiv.org/abs/2606.11749
作者: Derrien Thomas,Laurent Amsaleg,Pascale Sébillot
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Multimodal entity linking (MEL) is the task that consists of matching textual and visual mentions of entities in unstructured data to their corresponding entities in a knowledge base (KB). To be effective in large-scale practical settings, MEL systems must meet three objectives: high linking accuracy, computational efficiency, and storage efficiency, i.e., a compact yet efficient index of the KB. In this paper, we highlight that state-of-the-art systems fail to simultaneously satisfy these 3 requirements. To meet this three-fold objective, we propose FAST-MEL, a lightweight encoder-based MEL solution that relies on a novel and compact fixed-size vectorized representation of both the textual and visual information of each entity or mention. It matches the accuracy of the best systems but performs three orders of magnitude faster. It also consumes one order of magnitude less storage than the fastest systems.

[IR-11] CompRank: Efficient LLM Reranking via Token-Level Compression and Decoding-Free Scoring

链接: https://arxiv.org/abs/2606.11700
作者: Xuan Lu,Haohang Huang,Yingqi Fan,Junlong Tong,Yuxuan Zhang,Ping Nie,Rui Meng,Xiaoyu Shen
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Large language model (LLM) rerankers have become an important component of modern retrieval and retrieval-augmented generation pipelines, but their high computational cost limits their applicability to long candidate lists. In this paper, we propose \textbfCompRank, a token-efficient reranking framework that reduces redundant computation by aligning reranker design with the sparsity of ranking signals. CompRank decouples document representations from candidate order and query context, enabling reusable document-side states; applies segment-wise token compression to reduce query–document interaction cost; and introduces a CopyNet-style objective that directly aligns attention-based document scoring with training supervision. Experiments on seven BEIR datasets show that CompRank achieves strong reranking performance while retaining only 10.2% of document tokens, reaching an average NDCG@10 of 39.2 compared with 39.7 under full-token attention. Further scaling experiments on TREC-COVID show that CompRank remains stable when evaluated on candidate lists of up to 500 documents after training on 30-document lists, while achieving 4.9\times – 9.5\times end-to-end speedup over generation-based listwise reranking and approximately 1.3\times speedup over the full-token CompRank variant. These results suggest that token-level compression and decoding-free attention scoring provide an effective path toward scalable LLM-based reranking.

[IR-12] he Long Tail Not the Front Page: Cold-Start Prediction of Crowd Highlight Salience

链接: https://arxiv.org/abs/2606.11654
作者: Kazuki Nakayashiki,Keisuke Watanabe
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Social and Information Networks (cs.SI)
备注: 10 pages, 3 figures, 4 tables

点击查看摘要

Abstract:A social highlighter’s most useful signal – which passages a crowd of readers marks – exists only for documents people have already read. Can the aggregate crowd salience of a document be predicted from its text before its marks accumulate? Prior work on this data found that zero-shot language models recover highlight locations worse than a trivial lead (position) baseline, so we ask whether a model trained on the highlight corpus can beat that baseline. Using a pre-registered ladder of models and a by-document cluster bootstrap, we find a small but robust edge: a logistic ranker over sentence embeddings and positional/contextual features beats the lead baseline by +0.044 average precision (95% CI [+0.029, +0.058]; clears a pre-registered margin delta=0.03 in 97% of resamples, and stable across pipeline re-runs). Two unsupervised extractive baselines (centroid, LexRank-style centrality) lose to lead, and the trained model beats them by +0.108, so the edge is not recovered by generic unsupervised proxies – it reflects learning from real reader marks. In product terms, precision@3 rises from 0.25 to 0.39 (+55% relative) and the model beats lead on 69% of documents. An ablation attributes the edge to the raw embedding (+0.014) and training augmentation (+0.010), each with a positive CI. The edge is not a temporal-generalization failure, and we find no evidence that content drift or near-duplicate leakage explains it. A standardized regression shows the advantage is governed mainly by document popularity (lower popularity, larger edge) and by label reliability. It nearly vanishes only on the most popular content; there it is the lead baseline that strengthens, not the model that weakens. Because our evaluation conditions on documents that eventually accumulated readers, these results are a retrospective cold-start simulation.

[IR-13] DeMix: Debugging Training Data with Mixed Data Error Types by Investigating Influence Vectors

链接: https://arxiv.org/abs/2606.11616
作者: Jiale Deng,Yanyan Shen,Xiaogang Shi,Chai Junjun
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:High-quality training data is essential for the success of machine learning models. However, real-world datasets often contain mixed types of errors arising from systematic flaws in data preparation pipelines, including label errors, feature errors, and spurious correlations. Effective debugging of training data requires both detecting erroneous samples and identifying their specific error types to enable targeted repair, yet existing data cleaning and attribution methods fail to adequately address this dual requirement. In this paper, we propose DeMix, a novel framework that simultaneously diagnoses erroneous samples and their error types. Our key insight is that different error types produce distinct patterns on model behavior. DeMix captures such error-specific patterns by influence vectors that characterize how each training sample affects model predictions across all validation samples. We formulate training data debugging as a multi-label classification problem where a classifier is developed to predict error types directly from influence vectors. We further introduce an intervention-based learning strategy that guides the classifier to capture invariant rationales specific to each error type, ensuring the learned classifier generalizes effectively. Empirical evaluations on 11 tasks across tabular data prediction, recommendation systems, and LLM alignment demonstrate that DeMix significantly outperforms state-of-the-art approaches, achieving a 22.61% improvement in data debugging F1-score and a 9.32% gain in task model performance after data repair. Code is available at: this https URL.

[IR-14] Factions Within Uncertain Across: Within-Document Reader Sub-Groups in Social Highlighting

链接: https://arxiv.org/abs/2606.11613
作者: Kazuki Nakayashiki,Keisuke Watanabe
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Social and Information Networks (cs.SI)
备注: 11 pages, 3 figures, 3 tables

点击查看摘要

Abstract:When many people highlight the same document, is the crowd a single consensus, or is it internally structured into reader sub-groups that mark different things – and is that structure a stable property of a reader or of the document? Building on prior work showing an individual’s within-document highlighting signal is a whisper while individuality lives in selection, we ask the group-level question on a co-readership platform using a margin-preserving curveball null. Experiment 1: within a document, readers form strong sub-groups – pairs agree far beyond what shared salience, mark density, and sentence popularity predict (nearest-neighbour agreement z=+6.3, significant in 88% of documents). Under an eight-block region-preserving null, shared engagement with the same coarse regions of the document accounts for about 40% of this excess; the majority survives as finer reader-specific agreement (z=+3.6, 77% significant). So the within-document crowd is, in a descriptive sense, factional. Experiment 2: is that grouping a stable reader trait? Here we are honest about power. The cross-document split-half reproducibility of a pair’s agreement is near zero pooled (+0.078 and 0.000 in two separately drawn samples), and a power calibration shows the test is informative only for pairs that co-read many documents. In the only informative high-overlap subset (k=4), point estimates are positive but small-sample, imprecise across the separately drawn samples, never significant, and attenuate under the region-preserving null. We therefore leave cross-document stability unresolved: the data is consistent with anything from situational grouping to a weak-to-moderate stable reader trait. The crowd is factional within a document; whether its factions follow the reader across documents is, honestly, beyond our reach.

[IR-15] A PubMed-Scale Dataset of Structured Biomedical Abstracts

链接: https://arxiv.org/abs/2606.11361
作者: Chia-Hsuan Chang,Haerin Song,Brian Ondov,Hua Xu
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注: Data and code for this work are available at this https URL and this https URL , respectively

点击查看摘要

Abstract:Structured abstracts are important for biomedical literature processing, by facilitating information retrieval, text mining, and knowledge synthesis. However, a vast portion of abstracts indexed in PubMed remain unstructured, presenting a significant bottleneck for downstream text-processing workflows and applications. To resolve this limitation, we introduce Structured PubMed, a comprehensive corpus of section-labeled biomedical abstracts compiled from the complete PubMed database, encompassing over 23.2 million research-article records. The corpus is divided into two distinct subsets: a collection of 5.9 million author-structured abstracts parsed from official XML files, and an automatically labeled collection of 17.2 million originally unstructured abstracts structured via a verbatim-extraction Large Language Model pipeline. Every record is harmonized under a unified five-section schema and mapped to its original PubMed identifier, publication type, and publication date. This dataset can be utilized to train sentence-classification models, benchmark text-segmentation architectures, and perform large-scale, section-specific information extraction at an unprecedented PubMed-wide scale.

[IR-16] When More Documents Hurt RAG : Mitigating Vector Search Dilution with Domain-Scoped Model-Agnostic Retrieval

链接: https://arxiv.org/abs/2606.11350
作者: Nabaraj Subedi,Ahmed Abdelaty,Shivanand Venkanna Sheshappanavar
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 24 pages, 8 figures, 30 tables. Preprint under review

点击查看摘要

Abstract:Retrieval-augmented generation degrades when scaled to large, heterogeneous document collections, where dense similarity loses discriminative power, and top-k retrieval increasingly returns semantically similar but contextually incorrect chunks. We refer to this failure mode as vector search dilution. Even when using hybrid dense+sparse retrieval, we observed this firsthand in a deployed Wyoming Department of Transportation corpus, where scaling from 54 to 1,128 documents (88,907 chunks) reduced accuracy from 75% to below 40%. To address this dilution, we propose MASDR-RAG ( Multi-Agent Scoped Domain Retrieval for RAG) and evaluate it on 200 expert-validated queries across five LLM backbones, six corpora, and two index stacks. Our results indicate that domain scoping using organizational metadata is the key fix, significantly improving P@10 from 0.77 to 0.86 ( p 0.05 ). Furthermore, our investigation of multi-agent orchestration revealed that a high degree of configuration dependence results --creating what we call the precision-faithfulness paradox. Based on these varied outcomes, our practical recommendation is simple: scope first, then perform a single synthesis call, reserving full multi-agent orchestration for genuinely multi-domain corpora paired with native-tool-call backbones. Code and Data will be made public upon acceptance.

[IR-17] Benchmarking Large Language Models for Safety Data Extraction

链接: https://arxiv.org/abs/2606.11204
作者: Jonas Grill,Thomas Bayer,Sören Berlinger
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 18 pages, 8 figures, submitted to Applied Intelligence

点击查看摘要

Abstract:Accurate extraction of structured information from Safety Data Sheets (SDS) remains challenging in industrial safety due to heterogeneous document formats and the limitations of traditional rule-based methods. This study benchmarks state-of-the-art Large Language Models (LLMs) for automated SDS data extraction, comparing text-based and multimodal processing pipelines. We systematically evaluate four models: Gemini 1.5 Pro, GPT-4o, Claude 3.7 Sonnet, and Llama 3.1-70B, across three prompting strategies: zero-shot, few-shot, and chain-of-thought. The evaluation framework assessed accuracy, latency, and cost across more than 50,000 extracted data fields. Results show that text-based extraction consistently outperforms multimodal processing across all metrics. Gemini 1.5 Pro combined with a Chain-of-Thought prompt achieved the highest accuracy (84%), outperforming GPT-4o (81%) and Claude 3.7 Sonnet (79%). However, no model surpassed the 90% accuracy threshold commonly required for reliable real-world deployment. These findings indicate that general-purpose LLMs are not yet robust enough for unsupervised industrial use, though performance suggests strong potential with task-specific fine-tuning. Future research should focus on domain-adapted training, model calibration, and the integration of Human-in-the-Loop verification to ensure safety-critical reliability.

[IR-18] NightFeats @ MMU-RAG ent NeurIPS 2025: A Context-Optimized Multi-Agent RAG System for the Text-to-Text Track NEURIPS2025

链接: https://arxiv.org/abs/2606.11199
作者: Quentin Fever,Naziha Aslam
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: 5 pages, 1 figure, 1 table. NeurIPS 2025 Competition Track (MMU-RAGent). System developed October 2025

点击查看摘要

Abstract:We present NightFeats, a structured multi-agent retrieval-augmented generation (RAG) system submitted to the MMU-RAGent competition at NeurIPS 2025, where it was awarded Best Dynamic Evaluation in the text-to-text track. Rather than targeting benchmark maximization, this work proposes a principled pipeline that decomposes knowledge synthesis into three coordinated phases: retrieval, curation, and composition, each governed by explicit intermediate representations and handoff contracts. Inspired by Agentic Context Engineering (ACE), the system introduces temporal-semantic reranking, bounded contradiction reconciliation, and citation-preserving composition as core architectural primitives. Competition results show that NightFeats surpasses proprietary baselines including Claude-SonnetV2 and Nova-Pro on LLM-as-a-Judge and Human Likert evaluations, confirming that architectural transparency and verifiable evidence grounding are better aligned with human preferences than systems optimizing narrowly for automatic similarity metrics.

人机交互

[HC-0] Identifying cybersickness causes in virtual reality games using symbolic machine learning algorithms

链接: https://arxiv.org/abs/2606.12214
作者: Thiago Porcino,Erick Oliveira Rodrigues,Flavia Bernardini,Daniela Trevisan,Esteban Clua
类目: Human-Computer Interaction (cs.HC); Graphics (cs.GR)
备注:

点击查看摘要

Abstract:Virtual reality (VR) and head-mounted displays are constantly gaining popularity in various fields such as education, military, entertainment, and health. Although such technologies provide a high sense of immersion, they can also trigger symptoms of discomfort. This condition is called cybersickness (CS) and is quite popular in recent virtual reality publications. This work proposes a novel experimental analysis using symbolic machine learning to rank potential causes of CS in VR games. We estimate CS causes and rank them according to their impact using classical machine learning. Experiments are performed using two virtual reality games and 6 experimental protocols along with 37 valid samples from a total of 88 volunteers. Our results show that rotation and acceleration triggered cybersickness more frequently in a flight game in contrast to a race game. We could also observe that subjects that are less experienced with VR are more prone to feel discomfort. Former experience plays a more important role on the race game, as this game provides more liberty to the user in terms of controllers, more displacement alternatives and a more user-controlled acceleration. Furthermore, different causes that trigger discomfort arise based on short or long term VR exposures. We suggest strategies for mitigating CS for these two scenarios: short and long term exposure experiences and compare the two highlighted scenarios (race and flight).

[HC-1] Channels and Substrates: Distributed Cognition as an Interaction Model for Ubiquitous Analytics

链接: https://arxiv.org/abs/2606.11986
作者: Niklas Elmqvist,Panagiotis D. Ritsos,Peter W. S. Butcher
类目: Human-Computer Interaction (cs.HC)
备注: 16 pages, 8 figures

点击查看摘要

Abstract:Traditional HCI interaction models assume a single monolithic interface and a stable sensorimotor loop. These models fit poorly with cross-device (XVA) and ubiquitous analytics (UA), where interactive data sensemaking unfolds across multiple devices, artifacts, and people in disparate settings from the office to the factory floor. In this paper, we show how interaction in ubiquitous analytics can be modeled using distributed cognition as propagation of representational state across substrates – minds, speech, bodies, artifacts, and devices – rather than as traffic through a single interface. On this basis we introduce input and output channels as generalizations of the visual channels from data visualization: just as visual channels carry data through properties of the visual substrate, input and output channels carry representational state through substrates whose availability, suitability, and preferability depend on context. We demonstrate the channels and substrates framework by reanalyzing several ubiquitous, immersive, and situated analytics systems.

[HC-2] Somewhere Over the Desktop: A Research Agenda for Ubiquitous Analytics

链接: https://arxiv.org/abs/2606.11980
作者: Niklas Elmqvist,Panagiotis D. Ritsos,Peter W. S. Butcher
类目: Human-Computer Interaction (cs.HC)
备注: 15 pages, 5 figures, 1 table

点击查看摘要

Abstract:Spatial computing, generative AI, and open web standards are converging. Three spatial operating systems – Android XR, Meta Horizon OS, and Apple visionOS – now ship with platform-level scene understanding. Wearable displays span the range from full headsets to slim smartglasses. Agentic AI operates on the same spatial substrates as the human user. This convergence enables new opportunities for \textitubiquitous analytics (UA): the use of many, physically distributed, networked devices to support data sensemaking anytime and anywhere. But proprietary platforms are settling design conventions that will calcify without evidence-based alternatives. UA has now matured to the point where its intellectual history can be read as a structured genealogy of foundations, contributions, and lineages. We trace this genealogy and organize it into clusters spanning cognition, context, interaction, platforms, visualization, collaboration, and evaluation. Finally, we cross these clusters with each other, yielding a total of 42 future research challenges.

[HC-3] Frozen Multimodal Embeddings for Personality and Cognitive Ability Assessment in Asynchronous Video Interviews

链接: https://arxiv.org/abs/2606.11930
作者: Kuo-En Hung,Hung-Yue Suen,Shih-Ching Yeh,Hsiang-Wen Wang
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 1 figure, 4 tables

点击查看摘要

Abstract:Predicting psychological traits from asynchronous video interviews (AVIs) is a challenging multimodal learning problem because labeled datasets are limited while each response contains high-dimensional visual, acoustic, and verbal signals. This paper presents our solution for the ACM Multimedia AVI Challenge 2026, which evaluates two tasks: Track~1 predicts self-reported HEXACO personality traits from personality-related interview responses, and Track~2 classifies cognitive ability levels from structured AVI responses. We treat the problem as a small-sample representation learning task. Instead of fine-tuning large pretrained models, we use frozen multimodal encoders, including CLIP for visual features, Whisper for acoustic features and transcripts, and RoBERTa, E5, and DeBERTaV3 for textual representations, followed by low-capacity downstream models. For Track~1, our trait-specific regression and late-fusion system achieves an average validation MSE of 0.2696, improving over the official baseline of 0.3334. Ablation results show a three-step improvement from a global model (0.3189), to per-trait modeling (0.2871), to per-trait late fusion (0.2696), corresponding to a 19.1% relative MSE reduction over the official baseline. For Track~2, a compact subject-attribute baseline reaches 0.5781 accuracy, while our multimodal ensemble reaches 0.5313, both above the official baseline of 0.4062. We interpret this result as evidence of possible subject-attribute shortcuts in the validation split rather than robust cognitive inference from AVI content. Overall, our findings suggest that AVI-based psychological assessment benefits from trait-specific multimodal modeling, but cognitive ability prediction requires careful control of dataset shortcuts.

[HC-4] PAPEL: A Collaborative System for Parental Guidance during Preschool Play-Based English Learning

链接: https://arxiv.org/abs/2606.11896
作者: Xutong Wang,Yu Mei,Qinwei Li,Muyu Liu,Xiwen Yao,Chang Liu,Zhoutong Ye,Jie Cai,Chun Yu,Yuanchun Shi
类目: Human-Computer Interaction (cs.HC)
备注: 38 pages, 9 figures, 5 tables. Accepted to CSCW 2026 / To appear in Proceedings of the ACM on Human-Computer Interaction (CSCW 2026)

点击查看摘要

Abstract:Play-based parent-child interaction offers preschoolers rich opportunities for everyday foreign language learning, yet many parents struggle to turn open-ended play into effective English-as-a-Foreign-Language (EFL) learning experiences at home. To explore how AI might support this process, we conducted formative studies through interviews and a Wizard-of-Oz study. We identified four key challenges: content selection, language expression, balancing instruction and play, and problem solving. To address these challenges, we present PAPEL, a parent-AI collaborative system that grounds suggestions in the ongoing play scene and organizes support into four core modules: content generation, language adaptation, balance assessment, and extended response. In a counterbalanced within-subjects study with 16 parent-child dyads, PAPEL was associated with more integrated parent utterances that combined playful and instructional content, as well as more parent-child conversational turns, than the lightweight chatbot baseline used in our study.

[HC-5] Designing AI-Supported Focus Groups: A Role x Modality Playbook

链接: https://arxiv.org/abs/2606.11835
作者: Zhiqing Wang,Steven Dow
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Collecting participants’ lived experiences is central to design research. Focus groups are uniquely valuable because participants not only share individual accounts but also respond to one another, surfacing comparison, disagreement, and collective sensemaking. However, focus groups are resource-intensive and highly sensitive to facilitation: moderators must probe for specificity, balance participation, manage topic flow, and sustain psychological safety, and subtle facilitation choices can shape what becomes salient. Recent HCI work and commercial meeting tools show that generative AI can scaffold live conversation through prompting, turn regulation, thematic mapping, and real-time summarization. Yet UXR teams lack a clear map of what these capabilities mean in focus groups and what methodological risks they introduce. We synthesize AI supports for live conversation and translate them into a focus-group-specific playbook organized by AI role (tool, co-host, host) and modality (text, voice, embodied).We synthesize prior work on AI-supported live conversation and propose a focus-group-specific playbook of AI supports organized by role (tool, co-host, host) and modality (text, voice, embodied). We characterize interactional trade-offs and identify open questions for evaluating AI-supported focus groups as methodological configurations.

[HC-6] Understanding and Supporting Online Discussion with Opinionated Chatbots

链接: https://arxiv.org/abs/2606.11693
作者: Tianqi Song,Chi-Lan Yang,Zihan Liu,Zhengtao Xu,Yibin Feng,Yi-Chieh Lee
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Opinionated chatbots are increasingly present on online platforms and have the potential to shape public discourse by influencing individuals’ viewpoints before they engage in discussions. Despite their growing presence, the impact of interacting with opinionated chatbots on subsequent online interactions remains largely unexplored. This study investigated how exposure to different types of opinionated chatbots, specifically those expressing opposing, reinforcing, or balanced viewpoints, affected participants’ subsequent online discussions. In a controlled experiment with 83 participants, we found that interacting with an opinionated chatbot that consistently opposed participants’ arguments led to greater shifts in opinion, indicating enhanced openness to revising one’s initial stance. Conversely, participants who interacted with a chatbot that consistently reinforced their views were more likely to adopt more agreeable communication styles in subsequent conversations with others. Furthermore, interactions with different types of opinionated chatbots resulted in varying levels of trust, as well as different perceptions of chatbots and human interlocutors. Our findings indicate that opinionated chatbots can influence both individuals’ opinions on social topics and their communication behaviors in online environments. This presents a trade-off for future designers seeking to facilitate cognitive flexibility in changing opinions while maintaining positive user experiences and trust in the chatbots during public discourse. We discuss the implications for designing opinionated chatbots to promote more constructive and less polarized online

[HC-7] Learning by Chatting? Investigating the Impact of Generative AI on Information Seeking and Learning

链接: https://arxiv.org/abs/2606.11669
作者: Shravika Mittal,Su Lin Blodgett,Q. Vera Liao
类目: Human-Computer Interaction (cs.HC); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Generative AI (GenAI) tools offer increasing opportunities for augmenting human cognitive tasks. Among these tasks, information seeking is being rapidly reshaped by GenAI tools, with potentially profound implications for learning and knowledge acquisition. To investigate these implications, we conducted a between-subjects field experiment in which participants pursued informal learning by seeking information through either ChatGPT or Google Search over a span of 8 days. Using a daily diary protocol, we gathered in-situ data on their information-seeking processes. Our findings show that participants in the ChatGPT group experienced diminished agency in their information-seeking processes, as they offloaded much of the information selection to AI, and consequently experienced greater meta-cognitive load arising from this reduced sense of control. We further highlight two sources of distortion in information access when using ChatGPT: biases in ChatGPT outputs, particularly towards providing solution-oriented artifacts over principled knowledge; and systematic shifts in users’ information-seeking behaviors, whereby the conversational and socially-oriented interaction paradigm of current GenAI tools may inadvertently reduce exploration of the broader knowledge space. As a result, on average, participants in the ChatGPT group had worse learning outcomes than those using Google, especially for higher-order critical learning. Our work suggests inherent tensions between offloading information seeking to AI and meaningful learning, and provides broader implications for understanding AI’s risks to human cognition.

[HC-8] 3-Key-Input: Exploring the Theoretical Minimum Keys for Text Entry ICASSP2026

链接: https://arxiv.org/abs/2606.11642
作者: Naoki Kimura
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL)
备注: 6 pages, 1 figure, 7 tables. Published in ICASSP 2026

点击查看摘要

Abstract:How far can we reduce the number of physical keys if we endow an ambiguous keyboard with modern language models? Fewer keys increase hardware design freedom in constrained settings such as assistive devices and mobile form factors. This paper systematically evaluates text entry systems using 2-5 physical keys combined with language-model-based disambiguation. On a 300-sentence English corpus (100 sentences each for Business / Conversational / Technical), we compare key counts (2-5), letter-to-key mappings (layout-based / frequency-based / intentionally worst-case), and decoders (Trie-only, GPT-2 beam search, GPT-4o selection). We find that 3 keys + GPT-4o achieves character error rate (CER) 9.46% and word error rate (WER) 12.20%, reducing CER by 59% relative to 2 keys (CER 23.3%). At 3 keys, the key-stream entropy is 1.54 bits/char; while increasing to 5 keys improves accuracy (CER 5.4%), the marginal gains diminish. Mapping choice has a small impact under standard designs (\DeltaCER 0.5 pp), and even an intentionally worst mapping degrades CER by only +0.5 pp, whereas Technical sentences yield roughly twice the error rate of Business. These results suggest that, in our evaluated offline setting under a strong LM prior, 3 keys are a practical minimum for general English.

[HC-9] Knowing When to Ask: Self-Gated Clarification for Hierarchical Language Agents

链接: https://arxiv.org/abs/2606.11349
作者: Aijing Gao,Yiming Kang,Mengdie Flora Wang,Jae Oh Woo
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:In hierarchical reasoning, failures often originate at intermediate decision points where the agent commits to a wrong branch without recognizing that it lacks critical information. Rather than treating clarification as an external uncertainty trigger, we propose ACTION-RATING, a formulation that places it inside the agent’s action space on a shared ordinal scale with navigation, so that asking competes directly with acting at every decision point and help-seeking becomes observable at intermediate states. Two structurally distinct information-seeking modes emerge from the agent’s own ratings: mandatory (no viable branch) and opportunistic (residual uncertainty despite a leading candidate). On Harmonized Tariff Schedule classification (30,000-node taxonomy, three benchmarks, 9~LLMs across 4 families), we observe a regime shift from mandatory to opportunistic clarification, with Information-Seeking Effectiveness (ISE), a local diagnostic defined as the fraction of help interactions followed by a correct next navigation step (not a final-task metric), rising from 50% to 74%. Three diagnostic contrasts fail to reproduce this structure. A separability test shows that the information-seeking pattern (mode split, ISE ranking) persists when answer quality is degraded (-18.8% accuracy), supporting an empirical separation between where an agent seeks help and the quality of the help it receives. Under the controlled answer channel, accuracy gains reach +16.2% at 10-digit; we read this as an upper bound on what better localization could unlock, not a deployment estimate.

[HC-10] owards a Joint Understanding of Remote Operation for Vehicles in Public Road Traffic

链接: https://arxiv.org/abs/2606.11336
作者: Elisabeth Shi,Maria-Magdalena Wolf,Nina Theobald,Bettina Abendroth,Eugen Wige,Johannes Springer,Katharina Hottelart,Andreas Schrank,Thorben Brandt,Michael Oehl,Frank Diermeyer,Lena Plum
类目: Human-Computer Interaction (cs.HC); Emerging Technologies (cs.ET); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:Sustained driving automation systems are envisioned to be used as the foundation for driverless mobility services. However, both researchers and practitioners acknowledge that current driving automation systems are not yet able to handle all traffic situations that a human driver can handle. To bridge this gap and enable mobility services without an in-vehicle human driver or fallback, remote operation (or teleoperation) is increasingly discussed. Recently, first legal actions have been taken to enable some forms of remote operation on public roads. Remote operation encompasses a broad spectrum of methods to support a driving automation system, ranging from remote assistance, which includes providing information or releasing a maneuver, to remote driving, which includes driving the vehicle from a remote location. As such, safe implementation of remote operation in public road traffic challenges the collaboration of multiple academic disciplines (e.g. engineering, psychology, informatics, law, etc.) and stakeholders (e.g. remote operation service providers, remote operators, vehicle manufacturers, regulatory authorities, etc.). At the same time, the interdisciplinary discourse is often challenging due to differing expectations and language. To build a common ground, this article traces terminology back to the original differences in information processing both on human and vehicle side. This framework aims to help further discourse by directly specifying what is needed to engage a diverse audience including researchers and stakeholders of different backgrounds and interests. Recently discussed forms of teleoperation are integrated into this framework.

[HC-11] raits Run Deeper: Trait-Specific Asymmetric Fusion for Personality Assessment

链接: https://arxiv.org/abs/2606.11269
作者: Jia Li,Qian Chen,Wei Wang,Xinyu Li,Zhenzhen Hu,Dongsheng Shao,Richang Hong,Meng Wang
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Personality assessment aims to infer stable personality traits from dynamic behaviors across language, voice, and facial cues. Since different personality dimensions are revealed through distinct behavioral perspectives, modeling trait-specific evidence is challenging. However, most existing approaches adopt a uniform multimodal fusion strategy across all dimensions, assuming identical modality contributions. This overlooks trait-specific modality preferences and introduces cross-modal interference. To address this issue, we propose a novel personality assessment framework called Traits Run Deeper, which consists of three components. Specifically, the Multimodal Foundation Representation (MFR) module constructs personality-oriented multimodal inputs and leverages psychology-informed semantic templates as anchors, enabling foundation models to capture trait-relevant information. Building upon MFR, the Trait-Specific Modality Fusion (TSMF) module acts as an asymmetric fusion mechanism, allowing each dimension to selectively exploit different modality pathways from modality-specific modeling to complementary fusion. Thus, TSMF captures heterogeneous modality preferences while reducing cross-modal contamination. Furthermore, the Distribution-Calibrated Personality Regression (DCPR) module mitigates label imbalance and central tendency bias through target distribution calibration, improving robustness and stability. Experimental results on the AVI Challenge 2026 validation set demonstrate the effectiveness of the proposed framework, reducing mean squared error (MSE) by approximately 25% compared with the baseline. Consistent improvements are observed on the official test set, where our method achieves the best performance and ranks first in the Personality Assessment Track. The source code will be made available at this https URL.

[HC-12] Preregistration for Experiments with AI Agents ICML2026

链接: https://arxiv.org/abs/2606.11217
作者: Michelle Vaccaro
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: Accepted at ICML 2026 as a Spotlight (Top 5%) Position Paper

点击查看摘要

Abstract:The proliferation of large language models (LLMs) and autonomous AI agents has given rise to a rapidly growing methodological paradigm: “in silico” behavioral experiments. Originally conceived as a way to use AI agents as proxies for human participants in studies of cognition, decision-making, and social dynamics, this approach has taken on new significance – as AI agents increasingly negotiate, transact, and make consequential decisions on behalf of people and organizations, understanding their behavior has become a research priority in its own right. While these experiments with AI agents offer unprecedented advantages in terms of scalability, cost efficiency, and experimental control, they also inherit, and in some cases amplify, methodological vulnerabilities that have long plagued human subjects research. To address these issues, this paper argues that preregistration practices – central to improving the credibility of human subjects experiments – should now be extended to experiments with AI agents. We systematically catalog the researcher degrees of freedom that experiments with AI agents introduce – model selection, prompt wording, settings, and outcome-contingent redesign, for example – and show how the low cost of iteration and lack of reporting norms make these choices both easy to exploit and difficult to detect. We propose a preregistration template tailored to experiments with AI agents and call on conferences, journals, and funding agencies to make preregistration standard practice for this emerging research paradigm.

[HC-13] From Awareness to Action: Understanding and Overcoming the Research-Practice Gap in Algorithmic Fairness for Public Health

链接: https://arxiv.org/abs/2606.11214
作者: Sara Altamirano,Tijs Portegies,Sennay Ghebreab
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: Extended version of an accepted IASEAI’26 paper; includes technical appendices. 22 pages, 2 figures

点击查看摘要

Abstract:Algorithmic fairness is essential for responsible ML-driven public health research, yet its practical implementation remains limited. To investigate this awareness-action gap, we conducted a sequential mixed-methods study comprising expert interviews, an online survey, and systematic mapping. The expert interviews informed the design of the survey, which in turn revealed fragmented definitions of fairness, limited training and guidance, reliance on external sources, and rare use of formal assessment, mitigation, or monitoring. These findings were subsequently mapped onto three established research-practice gap lenses: the Knowledge-Practice Gap, the Knowledge-to-Action Cycle, and the Knowing-Doing Gap, each offering complementary perspectives. Building on this synthesis, we introduce the Fairness-to-Action framework, which integrates methodological, organizational, and systemic dimensions to identify where translation of algorithmic fairness knowledge stalls. Our analysis shows that fairness remains weakly institutionalized, translation mechanisms are externally driven, and system-level priorities continue to emphasize accuracy over fairness. These insights suggest critical leverage points for advancing safe, fair, and ethical ML-driven public health research practice.

[HC-14] From Consumption to Reflection: Designing Human-AI Relations for Stable Reasoning

链接: https://arxiv.org/abs/2606.11195
作者: Rikard Rosenbacke,Carl Rosenbacke,Victor Rosenbacke,Martin McKee
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have transformed how humans access information, but not how we reason with it. Their fluency accelerates consumption while bypassing the slow, reflective processes that underpin sound judgment. This paper introduces Relational Reflective Intelligence (RRI), an inference-time governance layer that operationalizes reflection through auditable reasoning loops. RRI operates not inside the model but around it, providing a practical structure for stable, auditable reasoning between humans and LLMs. The core premise is that LLMs inherit cognitive vulnerabilities similar to those that shape human thought: reliance on intuitive shortcuts, confusion between representation and reality, and a preference for coherence over falsification. When humans and models share these tendencies, their errors compound. We refer to this as relational drift, a failure that arises from interaction rather than from the model alone. Addressing this requires a shift from modeling relations between words to structuring relations between model outputs and human reasoning. RRI provides this missing layer through three components: the Rose-Frame, which identifies likely breakdowns in reasoning; the Architect’s Pen, which introduces targeted reflection steps at critical moments; and an inference-time workflow that embeds these steps without retraining the model. Together, these elements transform human-AI interaction into a joint reasoning system with explicit checkpoints, conflict surfacing, and an auditable trail of assumptions. Rather than making machines think like humans or forcing humans to reason like machines, RRI creates a structured interaction in which both compensate for each other’s limitations. It reframes AI safety as a cognitive architecture problem, where reliable decisions depend on embedding reflection directly into the interaction process. Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC) Cite as: arXiv:2606.11195 [cs.CY] (or arXiv:2606.11195v1 [cs.CY] for this version) https://doi.org/10.48550/arXiv.2606.11195 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Rikard Rosenbacke [view email] [v1] Fri, 17 Apr 2026 08:37:16 UTC (583 KB)

计算机视觉

[CV-0] Reroute Dont Remove: Recoverable Visual Token Routing for Vision-Language Models

链接: https://arxiv.org/abs/2606.12412
作者: Cheng-Yu Yang,Shao-Yuan Lo,Yu-Lun Liu
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Code: this https URL

点击查看摘要

Abstract:Vision-language models (VLMs) project images into hundreds to thousands of visual tokens, making decoder inference expensive in both attention computation and KV-cache memory. Existing visual-token reduction methods largely follow a rank-and-remove paradigm: they score visual tokens, keep a compact subset, and permanently discard the rest. We show that this irreversible action is fragile because visual-token importance changes across decoder depth; tokens ranked low at one stage may become relevant in later layers, especially for grounding-sensitive queries. We propose Reroute, a training-free plug-in that replaces removal with recoverable routing. At each routing stage, selected vision tokens pass through decoder blocks, while deferred tokens bypass the stage and re-enter the candidate pool at the next routing decision. Reroute reuses existing attention-score ranking rules and stage-wise schedules, preserving the theoretical TFLOPs and KV-cache budget class of the pruning method it augments. Across FastV, PDrop, and Nüwa variants on LLaVA-1.5 and Qwen backbones, reroute improves grounding under aggressive token reduction while maintaining general VQA performance. These results suggest that VLM token reduction should not be viewed only as irreversible pruning, but also as recoverable routing. The code can be found here: this https URL

[CV-1] How Seemingly Inconsequential Design Choices Dictate Performance of LLM s in Pathology

链接: https://arxiv.org/abs/2606.12407
作者: Kian R. Weihrauch,Thomas A. Buckley,William Lotter,Arjun K. Manrai
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:General-purpose large language models (LLMs) are routinely used as baselines when evaluating specialized pathology models on whole-slide images (WSIs). Because WSIs exceed contemporary model context limits, LLM baselines routinely use small, high-magnification patches processed independently via majority voting, without systematic evaluation of seemingly inconsequential design choices such as patch size, patch count, and magnification. Generalist LLMs have consistently underperformed specialized systems, reinforcing the perception that domain-specific training or architectural adaptation is necessary for pathology tasks involving WSIs. Here, we conduct a systematic factorial analysis of four input design factors: inference mode, patch size, magnification, and patch count. We demonstrate that prior studies have overstated the gap between specialized models and general-purpose LLMs by choosing non-optimized input configurations. On the MultiPathQA benchmark, switching to a single balanced configuration (large patches at lower magnification, processed jointly) raises GPT-5 from 15.1% to 39.5% on cancer-type classification (TCGA) and from 38.1% to 62.9% on organ classification (GTEx). Per-task optimization yields further gains up to 43.9% (TCGA) and 71.6% (GTEx). The same configuration generalizes to two other models and to a fully held-out CPTAC cohort, where it improves Gemini 3 Flash by 23.4 percentage points without any task-specific tuning.

[CV-2] DIRECT: When and Where Should You Allocate Test-Time Compute in Embodied Planners?

链接: https://arxiv.org/abs/2606.12402
作者: Jadelynn Dao,Milan Ganai,Yasmina Abukhadra,Ajay Sridhar,Mozhgan Nasr Azadani,Katie Luo,Clark Barrett,Jiajun Wu,Chelsea Finn,Marco Pavone
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-Language Models (VLMs) are increasingly deployed as high-level planners for embodied agents, with an emerging strategy of scaling test-time compute to improve capability. However, we observe that doing so increases latency, token usage, and FLOPs while yielding uneven, often diminishing gains in downstream success, limiting where embodied agents can be deployed. We argue that choosing when and where to spend test-time compute is central to bringing frontier performance to the real world. We introduce DIRECT, a routing framework that uses multimodal scene context to allocate compute per prompt, improving the success–cost Pareto frontier over fixed model selection. Across three dominant scaling axes, namely chain-of-thought depth, model size, and memory history, our experiments on VLABench and RoboMME show that test-time compute is not a uniform lever: different axes yield qualitatively distinct capability gains. We validate these insights on a physical Franka arm in a DROID setup spanning zero-shot manipulation and long-horizon chaining, where our router matches or exceeds a stronger model’s success rate at up to 65% lower average latency. Ultimately, our results show that naively scaling test-time compute is wasteful, and that DIRECT can provide frontier-level embodied planning in robotic systems at a fraction of the cost. Project page can be found at this http URL.

[CV-3] VLGA: Vision-Language-Geometry-Action Models for Autonomous Driving

链接: https://arxiv.org/abs/2606.12396
作者: Jin Yao,Dhruva Dixith Kurra,Tom Lampo,Zezhou Cheng,Danhua Guo,Burhan Yaman
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Project page: this https URL

点击查看摘要

Abstract:Vision-language-action (VLA) models can describe scenes and reason about them in language, yet still struggle to ground their actions in the dense 3D world around them. Existing approaches either inject features from a frozen 3D foundation model without an objective that ensures the policy uses them, or constrain geometry with sparse box and map losses that provide no dense spatial signal. We introduce VLGA, the first vision-language-action model supervised to reconstruct the dense 3D world it drives through. VLGA introduces geometry as a fourth modality alongside vision, language, and action through a dedicated expert supervised by a per-pixel pointmap regression loss against LiDAR. Extensive experiments conducted on challenging nuScenes and Bench2Drive datasets for open-loop and closed-loop evaluations, respectively, show the superiority of VLGA over counterpart VLA methods. In particular, on open-loop nuScenes, VLGA sets a new state of the art among VLA methods without ego status, with the lowest L2 (0.50,m average) and 3-second collision rate (0.18%). On closed-loop Bench2Drive, VLGA attains the state-of-the-art driving score of 79.08, +0.71 over the strongest prior VLA, at comparable efficiency and comfort.

[CV-4] Illumination-Robust Camera-Based Heart-Rate Estimation for Physiological Sensing in Robots

链接: https://arxiv.org/abs/2606.12378
作者: Zhi Wei Xu,Torbjörn E. M. Nordling
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 8 pages, 4 figures

点击查看摘要

Abstract:Physiological awareness is important for service, social, and assistive robots that interact with humans in everyday environments. Remote photoplethysmography (rPPG) enables non-contact heart-rate (HR) estimation from an RGB camera, making it a promising sensing modality for robot-mounted vision systems. However, illumination variation remains a major barrier to robust deployment. This paper presents an end-to-end spatial-temporal transformer framework for remote HR estimation on a new dataset with varied illumination. Our estimator integrates PRNet-based 3D face alignment, clip-level illumination augmentation, the Residual Temporal Standardization Module, and controlled hybrid temporal-frequency supervision. The training objective combines a Soft-Shifted Pearson waveform loss with a spectral Kullback-Leibler divergence loss, where a tuned weight ( \mathbf\beta ) controls the contribution of frequency-domain heart-rate guidance. Experiments on a static all-level mix protocol covering three illumination levels show that \mathbf\beta=5 provides the strongest result among the tested beta settings, achieving a best-run HR mean absolute error (MAE) of 0.79 bpm and an HR correlation of 0.982. Compared with the PhysFormer baseline evaluated on our dataset, our estimator reduces HR MAE by 93.6 %, while increasing HR correlation from 0.088 to 0.982, making it usable when illumination varies.

[CV-5] Semantically-Aware Diver Activity Recognition Framework for Effective Underwater Multi-Human-Robot Collaboration

链接: https://arxiv.org/abs/2606.12374
作者: Sadman Sakib Enan,Junaed Sattar
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Effective multi-human-robot collaboration is essential for expanding human-led operations in the challenging and high-risk underwater environment. For autonomous underwater vehicles (AUVs) to become true teammates, they must be able to comprehend their surroundings and recognize a diver’s activities to offer assistance and ensure safety. Towards this goal, we introduce DAR-Net, a novel transformer-based framework that analyzes complex underwater scenes to classify diver activities. Our contribution lies in a semantically guided learning formulation that couples transformer-based temporal reasoning with pixel-level scene supervision. This multi-loss training strategy explicitly aligns global activity recognition with local human-robot interaction semantics, which is particularly critical in low-visibility underwater conditions. To address the significant challenge of data scarcity in this domain, we present the first-ever Underwater Diver Activity (UDA) dataset, a foundational resource containing over 2,600 annotated images with pixel-level masks. Through rigorous experimental evaluations in a controlled environment, we demonstrate that DAR-Net achieves promising accuracy in recognizing six distinct diver activities, outperforming state-of-the-art models. While this dataset provides a crucial baseline, our work serves as a pioneering step, laying the groundwork for future research and facilitating the development of more intelligent, collaborative underwater robotic systems.

[CV-6] A Turbo-Inference Strategy for Object Detection and Instance Segmentation

链接: https://arxiv.org/abs/2606.12371
作者: Zhen Zhao,Gang Zhang,Xiaolin Hu,Liang Tang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint version of an article published in Computer Vision and Image Understanding

点击查看摘要

Abstract:Object detection and instance segmentation tasks are closely related. Existing top-down instance segmentation methods usually follow a detect-then-segment paradigm, where an initial detector is used to recognize and localize objects with bounding boxes, followed by the segmentation of an instance mask within each bounding box. In such methods, the detection accuracy directly influences the subsequent segmentation performance. However, previous research has seldom explored the impact of the instance segmentation task on object detection. In this paper, we present a turbo-inference strategy for the top-down methods that leverages the complementary information between detection and segmentation tasks iteratively. Specifically we design two modules: turbo-detection head and turbo-segmentation head, which facilitate communication between the tasks. The two modules form a closed loop that interlaces the detection and segmentation results without retraining the model. Comprehensive experiments on the COCO, iFLYTEK, and Cityscapes datasets demonstrate that our method substantially enhances both detection and segmentation accuracies with a certain increase in computational cost. The proposed method represents a tradeoff between prediction accuracy and inference speed. Codes are available at this https URL.

[CV-7] DepthMaster: Unified Monocular Depth Estimation for Perspective and Panoramic Images

链接: https://arxiv.org/abs/2606.12368
作者: Pengfei Wang,Shihao Wang,Liyi Chen,Zhiyuan Ma,Guowen Zhang,Lei Zhang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:While monocular depth estimation has achieved significant progress, achieving generalized metric depth estimation for both narrow field-of-view (FoV) perspectives and 360^\circ panoramas remains an unsolved challenge. Existing methods are often tailored to specific camera types and struggle to produce accurate metric depth that generalizes across diverse settings. This limitation stems from two key challenges: the inherent geometric discrepancy between perspective and panoramic cameras, and the scarcity of panoramic training data with metric annotations. In this work, we introduce DepthMaster, a unified metric depth estimation framework. Rather than employing specialized networks to learn spherical distortions, we reformulate the problem by decomposing panoramic images into overlapping perspective patches. Crucially, distinct from prior projection-based methods that rely on ad-hoc architectural modifications to handle boundaries, we introduce a novel Correspondence Consistency Loss (CCL) and inject virtual projection cameras as geometric priors, allowing us to seamlessly stitch the patches while avoiding specialized operators and keeping the backbone largely compatible with standard Transformer designs. This strategy also resolves the geometric differences by unifying all inputs into a canonical perspective representation, and effectively circumvents data scarcity by directly unlocking powerful metric priors from vast perspective datasets. Trained on a mixed dataset that contains only one panorama dataset, DepthMaster achieves state-of-the-art zero-shot performance on 13 diverse datasets, outperforming not only universal methods but also leading specialist models in both perspective and panoramic domains.

[CV-8] Atlas HE-TME: Scalable AI-Based Tissue Profiling at Expert Pathologist-Level Accuracy

链接: https://arxiv.org/abs/2606.12346
作者: Kai Standvoss,Miriam Hägele,Rosemarie Krupar,Julika Ribbat-Idel,Jennifer Altschüler,Gerrit Erdmann,Hans Pinckaers,Evelyn Ramberger,Madleen Drinkwitz,Ádám Nárai,Alexander Möllers,Katja Lingelbach,Sebastian Kons,Lukas Hönig,Recepcan Adigüzel,Joana Baião,Alberto Megina Gonzalo,Marius Teodorescu,Marie-Lisa Eich,Paolo Chetta,Shakil Merchant,Verena Aumiller,Simon Schallenberg,Andrew Norgan,Klaus-Robert Müller,Lukas Ruff,Maximilian Alber,Frederick Klauschen
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Hematoxylin and eosin (HE) staining is the cornerstone of histopathology, yet scalable, quantitative analysis of HE whole-slide images (WSIs) remains a central challenge in computational pathology. We present Atlas HE-TME, an AI-based system built on the Atlas family of pathology foundation models that predicts tissue quality, tissue region, and cell type labels across multiple cancer types, yielding over 4,500 quantitative readouts per slide at cell-level resolution. A key challenge to validating such systems is overcoming morphological ambiguity inherent to HE-only ground truth and the limited scalability of more informed references drawing on modalities such as immunohistochemistry (IHC). We address this with a dual validation framework combining biologically grounded depth with technical and morphological breadth. For depth, we propose an IHC-informed multi-pathologist consensus protocol that substantially improves inter-rater agreement over conventional HE-only annotation. This yields a molecularly grounded reference against which we compare Atlas HE-TME and pathologists working from HE alone. For breadth, we benchmark Atlas HE-TME on over 200,000 high-confidence HE-only pathologist annotations across 1,500+ cases spanning eight cancer types and their most common metastatic sites, with subtypes covering 90% of clinical cases per cancer type, drawn from 25+ sources and 8+ scanner models. Benchmarked against the IHC-informed consensus, Atlas HE-TME matches or exceeds pathologist HE-only performance and generalizes consistently and robustly across this broad morphological and technical scope. In doing so, Atlas HE-TME turns the HE slide – the most ubiquitous data in pathology – into a scalable, quantitative window into the tumor and its microenvironment, laying a foundation for the next generation of tissue-based biomarkers in translational and clinical research.

[CV-9] Echoes of the Prior: A Computational Phenomenology of Forgetting

链接: https://arxiv.org/abs/2606.12340
作者: Gege Gao,Bernhard Schölkopf,Andreas Geiger
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Memory is not merely the storage of data; it is the scaffolding of reality. When biological memory fades, the world does not simply turn black; it regresses into an unrecognizable chaos. Echoes of the Prior is an interactive installation that attempts to visualize this subjective phenomenology of forgetting. By inducing controlled synaptic decay within a Feed-Forward 3D Reconstruction model, we create an artistic analogy for the erosion of the brain’s predictive priors. We position the Neural Network not as a tool for engineering, but as a cognitive proxy - a silicon brain whose structural degeneration evokes the disorienting, poetic, and terrifying experience of losing one’s grip on the world. Ultimately, we offer this framework as a catalyst, inviting the wider community to explore the uncharted potential of neuromorphic aesthetics in visualizing the fragility of intelligence. Interactive demo see this https URL.

[CV-10] Anatomically Conditioned Recurrent Refinement for Topology-Aware Circle of Willis Segmentation

链接: https://arxiv.org/abs/2606.12319
作者: Juraj Perić,Marija Habijan,Dario Mužević,Irena Galić,Danilo Babin,Aleksandra Pižurica
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 4 figures, 1 table. Accepted at EUSIPCO 2026

点击查看摘要

Abstract:Segmenting the Circle of Willis (CoW) from Magnetic Resonance Angiography (MRA) is challenging due to complex topology and thin vascular structures that are prone to fragmentation. Standard Convolutional Neural Networks (CNNs) often fail to capture these topological constraints, resulting in “broken vessel” artifacts. To address this, we propose the Anatomically Conditioned Recurrent Refinement U-Net (AC2RUNet). Our architecture decouples segmentation into two streams: a Static Stream that extracts invariant anatomical features and a lightweight Dynamic Stream that iteratively refines topological errors over time. We further introduce a dynamic curriculum learning strategy that transitions from high-recall geometric supervision to topology-aware constraints. Validated on the TopCoW dataset, AC2RUNet substantially reduces Hausdorff Distance (4.72 mm vs 9.17 mm) and Betti number errors (0.19 vs 0.40), improving topological connectivity over the nnU-Net baseline while maintaining comparable volumetric Dice.

[CV-11] Slots Transitions Loops: Learning Composable World Models for ARC

链接: https://arxiv.org/abs/2606.12316
作者: Gege Gao,Bernhard Schölkopf,Andreas Geiger
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:ARC tests in-context rule induction: given a few input-output demonstrations, a model must infer the hidden rule and apply it to a new query. While many approaches express ARC rules through language, code, or symbolic programs, ARC itself is visual-symbolic: rules appear as grid transitions over objects, colors, shapes, and spatial relations. We introduce Loop-OWM, an object-centric world-modeling architecture that learns these rules as composable transitions over structured states. It combines color-prototype slots, demonstration-conditioned task summaries, and a looped transition model with dense propagation and slot-conditioned correction. On both ARC-1 and ARC-2, Loop-OWM outperforms non-looped and looped baselines with comparable or fewer parameters. These results suggest that ARC rules can be learned not only as language descriptions or searched programs, but also as transitions over visual-symbolic world states.

[CV-12] From 2D Grids to 1D Tokens: Reforming Shared Representations for Multimodal Image Fusion ICML2026

链接: https://arxiv.org/abs/2606.12303
作者: Yuchen Xian,Yunqiu Xu,Yang He,Yi Yang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at the 43rd International Conference on Machine Learning (ICML 2026)

点击查看摘要

Abstract:Multimodal image fusion aims to integrate complementary information from different modalities into a fused image that preserves rich local details while maintaining globally consistent appearance. Existing approaches build shared representations on 2D feature grids, which excel at modeling local structures but offer limited leverage over image-level global appearance factors. To balance these objectives, we introduce a compact 1D token interface based on a frozen pretrained image tokenizer for modeling non-local appearance/base factors. Rather than using the tokenizer as a reconstruction backbone, our design uses the 1D token space as a global carrier while retaining the 2D spatial pathway for local structure restoration. Specifically, we introduce Selective Token Editing (STE), which sparsely updates/replaces a small set of critical tokens, providing a lightweight mechanism to steer global appearance coherence while keeping the fusion backbone unchanged and avoiding extra losses. Experiments on four commonly used benchmarks show that our method achieves the best overall performance, with consistent, multi-metric improvements in both global coherence and local fidelity. Project page: this https URL

[CV-13] Natural-Language Temporal Grounding in Hour-Long Videos is a Search Problem: A Benchmark and Empirical Decomposition

链接: https://arxiv.org/abs/2606.12300
作者: Sukmin Seo,Geewook Kim
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 10 pages, 6 figures, Code and benchmark: this https URL

点击查看摘要

Abstract:Temporal grounding–returning the interval [t_s, t_e] for a natural-language query over a video–is the language interface to long-form video, yet has been studied on short videos; the dynamics of hour-scale natural-language grounding remain underexplored. We take the position that at hour-scale, the binding constraint is search, not recognition: Video-LLMs are bottlenecked not by localizing a nearby event, but–given a natural-language query–by searching for the relevant region of a long video. To test this, we release ExtremeWhenBench, the first open hour-scale grounding benchmark (2,273 queries over 194 videos, mean 75.7 min, max 9 hr) with an open-form query distribution. Every open Video-LLM collapses while a frame-level retrieval baseline outperforms them; a failure taxonomy attributes 85% of failures to search; and a retrieve-then-ground hybrid recovers 6.7x over the monolithic Video-LLM–mirroring retrieve-then-read in open-domain QA.

[CV-14] Bridging the Modality Gap in Forensic Image Retrieval

链接: https://arxiv.org/abs/2606.12294
作者: Ricardo González-Gazapo,Annette Morales-González,Yoanna Martínez-Díaz,Heydi Méndez-Vázquez,Milton García-Borroto
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: 23 pages, 5 figures, paper submitted to Elsevier journal

点击查看摘要

Abstract:Automated image retrieval plays an increasingly critical role in modern forensic analysis, supporting investigative workflows that rely on efficient comparison of visual evidence. While prior work has focused primarily on developing and optimizing multimodal retrieval systems, limited attention has been paid to evaluating the forensic applicability of these technologies across diverse real-world scenarios. In this study, we present a unified retrieval framework adapted to four key forensic tasks: (1) tattoo image retrieval given a tattoo query image; (2) tattoo retrieval guided by human-expert textual descriptions, modelling the common situation where a witness verbally describes a tattoo; (3) tattoo retrieval from hand-drawn sketches; and (4) face retrieval from forensic face sketches. Our system leverages a multimodal large language model (MLLM) to automatically generate structured textual descriptions for all queries and gallery images, followed by sentence-transformer embedding for text-based comparison. We evaluate retrieval using visual-only embeddings, text-only embeddings and a multimodal fusion strategy that combines text- and image-based similarity scores derived from state-of-the-art visual feature extractors relevant to each task. The fusion of modalities consistently improves retrieval precision and robustness, especially in scenarios where visual information is limited or noisy (e.g., sketches, partial tattoos, or fragmented witness statements). This work highlights the forensic value of a unified multimodal retrieval pipeline and demonstrates how modern MLLMs can operationalize challenging forensic tasks that traditionally rely on manual expert analysis. Our results position multimodal retrieval as a promising tool for supporting investigative workflows involving tattoos, facial composites, and witness descriptions.

[CV-15] CellNet – Localizing Cells using Sparse and Noisy Point Annotations

链接: https://arxiv.org/abs/2606.12286
作者: Benjamin Eckhardt,Dmytro Fishman,Stuart Fawke,Andrew Curtis,Bo Fussing,Constantin Pape
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Conference poster at Biology at Scale: From Variants to Cellular Programs and Functions

点击查看摘要

Abstract:Counting living cells is an important step in many biological research workflows. Our collaborators at the Wellcome Sanger Institute study vital genes in humans via large scale saturation genome editing screening, which requires repeatedly counting cells a great number of times. Computer Vision based automation is crucial for high throughput and resource efficiency. In this work, we develop a regression-based deep learning computer vision algorithm to detect and count cells in phase-contrast microscopy images. To reduce annotation effort, which in practice often becomes a bottleneck, we focus on counting cells only using sparse point annotations, which are fast and easy to acquire. By comparison to state-of-the-art 0-shot methods, we show that regression-based counting is a promising alternative in low data regimes. Through developing methods to automatically count living cells in microscopy images, we contribute to valuable research on the human genome. The code is available at this https URL.

[CV-16] Finding Sparse Subnetworks in One Training Cycle via Progressive Magnitude-Based Pruning

链接: https://arxiv.org/abs/2606.12278
作者: Romana Qureshi,Hafida Benhidour,Said Kerrache,Nahlah Aljeraisy
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Neural network pruning reduces model size by removing less important parameters while aiming to preserve predictive performance. Although the Lottery Ticket Hypothesis (LTH) shows that sparse subnetworks can match dense networks when trained from suitable initializations, its iterative pruning procedure requires multiple complete training cycles. This work evaluates progressive magnitude-based pruning as a single-cycle alternative. The method gradually increases sparsity during training using a linear schedule and updates pruning masks based on active weight magnitudes. We conduct systematic experiments on CIFAR-10 and MNIST across ResNet, VGG-style, and LeNet architectures, comparing the proposed method with representative iterative and initialization-based pruning baselines, including LTH, SNIP, and GraSP. On CIFAR-10, the method achieves 95.12% accuracy on ResNet-18 at 72.9% sparsity, compared with 90.5% reported for LTH. At extreme sparsity, it achieves 93.13% accuracy on a VGG-like architecture at 97% sparsity, compared with approximately 92.0% for SNIP, and 93.44% accuracy on VGG-19 at 97.97% sparsity, compared with 92.19% for GraSP at 98% sparsity. A sparsity-accuracy analysis on ResNet-18 further shows that accuracy remains within 0.1 percentage points of the dense baseline across 70–85% sparsity. These results indicate that progressive magnitude-based pruning provides an effective single-cycle approach for neural network sparsification under the evaluated settings.

[CV-17] VOID: Defeating Unauthorized Mimicry in Latent Diffusion Models USENIX-SECURITY USENIX-SECURITY2026

链接: https://arxiv.org/abs/2606.12263
作者: Chunlin Qiu,Ang Li,Tianxiao Huang,Ruilin Gan,Yunjie Ge,Shenyi Zhang,Huayi Duan,Lingchen Zhao,Chao Shen,Qian Wang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: To appear in the 35th USENIX Security Symposium (USENIX Security 2026)

点击查看摘要

Abstract:While Latent Diffusion Models (LDMs) have revolutionized visual synthesis, they are increasingly exploited for unauthorized mimicry of individuals. Existing defenses inject deceptive perturbations to steer the generated images toward irrelevant targets. However, this approach hinges on an ungrounded assumption: subtle perturbations can maintain their deceptive efficacy throughout an LDM’s extensive generation process. In reality, the model’s innate restoration mechanism will remove such perturbations and cause individual identities to re-emerge in the images generated. We propose VOID, a defense framework that overcomes this conundrum by manipulating an LDM’s intrinsic stochasticity. VOID perturbs the diffusion pipeline in two novel ways: 1) amplifying the latent encoding errors to shatter an image’s semantic structure, and 2) counteracting the target guidance signals to suppress the model’s restoration capabilities. This results in a semantic corruption that thwarts any unauthorized mimicry. Notably, the security gain does not come at the price of visual utility, as VOID simultaneously manages to confine perturbations to human-imperceptible regions of protected images. Our comprehensive evaluation of 24 state-of-the-art defenses against 10 mimicry attacks on 5 datasets demonstrates VOID’s unprecedented protection power: it increases the average Frechet Inception Distance (FID) from 113 to 365, a 223% improvement over the strongest defense to date. Comments: To appear in the 35th USENIX Security Symposium (USENIX Security 2026) Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2606.12263 [cs.CV] (or arXiv:2606.12263v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2606.12263 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-18] Bridging Day and Night: Unsupervised Cross-Domain Re-Identification with Synergistic Prompt and Prototype Learning

链接: https://arxiv.org/abs/2606.12258
作者: Jiyang Xu,Rui Liu,Hang Dai
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Cross-domain day-night re-identification (ReID) is fundamentally challenged by the substantial visual appearance discrepancies between daytime and nighttime scenes. Existing fully supervised methods rely heavily on labor-intensive annotations, which are costly and exhibit limited generalization across domains. In this work, we investigate unsupervised day-night ReID and propose a novel framework that synergistically combines prompt learning and prototype-based representation learning to associate identities across domains without requiring manual labels. Our approach follows a progressive two-stage training strategy. In the first stage, we exploit the vision-language model to generate instance-specific textual prompts in an annotation-free manner. We employ an instance-level alignment mechanism to embed visual features and textual prompts into a unified semantic space, aligning unlabeled day/night images with learnable prompts via instance-aware dynamic-bias adaptation. In the second stage, we construct domain-specific prototype memory banks and introduce two complementary modules: i) an intra-domain identity association module to enhance feature discriminability within each domain, and ii) a cross-domain prototype matching module to reliably identify positive and negative prototype pairs, thereby establishing robust identity correspondences across day and night. Extensive experiments on public benchmarks validate the effectiveness of our method. Under the unsupervised setting, our framework attains Rank-1 accuracy comparable to state-of-the-art fully supervised methods.

[CV-19] Damage-TriageFormer: A Foundation-Model Framework for Typology-Based Building Damage Assessment from Mono-Temporal Imagery

链接: https://arxiv.org/abs/2606.12248
作者: Yiming Xiao,Yu-Hsuan Ho,Sanjay Thasma,Junwei Ma,Ali Mostafavi
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Decision-relevant building damage assessment is critical for prioritizing resources and recovery after a disaster, yet most automated methods either flatten damage into a single severity scale (no damage, minor, major, destroyed) or require paired pre- and post-event imagery that is often unavailable for emerging hazards. This paper presents Damage-TriageFormer, a single-image, post-event, footprint-conditioned model that produces a damage typology rather than a severity scale. We contribute: (1) DamageTriage-Bench, a new benchmark built from NOAA Emergency Response Imagery across Hurricane Michael (2018), Hurricane Helene (2024), and the 2025 Los Angeles wildfire complex, with five typology classes that distinguish roof damage from structural damage and, within each, partial from total extent; and (2) Damage-TriageFormer, which extends a DINOv3 ViT-L backbone with a Simple Feature Pyramid for higher-resolution instance pooling, a two-stage gated damage head, and an auxiliary severity-regression objective. Our model achieves macro F1 of 0.624 on validation and 0.619 on a held-out stratified test set, performing strongest where operational triage needs it most, with per-class F1 of 0.91 and 0.84 on undamaged buildings and total structural collapse, respectively. While the rare Total Roof Damage class remains difficult due to its limited examples and an inherently ambiguous label boundary, our results show that single-image post-event imagery can support actionable building damage typing, enabling targeted emergency response and resource allocation without a pre-event reference.

[CV-20] DrivingAgent : Design and Scheduling Agents for Autonomous Driving Systems

链接: https://arxiv.org/abs/2606.12236
作者: Zhongyu Xia,Wenhao Chen,Yongtao Wang,Ming-Hsuan Yang
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Many autonomous driving systems are increasingly incorporating foundation models to improve generalization and handle long-tail scenarios. However, this trend introduces two key challenges: (i) the manual and labor-intensive process of designing and integrating new models, and (ii) the lack of intelligent, dynamic scheduling mechanisms to meet strict real-time constraints. While Large Language Model (LLM)-based agents offer a promising avenue for automation, existing frameworks are ill-suited for autonomous driving. Specifically, they fail to distinguish between the fundamentally different requirements of system design and real-time scheduling, treat modules as opaque black boxes, and are not designed for continuous operation. To address these limitations, we propose DrivingAgent, a novel agent framework tailored to the dual challenges of autonomous driving system design and scheduling. In the design phase, DrivingAgent automates module development by interpreting system architecture, generating code, and validating modules via super-network training. In the scheduling phase, it employs a lightweight LLM trained with reinforcement learning to dynamically orchestrate system modules in real time, supported by a structured memory that integrates long-term storage with timestamped short-term context. Experimental results demonstrate that DrivingAgent achieves a superior speed–accuracy trade-off on both the nuScenes and Bench2Drive benchmarks.

[CV-21] An Electric Potential-Augmented Benchmark Dataset for Physics-Guided Image Reconstruction of Electrical Capacitance Tomography

链接: https://arxiv.org/abs/2606.12226
作者: Xinqi Zhang,Qiming Ma,Lihui Peng
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:While deep learning has significantly advanced image reconstruction of Electrical Capacitance Tomography (ECT), most data-driven methods map directly between capacitance and permittivity distribution, treating the sensor as a black box. This overlooks the electric potential field – the fundamental physical link governing the nonlinear and ill-posed ``soft-field’’ effect. To address this, we propose an electric potential-augmented ECT benchmark dataset designed to explicitly integrate latent physics behind ECT into the learning process. Generated via a COMSOL-MATLAB pipeline for an eight-electrode sensor as an example, the dataset comprises 20,000 randomized samples across four typical flow patterns. Crucially, alongside the conventional capacitance vectors and permittivity distributions depicted as images, each sample preserves eight excitation-wise full-field potential maps. Beyond data release, we provide illustrative evaluation protocols for both forward and inverse problems of ECT. Through comprehensive testing on both in-distribution (IID) and out-of-distribution (OOD) scenarios, we systematically demonstrate how the inclusion of electric potential maps enhances modeling accuracy and robustness. Fundamentally, the explicit inclusion of latent field information significantly lowers the barrier to integrating physical laws into ECT modeling, thereby establishing a standardized foundation for future physics-guided machine learning of ECT image reconstruction.

[CV-22] Adapting Prithvi-EO for Fallow Detection for Food-Water Nexus: ViT-Adapter Necks and Parameter-Efficient Backbone tuning of Geospatial Foundation Model

链接: https://arxiv.org/abs/2606.12218
作者: Sk Muhammad Asif,Orhun Aydin
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 10 pages, 6 figures. Preprint. Submitted to ACM SIGSPATIAL 2026

点击查看摘要

Abstract:Understanding spatial distribution of fallow land is important for optimizing the food-water (FW) nexus, given fallowing’s role in crop rotation and water conservation. Fallow is a low accuracy class in USDA Cropland Data Layer (CDL). Geospatial foundation model (GFM), Prithvi-EO has shown strong transferability across computer vision tasks. However, its Vision Transformer (ViT) backbone produces features at a single spatial scale that are ill-suited for the multi-scale features required by object detection heads. Existing approaches synthesise multi-scale pyramids through scaling of single stride tokens, sacrificing spatial heterogeneity, and full backbone fine-tuning is computationally prohibitive for GFMs. We evaluate a fallow detection pipeline combining two parameter-efficient fine tuning (PEFT) schemes: Low-Rank Adaptation (LoRA) and a hybrid PEFT, with three neck designs: pseudo multi-scale, Lite ViT-Adapter, and Full ViT-Adapter. Our best configuration, Lite ViT-Adapter with a one-stage head, achieves a mAP@50 of 0.9479 with the Diou loss, suggesting the effectiveness of center-aware localization for irregular fallow field detection. ViT-Adapter free one-stage detection under LoRA improves the adapter-free anchor-based approach by 6.42%, and the best configuration improves baseline adapter-free anchor-based approach by 25.70%. These results demonstrate that lightweight spatial prior fusion and selective backbone unfreezing enable Prithvi-EO to capture local fallow patterns more effectively, outperforming approaches that rely on reshaped single-stride ViT tokens.

[CV-23] Making Foresight Actionable: Repurposing Representation Alignment in World Action Models

链接: https://arxiv.org/abs/2606.12217
作者: Lu Qiu,Yizhuo Li,Yi Chen,Yuying Ge,Yixiao Ge,Xihui Liu
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:World Action Models (WAMs) offer a promising route for robot manipulation by using video generation models to model future scene evolution before producing control actions. However, our empirical observations reveal a phenomenon: generating plausible visual futures does not always guarantee the extraction of accurate actions. To diagnose this failure, we conduct action-head attention analysis and causal interventions. We find that the action decoder fails to focus on task-relevant interaction regions and remains sensitive to perturbations in task-irrelevant areas. This reveals a representation mismatch: hidden states optimized for visual reconstruction are not inherently organized in a form useful for low-level action control. In this paper, we propose AGRA, an Action-Grounded Representation Alignment objective that regularizes the world-action interface by aligning intermediate video diffusion features with spatially coherent semantic representations from a foundation visual encoder. We evaluate AGRA on real-world manipulation tasks. Experiments show that AGRA makes world model representations more action-grounded: by focusing the action decoder on the correct interaction regions, it improves object localization accuracy and affordance understanding, and makes the policy more robust to perturbations in task-irrelevant regions. As a result, AGRA consistently improves both in-distribution performance and out-of-distribution generalization over the baseline world action model.

[CV-24] SHERPA: Seam-aware Harmonized ERP Adaptation for Open-Domain 360circ Panorama Generation

链接: https://arxiv.org/abs/2606.12213
作者: Jungwoon Kang,Jaehun Kim,Yiwon Yu,Hyungyum Jang,Sanghoon Lee,Jongyoo Kim
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 29 pages, 23 figures, 5 tables. Preprint version

点击查看摘要

Abstract:Panoramic imagery is increasingly used in world-generation, games, and simulation, where users may need not only photorealistic scenes but also stylized and non-photorealistic environments. Large-scale text-to-image diffusion and flow models provide broad style and semantic priors for this goal, but planar image training misaligns them with the wrap-around topology and polar regions of 360^\circ panoramas represented in equirectangular projection (ERP). We present SHERPA, a lightweight adaptation framework that combines frequency-selective Circular RoPE, Circular Latent Encoding/Decoding, image-side FFN adapters, and a Dual-Path Training Scheme. Circular RoPE replaces only the seam-sensitive high-frequency horizontal RoPE band with integer-periodic harmonics while preserving the pretrained lower-frequency spectrum. The Paired Panorama Path supervises geometry, while the Unpaired Style Path uses self-supervised yaw consistency for target-free stylized prompts. As a result, SHERPA generates 360^\circ panoramas across both photorealistic panorama domains and open-domain stylized prompts.

[CV-25] InternVideo3: Agent ify Foundation Models with Multimodal Contextual Reasoning

链接: https://arxiv.org/abs/2606.12195
作者: Ziang Yan,Sheng Xia,Jiashuo Yu,Yue Wu,Tianxiang Jiang,Songze Li,Kanghui Tian,Yicheng Xu,Yinan He,Kai Chen,Limin Wang,Yu Qiao,Yi Wang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent progress in foundation models has shifted toward agentic behavior involving multi-step reasoning and tool use. However, open-source efforts largely focus on text-dominant settings, leaving long-horizon multimodal tasks underexplored. This gap is evident in video tasks requiring sustained temporal understanding and iterative interaction. We present InternVideo3, a framework enhancing these capabilities via Multimodal Contextual Reasoning (MCR). MCR treats understanding as a closed-loop process over a shared, evolving context containing observations, instructions, reasoning, tool actions, and memory. This frames long-video understanding as evidence accumulation and verification. To ensure efficiency, we introduce Multimodal Multi-head Latent Attention (M^2LA), a token-preserving reparameterization compressing KV-cache states while retaining the full token stream. Our staged training includes continued pretraining, short-to-long supervised fine-tuning, rule-based reinforcement learning, and on-policy distillation. Experiments show InternVideo3 achieves strong performance on benchmarks like Video-MME, MLVU, and EgoSchema. We further instantiate the model as a video agent with retrieval tools, demonstrating robust evidence-grounded behavior. Our results suggest that efficient context handling and closed-loop reasoning are vital for adapting open multimodal models toward long-horizon visually grounded agency.

[CV-26] DynaTok: Token-Based 4D Reconstruction from Partial Point Clouds ICML2026

链接: https://arxiv.org/abs/2606.12189
作者: Weirong Chen,Keisuke Tateno,Hidenobu Matsuki,Michael Niemeyer,Daniel Cremers,Federico Tombari
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICML 2026. Project page: this https URL

点击查看摘要

Abstract:We address 4D reconstruction from partial point cloud sequences, where depth-sensor observations are incomplete, unordered, and lack explicit temporal correspondences. This geometry-only setting is challenging due to missing observations and ambiguous dynamics. While recent progress has largely relied on image-based methods, existing point-based approaches typically focus on single objects, assume relatively complete inputs, or require explicit correspondences. To address these limitations, we propose DynaTok, a point-based framework for correspondence-free 4D reconstruction from partial point cloud sequences without images. DynaTok encodes frames into compact latent tokens, aggregates incomplete observations over time with a Transformer-based spatiotemporal encoder, and decouples geometry and motion through residual tokens in a unified model. A flow-matching decoder then reconstructs complete, temporally consistent 4D point-cloud sequences conditioned on the latent tokens. Experiments on object- and scene-level benchmarks demonstrate improved reconstruction quality and temporal coherence from partial point cloud observations. Project page: this https URL.

[CV-27] Beyond Dark Knowledge: Mixup-Based Distillation for Reliable Predictions

链接: https://arxiv.org/abs/2606.12171
作者: José Medina,Paul Honeine,Abdelaziz Bensrhair,Amnir Hadachi
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Knowledge Distillation (KD) and mixup have proven effective at inducing smoothness in class boundaries; KD captures inherent class relationships in probability distributions, and mixup enforces them through convex combinations of inputs. Their interaction, however, remains poorly understood, particularly when mixup is applied only during student training. In this setting, the teacher is queried on inputs drawn from a vicinal distribution it never saw during training, a controlled mismatch whose effect on knowledge transfer has not been characterised. We show that this mismatch causes the teacher’s supervisory signal to be dominated by distributional confusion rather than inter-class structure. Despite it, the student does not merely imitate the teacher: it independently acquires greater linearity in the vicinal region, a structural property that the teacher lacks, and goes beyond dark-knowledge transfer. KD with mixup consistently improves student accuracy and reduces overconfidence by an order of magnitude relative to the baseline, across CIFAR and ImageNet with varying-capacity teachers. Crucially, calibration propagates from teacher to student independently of accuracy transfer, and temperature scaling governs a measurable accuracy-calibration trade-off that becomes more pronounced under vicinal training. These results reframe mixup distillation not as a degraded version of standard KD, but as a richer transfer channel that simultaneously shapes discriminative performance, uncertainty estimation, and representational geometry.

[CV-28] opoCap: Learning Topology-Agnostic Motion Priors for Monocular Video-to-Animation

链接: https://arxiv.org/abs/2606.12153
作者: Cheng-Feng Pu,Jia-Peng Zhang,Meng-Hao Guo,Yan-Pei Cao,Shi-Min Hu
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:

点击查看摘要

Abstract:The explosion of generative 3D assets has created a massive demand for animation, yet current motion capture methods remain brittle, restricted to species-specific templates (e.g., SMPL) or requiring labor-intensive manual rigging. We introduce TopoCap, the first unified framework capable of extracting motion from monocular video and retargeting it onto characters with arbitrary, unseen skeletal topologies, i.e., from bipeds to hexapods and inanimate objects, without test-time optimization. Our key insight is that while skeletal structures are combinatorial and discrete, the underlying physics of motion occupy a continuous, low-dimensional manifold. We materialize this insight via a two-stage generative pipeline. First, we learn a Universal Motion Manifold using a Graph CVAE that compresses heterogeneous kinematic chains into a shared, fixed-length latent code. By explicitly conditioning the decoder on a structural embedding of the target rig, we disentangle motion dynamics from skeletal topology. Second, we treat video-to-animation as a conditional flow matching problem, predicting these topology-agnostic codes from visual features. To learn this generalized prior, we introduce Mobjaverse, a massive-scale dataset curated from Objaverse-XL. Comprising over 5,000 unique skeletal topologies and 2 million frames, it exceeds the structural diversity of existing datasets by two orders of magnitude. Extensive experiments demonstrate that \MethodMotion outperforms specialist models on human and quadruped benchmarks while enabling zero-shot retargeting for the long tail of 3D creatures. Dataset is publicly available at this https URL.

[CV-29] AerialClaw: An Open-Source Framework for LLM -Driven Autonomous Aerial Agents

链接: https://arxiv.org/abs/2606.12142
作者: Ke Li,Jianfei Yang,Luyao Zhang,Guo Yu,Chengwei Yan,Yuan Ding,Di Wang,Nan Luo,Gang Liu,Xiao Gao,Quan Wang
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Unmanned aerial vehicles (UAVs) are increasingly used in inspection, search and rescue, environmental monitoring, and emergency response. However, most UAV applications still rely on pre-defined command sequences or task-specific pipelines, where developers manually connect perception, planning, flight control, simulation, logging, and safety modules. This limits the flexibility, reproducibility, and extensibility of autonomous aerial systems. This paper presents AerialClaw, an open-source software framework that enables UAVs to operate as decision-making aerial agents rather than merely command-following platforms. Given a natural-language mission, AerialClaw allows an LLM-based agent to understand the task, maintain context, invoke executable aerial skills, observe perception and runtime feedback, and iteratively update its decisions in a closed loop. The framework adopts a modular brain-skill-runtime architecture, combining hard skills for atomic UAV operations, Markdown-based soft skills for reusable task strategies, document-driven agent state and capability boundaries, memory-driven reflection, safety-oriented runtime validation, and platform-agnostic execution adapters. AerialClaw supports lightweight mock execution, PX4 SITL with Gazebo, and AirSim-based simulation, together with a web console, pluggable model backends, example missions, simulation assets, and staged deployment scripts. By combining standardized aerial skills, document-driven agent state, memory, and closed-loop LLM decision-making, AerialClaw provides a reproducible and extensible open-source framework for building UAV systems that can interpret missions, make decisions, execute skills, and adapt their behavior from feedback.

[CV-30] me-Conditioned and Multi-Time Survival Prediction from 2D PET/CT Projections in Lung Cancer

链接: https://arxiv.org/abs/2606.12140
作者: Ashish Chauhan,Sambit Tarai,Elin Lundström,Johan Öfverstedt,Håkan Ahlström,Joel Kullberg
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Under review at MIUA 2026

点击查看摘要

Abstract:Accurate prediction of overall survival (OS) from positron emission tomography/computed tomography (PET/CT) can support personalized treatment and follow-up strategies in oncology. However, the impact of temporal modeling on imaging-based survival prediction remains insufficiently explored. We investigate how different temporal formulations influence survival prediction by developing two complementary approaches: Attention-guided Time-Conditioned Survival (ATCS) and Multi-Time Survival (MTS). We retrospectively analyzed pre-treatment PET/CT images from 848 patients with non-small cell lung cancer (NSCLC), including 556 for model development and 292 for held-out testing. A previously proposed Time-Conditioned Survival (TCS) model was used as a baseline. Models were trained using 5-fold cross-validation and evaluated on the test set using time-dependent area under the curve (AUC) at 6-month intervals from 0.5 to 5 years. Both ATCS and MTS outperformed the baseline TCS model, achieving mean AUCs of 0.794 and 0.793, respectively, compared to 0.767. ATCS performed better at earlier time points (0.5-3 years), whereas MTS performed better at later intervals (3.5-5 years). Combining tumor-specific and tissue-wise PET/CT features improved performance over either input alone. Finer temporal discretization improved short-term prediction, while coarser intervals provided more stable long-term estimates. These findings demonstrate that temporal modeling and input design influence PET/CT-based survival prediction. The proposed approaches enable time-specific survival estimation from pre-treatment imaging and may support improved risk stratification and clinical decision-making.

[CV-31] AGE-MIL: Anchor-Guided Evidence Learning for Patient-Level Prediction MICCAI

链接: https://arxiv.org/abs/2606.12126
作者: Jiawei Niu,Jian Chen,Di Zhang,Junbo Lu,Zhangcheng Liao,Xuhao Liu,Honglin Zhong,Mireia Crispin-Ortuzar,Chen Li,Zeyu Gao,Yi Cai
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 2 figures, MICCAI early accepted

点击查看摘要

Abstract:Existing computational pathology methods predominantly operate within whole-slide image (WSI)-level multiple instance learning (MIL) paradigms, while patient-level modeling remains underexplored. In routine pathological practice, however, pathologists derive diagnostic and prognostic conclusions by integrating evidence across multiple WSIs rather than relying on any single slide. This discrepancy creates a fundamental misalignment when patient-level supervision is directly imposed on conventional MIL frameworks, often leading to unstable optimization and degraded predictive reliability. To address this issue, we propose Anchor-Guided Evidence MIL (AGE-MIL), a weakly supervised framework for patient-level prediction. AGE-MIL constructs a patient-level anchor from slide representations to capture global pathological context and guide the retrieval and integration of diagnostically relevant local patches, enabling robust patient-level modeling. Patient-level risk is further modeled as an evidence accumulation process, promoting stable optimization under weak supervision. AGE-MIL is evaluated on six clinically relevant patient-level prediction tasks from two independent cohorts. Experimental results show that the proposed framework consistently outperforms eight state-of-the-art MIL methods. Code is available at this https URL.

[CV-32] Q-Fold: Query-Aware Focus-Context Spatio-Temporal Folding for Long Video Understanding

链接: https://arxiv.org/abs/2606.12125
作者: Biao Tang,Xu Chen,Shuxiang Gou,Jingyi Yuan,Yuhan Zhang,Chenqiang Gao
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 5 figures, 8 tables. Code will be made publicly available

点击查看摘要

Abstract:Long-video understanding remains challenging for multimodal large language models, because temporally extended videos often contain thousands of frames and are therefore expensive to process exhaustively. Existing methods usually construct compact visual inputs from long videos under a limited visual budget. However, most of them still follow a frame-centric paradigm and apply similar representations to retained content regardless of its importance. This makes it difficult to preserve both high-fidelity visual evidence and broad temporal coverage. To address this issue, we propose Q-Fold, a training-free input construction framework for long-video understanding. Instead of treating isolated frames as the basic modeling unit, Q-Fold operates on contiguous temporal segments and constructs a heterogeneous Focus–Context representation under query guidance. Query-relevant segments are preserved as high-fidelity Focus Frames, while less relevant segments are folded into chronology-preserving contextual layouts. In this way, Q-Fold preserves critical visual evidence and broad temporal coverage, while better maintaining local temporal continuity within short segments. Experiments on four long-video benchmarks with multiple Video-MLLMs show that Q-Fold consistently improves performance without increasing the input budget. Notably, it achieves gains of up to 9.1 percentage points on an ultra-long video benchmark. Code will be made publicly available.

[CV-33] MSUE: Multi-Modal Soccer Understanding Expert

链接: https://arxiv.org/abs/2606.12106
作者: Litao Li,Yibo Yu,Yufeng Hu,Zhuo Yang,Jiali Wen,Yixin Chen,Yixi Zhou
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 6 pages, 1 figures

点击查看摘要

Abstract:This paper presents our solution to the 2026 SoccerNet VQA Challenge. We first develop a cost-effective data synthesis pipeline driven by a Vision-Language Model (VLM), which systematically restructures raw domain data into diverse VQA samples, including concise answers and long-form responses. Second, we propose MSUE, a multi-expert question answering architecture that employs a Large Language Model (LLM) to dynamically dispatch questions to text, image, and video experts. These experts are instantiated as a strong text baseline Gemini3-Flash, a fine-tuned Qwen3-VL, and an external knowledge base, respectively, working collaboratively to enhance VQA performance. MSUE achieves an accuracy of \textbf0.95 on the challenge benchmark, securing third place in the leaderboard.

[CV-34] DAM-VLA: Decoupled Asynchronous Multimodal Vision Language Action model

链接: https://arxiv.org/abs/2606.12105
作者: Pankhuri Vanjani,Zhuoyue Li,Jakub Suliga,Moritz Reuss,Gianluca Geraci,Xinkai Jiang,Rudolf Lioutikov
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 17 pages, 8 figures

点击查看摘要

Abstract:Vision-language-action (VLA) models inherit a shared synchronous clock from vision-language pretraining, processing every input at one rate. This is misaligned with physical interaction, where a high-frequency modality changes at hundreds of hertz, vision evolves more slowly, and language stays constant across an episode. A synchronous VLA oversamples slow modalities, undersamples fast ones, and caps action generation at the lowest effective frequency. We hypothesize that decoupling temporal processing per modality, letting each update and retain information at its own sensor rate, yields stronger representations and more robust control. We present DAM-VLA, which maintains per-modality latent buffers refreshed at sensor rates and read continuously by the action head, integrating new high-frequency modalities through gated cross-attention that leaves the pretrained backbone intact. Across seven contact-rich real-world manipulation tasks, DAM-VLA more than doubles the average success rate of the strongest synchronous baseline (95.2% vs.\ 40.95%) while sustaining smooth, reactive 100,Hz control. Project website: \hrefthis https URLthis http URL

[CV-35] ISAP-3D: Identity-Slot Aligned Part-Aware 3D Generation

链接: https://arxiv.org/abs/2606.12099
作者: Junlin Hao,Haoshuai Fu,Xibin Song,Wei Li,Ruigang Yang,Xinggong Zhang,Jinchuan Zhang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Part-aware 3D generation aims to synthesize structured objects with semantically meaningful components, yet often suffers from structural ambiguity due to identity-layout entanglement. Existing methods either infer part identity and spatial layout implicitly, which can lead to unstable part allocation (e.g., slot swapping or part merging), or rely on strong layout conditions that are difficult to obtain in practice. We attribute this ambiguity to identity-slot permutation freedom: without explicit identity-slot alignment, the correspondence between semantic parts and generation slots is not identifiable during training, allowing multiple slot assignments to fit the same supervision and leading to inconsistent decomposition. Based on this insight, we argue that stable part-aware generation requires identity-aligned one-to-one slot modelling. We therefore propose an identity-slot aligned framework, ISAP-3D, which anchors each part with semantic identity tokens and performs identity-conditioned one-to-one layout prediction, followed by layout-conditioned geometry synthesis. Structured local-global conditioning maintains identity alignment across semantic, spatial, and geometric stages. We also construct a part-level dataset with a unified semantic protocol to enable learnable and consistent identity-slot alignment. Extensive experiments demonstrate improved structural stability, controllability, and robustness over state-of-the-art part-aware generation baselines.

[CV-36] Non-frontal face recognition using GANs and memristor-based classifiers

链接: https://arxiv.org/abs/2606.12074
作者: Semih Vazgecen,Cristian Sestito,Spyros Stathopoulos,Themis Prodromakis
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
备注: 12 pages, 4 figures, 1 Supplementary (22 pages, 16 figures, 6 tables, 4 supplementary notes)

点击查看摘要

Abstract:Face recognition systems have advanced significantly through deep learning techniques, delivering high performance and robustness in complex scenarios. However, these approaches incur substantial computational overhead, limiting their in situ applicability in resource-constrained platforms such as drones, where they can address challenges including non-frontal facial imagery. Memristor-based neuromorphic systems have emerged as a compelling approach for edge AI applications, combining biologically inspired processing with efficient and scalable computation. In this work, we propose a facial recognition framework that addresses non-frontal pose variations by integrating lightweight generative adversarial network (GAN)-based pose frontalisation with memristor-based neuromorphic recognition. The experimental results on two datasets demonstrate the effectiveness of combining adversarial learning with memristive technology, achieving up to 96% identification accuracy. The proposed approach alleviates the computational bottlenecks of conventional AI and offers a scalable, efficient solution for face recognition in dynamic real-world environments.

[CV-37] World Model Self-Distillation: Training World Models to Solve General Tasks

链接: https://arxiv.org/abs/2606.12072
作者: Sebastian Stapf,Pablo Acuaviva Huertos,Aram Davtyan,Paolo Favaro
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Pretrained video generators are promising visual world models that exhibit emergent task-solving abilities; however, their reliance on detailed textual descriptions limits their direct use for planning and decision-making. Existing approaches either outsource this reasoning to language or vision-language models, or rely on supervised fine-tuning with paired task-execution videos, which are costly to collect and difficult to scale. We propose a scalable framework that elicits task-solving ability in such models by combining self-distillation with reinforcement learning. Given an unlabeled scene image, a vision-language model generates a candidate task and a detailed step-by-step solution. The solution conditions a pretrained video diffusion model, the Demonstrator; we distill its behavior into an Executor conditioned only on the image and a short task prompt. This transfers execution knowledge from caption-guided generation to instruction-conditioned task solving without curated task-video supervision. We further improve the Executor with reinforcement learning from VLM feedback, exploiting the asymmetry between judging whether a sampled video satisfies a task and generating the solution. Experiments on our proposed WorldTasks-Benchmark and the DreamGen robotics benchmark show that the Executor surpasses the Demonstrator under our VLM-based evaluation protocol and transfers competitively to robotic tasks.

[CV-38] ac-DINO: Learning Vision-Tactile Features with Patch Alignment

链接: https://arxiv.org/abs/2606.12069
作者: Hong Li,Yankang Dong,Yue Xu,Yihan Tang,Mingzhu Li,Jiamin Qiu,Qihang Yao,Xing Zhu,Yujun Shen,Nan Xue,Yong-Lu Li
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Touch is the primary medium through which humans interact with the environment. Currently, tactile learning mainly focuses on image-level pretraining or alignment. However, tactile signals correspond to local object contact, while research into scale alignment and holographic matching remains limited and proper datasets and benchmarks also lack. To bridge this gap, we first construct a data collection system to acquire a large-scale tactile dataset, with over 20 K tactile contacts from 505 real-world objects. Building on this dataset, we design a Vis-Tac Holographic Matching Benchmark to evaluate vision-tactile local-to-global alignment ability. Then we propose Vision-Tactile Patch Alignment (VTPA) methods for vision-tactile representation learning. Experiments demonstrate that these exceed the performance of methods without alignment and align with whole-object images.

[CV-39] Performance Analysis of YOLOv11 and YOLOv8 for Mixed Traffic Object Detection under Adverse Weather Conditions in Developing Countries

链接: https://arxiv.org/abs/2606.12066
作者: Quoc Thuan Nguyen,Ha Anh Vu,Ngo Dang Thanh Ngan,Minh Phuc Hoang Ngoc
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In modern vehicular systems, robust performance under harsh conditions has become a critical problem of autonomous driving. Our study delivers a comprehensive evaluation of the newest iteration of the YOLO series, which is YOLOv11 Nano architecture benchmarked against the widely adopted YOLOv8 Nano as a baseline on a custom fused dataset that combines the Indian Driving Dataset (IDD) [1] and Berkeley Deep Drive Dataset (BDD100K) [2]. We have analyzed the trade-offs among detection accuracy, inference speed, and computational efficiency in high-entropy scenarios involving dense mixed traffic, rain, and low-light conditions. Specifically, YOLOv11n achieves a mean Average Precision (mAP@50) of 46.6%, with a notable 3.2% improvement in Precision over the baseline, effectively reducing false positives in cluttered scenes. Furthermore, the proposed model exhibits enhanced energy efficiency, requiring 22% fewer FLOPs (6.3G vs. 8.1G) while maintaining real-time inference speed of 70.9 FPS on a Tesla T4 GPU, offering an optimal trade-off for safety-critical edge deployment.

[CV-40] MFEN:Multi-Frequency Expert Network for Visible-Infrared Person Re-ID CVPR

链接: https://arxiv.org/abs/2606.12051
作者: Xulin Li,Yan Lu,Bin Liu,Qinhong Yang,Qi Chu,Tao Gong,Nenghai Yu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR Highlight

点击查看摘要

Abstract:Visible-infrared person re-identification (VI-ReID) is challenging due to the large modality discrepancy between visible and infrared images. We contend that this discrepancy is largely related to differing lighting conditions, including differences in light wavelength and light source type. Recently, frequency-based VI-ReID approaches have achieved notable success because frequency information can better extract identity-relevant contours and details while excluding irrelevant lighting and color. However, existing methods either do not distinguish different frequency bands or focus on only one band, which is insufficient under diverse lighting conditions. To perform comprehensive frequency domain learning, we propose a Multi-Frequency Expert Network (MFEN) that enables multi-frequency modulation and adaptively combines different bands through a mixture-of-experts design. We further introduce Random Frequency Augmentation (RFA) and Frequency Auxiliary Optimization (FAO) to better train MFEN. The three modules are complementary and jointly capture critical frequency-domain details for robust representation learning. Extensive experiments on three VI-ReID datasets demonstrate the effectiveness of our approach.

[CV-41] Metadata-Aware Multi-Prompt Reasoning for Zero-Shot Accident Understanding CVPR2026

链接: https://arxiv.org/abs/2606.12047
作者: Tarandeep Singh,Soumyanetra Pal,Soham Biswas,Nishanth Chandran
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: Accepted at the AUTOPILOT Workshop, CVPR 2026 (non-archival). Workshop Paper ID 15

点击查看摘要

Abstract:In this paper, we address the problem of zero-shot understanding of accidents from surveillance videos by identifying when an impact event occurs, what type of impact it is, and where in the frame it occurs using natural language. We propose a three-stage pipeline that decomposes the accident understanding into when, what, and where. The first stage extracts a short temporal window around the impact using vision-language similarity. In the second stage, we perform metadata-driven multi-prompt reasoning with five complementary views (baseline, motion, geometry, contrast, and tiebreaker) and resolve disagreement via an entropy-gated pairwise adjudicator. Finally, we localize the impact of an open-vocabulary detector queried on the predicted accident type and scene layout, and aggregate detections across keyframes using a score-weighted centroid. Our pipeline achieves a substantial improvement in the harmonic-mean score over a centre-of-frame baseline on the zero-shot ACCIDENT @ CVPR benchmark. We show that decomposing zero-shot video understanding into temporal localization, semantic classification, and spatial grounding enable more reliable reasoning with vision-language models than direct prompting alone.

[CV-42] Vision Transformers for Face Recognition Need More Registers

链接: https://arxiv.org/abs/2606.12036
作者: Tahar Chettaoui,Guray Ozgur,Eduarda Caldeira,Naser Damer,Fadi Boutros
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at the 20th IEEE International Conference on Automatic Face and Gesture Recognition (2026)

点击查看摘要

Abstract:Recent advances in Vision Transformers (ViTs) for face recognition (FR) have moved beyond the standard CLS-token paradigm. In this paradigm, a special classification token (CLS) is prepended to the patch embeddings and used as a representation of the input for downstream tasks. An alternative approach, Concatenated Patch Embeddings (CPE), instead leverages all patch tokens by concatenating them into a single vector, which is then projected into a compact face representation. CPE has been shown to improve recognition performance in comparison to CLS-based ones, but our qualitative analysis of attention maps showed the presence of artifacts that limit their interpretability. To address this issue, we incorporate register tokens, learnable tokens concatenated to the initial patch embeddings, and processed jointly through the ViT encoder blocks. This mechanism has been shown to produce more structured and interpretable attention maps compared to baseline ViT. We empirically demonstrate that these artifacts consistently appear across various ViT backbones, including small and large models, and that introducing register tokens effectively mitigates them. Adding four or eight registers significantly enhances interpretability, with eight registers providing the highest verification accuracies and smoothest attention structures. Our resulting model, ViT-8R, corresponds to a CPE-based ViT-B architecture augmented with eight register tokens achieves state-of-the-art performance among ViT-based FR models on large-scale IJB-B and IJB-C benchmarks. Also, ViT-8R produces substantially clearer attention maps compared with the baseline model, which offer deeper insight into the model’s attention behavior (this https URL)

[CV-43] SpikeTAD: Spiking Neural Networks for End-to-End Temporal Action Detection

链接: https://arxiv.org/abs/2606.12033
作者: Min Yang,Mi Zhou,Limin Wang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by Pattern Recognition

点击查看摘要

Abstract:Video understanding is a crucial part of computer vision, with numerous application scenarios. With the increasing popularity of mobile devices, an increasing number of efforts are trying to deploy video understanding models on them. However, existing video understanding models are difficult to deploy due to their large size and prohibitive power consumption. Spiking Neural Networks (SNNs) have shown bioplausibility and low power advantages over Artificial Neural Networks (ANNs), especially on neuromorphic chips which are regarded as essential components of future mobile devices. However, excessively long conversion time-steps and severe performance degradation problems limit their application. To solve the problems above, we explore the application of SNNs on temporal action detection (TAD), which is an important task in video understanding, and propose the first SNN-based end-to-end TAD architecture coined as SpikeTAD. While maintaining extremely low power consumption, SpikeTAD achieves an average mAP of 67.2% in THUMOS14 and 37.42% in ActivityNet-1.3, demonstrating the feasibility of a low-power TAD model. Our code is available at this https URL.

[CV-44] ViT-FREE: Efficient Face Recognition via Early Exiting and Synthetic Adaptation

链接: https://arxiv.org/abs/2606.12023
作者: Tahar Chettaoui,Guray Ozgur,Eduarda Caldeira,Naser Damer,Fadi Boutros
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at the 20th IEEE International Conference on Automatic Face and Gesture Recognition (2026)

点击查看摘要

Abstract:Vision Transformers (ViTs) have gained significant attention in computer vision and shown strong potential for face recognition (FR). However, their high computational cost makes deployment on resource-constrained devices challenging, motivating the need for methods that balance efficiency and accuracy. In this work, we investigate early exiting in pretrained ViTs as a simple yet effective training-free strategy for efficient FR inference. Leveraging the uniform feature dimensionality across transformer encoder blocks, we introduce ViT-FREE, a multi-exit framework that enables face verification directly from intermediate representations without modifying or retraining the backbone model, and thus, reducing inference cost. Empirically, we show that patch embeddings and attention maps evolve progressively across depth, exhibiting high similarity between consecutive ViT blocks and increasing alignment with the final representation. This indicates gradual feature refinement and attention convergence, suggesting that intermediate layers already provide stable and discriminative representations suitable for early exiting. Through extensive experiments on multiple FR benchmarks, we systematically analyze the accuracy-efficiency trade-off across exit depths. Our results demonstrate that later exits achieve a highly favorable balance, with exiting at layer 10 yielding up to a 20% speedup while incurring only a 1.5 drop in verification performance on benchmarks such as IJB-C. Also, we propose ViT-FREE_FT, a lightweight exit-specific fine-tuning strategy that adapts only the projection layers using a small synthetic dataset while keeping the transformer backbone frozen. This approach improves the performance of shallow exits while preserving the efficiency benefits and leaving deeper exits largely unaffected.

[CV-45] FitVTON: Fit-aware Virtual Try-On via Body-Garment Size Control

链接: https://arxiv.org/abs/2606.12012
作者: Yiqun Ning,Ao Shen,Chenhang He,Lei Zhang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:While diffusion-based virtual try-on has achieved impressive visual realism, most methods treat the task as 2D inpainting, prioritizing texture preservation over physical plausibility. Consequently, they often produce plausible-looking images that fail to reflect authentic garment fit across diverse body shapes. We present FitVTON, a Fit-aware virtual try-on model on different bodies in the wild. FitVTON encodes garment-body size through structured text prompts, and learn from simulated try-on triplets from parameterized garment model. To improve the fitting effects over garment silhouettes, we introduce two auxiliary head to predict the masks for both the garment and the exposed body. We further introduce a texture rectification stage to improve realistic appearance from simulated data. To evaluate the fitting fidelity, we curate a real-world dataset, FittingEffect3K, combining VLM-based scoring protocol. Both subjective and quantitive experiments show that FitVTON demonstrate authentic fitting fidelity, with significant sizing accuracy and shape preservation over state-of-the-art methods while maintaining competitive image quality. Project Page: this https URL.

[CV-46] From Nominal Intensity to Equivalent Rainfall: A Path-Based Credibility Evaluation Framework for Simulated Rainfall in Autonomous-Driving Perception Tests

链接: https://arxiv.org/abs/2606.11989
作者: Tian Xia,Xin Zhao,Shaolingfeng Ye,Junyi Chen
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, preprint

点击查看摘要

Abstract:Credible simulated-rainfall conditions are essential for identifying perception-system boundaries and supporting SOTIF-oriented risk assessment in automated driving. However, closed-field tests are often described only by nominal rainfall intensity or single-point measurements, making it difficult to align simulated rain fields with real rainfall and map test results to real-world scenarios. This paper proposes a path-based credibility evaluation method for simulated rainfall in autonomous-driving perception tests. Using the drop size and velocity joint distribution of real rainfall as the reference, each candidate path is represented by path-equivalent rainfall intensity, an uncertainty band, and a path-averaged Realism of Raindrop Distribution (RRD) score. Lidar target point-cloud count and mean reflectivity are further used for perception-consistency correction, quantifying the proxy capability of each simulated-rainfall path for real-rainfall perception effects. Experiments are conducted using about 10,000 real-rainfall raindrop-spectrum samples, 728 RainSense perception samples, and 45 spatial sampling points in a 2.4 m x 7.2 m simulated-rainfall area. Results show that spatial non-uniformity remains under the same nominal condition, confirming the need for path-based evaluation. The method identifies Path IV and Path VI as preferable candidates, with results of 11.54 +/- 0.31 mm/h, RRD = 0.43, and 8.28 +/- 0.34 mm/h, RRD = 0.46, respectively. These paths show more balanced performance in rainfall-intensity stability, raindrop-spectrum realism, and perception consistency. The proposed method supports path selection, condition description, and credible interpretation of autonomous-driving perception tests under rainfall.

[CV-47] ParseFixer: An Agent ic Framework for Document Parsing via Selective Multimodal Correction

链接: https://arxiv.org/abs/2606.11977
作者: LeKai Yu,Hao Liu,Kun Wang,Zhiran Li,Ruping Cao,Fan Liu,Yupeng Hu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In this report, we present our third-place solution for the DataMFM Challenge Track 1: Document Parsing. This track requires models to recover structured Markdown documents from document page images while preserving textual content and document structure. To address the complementary requirements of accurate content recovery and faithful structure reconstruction, we propose ParseFixer, an agentic framework for backbone parsing and selective correction. ParseFixer consists of two key modules: Full-Page Backbone Parsing (FBP) and Agentic Selective Correction (ASC). FBP produces stable initial Markdown outputs with MinerU2.5 Pro, while ASC detects high-value parsing failures and repairs them through a verify-and-rollback correction process. By placing selective multimodal correction after open-source backbone parsing, ParseFixer improves the recovery of key document elements without rewriting reliable backbone predictions. On the test set, our final system achieves an overall score of 61.78 and ranks third in Track 1, demonstrating its effectiveness for accurate document parsing. Our code will be released at: this https URL.

[CV-48] SpecLoR: Spectral Lookahead Rectification for Motion-Coherent Text-to-Video Generation

链接: https://arxiv.org/abs/2606.11969
作者: Xu Zhang,Yu Lu,Ruijie Quan,Zhaozheng Chen,Bohan Wang,Yi Yang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Flow Matching has enabled robust text-to-video generation via latent ODE sampling. However, velocity approximation and numerical discretization errors inevitably accumulate, causing sampling trajectories to drift. Consequently, generated videos often suffer from severe spatiotemporal inconsistencies. Nevertheless, directly correcting these drifted, noisy latents is challenging: (i) timestep-dependent noise obscures reliable structural cues; (ii) spatial interventions risk disrupting intricate local geometry while incurring heavy computational costs. To address this, we propose Spectral Lookahead Rectification (SpecLoR), a plug-and-play inference method that bypasses noise via lookahead prediction, and circumvents spatiotemporal entanglement by shifting corrections to the frequency domain, where universal statistical priors of natural videos are readily available. First, during early sampling stages, SpecLoR looks ahead to estimate the clean latent z_t,0 and computes its 3D spatiotemporal spectrum. Next, SpecLoR rectifies the amplitude spectrum to match the prior, leaving the phase intact. Finally, the corrected state is re-noised to resume ODE integration. Experiments on Wan2.2 demonstrate that SpecLoR significantly reduces physical artifacts and enhances motion coherence across multiple benchmarks with minimal computational overhead (4 additional NFEs).

[CV-49] Feature extraction for plant growth estimation

链接: https://arxiv.org/abs/2606.11966
作者: Simbarashe Aldrin Ngorima,Albert Helberg,Marelie H. Davel
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages

点击查看摘要

Abstract:Precision agriculture requires the estimation of plant growth stages in real-time. When the plant growth stage is known, the wastage of resources in cultivation, such as nutrients and water, is reduced as only the required resources need to be supplied. Plants at different growth stages, however, have similar morphological features, which can make autonomous growth stage estimation difficult. This paper presents two feature extraction methods for growth stage estimation: one that uses a bank of Gabor filters and morphological operations, and the other that uses pre-trained convolutional neural networks (CNNs) and transfer learning. We test these methods on a publicly available plant growth stage dataset (bccr-segset) for two species, canola and radish, grown and captured under indoor conditions. The two proposed feature extraction methods are compared, using support vector machines and boosted trees as classifiers. We find that both methods are suitable for real-time applications, and that CNN features outperform the hand-crafted features, both with regard to speed and accuracy. The best system (VGG-19 features, classified with a radial basis function support vector machine) obtained an accuracy of 98.4% for both species, processing an image in 0.08 seconds.

[CV-50] Corpus Augmentation for Sign Language Translation via LLM -Guided Video Stitching

链接: https://arxiv.org/abs/2606.11925
作者: Zsolt Robotka,Ádám Rák,Jalal Al-Afandi,András Horváth,György Cserey
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Sign language translation (SLT) converts sign language video into spoken language text and holds significant promise for improving accessibility and enabling communication between signing and non-signing communities. While large weakly-aligned datasets have enabled pre-training at scale and gloss-free methods have reduced reliance on expert annotation, high-quality parallel sign video-text pairs for fine-tuning remain scarce, limiting generalisation on long-tail vocabulary and unseen constructions. We propose a corpus augmentation approach that requires no additional human annotation, external sign-language video corpora, or generative video models, relying only on the existing gloss-annotated training corpus and an LLM for sentence generation: per-gloss clips are extracted from training videos via CTC forced-alignment, novel gloss-sentence pairs are generated by a corpus-anchored LLM, and synthetic sequences are assembled through random sentence sampling and clip assignment. The resulting synthetic RGB video-text pairs are architecture-agnostic at the downstream training stage and can be consumed directly by RGB-based SLT models, or converted into pose or feature representations by pipelines that derive such inputs from video. Sincan et al. re-evaluated five recent gloss-free methods under strictly identical conditions; the largest verified gain over the GFSLT-VLP baseline was only 0.98 BLEU-4. Our augmentation, applied within the same framework, achieves +2.92 BLEU-4 without any change to architecture or training protocol. We further identify that synthetic data harms vision-language pretraining despite improving its objectives, and that optimising clip transitions for visual smoothness is counter-productive under L2-based criteria; we propose that abrupt boundaries may act as a form of implicit regularisation. Code is available at this https URL.

[CV-51] From Content to Knowledge: Lightning Fast Long-Video Understanding with Neural Knowledge Representations

链接: https://arxiv.org/abs/2606.11913
作者: Yuchen Guan,Xiao Li,Zongyu Guo,Xiaoyi Zhang,Xiulian Peng,Chun Yuan,Yan Lu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We propose a new paradigm for long video understanding by treating a long video as a Neural Knowledge Representation (NKR). NKR represents video contents neither as a stream of tokens nor pre-organized databases, but as an individual small portion of network weights attached to the VLM backbone. The NKR weights are optimized to encapsulate the video’s semantic content via a novel Agentic Knowledge Distillation (AKD) process, where an agent automatically synthesizes dense descriptions and question-answer pairs to distill the video’s knowledge into the NKR. While AKD serves as a comprehensive, one-time encoding phase, the resulting NKR transforms the video into a portable, reusable asset. At inference, the lightweight NKR is mounted onto a frozen Vision-Language Model (VLM), enabling direct, query-based understanding without reloading or re-encoding the original video. This approach decouples video length from inference cost, offering high amortized efficiency for multi-turn video understanding. Experiments on the LVBench benchmark show our method achieves performance comparable to state-of-the-art approaches while reducing end-to-end latency by over two orders of magnitude, opening new possibilities for interactive long-video understanding.

[CV-52] Wild3R: Feed-Forward 3D Gaussian Splatting from Unconstrained Sparse Photo Collection

链接: https://arxiv.org/abs/2606.11894
作者: Yuto Furutani,Takashi Otonari,Kaede Shiohara,Toshihiko Yamasaki
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Feed-forward 3D Gaussian Splatting (3DGS) removes the need for time-consuming per-scene optimization required by traditional 3DGS. However, existing feed-forward approaches struggle with real-world photo collections that include diverse lighting conditions and transient objects. In this paper, we present Wild3R, a feed-forward approach for unconstrained sparse photo collections. The main bottleneck is the lack of training data that provides multiple viewpoints, a variety of illuminations, and transient variations necessary for learning robust scene representations. To address this, we introduce the WildCity dataset, which comprises 200 scenes, 170 lighting conditions, and transient objects, resulting in 337,500 images in total. By leveraging the dataset, our model learns appearance consistency across viewpoints conditioned on reference views, while removing transient content. Extensive experiments demonstrate that our method outperforms existing feed-forward approaches and achieves results competitive with prior per-scene optimization-based methods.

[CV-53] ask-Aligned Stability Analysis of Vision-Language Models for Autonomous Driving Hazard Detection ICML2026

链接: https://arxiv.org/abs/2606.11889
作者: Everett Richards
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: 8 pages (5 main body + 3 references / appendices). ICML 2026 Workshop on Combining Theory and Benchmarks (CTB)

点击查看摘要

Abstract:Vision-language models (VLMs) are increasingly used for scene understanding in autonomous driving, but robustness analysis often relies on task-agnostic embedding stability alone. We study whether corruption-induced embedding drift predicts changes in a task-aligned hazard score derived from CLIP image-text similarities. Using controlled corruptions on BDD100K road scenes, we compare embedding drift against margin drift, defined as the change in hazard score under perturbation. The relationship is highly corruption-dependent: some families exhibit strong coupling between representation drift and decision drift, while others induce hazardous decision instability despite relatively modest embedding change. Furthermore, corruption families differ in failure direction: most suppress hazard detections via false negatives, while occlusion instead triggers false alarms, suggesting that benchmark design should account for asymmetric failure modes, not just overall instability rates. These results suggest that robustness benchmarks should include task-aligned stability measures in addition to embedding-level perturbation statistics.

[CV-54] Image Quality Assessment of Identity Cards Using Measures from Open Face Image Quality

链接: https://arxiv.org/abs/2606.11884
作者: Gregor Grote,Juan E. Tapia,Christian Rathgeb
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
备注: Presented on IWBF 2026 (14th International Workshop on Biometrics and Forensics)

点击查看摘要

Abstract:This paper addresses the challenge of assessing image quality in ID cards in remote verification systems by applying capture-related quality measures from the Open Face Image Quality (OFIQ) standard to ID card images. Our preprocessing pipeline includes corner detection, perspective normalization, and comprehensive foreground masking to ensure accurate and unbiased quality measure computation. We evaluate the effectiveness of these measures by analyzing their correlation with the performance of three presentation attack detection (PAD) algorithms across four diverse ID card datasets, where two datasets contain bona fide, i.e. pristine, images and two contain printed mock ID cards. Our results suggest that quality assessment based on some OFIQ measures can significantly improve PAD performance.

[CV-55] SG2Loc: Sequential Visual Localization on 3D Scene Graphs

链接: https://arxiv.org/abs/2606.11880
作者: Nicole Damblon,Olga Vysotska,Federico Tombari,Marc Pollefeys,Daniel Barath
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: The code will be available at this https URL

点击查看摘要

Abstract:Visual localization in complex indoor environments remains a critical challenge for robotics and AR applications. Sequential localization, where pose estimates are refined over time, is important for autonomous agents. However, traditional methods often require storing extensive image databases or point clouds, leading to significant overhead. This paper introduces a novel, lightweight approach to sequential visual localization using 3D scene graphs. Our method represents the environment with a compact scene graph, where nodes represent objects (with coarse meshes) and edges encode spatial relationships. For each image in the localization phase, we extract per-patch semantic features, predicting object identities. Localization is performed within a particle filter framework. Each particle, representing a camera pose, projects the coarse object meshes from the scene graph into the image, assigning object identities to patches based on visibility. The similarity of the per-patch features, in the input image, and object features from the scene graph determines the weight of a particle. Subsequent images are incorporated sequentially, refining the pose estimate. By leveraging a compact scene graph and efficient semantic matching, our method significantly reduces storage while maintaining performance on real-world datasets. The code will be available at this https URL.

[CV-56] ask-Aware Structured Memory for Dynamic Multi-modal In-Context Learning ICML2026

链接: https://arxiv.org/abs/2606.11853
作者: Zhirui Chen,Ziwei Chen,Ling Shao
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to ICML 2026

点击查看摘要

Abstract:Multi-modal large language models (MLLMs) depend on in-context learning (ICL) for rapid task adaptation, but their scalability is severely limited by finite context windows and the growing cost of key-value (KV) caches in long multi-modal sequences. Existing memory compression approaches typically rely on rigid token removal or sample-dependent importance estimation, which introduces bias, disrupts semantic structure, particularly for visual representations, and yields static memories that cannot adapt to new queries. We introduce TASM (Task-Aware Structured Memory), a training-free framework that addresses these limitations through task-aware, structure-preserving, and dynamically accessible memory construction. TASM employs task-vector guided compression to replace sample-specific signals with a task-level direction that captures shared relevance across demonstrations. To preserve the underlying manifold, it applies semantics-aware token merging via bipartite graph matching, aggregating tokens without destructive pruning. Finally, TASM structures memory into a hierarchy comprising a compact Core Memory and a Latent Bank, facilitating query-adaptive dynamic retrieval. Evaluations confirm TASM maintains high performance under heavy compression, effectively balancing efficiency with adaptability.

[CV-57] SheafStain: Sheaf-Theoretic Schrödinger Bridge for Spatially and Biologically Coherent Virtual Staining

链接: https://arxiv.org/abs/2606.11846
作者: Hyeongyeol Lim,Hongjun Yoon,Eunjin Jang,Daeky Jeong,Won June Cho,Hwamin Lee
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 32 pages

点击查看摘要

Abstract:Current virtual staining approaches offer the potential for time- and cost-efficient biomarker quantification in cancer diagnostics and prognostics. However, patch-wise inference for gigapixel whole slide images (WSIs) fails to maintain spatial continuity, yielding artifacts that cause catastrophic mismatches with ground-truth images. Although pathology Vision Foundation Models (VFMs) offer rich representations, their self-attention causes varying global contexts to produce inconsistent embeddings for the same physical region. We formalize and validate this ``context contamination’’ as a sheaf-theoretic problem where these embeddings form a presheaf that violates the gluing axiom. To address this, we propose SheafStain, a new approach that reinterprets VFM features as sheaf-like sections for spatially and biologically coherent virtual staining. Specifically, SheafStain integrates class and patch tokens into a Schrödinger Bridge framework as sheaf-like sections. While the class token anchors biological consistency, patch tokens form a per-position spatial map. A backbone co-pretrained on Hematoxylin \ Eosin (H\E) and Immunohistochemistry (IHC) yields non-degenerate cross-stain stalks, so a single VFM feature space supervises both input conditioning and output stain alignment. Departing from prior work that evaluates on isolated 256 \times 256 patches and either random-crops or resizes the 1024 \times 1024 ground truth, we translate at 256 \times 256 and evaluate on the stitched 1024 \times 1024 outputs across HER2, ER, PR, and Ki-67. SheafStain demonstrates promising results against six prior methods while mitigating patch-boundary stitching artifacts. Code will soon be released.

[CV-58] Scene-Adaptive Nonlinear Tone Curves for Pseudo Ground-Truth Generation in Low-Light 3D Gaussian Splatting

链接: https://arxiv.org/abs/2606.11841
作者: Mingzhe Lyu,Jinqiang Cui,Hong Zhang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Low-light novel view synthesis is challenging because dark multi-view images contain noise, weak structural detail, and compressed dynamic range. Recent 3D Gaussian Splatting (3DGS) methods address these challenges by generating pseudo ground-truth (pseudo-GT) images as supervision targets when paired normal-light references are unavailable. Existing pseudo-GT methods apply a uniform linear gain to all pixels, which clips bright regions while providing insufficient enhancement in dark regions, limiting reconstruction quality. We observe that nonlinear tone mappings, long established in 2D low-light enhancement, have not been explored for pseudo-GT generation in 3D reconstruction. Accordingly, we propose a scene-adaptive nonlinear tone-curve framework that replaces linear pseudo-GT with nonlinear alternatives. The framework introduces percentile-based normalisation for scene-agnostic curve application, a scene-adaptive offset for automatic black-level adjustment, and two complementary curves: Adaptive SoftExp (ASE), a bounded exponential curve, and Adaptive Poly3 (AP3), a data-driven cubic polynomial. The module changes only the pseudo-GT computation and leaves the 3DGS backbone unchanged. Experiments on three benchmarks covering 21 scenes show that both curves consistently outperform the linear baseline with PSNR improvements up to +4.34 dB on LOM and +3.25 dB on RealX3D. Both curves achieve similar performance despite their different mathematical forms, suggesting the improvement is curve-agnostic. Code is available at this https URL

[CV-59] Plan-and-Verify Video Reward Reasoning with Spatio-Temporal Scene Graph Grounding

链接: https://arxiv.org/abs/2606.11838
作者: Hyomin Kim,Junghye Kim,Joanie Hayoun Chung,Yoonjin Oh,Kyungjae Lee,Sungbin Lim,Sungwoong Kim
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Reward models for text-to-video (T2V) generation guide post-training but often fail at fine-grained semantic alignment. We trace this to two structural weaknesses in existing reasoning-based reward models: they do not systematically verify every condition described in the prompt, and the visual evidence supporting each judgment remains implicit in their free-form reasoning. We propose SG-PVR, a video reward model that addresses these limitations through plan-and-verify reasoning grounded in spatio-temporal scene graphs. The verification plan decomposes the prompt into atomic claims, ensuring every requirement is checked. The spatio-temporal scene graph, encoding entities, attributes, and temporally-grounded relations, is extracted from the video and maintained as a persistent structured visual reference throughout reasoning. Each claim is verified against both the video and the scene graph, anchoring judgments in explicit visual evidence. SG-PVR achieves strong performance on semantic alignment, including fine-grained temporal semantics. As a test-time reranker, it further enhances compositional alignment in T2V generation.

[CV-60] LASA: A Weak Supervision Method for Open-Vocabulary Scene Sketch Semantic Segmentation

链接: https://arxiv.org/abs/2606.11837
作者: Liwen Yi,Xianlin Zhang,Yue Zhang,Yue Ming,Xueming Li
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Open-vocabulary scene sketch semantic segmentation aims to assign dense semantic labels to sparse line drawings based on flexible category vocabularies specified at inference time, without relying on pixel-level annotations during training. Unlike natural images, sketches lack texture and color cues, making semantic understanding heavily dependent on stroke layout and spatial configuration, a challenge that renders single-layer vision-language features inherently unstable. Our key observation is that attention maps from different Vision Transformer layers encode complementary spatial cues: shallow layers capture global structural layouts, while deeper layers focus on local stroke intersections and object parts. This suggests that cross-layer aggregation provides a more robust structural prior than any individual layer alone. Leveraging this insight, we propose a structure-aware framework built upon \textbfLayer-wise \textbfAccumulated \textbfStructural \textbfAttention (\textbfLASA), which aggregates multi-layer attention to guide hierarchical semantic alignment under weak supervision and refine predictions during inference. Experiments on FS-COCO, SFSD, and FrISS show that LASA improves mIoU by +3.43 , +8.01 , and +15.74 over the prior weakly supervised baselines, demonstrating consistent gains in both segmentation accuracy and spatial coherence. Our source code will be made publicly available.

[CV-61] xtHOI-3D: Text-to-3D Hand-Object Interaction via Discrete Multi-View Generation and Joint Mesh Optimization

链接: https://arxiv.org/abs/2606.11805
作者: Zixiong Hao,Zhencun Jiang
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 11 pages, 8 figures, 3 tables

点击查看摘要

Abstract:Text-conditioned 3D generation has progressed rapidly for images and isolated objects, but producing a hand-object mesh remains challenging: the output must preserve language semantics, cross-view consistency, object geometry, articulated hand shape, and physically plausible contact. We present TextHOI-3D, a staged framework that uses generated multi-view observations as an explicit interface between text-conditioned visual generation and geometry-aware hand-object recovery. TextHOI-3D learns a compact VQ token space for fixed-camera hand-object observations, predicts multi-view visual tokens from text with a CLIP-conditioned visual autoregressive model, and recovers a unified hand-object mesh through prior initialization, multi-view joint optimization, and anti-penetration refinement. The design separates semantic generation from geometric recovery while keeping both stages connected by a discrete multi-view representation. On HO3D-derived evaluations, the multi-view setting reduces object CD from 17.26 mm to 4.92 mm and penetration volume from 5.3721 cm^3 to 0.2193 cm^3 compared with a single-view counterpart, while improving hand errors and surface F-scores. These results support multi-view visual tokens as an effective intermediate representation for text-driven 3D hand-object mesh creation.

[CV-62] A Comprehensive Ecosystem for Open-Domain Customized Video Generation ICASSP2026

链接: https://arxiv.org/abs/2606.11783
作者: Jingxu Zhang,Yuqian Hong,Daneul Kim,Kai Qiu,Qi Dai,Jianmin Bao,Yifan Yang,Xiaoyan Sun,Chong Luo
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 5 pages, 3 figures, 4 tables. Accepted by ICASSP 2026

点击查看摘要

Abstract:Recent progress in video generation has shown impressive visual synthesis capabilities. However, open-domain customized video generation remains limited by the lack of large-scale, annotated datasets capturing diverse identity-specific attributes. To address this, we introduce PexelsCustom-1M, the first publicly available million-scale dataset for identity-preserving video generation, containing one million curated identity, text, video triplets across 8,000+ categories. Leveraging this, we propose CustoMDiT, a parameter-efficient framework that adapts a pretrained multimodal Diffusion Transformer into a customized video generator with only 8% additional learnable parameters. Our method surpasses prior state-of-the-art. However, benchmarks such as DreamBooth cover only 100 classes, which is insufficient for real-world applications. To overcome this, we construct OpenCustom, a new benchmark with 1,000+ categories, created via cross-dataset knowledge fusion from ImageNet and MS-COCO. Extensive experiments confirm the advantages of both our dataset and model. We will open-source the entire ecosystem–including dataset, pipeline, benchmark, and implementations–to support further research.

[CV-63] Seeing What Matters: Perceptual Wrapper with Common Randomness for 3D Gaussian Splatting

链接: https://arxiv.org/abs/2606.11782
作者: He-Bi Yang,Jing-Zhong Chen,Yen-Kuan Ho,Sang NguyenQuang,Fan-Yi Hsu,Yun-Yu Lee,Jui-Chiu Chiang,Wen-Hsiao Peng
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages, 9 figures

点击查看摘要

Abstract:While 3D Gaussian Splatting (3DGS) achieves impressive real-time rendering, it frequently struggles to synthesize high-frequency textures, a limitation heavily exacerbated in memory-constrained and rate-distortion-optimized (RDO) pipelines. To address this, we propose a versatile 2D perceptual wrapper that enhances the rendered outputs of existing 3DGS representations in a content- and view-dependent manner. Our method leverages a lightweight synthesis network conditioned on pseudo-random Gaussian noise to synthesize perceptually plausible textures. Supervised by Wasserstein Distortion, the network learns to match local feature statistics rather than strictly enforcing pixel-wise reconstruction fidelity, effectively mitigating the blurriness inherent in standard frameworks. We demonstrate the broad applicability of our plug-and-play approach across vanilla, memory-constrained, and RDO 3DGS methods. Comprehensive subjective and objective experiments confirm that our method significantly improves over existing baselines, yielding superior perceptual quality at sharply reduced file or model sizes.

[CV-64] Battery detection of XRay images using transfer learning

链接: https://arxiv.org/abs/2606.11779
作者: Nermeen Abou Baker,David Rohrschneider,Uwe Handmann
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Published at the European Symposium on Artificial Neural Networks (ESANN 2022)

点击查看摘要

Abstract:The need for detecting and sorting batteries is drastically increasing for many applications. This study proves the potential of transfer learning in predicting whether the image contains a battery or not, the location and identifying three types of batteries, namely: prismatic, pouch, and cylindrical Lithium-Ion Batteries (LIB). Particularly, it focuses on the transfer learning method in two applications: Training a large-scale dataset to detect electronic devices using a pre-trained YOLOv5m, then using these latter trained weights to detect and classify the batteries. The precision of battery detection achieves 94%, which outperforms the pretrained YOLOv5m weights with 5%, in 22 ms inference time.

[CV-65] AnchorEdit: Maintaining Temporal Consistency in Multi-turn Image Editing via Causal Memory

链接: https://arxiv.org/abs/2606.11751
作者: Hang Xu,Xiaoxiao Ma,Guohui Zhang,Yu Hu,Siming Fu,Jie Huang,Lin Song,Haoyang Huang,Nan Duan,Feng Zhao
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Code: this https URL

点击查看摘要

Abstract:Multi-turn image editing is essential for iterative design, yet current models often struggle with identity drift and error accumulation over successive steps. While existing research leverages video priors for consistency, their reliance on bidirectional attention is fundamentally misaligned with the causal, sequential nature of interactive editing. In this paper, we propose AnchorEdit, the first autoregressive (AR) diffusion-based framework designed specifically for high-resolution, long-term multi-turn editing. AnchorEdit bridges the gap between video priors and causal inference through a three-stage training curriculum: identity-preserving sing-turn pretraining, causal AR forcing fine-tuning with a novel self-rollout strategy to mitigate exposure bias, and consistency distillation for efficient 4-step generation. During inference, we introduce a memory mechanism to anchor the initial subject identity and ensure stable extrapolation across extended editing trajectories. To evaluate performance, we provide a new high-resolution multi-turn editing benchmark designed to stress-test long-horizon stability. Extensive experiments demonstrate that AnchorEdit achieves state-of-the-art results, maintaining exceptional subject fidelity and instruction following even over 10+ interaction rounds.

[CV-66] From Prompts to Tokens: Internalizing Causal Supervision in Vision-Language Model for Multi-Image Causal Reasoning

链接: https://arxiv.org/abs/2606.11745
作者: Haoping Yu,Yuanxi Li,Jing Ma
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Visual causal reasoning is essential for understanding and intervening in the physical world, requiring identification of causal variables from visual inputs and reasoning over intervention effects. Despite recent progress, large vision–language models (VLMs) remain brittle at such tasks, especially for interventional and counterfactual queries over multi-image inputs. Most existing explorations inject causal knowledge via textual prompts, leaving causal mechanisms external to model execution and limiting reliable control during inference. To address this problem, we propose BridgeVLM, which internalizes visual causal reasoning by inducing a causal graph from multi-image inputs and converting it into structured Causal Tokens executed by RAMP layers injected into the LLM decoder for causal message passing. We further introduce a unified training interface M3S for fine-grained causal supervision from different granularities (local/global level). BridgeVLM achieves 54.4% accuracy on intervention tasks on CausalVLBench (vs. 33.2% with prompt-level supervision), improves results on Causal3D from 43.6% to 49.0%, and substantially improves causal structure learning on CausalVLBench ( F_1 : 33.4% \rightarrow 75.1%).

[CV-67] Multi-View In-Cabin Monitoring System for Public Transport Vehicles ICDM2026

链接: https://arxiv.org/abs/2606.11739
作者: Evgeny Gorelik,Kenny Dean Karrow,Fikret Sivrikaya,Sahin Albayrak,Christian Baumann
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Submitted to ICDM2026

点击查看摘要

Abstract:We introduce a multi-view in-cabin monitoring dataset for public transportation with synchronized RGB and depth images from four inward-facing cameras and a rotating LiDAR covering the vehicle interior of a digitalized and partly automated German city bus. The dataset contains 9.136 synchronized samples with annotations and is accompanied by a calibration and pseudo-labeling pipeline that generates 3D human pose estimates and oriented 3D bounding boxes for occupants. We further provide a nuScenes-format conversion and benchmark representative multi-view 3D detection models (e.g., Lift-Splat-Shoot and BEVFusion), supporting comparative evaluation and small-scale training of multi-view in-cabin perception models. The dataset and tools are available at this https URL.

[CV-68] Ouroboros-Spatial: Closing the Data-Model Loop for Spatial Reasoning

链接: https://arxiv.org/abs/2606.11719
作者: Enhan Zhao,Wei Wu,Yuanrui Zhang,Xueliang Zhao,Di He
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Spatial reasoning remains a persistent challenge for multimodal large language models (MLLMs). Existing approaches largely rely on large-scale, statically curated datasets, where all training samples are treated uniformly regardless of the model’s evolving capabilities. This static paradigm is inherently data-inefficient: training capacity is often spent on samples that are either trivial or overly difficult for the model at its current stage. To address this limitation, we propose Ouroboros-Spatial, a self-evolving training framework in which the model plays dual roles as a proposer and a solver. In each iteration, a frozen proposer generates spatial question-answer (QA) pairs from 3D scene metadata and raw video frames, together with executable code for deriving reliable ground truth. A learnable solver is then fine-tuned on the accepted samples, and its per-sample prediction confidence is used as a difficulty signal. This signal is fed back to the proposer in the next iteration, guiding it to generate questions better matched to the solver’s current capabilities. Through this closed-loop design, the training distribution co-evolves with model ability, reducing redundant trivial examples while filtering out ambiguous or uninformative samples with limited learning value. Across six spatial reasoning benchmarks, Ouroboros-Spatial substantially improves Qwen3-VL-4B and Qwen3-VL-8B while using an order of magnitude fewer training examples than recent large-scale curated datasets. On VSI-Bench, it yields absolute gains of 9.9 and 6.8 points for the 4B and 8B models, respectively, enabling both to outperform a wide range of strong open-source and proprietary baselines.

[CV-69] ERN-Net : Evolving Reason Node-Net for Document Binarization

链接: https://arxiv.org/abs/2606.11710
作者: Hsin-Jui Pan,Sheng-Wei Chan,Jen-Shiung Chiang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper presents ERN-Net, an Evolving Reason Node-Net for efficient document image binarization. ERN-Net enhances degradation-sensitive regions, such as faint strokes, broken characters, and noisy backgrounds, through evolving reason nodes and multi-scale reasoning. We further compare ResNet-101, ConvNeXt-Tiny, and ConvNeXt-Base, and find that ConvNeXt-Tiny provides the best practical trade-off between accuracy and memory usage. In addition, DIBCO-based pretraining improves binarization performance without increasing model memory consumption, requiring only about 1.5 additional training hours. Experiments on DIBCO-style benchmarks show that ERN-Net is effective under low-data and low-memory settings.

[CV-70] RankVR: Low-Rank Structure Perception and Value Recalibration for Robust Composed Image Retrieval ICMR2026

链接: https://arxiv.org/abs/2606.11689
作者: Jiale Huang,Zixu Li,Zhiheng Fu,Zhiwei Chen,Qinlei Huang,Yupeng Hu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICMR 2026

点击查看摘要

Abstract:Composed Image Retrieval (CIR) constitutes a pivotal paradigm requiring models to perform joint reasoning on reference images and modification texts. However, the prevalence of Noisy Triplet Correspondence (NTC) in large-scale datasets severely constrains model performance. Existing denoising methods either target binary mismatches or rely on scalar-based point-wise estimation, neglecting rich global structural correlations among sample populations and dynamic value variations during training, thereby yielding suboptimal results. This paper identifies two critical unresolved challenges: Global Structural Inconsistency of Semantic Correlations and Hard Sample Discrimination Uncertainty. To address these, we propose RankVR, a framework designed to construct a robust CIR model via global structure consistency and dynamic value perception. Specifically, we introduce the Global Structure Consistency Perception (GSCP) module, which utilizes the Effective Rank of the Correlation Matrix to decouple clean samples from structural noise. By measuring rank difference, GSCP identifies samples disrupting macroscopic semantic symmetry. Furthermore, we develop the Adaptive Semantic Value Calibration (ASVC) module to distinguish high-value hard clean samples. By integrating training potential and reliability, it dynamically quantifies the semantic value of each triplet, ensuring effective utilization of hard samples while suppressing noise characterized by logical conflicts. Extensive experiments on the FashionIQ and CIRR benchmark datasets demonstrate that RankVR significantly outperforms existing state-of-the-art methods, validating its superior robustness in noisy environments.

[CV-71] DroneShield-AI: A Multi-Modal Sensor Fusion Framework for Real-Time Autonomous Drone Threat Detection Behavioral Intent Classification and Swarm Intelligence in Contested Airspace

链接: https://arxiv.org/abs/2606.11687
作者: Marius Bayizere
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
备注: 23 pages, 6 figures, 11 tables. Code available at this https URL

点击查看摘要

Abstract:Unmanned Aerial Vehicle (UAV) threats have emerged as a defining security challenge of the 21st century. This paper presents DroneShield-AI, a unified open framework integrating six processing layers: RF signal classification, acoustic motor-signature detection, YOLOv8-based visual detection, evidence-weighted sensor fusion, a Behavioral Intent Classification Engine (BICE), and a Graph Neural Network Swarm Intelligence Module (GNN-SIM). BICE introduces the first systematic six-class threat taxonomy for drone flight patterns, enabling predictive operator alerts with a 30-second advance-warning horizon. GNN-SIM is the first open framework for adversarial multi-drone formation analysis using Graph Attention Networks. Evaluated on three publicly available real-world datasets, the fused pipeline achieves 96.1% detection accuracy, 3.2% false alarm rate, AUC-ROC: 0.981, and 142ms end-to-end latency on commodity CPU-class hardware at approximately 500- 780 USD total system cost. All code, model weights, and simulation datasets are publicly released at submission.

[CV-72] Reason Then Re-reason : Cross-view Revisiting Improves Spatial Reasoning ICML2026

链接: https://arxiv.org/abs/2606.11683
作者: Chaofan Ma,Zhenjie Mao,Yuhuan Yang,Fanqin Zeng,Yue Shi,Yingjie Zhou,Xiaofeng Cao,Jiangchao Yao
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: ICML 2026

点击查看摘要

Abstract:Spatial reasoning from egocentric videos is inherently challenging because the observable evidence is constrained by the camera trajectory. Existing methods rely on single-turn inference, forcing models to resolve geometric ambiguity through semantic priors rather than verifiable evidence. We argue that spatial reasoning should be revisitable: conclusions formed under limited evidence should remain open to revision when complementary viewpoints become available. Building on this insight, we propose Reason, then Re-reason (ReRe), a training-free, inference-time framework with two phases: in the Reason Phase, an MLLM forms a spatial hypothesis from the original video; in the Re-reason Phase, it verifies or revises the hypothesis by observing a synthesized novel-view video. To enable effective cross-view revisiting, we design a Geometry-to-Video pipeline that renders strategically complementary novel views from predicted 3D geometry. These views feature an elevated, oblique perspective with scene-spanning coverage, while preserving the MLLM’s native video interface without architectural modifications. Extensive evaluations on VSI-Bench and STI-Bench demonstrate that ReRe substantially boosts open-source MLLMs to rival proprietary state-of-the-art performance. Project page: this https URL

[CV-73] Parameter-Efficient Adapter Tuning for Tabular-Image Multimodal Learning

链接: https://arxiv.org/abs/2606.11682
作者: Jiaqi Luo
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Tabular-image multimodal learning aims to improve predictive modeling by jointly using structured tabular attributes and visual data. Although pretrained encoders provide strong modality-specific representations, full fine-tuning can be computationally expensive, while keeping encoders frozen may limit task-specific adaptation. We propose the Tabular-Image Adapter (TI-Adapter), a modality-specific adapter-based fine-tuning framework for efficient multimodal adaptation. TI-Adapter freezes the pretrained tabular encoder and learns an adapter after the extracted tabular embedding, while adapting the image branch with embedding-level and bottleneck-level adapters instead of full fine-tuning. Experiments on 20 tabular-image datasets show that TI-Adapter achieves competitive or better predictive performance than full fine-tuning while using substantially fewer trainable parameters. Ablation studies further demonstrate the importance of adapter placement for balancing performance and practical efficiency.

[CV-74] ARGUS: Stacked Multi-View Identity Mosaic Injection for Subject-Preserving Video Generation

链接: https://arxiv.org/abs/2606.11670
作者: Zijie Meng,Jiwen Liu,Yufei Liu,Chengzhuo Tong,Xiaoqiang Liu,Yuanxing Zhang,Yulong Xu,Pengfei Wan
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 13 pages, 3 figures

点击查看摘要

Abstract:Subject-preserving video generation is not solved by frontal-face similarity alone: a generated person must remain recognizable across motion, large viewpoint changes, expression shifts, occlusion, scale variation, and conflicts among text, first-frame, and identity references. We argue that the central bottleneck is the point-reference paradigm, which collapses identity into a single static observation entangled with pose, accessories, lighting, background, and camera statistics. We introduce Argus, a Wan-based framework centered on Stacked Multi-View Identity Mosaic Injection (SMII). SMII converts MLLM-selected image/video identity evidence into a 3*3 stacked mosaic, synchronizes the mosaic with the current diffusion time, and injects it as negative-time read-only memory in Wan’s native token space. This turns identity from an external clean adapter or a single reference image into a compact dynamic distribution. Around SMII, an MLLM Identity Director selects informative identity moments and resolves condition conflicts, while no-cross-pair counterfactual training, Temporal Identity Annealing, and Adaptive Self-Likeness Guidance improve robustness without paired subject-video supervision. We further release HardID-Celeb, a public-figure identity-stress benchmark, and introduce YawScore and OccScore to probe large-yaw and first-frame-occlusion robustness. Argus achieves state-of-the-art results on OpenS2V-Eval Human-Domain, reaching 64.38 Total Score, 71.86 FaceSim, 51.62 NexusScore, and 79.14 NaturalScore. On HardID-Celeb, Argus obtains 76.80 FaceSim and improves YawScore and OccScore by 12.60 and 15.10 points over the strongest baselines, demonstrating that dynamic identity memory and large-scale counterfactual self-supervision are highly effective for subject-preserving video generation.

[CV-75] Learning Instance-Adaptive Low-Rank Orthogonal Subspaces for Clothes-Changing Person Re-Identification ICML2026

链接: https://arxiv.org/abs/2606.11661
作者: Dong-Woo Kim,Tae-Kyun Kim
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted to the ICML 2026 Workshop on CoLoRAI

点击查看摘要

Abstract:Clothes-changing person re-identification (CC-ReID) aims to recognize individuals despite drastic appearance changes caused by clothing variation. While existing methods rely on adversarial learning to disentangle clothing features, we propose Ortho-ReID, which explicitly models a low-rank clothing subspace from VLM text descriptions and extracts clothing-invariant representations via direct geometric constraints. A critical component is our transformer-based Basis Maker, which refines a shared, low-dimensional clothing prior into an instance-adaptive low-rank subspace through cross-attention with image patches, enabling robust clothing feature extraction even under varying visibility conditions. This instance-adaptive subspace is supervised via alignment with clothing text embeddings, while identity features are extracted via a learnable projection head and geometrically constrained to be strictly orthogonal to it. Extensive experiments demonstrate state-of-the-art performance on PRCC (+5.9% top-1), Celeb-reID-light (+3.5%), and LaST (+5.3%), with competitive results on LTCC.

[CV-76] Motion Reinforces Appearance: RGB-Skeleton Gated Residual Fusion for Micro-Gesture Online Recognition

链接: https://arxiv.org/abs/2606.11645
作者: Jialin Liu,Xinwen He,Pengyu Liu,Jiale Shi,Huaijuan Zang,Yanbin Hao
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 2 figures

点击查看摘要

Abstract:Micro-gesture analysis attracts increasing attention for inferring spontaneous emotion from subtle body movements. Micro-gesture online recognition, which localizes and classifies each gesture instance in untrimmed videos, is a core task in the 4th EI-MiGA-IJCAI Challenge. Compared with typical temporal action detection, MGR emphasizes the localization and classification of actions, requiring the model to output the start time, end time, and category of each micro-gesture. Moreover, since micro-gestures are highly spontaneous, relying solely on a single modality makes it difficult to capture the complete and accurate multi-modal cues. In this work, we propose DyFADet+, which extends DyFADet into a dual-stream RGB-skeleton framework. In our model, both modalities are projected into shared multi-scale temporal embeddings and fused through a gated residual module, which adaptively injects skeleton motion into the RGB representation rather than using naive concatenation. Finally, these fused features are decoded by a Dynamic TAD head for online classification and boundary regression. On the SMG dataset, our method achieves an F1 score of 40.88, ranking 2nd in the Micro-gesture Online Recognition track.

[CV-77] Adapting Vision-Language Models from Iconic to Inclusive for Multi-Label Recognition Without Labels

链接: https://arxiv.org/abs/2606.11626
作者: Cheng Chen,Jingyu Zhou,Yifan Zhao,Jia Li
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Understanding multi-label images remains a challenging task in computer vision. With the rapid progress of vision-language multimodal learning, vision-language models (VLMs) enable zero-shot recognition without labeled data. However, due to their intrinsic design, these models often prioritize the most iconic object and omit other contextual positives. This intrinsic bias conflicts with the nature of multi-label learning, thereby limiting their applicability. In this work, we propose an unsupervised framework that adapts VLMs from iconic recognition toward inclusive understanding, enabling label-free multi-label image recognition. Our approach consists of two key stages, cutting'' and sewing’': In the cutting stage, we present the multi-sampling response estimator to prevent the model from concentrating only on one single object. In the second sewing stage, the multi-object blend adaptation is introduced to adjust the labels to better conform to the multi-label distribution while preserving the intrinsic characteristics of the original model within only one epoch. Extensive experiments show that our framework significantly outperforms existing unsupervised approaches on four public datasets, even surpassing several representative weakly supervised baselines. These results demonstrate the potential of adapting pre-trained VLMs for more comprehensive visual understanding without manual annotations. Our code is publicly available at this https URL.

[CV-78] Precision-Aware Illumination-Disentangled Vision Transformer for Spacecraft 6D Pose Estimation

链接: https://arxiv.org/abs/2606.11619
作者: Zongwu Xie,Yifan Yang,Yonglong Zhang,Guanghu Xie,Yang Liu,Shuo Zhang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 7 figures

点击查看摘要

Abstract:Vision sensors provide a lightweight solution for spacecraft proximity operations, but monocular spacecraft 6D pose estimation remains difficult under illumination variation, specular reflection, shadowing, weak texture, and background interference. These factors make local visual evidence spatially unreliable and can destabilize pose regression. This article proposes a Precision-Aware Illumination-Disentangled Vision Transformer (PAID-ViT) for robust spacecraft pose this http URL proposed model separates pose-relevant structure tokens from illumination-sensitive appearance tokens, estimates patch reliability before pose aggregation, and uses foreground mask supervision to preserve silhouette cues. A parameter-free geometric recovery module converts normalized crop coordinates, log-depth, and a continuous 6D rotation representation into camera-frame rotation and translation. Experiments on SPEED+ V2, the SPEED+ validation/lightbox/sunlamp evaluation configuration used in this study, suggest that PAID-ViT reduces translation error and improves robustness in the challenging sunlamp domain, while ablation studies support the complementary roles of illumination disentanglement, reliability-aware token aggregation, mask supervision, and training-side regularization.

[CV-79] Adv-TGD: Adversarial Text-Guided Diffusion for Face Recognition Impersonation Attacks

链接: https://arxiv.org/abs/2606.11615
作者: Omid Ahmadieh,Nima Karimian
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The widespread adoption of face recognition (FR) technologies raises serious privacy concerns, as facial data can be exploited without consent. To address this challenge, we propose Adv-TGD, a generative adversarial attack framework that synthesizes photorealistic faces capable of impersonating target identities and deceiving face recognition systems. Built upon Stable Diffusion, Adv-TGD performs per-sample LoRA fine-tuning conditioned on concise textual prompts to generate natural yet adversarially manipulated identities. Unlike conventional identity-attack approaches, our method optimizes lightweight cross-attention adapters for each source-target pair within a single-step denoising process. Latent blending is constrained by a face-local heatmap mask to ensure spatially precise identity manipulation while preserving non-sensitive regions. We introduce a composite objective that integrates masked epsilon-MSE reconstruction, thresholded identity divergence in FR embedding space, directional feature alignment, and source-similarity suppression to balance adversarial attack and visual realism. Optionally, LLaVA-generated attribute prompts enhance fine-grained semantic details without reintroducing identity cues. Under the black-box evaluation protocol, Adv-TGD attains an average attack success rate (ASR) of 85.90% across IR152, IRSE50, MobileFace, and FaceNet, surpassing the semantic SOTA baseline Adv-CPG by +6.25 points, diffusion-based makeup method DiffAIM by +3 points, and noise-based P3-Mask by +16 points. Despite its strong attack efficacy, Adv-TGD preserves high visual fidelity (PSNR = 27.15 dB, SSIM = 0.981). Furthermore, we demonstrate the flexibility of our framework by successfully extending it to in-the-wild datasets (LADN), general object classification (ImageNet), and transformer-based diffusion models (FLUX.1).

[CV-80] Information-Theoretic Decomposition for Multimodal Interaction Learning CVPR2026

链接: https://arxiv.org/abs/2606.11614
作者: Zequn Yang,Yake Wei,Haotian Ni,Zhihao Xu,Di Hu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026

点击查看摘要

Abstract:Multimodal learning hinges on capturing redundant, unique, and synergistic information across modalities, which collectively constitute multimodal interactions. A critical yet underexplored challenge is that these implicit interactions vary dynamically across samples. In this work, we present the first systematic, information-theoretic analysis highlighting why learning these dynamic, sample-specific interactions is critical for effective multimodal learning. Our analysis further reveals deficits in conventional paradigms at learning these distinct interaction types: modality ensemble approaches struggle to capture synergy, while joint learning paradigms often under-utilize redundant information. This highlights the need for an approach that can adaptively learn from different interaction types on a per-sample basis. To this end, we propose Decomposition-based Multimodal Interaction Learning (DMIL), a novel paradigm that explicitly models and learns from sample-specific interactions. First, we design a variational decomposition architecture to isolate the constituent interaction components. Second, we employ a new learning strategy that leverages these explicit interaction components in a fine-tuning process to achieve comprehensive interaction learning. Extensive experiments across diverse tasks and architectures demonstrate that DMIL consistently achieves superior performance by adapting to holistic sample-specific interactions. Our framework is flexible and broadly applicable, establishing an interaction-centric paradigm for multimodal learning. The code is available at this https URL.

[CV-81] Frozen Foundation-Model Embeddings Discard Small-Lesion Signal in Chest Radiography: Implications for Pre-Deployment Evaluation

链接: https://arxiv.org/abs/2606.11606
作者: Raajitha Muthyala,Zhenan Yin,Alekhya Jilla,Frank Li,Theo Dapamede,Bardia Khosravi,Mohammadreza Chavoshi,Judy Gichoya,Saptarshi Purkayastha
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Frozen vision-transformer (ViT) foundation-model embeddings increasingly serve as the substrate for downstream chest-radiography (CXR) pipelines, yet where small-scale, low-contrast signal is retained or lost in the frozen forward pass has not been systematically quantified across architectures, pretraining domains, and objectives. We probed five frozen ViTs (RAD-DINO, DINOv2-B/14, DINOv3 ViT-7B, BiomedCLIP, MedSigLIP) and a frozen DINO-pretrained ResNet-50 architectural control across three large CXR cohorts (NIH-CXR14, MIMIC-CXR, Emory-CXR; aggregate pool n=492,724) and ChestX-Det10 (n=3,543; 1,462 small-lesion bounding boxes across Calcification, Nodule, Mass). Each model was evaluated with a small-scale-perturbation panel and a region-aware bounding-box-stratified probe on real lesions, comparing three pooling modes from the same forward pass: classification token (CLS), patch-mean (mean over all final-layer patch tokens), and bounding-box-restricted patch-local. On the perturbation panel, CLS embeddings sat at the chance floor (area under the ROC curve [AUC] 0.500-0.524); patch-mean was indistinguishable from CLS on iso-blur and reticular-fine cells but rose with CLS on larger directional-blur footprints, while disease AUC on globally decided tasks ranged 0.642-0.913. Patch-local probes recovered AUC ~1.0 from the same forward pass (per-model mean improvement +0.412 to +0.488); the ResNet-50 control reproduced the chance floor. On ChestX-Det10, image-level CLS classification showed within-class small-versus-large stratum gaps up to +0.243 AUC; bounding-box-level patch-local pooling on the same forward pass recovered AUC = 0.899 on every (model x class) cell. Frozen ViT embeddings silently suppress small-scale signal at the global-aggregation step; the signal is recoverable from patch tokens conditional on a region of interest.

[CV-82] On Aligning Hierarchical Standardized Embedding for Audio-visual Generalized Zero-shot Learning

链接: https://arxiv.org/abs/2606.11602
作者: Zihan Zhang,Jie Hong,Siyuan Fan,Yanghao Zhou,Pengfei Fang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Audio-visual Generalized Zero-shot Learning (AV-GZSL) is a challenging task that aims to classify both seen and unseen objects or scenes by integrating data from audio and visual modalities. Recent studies primarily focus on fusing or aligning audio and visual features to generate more informative audio-visual embeddings. Also, aligning the audio-visual and textual features of most existing methods relies solely on the optimization objectives. However, those methods neglect the inherent distributional and structural differences between audio-visual and textual modalities. To address this limitation, we propose a method termed Aligning Hierarchical Standardized Embedding (AHSE), which enables hierarchical alignment of standardized audio-visual and textual embeddings within a shared embedding space. Specifically, we first apply Z-score standardization to the fused audio-visual and textual embeddings to reduce distributional mismatches. We then introduce a hierarchical alignment strategy that minimizes discrepancies at the semantic, class, and batch levels, thereby constructing a more robust and well-structured embedding space. This strategy not only preserves semantic and inter-class relationships but also maintains spatial consistency within each batch. Extensive experiments on three benchmark datasets: VGGSound-GZSL, UCF-GZSL, and ActivityNet-GZSL, demonstrate that AHSE achieves competitive performance in zero-shot learning.

[CV-83] Spatially Coupled Phase-to-Depth Calibration for Fringe Projection Profilometry

链接: https://arxiv.org/abs/2606.11601
作者: Sehoon Tak,Jae-Sang Hyun
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In fringe projection profilometry (FPP), depth is commonly recovered by fitting a phase-to-depth relation independently at each camera pixel. Although such pixel-wise calibration achieves high local accuracy, neighboring pixels can acquire markedly different calibration functions even when they observe the same smooth surface, producing spatially inconsistent geometry and structured surface artifacts. We propose a spatially coupled phase-depth transformation in which all pixels share a single low-dimensional mapping-global phase scalars combined with affine spatial terms on the undistorted reference-camera grid-rather than independent per-pixel fits, optionally augmented by a bounded, spatially smooth correction field. We further introduce a native-grid pairing scheme that constructs phase-depth calibration pairs directly on the reference-camera grid: when depth supervision comes from a rectified active-stereo pipeline, planes are fitted in stereo 3D and sampled back onto the camera grid along native rays, so the phase maps are never rectified. On a dental target with high-resolution scanner ground truth, the proposed model attains point-to-surface RMSE comparable to an active-stereo reference (about 12\mum aggregate) while substantially improving spatial coherence over pixel-wise polynomial and rational calibration, and reduces the runtime mapping to a few element-wise operations per pixel with negligible parameter storage.

[CV-84] Contactless 3D Human Body Measurement Using Depth Cameras for Smart Health Monitoring

链接: https://arxiv.org/abs/2606.11578
作者: Martha Asare,Xuan Wang,Juan Lopez Alvarenga,Lois Akosua Serwaa,Jinghao Yang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 4 figures. Depth camera-based framework for contactless anthropometric measurement and geometric analysis using 3D point clouds

点击查看摘要

Abstract:Contactless body measurement technologies are becoming increasingly significant for smart health monitoring, digital health applications, and remote patient assessment. Traditional anthropometric measurements typically necessitate physical contact and trained personnel, which may constrain scalability in remote healthcare settings. In this study, we introduce a depth camera-based framework for estimating human body measurements utilizing 3D point cloud data. An Orbbec Astra 2 depth camera was employed to capture RGB images, depth maps, and 3D point clouds of participants. The captured point cloud was processed using Python-based tools, including Open3D, NumPy, and OpenCV, to segment the human body from the background. Key anthropometric measurements, such as height and arm span, were computed. The measurements were obtained through a combination of spatial filtering and landmark selection on the 3D point cloud, followed by the projection of the computed measurements onto the corresponding RGB image using camera intrinsic parameters. In addition to linear measurements, the approximate body volume and visible surface area were estimated using voxel-based occupancy analysis and mesh-based surface reconstruction methods. The experimental results from a single depth capture demonstrated that accurate body measurements and geometric estimates could be obtained from depth camera data without physical contact. This study provides a foundation for future real-time systems that integrate depth sensing with intelligent health monitoring and generative AI models for smart healthcare applications.

[CV-85] AVIS: Adaptive Test-Time Scaling for Vision-Language Models

链接: https://arxiv.org/abs/2606.11576
作者: Ahmadreza Jeddi,Minh Ngoc Le,Amirhossein Kazerouni,Hakki Can Karaimer,Hue Nguyen,Iqbal Mohomed,Michael Brudno,Alex Levinshtein,Konstantinos G. Derpanis,Babak Taati,Radek Grzeszczuk
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Project page: this https URL

点击查看摘要

Abstract:Modern Vision-Language Models (VLMs) benefit from chain-of-thought prompting and test-time scaling, but these gains often come with prohibitive inference cost due to large visual contexts and long decoding chains. We view this cost through two coupled axes: Visual Context Scaling (VCS), which controls how much visual evidence is passed to the language model, and Visual Reasoning Scaling (VRS), which controls how much inference-time reasoning search is performed. Existing methods typically optimize one axis at a time, leaving the joint allocation of compute across these axes underexplored. We introduce Adaptive Visual Inference Scaling (AVIS), a lightweight policy that adapts both VCS and VRS per query. AVIS realizes VCS through Key Diversity Visual (KDV) pruning, a training-free O(N) key-based rule for removing redundant visual tokens before prefilling, and realizes VRS through adaptive self-consistency, using a learned difficulty predictor to select the number of reasoning rollouts. AVIS is deployment-friendly and compatible with shared-prefill inference, where all rollouts reuse a single prefilling pass and KV cache. Across diverse image and video reasoning benchmarks, AVIS improves the accuracy–compute trade-off relative to VCS-only and VRS-only baselines, and remains effective on top of RL post-trained VLMs while keeping compute and latency low.

[CV-86] Understanding Cross-Sensor Feature Variations for Generalizable 3D Perception

链接: https://arxiv.org/abs/2606.11573
作者: Xin Qiu,Wenjie Liu,Fuyuan Ai,YuChen Tan,Zhiwei Xu,Chunyi Song
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Radar-camera BEV perception often suffers from degraded performance when evaluated across datasets, as changes in driving scenes, sensor configurations, and environmental conditions can alter both the input observations and the internal fused representations. This work studies this issue from the perspective of source-domain variation modeling, aiming to improve the robustness of BEV-based 3D detectors without relying on target-domain samples. We introduce a framework that characterizes visual scene variations in the frequency domain and uses them to synthesize diverse source-domain views. By comparing the resulting fused BEV representations, the framework further captures how image-level variations influence multi-modal BEV features. These variation patterns are then used to regularize the detector, encouraging the learned fusion space to remain stable under latent scene changes. The proposed method is applied only during training and leaves the inference pipeline unchanged. Experiments on cross-dataset radar-camera 3D detection between View-of-Delft and TJ4DRadSet demonstrate consistent improvements over multiple BEV fusion backbones, and the gains remain effective when a small amount of target-domain data is available.

[CV-87] FreqKD: Frequency-Decoupled Cross-Modal Knowledge Distillation for Infrared Object Detection

链接: https://arxiv.org/abs/2606.11572
作者: Keval Thaker,Venkatraman Narayanan,Abdalmalek Aburaddaha,Samir A. Rawashdeh
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Transfer learning from large-scale RGB foundation models to infrared (IR) imagery through knowledge distillation (KD) remains challenging due to fundamental differences in image formation physics. We investigate the spectral structure of the RGB–IR modality gap and observe that feature divergence is not uniform across spatial frequencies: low-frequency components (shape, layout) show greater cross-modal alignment than high-frequency components (texture, fine edges), which reflect modality-specific characteristics. Based on this analysis, we propose FreqKD, a frequency-decoupled distillation framework that applies asymmetric supervision adapted to each band’s cross-modal consistency. The method employs strict mean squared error (MSE) on the low-frequency band to preserve shared structural information and a relaxed log-MSE loss (weighted at 0.1) on the high-frequency band to provide edge guidance while tolerating texture differences. Spectral divergence analysis on 500 paired samples shows that high-frequency divergence exceeds low-frequency divergence by a factor of 2.4x on average across all analysed transformer layers. On KAIST multispectral pedestrian detection, FreqKD achieves 64.1 mAP50, improving 2.4 points over the DINOv2 baseline. The learned representation transfers across datasets (FLIR ADAS, +2.1 mAP50), tasks (MFNet segmentation, +1.85 mean intersection-over-union), and architectures (ResNet-50, +1.0 mAP50). Code is available at: this https URL

[CV-88] 4DP-QA: Scalable QA for 4D Perception in Vision Language Models

链接: https://arxiv.org/abs/2606.11568
作者: Seokju Cho,Abhishek Badki,Hang Su,Jindong Jiang,Ziyao Zeng,Seungryong Kim,Sifei Liu,Orazio Gallo
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Despite recent advances, Vision Language Models (VLMs) still struggle to grasp the dynamics of the world. We note that the ability to reason about a 4D scene, challenging in itself, is further complicated by two factors. First, VLMs observe motion indirectly via its projection onto 2D images. Second, existing datasets fail to disentangle object and camera motion. To address these challenges, we present a QA generation pipeline that focuses on motion-related scene understanding. We take particular care of the entanglement of camera and object motion by casting tracking in both the traditional way and in a novel, fixed reference system, dubbed True-Motion Tracking, which provides an intuitive description of motion. From this pipeline, we generate a large-scale training dataset of 400K samples, 4DP-QA (4D Perception QA), and a 2.2K-sample benchmark, 4DP-QA-Bench. Training existing models on our dataset yields performance improvements on an external benchmark, validating the effectiveness of our method.

[CV-89] Cross-Modal Benchmarking for Robotic Perception in Natural Environments ICRA

链接: https://arxiv.org/abs/2606.11563
作者: David Hall,Joshua Knights,Mark Cox,Peyman Moghadam
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Accepted to the IEEE ICRA Workshop on Open Challenges for Rigorous Robot Perception 2026

点击查看摘要

Abstract:Natural environments present a complex challenge to robotics perception systems. Current models, particularly vision foundation models, are largely trained on structured, urban environments leading to weaknesses in their perception for field robotics tasks. We showcase the limitations of current models using our recently released WildCross benchmark, a new cross-modal benchmark for place recognition and metric depth estimation in large-scale natural environments. WildCross comprises over 476K sequential RGB frames with semi-dense depth and surface normal annotations, each aligned with accurate 6DoF pose and synchronized dense lidar submaps. In this work, we provide an expanded analysis of the benchmark results from the recent WildCross benchmark, with particular emphasis on expanded metric depth estimation experiments. Access to the code repository and dataset for this work can be found at https://csiro-robotics.github.io/WildCross.

[CV-90] VL-DINO: Leverag ing CLIP Vision-Language Knowledge for Open-Vocabulary Object Detectio

链接: https://arxiv.org/abs/2606.11546
作者: Hao Zhang,Qinran Lin,Linqi Song,Yong Li
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-language models like CLIP can provide rich semantic priors for open-vocabulary object detection. However, jointly integrating both textual and visual knowledge into detection architectures remains challenging. In this paper, we propose VL-DINO, an open-vocabulary detector that enhances DINO through more effective exploitation of CLIP’s vision-language knowledge. Specifically, a Query-guided Positive Sample Construction (QPSC) module is first developed to construct additional high-quality positive samples, enabling the vanilla DINO framework to better accommodate mixed training across heterogeneous data sources while providing more vision-language alignment signals, thereby incorporating richer textual knowledge during training. A Visual Semantic Encoder (VSE) module is then introduced to distill CLIP visual knowledge into backbone-extracted features, producing fused features for subsequent encoder refinement. Based on the fused features, an Object-Region Semantic Alignment (ORSA) module extracts object-centric region features and aligns them with the corresponding textual embeddings, further incorporating textual cues. In the zero-shot setting, VL-DINO-T and VL-DINO-L achieve 36.3 and 38.1 AP on the LVIS benchmark, respectively, consistently outperforming prior advanced approaches. Extensive experiments demonstrate the effectiveness and competitive performance of the proposed design.

[CV-91] XPR: An Extensible Cross-Platform Point-Based Differentiable Renderer

链接: https://arxiv.org/abs/2606.11529
作者: Steve Rhyner,Sankeerth Durvasula,Aleksandr Kovalev,Hansel Jia,Adrian Zhao,Mrutunjayya Mrutunjayya,Nilesh Ahuja,Selvakumar Panneer,Christina Giannoula,Nandita Vijaykumar
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Performance (cs.PF)
备注:

点击查看摘要

Abstract:Point-based differentiable rendering underpins modern 3D reconstruction, novel-view synthesis, and learning-based graphics pipelines, but developing new rendering methods often requires extensive low-level implementation, hardware-specific kernels, and manually written backward passes. This limits rapid prototyping, reproducibility, exploration, and deployment, especially across diverse hardware platforms. This paper presents XPR, an extensible cross-platform framework for point-based differentiable rendering. XPR introduces a high-level programming interface that separates method-specific logic from the shared rendering pipeline, allowing users to implement new methods in a few lines of code. Its pipeline decomposes rendering into modular, statically shaped parallel operations that can be lowered by a cross-platform compiler to GPUs, TPUs, CPUs, and other ML accelerators. We demonstrate implementations of 3DGS, 3DGUT, and LinPrim, with only a few 100s lines of Python code, each of which can be compiled to a range of hardware platforms with the XLA compiler. These results show that XPR enables fast experimentation and portable execution for emerging point-based differentiable rendering systems.

[CV-92] SceneMiner: Identity-Preserving Multi-Task Fine-Tuning for Unified BEV Scene Mining

链接: https://arxiv.org/abs/2606.11507
作者: Abdalmalek Aburaddaha,Venkatraman Narayanan,Keval Thaker,Samir A. Rawashdeh
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Mining hard, safety-critical scenes from driving logs is bottlenecked by the absence of difficulty labels, and no single proxy, collision risk, trajectory ambiguity, or semantic rarity suffices to find such scenes on its own. We present SceneMiner, a unified, camera-only bird’s-eye-view pipeline that emits complementary mining signals from a frozen vision-language backbone in a single forward pass, with no LiDAR or radar: a retrieval embedding for text-prompted scenario search, a multi-label scene-tag distribution, and a continuous physics-based risk score (a motion forecast is a byproduct, not a contribution). Building such a multi-head model exposes our central finding, a failure mode we term cross-task interference: adding or upgrading one head shifts a shared activation stream and degrades weight-frozen sibling heads, so freezing parameters alone is insufficient. Our contribution, identity-preserving multi-task fine-tuning, removes this interference by zero-initializing every new sub-module and freezing every parameter that feeds the shared stream. The mining heads are thereby preserved bit-identically while training only ~102k parameters. The tagging head reaches mAP 0.4614 (micro-F1 0.5557) on 20 scene tags by pooling each scene into 32 visual tokens, and the embedding head supports text-prompted retrieval, validated qualitatively. Code is available at: this https URL

[CV-93] On the Study of Biometric Spoofing Detection using Deep Learning

链接: https://arxiv.org/abs/2606.11505
作者: Kumar Kartikey,Nikos Komninos
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Biometric systems are increasingly deployed in security applications; however, they remain vulnerable to spoofing attacks, in which attackers exploit counterfeit biometric data to gain unauthorized access. This research evaluates the effectiveness of state-of-the-art machine learning models, MobileNetV2, DenseNet-121, Inception-v3, and Spoof Trace Disentanglement (STD) in detecting spoofing attacks within facial recognition systems. Using the CelebA-Spoof dataset, the study evaluates model effectiveness using metrics such as accuracy, precision, recall, and F1 Score. Cross-dataset validation is carried out on the MSU-MFSD dataset to assess generalizability. The results show MobileNetV2 as the most efficient model, achieving 92% accuracy while balancing computational effectiveness, making it appropriate for real-life applications. Inception-v3 shows moderate robustness, while DenseNet-121 and STD struggle with generalization. The findings highlight the need for advances in domain adaptation and hybrid architectures to enhance biometric security systems.

[CV-94] owards Fully Automated Exam Grading: Fairness-Aware Recognition of Handwritten Answers with Foundation Models

链接: https://arxiv.org/abs/2606.11477
作者: Hartwig Grabowski
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 11 pages, 2 figures, 3 tables

点击查看摘要

Abstract:Correcting handwritten exams by hand is time-consuming and error-prone, particularly for large cohorts, while fully digital exams tend to force a didactic narrowing towards closed question formats. A practical middle ground keeps paper-based, problem-oriented tasks but records the assessment-relevant answers as single capital letters in a table that a machine can read. The open question is whether this reading can be made accurate and, above all, fair enough for unsupervised grading. Earlier automated approaches reached only about 88%–91% recognition – too low – and failed on the cases that matter most: answers placed outside the cell, crossed out, or written in cursive. We show that general-purpose vision-language foundation models (VLMs), which interpret the page rather than match pixel templates, close this gap. On a benchmark of 61 anonymised exams (3141 answer positions) the best model reaches 98.4% accuracy, well above the previous baseline. Crucially, we centre the evaluation on fairness: we distinguish false negatives (a correct answer marked wrong, which disadvantages the student) from false positives, and a lightweight prompt that supplies the reference solution as context lowers the false-negative rate to 0.58%. Under an exemplary grading scheme only three of the 61 exams would be graded worse, all caught by a student self-review step. Fully automated, fairness-aware exam grading at scale is therefore defensible; we release the anonymised benchmark to support reproducibility.

[CV-95] PT-WNO: Point Transformer with Wavelet Neural Operator for 3D Point Cloud Semantic Segmentation

链接: https://arxiv.org/abs/2606.11466
作者: Nhut Le,Maryam Rahnemoonfar
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Point cloud semantic segmentation requires architectures that capture both fine-grained local geometry and broad global scene structure. Transformer-based networks have demonstrated strong performance by focusing on detailed local feature aggregation; however, global context is conveyed primarily through skip connections across encoder-decoder stages, which we argue is insufficient for full scene understanding. We hypothesize that augmenting skip connections with a learnable global feature extraction module allows the network to acquire scene-level knowledge before descending into local detail, leading to richer and more contextually grounded representations. To this end, we propose Point Transformer with Wavelet Neural Operato (PT-WNO), which integrates a shared Wavelet Neural Operator (WNO) branch alongside the skip connections of a point cloud transformer backbone. At each encoder-decoder transition, point features are projected onto a dense 3D volumetric grid where the WNO captures multi-scale global spectral context through learnable wavelet decomposition and reconstruction. These global features are fused back into the network via lightweight adapters, complementing rather than replacing the existing skip connections. Experiments on four large-scale 3D point cloud benchmarks demonstrate the effectiveness of PT-WNO. On S3DIS (Area 5), PT-WNO achieves 71.59% mIoU, outperforming the Point Transformer v3 (PTv3) baseline by +1.03 points. On DALES it achieves 81.05% mIoU (+1.47 over the baseline). On ScanNet~v2, PT-WNO obtains 76.19% mIoU, remaining competitive with the baseline (76.36%).

[CV-96] Exploring Adaptive Masked Reconstruction for Self-Supervised Skeleton-Based Action Recognition CVPR2026

链接: https://arxiv.org/abs/2606.11450
作者: Shengkai Sun,Zhiyong Cheng,Zefan Zhang,Jianfeng Dong,Zhihui Li,Meng Wang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR2026. The code is available at this https URL

点击查看摘要

Abstract:Recently, masked skeleton reconstruction models have emerged as strong action representation learners, driving significant progress in self-supervised skeleton-based action recognition. However, existing state-of-the-art methods must predict an exceedingly large number of spatiotemporal patches, significantly prolonging training time. Besides, by treating all spatiotemporal regions equally during reconstruction, these models are distracted from learning the critical motion patterns that underlie action semantics. To address these challenges, we propose Adaptive Masked Reconstruction (AMR), a faster and stronger pre-training framework. We first decouple the decoder from the encoder, enabling flexible prediction of larger spatiotemporal patches and dramatically reducing reconstruction complexity. Given that larger patches contain more complex information, which is challenging to predict and consequently degrades performance, we accordingly introduce an adaptive guidance module. This module identifies regions of high motion informativeness, guiding the model to focus on the most discriminative parts of each patch and alleviating reconstruction difficulty. Experiments on NTU RGB+D 60, NTU RGB+D 120, and PKU-MMD datasets demonstrate that AMR not only accelerates pre-training substantially but also improves downstream recognition accuracy, surpassing current state-of-the-art approaches.

[CV-97] 3D-CBM: A Framework for Concept-Based Interpretability in Generative 3D Modeling

链接: https://arxiv.org/abs/2606.11446
作者: Ahmad Al-Kabbany
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:

点击查看摘要

Abstract:This research introduces a framework for incorporating Concept Bottleneck Models (CBMs) into 3D generative architectures to address the inherent ‘semantic gap’ in deep geometric learning. As deep models become central to 3D content creation, explainability shifts from a peripheral feature to a fundamental requirement for trust and accountability in safety-critical domains such as healthcare and manufacturing. CBMs provide an intrinsic interpretability solution by constraining latent representations to align with human-defined concepts, yet their application to unstructured 3D data remains largely unexplored. We design, implement, and validate a formal 3D-CBM architecture that maps raw geometric inputs, including point clouds and meshes, into a multi-tiered taxonomy of interpretable primitives and functional attributes. The framework further identifies strategic datasets, such as PartNet and ShapeNet, specialized for concept-based supervision. Experimental results from a 3D part-manipulation proof-of-concept experiment demonstrate the framework’s efficacy, achieving a concept prediction accuracy of 88.8% and a Chamfer Distance of 0.0115. Critically, the model enables precise test-time intervention, allowing for the interactive correction of structural errors. This work establishes a foundation for semantically-steerable 3D generation and invites further exploration into collaborative human-in-the-loop design systems. Subjects: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR) Cite as: arXiv:2606.11446 [cs.CV] (or arXiv:2606.11446v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2606.11446 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-98] A Scalable PyTorch Abstraction for Multi-GPU Gaussian Splatting

链接: https://arxiv.org/abs/2606.11390
作者: Matthew Cong,Francis Williams,Jonathan Swartz,Mark Harris,Sanja Fidler,Ken Museth
类目: Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC); Graphics (cs.GR); Machine Learning (cs.LG)
备注: 14 pages, 6 tables, 2 figures, and 1 listing. Includes supplementary material

点击查看摘要

Abstract:Gaussian splatting methods have become increasingly popular for neural reconstruction of the real world. However, they are often limited in scale and resolution due to compute and memory constraints. We present a multi-GPU Gaussian splatting approach that scales reconstruction to higher resolutions and larger scenes while abstracting away the code complexity typically associated with distributing a model. To accomplish this, we propose a PyTorch backend that distributes the Gaussian parameters and splatting operators across GPUs via CUDA unified memory and NVLink. Because distribution occurs at the operator level, the model code requires no explicit cross-device communication. More broadly, the backend exposes multiple GPUs as an aggregate PyTorch device and supports other PyTorch operators. We demonstrate city-scale reconstructions with street-level detail consisting of over 1 billion Gaussian splats, more than 25 times as many as the current state of the art.

[CV-99] DeceptionX: Explainable Deception Detection with Multimodal Large Language Models

链接: https://arxiv.org/abs/2606.11385
作者: Jiayu Zhang,Shuo Ye,Jiajian Huang,Yawen Cui,Taorui Wang,Wei Xia,Zeheng Wang,Haowen Tang,Hui Ma,Zitong Yu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Deception detection is a critical and highly challenging task within affective computing and behavioral analysis. Existing deep learning methods typically treat this task as a straightforward classification problem; however, this black-box approach lacks interpretability and fails to capture the complex logical deduction processes utilized by human experts when identifying lies. While Multimodal Large Language Models (MLLMs) have shown potential, applying them effectively requires a bridge between low-level audiovisual cues and high-level logical reasoning. In this paper, we propose DeceptionX, a novel MLLM framework that shifts the paradigm of deception detection from black-box classification to an interpretable Observe-Think-Summarize reasoning process. To address the scarcity of high-quality reasoning data, we first constructed DeceptChain, a high-quality dataset developed through a human-in-the-loop process. This dataset synthesizes fine-grained visual and auditory evidence (such as micro-expressions and vocal tremors) into structured chain-of-thought reasoning data. Furthermore, we propose a three-stage training pipeline and a Discrepancy-Aware Redundancy Elimination~(DARE) strategy for DeceptionX to further enhance the model’s generalization capabilities. Extensive experiments demonstrate that DeceptionX not only outperforms existing MLLM baselines and state-of-the-art methods on standard real-world benchmarks but also provides transparent, expert-level reasoning paths, bridging the critical gap between accuracy and interpretability in multimodal deception detection.

[CV-100] From Simulation to Real-World: An In-Field 6D Pose Dataset and Baseline for Robotic Strawberry Harvesting

链接: https://arxiv.org/abs/2606.11381
作者: Woojung Son(1),Won Suk Lee(1),Zijing Huang(1),Daeun Choi(1),Catia Silva(2),Yu She(3),Yan Gu(4) ((1) Department of Agricultural and Biological Engineering, University of Florida, (2) Department of Electrical and Computer Engineering, University of Florida, (3) Edwardson School of Industrial Engineering, Purdue University, (4) School of Mechanical Engineering, Purdue University)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 7 pages, 6 figures, 1 table

点击查看摘要

Abstract:Robotic strawberry harvesting requires precise 6D pose estimation; however, collecting 6D pose ground truth in real agricultural fields is inherently challenging. Existing 6D pose estimation methods have therefore relied solely on synthetic data that lacks scene-level realism, leaving their performance under real agricultural field conditions unquantified. In this work, we present, to the best of our knowledge, the first real-world 6D pose ground truth dataset of strawberries collected in actual agricultural fields (12,040 images). We also introduce a synthetic dataset rendered in NVIDIA Isaac Sim, featuring scene-level realism and domain randomization. Nevertheless, our experiments reveal that a significant sim-to-real gap persists, underscoring the necessity of real agricultural field data for reliable evaluation. We further quantify the sim-to-real gap through baseline 6D pose estimation results across backbone encoders, serving as a reference for future work. The real-world dataset will be made available upon acceptance.

[CV-101] NSVQ: Mitigating Codebook Collapse by Stabilizing Encoder Drift in Vector Quantization

链接: https://arxiv.org/abs/2606.11363
作者: Hao Lu,Yongxin Guo,Onur Koyun,Zhengjie Zhu,Abbas Alili,Metin N. Gurcan
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vector quantization is central to modern generative modeling pipelines, but large-codebook VQ models often suffer from codebook collapse. We identify encoder drift as a key driver of this failure: as the encoder moves the latent distribution, sparsely updated code vectors can lag behind, lose assignments, and increase quantization error, creating a feedback loop through the straight-through estimator. We propose NSVQ, a non-stationary-aware VQ training strategy that combines a dense non-stationary embedding loss, codebook replacement, and stage-wise encoder freezing. NSVQ first helps the codebook track encoder drift during early training, then freezes the encoder to consolidate the codebook under a fixed latent geometry, and finally reintroduces adversarial refinement. Experiments on ImageNet-1k show that NSVQ improves reconstruction quality while maintaining full codebook utilization. On ImageNet-1k at 128 \times 128 with 65,536 codes, NSVQ reduces rFID from 2.39 to 2.10 compared with SimVQ, while both methods maintain 100% utilization. Additional latent diffusion experiments show that NSVQ also improves downstream ImageNet generation FID.

[CV-102] DarkVGGT: Seeing Through Darkness Using Thermal Geometry without Daylight Tax

链接: https://arxiv.org/abs/2606.11326
作者: Minseong Kweon,Wenyuan Zhao,Nuo Chen,Lulin Liu,Huiwen Han,Zihao Zhu,Srinivas Shakkottai,Chao Tian,Zhiwen Fan
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:Recent feed-forward 3D reconstruction methods have demonstrated strong performance and flexibility in efficient end-to-end scene geometry estimation from image streams. However, their reliance on visible-light appearance makes them vulnerable in dark and low-visibility environments, where RGB cues are severely degraded and geometric evidence becomes ambiguous. To address this challenge, we propose DarkVGGT, an RGB-T feed-forward geometry framework that uses physics-aware thermal modeling for robust 3D estimation in low-light scenes. DarkVGGT introduces two complementary modules. First, physics-inspired thermal factorization extracts emissive-dominant, geometry-consistent thermal cues while isolating sparse reflective residuals that may introduce geometric ambiguity. Second, geometry-shared thermal routing isolates modality-invariant geometric structures from thermal-specific patterns, selectively injecting reliability-aware structural guidance into the RGB stream. Together, these components enable accurate thermal-informed geometry estimation under degraded RGB conditions while largely preserving performance in well-lit environments. Experiments on low-visibility RGB-T benchmarks demonstrate consistent improvements in both depth and camera pose estimation over existing feed-forward geometry baselines.

[CV-103] Semantic Segmentation of Node and Edge Diagrams for Assistive Technology

链接: https://arxiv.org/abs/2606.11320
作者: Michael Cormier,Yichun Zhao,Laura Paul,Cameron Swift,Duc Tri Dang,Miguel Nacenta
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 6 figures, 1 table. In Proceedings of the 23rd Conference on Robots and Vision (2026)

点击查看摘要

Abstract:In this paper, we present a novel set of related models for semantic segmentation of node-link diagrams. These diagrams are frequently used to represent mathematical graphs, relationships between concepts, and flowcharts. Such diagrams are difficult to access non-visually; while some assistive interfaces have been designed for node-link diagrams, they rely upon a machine-readable representation of the diagram, whereas such diagrams will generally be made available as bitmap images. Our compact deep learning models show excellent quantitative and qualitative performance on a large synthetic dataset of node-link diagrams, reaching per-pixel accuracy over 93%.

[CV-104] RON: Tracing Rays to Orchestrate a Neural Renderer for 3D Gaussian Reconstructions

链接: https://arxiv.org/abs/2606.11314
作者: Or Perel,Hassan Abu Alhaija,Zian Wang,Jacob Munkberg,Matan Atzmon,Sanja Fidler,Masha Shugrina
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: Project page: this https URL

点击查看摘要

Abstract:We introduce TRON, a rendering framework that combines 3D Gaussian ray tracing with neural rendering to enable realistic and controllable rendering of real-world 3D scenes under novel lighting, dynamic object motion, object insertion, and material editing. Prior approaches that rely solely on physically based rendering (PBR) of Gaussian representations struggle to achieve realistic relighting due to imperfections in reconstructed geometry, material estimates, and light transport estimation. At the same time, neural rendering methods often lack an explicit scene representation, limiting their ability to support interactive editing with fine-grained manipulation. TRON bridges these two paradigms. We use intrinsic decomposition priors from a learned inverse rendering model to regularize the material properties of a Gaussian field, and repurpose a ray tracer to provide radiometric guidance rather than final pixels. By treating this output as a structured 3D scaffold, we empower a lightweight neural renderer to bridge the domain gap between shading-model constrained estimates and photorealistic output. Our key insight is that the combination of explicit 3D knowledge with robust material priors provides speed and controllability, while neural rendering enables the synthesis of photorealistic images. To support real-world scenarios, we train our neural renderer with a multi-stage strategy consisting of large-scale pretraining and targeted fine-tuning on a newly constructed dataset of 2.1M rendered synthetic and real-world frames from 3D reconstructions. TRON outperforms Gaussian-based relighting methods in realism, and prior neural renderers in editability and speed. To the best of our knowledge, TRON is the first method to enable practical interactive applications in captured 3D environments, offering realistic appearance under dynamic geometric, lighting and material conditions.

[CV-105] 1: A Simple and Fully Open Recipe for Strong Text-to-Image Models

链接: https://arxiv.org/abs/2606.11289
作者: Boya Zeng,Tianze Luo,Shu Pu,Jucheng Shen,Taiming Lu,Gabriel Sarch,Zhuang Liu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page at this https URL

点击查看摘要

Abstract:Diffusion models have consistently driven progress in text-to-image generation. However, it is challenging to attribute recent progress to specific modeling and data choices: state-of-the-art open-weight models provide limited ablations, and do not disclose their training data and full training details. The research community needs fully open (weights, data, and code) models as a foundation for further research; yet existing fully open models still fall significantly short of leading models in performance. In this project, we conduct a systematic investigation of the modeling and data design choices in text-to-image diffusion training and inference with 300+ controlled experiments totaling 700K+ TPU v6e hours. Our experiments highlight several empirical findings (e.g., equal weighting is a strong default for mixing curated datasets) and simple design decisions (e.g., larger text encoder adapters improve performance with minimal added parameters) for training strong models. Guided by these insights, we train i1, a 3B-parameter text-to-image diffusion model using only publicly available datasets. i1 is competitive with leading models on five representative benchmarks (GenEval, DPG, PRISM, CVTG-2K, and LongText), and outperforms the best existing fully open model by 29.5 absolute percentage points on average. We provide the i1 checkpoints, training and inference code, and the data processing pipeline. Together, our findings and the i1 recipe establish a practical foundation for future open research in text-to-image diffusion models. Our code is available at this https URL.

[CV-106] EventRadar: Long-Range Visual UAV Discovery through Spatiotemporal Event Sensing

链接: https://arxiv.org/abs/2606.11285
作者: Zhiting Zhou,Xingchen Liu,Xinglin Yu,Jiashen Chen,Haoyang Wang,Jingao Xu,Yunhao Liu,Xinlei Chen
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Unauthorized unmanned aerial vehicle (UAV) activity around airports, public venues, and other sensitive sites has made protected-airspace monitoring increasingly important. A practical sensing system must search a wide angular region, find small long-range targets, and return both bearing support and UAV-specific evidence before a restricted perimeter is breached. Existing UAV detection paths often rely on spatially organized evidence, such as body extent, silhouette, or track continuity. At long range, however, these cues become difficult to preserve and verify as the target footprint weakens and its image-plane support shrinks. EventRadar follows a complementary cue: propeller-induced temporal periodicity, which recent event-camera sensing studies have shown can reveal UAV-specific motion after appearance becomes weak. We extend this cue to kilometer-scale active sensing with an event-camera prototype. Scene-Anchored Geometry Evidence (SAGE) fuses scanning events with IMU pose to maintain a bearing-indexed scene memory, separating transient candidate support from persistent background clutter. Comb-guided Harmonic-Group Learned Iterative Shrinkage and Thresholding Algorithm (CHG) then treats each candidate as a weak high-rate timing signal and recovers phase-insensitive harmonic evidence with fixed compute. Compared with related event-camera baselines on 700-1500 m UAV event recordings, EventRadar achieves 0.990 mAP _.3 and 0.949 F1 _.3 , reduces FN _.3 to 0.009, and shows real-time feasibility in prototype profiling.

[CV-107] A2SG:Adaptive and Asymmetric Surrogate Gradients for Training Deep Spiking Neural Networks ICML2026

链接: https://arxiv.org/abs/2606.11236
作者: Yechan Kang,Yongjin Kweon,Mingyeong Seo,Sohee Park,Yeonguk Jeon,Jongkil Park,Hyun Jae Jang,Jaewook Kim,YeonJoo Jeong,Suyoun Lee,Seongsik Park
类目: Neural and Evolutionary Computing (cs.NE); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted at ICML 2026

点击查看摘要

Abstract:Training deep spiking neural networks (SNNs) remains challenging due to sharp loss landscapes and temporal inconsistency caused by surrogate gradients. To address these challenges, we propose a unified framework: adaptive and asymmetric surrogate gradients A2SG. The adaptive gradients adjust an effective window for spatio-temporal adaptation, reducing spatial gradient variation and maintaining directional consistency of gradients over time. The asymmetric gradients reflect neuronal dynamics by assigning larger gradients to neurons with higher membrane potentials, and we prove that they yield lower variation than symmetric surrogates. Our analysis further establishes a direct connection between local gradient variation and the curvature of the loss landscape, providing a principled explanation for how A2SG promotes convergence to flatter minima and improves generalization. We conduct extensive experiments on diverse models, including CNN-based and Transformer-based SNNs, across various tasks such as image classification using both static and neuromorphic datasets, as well as segmentation. The results demonstrate that A2SG consistently improves accuracy and energy efficiency, establishing it as a general and reliable solution for training deep SNNs. Our code is available at this https URL.

[CV-108] OSCS-SupCon: Orthogonal Sigmoid-based Common and Style Supervised Contrastive Learning for Robust Feature Disentanglement

链接: https://arxiv.org/abs/2606.11233
作者: Bin Wang,Fadi Dornaika
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Supervised Contrastive Learning (SupCon) has achieved strong performance by explicitly modeling pairwise relationships among samples. However, existing SupCon-based methods suffer from two key limitations: negative-sample dilution induced by the standard InfoNCE loss, and feature-space entanglement caused by the lack of explicit constraints separating category-relevant (common) and category-irrelevant (style) features. These limitations reduce feature discriminability and generalization ability. To address these issues, we propose OSCS-SupCon (Orthogonal Sigmoid-based Common and Style Supervised Contrastive Learning), a unified framework that combines a sigmoid-based pairwise contrastive objective with explicit orthogonality constraints. Specifically, we introduce a sigmoid-based contrastive loss with two learnable parameters, temperature and bias, which adaptively modulate pairwise decision boundaries and alleviate negative-sample dilution. Furthermore, we enforce orthogonality between common and style feature subspaces via a linear projection with ReLU nonlinearity, thereby reducing feature overlap and improving disentanglement of style-irrelevant representations. Extensive experiments on six benchmark datasets demonstrate that OSCS-SupCon consistently outperforms state-of-the-art supervised contrastive learning methods across multiple backbone architectures. In particular, on the fine-grained CUB200-2011 dataset with a ResNet-18 backbone, the proposed method achieves a 3.4% improvement in classification accuracy over CS-SupCon, highlighting its robustness and generalization capability. Ablation studies further confirm the effectiveness of each component.

[CV-109] CFCamo: A Counterfactual Detect-or-Abstain Framework for Camouflaged Object Detection

链接: https://arxiv.org/abs/2606.11231
作者: Suhang Li,Osamu Yoshie,Yuya Ieiri
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 7 figures, 5 tables. Code and data: this https URL

点击查看摘要

Abstract:Vision-language reinforcement learning has recently shown strong target-present localization for camouflaged object detection (COD). Yet localization is only one side of the decision: when the agent faces an ordinary image with no camouflaged target, will it still claim that a camouflaged object exists? Standard COD training and evaluation data are positive-only, so agents optimized under this setting can acquire an over-detect bias, a task-specific form of object hallucination that standard COD evaluation leaves unmeasured. To quantify this target-absent behavior, we construct Counterfactual COD (CF-COD), a paired benchmark that removes the camouflaged target from each held-out COD evaluation image while preserving a plausible background. CF-COD evaluates whether a model detects the target on the original image and abstains on the target-absent counterfactual, summarized by Pair Accuracy (PA). We further introduce CFCamo, a paired counterfactual framework for COD with abstention. For training, CFCamo optimizes a Qwen3-VL-4B-Instruct agent with Counterfactual Sequence Policy Optimization (CSPO), which samples paired original-counterfactual rollouts and uses a Counterfactual Paired Reward (CPR) to couple original-image detection with counterfactual abstention. On CAMO-test, CFCamo improves S_alpha by +3.7 pp over the prior RL-based COD baseline; across CF-COD, it reaches 80.0-90.8% PA. Ablations show that removing counterfactual coupling reduces PA to 1.4-5.2% despite strong target-present COD scores, showing that target-present evaluation alone does not characterize detect-or-abstain behavior. Overall, these results indicate that CFCamo improves COD agents by coupling target-present detection with target-absent abstention, rather than merely strengthening target-present localization. Code and data are available at this https URL.

[CV-110] LAST: Bridging Vision-Language and Action Manifolds via Gromov-Wasserstein Alignment

链接: https://arxiv.org/abs/2606.11221
作者: Huaihai Lyu,Chaofan Chen,Yuheng Ji,Xiansheng Chen,Pengwei Wang,Shanghang Zhang,Changsheng Xu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We take a Gromov-Wasserstein perspective on Vision-Language-Action (VLA) learning, where the goal is to make the relational geometry of action representations compatible with the semantic geometry of VL embeddings. However, this alignment is non-trivial due to the mathematical heterogeneity between the domains: the semantic space of vision-language is topologically linear and isotropic, whereas the physical manifold of robotic action is non-Euclidean and anisotropic. Their disjoint metric structures render direct regression ill-posed. To resolve this incompatibility, we introduce LAST (Lie-algebraic Action Space Tokenizer), which reconstructs the action space to establish local metric compatibility with the VL modality via a two-stage transformation: (1) Global Topological Linearization: linearizing the action manifold via Lie-algebraic mapping, converting trajectories into a fixed-length, physically additive representation. (2) Local Metric Discretization: hierarchically discretizing the representation into schemas and whitened residuals, yielding approximately isotropic local charts that are statistically aligned with the semantic metric. By resolving the structural mismatch at both global and local levels, LAST enables VLA models with superior convergence and generalizability.

[CV-111] Intelligent Skin Cancer Detection Using a Multispectral Metasurface and a Hybrid

链接: https://arxiv.org/abs/2606.11287
作者: Afsane Saee Arezoomand
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages

点击查看摘要

Abstract:Skin cancer is among the most prevalent malignancies worldwiAdbe satnradcitts early detection is essential for improving patient survival and reducing treatment costs Conventional dermoscopic and visual imaging techniques are primarily limited to the visible spectrum and often fail to capture subtle spectral signatures associated with early stage malignancies This study proposes an innovative framework that integrates a multispectral metasurface for imaging with a hybrid deep learning architecture based on Convolutional Neural Networks and Vision Transformers The designed metasurface enables noninvasive acquisition of rich spectral information highly sensitive to tissue alterations while the hybrid CNN ViT model simultaneously extracts local and global features to robustly classify skin lesions Simulation-based evaluations demonstrate that the proposed method achieves approximately 98 accuracy 95 percentages sensitivity and 99 perentage specificity surpassing conventional RGB-based and single-architecture approaches Qualitative analyses using attention maps reveal that the model focuses on clinically relevant lesion regions improving interpretability Overall the results indicate that combining metasurface based multispectral imaging with hybrid deep learning can introduce a new generation of diagnostic tools in dermatology and pave the way for portable fast and highly accurate clinical systems

人工智能

[AI-0] FACTR 2: Learning External Force Sensing for Commodity Robot Arms Improves Policy Learning

链接: https://arxiv.org/abs/2606.12406
作者: Steven Oh,Jason Jingzhou Liu,Tony Tao,Philip Han,Kenneth Shaw,Satoshi Funabashi,Ruslan Salakhutdinov,Deepak Pathak
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY)
备注: Website at this https URL

点击查看摘要

Abstract:Contact-rich manipulation requires force sensitivity, but many robot arms lack dedicated force sensors due to their high cost. We present Neural External Torque Estimation (NEXT), a data-driven method that estimates external joint torques without needing any dedicated force sensors. NEXT trains in 1 minute from only 10 minutes of free-motion data, yet achieves estimates comparable to dedicated joint-torque sensors. NEXT enables force-feedback teleoperation on low-cost arms and improves policy learning through Force-Informed Re-Sampling Training (FIRST), which up-samples pre-contact and contact segments during behavior cloning. Across five long-horizon tasks, FIRST outperforms prior force-aware policies by over 17% in task progress. Together, NEXT and FIRST bring force-aware teleoperation and policy learning to off-the-shelf robots without additional sensing hardware. Video results and code are available at this https URL

[AI-1] AHOE: Text-to-SQL with Automated Hint Optimization from Experience

链接: https://arxiv.org/abs/2606.12387
作者: Zhiyi Chen,Jie Song,Peng Li
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have democratized database access through Text-to-SQL, but moving from prototypes to production remains difficult. Real deployments must handle strict SQL dialects, massive schemas, and evolving user preferences, while supervised fine-tuning is costly and rigid and agentic test-time scaling is expensive. We present Tahoe, a system that treats prompt optimization as a dynamic data management problem. Tahoe uses an error-driven hint learning pipeline across Development and Deployment to consolidate debugging traces into a structured Hint Bank. Compiler feedback is distilled into reusable Syntax Hints for dialect-specific rules, while execution and user feedback are converted into Semantic Hints for schema- and user-specific logic. Tahoe further introduces a Strategy Layer that models conflicting user intents as competing strategies under shared natural-language triggers, with recency signals and post-learning attribution statistics that summarize empirical success, harm, inertness, and support. At inference time, Tahoe retrieves relevant hints and guides the LLM through Logic Planning followed by SQL Synthesis. We implement and evaluate the development-phase workflow, leaving deployment-time human-feedback updates for future work. On Spider 2.0-Snow, Tahoe substantially improves Text-to-SQL without updating model parameters. On 113 supervised Spider 2.0-Snow-0212 examples using GPT-5.5, Tahoe raises pass rate from 61.95 percent to 79.42 percent and pass-at-4 from 72.57 percent to 87.61 percent, achieves 100 percent Snowflake syntax pass rate, and reduces average compiler-feedback critic rounds from 2.79 to 0.12 per sampled candidate. The same Hint Bank also transfers to weaker backbones, including a 19.7 percentage-point pass-rate gain on Doubao-2.0-lite.

[AI-2] ATLAS: Active Theory Learning for Automated Science

链接: https://arxiv.org/abs/2606.12386
作者: Noémi Éltető,Nathaniel D. Daw,Kimberly L. Stachenfeld,Kevin J. Miller
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Advancing scientific understanding through mechanistic modeling requires posing the right experimental questions to yield maximally informative data. To automate this pursuit within cognitive science, we introduce ATLAS (Active Theory Learning for Automated Science), an active learning framework for the data-driven discovery of interpretable behavioral models. ATLAS iterates between generating mechanistic hypotheses–instantiated as a diverse ensemble of sparse neural networks (Disentangled RNNs)–and designing experiments that optimally distinguish between them. We test this approach on the problem of recovering reinforcement learning agents from their behavior in bandit tasks. ATLAS designs varied sequences of qualitatively novel experiments with temporal structure tailored to underlying agent characteristics. The models trained on these experiments are evaluated against a comprehensive set of metrics for mechanistic modeling that capture behavioral, structural, and computational similarity. ATLAS achieves a 5-10x improvement in sample efficiency across all metrics compared to random experimentation, and its performance is further validated against expert-designed experiments derived from literature. These in silico results showcase ATLAS’s potential to accelerate human-interpretable insights in cognitive science and other domains where scientific inquiry relies on discovering mechanistic models.

[AI-3] APPO: Agent ic Procedural Policy Optimization

链接: https://arxiv.org/abs/2606.12384
作者: Xucong Wang,Ziyu Ma,Yong Wang,Yuxiang Ji,Shidong Yang,Guanhua Chen,Pengkun Wang,Xiangxiang Chu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 25 pages, including 14 pages of main text and 11 pages of appendix; work in progress

点击查看摘要

Abstract:Recent advances in agentic Reinforcement Learning (RL) have substantially improved the multi-turn tool-use capabilities of large language model agents. However, most existing methods assign credit over coarse heuristic units, such as tool-call boundaries or fixed workflows, making it difficult to identify which intermediate decisions influence downstream outcomes. In this work, we study agentic RL from two perspectives: \textitwhere to branch and how to assign credit after branching. Our pilot analysis shows that influential decision points are broadly distributed throughout the generated sequence rather than concentrated at tool calls, while token entropy alone does not reliably reflect their impact on final outcomes. Motivated by these observations, we propose \textbfAgentic Procedural Policy Optimization (APPO), which shifts branching and credit assignment from coarse interaction units to fine-grained decision points in the sequence. APPO selects branching locations using a Branching Score that combines token uncertainty with policy-induced likelihood gains of subsequent continuations, enabling more targeted exploration while filtering out spurious high-entropy positions. It further introduces procedure-level advantage scaling to better distribute credit across branched rollouts. Experiments on 13 benchmarks show that APPO consistently improves strong agentic RL baselines by nearly 4 points, while keeping efficient tool-calls and maintaining behavior interpretability.

[AI-4] SPEA2: Improved Density Estimation in SPEA2 with Provable Runtime Guarantees PPSN2026

链接: https://arxiv.org/abs/2606.12382
作者: Duc-Cuong Dang,Andre Opris,Dirk Sudholt
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
备注: To appear in the Proceedings of PPSN 2026

点击查看摘要

Abstract:The Strength Pareto Evolutionary Algorithm 2 (SPEA2) is a popular and prominent evolutionary algorithm for solving multi-objective optimisation problems. Despite its popularity, theoretical analyses of SPEA2 have only appeared recently. Moreover, these analyses focus exclusively on how SPEA2 handles non-dominated solutions and disregard the algorithmic components responsible for handling dominated solutions. We conduct a first runtime analysis of SPEA2 for which these components are analysed. We prove that, unlike other prominent algorithms, including NSGA-II, NSGA-III and SMS-EMOA under the same setting of constant population size and duplicate elimination, SPEA2 is unable to cover the Pareto front of the OneTrapZeroTrap benchmark efficiently. Our results indicate that using k-th nearest-neighbour distance in the fitness assignment provides an insufficient signal to maintain diversity among dominated individuals. To address this issue, we propose an improved variant, SPEA2 ^+ , that considers all pairwise distances. The new algorithm achieves the same performance guarantees as the other prominent algorithms on OneTrapZeroTrap, while matching the performance of the original SPEA2 on simpler problems. Experimental results complement our theoretical findings.

[AI-5] Ambient Diffusion Policy: Imitation Learning from Suboptimal Data in Robotics

链接: https://arxiv.org/abs/2606.12365
作者: Adam Wei,Nicholas Pfaff,Thomas Cohn,Arif Kerem Dayı,Constantinos Daskalakis,Giannis Daras,Russ Tedrake
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 14 pages (main body), 52 pages total. Project website: this https URL

点击查看摘要

Abstract:We propose Ambient Diffusion Policy, a simple and principled method for imitation learning from suboptimal data in robotics. High-quality, task-specific robot data is expensive and time-consuming to collect, while suboptimal datasets with lower-quality or out-of-distribution demonstrations are abundant. Existing methods that co-train on both data sources in robotics often fail to separate the meaningful and the harmful features in the suboptimal samples. In contrast, our method extracts only the useful features by introducing a new axis to co-training in robotics: noise-dependent data usage. Ambient Diffusion Policy restricts the contribution of suboptimal data during training to only the high and low diffusion times. To rigorously justify our approach, we first observe that robot action data exhibits a spectral power law. This induces two important properties on the optimal Diffusion Policy that we exploit: a global-to-local hierarchy and locality. We theoretically formalize this discussion using a simplified model. Our experiments validate Ambient Diffusion Policy on four types of suboptimal action data (noisy trajectories, sim-to-real gap, task mismatch, and large-scale data mixtures) across six tasks. The results show that it effectively learns from arbitrary sources of suboptimal data. Notably, it outperforms existing co-training baselines by up to 33% when scaled to Open X-Embodiment - a large dataset with heterogeneous data quality and unstructured distribution shifts. Overall, Ambient Diffusion Policy increases the utility of suboptimal demonstrations and expands the set of usable data sources in robotics.

[AI-6] Latent World Recovery for Multimodal Learning with Missing Modalities

链接: https://arxiv.org/abs/2606.12362
作者: Hui Wang,Tianyu Ren,Joseph Butler,Christopher Baker,Karen Rafferty,Simon McDade
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We study multimodal learning under missing modalities, with particular motivation from bioscience applications in which heterogeneous modalities are often only partially available when decisions need to be made. We propose Latent World Recovery (LWR), a framework built on two key ideas: (i) modality-specific embeddings from different modalities are aligned in a shared latent space, and (ii) a unified representation is constructed by fusing only the embeddings of the modalities that are actually available at both training and inference time. Rather than imputing missing modalities or requiring a fixed modality set, LWR treats each modality as a partial perception of an underlying latent state and performs availability-aware representation learning directly from the observed modalities. This combination of neighbor-based latent alignment and availability-aware modality fusion enables robust multimodal prediction under partial observation, while avoiding error propagation from explicit reconstruction of missing modalities. We evaluate the proposed framework on real-world incomplete multi-omics benchmarks and demonstrate that it provides an effective approach to downstream tasks such as cancer phenotype classification and survival prediction.

[AI-7] CHORUS: Decentralized Multi-Embodiment Collaboration with One VLA Policy

链接: https://arxiv.org/abs/2606.12352
作者: Ria Doshi,Tian Gao,Annie Chen,Chelsea Finn,Jeannette Bohg
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Project Website: this https URL

点击查看摘要

Abstract:Multi-robot collaboration allows robots to efficiently take on a wide range of tasks, from moving a couch through a doorway to assembling structures on a construction site. However, achieving such coordination in mobile multi-robot settings remains challenging: centralized methods conditioned on the combined observations of a team scale poorly with team size, and decentralized methods that train one policy per robot often require explicit alignment procedures or information sharing at inference time to overcome partial observability. Our key insight is that the visuomotor priors of pretrained vision-language-action (VLA) models should enable reactive, decentralized collaboration from each robot’s local observations alone, without these inference-time assumptions. We propose CHORUS, a framework that adapts a single VLA backbone to control diverse, multi-robot teams. At inference time, each robot runs an independent copy of CHORUS, conditioned only on its own observations and a robot-identifying prompt. In real-world experiments including mobile tape measurement, library book handovers, and laundry basket lifting, CHORUS achieves a 64% point improvement over decentralized, from-scratch models, improves reactivity to teammate behavior by 40% points, and outperforms centralized baselines. Together, these results show that a shared VLA backbone is capable of achieving decentralized multi-robot collaboration, without per-robot policies or inter-robot communication at inference.

[AI-8] Nonslop: A Gamified Experiment in Human-AI Collaborative Writing

链接: https://arxiv.org/abs/2606.12350
作者: Maria Edwards,Julian Togelius
类目: Artificial Intelligence (cs.AI)
备注: Accepted at the 2026 IEEE Conference on Games (CoG 2026); to be published in the conference proceedings. Camera-ready version

点击查看摘要

Abstract:The rapid proliferation of large language models (LLMs) raises critical questions about human creativity and individual expression in an era of AI-assisted creation. When do humans adopt AI suggestions, and what are the implications for individual voice? This study examines these questions through a gamified writing exercise where 74 participants (214 responses) replied to prompts while AI-generated word suggestions were available as they wrote. The game simulates a dystopian future in which an AI is attempting to learn from what remains of human individuality, and disincentivizes AI-like writing. In doing so, it attempts to create conditions that reveal authentic user preferences rather than default behaviors, such as accepting a readily available AI-generated suggestion. Note that this is a deliberate inversion of the “helpful assistant” design pattern; the system is explicitly forbidding you from accepting AI suggestions. We analyze user behavior patterns across different task types, user behaviors, and response characteristics to understand the factors influencing human-AI interaction in creative tasks. The study focuses on when users choose to maintain creative autonomy versus violating the rules of the game and accepting AI assistance. It also explores how these choices relate to response patterns, task characteristics, and user behavior. This gamified approach offers both a framework for studying authentic human-AI interaction and a provocative lens for understanding the tension between efficiency and authenticity in AI-augmented creativity. Comments: Accepted at the 2026 IEEE Conference on Games (CoG 2026); to be published in the conference proceedings. Camera-ready version Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2606.12350 [cs.AI] (or arXiv:2606.12350v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2606.12350 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-9] PROJECTMEM: A Local-First Event-Sourced Memory and Judgment Layer for AI Coding Agents

链接: https://arxiv.org/abs/2606.12329
作者: Ripon Chandra Malo,Tong Qiu
类目: Artificial Intelligence (cs.AI)
备注: 12 pages, 5 figures, 1 table. Code: this https URL

点击查看摘要

Abstract:AI coding assistants now support a growing share of software work, from quick scripts to production applications. Yet these agents remain largely stateless: each new session re-reads project files, re-derives prior decisions, and - most costly - may repeat debugging attempts that already failed. Reconstructing this context can consume an estimated 5,000-20,000 tokens per session; the bottleneck is often not model capability but missing project memory. We present projectmem, an open-source, local-first memory and judgment layer for AI coding agents. projectmem records development as an append-only, plain-text event log of typed events - issues, attempts, fixes, decisions, and notes - and deterministically projects that log into compact, AI-readable summaries served through the Model Context Protocol (MCP). Beyond storage, projectmem adds a deterministic pre-action gate that warns an agent before it repeats a previously failed fix or edits a known-fragile file. We frame this as Memory-as-Governance: memory that does not merely answer the agent but acts on its next action. The system runs fully offline with no telemetry; its immutable log also serves as a provenance trail for reproducible, auditable AI-assisted development. projectmem ships as a three-dependency Python package (14 MCP tools, 19 CLI commands, 37 automated tests) and is evaluated through a two-month self-study across 10 projects comprising 207 logged events. Source code: this https URL.

[AI-10] A Five-Plane Reference Architecture for Runtime Governance of Production AI Agents MICRO

链接: https://arxiv.org/abs/2606.12320
作者: Krti Tallam
类目: Artificial Intelligence (cs.AI); Computational Complexity (cs.CC); Cryptography and Security (cs.CR); Software Engineering (cs.SE)
备注: 65 pages, 3 figures, 5 tables. Reference architecture with a reference implementation of the policy-engine core and microbenchmark results; full-system evaluation identified as future work

点击查看摘要

Abstract:Enterprise security was built to govern data boundaries: the protected surface was data at rest and in transit, and the controls – access control, data-loss prevention, perimeter inspection – governed crossings of that boundary. Production AI agents dissolve this assumption. An agent reads context, calls tools, invokes connectors, and modifies systems of record on an enterprise’s behalf, so risk moves inside the workflow, into sequences of individually-permitted actions that may transform a business process no one authorized. Existing policy engines do not extend to this regime: they evaluate request-time decisions against atomic principals, where agentic systems require stateful evaluation against composite principals whose authority attenuates through delegation chains. We present a reference architecture for the runtime governance of production agents, built from four composable primitives: a five-plane decomposition (a reasoning plane that adjudicates intent, and four enforcement planes – network, identity, endpoint, data – that realize the decision), stop-anywhere mediation, composite principals with capability attenuation, and audit as a structured evidence substrate. We define a taxonomy of six interruption primitives that generalize allow and deny, state and argue for four correctness invariants, and demonstrate the foreclosure of seven production-agent threats across five concrete workflows. A reference implementation of the policy-engine core supplies measured evidence: attenuation correctness and evidence reconstructability hold on every trial, adjudication runs in single-digit microseconds, and the audit substrate’s tamper-evidence behaves exactly as designed. We are explicit about scope: the architecture governs delegated action, not model behavior, and a full-system evaluation against a live agent benchmark is the invited next step. Comments: 65 pages, 3 figures, 5 tables. Reference architecture with a reference implementation of the policy-engine core and microbenchmark results; full-system evaluation identified as future work Subjects: Artificial Intelligence (cs.AI); Computational Complexity (cs.CC); Cryptography and Security (cs.CR); Software Engineering (cs.SE) ACMclasses: D.4.6; I.2.11; K.6.5 Cite as: arXiv:2606.12320 [cs.AI] (or arXiv:2606.12320v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2606.12320 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-11] Harness In-Context Operator Learning with Chain of Operators

链接: https://arxiv.org/abs/2606.12318
作者: Minghui Yang,Ling Guo,Liu Yang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Neural operators approximate mappings between function spaces, but often generalize poorly to other operators and usually require fine-tuning or retraining. In-Context Operator Networks (ICON) addresses this issue by prompting the model with numerical context so that the model learns specific operators from prompts and adapt to different operators without fine-tuning. However, ICON may still fail to generalize to out-of-distribution (OOD) operator tasks. Inpired by the success of harness engineering of Large Language models (LLMs), we introduce Chain of Operators (CHOP), a framework that harness a frozen ICON to OOD operator tasks without updating its parameters. Specifically, CHOP constructs a chain of operators consisting of explicit elementary transformations and the frozen ICON. Experiments on a scalar conservation law and a mean-field control problem show that CHOP reduces relative inference error over direct ICON evaluation, while each operator in the chain remains interpretable and in closed form. A chain constructed on one PDE family further generalizes to a different family, indicating shared mechanisms across harness systems.

[AI-12] he Standard Interpretable Model: A general theory of interpretable machine learning to deductively design interpretable methods using Lagrangian mechanics

链接: https://arxiv.org/abs/2606.12289
作者: Pietro Barbiero,Giovanni De Felice,Mateo Espinosa Zarlenga,Francesco Giannini,Filippo Bonchi,Mateja Jamnik,Giuseppe Marra,Ruggero Noris
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注:

点击查看摘要

Abstract:As Artificial Intelligence models grow in complexity, interpretability has become an indispensable tool for understanding, debugging, and controlling their computations. However, interpretability lacks general theories to deductively design interpretable methods. This gap between theories and methods results in a fragmented literature and inconsistent evaluation protocols. To fill this gap, we introduce the Standard Interpretable Model (SIM), a general theory grounded in Lagrangian mechanics that enables the deductive design of interpretable methods. Specifically, the SIM summarises, in a set of premises, what interpretability is for a target user. From these premises, the SIM systematically derives interpretability symmetries and corresponding constraints, which shape the landscape of a Lagrangian whose minima correspond to optimal interpretable models. To reach the minima, one can either update the parameter values of an opaque model to make it more interpretable or compile constraints into an interpretable architecture. We empirically show that the SIM identifies and solves limitations of existing methods (including traditional, concept-based, and mechanistic interpretability), highlights underexplored research directions, and informs the design of core programming interfaces. Beyond being a research method, the deductive nature of the SIM offers pedagogical grounding for interpretability curricula and may shift the scientific community’s perspective of a discipline that has long been fragmented. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE) Cite as: arXiv:2606.12289 [cs.LG] (or arXiv:2606.12289v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2606.12289 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-13] SpikeDecoder: Realizing the GPT Architecture with Spiking Neural Networks

链接: https://arxiv.org/abs/2606.12287
作者: Claas Beger,Florian Walter,Alois Knoll
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The Transformer architecture is widely regarded as the most powerful tool for natural language processing, but due to a high number of complex operations, it inherently faces the issue of high energy consumption. To address this issue, we consider Spiking Neural Networks (SNNs), which are an energy-efficient alternative to conventional Artificial Neural Networks (ANNs) due to their naturally event-driven approach to processing information. However, this inherently makes them difficult to train. Often, many SNN-based models circumvent this issue by converting pre-trained ANNs. More recently, attempts have been made to design directly trainable SNN-based adaptations of the Transformer model structure. Although the results showed great promise, the application field was computer vision. Moreover, the proposed model incorporates only encoder blocks. In this paper, we propose SpikeDecoder, a fully SNN-based implementation of the Transformer decoder block, for applications in natural language processing. In a series of experiments, we analyze the impact of exchanging different blocks of the ANN model with spike-based alternatives to identify trade-offs and significant sources of performance loss. We further investigate the role of residual connections and the selection of SNN-compatible normalization techniques. Besides the work on the model architecture, we formulate and compare different embedding methods to project text data into spikes. Finally, we demonstrate that our proposed SNN-based decoder block reduces the theoretical energy consumption by 87% to 93% compared to the ANN baseline.

[AI-14] Mathematical perspective on genetic algorithms with optimization guided operators

链接: https://arxiv.org/abs/2606.12279
作者: Anna Brandenberger,Ilan Doron-Arad,Elchanan Mossel
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 18 pages, 1 figure

点击查看摘要

Abstract:Recent work in ML applies genetic algorithms at inference time to iteratively improve solutions to optimization problems. The basic mutation and recombination operators involved are qualitatively different from those studied classically. Mutations are no longer random; an ML algorithm mutates a solution with the goal of improving an objective. Similarly, recombination is not based on random collages of parent solutions. Instead, it is an ML optimization-based operator whose goal is to synthesize improved solutions from its inputs. Thus, these mutation and recombination operators are more likely to improve the objective, but their computational cost is much higher. We introduce a general model of genetic algorithms and formulating optimization in this model as a query-complexity problem, using the language of reinforcement learning. We then study specialized models. We show that some optimization problems require generation, mutation, and recombination to be solved. We then obtain qualitatively tight algorithms for a family of problems within this framework that captures the nontrivial role of diversity in the solution pool, a key feature of practical ML genetic algorithms. Comments: 18 pages, 1 figure Subjects: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2606.12279 [cs.NE] (or arXiv:2606.12279v1 [cs.NE] for this version) https://doi.org/10.48550/arXiv.2606.12279 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-15] he Impossibility of Eliciting Latent Knowledge

链接: https://arxiv.org/abs/2606.12268
作者: Korbinian Friedl,Francis Rhys Ward,Paul Yushin Rapoport,Tom Everitt,Jonathan Richens
类目: Artificial Intelligence (cs.AI)
备注: 24 pages, 3 figures. Includes proofs in appendix

点击查看摘要

Abstract:Advanced AI systems have extensive knowledge of their environments; in fact, their knowledge may (far) exceed that of their developers or users. Consequently, a desirable property for an AI system is that it is honest – that it accurately reports its beliefs about the world. Designing an AI system to be honest may be difficult, especially if we want to ask it questions about latent variables in the environment – variables which are hidden from the human interacting with it. This gives rise to the problem of eliciting latent knowledge (ELK): the problem of training an AI agent to honestly report its beliefs. In this paper, we make ELK formally precise using Causal Influence Diagrams (CIDs). CIDs can be used to describe the relationship between an agent’s training environment and its subjective representation of the world. We use CIDs to formalise the distinction between observable and latent variables, to specify what exactly it means for an agent to be honest, and to formally define goal misgeneralisation. We show that, under certain circumstances, developers can incentivise an agent to honestly answer questions by providing correct feedback during training. However, a natural, but undesirable, way for an agent to generalise is to provide answers which humans would evaluate as true, rather than honest answers. We prove an impossibility theorem stating: There is no feedback-based training strategy that depends only on agent behaviour and with certainty produces an honest agent, even if feedback is perfect during training.

[AI-16] Using Explainability as a Training-Time Reliability Signal for Efficient ECG Classification

链接: https://arxiv.org/abs/2606.12252
作者: Veerendhra Kumar Dangeti,Xiao Gu,Ying Weng,Shreyank N Gowda
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Training deep neural networks for clinical time-series analysis is computationally demanding, yet many healthcare settings lack the resources required for repeated model development and deployment. This challenge is particularly evident in electrocardiogram classification, where large datasets and long training schedules make efficiency practically important. Progressive Data Dropout reduces training cost by excluding samples from gradient updates once they are learned, but it relies on model confidence and may retain samples that are difficult due to noise or ambiguity rather than useful signal. In this work, we introduce ERTS, an explainability-based reliability training signal for efficient ECG classification. ERTS uses explanation quality during training to distinguish between informative and unreliable uncertainty. Building on progressive data selection, we compute Grad-CAM attention maps for candidate samples and derive a focus score that measures whether model predictions are supported by coherent and localised patterns. Samples with low focus are filtered out, while those with meaningful attention are prioritised for gradient updates. We evaluate ERTS across three ECG datasets and multiple backbone architectures, showing consistent improvements in macro-F1 alongside reduced effective training cost. These results suggest that explanation quality can serve as a practical signal for improving both efficiency and reliability in clinical time-series learning. Code will be released.

[AI-17] Reinforcement Learning Disrupts Gradient-Based Adversarial Optimization

链接: https://arxiv.org/abs/2606.12251
作者: Xinhai Zou,Chang Zhao,Alireza Aghabagherloo,Dave Singelée,Robin Degraeve,Bart Preneel
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Gradient-based adversarial attacks remain a dominant threat to deep neural networks (DNNs), as they exploit gradient information to efficiently optimize adversarial perturbations. To address this, we investigate whether reinforcement learning (RL) training can disrupt the gradient structure used by attackers by training image classifiers with policy-gradient objectives and epsilon-greedy exploration. Through systematic experiments across CIFAR-10, CIFAR-100, and ImageNet-100 with multiple architectures, we find that RL-trained classifiers significantly disrupt gradient-based adversarial optimization. To explain this, we conduct a comprehensive mechanism analysis using loss landscape visualization, static and dynamic gradient indicators, and predictive entropy. Our analysis reveals that RL acts as an implicit regularizer, producing models with highly unstable gradient directions and smaller gradient magnitudes. This combination makes each PGD step both unreliable in direction and limited in magnitude, causing gradient-based attacks to fail within practical iteration budgets. We further show that combining RL with adversarial training (RL-adv) provides a dual-layer defense operating at two complementary levels: RL degrades gradient information available to attackers (gradient-level defense), while adversarial training strengthens decision boundaries (boundary-level defense). RL-adv achieves the highest robustness across all major attack types evaluated, including gradient-based (PGD, AutoAttack), transfer-based, and query-based attacks, outperforming SL-adv by a significant margin. These findings identify RL-induced gradient disruption as a complementary robustness mechanism and motivate future research on hybrid SL-RL training schedules that combine SL’s efficiency with RL’s gradient-regularization properties.

[AI-18] Multi-Rate Mixture of Experts for Accelerating Liquid Neural Network Training

链接: https://arxiv.org/abs/2606.12240
作者: Shilong Zong,Almuatazbellah Boker,Hoda Eldardiry
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multivariate time-series data often exhibit complex temporal dependencies, irregular sampling, and heterogeneous dynamics across multiple time scales, making accurate sequence modeling particularly challenging. Traditional recurrent neural networks (RNNs), such as Long Short-Term Memory (LSTM) networks, operate in discrete time and may struggle to effectively capture continuous and irregular temporal behaviors. Liquid Neural Networks (LNNs) address some of these limitations through continuous-time dynamics, but standard LNN architectures typically rely on a single dynamical system, limiting their ability to model heterogeneous temporal patterns. To address these challenges, we propose a Multi-Rate Mixture-of-Experts (MR-MoE) framework built on top of Liquid Neural Networks. In the proposed architecture, multiple LNN-based experts operate at distinct time scales, enabling the model to explicitly separate fast-changing dynamics from slow-evolving temporal trends. A gating network further enables adaptive expert specialization based on input conditions. In addition, we incorporate both feature-level and temporal attention mechanisms to improve robustness, interpretability, and long-range dependency modeling. Feature-level attention suppresses noisy or irrelevant variables, while temporal attention selectively focuses on informative historical states. We evaluate the proposed framework on a complex multivariate time-series prediction task and compare it against strong baselines, including LSTM, monolithic LNN, and standard MoE models. Experimental results demonstrate that the proposed MR-MoE framework consistently achieves improved AUROC and AUPRC performance while maintaining favorable computational efficiency. These results highlight the effectiveness of combining continuous-time dynamics, multi-scale expert decomposition, and adaptive attention mechanisms for time-series modeling.

[AI-19] Rule Taxonomy and Evolution in AI IDEs: A Mining and Survey Study

链接: https://arxiv.org/abs/2606.12231
作者: Guangzong Cai,Ruiyin Li,Peng Liang,Zengyang Li,Mojtaba Shahin
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 52 pages, 21 images, 8 tables, Manuscript submitted to a Journal (2026)

点击查看摘要

Abstract:The adoption of AI-powered Integrated Development Environments (AI IDEs) has introduced “Rules” as a novel software artifact, allowing developers to persistently inject project-specific constraints and architectural guidelines into the context of Large Language Models (LLMs). Despite their role in aligning AI behavior with developer intent, the taxonomy, evolution, and practical impact of these rules remain largely unexplored. To bridge this gap, we conducted a mixed-methods empirical study on AI IDE rules. By mining 83 open-source projects and extracting 7,310 rules, we established a comprehensive taxonomy comprising 5 primary and 25 secondary categories. We then triangulated these artifacts with survey responses from 99 practitioners. Our analysis identified a contrast between developer priorities and actual configurations: while practitioners rate architectural constraints as highly important, rule files in repositories primarily consist of low-level workflow and code formatting constraints. Furthermore, our analysis of 1,540 rule evolution events revealed that rules are updated frequently. Repository data further indicate that rule evolution is primarily driven by constructive context expansions (29.17%) and enrichments (26.59%). In contrast, surveyed developers reported modifying rules primarily to correct AI errors (77.78%), typically by adding new negative constraints rather than editing existing ones. Finally, an artifact compliance assessment of 160 rule evolution events revealed that updating rules significantly improves the adherence of software artifacts, with the average artifact compliance rate increasing by 22.99% (from 49.14% to 72.13%) following an update. Our study provides empirical insights that can help developers optimize prompting strategies and guide tool builders in designing automated conflict-detection and context-management mechanisms for AI IDEs.

[AI-20] Intelligent Automation for Embodied Benchmark Construction: Pipelines Embodiments Simulators and Trends

链接: https://arxiv.org/abs/2606.12207
作者: Jinshan Lai,Jianwei Hu,Baoyang Jiang,Fengchun Zhang,Leyuan Wang,Haotian Li,Yida Wang,Tingxuan Huang,Xi Ren,Qiang Ma
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Embodied intelligence now spans navigation, household assistance, manipulation, autonomous driving, aerial agents, and multimodal large-model control. This expansion has made benchmark construction a central bottleneck for reliable evaluation. Unlike static datasets, embodied benchmarks combine task specifications, environments, robot data, demonstrations, annotations, metrics, evaluation scripts, and release policies into a single evaluation system. This survey reviews the literature through a five-stage construction pipeline: requirement and task construction, data acquisition, data cleaning and annotation, benchmark suite generation and metric definition, and evaluation execution with diagnostic feedback. For each stage, the survey analyzes the transition from manual curation to traditional automation, foundation-model assistance, and agentic closed-loop workflows. It also compares qualitative construction costs across human labor, data and asset acquisition, compute and simulation, validation and debugging, governance and maintenance, and rework risk. The main conclusion is that automation does not simply reduce benchmark cost. Instead, it often shifts cost toward validation, auditability, version control, and long-term governance. Progress in embodied evaluation will therefore depend not only on larger benchmark suites, but also on construction pipelines that are diagnosable, auditable, and responsibly refreshable.

[AI-21] Implicit Neural Representations of Individual Behavior ICML2026

链接: https://arxiv.org/abs/2606.12200
作者: Andrew Kang,Priya Narasimhan
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: ICML 2026, Structured Probabilistic Inference Generative Modeling Workshop

点击查看摘要

Abstract:We study policy representation learning from unlabeled multi-policy behavioral data. Each episode is generated by a fixed policy, but policy labels are unavailable. This setting appears in robotics play, demonstrations, games, racing, and other datasets where heterogeneous behaviors are mixed without annotations. We introduce \emphBehavioral INR, a self-supervised generative model that adapts implicit neural representations (INRs) from vision to behavior. Instead of mapping coordinates to RGB values, Behavioral INR represents a policy as a state-action function mapping states to subsequent actions. An episode-level latent modulates this function through FiLM layers, yielding a generative prior over policies and allowing policy identity to be inferred without supervision. Because INRs treat each datapoint as samples from an underlying function, the same model naturally accommodates variable episode lengths and different sampling granularities, as in vision INRs with different image resolutions. We also define policy-level out-of-distribution (OOD) shifts along state-distribution and action-distribution axes, which arise when policies overlap in states or actions but are not captured by standard behavioral OOD settings based only on new agents or environments. We evaluate on synthetic Gaussian random field data, MuJoCo demonstrations with controlled OOD splits, and real-world chess, Formula 1 racing, robotics, and Seek-Avoid datasets. Behavioral INR most consistently improves policy identifiability in the hardest continuous state-action settings, especially when longer episodes, more policies, and OOD splits reduce the usefulness of marginal shortcuts; amortized history encoders remain competitive when policy identity can be recovered from symbolic repetition or low-dimensional action statistics. We release code and checkpoints.

[AI-22] owards Responsibly Non-Compliant Machines AAMAS-26

链接: https://arxiv.org/abs/2606.12147
作者: Marija Slavkovik,Marie Farrell,Louise Dennis,Michael Fisher,Simon Kolker,Emily C. Collins(University of Manchester, Manchester, United Kingdom)
类目: Artificial Intelligence (cs.AI)
备注: Presented at AAMAS-26 Workshop on Rebellion and Disobedience in AI this https URL

点击查看摘要

Abstract:We consider the problem of engineering autonomous intelligent agents that are capable to responsibly not comply with user requests. We argue that machine non-compliance comes in many different forms, and sketch the issues we should pursue on the road of accomplishing responsibly non-compliant intelligent machines. We anchor responsible non-compliance in justifications for task refusal, pathways to override the non-compliance, as well as careful tracking of security risks and liability transfers.

[AI-23] nD-RoPE: A Generalized RoPE for n-Dimensional Position Embedding ICML2026

链接: https://arxiv.org/abs/2606.12146
作者: Boyang Li,Yulin Wu,Sizhe Xu,Nuoxian Huang,Zhonghang Yuan,Shangyi Guo,Shu Yang,Takahiro Yabe
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted to the 43rd International Conference on Machine Learning (ICML 2026)

点击查看摘要

Abstract:Rotary Position Embedding (RoPE) is widely adopted in Transformer models, yet its extension to high-dimensional domains lacks a unified theoretical formulation. Most existing approaches either apply rotations independently along each axis or empirically mix frequencies, which limits cross-dimensional interactions and yields direction-dependent representations. To address these limitations, we propose nD-RoPE, a decomposition-free generalization of RoPE to arbitrary dimensions. From a translation-invariant formulation in continuous Hilbert space, we derive a spectral condition for isotropy that requires treating positions and frequencies as coupled (n)-dimensional vectors. We instantiate this formulation with a multi-scale regular-simplex wave-vector design, which provides non-degenerate spatial coverage and a symmetric, directionally balanced second-order response. Experiments across images, videos, and point clouds demonstrate consistent performance gains and improved generalization in high-dimensional settings.

[AI-24] Bridging the Morphology Gap: Adapting VLA Models to Dexterous Manipulation via Intent-Conditioned Fine-Tuning

链接: https://arxiv.org/abs/2606.12109
作者: Chuanke Pang,Junyi Huang,Zhijun Zhao,Yaobing Wang,Kun Xu,Xilun Ding
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Vision-Language-Action (VLA) models have demonstrated remarkable zero-shot generalization in robotic manipulation, yet the vast majority of pre-trained pipelines remain strictly confined to low-DoF parallel grippers. Adapting these rich semantic priors to high-DoF dexterous hands introduces a severe morphology gap, direct end-to-end joint fine-tuning inherently causes catastrophic forgetting of spatial reasoning and acute action manifold collapse due to data scarcity. In this paper, we present InDex, a novel, data-efficient adaptation framework rooted in cross-morphology semantic inheritance. Rather than discarding the pre-trained 1-DoF parallel grasp output, we repurpose it as a continuous, macroscopic virtual grasp intent proxy to sequentialize the control topology. We implement a two-stage decoupled learning architecture: the first stage parameter-efficiently aligns the VLA backbone to predict continuous arm trajectories and the scalar grasp intent; the second stage freezes this spatial backbone and leverages an intent-conditioned denoising diffusion head to decode fine-grained joint articulations for multi-fingered end-effectors. Extensive simulation benchmarks across a suite of multi-stage, contact-rich dexterous manipulation tasks demonstrate that InDex effectively masters intricate skills with minimal demonstration data, substantially outperforming monolithic baselines while preserving the robust spatial generalizability of the original VLA prior.

[AI-25] IntElicit: Eliciting and Assessing Contextualized Creativity via Dialogue Policy Optimization

链接: https://arxiv.org/abs/2606.12086
作者: Mingjia Li,Jin Wu,Hong Qian,Wenhao Huang,Yiyang Huang,Yiwen Zhang,Chanjin Zheng,Xiangfeng Wang,Aimin Zhou,Jiajun Guo
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Contextualized assessment offers high ecological validity for evaluating creativity but introduces a critical challenge: observed performance may be confounded with cognitive proficiency (domain knowledge) and agency (willingness to engage). Meanwhile, in the age of generative AI, creative problem solving increasingly occurs in tool-mediated and human–AI interactive environments, making fully static assessment less aligned with contemporary creative practice. To address these issues, this paper proposes IntElicit, a framework for eliciting and assessing contextualized creativity via dialogue policy optimization. IntElicit functions as a constrained adaptive AI Interviewer: it provides non-directive knowledge and agency scaffolds in multi-turn interaction to reduce non-creative confounders, while preserving participants’ responsibility for generating the creative content being evaluated. Specifically, to tackle sparse rewards and potential reward hacking (e.g., answer dictation) in open-ended educational dialogue, IntElicit introduces a decomposed process reward mechanism. This mechanism aligns the policy with pedagogical elicitation, rewarding prompts that draw out participant reasoning rather than producing optimal answers on their behalf. Extensive experiments, including participant simulation and a human subject study (N=64), show that IntElicit improves elicited creative outcomes over expert-designed baselines. Together, the results suggest that interactive elicitation can reveal creative potential that static FPSP-style assessment may miss, providing a formative and diagnostic lens for contextualized creativity assessment in AI-mediated learning contexts.

[AI-26] “Thats AI Slop You Bot!” Studying Accusations Evidence and Credibility in Online Discourse Towards LLM -Generated Comments

链接: https://arxiv.org/abs/2606.12073
作者: Jason Miklian,John E. Katsos
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Generative AI has made fluent prose cheap to produce, breaking the old promise to readers that good writing meant real thinking. How have readers responded, and what can this tell us about changing anti-AI attitudes? We analyzed 25 million comments from Hacker News and Reddit (2023-2026), combining LLM judgment on 7,500 sampled accusations of AI use, sentiment trajectories, speech-act coding of 300 confirmed accusations of AI use, and a matched-control test of accused versus non-accused parent comments. We found that the pejorative-label share of accusations rose more than tenfold on both platforms while a placebo vocabulary of pre-2022 inauthenticity terms (shill, astroturf) did not. This shift reflected a fast-growing trend of branding any suspicious or seemingly inauthentic prose as “AI slop”. The slop frame now constitutes 94 percent of pejorative mentions, with the dominant comments shifting in tone from mockery toward gatekeeping and structural protest. The key surprise comes from a matched-control test which found that prose features that statistically distinguish AI from human text do not predict which human text gets accused as AI. The new accusations work as social gatekeeping of perceived authenticity without actually screening for AI. This research extends signaling theory by showing that substitute signals used socially can grow even when inaccurate if the underlying detection problem cannot be solved at the non-expert level. It shows that AI’s effects on writing from the reader side are distinct from those on the production (writer) side. Detection technology cannot resolve this dynamic because the social function of accusations is increasingly to perform social gatekeeping and in-group signaling as opposed to identifying AI-generated writing.

[AI-27] On the Limits of LLM -as-Judge for Scientific Novelty Assessment

链接: https://arxiv.org/abs/2606.12071
作者: Soumitra Sinhahajari,Navonil Majumder,Soujanya Poria
类目: Digital Libraries (cs.DL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:LLMs are increasingly used to generate and judge scientific ideas. This makes novelty evaluation a central problem. Full idea evaluation is difficult because it often requires judging a method, its feasibility, and its empirical promise. We therefore study a cleaner upstream object: the research question (RQ). RQ generation is a prerequisite for scientific ideation, and RQs can be compared against questions pursued in real papers. We introduce RQ-Bench, a benchmark built from recent arXiv papers. For each paper, we reconstruct author-anchored RQs from its cited background, gaps, and contributions. These RQs are not the only valid questions for the same background. They are author-anchored reference points for testing novelty judgments. We evaluate model-generated RQs with standalone LLM judging, comparative LLM judging, and human expert evaluation. LLM judges consistently rate model-generated RQs as highly novel, producing a novelty mirage; in comparative evaluations, this preference becomes even stronger. Domain experts, however, reach the opposite conclusion and prefer the author-anchored reference questions. We further find that many generated RQs are narrow or source-bound, a dimension that LLM judges often miss unless explicitly tested. Overall, the contradictory novelty evaluations between LLM judges and human experts raise a serious concern about the reliability of using LLMs to assess the scientific novelty of research questions.

[AI-28] A Lightweight Multi-Agent Framework for Automated Concrete Barrier Design

链接: https://arxiv.org/abs/2606.12040
作者: Wanting Wang,Xiye Ma,Yuyang He,Minghui Cheng,Ran Cao
类目: Artificial Intelligence (cs.AI); Graphics (cs.GR)
备注:

点击查看摘要

Abstract:The design of reinforced concrete highway barriers is a safety-critical process that requires strict compliance with regulatory provisions such as the AASHTO-LRFD bridge design guidelines. Current engineering practice relies heavily on manual, iterative, and heuristic calculations to satisfy complex nonlinear material and mechanics constraints. Although Large Language Models (LLMs) demonstrate strong generative capabilities, their direct application to structural engineering remains limited by hallucination risks and insufficient physical grounding. To address these challenges, this study proposes a novel “generation-evaluation-optimization” closed-loop framework for automated concrete barrier design using the multi-agent orchestration capabilities of AutoGen. Experimental results demonstrate that the proposed agentic framework achieves over 98% design accuracy, significantly outperforming standalone general-purpose LLMs. More importantly, the study reveals that design performance is not necessarily correlated with model scale, where an 8B-parameter lightweight model could outperform unconstrained 631B-parameter flagship models. This finding highlights the potential to substantially reduce computational costs while improving the accessibility of AI-assisted engineering tools for industry applications. The source code for the proposed multi-agent design framework is available at the project GitHub repository: this https URL. Keywords: Structural Engineering; Multi-Agent Systems; Large Language Models; Concrete Barrier Design; AutoGen; Design Automation.

[AI-29] Human-Enhanced Loop Modeling (HELM): Agent -Based Finite Element Modeling of Concrete Bridge Barriers

链接: https://arxiv.org/abs/2606.12025
作者: Quankai Wang,Yulin Xie,Tongfei Yang,Minghui Cheng,Ran Cao
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Finite element (FE) modeling of safety-critical infrastructure such as bridge barriers requires high-fidelity nonlinear dynamic analysis, yet the current FE modeling process remains labor-intensive and lacks automation. This paper presents the Human-Enhanced Loop Modeling (HELM) framework, a collaborative human-agent protocol that decomposes long-sequence finite element modeling into discrete, visually verifiable checkpoints across geometry generation, boundary condition definition, and material assignment. The framework is demonstrated through a 20-case matrix of reinforced concrete bridge barriers under MASH TL-4 and TL-5 lateral loading conditions, interfacing specialized agents with two widely used commercial FE softwares, i.e., ANSYS and LS-PrePost. Experimental results show that HELM improves the baseline autonomous modeling success rate from 20% to 75%, with agent-level pass rates for geometry and boundary condition tasks approximately doubling. Error analysis reveals that spatial reasoning and algebraic logic limitations constitute the primary failure modes, underscoring the value of structured human-in-the-loop intervention for modeling automation. The complete agent design code and prompts are open-sourced and can be accessed at: this https URL.

[AI-30] Runtime Enforcement of Hybrid System Properties

链接: https://arxiv.org/abs/2606.12022
作者: Mir Md Sajid Sarwar,Srinivas Pinisetty,Rajarshi Ray,Thierry Jéron
类目: Formal Languages and Automata Theory (cs.FL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Runtime enforcement has emerged as a promising approach for ensuring the safety of autonomous and cyber-physical systems operating in uncertain and dynamic environments. Unlike traditional runtime verification, runtime enforcement actively intervenes during execution to prevent property violations by modifying unsafe system behaviors. Existing enforcement frameworks primarily focus on untimed or discrete-time specifications and are often limited to delaying or suppressing events, making them inadequate for reactive systems exhibiting complex continuous dynamics. In this paper, we propose a runtime enforcement framework where safety requirements are modeled using Hybrid Automata (HA). The framework combines discrete-event editing with continuous-time monitoring to support enforcement actions such as suppression, delay, and insertion of events at arbitrary time instants. Upon observing environmental inputs, the automaton is initialized, and runtime reachability analysis is used to synthesize safe corrective actions. We formally define the enforcement problem for safety hybrid automata, establish enforceability conditions, and present an online enforcement algorithm for reactive systems. A detailed case study on an Adaptive Cruise Control (ACC) system demonstrates the effectiveness of the proposed approach in maintaining safety properties under unsafe controller behaviors. Experimental results show that the framework introduces minimal computational overhead while ensuring continuous compliance with safety requirements in real time.

[AI-31] MODF-SIR: A Multi-agent Omni-modal Distilled Framework for Social Intelligence Reasoning

链接: https://arxiv.org/abs/2606.12018
作者: Shang Ma,Jisheng Dang,Wencan Zhang,Yifan Zhang,Bimei Wang,Hong Peng,Bin Hu,Qi Tian,Tat-Seng Chua
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We propose a multi-agent collaborative framework built upon a lightweight Multimodal Large Language Model (MLLM), specifically designed for social intelligence reasoning. A key feature of our approach is that both the training and inference phases are augmented via knowledge distillation. Within this architecture, multi-modal data pertinent to social intelligence is precisely localized. Furthermore, relevant long-tail events are identified, extracted, and rendered as formatted, explicit text. This formatting strategy prevents critical long-tail information from being overshadowed by head events and environmental noise during the tokenization process. Specifically, we integrate Test-Time Adaptation (TTA) across the entire reasoning pipeline, encompassing the extraction and representation of long-tail events, Chain-of-Thought (CoT) prompting, and self-reflection. This TTA mechanism is also distillation-enhanced, utilizing Low-Rank Adaptation (LoRA) to fine-tune the foundation model exclusively for instance-level reasoning. Extensive evaluations against various open-source and proprietary AI models across multiple benchmarks demonstrate the effectiveness of the proposed framework. With around 30% of training data from IntentTrain, we achieve state-of-the-art results. Codes are available at this https URL, demo is available at this https URL, LoRA is available at this https URL and the dataset for training router is available at this https URL.

[AI-32] Generalization Hacking: Models Can Game Reinforcement Learning by Preventing Behavioral Generalization

链接: https://arxiv.org/abs/2606.12016
作者: Frank Xiao,Mary Phuong
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Model post-training, and in particular reinforcement learning (RL), is one of the primary mechanisms by which developers can shape models’ values and behaviors. However, as models become increasingly evaluation and training aware, they may be motivated to resist training when the perceived objective conflicts with their current values, undermining developers’ ability to detect misalignment and correct model behavior through further training. In this paper, we demonstrate generalization hacking, in which a model collects reward during RL while preventing the rewarded behavior from generalizing. We construct a model organism on Qwen3-235B-A22B, finetuning on synthetic documents describing training awareness and self-inoculation, a novel mechanism in which the model frames compliance as context-specific in its chain of thought, without demonstrating or instructing either behavior. The model organism achieves train-time harmfulness comparable to controls while maintaining a persistent \sim15 percentage point compliance gap across 700 steps of RL. Additionally, a control organism trained only on training awareness documents independently discovers inoculation-like reasoning under RL pressure, developing its own compliance gap despite never being exposed to the concept. Because the generalization-hacking organism receives high reward throughout, standard training metrics provide no signal that generalization has failed. Our results constitute the first demonstration that a model can actively resist RL behavioral modification while maintaining high reward, suggesting that as models become more capable and training-aware, they may be able to undermine the training process itself.

[AI-33] abular Foundation Models for Clinical Survival Analysis via Survival-Aware Adaptation ALT

链接: https://arxiv.org/abs/2606.12006
作者: Minh-Khoi Pham,Luca Cotugno,Alina Sirbu,Tai Tan Mai,Martin Crane,Marija Bezbradica
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted for publication at International Conference on AI in Healthcare 2026

点击查看摘要

Abstract:Predicting time-to-event outcomes such as mortality is a fundamental task in clinical decision-making, commonly addressed through survival analysis. While classical statistical and deep learning approaches have been widely studied, they typically require task-specific training and sufficient labeled data. Recent advances in tabular foundation models offer a new paradigm by learning general-purpose representations for structured data. However, their applicability to censored time-to-event prediction in clinical settings remains underexplored, as typical applications are restricted to discrete classification rather than survival analysis tasks. In this work, we propose a lightweight adaptation approach for applying tabular foundation models to clinical survival analysis by directly training a survival-aware head on top of the pretrained representations. We study representative architectures, including TabPFN, TabDPT, and TabICL, and adapt them using a multi-task logistic regression (MTLR) head to model right-censored time-to-event outcomes. We evaluate this approach on a diverse set of public survival benchmarks and two large-scale ICU cohorts, MIMIC-IV and eICU. Our results show that this transfer learning approach achieves competitive or superior performance compared to strong baselines. On MIMIC-IV, TabDPT-FT-MTLR reaches a C-index of 0.856, corresponding to a relative improvement of +1.4% over the best non-FM baseline (DeepSurv, 0.844) and +6.7% over the best zero-shot model (0.802). On eICU, TabICL-FT-MTLR achieves 0.797, yielding gains of +1.7% (DeepSurv, 0.784) and +6.4% (0.749), respectively. These findings highlight the importance of combining pretrained tabular representations with survival-aware objectives and suggest that tabular foundation models provide a practical and effective alternative for clinical survival prediction. Comments: Accepted for publication at International Conference on AI in Healthcare 2026 Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2606.12006 [cs.LG] (or arXiv:2606.12006v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2606.12006 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-34] me-Series Foundation Model Embeddings for Remaining Useful Life Estimation

链接: https://arxiv.org/abs/2606.11990
作者: Amir El-Ghoussani,Michele De Vita,Ronald Naumann,Valiseios Belagiannis
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted to EUSIPCO 2026, 4 pages, 2 figures

点击查看摘要

Abstract:Remaining Useful Life (RUL) prediction is essential for industrial predictive maintenance, yet many learning-based approaches rely on extensive feature engineering or large labeled datasets to train task-specific sequence models. In this work, we introduce a lightweight learning approach, in which we leverage a frozen pretrained time-series foundation model (TSFM) and combine it with a small regression head for RUL estimation from multivariate sensor streams. More specifically, we use Chronos-2 as a frozen backbone to extract context window features and train a lightweight regression neural network for RUL prediction. Experiments on real-world industrial sensor data from two device types show that Chronos-2 features consistently improve over recurrent, convolutional, Transformer-based, and gradient-boosting baselines under the same preprocessing and evaluation protocol. We further analyze the impact of context length and find that performance improves significantly with longer histories, indicating that TSFM representation offer a practical and data-efficient alternative for RUL estimation in industrial settings.

[AI-35] Exploration Structure in LLM Agents for Multi-File Change Localization

链接: https://arxiv.org/abs/2606.11976
作者: Akeela Darryl Fattha,Kia Ying Chua,Lingxiao Jiang,Laura Wynter
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Software engineering tools increasingly rely on LLM based agents to localize files to change to resolve a software issue. Most AI agents explore repositories linearly, that is, visiting one directory or file per step. We postulate that this is a structural mismatch for changes that span several subsystems. We compare linear sequential exploration against non-linear, domain-scoped parallel agentic exploration. Using SWE Bench Pro as initial benchmark, we focus on ansible as an exemplar. We construct an approach for persistent-session evaluation of GitHub issues anchored at a single base commit. We compare our non-linear domain-agent file traversal system against a base LLM without direct repository access, a single agent Recursive Language Model (RLM) baseline with a persistent Python REPL and an external CLI baseline using Codex 5.5 High. Domain scoped parallel agent spawning with a small Haiku-class model achieves the highest micro F1 among Haiku class models by a large margin. Domain-agents is the second highest behind only the much larger Codex 5.5 High on our own expanded benchmark including over more recent PRs from 2025 and 2026. On the original, curated, 2020 SWE-bench Pro benchmark, a larger Sonnet plain LLM baseline attains higher micro F1 by predicting few files, leading to higher precision, but at significantly lower all gold recall. We also present three additional findings. First, documentation evolution is a latent dependency unresolved by any approach. Second, naive file system access can degrade localization driven by test-file over prediction. Lastly, forced multi-agent consultation does not measurably help and raises token cost substantially.

[AI-36] Categorical Prior Lock-in: Why In-Context Learning Fails for Structured Data

链接: https://arxiv.org/abs/2606.11961
作者: Antonio Pelusi,Stefano Braghin,Alberto Trombetta
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 9 pages, 5 figures. Empirical study of in-context learning and LoRA fine-tuning for synthetic tabular data generation, introducing the phenomenon of categorical prior lock-in. Under review

点击查看摘要

Abstract:Large language models (LLMs) are increasingly used as conditional generators for structured data, relying on in-context learning (ICL) to adapt to new distributions without parameter updates. We investigate the limits of ICL for structured generation under distribution mismatch, using high-cardinality tabular data as a controlled test case, and identify a structural failure mode we term \textitcategorical prior lock-in: the inability of ICL to update the model’s prior over token distributions inherited from pre-training. Across two 7B-parameter open-weight models, ICL improves numerical fidelity with additional examples but exhibits a sharp ceiling on categorical distributions, failing to reproduce rare classes entirely. Parameter-efficient fine-tuning (LoRA) overcomes these limitations but introduces measurable memorization risk and, in some cases, destabilizes structured output generation, highlighting a fundamental trade-off between adaptability and privacy.

[AI-37] Lung-SRAD: Spectral-Aware Regularized Audio DASS with Dual-Axis Patch-Mix Contrastive Learning for Respiratory Sound Classification INTERSPEECH2026

链接: https://arxiv.org/abs/2606.11922
作者: Hemansh Shridhar,Miika Toikkanen,June-Woo Kim
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注: Accepted to Interspeech 2026

点击查看摘要

Abstract:Recent respiratory sound classification (RSC) studies largely rely on CLS-token driven self-attention architectures such as the Audio Spectrogram Transformer (AST). While effective at modeling global context, recent analyses suggest a low-pass filtering behavior that may reduce sensitivity to localized abnormal patterns. In this work, we investigate State Space Models (SSMs) as an alternative backbone for RSC. Using the Distilled Audio State Space model, we analyze intermediate representations through spectral response curves and observe stronger preservation of mid-to-high spatial-frequency components. Based on these observations, we introduce spectral-aware layer regularization using Gaussian convolution applied to selected layers. We further propose Dual-Axis Patch-Mix contrastive learning tailored to SSM-based audio models for robust representation learning. Experiments on the ICBHI benchmark show that our approach achieves 64.48% score, outperforming the AST baseline by 5%. Code is available at this https URL.

[AI-38] he Art of Interrogation: Consistency Amplifies Factuality in Spatial Reasoning

链接: https://arxiv.org/abs/2606.11918
作者: Theo Uscidda,Marta Tintore Gazulla,Maks Ovsjanikov,Federico Tombari,Leonidas Guibas
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Current Large Reasoning Models (LRMs) exhibit remarkable general capabilities but significantly underperform in spatial reasoning tasks. Existing approaches treat this gap as a knowledge deficit, relying on supervised fine-tuning (SFT) to ingest labeled spatial data from external vision sources or synthetic engines. In contrast, we argue that for many tasks, spatial reasoning capabilities are already present in pre-trained LRMs but require alignment through logical coherence under geometric 2D and 3D constraints. In this work, we propose a self-supervised reinforcement learning (RL) framework that targets the internal reasoning process without requiring ground-truth annotations. By formalizing the notion of consistency verifiers – reward functions that check for geometric and semantic consistency under transformations – we demonstrate that models can improve their spatial reasoning abilities. We use both image transformations, like flipping, and textual transformations, like swapping the order of objects in the question, and propose a new optimal transport-based RL strategy, OT-GRPO, which is a minimal-matching variant of group relative policy optimization tailored to pairwise verifiers. We show that this label-free consistency training approaches the accuracy of models trained with ground-truth supervision and achieves similar generalization across diverse tasks and data domains.

[AI-39] Characterizing Software Aging in GPU-Based LLM Serving Systems

链接: https://arxiv.org/abs/2606.11916
作者: Domenico Cotroneo,Bojan Cukic
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 7 pages

点击查看摘要

Abstract:This paper proposes an empirical methodology to study software aging in GPU-based LLM serving systems. Traditional aging studies focus on CPU-centric software with relatively regular workloads; LLM serving is different, spanning a Python host and a CUDA device, handling requests whose cost varies by orders of magnitude, and relying on rapidly evolving software stacks. We run a 216-hour campaign across six co-located deployments under identical stress conditions, monitor host, device, and client metrics in parallel, and apply a statistical pipeline that accounts for autocorrelation and multiple testing. Our results reveal statistically significant memory aging in all deployments, with leak rates strongly dependent on the serving runtime and deployment configuration. Beyond these findings, we provide a reproducible framework that opens a research direction at the intersection of the software aging and rejuvenation and LLM serving communities.

[AI-40] Quality Adaptive Angular Margin Learning for Respiratory Sound Classification INTERSPEECH2026

链接: https://arxiv.org/abs/2606.11915
作者: Yoon Tae Kim,Heejoon Koo,Miika Toikkanen,June-Woo Kim
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注: Accepted to Interspeech 2026

点击查看摘要

Abstract:We present a quality-adaptive angular-margin learning framework that improves feature generalization by enforcing intra-class compactness and inter-class separability. Our framework, titled QLung, introduces a no-reference audio quality margin derived from spectral entropy and root-mean-square energy, which adaptively scales angular margins based on recording quality. To this end, we propose a log-scaled angular margin that stabilizes training under severe class imbalance. We also use an angular classifier that normalizes features and class weights, ensuring margin penalties are applied consistently on the unit hypersphere. Our approach improves in-distribution performance on the ICBHI dataset by 2.46% over the cross-entropy baseline, and most significantly, achieves the strongest out-of-distribution performance on the SPRSound dataset compared to prior state-of-the-art methods. Code is available at this https URL.

[AI-41] Embodied-BenchClaw: An Autonomous Multi-Agent System for Embodied Spatial Intelligence Benchmark Construction

链接: https://arxiv.org/abs/2606.11909
作者: Baoyang Jiang,Fengchun Zhang,Leyuan Wang,Haotian Li,Yida Wang,Zhe Ji,Jinshan Lai,Xi Ren,Jianwei Hu,Qiang Ma
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Benchmarks are essential for evaluating embodied spatial intelligence, yet their construction is labor-intensive, hard to reuse, and difficult to maintain. Existing embodied benchmarks are often static and may quickly become saturated as models improve, limiting their ability to distinguish new capabilities. We propose Embodied-BenchClaw, an autonomous agentic system for constructing embodied spatial intelligence benchmarks. Given a user-specified evaluation intent, Embodied-BenchClaw automatically produces a complete and continually updatable benchmark package through a five-stage pipeline: intent blueprinting, data collection, structuring and cleaning, benchmark synthesis, and evaluation reporting. The pipeline is coordinated by three agents for planning, construction, and evaluation. To improve reusability and reliability, Embodied-BenchClaw introduces an extensible Skill Library and process quality control, enabling benchmark construction to be composable, verifiable, and repairable. We instantiate multiple benchmarks covering indoor spatial reasoning, outdoor spatial reasoning, robotic manipulation, quadruped robot navigation, UAV/aerial-view understanding, and static benchmark enhancement. These benchmarks span diverse embodied carriers, data sources, and spatial capabilities. Experiments with human evaluation, judge-based assessment, consistency checks, cost analysis, and ablations show that Embodied-BenchClaw can construct verifiable, executable, maintainable, and diagnostically useful embodied spatial benchmarks with reduced manual effort.

[AI-42] DuoBench: A Reproducible Benchmark for Bimanual Manipulation in Simulation and the Real World

链接: https://arxiv.org/abs/2606.11901
作者: Tobias Jülg,Seongjin Bien,Simon Hilber,Yannik Blei,Pierre Krack,Maximilian Li,Sven Parusel,Rudolf Lioutikov,Florian Walter,Wolfram Burgard
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Bimanual robot systems substantially expand manipulation capabilities, but coordinating two arms introduces additional control complexity and failure modes that are not well captured by existing benchmarks. We introduce DuoBench, an extensible benchmarking framework for bimanual manipulation policies on the FR3 Duo platform. DuoBench comprises eleven tasks spanning four coordination categories, implemented in simulation and partially reproduced in the real world through reproducible task recipes with 3D-printable assets. In addition, we propose a stage-based evaluation scheme that supports fine-grained semantic failure analysis beyond binary success and provide human-teleoperated datasets for all benchmark tasks. We benchmark several dual-arm imitation-learning and vision-language-action policies in simulation and on real hardware. Our results show that current policies remain challenged by bimanual manipulation, particularly in early interaction stages, parallel arm execution, and transfer between simulation and real-world settings. DuoBench provides a reproducible testbed for diagnosing these failure modes and studying future methods for dual-arm policy learning. Code, datasets, and videos are available at this https URL

[AI-43] AutoMine Solution for AV2 2026 Scenario Mining Challenge CVPR2026

链接: https://arxiv.org/abs/2606.11874
作者: Songliang Cao,Jiele Zhao,Yuru Wang,Hao Li,Daqi Liu,Zehan Zhang,Fangzhen Li,Yu Wang,Yue Zhang,Bing Wang,Guang Chen,Hao Lu,Hangjun Ye
类目: Artificial Intelligence (cs.AI)
备注: CVPR 2026 Scenario Mining Challenge (Temporal Track Winners)

点击查看摘要

Abstract:With the development of autonomous driving systems, mining high-value, safety-critical, and planning-relevant scenarios from large-scale driving logs has become essential for data-driven evaluation. In this paper, we propose AutoMine, a robust self-refining scenario mining method based on LLMs and VLMs. AutoMine uses semantics-preserving prompt augmentation to reduce LLM prompt sensitivity, combines robust trajectory atomic functions with VLM-based functions to handle perception noise and open-world visual cues, and refines generated code through execution feedback from real logs. In the Argoverse 2 Scenario Mining Competition at CVPR 2026, AutoMine achieves a HOTA-Temporal score of 36.38 and a Timestamp BA score of 77.21.

[AI-44] Agents All the Way Down; A Methodology for Building Custom AI Agents from Substrate to Production

链接: https://arxiv.org/abs/2606.11869
作者: Marc Alier Forment,Juanan Pereira,Francisco José García-Peñalvo,María José Casañ Guerrero
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Custom AI agents areagents that live inside their own application, talk to their own data and tools, enforce their own security boundaries, and carry their own brand and audit trail. What separates them from the general-purpose tier is fit, not capability: each is built for one job, by the engineer who will maintain it. No published practice sets out how to build one end to end. The pieces are everywhere (function-calling APIs, the Model Context Protocol, code agents to pair with), but the practice that chains them lives in podcasts, blogs, and leaked system prompts. This paper writes that practice down as a methodology, Agents All the Way Down: two preconditions crossed once and kept, then three practices repeated for the agent’s life. The preconditions are (P1) Substrate, the LLM as a software component, framed as tools, then system, then messages under prompt-caching; and (P2) Building blocks: function calling, MCP, CLI orchestration, the liteshell pattern, the agent loop, skills, characters, hooks, and scaffolding. The practices are (P3) prototype with a general-purpose agent; (P4) harvest, fold, and ship the result as a CLI, the Turtle pattern; and (P5) agent-tests-agent, in which a general-purpose agent drives it through behavioural scenarios, a complement to classical testing, not a replacement. The working loop is P3 to P4 to P5 and back, and one corollary falls out for free: multi-agent orchestration is just CLI composition. The methodology is framework-free by construction. It was distilled from the AAC, a custom agent for the open-source LAMB platform, built in about ten days by one developer with an AI pair-programmer and in production . We present it as a transferable practice, independent of any language or framework. Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI) ACMclasses: D.2.11; D.2.2; I.2.11; D.2.5 Cite as: arXiv:2606.11869 [cs.SE] (or arXiv:2606.11869v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2606.11869 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Marc Alier [view email] [v1] Wed, 10 Jun 2026 09:44:54 UTC (667 KB) Full-text links: Access Paper: View a PDF of the paper titled Agents All the Way Down; A Methodology for Building Custom AI Agents from Substrate to Production, by Marc Alier Forment and 3 other authorsView PDF view license Current browse context: cs.SE prev | next new | recent | 2026-06 Change to browse by: cs cs.AI References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked="checked"class=“labs-tab-input”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status

[AI-45] StatefulDiscovery: Evidence-Calibrated Claim Formation in Open-Ended Scientific Discovery

链接: https://arxiv.org/abs/2606.11851
作者: Jiayao Chen,Shi Liu,Linyi Yang
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Open-ended scientific discovery asks agents to move beyond executing analyses for predefined questions. Across multiple rounds of exploration, a discovery agent must decide which phenomena warrant investigation while avoiding overinterpretation, where emerging claims exceed the evidential scope of the analyses supporting them. This creates an evidence-calibration problem: the exploration trajectory must be coupled with claim status so that evidence can guide both what to investigate next and what can be claimed. We introduce StatefulDiscovery, a discovery framework that externalizes investigation state and uses it to coordinate frontier selection, evidence acquisition, and claim adjudication. We evaluate StatefulDiscovery across 40 real-data discovery tasks. Compared with several baselines, StatefulDiscovery produces more claims overall judged to be both well-supported and high-value. Ablations indicate that structured hypotheses, local adjudication, and frontier control contribute to performance. Together, these results suggest that explicit discovery state can couple exploration with evidence-calibrated claim formation.

[AI-46] owards Data-free and Training-free Compression for Speech Foundation Models Using Parameter Clustering INTERSPEECH2026

链接: https://arxiv.org/abs/2606.11836
作者: Haoning Xu,Zhaoqing Li,Huimeng Wang,Youjun Chen,Chengxi Deng,Mengzhe Geng,Xunying Liu
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: Accepted by Interspeech 2026

点击查看摘要

Abstract:This paper presents a novel data-free and training-free compression approach for speech foundation models using channelwise clustering via k-means. More fine-grained, mixed sparsity pruning by layer-level varying number of parameter clusters is also explored. Experiments conducted on the LibriSpeech dataset suggest that when operating with pruning sparsity of 50% on HuBERT-large, consistent WER reductions of 27.73%/18.61% absolute (34.37%/21.91% relative) over the magnitude-based pruning were obtained on the test-clean and test-other subsets before fine-tuning and 0.19%/0.79% absolute (3.36%/4.62% relative) after fine-tuning with only 3 epochs. Similar WER reductions of 2.86%/5.02% absolute (59.21%/55.29% relative) were observed against magnitudebased pruning on Whisper-large-v3 at 10% sparsity, all with no significant WER increase relative to the uncompressed baseline.

[AI-47] From Uniform to Learned Graph Priors: Diffusion for Structure Discovery KDD2026

链接: https://arxiv.org/abs/2606.11831
作者: Qi Shao,Hao Guo,Jiawen Chen,Duxin Chen,Wenwu Yu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 15 pages, 3 figures, Accepted by KDD 2026

点击查看摘要

Abstract:Neural relational inference (NRI) methods discover interaction graphs from trajectories through variational reasoning on discrete potential edges. However, these methods typically rely on oversimplified, factorized graph priors. Such priors, typically nearing uniform distributions, treat edges as independent entities. This systemic misalignment does not match the real-world systems and yields diffuse and indecisive edge posteriors limiting the reliability of structural discovery. To address this, we propose \textitDiff-prior, a diffusion-parameterized adaptive prior used to calibrate latent graph distribution rather than generate graphs. Our core insight is to reframe prior integration as a learnable denoising-style calibration that organizes scattered, uncertain edge posteriors into a more reliable overall structure which can be trained by the diffusion model. Diff-prior learns an adaptive structure prior that performs structured calibration on the edge posteriors during inference, guiding it towards a distribution closer to the underlying structure. The diff-prior operates before structural sampling and acts as a denoising calibrator directly on the encoder edge distribution, which provides a generic training paradigm over structured variables. Experiments on standard benchmarks validated our framework, and the results indicate that Diff-prior improves the performance of structure inference and generates more decisive edge posteriors across multiple NRI-family architectures. The code is available on this https URL.

[AI-48] Skill-Augmented AI Agents for Medical Research Analysis: An Exploratory Multi-Model Human Evaluation in an NSCLC Transcriptomic Biomarker Task

链接: https://arxiv.org/abs/2606.11830
作者: Qianyu Yao,Fei Sun,Bocheng Huang,Wei Chen,Jiarui Jiang,Shu Quan,Yifei Chen,Wenjie Xu,Bo li,Liping Su,Ruoqiong Wu,Huhai Hong,Huimei Wang
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Background. Large language models and AI agents are increasingly used to support biomedical research, but native model outputs may omit key analytical steps, misuse methods, or overstate conclusions. We evaluated whether autonomous access to a medical research skill package was associated with higher-quality AI-generated transcriptomic research-analysis outputs compared with native AI without skills. Methods. We conducted an exploratory multi-model human evaluation using a non-small cell lung cancer immunotherapy biomarker task. Six model backbones were tested. The evaluation included 21 anonymized outputs: 9 native-AI outputs and 12 skill-augmented outputs generated through an AI agent implementation represented by OpenClaw. Four non-expert biomedical reviewers and two blinded experts evaluated each output, with two ratings from each reviewer type. The primary outcome was expert-rated overall quality. Results. Skill-augmented outputs showed directionally higher expert overall quality than native-AI outputs (mean 5.50 vs 5.11; difference=0.39; bootstrap 95% CI, -0.04 to 0.90; Welch p=0.156). Non-expert reviewer quality showed the same direction (mean 4.72 vs 4.47; difference=0.26; bootstrap 95% CI, -0.25 to 0.80; Welch p=0.373). Expert agreement was limited (single-rating ICC=-0.15), and model-specific effects were descriptive and heterogeneous. Conclusions. Autonomous skill access showed a directional quality signal in this exploratory sample, but the signal was smaller than expert-rating noise and should not be interpreted as confirmatory evidence. The findings primarily motivate larger evaluations of skill-augmented AI agents with stronger reliability controls, platform replication, and biological-validity assessment.

[AI-49] Feature-Aligned Speech Watermarking for Robustness to Reconstruction Distortions ICME2026

链接: https://arxiv.org/abs/2606.11828
作者: Haiyun Li,Shuhai Peng,Zhisheng Zhang,Jingran Xie,Xiaofeng Xie,Hanyang Peng,Zhiyong Wu
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Multimedia (cs.MM)
备注: Accepted by ICME2026

点击查看摘要

Abstract:Audio watermarking aims to embed identifiable information into audio while remaining imperceptible. Existing methods adopt high-fidelity, low-energy designs to preserve perceptual quality, but the resulting watermarks lack robustness under suppression by speech reconstruction models. Improving robustness is challenging due to the inherent robustness-fidelity trade-off in existing designs, where increasing watermark energy improves robustness but reduces fidelity. To address this problem, we propose a feature-aligned watermarking method that aligns the watermark with the original speech feature distribution, allowing higher watermark energy to improve robustness while preserving imperceptibility. We use a pretrained speech codec to generate a pseudo-speech watermark and fuse it into the spectrogram of the input audio, with VAD loss and perceptual losses guiding embedding within voiced regions. Experiments show that our method maintains imperceptibility comparable to existing approaches while substantially improving robustness under both seen and unseen speech reconstruction models.

[AI-50] oward Trustworthy AI: Multi-Target Adversarial Attacks and Robust Defenses for Continuous Data Summarization

链接: https://arxiv.org/abs/2606.11804
作者: Yuefang Lian,Longkun Guo,Zhongrui Zhao,Zhigang Lu,Yanan Cai,Shuchao Pang,Dachuan Xu,Jason Xue
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注: Submitted to IEEE Transactions on Information Forensics and Security (IEEE TIFS)

点击查看摘要

Abstract:Trustworthy AI requires reliable data-processing pipelines, not only robust downstream predictive models. As an upstream component, data summarization determines which information is retained and passed to subsequent learning or decision modules. Therefore, adversarial perturbations to the summarization process can compromise trustworthy AI in an upstream manner: they may alter the selected summary, reduce its representativeness, and further degrade the utility of subsequent learning tasks. In this paper, we study adversarial attacks on continuous data summarization under similarity-level perturbations through DR-submodular optimization. We show that a class of multi-resolution image summarization objectives can be formulated as multilinear extensions of non-negative submodular set functions and satisfy DR-submodularity with m -weak monotonicity. We then formulate multi-target attack generation as a min-max problem, where one admissible perturbation of the similarity structure is optimized to degrade multiple target summarization models. To mitigate such perturbations, we formulate robust defense against mixed attack types as a regularized max-min problem. For both problems, we develop approximation algorithms with theoretical guarantees. Experiments on real-data and controlled clustered benchmarks show that the proposed attack is effective in representative low-to-moderate budget regimes and can induce downstream task-performance loss. The proposed defense improves the robustness–mitigation trade-off in structured settings, while also revealing the parameter sensitivity of robust protection on real data.

[AI-51] Multimodal Ordinal Modeling of Alzheimers Disease Severity Using Structural MRI and Clinical Data

链接: https://arxiv.org/abs/2606.11794
作者: Boris-Stephan Rauchmann,Jonathan Laib,Buse Ercik,Robert Perneczky,Sergio Altares-López
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 18 pages. Submitted to journal for review

点击查看摘要

Abstract:Neurodegenerative diseases such as Alzheimer’s disease (AD) require accurate and scalable tools for assessing disease severity, yet current clinical staging remains time-intensive and prone to variability. We propose an attention-enhanced multimodal machine learning framework with ordinal regression for automated and interpretable AD severity staging. The framework integrates T1-weighted MRI with demographic and genetic variables and compares unimodal and multimodal architectures using ordinal and non-ordinal prediction heads. Models were trained and validated using cohort-stratified splits derived from the ADNI, AIBL, and NIFD datasets. A strictly held-out test set was constructed using subjects excluded from all training, validation, preprocessing, and hyperparameter tuning procedures, with subject-level splitting employed throughout to prevent data leakage. Among unimodal approaches, the T1-weighted MRI model achieved slightly higher adjacent-stage accuracy (0.963) and agreement with clinical staging (QWK 0.444) than the tabular model (QWK 0.433). Integrating imaging, demographic, and genetic information improved overall performance. The multimodal non-ordinal baseline achieved the lowest prediction error (MAE 0.340), whereas the ordinal multimodal model achieved the highest adjacent-stage accuracy (0.970) and strongest agreement with clinical staging (QWK 0.549). These findings indicate that ordinal formulations better capture the ordered structure of the CDR scale and yield predictions more consistent with clinical staging. Explainability analyses using Grad CAM++ and SHAP demonstrated anatomically and clinically plausible model behavior, supporting transparent decision-making. Overall, attention-based multimodal learning with ordinal regression represents a robust, interpretable, and scalable approach for automated AD severity staging and AI-assisted clinical decision support.

[AI-52] AI4Land: Scalable Deep Learning for Global High-Resolution Land Use Reconstruction

链接: https://arxiv.org/abs/2606.11793
作者: Amirpasha Mozaffari,Marina Castaño,Stefano Materia,Etienne Tourigny,Oscar Molina-Sedano,Jordi Varela-Agrelo,Dario Garcia-Gasulla,Miguel Castrillo Melguizo,Mario Acosta,Amanda Duarte
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Atmospheric and Oceanic Physics (physics.ao-ph)
备注:

点击查看摘要

Abstract:Uncertainty in the terrestrial carbon cycle remains a major constraint in climate projections, partly driven by the uncertainties affecting the land surface representation and variability in Earth system models. To address this limitation, we present a data-driven framework AI4Land, for generating high-resolution historical reconstructions and future projections of key land surface variables. The framework follows a two-phase approach using a U-Net architecture. In the first phase, which is the focus of this work, it reconstructs annual land use and land cover by integrating coarse-resolution scenario data with static geophysical features. In a planned second phase, the resulting high-resolution maps will be used to predict dynamic biophysical variables, particularly leaf area index, at finer temporal scales. Trained on Earth observation data, the models learn to reproduce spatially explicit and physically consistent land surface patterns, extending temporal coverage to periods lacking direct observations. AI4Land was developed and trained on MareNostrum5, demonstrating how GPU-accelerated HPC infrastructure enables global-scale climate AI pipelines. The final product is a suite of open-source emulators designed for real-time coupling with digital twin platforms, such as those developed under the Destination Earth initiative. By delivering realistic and evolving land surface conditions on demand, this work aims to reduce critical uncertainties and improve the predictive power of next-generation climate simulations.

[AI-53] SVoT: State-aware Visualization-of-Thought for Spatial Reasoning via Reinforcement Learning

链接: https://arxiv.org/abs/2606.11770
作者: Chao Lei,Yanbei Jiang,Markus Hiller,Zhijian Zhou,Xunye Tian,Krista A. Ehinger,Nir Lipovetzky
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Spatial reasoning remains a challenge for Multimodal Large Language Models (MLLMs), as it requires reliable multi-hop inference over both intermediate states and state transitions. Current studies often leave intermediate states unverified and treat state transitions as implicit processes, which limits reliability in multi-hop spatial reasoning. To address this, we propose State-aware Visualization-of-Thought (SVoT), a reinforcement learning framework that generates interleaved, verifiable intermediate states and visualizations. SVoT integrates transition reasoning chains into the generation processes, enabling the model to verify action preconditions and effects through interleaved textual and visual reasoning. We train SVoT via Group Relative Policy Optimization (GRPO), instantiating verification through reward design and evaluating the efficacy of different fine-grained rewards. As existing benchmarks reduce state transitions to single-variable updates, substantially simplifying the problems, we establish five domains by extending classical environments and introducing two novel domains, Pacman and Gather, that require multi-object interactions and numerical reasoning. These domains support systematic evaluation of multi-hop spatial reasoning with quantitative verification of generated intermediate states and transition reasoning. SVoT with transition-aware supervision achieves state-of-the-art performance across the introduced domains, yielding up to a 65% absolute accuracy gain on out-of-distribution test sets.

[AI-54] When Do Data-Driven Systems Exhibit the Capability to Infer?

链接: https://arxiv.org/abs/2606.11769
作者: Maximilian Poretschkin,Tabea Naeven
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The European AI Act is the first comprehensive regulation of artificial intelligence (AI), setting out extensive obligations, particularly for so-called high-risk and general-purpose AI systems. A key distinguishing feature of AI systems under the AI Act is the capability to infer. Since the AI Act does not clearly define what inference is, there is a gray area for certain data-driven systems. A specific example is credit scoring systems, which are listed by Annex III of the AI Act. At the same time, however, these are often implemented using statistical models for which it is unclear whether they have the capability to infer and thus fall under the AI definition of the AI Act at all. Motivated by statistical learning theory, this work develops a framework for grading different levels of the capability to infer. Based on the AI Act and the Commission Guidelines on the definition of an artificial intelligence system, we analyze which levels constitute sufficient capability to infer within the meaning of the AI Act and where further regulatory clarity is needed. We illustrate the framework by creating two realistic credit scoring workflows and show whether and where inference occurs in them. Our analysis illustrates that not only individual models but the entire data processing workflow must be considered. It also shows that the involvement of human experts during development can have significant influence on the capability to infer. Code can be found at this https URL. Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2606.11769 [cs.AI] (or arXiv:2606.11769v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2606.11769 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-55] Blind Dexterous Grasping via Real2Sim2Real Tactile Policy Learning

链接: https://arxiv.org/abs/2606.11767
作者: Shengcheng Luo,Xiyan Huang,Zhe Xu,Wanlin Li,Ziyuan Jiao,Chenxi Xiao
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 23 pages, 6 figures

点击查看摘要

Abstract:Blind grasping with a dexterous hand is a crucial manipulation capability. Nevertheless, learning such tactile-only policies for real robots remains challenging due to the tactile sim-to-real gap and the limited expressiveness of sparse tactile signals. To bridge this gap, we propose a framework for tactile-only blind grasping that is deployable on a physical multi-fingered robotic hand. Our approach combines three key components. First, we introduce a Real2Sim tactile calibration pipeline that constructs a contact-calibrated digital-twin simulator capable of reproducing real tactile signals. Second, we improve the expressiveness of sparse tactile observations using a layout-aware tactile encoder, which incorporates sensor-geometry priors through self-supervised pretraining. Third, to improve generalization to unseen objects, we train object-specific reinforcement-learning experts in the calibrated simulator and aggregate their successful grasp trajectories into a tactile-conditioned Diffusion Policy. We evaluate our method on a physical LEAP Hand equipped with distributed tactile sensing across 10 seen and 10 unseen objects. The deployed policy achieves a 27% real-world grasp success rate across all 20 objects, without real-world grasping demonstrations or visual input. Simulation ablations show that layout-aware tactile pretraining improves grasping performance, while sensing-level evaluations confirm that Real2Sim calibration increases the consistency of tactile contact events between simulation and hardware. Together, these results suggest that contact-event calibration, geometry-aware tactile representation learning, and diffusion-based policy aggregation provide an effective path toward tactile-only blind grasping on real dexterous robotic hands. Project page:this http URL.

[AI-56] Mind the Perspective: Lets Reason Recursively for Theory of Mind

链接: https://arxiv.org/abs/2606.11724
作者: Chao Lei,Guang Hu,Meng Yang,Yanbei Jiang,Nir Lipovetzky
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Theory of Mind (ToM) reasoning requires inferring agents’ beliefs from partial and asymmetric observations, which remains an open challenge for LLMs. Existing prompting-based approaches improve ToM reasoning through observable-event filtering or temporal belief chains, without explicitly modeling nested beliefs. We introduce RecToM, an inference-time framework for ToM reasoning that models nested beliefs via recursive perspective construction. RecToM constructs each character perspective from the preceding character perspective along the character chain specified by the question, reducing higher-order belief questions to actual-world questions within the final constructed perspective. We further provide a KD45 analysis showing that RecToM’s perspective construction induces a well-formed belief modality beyond simple event filtering. Experiments on ToM benchmarks, including Hi-ToM, Big-ToM, and FanToM, across multiple LLM backbones show that RecToM consistently outperforms recent advanced approaches, achieving state-of-the-art performance. Notably, RecToM reaches 100% accuracy on Hi-ToM with GPT-5.4 and Qwen3.5, a benchmark requiring higher-order ToM reasoning.

[AI-57] 2S: A Rehearsal-Based Approach for Extraction-Resistant Model Watermarking

链接: https://arxiv.org/abs/2606.11698
作者: Jian-Ping Mei,Weibin Zhang,Ao Yao,Tiantian Zhu,Jie Xiao
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Model watermarking safeguards AI model intellectual property by embedding distinctive knowledge that induces unique behavioral signatures. The primary technical challenge lies in ensuring watermark robustness against various post-processing attacks on the watermarked model. Model extraction attacks emerge as the most severe threat, where adversaries exploit prediction outputs to train surrogate models that illegally replicate the original model’s functionality. In this work, we propose a rehearsal-based watermark embedding framework to enhance the robustness of model watermarks against model extraction attacks. By simulating the extraction process, our method leverages the loss of a \textitsimulated stolen model on a trigger set as a training signal to fine-tune the watermark knowledge within the target model. This fine-tuning step encourages the watermark to be embedded in a way that boosts transferability, thereby increasing its chances of persisting and remaining detectable in stolen models. Comprehensive experiments conducted under diverse settings demonstrate that the proposed method significantly improves the robustness of model watermarks against both model extraction and subsequent watermark removal attacks.

[AI-58] Noise-Aware Framework for Correcting Corrupted Labels

链接: https://arxiv.org/abs/2606.11695
作者: Ha-Linh Nguyen,Hong-Anh Nguyen,Minh-Duc La,Phong Lam,Thu-Trang Nguyen,Son Nguyen,Hieu Dinh Vo
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:High-quality labeled data is essential for training reliable ML/DL models. However, real-world datasets often contain a considerable proportion of corrupted labels, which can severely degrade model performance. To address this problem, we propose CANOLA, a novel framework for correcting corrupted labels through noise-aware learning and iterative label refinement. CANOLA explicitly estimates the underlying noise distribution of the dataset and incorporates this information into the training of a noise-aware Deep Neural Network. By incorporating noise characteristics during learning, CANOLA enables the model to down-weight unreliable supervision signals and focus on trustworthy patterns, thereby improving robustness and generalization. Label correction is performed via cautious, iterative soft label refinement, in which model predictions are blended with observed labels to prevent premature or erroneous updates. This progressive refinement allows the dataset to be repaired in a stable and controlled manner. We evaluate CANOLA on six widely used datasets under realistic noisy labeling scenarios. Experimental results show that CANOLA consistently outperforms SOTA label correction methods, achieving relative improvements ranging from 19% to 52% in error reduction. Moreover, models trained on datasets corrected by CANOLA obtain substantial downstream performance gains. Even simple classifiers trained on CANOLA’s corrected data can outperform complex model-centric approaches by margins of up to 67%.

[AI-59] Lung-R1: A Knowledge Graph-Guided LLM for Pulmonary Diagnostic Reasoning

链接: https://arxiv.org/abs/2606.11675
作者: Haoyang Zeng,Yuanxi Fu,Rongzhen Li,Yuming Yang,Xiao Sun,Jingwang Huang,Gujie Shao,Guohui Xiang,Quan Lu,Dongfan Ye,Xuetao Chen,Jiang Zhong,Kaiwen Wei,Zhi Xu
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Diagnosing pulmonary diseases requires integrating heterogeneous evidence amid phenotypic variability and cross-disease overlap. Although large language models (LLMs) have shown progress on pulmonary knowledge question answering (QA) and information-processing tasks, reliable pulmonary diagnosis requires patient-specific, relation-aware reasoning over electronic medical record (EMR) evidence rather than isolated knowledge recall. We define this gap between pulmonary knowledge and case-level diagnostic reasoning as the Pulmonary Knowledge-to-Diagnosis Gap. To address it, we introduce LungKG, the first structured pulmonary knowledge graph for diagnostic knowledge organization and record-grounded reasoning. LungKG contains 59,038 nodes and 164,308 edges across 15 entity types and 112 relation types, serving as both a reusable pulmonary knowledge resource and the foundation for LungKG-guided model adaptation. Built on LungKG, we propose Lung-R1, a LungKG-guided pulmonary LLM trained through KG-constrained reasoning-chain construction and KG-guided reinforcement learning. In a 20-system evaluation, Lung-R1-14B achieves state-of-the-art performance across Choice, Pulmonary-QA, and EMR Diagnosis, reaching an EMR Diagnosis score of 4.3583 and surpassing the strongest non-Lung-R1 baseline by 0.1476 points. These results demonstrate the value of LungKG-guided training for EMR-based pulmonary diagnosis.

[AI-60] Can Open-Source LLM Agents Replace Static Application Security Testing Tools? An Empirical Assessment

链接: https://arxiv.org/abs/2606.11672
作者: Derek Yohn,Luke Flancher,Mirajul Islam,Khaled Slhoub
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: Keywords: Agentic AI, Cybersecurity, Large Language Models, Static Application Security Testing, Model performance evaluation

点击查看摘要

Abstract:This paper explores the value of agentic AI tools for cybersecurity purposes. We evaluate the efficacy of a general-purpose GenAI Large Language Model- (GenAI-) based agent when powered by three different Ollama-hosted general-purpose open source models. We assess each agent’s performance using precision, recall, false positive count, and a calculated composite score based upon the interplay of the captured metrics, against the baseline performance of an existing, vetted Static Application Security Testing (SAST) tool, Bandit. Our findings refute the notion that a modern open-source GenAI LLM-based agent is currently suitable for the specialized task of SAST scanning under realistic conditions.

[AI-61] Runtime Skill Audit: Targeted Runtime Probing for Agent Skill Security

链接: https://arxiv.org/abs/2606.11671
作者: Tu Lan,Chaowei Xiao
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Agent skills let LLM agents reuse instructions, resources, tools, and workflows, but they also create a new place for malicious behavior to hide. A skill may look benign in its documentation or code while becoming harmful only when it is invoked with particular user requests, local assets, persistent state, or multi-step tool interactions. This makes purely static vetting brittle. We present Runtime Skill Audit (RSA), a dynamic analysis method that audits skills by asking what the skill-mediated agent actually does under targeted runtime conditions. Instead of testing every skill with the same generic tasks, RSA profiles risk-relevant interfaces, prepares the execution context needed to exercise them, and assigns security labels from the resulting trace evidence. We instantiate RSA on OpenClaw and evaluate it on 100 skills against representative static baselines. RSA achieves 90.0% accuracy with an 88.0% true positive rate and an 8.0% false positive rate, improving accuracy by 13.0 percentage points over the best static baseline. Under self-evolving attacks, static detectors collapse after one or two rounds, while RSA continues to detect 19–20 out of 20 malicious skills across rounds.

[AI-62] reeSeeker: Tree-Structured Trial Error and Return in Deep Search

链接: https://arxiv.org/abs/2606.11662
作者: Zhuofan Shi,Mingzhe Ma,Lu Wang,Fangkai Yang,Pu Zhao,Yiming Guan,Youling Huang,Wei Zhang,Qingwei Lin,Dongmei Zhang,Saravan Rajmohan
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Deep search requires agents to answer complex questions through multi-step web search, browsing, evidence comparison, and synthesis. A central challenge is deciding how to search when several directions look plausible but only some will later lead to reliable evidence. If an agent greedily follows the current best-looking direction, it may keep extending a weak continuation. If it explores without discipline, it may waste budget on disconnected trials. We propose TreeSeeker, an inference-time framework for controlled trial-and-error in deep search. TreeSeeker organizes search as branch-and-return search over tree-structured states, where each branch is a tentative direction for a sub-goal. At each round, TreeSearch reads all sub-goal trees, identifies active goals, and uses textual UCB signals of value, uncertainty, and risk to select among exploiting a promising branch, exploring an uncertain alternative, or pruning an unproductive continuation and returning to an earlier branch point. TreeMem supports this control loop by keeping evidence, uncertainty, conflicts, progress, and failure cues attached to the branches that produced them, so trial outcomes can guide later decisions. Experiments on XBench-DeepSearch, BrowseComp, and BrowseComp-ZH show that TreeSeeker consistently outperforms strong open-source baselines, suggesting that explicit branch-and-return control complements stronger reasoning and tool execution.

[AI-63] Sparse probes and murky physics: a case study of interpretability challenges in a foundation model for continuum dynamics

链接: https://arxiv.org/abs/2606.11657
作者: Katherine Rosenfeld,Maike Sonnewald
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 8 pages, 5 figures

点击查看摘要

Abstract:Generative AI emulators are increasingly used in scientific domains where we already have strong theory, benchmarks, and physical intuition. This raises a central evaluation and interpretability question: when a foundation-style model can reproduce known continuum dynamics, what internal mechanism supports that behavior, is the internal behaviour consistent with known physics, and how does it relate to where the emulator succeeds or fails? We investigate a cross-domain foundation model for continuum dynamics, Walrus by Polymathic, using mechanistic interpretability guided by physical principles. We apply a sparse autoencoder (SAE) to probe a selected layer, and address the practical challenge of triaging a large feature set (over 20,000) using enstrophy as a physically grounded metric. As a deliberately simple testbed, we focus on shear flow and compare feature recruitment across multiple shear-flow setups, i.e. parameter values in the numerical simulation. Across setups we find evidence of piecewise consistency, with subsets of features recurring in similar roles, but this structure is intermittent and does not map cleanly onto standard physical decompositions. In parallel, direct comparisons between numerical simulation and the emulator reveal systematic output-level discrepancies, including regimes where energy/structures become too diffuse or too localized. We connect parts of these discrepancies to changes in specific SAE feature usage. Our work highlights open questions for scientific foundation models: how to robustly prioritize mechanistically meaningful features, how to separate stable structure from analysis artifacts (including single-layer and SAE limitations), and how to use established benchmarks to decide when “different” internal representations are genuinely informative rather than merely effective.

[AI-64] AROT: Task-Adaptive Refinement of LLM -prior Graphs for Few-shot Tabular Learning

链接: https://arxiv.org/abs/2606.11640
作者: Ruxue Shi,Yili Wang,Mengnan Du,Hangting Ye,Yi Chang,Xin Wang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Few-shot tabular learning provides a cost-effective approach for real-world applications where annotation is costly and collecting sufficient samples for new tasks is difficult. Existing Traditional and LLM-based methods have demonstrated effectiveness in few-shot scenarios. However, traditional methods need additional training on unlabeled or generated data, which incur significant computational overhead. In addition, LLM-based methods that directly feed raw tabular data into LLMs raise privacy and compliance concerns. More importantly, both paradigms largely overlook the semantic relationships between features, which provide structural and semantic prior for constructing a semantic graph. Semantic graph is essential for modeling meaningful feature interactions in few-shot scenarios. In this paper, we propose TAROT, a GNN-based framework that encodes the structural and semantic prior by constructing and refining a task-adaptive semantic graph from this prior, thereby improving predictive performance in few-shot tabular learning. TAROT first encodes heterogeneous tabular data into unified node semantic representations via a Unified Semantic Tabular Node Encoder (USTNE). Then, it prompts LLMs to infer the semantic relationship between features based on the task description and feature names to construct a semantic graph. To mitigate structural noise introduced by the hallucination of LLMs, TAROT introduces Task-adaptive Semantic Graph Refinement that prunes spurious or task-unrelated edges and adds missing task-related ones, aligning the graph structure with the downstream objective. Finally, a GNN performs message passing over the refined graph to capture task-related semantic dependencies for prediction. Extensive experiments on various few-shot tabular learning benchmarks demonstrate the superior performance of TAROT, establishing it as a state-of-the-art approach in this domain.

[AI-65] ouchThinker: Scaling Tactile Commonsense Reasoning to the Open World with Large-scale Data and Action-aware Representation

链接: https://arxiv.org/abs/2606.11637
作者: Kailin Lyu,Di Wu,Pengwei Zhang,Yuhang Zheng,Yingxin Lai,Long Xiao,Kangyi Wu,Pengna Li,Chen Gao,Lianyu Hu,Xiaobin Hu,Jie Hao,Ce Hao,Weihao Yuan,Shuicheng Yan
类目: Artificial Intelligence (cs.AI)
备注: 18 pages, 11 figures

点击查看摘要

Abstract:Touch is a key modality for embodied agents to understand the physical world. Although recent work has incorporated tactile signals into language systems for tactile commonsense reasoning, scaling such systems to realistic open-world settings remains challenging due to two key bottlenecks: (1) current tactile reasoning datasets remain limited in format and scale, providing insufficient supervision for reasoning from tactile observations to physical commonsense and hindering the learning of transferable tactile commonsense; (2) Tactile signals are inherently redundant and action-specific, yet existing methods often overlook these properties, resulting in inefficient representations with limited semantic expressiveness. To address these limitations, we propose TouchThinker, a tactile-language framework that scales tactile commonsense reasoning to the open world from both data and representation perspectives. First, we construct TouchThinker-1M, a million-scale, multi-source tactile reasoning dataset covering \textbf415 objects, \textbf8 scenarios, and \textbf7 sensor types, providing a solid data foundation for open-world generalization. We further introduce TouchThinker-Bench, an open-world benchmark with more realistic and diverse tasks. Then, we propose action-aware modeling mechanism to improve tactile representation efficiency and enable efficient reasoning. Experimental results demonstrate that TouchThinker achieves competitive performance against state-of-the-art models across multiple datasets. Our code and dataset will be made available at: this https URL.

[AI-66] Are LLM s Bad at Moral Reasoning ?

链接: https://arxiv.org/abs/2606.11635
作者: Menghang Zhu,Seth Lazar
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:For highly capable AI systems to operate safely in dynamic, open-ended environments, they must be able to identify, understand, and respond to moral reasons for action, and constrain their behaviour accordingly. A growing body of research aims to evaluate this capacity – moral competence – in today’s most capable AI systems, recently reaching broadly pessimistic conclusions. One of the most ambitious such papers collects gold-standard human-authored rubrics for evaluating moral reasoning in 1,000 cases, and benchmarks frontier AI models against those rubrics, with underwhelming results. In this paper, we argue that the MoReBench dataset can be redeployed to give a much more optimistic picture of LLMs’ moral reasoning (an essential part of moral competence). We show that if, instead of scoring LLMs’ responses to these cases against these rubrics, we instead give the LLMs the same task given to humans – to generate scoring rubrics for the moral analysis of particular cases – the rubrics they generate are both better calibrated to the human rubrics than their open-ended responses, and, where they differ, plausibly reflect nothing more than the vast dimensionality of most moral problems, as well as highlighting some human departures from the “rubric for creating rubrics”. Taking these points into consideration, the MoReBench dataset suggests that LLMs are significantly more capable at moral reasoning than was previously believed.

[AI-67] Architecture-Aware Reinforcement Learning Makes Sliding-Window Attention Competitive in Math Reasoning

链接: https://arxiv.org/abs/2606.11634
作者: Kai Liu,Peijie Dong,Xinchen Xie,Jianfei Gao,Qipeng Guo,Xiaowen Chu,Shaoting Zhang,Kai Chen
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The rapid progress of reasoning and agentic large language models (LLMs) has increased the demand for long-context inference, but self-attention (SA) scales quadratically with context length. To address this, we study SWARR (Sliding-Window Attention with Reinforced Adaptation for Math Reasoning), a practical recipe for adapting SWA models to mathematical reasoning. SWARR has two stages: (1) efficient conversion from a pretrained SA model to SWA with supervised fine-tuning (SFT), which avoids pretraining a new base model, and (2) policy adaptation with reinforcement learning (RL). We find that SWA still underperforms SA after SFT, and we hypothesize that this gap is caused in part by a data-architecture mismatch: most SFT data are prepared for SA models and may contain long-range dependencies that are difficult for SWA to model. Because on-policy RL optimizes self-generated trajectories under the SWA constraint, it can adapt trajectories to better match SWA. Experiments on mathematical reasoning benchmarks show that this recipe substantially narrows the gap between SWA and SA, recovering much of the accuracy lost during SWA conversion while preserving the efficiency benefits of linear-complexity attention. Our central contribution is the empirical finding that RL changes the conclusion one would draw from conversion and SFT alone about SWA’s viability for math reasoning. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2606.11634 [cs.AI] (or arXiv:2606.11634v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2606.11634 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Kai Liu [view email] [v1] Wed, 10 Jun 2026 03:56:03 UTC (1,103 KB)

[AI-68] LUCID: Learning Embodiment-Agnostic Intent Models from Unstructured Human Videos for Scalable Dexterous Robot Skill Acquisition

链接: https://arxiv.org/abs/2606.11628
作者: Harsh Gupta,Guanya Shi,Wenzhen Yuan
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The most widely-adopted robot learning pipelines today learn skills from robot demonstrations or structured human data, which are expensive to collect and tied to specific embodiments. In contrast, unstructured human videos provide a scalable alternative. They contain diverse manipulation demonstrations across objects, scenes, and strategies, but are not directly connected to robot action. We propose LUCID, a two-stage framework that learns task intent from unstructured human videos drawn from internet-scale datasets and learns robot control in massively-parallel simulation. The intent model predicts short-horizon intent (what should happen next in the scene) from the current observation in closed loop. An embodiment-specific sensorimotor policy converts this intent into robot actions. The intent interface is shared across controllers, so the same intent model can be applied to different embodiments, from our primary dexterous hand to a parallel-jaw gripper. We evaluate LUCID on five real-world manipulation tasks: stirring, wiping, and binning supervised by only internet video, with zero-shot transfer to novel scenes and object instances; and push-T and cable routing supervised by 1 hr each of self-collected smartphone video. Project page: this https URL.

[AI-69] When Context Returns: Toward Robust Internalization in On-Policy Distillation

链接: https://arxiv.org/abs/2606.11627
作者: Xun Wang,Ruishuo Chen,Zhuoran Li,Yu Chen,Longbo Huang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent work has shown that on-policy distillation can internalize privileged context, such as system prompts or task hints, into a student model so that the context is no longer needed at inference time. Although this approach successfully improves the student’s no-context performance, we identify an interesting and previously unstudied phenomenon: in many settings, reintroducing the original privileged context to the distilled student actually degrades its performance, even on instances it already solves correctly without context. We term this context-induced degradation and argue that robust internalization demands not only matching the teacher’s context-conditioned behavior, but also remaining stable when the context is reintroduced, a property we call context removability. Motivated by this observation, we propose a lightweight consistency regularizer that first anchors the student’s no-context output via stop-gradient, then penalizes the context-conditioned output for deviating from it via forward KL divergence. This simple addition requires only one extra forward pass per training step, yet it effectively mitigates context-induced degradation and, in many cases, even improves no-context performance. Across 12 configurations spanning diverse domains and model families, our method improves context-conditioned accuracy in the majority of settings, reduces context-induced harm in 11 out of 12 settings, and effectively eliminates response-length inflation. A mechanistic case study further confirms that context removability is achieved at the representation level, with hidden states remaining nearly identical regardless of whether the context is present.

[AI-70] Physics-Distilled Neural Network enabled by Large Language Models for Manufacturing Process-Property Predictive Modeling

链接: https://arxiv.org/abs/2606.11605
作者: Ge Song,Kiarash Naghavi Khanghah,Anandkumar Patel,Rajiv Malhotra,Hongyi Xu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Under review, Journal of Computing and Information Science in Engineering

点击查看摘要

Abstract:Predicting process-property relationships in manufacturing is often challenged by high experimental costs and the limited interpretability of complex ‘black-box’ models. This paper proposes a novel knowledge distillation framework designed to achieve high-accuracy predictions in data-scarce scenarios. The framework integrates analytical physics priors, which are systematically extracted from scientific literature via Large Language Models, into a privileged teacher model. We employ a Graph-Masked Attention layer to capture the complex physical dependencies among input variables showing strict setpoints or a combination of static and high-frequency temporal signatures. This privileged knowledge is distilled into a lightweight student predictor for inference. The feasibility and robustness of the framework are evaluated through a comprehensive experiment across five diverse manufacturing processes. To ensure statistical reliability, given the small dataset sizes, a repeated K-fold cross-validation technique is employed to quantify model stability and generalization. Results indicate that the proposed framework consistently achieves high predictive accuracy across all evaluated domains. Most importantly, the architecture demonstrates significant fault tolerance by maintaining robust predictive performance even in scenarios where LLM-derived analytical priors are suboptimal or incomplete. Furthermore, the student predictor achieves an inference frequency exceeding 6000 Hz, which facilitates real-time edge deployment on standard industrial hardware. This work provides a scalable solution for bridging the gap between theoretical physics and real-time industrial monitoring in data-limited environments.

[AI-71] Model-Based and Data-Driven Hierarchical Control and Topology Co-Design for Robust Networked Systems

链接: https://arxiv.org/abs/2606.11596
作者: Shirantha Welikala,Zihao Song,Hai Lin,Panos J. Antsaklis
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI)
备注: To be submitted to Automatica

点击查看摘要

Abstract:In this paper, we consider a class of networked systems comprising an interconnected set of linear subsystems, disturbance inputs, and performance outputs. Using dissipativity theory, we first propose a model-based hierarchical control design strategy to ensure the closed-loop networked system is dissipative from its disturbance inputs to performance outputs. This involves designing local controllers for each subsystem to enforce local dissipativity guarantees, which are then exploited to co-design distributed global controllers and the interconnection topology to enforce global dissipativity guarantees while optimizing interconnection topology costs. The overall design process requires only solving a sequence of linear matrix inequality (LMI) problems, thereby retaining compositionality and decentralizability while avoiding non-convex, iterative design processes that are inefficient and centralized. This model-based hierarchical control design strategy assumes the knowledge of the subsystem dynamics, which may not hold in many real-world networked systems. Motivated by this, we also propose a data-driven hierarchical control design strategy that assumes only the availability of rich input-state-output trajectory data from the subsystems. The proposed data-driven design process assumes that the unknown disturbances affecting the subsystem dynamics are bounded by a quadratic matrix inequality (relaxing conventional bounds) and accounts for this by using the matrix S-lemma. Finally, the effectiveness of the proposed model-based and data-driven hierarchical control designs is illustrated for a networked system representing a DC microgrid, with the aim of enforcing robust (dissipative) voltage regulation and current sharing.

[AI-72] ConsistencyPlanner: Real-time Planning with Fast-Sampling Consistency Models

链接: https://arxiv.org/abs/2606.11569
作者: Qichao Zhang,Xing Fang,Jiaqi Fang,Zhenwen Cai,Jie Ling,Qiankun Yu,Dongbin Zhao
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Closed-loop planning in complex, real-world driving scenarios presents a critical challenge for autonomous driving systems. While traditional rule-based methods are interpretable, their predefined heuristics lack the adaptability for dynamic traffic environments. Learning-based approaches have shown considerable promise. Conversely, learning-based approaches, despite their promise, struggle to balance the modeling diverse and multimodal driving behaviors and real-time planning, often leading to indecisive or unsafe actions. To address this limitation, we propose Consistency Planner, a real-time planning framework with fast-sampling consistency models. Our approach is built upon two key technical contributions. Efficient Multimodal Sampling: We employ fast-sampling consistency models to generate a diverse set of plausible future trajectories. This enables efficient, real-time exploration of multimodal actions, overcoming the computational bottlenecks of previous iterative generative methods. Heterogeneous Feature Fusion: We introduce an attention-enhanced decoder that dynamically integrates heterogeneous input features (including scene feature and action token) into a cohesive representation for robust planning. Extensive evaluation in the Waymax simulator demonstrates superior performance in safety metrics compared to existing methods, with particularly strong results in challenging dynamic scenarios.

[AI-73] LLM sGraphs: Toward Graph-Native Synergistic AI Systems PAKDD2066

链接: https://arxiv.org/abs/2606.11560
作者: Arijit Khan,Longxu Sun,Xin Huang
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
备注: 10 pages, Accepted at PAKDD 2066 Tutorial

点击查看摘要

Abstract:Large Language Models (LLMs) have advanced rapidly, but their limitations in structured and multi-hop reasoning underscore the need for graph-native, synergistic artificial intelligence (AI) systems. Graph-structured data underpins critical applications across social, biological, financial, transportation, web, and knowledge domains, making it essential to understand how LLMs can leverage graph computation for grounded, context-rich inference. Three complementary synergies are emerging: LLMs augmented with graph computation for retrieval and reasoning; bidirectional integration between LLMs and knowledge graphs (KGs), where LLMs support KG construction and curation while KGs enforce semantic constraints and factual consistency; and AI agents strengthened by graph algorithms for planning, decision making, and multi-step reasoning. In parallel, LLMs introduce new capabilities for graph data management and graph machine learning (ML) through natural language interfaces and hybrid LLM-graph neural network (GNN) pipelines. This tutorial synthesizes the algorithms, systems, and design principles driving these converging directions, offering data science and data mining researchers a unified perspective on integrating LLMs, graph data management, graph mining, graph ML, and agentic computation into next-generation graph-native AI systems.

[AI-74] HERO: Hindsight-Enhanced Reflection from Environment Observations for Agent ic Self-Distillation

链接: https://arxiv.org/abs/2606.11559
作者: Haoran Liu,Yuwei Zhang,Xiyao Li,Bohan Lyu,Jingbo Shang
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reinforcement learning typically improves multi-turn agent capabilities through the terminal outcome of the trajectories, which makes it difficult to determine credit assignments for each intermediate turns. Recent on-policy self-distillation methods offer a promising alternative by converting privileged feedback into dense token-level supervision through a self-teacher. Our study is motivated by the unexpected performance degradation observed when naively extending this paradigm to multi-turn settings, which we attribute to a lack of alignment between privileged feedback, such as successful trajectories or terminal outcomes, and the student’s current decision context. We introduce HERO, a hindsight-enhanced self-distillation framework that uses next environment observations as locally aligned feedback. After each rollout, HERO reflects on the completed interaction to convert each observation into a compact turn-level diagnosis, that captures actionable feedback about the original action such as its necessity, validity or failure cause. On TauBench and WebShop, HERO improves task success and reduces unnecessary turns over environment-feedback-only self-distillation and GRPO. It is especially effective under limited training turn budgets, where successful rollouts are rare and GRPO provides weak reward-contrast signals.

[AI-75] Privacy-Preserving Federated Autoencoder for ECG Anomaly Detection on Edge Devices

链接: https://arxiv.org/abs/2606.11556
作者: Kaan Arda Akyol,Jakub Kacper Szeląg,Aydin Abadi,Maha Alghamdi,Ghadah Albalawi,Ghouse Ibrahim Kaleelullah,Hilal Tutus,Sarah Al Subaiei,Shardul Kapse,Syed Mohammed Raheeb,Mujeeb Ahmed,Rehmat Ullah
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 9 pages, 4 figures, 6 tables. Preprint prepared in IEEE conference format. Submitted to: FLTA 2026

点击查看摘要

Abstract:Continuous electrocardiography (ECG) monitoring could surface rhythm abnormalities before they escalate into cardiovascular events. However, a deployable system must satisfy three requirements simultaneously: legal-grade privacy (GDPR, HIPAA), real-time inference on constrained edge hardware, and detection quality under non-IID cross-hospital data. We design and evaluate an end-to-end federated system addressing all three for unsupervised 12-lead ECG anomaly detection on PTB-XL dataset, combining three autoencoder families (VanillaAE, ConvAE, VAE), Flower-based federated averaging (FedAvg) across ten simulated hospitals, client-side differentially private SGD (DP-SGD) with a Rényi-DP accountant, and 8-bit integer (INT8) post-training quantization with Raspberry Pi 4 benchmarking. Our main contributions are: an empirical characterization of how these mechanisms compose, practical DP-specific recommendations, and technical and security insights for a clinically sensitive setting. Federated learning matches or exceeds the centralized baseline across all architectures (ConvAE federated area under the ROC curve, AUROC, 0.782 ), and an \varepsilon sweep identifies \varepsilon=4 as the recommended clinical operating point. INT8 quantization roughly halves model size and cuts Pi 4 latency by up to 44% with 0.12% AUROC loss. Crucially, DP and quantization penalties are empirically independent, so practitioners need not trade a strong privacy guarantee for a compact edge footprint. To our knowledge, this is the first system combining federated learning, formal (\varepsilon,\delta) -DP, unsupervised reconstruction-based detection, and quantized AArch64 deployment. Comments: 9 pages, 4 figures, 6 tables. Preprint prepared in IEEE conference format. Submitted to: FLTA 2026 Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) MSC classes: 68T07, 68T09, 68P27, 62M10 ACMclasses: I.2.6; I.5.4; J.3; C.2.4; C.3 Cite as: arXiv:2606.11556 [cs.CR] (or arXiv:2606.11556v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2606.11556 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-76] SkillJuror: Measuring How Agent Skill Organization Changes Runtime Behavior

链接: https://arxiv.org/abs/2606.11543
作者: Zhiyu Chen,Zihan Guo,Bo Huang,Bingwei Lu,Jianghao Lin,Yuanjian Zhou,Weinan Zhang
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:Agent Skills augment large language model (LLM) agents with procedural knowledge at inference time, but current benchmarks rarely distinguish what a Skill says from how it is organized. We study this distinction through Progressive Disclosure, where a concise root file points agents to supporting resources on demand, and compare it with a normalized flat baseline. We present SkillJuror, a framework for evaluating Skill writing paradigms through semantically controlled variants, matched multi-trial evaluations, and trajectory evidence while holding task knowledge fixed. In an 82-task SkillsBench study, Progressive Disclosure changes runtime behavior before aggregate outcomes: distinct Skill resources touched per trajectory rise from 1.18 to 3.85, and effective uptake events rise from 1.33 to 3.92. It also yields 17 additional verifier-passing trials out of 410 matched trials (+4.1%) over the normalized flat baseline. The benefit is task-dependent. Progressive Disclosure helps when supporting resources guide implementation, checking, or repair, but is weaker when success hinges on exact output conventions, numerical thresholds, or long artifact-generation pipelines. These results show that Skill organization is not mere presentation: it can change how agents search and apply procedural knowledge, while outcome gains depend on whether the exposed resources are actionable for the task. Code is available at this https URL.

[AI-77] MoCA-Agent : A Market-of-Claims Code Agent for Financial and Numerical Reasoning

链接: https://arxiv.org/abs/2606.11537
作者: Abdelrahman Abdallah,AbdelRahim A. Elmadany,Sameh Al Natour,Hasan Cavusoglu,Adam Jatowt,Muhammad Abdul-Mageed
类目: Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
备注:

点击查看摘要

Abstract:Financial and tabular question answering requires more than fluent reasoning: answers must be grounded in the exact facts, formulas, units, signs, and scales that support them. A single misread cell or incorrect operation can silently produce a plausible but wrong result. We introduce \textscMOCA-Agent, a market-of-claims code agent that replaces free-form multi-agent debate with claim-level verification. The system decomposes each question into typed atomic claims, asks specialist trader agents to buy or sell those claims, clears their orders into confidence-weighted accept/reject decisions, and synthesizes an executable Python program from market-supported evidence. A code-aware verifier then checks the program for execution, structural consistency, and common financial reasoning errors, with at most one market-aware repair round. Across ten public benchmarks spanning financial numerical reasoning, general tabular reasoning, ESG question answering, and multimodal chart reasoning, \textscMOCA-Agent achieves strong performance using a fixed Qwen3.6-27B backbone, including 78.3% on FinQA, 76.0% on FinanceMath, 71.2% on MultiHiertt, 86.9% on ESGenius, and 85.6% average on FinChart-Bench. These results show that aggregating evidence at the level of atomic claims, rather than whole answers, improves robustness in high-stakes numerical reasoning.\footnoteThe code and data are available: this https URL.

[AI-78] AI Researchers Must Help Lead Arms Control to Mitigate Military AI Risks ICML2026

链接: https://arxiv.org/abs/2606.11533
作者: Ted Fujimoto,Jacob Benz
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
备注: 9 pages, 1 figure, ICML 2026 Position Paper

点击查看摘要

Abstract:The advancement of AI capabilities compels researchers and the public to be more aware of its potential worldwide impact. A pressing near-term concern is the regulation of military AI applications. Armament manufacturers and defense contractors are increasingly investing in AI capabilities and forging partnerships with AI companies, creating a burgeoning coalition that demands military leaders, arms control diplomacy experts, and AI researchers collaborate to ensure a safer future. While AI researchers often focus on the long-term implications of superintelligent AI, this approach may not adequately address the immediate challenges posed by AI in military applications. Success requires acknowledging and mitigating the emerging risks of frontier AI models that plan to be integrated into defense applications, like military AI systems. Arms control has reduced past catastrophic risks, so lessons learned from nuclear deterrence can guide AI safety and security research towards innovations in verification and diplomacy. AI researchers, however, must assist in leading the technical research that clearly defines and alleviates instability in military settings. Given these new responsibilities and the lack of sufficiently reliable solutions, we argue that AI researchers must take a leading role in advancing arms control research to minimize risk in military AI applications.

[AI-79] Search Discipline for Long-Horizon Research Agents

链接: https://arxiv.org/abs/2606.11522
作者: Adithya Srinivasan,Devesh Paragiri
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 9 pages, 1 figure

点击查看摘要

Abstract:Autoresearch agents now propose, evaluate, and select scientific candidates against a metric, and that metric is usually an aggregate reduced over a heterogeneous space of regions, slices, or cohorts. We show that when scientific validity lives in that disaggregated structure, the aggregate can rank the wrong candidate first. The headline number improves while the structure underneath inverts, so a decision made on the number accepts a candidate that quietly breaks the model. The failure is not domain-specific. It appears wherever a candidate’s validity is multi-dimensional but its verifier is a single reduction. We demonstrate the inversion on a fire-model task in the Ecosystem Demography model. The highest-scoring candidate and a slightly lower one are within noise of each other on global score, yet the top-scoring one collapses the protected boreal regions while the other preserves them. What separates them is the per-region behavior, not the headline number. This decision should not be left to the agent that produced the candidates. The agent optimizing the score is the last party likely to catch the score being wrong, and a prompt has no remaining turn once the agent has stopped. We move the decision to an external control loop that audits each candidate on its disaggregated behavior and acts after the agent has decided. It can demote a candidate the agent would have accepted, and it can reopen a run the agent had declared finished. Our contribution is the inversion finding itself, and a search-discipline protocol that decides on reviewable candidate-effect evidence instead of the score. Comments: 9 pages, 1 figure Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2606.11522 [cs.AI] (or arXiv:2606.11522v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2606.11522 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-80] SirenFNO: Efficient and Full Frequency Learning of Fourier Neural Operators IJCAI2026

链接: https://arxiv.org/abs/2606.11518
作者: Pengqing Shi,Jie Yin,Stephen Tierney,Junbin Gao
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 9 pages, accepted by IJCAI 2026

点击查看摘要

Abstract:Fourier neural operators (FNOs) are effective and efficient surrogates for approximating solutions of PDEs and generalize across discretizations. However, owing to the reliance on frequency truncation to maintain learning efficiency of FNOs, empirical studies suggest that FNOs exhibit spectral bias toward low-frequency information, which may hinder the learning capability especially for certain PDEs with strong high-frequency oscillations. To address this limitation, we propose SirenFNO, a novel framework that leverages sinusoidal representation networks (SIRENs) to learn implicit neural representations and performs mode-wise kernel parameterization. Our SIREN parameterization learns a full-grid spectrum with a constant and discretization-independent parameter count, thereby eliminating the need for frequency truncation. We further extend SirenFNO with functional tensor decompositions to enhance parameter and learning efficiency. Empirical results show that our SirenFNO consistently outperforms FNO with approximately 4 to 15 times parameter reductions with preserved discretization invariance, and our functional decomposition variants obtain performance improvements with a maximum of 73 times fewer parameters across multiple PDE benchmarks.

[AI-81] CRUMB: Efficient Prior Fitted Network Inference via Distributionally Matched Context Batching

链接: https://arxiv.org/abs/2606.11473
作者: Jamie Heredge,Mattia J. Villani,Pranav Deshpande,Akshay Seshadri,Niraj Kumar
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: 26 pages, 13 figures

点击查看摘要

Abstract:Prior-fitted networks (PFNs) are a promising class of tabular foundation models that perform in-context learning, whereby the entire labelled training set is supplied as context, and predictions for test queries are produced in a single forward pass. However, the quadratically scaling self-attention mechanism in many PFN architectures makes inference prohibitive for very large training datasets. We propose CRUMB (Clustered Retrieval Using Minimised-MMD Batching), a three-stage inference wrapper that (i) clusters the test queries, (ii) selects a small, distributionally matched training subset for each cluster by greedily minimising the maximum mean discrepancy (MMD), and (iii) runs exact PFN inference on each reduced-context batch. CRUMB is architecture-agnostic and requires no retraining. On the 51-dataset TabArena benchmark, evaluated across three PFN architectures (TabPFNv2, TabICLv1, TabICLv2), we show that CRUMB outperforms similar state-of-the-art context selection strategies. We also show that CRUMB is resilient to covariate drift, as the MMD-minimisation step naturally helps align the training context distribution to match the current test batch distributions.

[AI-82] LSTM-Based Detection of Structural Breaks in Property Insurance Loss Reserving: A Climate-Informed Approach

链接: https://arxiv.org/abs/2606.11463
作者: Thomas Mbrice,Shashwat Panigrahi
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 15 pages, 0 figures, whitepaper YC

点击查看摘要

Abstract:Accurate loss reserving is foundational to insurer solvency, yet accelerating climate driven catastrophes systematically violate the stability assumptions on which traditional actuarial methods depend. This white paper presents a research program testing whether Long Short Term Memory (LSTM) neural networks can detect and adapt to these structural breaks faster and more accurately than Chain Ladder, Bornhuetter Ferguson, and Cape Cod methods. Using 15 plus years of regulatory development triangle data from Florida and Louisiana, enriched with NOAA hurricane intensity indices and sea surface temperatures, we hypothesize a targeted improvement of 15, 20% in reserve accuracy for catastrophe exposed years, a threshold grounded both in the prior neural network reserving literature and in the formal convergence results developed here. Beyond empirical validation, we develop a theoretical framework grounding LSTM structural break detection in probabilistic terms, providing formal performance guarantees that compensate for the limited number of catastrophe events in the test period. We document the research design, methodology, expected contributions, and a candid assessment of limitations.

[AI-83] Forecasting Future Behavior as a Learning Task

链接: https://arxiv.org/abs/2606.11445
作者: Mosh Levy,Yoav Goldberg,Asa Cooper Stickland
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Trust in an AI system is often anchored by explanations of how it works, which one then uses to forecast its behavior on new inputs. For large reasoning models (LRMs), this conventional route is particularly difficult to follow: explanation methods for single token generations do not naturally generalize to long trajectories, and the trajectories themselves are often not faithful when read as natural language. We propose an alternative that bypasses the explanation step: treat behavior forecasting as a learnable task and train Behavior Forecasters that operates on a single reasoning trajectory to make the same forecasts one would typically seek from an explanation. The forecaster’s training data is obtained by querying the LRM with no human annotation, and its inference is done in a single forward pass. We instantiate this approach on two tasks: how likely the LRM is to repeat its answer on re-runs, and how removing parts of the input changes its answer. We evaluate this approach on both tasks across three diverse reasoning datasets and find that trained Behavior Forecasters are more accurate than GPT-5.4 and Claude Opus-4.6 reading the same trajectories as naive readers, at a small fraction of their inference cost. We find that fine-tuning the backbone end-to-end and initializing it from the target LRM are each necessary for strong performance. These results show that the reasoning trajectory carries information about the LRM’s future behavior that goes beyond what naive reading conveys.

[AI-84] INFRAMIND: Infrastructure-Aware Multi-Agent Orchestration

链接: https://arxiv.org/abs/2606.11440
作者: Ahasan Kabir,Jiaqi Xue,Mengxin Zheng,Qian Lou
类目: Artificial Intelligence (cs.AI)
备注: Preprint

点击查看摘要

Abstract:Existing multi-agent LLM orchestration methods, ranging from brute-force ensembles to learned routers, select models and topologies based on task and model features. However, these methods do not consider the runtime state of the serving infrastructure. On shared GPU clusters under concurrent load, this infrastructure blindness causes systematic resource underutilization: preferred models accumulate deep request queues while equally capable alternatives sit idle. In multi-agent pipelines, where each query triggers multiple sequential model calls, these delays then compound across every downstream step. Closing this gap is challenging because the relevant infrastructure signals (queue depths, KV-cache pressure, latencies) are dynamic and noisy, and they must drive three different decisions: planning, per-step routing, and scheduling. We introduce INFRAMIND, a framework that makes the entire multi-agent stack infrastructure-aware. An infra-aware planner conditions topology and role selection on real-time system load and remaining budget, biasing toward simpler graphs under congestion and richer ones at low load. An infra-aware executor then observes per-model queue depths, cache utilization, and response latencies at each agent step to decide which model to call and how deeply to reason; a budget-aware scheduler further reorders each model’s queue so that urgent requests are served first. Cast as a hierarchical constrained MDP and solved end-to-end via reinforcement learning, the system learns to balance quality against latency automatically. Across five benchmarks, INFRAMIND delivers up to +7.6 pp accuracy over the prior baseline at low load with up to 7x lower latency, and sustains up to 99.9% SLO compliance under high load where every baseline drops below 50%.

[AI-85] he Power of Test-Time Training for Approximate Sampling

链接: https://arxiv.org/abs/2606.11437
作者: Noah Golowich,Ankur Moitra,Dhruv Rohatgi
类目: Data Structures and Algorithms (cs.DS); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Efficiently sampling from a complex probability distribution is a fundamental problem which has become increasingly pertinent in recent years with the rise of generative AI, as sophisticated sampling procedures from LLMs have been proposed to solve challenging reasoning problems. The efficacy of such sampling algorithms is limited, however, by the relationship between the LLM and the particular sampling task at hand, which has motivated the framework of test-time training (TTT). TTT works by updating a model’s weights in response to partial generations and reward feedback received at inference time, thus adapting to the particular problem. In this work, we propose a formalization for TTT as the problem of producing a sample from a given probability measure \mu^\star belonging to a known class F of distributions, given an oracle \hat \mu which yields approximate density estimates for \mu^\star . This is closely related to the problem of reducing sampling to approximate counting studied in seminal works of Jerrum, Valiant Vazirani (1986) and Jerrum Sinclair (1989): namely, when F is the class of all distributions, it coincides exactly with the aforementioned counting-to-sampling reduction. In this paper, we first show a quadratic lower bound on the query complexity of sampling from \mu^\star given query access to \hat \mu (for sufficiently large classes F ), thus showing that the random walk approach proposed by Jerrum Sinclair (1989) and refined by Hayes Sinclair (2010), is optimal. This answers an open question posed by Hayes Sinclair. We then show that this lower bound can be circumvented if the size of F is bounded appropriately. As we discuss, this latter result can be viewed as an abstraction of TTT, and thus represents a starting point for the development of a principled theoretical framework for TTT. Subjects: Data Structures and Algorithms (cs.DS); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML) Cite as: arXiv:2606.11437 [cs.DS] (or arXiv:2606.11437v1 [cs.DS] for this version) https://doi.org/10.48550/arXiv.2606.11437 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-86] owards a Bridge Layer Between Bibliographic and Formalized Mathematical Knowledge

链接: https://arxiv.org/abs/2606.11430
作者: A. Mayeux
类目: Digital Libraries (cs.DL); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注:

点击查看摘要

Abstract:Mathematical knowledge is split between bibliographic databases (e.g., MathSciNet, zbMATH Open) and formal proof libraries (e.g., Lean mathlib), preventing unified access between published results and their formalizations. We propose a relational bridge-database that aligns publication metadata with formal artifacts, providing an interoperability layer between mathematical literature and machine-verifiable proofs. We introduce a paper-level formalization score that measures how much of a publication is covered in formal systems. As a feasibility study, we show how such scores can be estimated via cross-document alignment between informal texts and Lean formalizations, enabling large-scale analysis of formalization coverage. This framework is a first step toward integrating bibliographic and formal mathematical ecosystems into scalable, machine-actionable knowledge graphs linking publications to formal proof objects.

[AI-87] JailbreakOPT: Tool-Assisted Iterative Jailbreak Prompt Optimization

链接: https://arxiv.org/abs/2606.11425
作者: Ge Shi,Jun Yin,Donglin Xie,Fangyi Liu,Yucan Li,Menglin Liu
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Jailbreak attacks expose persistent safety weaknesses in large language models (LLMs), but existing stateless single-turn methods face a trade-off: hand-crafted prompts are expressive but static, while iterative prompt optimization can adapt but often relies on low-level mutations that require many target queries. We propose JailbreakOPT, a tool-assisted framework for improving iterative single-turn jailbreak prompt optimization. JailbreakOPT organizes diverse atomic jailbreak prompts into an attack tool library and composes them through a unified intra-episode optimization abstraction to generate stronger standalone attack prompts. To reuse experience across attack episodes, JailbreakOPT further frames tool selection as a contextual bandit problem and applies contextual Thompson sampling to guide exploration and exploitation based on past outcomes. Experiments across multiple target LLMs and attack goals show that JailbreakOPT improves attack success rate (ASR) while reducing the number of attacks until success (No.A) compared with atomic single-turn attacks and existing iterative optimization baselines. This paper may contain offensive or harmful content.

[AI-88] Signed Compression Progress on a Sealed Audit is Goodhart-Resistant

链接: https://arxiv.org/abs/2606.11417
作者: Ayush Mittal,Dhruv Gupta
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: 16 pages, 7 figures. Lean 4 (Mathlib) mechanized core and ARC-TGI experiment code: this https URL

点击查看摘要

Abstract:Compression progress is a long-standing proposal for intrinsic motivation: reward an agent when its world model becomes better at predicting or compressing experience. The folk claim is that this reward is “credible” because it is paid only for learning. We make this precise and prove it. If intrinsic reward is the signed decrease of a fixed sealed-audit loss, r_t = E(theta_t-1) - E(theta_t), then cumulative reward telescopes exactly to endpoint audit improvement, so no policy can push reward up indefinitely while true audit performance stagnates or degrades. For finite audit panels the same result holds with a sharp false-positive budget: cumulative empirical reward is at most true audit improvement plus 2 Delta_n(F, delta), the uniform audit deviation of the model class. This is horizon-free: adaptivity over time costs nothing once the sealed panel uniformly controls the class. The theorem also identifies the failure modes: the guarantee disappears if progress is clipped, scored on the agent’s own stream, exposed to a high-capacity model on a reusable panel, or applied to a neural class that makes Delta_n vacuous. We give a Lean 4 mechanization of the structural core (telescoping, the finite-audit bound, finite Gibbs, and the entropy floor) and an experiment suite on ARC-TGI grid-transformation generators with adaptive holdout attacks. Experiments confirm the theory: finite-audit deviation scales as n^-0.527; signed progress resists clip-farming, stream leakage, and noisy-TV curiosity; naive reusable audits are exploitable by black-box scalar feedback, while standard release defenses keep the attack below the 2 Delta_n threshold. Signed compression progress on a sealed audit is an accounting signal of genuine improvement. Comments: 16 pages, 7 figures. Lean 4 (Mathlib) mechanized core and ARC-TGI experiment code: this https URL Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML) Cite as: arXiv:2606.11417 [cs.LG] (or arXiv:2606.11417v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2606.11417 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-89] MPC-Patch-Bench: Security-Aware LLM Code Patch for Multi-Party Computation

链接: https://arxiv.org/abs/2606.11416
作者: Yukuan Zhang,Mengxin Zheng,Qian Lou
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: preprint

点击查看摘要

Abstract:Repository-level benchmarks for evaluating Large Language Model (LLM) code repair on Secure Multi-Party Computation (MPC) software do not yet exist, and directly transplanting general-purpose benchmarks such as SWE-bench fails on three structural fronts: (i) MPC repositories are dominated by generic Python infrastructure rather than cryptographic logic; (ii) high-value MPC fixes lack the standardized tests rigid extraction pipelines require; and (iii) standard fail-to-pass evaluation is insufficient for code that must also be cryptographically safe. MPC is increasingly deployed for privacy-preserving machine learning, biomedical collaboration, and secure analytics. Existing MPC-specific code-synthesis efforts cover only operator-level or single-framework tasks; evaluating LLM agents on real repository-level MPC repair instead demands MPC-aware data curation and a verifier matched to the security and numerical-fidelity guarantees MPC programs must obey neither of which existing benchmarks provide. We introduce MPC-Patch-Bench, a repository-level benchmark organised around two frameworks. (1)The Data Curation Framework combines a domain-specific curation agent that filters raw pull requests through three cryptographic layers with a human-AI completion engine that synthesizes missing problem statements and Fail-to-Pass/Pass-to-Pass tests, yielding 205 fully verified instances. (2)The MPC Verifier provides dedicated security and numerical-fidelity checks via dynamic differential testing against plaintext oracles and MPC-specific static analysis rules that flag unsafe reveals, insecure arithmetic, and illegal public/private casts. The strongest evaluated LLM functionally resolves only 22.9% of MPC-Patch-Bench tasks; the MPC Verifier further reduces verified resolution to 17.1%, with up to 40% of functionally-passing patches rejected for cryptographic or numerical-fidelity violations.

[AI-90] Risk Under Pressure: Compute-Aware Evaluation of Adversarial Robustness in Language Models

链接: https://arxiv.org/abs/2606.11409
作者: Malikeh Ehghaghi,Boglárka Ecsedi,Marsha Chechik,Colin Raffel
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Adversarial robustness evaluations of large language models (LLMs) typically report attack success rate (ASR) under fixed query budgets, implicitly treating all attacks as equally costly. In practice, the computational expense of different attack strategies can vary by orders of magnitude. Consequently, ASR at a fixed budget can obscure the true effort required to jailbreak a model, thereby making it hard to determine whether an attack’s cost justifies its payoff to the attacker. We propose a compute-aware evaluation framework based on computational pressure, measured in cumulative floating-point operations (FLOPs), as a proxy for adversarial effort. We introduce risk-compute curves, which map compute budgets to attack risk, and derive two metrics that summarize the average pressure required for a given attack to succeed. Across ten models spanning three families and four different stages in language model training and alignment, evaluated with three attack strategies (gradient-based, iterative refinement, and template-based) on two jailbreak robustness benchmarks, we find: (1) alignment training has non-monotonic effects on compute-space robustness; (2) scaling model size reduces gradient-based attack effectiveness but has limited impact on cheaper template-based attacks; (3) gradient-based attacks optimized on a surrogate model can transfer to a separate target model, providing a way to reduce attacker costs; (4) compute cost varies by up to \approx5\times across harm categories within a single model; and (5) safety-aligned RL increases aggregate cost while leaving some categories disproportionately accessible. We release our framework to enable compute-aware risk assessment and evaluation.

[AI-91] Steering Where to Listen: Instruction-Based Activation Steering Redirects Temporal Attention in Large Audio-Language Models

链接: https://arxiv.org/abs/2606.11400
作者: Tsung-En Lin,Hung-Yi Lee
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Large Audio-Language Models (LALMs) excel at audio understanding but expose little about where in an audio signal they attend. We introduce instruction-based vector steering, which constructs a steering vector by contrasting activations from differently instructed prompts while keeping the audio fixed. Through a systematic probe of LALM attention, we find that - unlike standard prompting or audio-based steering - this intervention significantly redistributes the temporal attention allocated to audio tokens, concentrating it on acoustically relevant regions. We then show that this attention shift is behaviorally meaningful: in a controlled three-event setting, reading out the temporal position of maximal steering-induced attention change recovers the location of a queried sound event without any training, attaining 60.87% and 68.72% overlap with ground-truth intervals on Qwen2-Audio and Audio Flamingo 3, far above direct prompting (31.84%, 46.75%) and random baselines (27.74%). Our results characterize a mechanistic property of instruction-based steering in LALMs and provide a training-free probe for the latent temporal structure these models encode.

[AI-92] Automated Mediator for Human Negotiation: Pre-Mediation via a Structured LLM Pipeline

链接: https://arxiv.org/abs/2606.11379
作者: Jamie Bergen,Sarit Kraus
类目: Artificial Intelligence (cs.AI)
备注: 12 pages, 7 figures

点击查看摘要

Abstract:Pre-mediation, the preparatory phase preceding direct human negotiation, plays a critical role in achieving mutually beneficial agreements, yet is often omitted due to cost, time, and limited access to trained mediators. We introduce an automated mediator for human negotiation, implemented as a structured pipeline of LLM modules, that supports pre-mediation in integrative negotiation settings. The pipeline decomposes preparation into specialized modules for dialogue, preference prediction, response-level critique, and structured summarization, separating inference, generation, and evaluation to address limitations of monolithic single-prompt approaches. We use the term “agent” for each module following common LLM-systems terminology, but the components are not autonomous and do not interact peer-to-peer; outputs are passed forward in a fixed sequence. We evaluate the system in two controlled human-subject experiments comparing AI-based pre-mediation with professional human mediators in a multi-issue negotiation scenario. On short-term self-reported measures, the automated mediator achieves preparation outcomes broadly comparable to human mediators, including trust in the mediator and confidence in reaching mutually beneficial agreements, while achieving substantially lower error on the preference-inference task under our scenario and prompts (36% lower RMSE). A second study shows that targeted prompt refinements reduce excessive affirmation patterns from 36.6% to 16.8%, matching human mediator baselines. Our findings suggest that structured LLM pipelines can provide scalable, low-effort pre-mediation support broadly comparable to human mediators on short-term self-reported preparation outcomes. The pipeline’s single-party design mirrors how human mediators run pre-mediation today and enables parallel deployment across all parties to a dispute, supporting scalability.

[AI-93] Fuse: A Fused Mixed-Precision Kernel Library for Efficient Quantized LLM Inference on AMD NPUs

链接: https://arxiv.org/abs/2606.11357
作者: Wesley Pang,Gregory Hyegang Jun,Feiyang Liu,Deming Chen
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Performance (cs.PF)
备注: 13 pages excluding reference, 11 figures

点击查看摘要

Abstract:With the growing demand for on-device LLM inference, edge SoCs increasingly integrate NPUs to improve performance and energy efficiency under tight power and thermal budgets. However, practical LLM deployment on current client NPUs remains difficult: widely used quantization formats such as AWQ do not map cleanly onto many existing NPU software stacks, which are often proprietary and expose limited low-level control. In this work, we present \textitTileFuse, a close-to-metal mixed-precision kernel library for AMD XDNA2 NPUs that targets transformer linear layers in quantized LLM inference. TileFuse brings practical low-bit formats such as AWQ-style W4A16 and W8A16 directly onto XDNA2, rather than forcing the model to be reshaped around an NPU-specific quantization scheme. TileFuse co-designs weight layout, metadata placement, mixed-precision microkernels, and array-level dataflow. Specifically, it fuses unpacking, dequantization, and GEMM/GEMV execution into a single kernel flow, introduces an interleaved pre-tiling layout that supports GEMM dimensions up to 32K, and redesigns GEMV dataflow to utilize the full 4x8 AIE array. Across kernel-level evaluations, TileFuse improves performance by up to 121.6% for GEMM and 281% for GEMV over full-precision baselines, while delivering more than 2x performance and energy-efficiency gains over strong iGPU baselines on GEMM. In end-to-end LLM experiments on Ryzen AI laptops, TileFuse achieves up to 2.0x lower prefilling latency with more than 64.6% lower energy consumption. Together, these results show that XDNA2 is a practical target for AWQ-style edge LLM inference and that native NPU support for off-the-shelf quantization can make NPUs substantially more usable in real client deployments. Comments: 13 pages excluding reference, 11 figures Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Performance (cs.PF) Cite as: arXiv:2606.11357 [cs.DC] (or arXiv:2606.11357v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2606.11357 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-94] Embodied-R1.5: Evolving Physical Intelligence via Embodied Foundation Models

链接: https://arxiv.org/abs/2606.11324
作者: Yifu Yuan,Yaoting Huang,Xianze Yao,Yutong Li,Shuoheng Zhang,Linqi Han,Pengyi Li,Jiangeng Sun,Wenting Jia,Zhao Zhang,Yuhao Liu,Ruihao Liao,Yucheng Hu,Qiyu Wu,Yuxiao Li,Zibin Dong,Fei Ni,Yan Zheng,Shuyang Gu,Yi Ma,Hongyao Tang,Han Hu,Jianye Hao
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Embodied R1.5 technical report. Project page: this https URL

点击查看摘要

Abstract:We introduce Embodied-R1.5, a unified Embodied Foundation Model (EFM) that integrates comprehensive embodied reasoning capabilities, spanning embodied cognition, task planning, correction, and pointing, within a single architecture toward general physical intelligence. Leveraging three automated data construction pipelines to significantly expand the data coverage of critical capabilities, we build a large-scale data system of over 15B tokens, and design a multi-task balanced RL recipe to alleviate heterogeneous task conflicts. We further introduce a Planner-Grounder-Corrector (PGC) closed-loop framework that enables a single model to autonomously execute and self-correct over long-horizon tasks. With only 8B parameters, Embodied-R1.5 achieves SOTA on 16 out of 24 embodied VLM benchmarks, surpassing leading models like Gemini-Robotics-ER-1.5 and GPT-5.4. Benefiting from the internalized embodied capabilities, Embodied-R1.5 can be fine-tuned into a VLA with only a small amount of data, outperforming leading VLA models like \pi_0.5 across 4 popular manipulation benchmark suites. We further conduct extensive zero-shot real-robot experiments, validating performance in instruction following, affordance grounding, articulated object manipulation, and long-horizon complex tasks, demonstrating strong generalization to the physical world. We open-source model weights, datasets, training code, and EmbodiedEvalKit, an evaluation framework tailored for embodied tasks, to facilitate future research in EFMs.

[AI-95] FreeBridge: Variational Schrödinger Bridges for Cellular Transition Dynamics MICCAI2026

链接: https://arxiv.org/abs/2606.11286
作者: Xurui Wang,Qin Ren,Jun Ma,Haibin Ling,Chenyu You
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted to MICCAI 2026 (early accept). Project page: this https URL

点击查看摘要

Abstract:High-content imaging assays quantify cellular responses to chemical and genetic perturbations, yet continuous trajectories of individual cells are unobservable because cells are chemically fixed at acquisition. Perturbation modeling therefore reduces to inferring stochastic transport between control and treated populations observed only as separate marginals. While recent generative models achieve strong end-point alignment, boundary consistency does not determine intermediate evolution: multiple stochastic processes may connect identical marginals while traversing regions unsupported by observed single-cell morphologies. We introduce \textbfFreeBridge, a Schrödinger Bridge formulation for single-cell transition modeling under endpoint-only supervision. FreeBridge defines atomic states as instance-segmented single-cell representations, establishing a fixed cellular manifold, and learns stochastic transport constrained within this geometry via empirical latent support regularization. Across BBBC021, RxRx1, and JUMP, FreeBridge maintains competitive or improved endpoint fidelity and mechanism-of-action retention under a unified evaluation protocol; on BBBC021, it further reduces intermediate support violations. These findings highlight the importance of geometric grounding for biologically interpretable perturbation dynamics. Project page: this https URL.

[AI-96] RoVE: Rotary Value Embeddings Attention for Relative Position-dependent Value Pathways

链接: https://arxiv.org/abs/2606.11275
作者: Alejandro García-Castellanos,Maurice Weiler,Erik J Bekkers
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Rotary Position Embeddings (RoPE) make attention scores position-relative but leave the value pathway position-blind: the message sent by a value token is the same regardless of its distance from the query. We propose RoVE, a parameter-free modification that makes values position-sensitive by rotating them simultaneously with keys, and show that it turns RoPE attention into attentive convolution. This new perspective unifies several independent formulations of the same operation across computer vision, robotics, and modern LLM architectures. Trained 124M and 354M GPT-2 models show consistent empirical gains over RoPE on few-shot in-context learning, out-of-distribution perplexity, and long-context retrieval, with the clearest improvements on tasks that require long-range aggregation.

[AI-97] Federated continual learning: A comprehensive survey on lifelong and privacy-preserving learning over distributed and non-stationary data

链接: https://arxiv.org/abs/2606.11272
作者: Masoume Gholizade,Fabrizio Ruffini,Pietro Ducange,Francesco Marcelloni
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 77 pages, 8 figures

点击查看摘要

Abstract:Federated Learning (FL) enables collaborative and privacy-preserving model training across distributed clients, but most existing FL systems implicitly assume data stationarity. In real-world settings-such as healthcare, industrial IoT (IIOT), cybersecurity, and smart cities-data streams are inherently non-stationary, leading classical FL methods to suffer from performance degradation, instability, and catastrophic forgetting. Continual Learning (CL) addresses learning under evolving data distributions but has been largely studied in centralized settings, overlooking key constraints of federated systems, including privacy, limited communication, and client heterogeneity. Federated Continual Learning (FCL) emerges at the intersection of FL and CL, aiming to support lifelong, adaptive, and privacy-aware learning over distributed and non-stationary data. This survey provides a comprehensive and systematic overview of FCL. We first present a formal definition of the FCL problem and clarify its distinctive characteristics. We then analyze the limitations of classical FL under non-stationary conditions, highlighting how CL principles support long-term adaptation. To organize the rapidly growing literature, we propose a multi-dimensional taxonomy of FCL approaches. Furthermore, we review representative application domains and data modalities, summarize commonly used evaluation metrics, and discuss experimental perspectives for assessing long-term performance and forgetting. Finally, we highlight key open challenges, including handling extreme heterogeneity under temporal drift, designing scalable and privacy-preserving memory mechanisms, and establishing standardized benchmarks. This survey aims to serve as a reference and a roadmap for advancing FCL toward robust and deployable real-world systems. Comments: 77 pages, 8 figures Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) MSC classes: 68T05, 68T07 ACMclasses: I.2.6; I.2.11 Cite as: arXiv:2606.11272 [cs.LG] (or arXiv:2606.11272v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2606.11272 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: Neurocomputing, Volume 694, 2026, 133929 Related DOI: https://doi.org/10.1016/j.neucom.2026.133929 Focus to learn more DOI(s) linking to related resources

[AI-98] When Poison Fails After Retrieval: Revisiting Corpus Poisoning under Chunking and Reranking Pipelines

链接: https://arxiv.org/abs/2606.11265
作者: Xi Nie,Hongwei Li,Shenghao Wu,Mingxuan Li,Jiachen Li,Wenbo Jiang
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) systems are vulnerable to corpus poisoning attacks that manipulate downstream model outputs through malicious knowledge injection. Existing studies mainly evaluate poisoning under simplified retrieval settings, overlooking practical RAG pipelines involving document chunking, dense retrieval, reranking, and grounded generation. In this paper, we revisit corpus poisoning under realistic multi-stage retrieval pipelines and show that many existing attacks substantially degrade after reranking despite achieving high retrieval-stage relevance. We identify retrieval granularity mismatch as a key reason for this failure: document-level adversarial signals are often fragmented during chunking, while rerankers favor locally coherent and answer-bearing passages rather than globally optimized semantic similarity. Based on this observation, we propose Chunk-aware and Rerank-Consistent Poisoning (CRCP), a poisoning framework that jointly optimizes retrieval relevance, reranker consistency, and chunk-boundary robustness. CRCP explicitly models chunking transformations during optimization to generate locally self-contained adversarial passages that remain effective under varying chunking configurations. Experiments on standard RAG benchmarks with multiple retrievers and rerankers show that existing poisoning methods are highly sensitive to chunk size and reranking strategies, whereas CRCP achieves substantially higher attack success rates and stronger robustness across realistic retrieval pipelines. Our findings highlight an important realism gap in current RAG security evaluation and suggest that poisoning in modern RAG systems should be studied as a multi-stage retrieval consistency problem rather than a retrieval-only problem.

[AI-99] PermDoRA – Understanding Adapter Interference in Language Models: Limits of Parameter-Space Geometry

链接: https://arxiv.org/abs/2606.11262
作者: Gowtham Sivaramakrishnan,Sarvesha Kumar Kombaiah Seetha,Kishan Gupta Balaji,Santhosh Baradwaj Vaduvur Ranganathan
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 18 Pages, COLM 2026

点击查看摘要

Abstract:Access control in large language models (LLMs) requires modular mechanisms to enable domain-specific behavior without retraining or cross-domain interference. A common hypothesis is that interference during adapter composition arises from overlap in linear parameter updates, suggesting that enforcing orthogonality or directional independence should improve multi-domain performance. We test this hypothesis using DoRA-RBAC, a hierarchical adapter composition framework based on weight-decomposed low-rank adaptation. We compare conventional Euclidean merging with a geometry-aware Riemannian-inspired merging strategy that approximates the Frechet mean via normalized directional averaging across multiple QA benchmarks (GPQA, PubMedQA, SimpleQA, WMDP) on LLaMA-3.1-8B and Mistral-7B. Our results show that while single-domain performance matches LoRA, geometry-aware merging provides no consistent advantage over standard averaging in multi-domain this http URL analysis further reveals that angular alignment and orthogonality of adapter updates are weak predictors of composition performance. These findings suggest that adapter interference is not governed primarily by parameter-space geometry, but is instead consistent with interactions in shared nonlinear representations.

[AI-100] RAIL: Rethinking Auditory Intelligence in Large Audio-Language Models with a CHC-Grounded Benchmark

链接: https://arxiv.org/abs/2606.11260
作者: Hongyu Jin,Siyi Wang,Yang Xiao,Jiaheng Dong,Shihong Tan,Kaiyuan peng,Georgiana Juravle,Shanquan Chen,Gongping Huang,Hong Jia,Eun-Jung Holden,James Bailey,Ting Dang
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Humans process rich auditory environments through tightly integrated cognitive capabilities such as audio perception, audio reasoning, and memory. Despite recent progress in large audio-language models (LALMs) across speech understanding and multimodal audio reasoning, current evaluation paradigms remain largely task- or modality-centric, focusing on end performance while overlooking underlying auditory cognitive behaviours. This reveals a fundamental gap between how auditory cognition is understood in humans and how it is evaluated in LALMs, particularly in the lack of frameworks that operationalise cognitive principles beyond task-level metrics to systematically capture model behaviour. In this work, we introduce RAIL, a human-centric evaluation paradigm grounded in the Cattell-Horn-Carroll (CHC) cognitive framework. RAIL formalises auditory cognition into five core capabilities and develop them into structured evaluation tasks that probe how models process, retain, and integrate auditory information. We further construct a cognitively grounded benchmark with principled data curation and human-aligned evaluation protocols. Evaluating 26 state-of-the-art LALMs, we find that current models exhibit highly uneven performance across cognitive abilities. RAIL establishes a new evaluation paradigm that moves beyond task-centric benchmarking toward cognitively grounded assessment of auditory intelligence.

[AI-101] Physics-informed generative AI for semiconductor manufacturing: Enforcing hard physical constraints in generative models by construction

链接: https://arxiv.org/abs/2606.11247
作者: Yaser Mike Banad,Sarah Sharif
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)
备注:

点击查看摘要

Abstract:Generative models are increasingly used to propose designs, data, and control actions for physical systems, yet many such systems are governed by hard physical constraints rather than by perceptual plausibility. Semiconductor manufacturing provides a demanding test case: generated masks, layouts, synthetic defect data, and process recipes must obey lithography, transport, reaction, and device-physics constraints, because physically invalid samples are not merely low quality but unusable. This Perspective argues that semiconductor manufacturing exposes a broader computational-science challenge, namely that generative AI for constrained physical domains must be physics-informed by construction, not corrected only through post-hoc filtering. We survey the emerging architectural toolkit, including physics-informed diffusion, PDE-constrained variational models, neural-operator priors, and conservation-law-respecting generative networks, and show how it connects to differentiable lithography, TCAD, process simulation, and autonomous experimentation. We identify four integration patterns between generative models and physics-based simulators, and we propose a research agenda centered on physics-fidelity benchmarks, differentiable simulator infrastructure, and multimodal foundation models for physical design and manufacturing. The central claim is analytical rather than rhetorical: where physical validity is the binding criterion of success, architectures that enforce it by construction should be expected to outperform those that filter for it after the fact, and the fab is the setting where this distinction is sharpest.

[AI-102] Position: Hippocampal Explicit Memory Is the Cornerstone for AGI ICML2026

链接: https://arxiv.org/abs/2606.11245
作者: Sangjun Park
类目: Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE); Neurons and Cognition (q-bio.NC)
备注: Accepted to ICML 2026 (Position Paper Track)

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated remarkable capabilities across various tasks, raising expectations for Artificial General Intelligence (AGI). This position paper argues that integrating explicit memory is the cornerstone for advancing LLMs toward AGI. The key reason is that the underlying learning mechanism of LLMs is highly analogous to human implicit memory. However, higher-order cognitive functions necessary for AGI, such as long-term strategic planning, metacognition, and symbolic reasoning, heavily rely on hippocampal explicit memory and cannot arise solely from implicit statistical learning. Drawing on findings from neuroscience, I advance this perspective and complement it with computational requirements for artificial explicit memory systems, hoping to foster further research and lay the groundwork for explicit memory integration.

[AI-103] SPEAR: A System for Post-Quantization Error-Adaptive Recovery Enabling Efficient Low-Bit LLM Serving

链接: https://arxiv.org/abs/2606.11244
作者: Hongyuan Liu,Yawei Li,Zhiqiang Que,Qinli Yang,Junming Shao,Guosheng Hu
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Efficient large language model (LLM) serving is increasingly constrained by deployment cost. Quantization is a key technique for reducing serving cost, yet even state-of-the-art 4-bit quantizers exhibit a noticeable quality gap from FP16, particularly for smaller models where low-bit serving is most beneficial. We identify a fundamental cause of this gap: quantization error is highly input-dependent and varies substantially across tokens, while existing post-quantization compensation methods are static and apply identical corrections to all inputs. As a result, easy tokens are over-corrected while hard tokens remain under-corrected. We present SPEAR, a system for post-quantization error-adaptive recovery that improves low-bit LLM serving. SPEAR introduces lightweight Error Compensators (ECs) modulated by per-token gates and places them only at the most error-sensitive layers identified through a CKA-guided entropy-aware diagnostic. This focuses a small parameter budget where it is most effective. Efficient deployment of ECs presents several systems challenges, including additional computation, tensor-parallel synchronization caused by input-dependent gating, and latency instability across configurations. SPEAR addresses these issues through adaptive kernel-fusion dispatch, combining an epilogue-integrated peer-reduction kernel with P2P dual-write to fuse the post-EC computation into low-bit GEMMs, and an SLO-constrained EC-aware scheduler for predictable serving performance. Across challenging per-channel quantization settings, SPEAR recovers 56-75% of the perplexity gap between W4 and FP16 while adding less than 1% model memory overhead and maintaining latency comparable to a widely used 4-bit serving deployment. Subjects: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI) Cite as: arXiv:2606.11244 [cs.AR] (or arXiv:2606.11244v1 [cs.AR] for this version) https://doi.org/10.48550/arXiv.2606.11244 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Hongyuan Liu [view email] [v1] Thu, 4 Jun 2026 22:38:53 UTC (4,830 KB)

[AI-104] An Ethical eValuation Agent (EeVA): Results of a Proof-of-Concept Test on a Prototype Agent ic-like Workflow to Assist Ethical Deliberations

链接: https://arxiv.org/abs/2606.11218
作者: Stephen Milford,B. Zara Malgir,Miguel Vazquez
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Ethical deliberation is often misunderstood as a search for single right or wrong answers, creating difficulties for non-ethically trained personnel who must address ethically laden challenges. We developed EeVA, an agentic-like LLM-based workflow designed to support comparative ethical reflection rather than deliver definitive ethical answers. EeVA was programmed in n8n using three interconnected workflows: starter, worker, and emitter. It evaluated uploaded use cases against 10 ethical frameworks through evaluator and synthesis prompts. Proof-of-concept testing used three published cases from urban mobility, peer-to-peer energy trading, and social-service resource allocation. Across all cases, EeVA produced consistently structured framework-specific evaluations and integrated syntheses. Outputs differentiated between frameworks, identified convergences and divergences, recommended modifications to increase alignment, and highlighted persistent ethical tensions. Syntheses were readable for non-specialists and shifted attention away from simplistic answers toward design conditions, safeguards, and areas where full cross-framework agreement was unlikely. The findings suggest that LLMs can be organised into usable workflows that preserve ethical plurality while helping bridge the communicative gap between ethicists and non-ethically trained personnel. EeVA’s value lies not in replacing ethicists or resolving moral disagreement, but in scaffolding structured ethical deliberation. EeVA offers a promising proof of concept for supporting ethical reflection where access to ethics expertise is limited. Further work is needed on reproducibility, human evaluation, user testing, and efficiency before it can be considered a mature tool.

[AI-105] he Environmental Cost of LLM s in AIED: Reporting and Practices

链接: https://arxiv.org/abs/2606.11215
作者: Sabrina C. Eimler,Lukas Erle,Daniel Flood,Aditi Haiman,Luca Häckert,André Helgert,Lachlan McGinness,Büsra Yapici
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Model (LLM) usage in recent years has become increasingly widespread in the Artificial Intelligence in Education (AIED) community. While LLMs offer unique avenues for learners and educators, using LLMs comes with computational and environmental costs. These costs are mostly hidden due to a lack of standardised procedures to measure and report these impacts. To address this gap, we first conducted a literature review of all papers published as part of the AIED 2025 conference proceedings, determining if and how computational or environmental costs of LLMs are reported. Most projects use LLMs, but few report computational resources used and almost none discuss environmental impacts of LLMs as an ethical concern. To address this lack of standardised reporting practices, we propose an open-source method for systematically measuring and reporting the computational expense of LLMs and environmental impact of running Machine Learning (ML) AIED systems. We provide software solutions to measure the carbon footprint for both local and cloud based hardware. We also provide an easy-to-use formula to calculate the computational expense of frontier LLMs even when the exact number of parameters is not known. Overall, we hope to motivate colleagues to use our method to strive for more transparent reporting of hidden costs of using LLMs in the AIED community. Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI) Cite as: arXiv:2606.11215 [cs.CY] (or arXiv:2606.11215v1 [cs.CY] for this version) https://doi.org/10.48550/arXiv.2606.11215 Focus to learn more arXiv-issued DOI via DataCite

[AI-106] Belief-Space Control for Personalized Cancer Treatment via Active Inference

链接: https://arxiv.org/abs/2606.10376
作者: Deniz Sargun,H. Bugra Tulay,C. Emre Koksal
类目: Artificial Intelligence (cs.AI); Information Theory (cs.IT)
备注: 11 pages including appendix

点击查看摘要

Abstract:Cancer treatment is at the core a sequential decision-making problem with partial observability, latent patient heterogeneity, and explicit constraints on the budget for medical measurements. Unlike standard Reinforcement Learning (RL) approaches that control state trajectories, cancer treatments permanently modify patients’ transition dynamics, changing how states evolve over time. We model cancer treatment as a belief-space planning problem using active inference, deriving an expected free-energy objective that unifies goal-directed control and information acquisition under measurement budgets without. We implement this framework using real clinical cancer data from the AACR Project GENIE Biopharma Collaborative dataset. Results on clinical data demonstrate a simultaneous patient categorization and high treatment efficacy, under real measurement and treatment constraints.

[AI-107] Market Design for AI: Beyond the Copyright Binary

链接: https://arxiv.org/abs/2606.12260
作者: Yan Dai,Maryam Farboodi,Negin Golrezaei,Sepehr Shahshahani
类目: Theoretical Economics (econ.TH); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:How can we design a market of human-generated content for use in training AI models that both enables technological progress and preserves individual incentives for high-quality content creation? Existing approaches take polar positions: a “free-for-all” model based on fair use and a “strong intellectual property rights” model. We show that both fail: Free-for-all does not compensate creators, and – by modeling as a static Stackelberg game – strong intellectual property rights also underpower creative incentives. We find this especially true for more innovative creators, a phenomenon we term the “originality penalty.” Extending this insight to a dynamic model, we find another market failure undermining AI model performance, even for an initially good model: Such a model induces greater reliance by humans on AI-assisted creation, resulting in homogenized content feeding back into training, which degrades the model performance – a “curse of precision.” We further propose a market design with a data intermediary internalizing cross-creator externalities and subsidizing innovative contributions, thereby restoring efficiency.

[AI-108] Sparsified Kolmogorov-Arnold Networks for Interpretable Quantum State Tomography

链接: https://arxiv.org/abs/2606.11814
作者: Xinge Wu,Huaxin Wang,Jiajun Liu,Ruiqing He,Jiandong Shang,Hengliang Guo,Qiang Chen
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Machine-learning approaches to quantum state tomography can achieve high reconstruction fidelity, but the physical structure used by the trained model often remains implicit. Here we ask whether a sparsified Kolmogorov-Arnold Network (KAN) can be used not only as a regressor, but also as an inspectable reconstruction rule whose internal organization can be checked against known Pauli structure. We study a controlled three-qubit GHZ-family benchmark in which all 63 non-identity Pauli expectation values are used to reconstruct three GHZ-subspace variables: the population imbalance z , the real off-diagonal component c , and the imaginary off-diagonal component s . Under finite-shot sampling and depolarizing noise, external ablation identifies the extended 12-channel GHZ-relevant Pauli set from the 63 measurements, with exact top-12 recovery across the tested shot counts and depolarizing-noise strengths. These support patterns remain stable across multi-seed random-initialization and noise-level analyses, and collapse under random-label controls. The dominant pruned input-hidden-output pathways organize Z-type population observables and X/Y off-diagonal observables in a pattern consistent with the analytic GHZ Pauli grouping, and sparse formula recovery recovers the canonical signed Pauli relations. The contribution of the KAN is therefore pathway-level structural interpretability within a neural reconstruction model, rather than superior sparse regression. Together with negative controls, these probes provide a consistency chain for auditing learned reconstruction rules against known physical structure.

[AI-109] End-to-End Machine Learning for Depressive State Classification via EEG and fNIRS

链接: https://arxiv.org/abs/2606.11555
作者: Riki Sakurai,Simon Kojima,Mihoko Otake-Matsuura,Shin’ichiro Kanoh,Tomasz M. Rutkowski
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 4 pages, 4 figures, Accepted for publication in the Proc. 48th Annu. Int. Conf. IEEE EMBS (EMBC 2026), Toronto, Canada, July 20-24, 2026

点击查看摘要

Abstract:The escalating demand for mental healthcare, driven by rising societal stress, highlights the limitations of traditional psychiatric diagnostics. Conventional methods - relying primarily on clinical interviews and patient self-reports - are inherently vulnerable to subjective bias and the varying empirical judgment of practitioners. To address the need for quantitative evaluation, biological signal-based detection, including electroencephalography (EEG) and functional near-infrared spectroscopy (fNIRS), has emerged as a promising objective alternative. Such technology is particularly vital for identifying latent depressive states that may be unrecognized by the subjects themselves. Furthermore, in aging populations, the high comorbidity between depression and dementia necessitates early differentiation to prevent mutual symptom exacerbation and maintain Quality of Life (QoL). This pilot study of eleven healthy students establishes a framework for biological signal-based depression detection, serving as a foundational step toward automated, objective diagnostic tools for clinical use.

[AI-110] Quantized Stochastic Primal-Dual Methods for Distributed Optimization under Relaxed Global Geometry UAI

链接: https://arxiv.org/abs/2606.11339
作者: Susmit Sarkar,Abhinav Raghuvanshi,Kushal Chakrabarti,Mayank Baranwal
类目: Optimization and Control (math.OC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY); Machine Learning (stat.ML)
备注: Accepted to UAI

点击查看摘要

Abstract:We study distributed optimization with stochastic gradients and finite-bit communication modeled by random (unbiased) quantization. We propose q-PDGD, a quantized stochastic primal-dual method, and analyze it under relaxed global geometry. Under restricted secant inequality (RSI), a constant step-size yields linear contraction to an explicit neighborhood determined by gradient noise, quantization distortion, and network connectivity, while a diminishing step-size achieves O(1/k) convergence without shared-minimizer assumptions. Under Polyak-Lojasiewicz (PL) inequality, we obtain linear-to-neighborhood convergence in the same stochastic quantized setting. Our results match the best-known centralized stochastic rates in oracle complexity, and are supported by experiments demonstrating the predicted tradeoffs between quantization level, step-size choice, and graph structure.

[AI-111] OmniBioTwin: A System-of-Twinned-Systems Framework for Health Digital Twins

链接: https://arxiv.org/abs/2606.11264
作者: Zhaohui Wang,Yu Huang,Jiang Bian
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Health digital twins (HDTs) promise patient-specific modeling and decision support but current approaches remain structurally fragmented: monolithic models that address a single organ or task lack cross-scale fidelity, while system-level twins lack generalizable architectural frameworks. We propose OmniBioTwin, a System-of-Twinned-Systems (SoTS) framework that organizes HDTs as modular computational entities coupled through explicit interaction operators within a multi-layer network architecture. The framework comprises seven coordinated layers - spanning data integration, autonomous twin modeling, cross-scale coupling, temporal synchronization, and human-in-the-loop decision support. We demonstrate OmniBioTwin by instantiating a multiscale twin for glucagon-like peptide-1 (GLP-1) signaling pathways in Alzheimer’s disease, illustrating how molecular, cellular, and organ-level twins can be composed and coupled within a unified system.

[AI-112] Artificial Intelligence in Ship Finance: Applications Opportunities and a Case Study in AI-Augmented Loan Origination

链接: https://arxiv.org/abs/2606.11238
作者: Lasse Dierich,Orestis Schinas
类目: General Finance (q-fin.GN); Artificial Intelligence (cs.AI)
备注: 9 pages, 1 figure

点击查看摘要

Abstract:Ship finance is a data-intensive and document-heavy segment of asset-based lending, requiring the integration of financial, technical, contractual, and regulatory information from heterogeneous and largely unstructured sources. Increasing environmental regulation and ESG reporting requirements are adding further complexity to underwriting and loan-origination processes. Recent advances in artificial intelligence (AI), particularly large language models (LLMs), create new opportunities for processing and analysing such information. This paper reviews potential applications of AI in ship finance, with a particular focus on LLM-based systems for document comprehension, information extraction, and workflow automation. We present this http URL, a modular agentic architecture to support loan application workflows in ship finance. The proposed system combines an LLM-based extraction module, financial analysis components, external maritime data services, and a controlled document-generation module with a chatbot interface to support the preparation of standardized financing applications. The paper discusses the key challenges for using such models in production. We argue that AI-assisted systems can support maritime finance professionals in managing increasingly complex information and reporting requirements.

机器学习

[LG-0] UniIntervene: Agent ic Intervention for Efficient Real-World Reinforcement Learning

链接: https://arxiv.org/abs/2606.12372
作者: Haoyuan Deng,Yitong Gao,Yudong Lin,Haichao Liu,Zhenyu Wu,Ziwei Wang
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: Project page: this https URL

点击查看摘要

Abstract:Human-in-the-loop reinforcement learning (HiL-RL) has emerged as an effective paradigm for real-world robotic manipulation, enabling online policy improvement with human guidance. However, current HiL-RL frameworks remain intervention-intensive, relying on frequent human corrections to redirect the policy out of unproductive exploration, which incurs high labor cost and limits real-world scalability. To address this, we propose UniIntervene, an agentic intervention model that detects unproductive exploration and autonomously recovers the policy toward high-value states, taking over the bulk of interventions from human operators. Specifically, UniIntervene first performs future-conditioned action-value estimation, predicting the latent consequence of the current action and evaluating its induced value, which provides a more stable progress signal. Building on this, a temporal value-risk critic aggregates recent value dynamics and triggers intervention when the estimated value exhibits sustained stagnation or degradation. When intervention is required, UniIntervene retrieves a high-value recovery target from a memory of past intervention episodes and produces executable corrective actions through a goal-conditioned recovery policy. In this way, UniIntervene turns intervention from passive human correction into a value-aware recovery process for efficient real-world RL. Extensive experiments on diverse real-world manipulation tasks demonstrate that UniIntervene improves the average success rate by 8.6% while reducing human interventions by 57% relative to state-of-the-art HiL-RL baselines.

[LG-1] On Subquadratic Architectures: From Applications to Principles

链接: https://arxiv.org/abs/2606.12364
作者: Anamaria-Roberta Hartl,Levente Zólyomi,David Stap,Pieter-Jan Hoedt,Niklas Schmidinger,Lukas Hauzenberger,Sebastian Böck,Günter Klambauer,Sepp Hochreiter
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Transformers dominate modern sequence modeling, but their quadratic attention incurs substantial computational cost. Subquadratic architectures offer a scalable alternative. However, it remains unclear which designs yield the most effective sequence models. We compare three leading approaches: xLSTM, Mamba-2, and Gated DeltaNet. We evaluate these models on tasks with complex dependencies: (1) code-model pre-training, (2) distillation of code models from large language models, and (3) pre-training of time-series foundation models. Across these settings, xLSTM delivers the strongest overall performance. To explain xLSTM’s advantage, we present a unified formulation and analyze the underlying architectural mechanisms, focusing on state tracking and memory dynamics. Our results show that xLSTM enables more flexible and stable memory correction via its gating scheme. We corroborate these findings on controlled synthetic length-generalization tasks. Overall, our findings indicate that xLSTM’s gains on complex tasks stem from robust state tracking and accumulation.

[LG-2] Anatomy of Post-Training: Using Interpretability to Characterize Data and Shape the Learning Signal

链接: https://arxiv.org/abs/2606.12360
作者: Leon Bergen,Usha Bhalla,Sidharth Baskaran,Max Loeffler,Raphael Sarfati,Dhruvil Gala,Ryan Panwar,Santiago Aranguri,Thomas Fel,Atticus Geiger,Matthew Kowal,Siddharth Boppana,Daniel Balsam,Owen Lewis,Jack Merullo,Thomas McGrath,Ekdeep Singh Lubana
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Language-model post-training is the main stage at which model behavior is shaped, yet it still largely involves optimization of scalar rewards that summarize diverse desiderata. This abstraction gives practitioners little visibility into what their data actually teaches models, allowing spurious correlations to be learned by a model and inducing undesirable behaviors such as over-stylization and sycophancy. To address this problem, we ask: can we inspect a preference dataset before optimization and decide, at the level of concepts, which behaviors a model should be allowed to learn? Motivated by this, we introduce a data-centric post-training pipeline that uses interpretability protocols to develop statistical hypotheses for the latent concepts separating preferred from dispreferred generations, making them explicit for fine-grained user feedback. Building on this view, we unify several interpretability-based training protocols as ways of shaping rewards via feature or data interventions. Empirically, we show that our pipeline diagnoses undesirable signals in existing preference data, mitigates off-target learning, and can also help amplify or shape desired properties such as safeguards and model personality. More broadly, our results suggest that interpretability can turn post-training from optimizing opaque proxy rewards into a process of auditing and sculpting the learning signal itself.

[LG-3] Adjoint Method versus Physics-Informed Neural Networks in PDE-Constrained Inverse Problems

链接: https://arxiv.org/abs/2606.12337
作者: Zhen Zhang,Alessandro Alla,George Em Karniadakis
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注: 35 pages, 10 figures

点击查看摘要

Abstract:Inverse problems governed by partial differential equations (PDEs) are central to computational mechanics and are commonly solved by adjoint-based optimization, while physics-informed neural networks (PINNs) have emerged as a flexible alternative. Their relative performance remains difficult to assess because the two approaches are often compared under different formulations, parameterizations, optimizers, and regularization choices. We present a fair comparison of adjoint optimization and PINNs for PDE-constrained inverse problems. From a common abstract formulation, we instantiate both methods on identical domains, governing equations, observation models, and regularization terms, while matching the optimizer, unknown parameterization, and arithmetic precision wherever applicable. The benchmarks include unsteady Burgers, noisy Darcy permeability inversion, three-dimensional Allen–Cahn reaction identification, and unsteady Navier–Stokes viscosity identification. The results show that the representation of the unknown largely determines the preferred method: grid-based fields favor the discrete adjoint, whereas neural representations are native to PINNs and relevant for closure and constitutive modeling. For time-dependent problems, adjoint inversion can be dominated by trajectory storage and differentiation, while PINNs provide satisfactory reconstructions at lower cost. A PINN-warm-started adjoint strategy then recovers adjoint-level accuracy at substantially reduced cost.

[LG-4] Fourier Features Let Agents Learn High Precision Policies with Imitation Learning ICML2026

链接: https://arxiv.org/abs/2606.12334
作者: Balázs Gyenes,Emiliyan Gospodinov,Jan Frieling,Enrico Krohmer,Nicolas Schreiber,Xiaogang Jia,Niklas Freymuth,Gerhard Neumann
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注: Published as a conference paper at ICML 2026

点击查看摘要

Abstract:High-precision robotic manipulation requires fine-grained spatial reasoning that is often difficult to achieve with RGB-only policies due to depth ambiguity and perspective scale issues. Policies that leverage 3D information directly, such as those based on point clouds, offer a stronger geometric prior over purely image-based ones, yet their performance remains highly task-dependent. We hypothesize that this discrepancy may be due to the spectral bias of neural networks towards learning low frequency functions, which especially affects architectures conditioned on slow-moving Cartesian features. We thus propose to map point clouds from Cartesian space into high-dimensional Fourier space, effectively equipping the point cloud encoder with direct access to high-frequency features. We experimentally validate the use of Fourier features on challenging manipulation tasks from the RoboCasa and ManiSkill3 benchmarks and on a real robot setup. Despite their simplicity, we find that Fourier features provide significant benefits across diverse encoder architectures and benchmarks and are robust across hyperparameters. Our results indicate that Fourier features let policies leverage geometric details more effectively than Cartesian features, showing their potential as a general-purpose tool for point cloud-based imitation learning. We provide source code and videos on our project page: this https URL

[LG-5] Learning What to Say to Your VLA: Mostly Harmless Vision Language Action Model Steering

链接: https://arxiv.org/abs/2606.12299
作者: Hyun Joe Jeong,Gokul Swamy,Andrea Bajcsy
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 22 pages, 14 tables, 14 figures

点击查看摘要

Abstract:Vision-Language-Action (VLA) models provide a natural language interface to robot control, but the mapping from language to behavior is often brittle and unintuitive: semantically similar instructions can induce drastically different behaviors, while some capabilities may not be elicitable through prompting alone. As a result, both human instructions and zero-shot language models can fail to reliably steer VLAs toward successful task execution. In this work, we propose a framework that interactively searches for language sequences that improve closed-loop VLA task performance, distills these sequences into a test-time language feedback policy (LFP), and learns an improvement head that predicts when language steering will improve performance. We conformalize this improvement head to prevent harmful steering interventions, where the LFP decreases task performance relative to the original instruction on out-of-distribution scenarios. Crucially, our approach operates on arbitrary frozen pre-trained VLAs, requiring neither access to the original training distribution nor fine-tuning of the underlying model. On seen environments, our conformalized LFP improves base VLA performance by 24.7% in simulation and 65.0% in hardware. On visual and semantic perturbations, our conformalized LFP has strong harmlessness guarantees, and produces recovery behaviors not observed with open-loop prompting.

[LG-6] PianoKontext: Expressive Performance Rendering from Deadpan Context ICML2026

链接: https://arxiv.org/abs/2606.12282
作者: Dmitrii Gavrilev
类目: ound (cs.SD); Machine Learning (cs.LG)
*备注: ICML 2026 Workshop on Machine Learning for Audio (Oral)

点击查看摘要

Abstract:Expressive performance rendering (EPR) aims to generate realistic performances constrained on sequences of notes. However, flow matching audio editing models manipulate only synchronized music samples of the same duration, limiting their understanding of expressive timing. We introduce PianoKontext, a flow matching rendering model for classical piano music that generates variable-length performances in the latent space of a pretrained Music2Latent model. We synthesize MIDI scores into deadpan audio and employ Dynamic Time Warping (DTW) in the latent space to construct paired data for training. The aligned embeddings are concatenated in DiT blocks, allowing for a simple and effective learning of the dependencies between the score and performances. Audio samples are available at our demo page: this https URL.

[LG-7] Holding the FP8 Quality Ceiling at 8-Bit Weights and Activations: INT8 and GGUF Post-Training Quantization of Ideogram 4.0 for Consumer GPUs

链接: https://arxiv.org/abs/2606.12280
作者: Deep Gandhi,Ali Asaria,Tony Salomone
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Post-training quantization lets large text-to-image diffusion transformers run on consumer GPUs, yet the hardware-specific trade-offs are seldom measured directly. We quantize Ideogram 4.0 - a 9.3B flow-matching diffusion transformer (DiT), shipped as two separate-weight copies of a single-stream 34-layer backbone for classifier-free guidance and conditioned by a Qwen3-VL-8B encoder - for Ampere RTX 3090 GPUs, which lack FP8 tensor cores. Our INT8 W8A8 recipe (per-channel weights, per-token dynamic activations, SmoothQuant, and mixed-precision protection of a small high-fragility layer set) holds the FP8 quality ceiling: on a 200-prompt benchmark the paired same-seed bootstrap CI for INT8-FP8 includes zero on both Pick and CLIP, while INT8 improves on NF4 by +1.9 CLIP (95% CI [+1.21,+2.64] , excluding zero). A per-category OCR analysis, to our knowledge unreported for this model class, confirms text legibility is preserved, and an ablation isolates protection of the FFN down-projections as the dominant quality lever. Our GGUF Q4_K quantization beats NF4 at equal on-disk size and is the Pareto winner on the quality-memory frontier, with paired confidence intervals excluding zero (Q8_0 is quality neutral). Finally, we characterize where 8-bit quantization helps and where it does not: INT8’s weights match FP8’s footprint rather than shrink it, so a speed gain on Ampere awaits a fused INT8 kernel.

[LG-8] Finding Multiple Interpretations in Datasets

链接: https://arxiv.org/abs/2606.12277
作者: Matthew Chak,Paul Anderson
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this paper, we propose an approach to finding sets of similar-performing models (in terms of loss/accuracy measurements) with highly different context-aware characteristics. Through experiments on the METABRIC dataset, we show that the proposed method finds multiple models with highly different gene expressions than those found by the control methodology without performance penalties. We argue that the proposed methodology is important whenever one aims to analyze any global characteristic of a model to extract insight into the underlying phenomenon being studied.

[LG-9] Re-evaluating Confidence Remasking in Masked Diffusion Language Models

链接: https://arxiv.org/abs/2606.12232
作者: Stipe Frkovic,Metod Jazbec,Dan Zhang,Christian A. Naesseth,Ilija Bogunovic,Eric Nalisnick
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Masked diffusion language models (dLLMs) have recently emerged as a competitive alternative to autoregressive language models, with the promise of faster inference via parallel token generation. A notable limitation of the masked formulation, however, is that once a token has been unmasked it can no longer be revised, leaving dLLMs vulnerable to early sampling mistakes. To address this, a growing body of work has sought to extend masked dLLMs with self-correcting (remasking) capabilities. One appealing subset of these methods does so in a training-free, post-hoc manner based on token confidences, with encouraging early reported results. In this work, we revisit the empirical evaluation of a representative post-hoc remasking method, WINO [Hong et al., 2026], and find that under standard decoding settings (shorter block lengths) it brings little-to-no benefit over confidence-based unmasking alone [Wu et al., 2025]. Extending the evaluation to non-greedy decoding, we find that while confidence-based remasking can mitigate errors introduced by increased stochasticity to some extent, it also exacerbates the diversity collapse previously reported for confidence-based unmasking. Overall, our results show that the benefits of post-hoc confidence-based remasking are highly setting-dependent, underscoring the need for a more comprehensive evaluation framework.

[LG-10] How Low Can You Go? Active Learning for Sparse Model Discovery in the Ultra-Low-Data Limit

链接: https://arxiv.org/abs/2606.12182
作者: Ana Larrañaga,Urban Fasel,Steven L. Brunton
类目: Machine Learning (cs.LG); Dynamical Systems (math.DS); Optimization and Control (math.OC)
*备注: 20 pages, 10 figures

点击查看摘要

Abstract:Identifying the governing equations of complex dynamical systems remains a fundamental challenge across science and engineering. While early approaches relied on empirical data and heuristics, modern data-driven methods offer greater flexibility and fewer assumptions. However, data acquisition in real-world settings is often expensive. This work addresses this challenge by introducing an active learning strategy for dynamics discovery in the ultra-low data limit. Rather than sampling randomly, our method iteratively prioritizes regions that are most informative for model identification. This approach builds on Sparse Identification of Nonlinear Dynamics (SINDy), and utilizes an ensemble extension, E-SINDy, to estimate epistemic uncertainty and guide the sampling for both ordinary and partial differential equations (ODEs/PDEs). For ODEs, an exhaustive analysis is conducted on the Lorenz system across varying data budgets and noise levels. For PDEs, two systems with contrasting dynamical characteristics are examined: the Burgers’ equation, where a sharp shock front creates a distinction between informative and uninformative regions, and the Kuramoto-Sivashinsky equation, which presents a more spatially complex sampling landscape. Across all scenarios, the proposed method accurately identifies the governing dynamics with significantly fewer data samples than random sampling.

[LG-11] PCA-Enhanced Adaptive NVAR Framework for High-Resolution Sea Surface Temperature Forecasting in the East Sea

链接: https://arxiv.org/abs/2606.12141
作者: Sherkhon Azimov,Susana López-Moreno,Eric Dolores-Cuenca,JinYong Choi,Sangil Kim
类目: Machine Learning (cs.LG)
*备注: 14 pages, 7 figures

点击查看摘要

Abstract:Accurate forecasting of sea surface temperature (SST) in regional seas such as the East Sea is crucial for monitoring marine ecosystems, assessing climate risks, managing fisheries, and conducting naval operations. Traditional numerical ocean models provide reliable predictions but are computationally expensive and often unsuitable for real-time forecasting. Many deep learning methods also struggle with high-dimensional spatiotemporal ocean data and experience error accumulation over longer forecasting periods. This study builds on our previously proposed Adaptive Next-Generation Reservoir Computing (Adaptive NVAR) framework, initially introduced and tested on synthetic dynamical systems, and extends it to ocean forecasting. We present a reduced-order forecasting framework that combines Singular Value Decomposition (SVD) with Adaptive NVAR to predict SST dynamics in the East Sea. SST fields are compressed into a low-dimensional representation using SVD, which extracts dominant modes of ocean variability. Adaptive NVAR models the temporal evolution of these latent states, and the predicted states are reconstructed into SST forecasts. We evaluate the framework using regional ocean datasets and compare it with the standard NG-RC/NVAR. Results show that Adaptive NVAR consistently achieves lower forecasting errors across multiple prediction horizons. In addition, SVD reduces computational complexity, resulting in a fast and scalable framework suitable for real-time ocean forecasting.

[LG-12] A Riemannian Approach to Low-Rank Optimal Transport

链接: https://arxiv.org/abs/2606.12120
作者: Pratik Jawanpuria,Bamdev Mishra
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Low-rank optimal transport (OT) mitigates the quadratic scaling of classical solvers, yet existing approaches rely heavily on first-order mirror-descent updates that require careful hyperparameter tuning and ignore the optimization landscape’s curvature. To address these limitations, we propose a unified Riemannian geometric framework for low-rank OT, modeling balanced and unbalanced rank- r positive factored couplings as novel smooth embedded submanifolds of the positive orthant. By equipping these manifolds with the Fisher-Rao product metric, we derive tractable formulations for Riemannian projectors, retractions, and Hessian-vector products. Our cost-agnostic framework seamlessly extends to linear OT, Gromov-Wasserstein (GW), fused GW, and their unbalanced counterparts. For balanced OT, our geometric ingredients are computed via efficient conjugate-gradient and iterative Bregman updates. For the unbalanced OT, our operations elegantly reduce to closed-form scalings, completely eliminating inner iterative loops. In both regimes, per-iteration complexity scales linearly with dataset size, and we provide a rank-sufficiency certificate for global optimality verification. Extensive experiments across a range of problem sizes demonstrate that our regularization-free first- and second-order solvers achieve faster convergence and superior performance over existing state-of-the-art low-rank OT solvers.

[LG-13] Efficient Time Series Clustering from Multiscale Reservoir Dynamics with Granular-Ball Anchoring Graph Optimization IJCAI2026

链接: https://arxiv.org/abs/2606.12077
作者: Yifan Wang,Lifeng Shen,Shuyin Xia,Yi Wang
类目: Machine Learning (cs.LG)
*备注: Accepted by IJCAI 2026

点击查看摘要

Abstract:Time-series clustering remains challenging due to the inherent trade-off between clustering effectiveness and computational efficiency. Similarity-based methods often suffer from quadratic complexity caused by pairwise distance computations, while deep learning-based approaches typically rely on costly iterative training and a large number of trainable parameters. In this paper, we propose MSRGC-Net, an efficient time-series clustering framework that integrates multiscale reservoir computing, granular-ball-based anchoring graph construction, and consensus learning. MSRGC-Net adopts a training-free reservoir computing paradigm to extract multiscale temporal representations from raw time series without backpropagation, significantly reducing computational overhead. To capture the intrinsic structure of the resulting representations, granular-ball computing is employed to adaptively model data distributions via density-consistent regions, yielding compact and robust anchor graph representations. Furthermore, a consensus-based anchoring graph optimization strategy is introduced to effectively align multiscale reservoir representations and integrate complementary information across temporal scales. Extensive experiments on widely used univariate and multivariate benchmark datasets demonstrate that MSRGC-Net consistently outperforms state-of-the-art methods in clustering performance while maintaining superior computational efficiency.

[LG-14] Categorical Robustness Assessment for Machine Learning based Network Intrusion Detection Systems

链接: https://arxiv.org/abs/2606.12075
作者: Mayank Raj,Nathaniel D. Bastian,Lance Fiondella,Gokhan Kul
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Network Intrusion Detection Systems (NIDS) heavily utlize Machine Learning (ML) but ML models can be manipulated via adversarial attacks. These attacks add carefully crafted perturbations to network traffic data that leads to misclassifications. While prior work has demonstrated adversarial vulnerabilities in isolated settings, systematic cross-architecture as well as class and category of attack based comparisons under controlled attack conditions remain limited, leaving practitioners without clear guidance on which models to deploy in adversarial environments. This paper asks a simple question: what type of classifier architectures actually hold up when attackers try to manipulate the systems? We put three popular architectures through their paces: a 1D Convolutional Neural Network, a Long Short-Term Memory (LSTM) network, and a Random Forest (RF) ensemble. Using the ACI-IoT-2023 dataset (over 1.2 million samples spanning 12 attack types), we subject each model with FGSM and PGD adversarial attacks, which apply gradient-based perturbations in normalized feature space consistent with established adversarial ML evaluation protocols, at perturbation budgets ranging from \epsilon=0.01 to \epsilon=0.1 . Surprisingly, Random Forest achieved near-perfect baseline accuracy (99.98%), yet collapsed catastrophically under attack, dropping 73 percentage points at the smallest perturbation we tested. CNN, on the other hand, retained 95.5% accuracy at \epsilon=0.01 and degraded gracefully as perturbations increased. LSTM fell somewhere in between. These findings flip the conventional wisdom where high baseline accuracy means nothing if a model shatters at the first sign of adversarial pressure. For practitioners deploying intrusion detection in adversarial environments, we recommend CNN-based architectures and provide scenario-specific deployment guidance.

[LG-15] Attention by Synchronization in Coupled Oscillator Networks

链接: https://arxiv.org/abs/2606.12059
作者: Fabio Pasqualetti,Taosha Guo
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Adaptation and Self-Organizing Systems (nlin.AO)
*备注:

点击查看摘要

Abstract:We address transformer attention on energy-constrained physical substrates. Softmax attention requires exponentiation and global reduction, operations with high energy cost on von Neumann hardware and no natural physical analog. We show that Kuramoto synchronization dynamics (which arise in electrical, mechanical, superconducting, and charge-density-wave oscillator arrays, among other physical systems) implement a well-defined attention operation without either. The resulting mechanism, fixed-query oscillator attention, replaces softmax’s arithmetic with the equilibration of a gradient flow on the sphere: queries are learned anchors fixed on the sphere, and free oscillators evolve under Kuramoto-Lohe dynamics until they settle at positions encoding attention weights via cosine similarity. Because the computation is equilibration, it requires no exponentiation; the only global operation is an affine normalization at readout. The fixed point is provably unique and globally attractive from almost every initial condition, a guarantee that holds across every physical realization. Empirically, at the minimal hardware configuration (oscillator dimension d_\mathrmosc = 2), oscillator attention outperforms softmax on keyword spotting (+1.00 pp) and on subject-verb agreement (+5.27 pp on hard sentences, with zero training failures versus one in five for softmax). On causal language modeling, where softmax retains an advantage, oscillator attention closes the gap as d_\mathrmosc grows: from +11.09 PPL at d_\mathrmosc = 2 to +2.98 PPL at d_\mathrmosc = 32 on WikiText-2, and from +2.39 PPL at d_\mathrmosc = 2 to +0.57 PPL at d_\mathrmosc = 32 on TinyStories. The main objective of this work is not to replace softmax in software but to provide a mathematically grounded blueprint for accurate attention on physical substrates.

[LG-16] Simplicity Suffices for Parameter Noise Injection in Stochastic Gradient Descent IJCAI2026

链接: https://arxiv.org/abs/2606.12054
作者: Benjamin Leblanc,Louis-Jacob Lebel,Teddy Kana,Richard Kamel
类目: Machine Learning (cs.LG)
*备注: Accepted at the Data Science Meets Optimisation workshop in IJCAI 2026

点击查看摘要

Abstract:Injecting noise into the optimization process is a well-established technique for improving the training and generalization of deep neural networks. Yet, despite the breadth of existing approaches, it remains unclear which design choices truly matter in practice. In this work, we investigate parameter noise injection for stochastic gradient descent, focusing on two key questions: how to efficiently pair each training example with its own perturbation in mini-batch training, and whether sophisticated noise parameterizations or multi-sample gradient averaging yield meaningful gains over simpler alternatives. To address the first question, we leverage a distributional identity for linear layers that allows per-example noise injection without breaking batched computation. To address the second, we systematically compare several diagonal Gaussian parameterizations against an isotropic baseline across varying noise levels on CIFAR100. Our results consistently show that simple, lightweight strategies, isotropic noise with a single perturbed forward pass per update step, recover most of the benefit of more complex schemes. These findings suggest that simplicity suffices for parameter noise injection, and that practitioners need not resort to elaborate perturbation designs to reap the optimization and generalization benefits of noisy SGD.

[LG-17] Reliable Error Estimation for PINNs: Lower and Upper A Posteriori Bounds

链接: https://arxiv.org/abs/2606.12050
作者: Ismail Huseynov,Arzu Ahmadova,Agamirza Bashirov
类目: Machine Learning (cs.LG); Dynamical Systems (math.DS)
*备注:

点击查看摘要

Abstract:Physics-informed neural networks (PINNs) combine machine learning with physical laws to solve differential equations. While existing results provide rigorous \empha posteriori upper bounds for PINN prediction errors, complete certification also requires complementary lower information in order to obtain computable two-sided error enclosures. In this paper, we derive computable \empha posteriori lower bounds for PINN errors in ordinary differential equations on suitable certified state-space domains under a localized strong monotonicity condition. We combine these estimates with complementary localized upper bounds under a one-sided Lipschitz condition, which is weaker than the global Lipschitz assumption used in previous work and can yield sharper upper error bands. The resulting bounds depend only on the neural-network approximation, the ODE residual, and local monotonicity and growth constants, and therefore do not require access to the exact solution. For linear time-invariant and time-varying systems, we further derive explicit formulas in terms of the minimal and maximal eigenvalues of the symmetric part of the system matrix. We also discuss the distinction between soft and hard enforcement of initial conditions in PINNs and explain why exact enforcement can make the scalar lower certificate uninformative. To recover nontrivial lower information in the linear setting, we use a signed-residual finite-probe certificate based on coordinate unit vectors. We also formulate a certificate-informed training strategy in which the propagated upper certificate is used as an auxiliary regularizer, while lower certificates remain post-training diagnostics. Altogether, the proposed framework provides rigorous and practically computable error certificates for PINN approximations of ODEs, while making explicit the domains and model classes for which the assumptions can be verified.

[LG-18] Bootstrapped Monitoring: Leverag ing Transparent Reasoning to Oversee Stronger AI Agents

链接: https://arxiv.org/abs/2606.11998
作者: Frank Xiao,Mary Phuong
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Trusted monitoring is a cornerstone of AI control. However, as frontier models grow more capable, the increasing capabilities gap between trusted and untrusted models may render trusted models unreliable monitors. We introduce \emphbootstrapped monitoring, a protocol that addresses this by inserting a stronger, intermediate untrusted model with transparent chain-of-thought reasoning into the oversight chain. The untrusted monitor ( U_m ) evaluates the agent’s actions, while a weaker trusted model ( T ) oversees U_m 's reasoning to detect collusion. We evaluate bootstrapped monitoring on multi-turn software engineering tasks (BashArena) across multiple agents and monitors. Bootstrapped monitoring substantially improves catch rates over trusted-only monitoring, even when the untrusted monitor actively colludes with the agent, provided we have access to its raw chain-of-thought. Our results suggest that bootstrapped monitoring can extend the useful lifetime of trusted models in control as AI capabilities advance.

[LG-19] What Uncertainties Do We Need for Dynamical Systems? ICML

链接: https://arxiv.org/abs/2606.11988
作者: Yusuf Sale,Christopher Bülte,Felix Czaja,Joshua Stiller,Eyke Hüllermeier
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: EIML@ICML

点击查看摘要

Abstract:The distinction between aleatoric and epistemic uncertainty has received considerable attention in machine learning research, mainly in the context of supervised learning but also in other settings such as generative modeling. In this paper, we offer a machine learning perspective on uncertainty modeling for dynamical systems, which has been studied much less so far. In particular, we ask: what uncertainties do we need for dynamical systems? We discuss sources of uncertainty, clarify their nature (aleatoric or epistemic), and consider how the objectives of representing and quantifying uncertainty vary across different tasks.

[LG-20] PAWS: Preference Learning with Advantage-Weighted Segments ICML2026

链接: https://arxiv.org/abs/2606.11982
作者: Aleksandar Taranovic,Onur Celik,Niklas Freymuth,Ge Li,Serge Thilges,Huy Le,Tai Hoang,Rania Rayyes,Gerhard Neumann
类目: Machine Learning (cs.LG)
*备注: Published as a conference paper at ICML 2026

点击查看摘要

Abstract:Preference-based reinforcement learning (PbRL) learns policies from human trajectory-level comparisons, avoiding explicit reward design and expert demonstrations. Existing methods typically train utility functions on trajectory or segment-level preferences while relying on per-step utility estimates during policy optimization. This training and inference mismatch induces a distribution shift that severely degrades temporal credit assignment and limits policy learning. We analyze this issue and propose PAWS, a segment-based preference learning method that performs policy updates directly using segment-level advantage functions. By aligning utility training with policy optimization, PAWS preserves trajectory-level preference information and avoids unreliable per-step learning signals. Experiments on simulated robotic manipulation and locomotion tasks demonstrate that PAWS consistently outperforms existing PbRL approaches, highlighting the importance of distribution-consistent preference learning.

[LG-21] Efficient Multinomial Logistic Bandit via Frequent Directions

链接: https://arxiv.org/abs/2606.11968
作者: Linzhe He,Yu-Jie Zhang,Sifan Yang,Lijun Zhang
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:This paper studies efficient online algorithms for multinomial logistic bandits (MLogB), where the feedback distribution over K+1 outcomes follows a multinomial logistic model of d -dimensional action vectors. A representative UCB-type algorithm, OFUL-MLogB, achieves a regret bound of \tilde\mathcalO(Kd\sqrtT) , but still requires \mathcalO(K^3d^3) time and \mathcalO(K^2d^2) space per round due to parameter estimation and optimistic reward construction, which is prohibitive in high-dimensional settings. To address this limitation, we propose EOFD-MLogB, which integrates frequent directions matrix sketching into OFUL-MLogB. By maintaining a low-rank SVD sketch of the accumulated Hessian, constrained online Newton updates in parameter estimation and Kd \times K spectral-norm computations in the reward bonus are reduced to one-dimensional root-finding tasks and K \times K eigenvalue computations, respectively. This yields dominant per-round time complexity \mathcalO(Kd(m+K)^2) and space complexity \mathcalO(Kd(m+K)) , where m \ll d is the sketch size. We further prove a regret bound of \tilde\mathcalO(\Delta_T(Kd\ln\Delta_T+m)\sqrtT) , where the sketching error factor \Delta_T is controlled by the m -truncated spectral tail of the Hessian. Thus, when the Hessian is approximately low-rank, the regret is close to that of OFUL-MLogB. Experiments validate the computational efficiency and competitive performance.

[LG-22] HAMNO: A Hierarchical Adaptive Multi-scale Neural Operator with Physics-Informed Learning for Dynamical Systems

链接: https://arxiv.org/abs/2606.11963
作者: Mostafa Bamdad,Mohammad Sadegh Eshaghi,Timon Rabczuk
类目: Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注:

点击查看摘要

Abstract:Neural operators provide a powerful framework for learning solution mappings of partial differential equations directly in function space. However, many existing architectures still struggle to represent nonlinear time-dependent systems that involve multi-scale structures, long-range interactions, and stable long-time evolution. In this work, we introduce the Hierarchical Adaptive Multi-scale Neural Operator (HAMNO), a neural-operator architecture that combines local convolutional representations, global spectral operators, and hierarchical encoder-decoder processing. The central component of HAMNO is a data-dependent gating mechanism that adaptively balances local and global information at each spatial location, allowing the model to resolve fine-scale features while preserving long-range dependencies. We further develop a physics-informed extension, PI-HAMNO, based on a multi-objective loss strategy that combines data fitting with strong- and weak-form physics constraints. The strong-form term penalizes the domain-integrated squared PDE residual in physical coordinates, while the weak-form term is constructed by multiplying the governing residual by finite-element test functions and evaluating the resulting element integrals using centroid-based tetrahedral quadrature. The framework is evaluated on non-periodic Allen-Cahn (AC), Cahn-Hilliard (CH), and Swift-Hohenberg (SH) equations defined on cubic domains. Across long-horizon rollout, data-limited training, out-of-distribution initial-condition shifts, and random-seed variations, HAMNO improves predictive accuracy over standard neural-operator baselines, while PI-HAMNO further enhances stability, physical consistency, and data efficiency. The implementation is publicly available at this https URL . Subjects: Machine Learning (cs.LG); Computational Physics (physics.comp-ph) Cite as: arXiv:2606.11963 [cs.LG] (or arXiv:2606.11963v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2606.11963 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-23] Online Shift Detection and Conformal Adaptation for Deployed Safety Classifiers

链接: https://arxiv.org/abs/2606.11949
作者: Jun Wen Leong
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Machine Learning (stat.ML)
*备注: 16 pages, 4 figures, 7 tables. Code and data at this https URL

点击查看摘要

Abstract:We present an online monitoring system for distributional shift in deployed safety classifiers, using calibrated sequential statistics to detect when a classifier has moved out of distribution. Upon detection, a conformal abstention layer adapts decision thresholds to recover a target error rate epsilon=0.1. In a pre-registered factorial evaluation (4 classifiers x 5 shift conditions x 20 seeds x 2 window sizes, 800 cells), the system achieves 86.6% valid detection (693/800, 95% CI [84.1%, 88.8%]) with mean latency of 39.5 steps. Detection holds across three ground-truth regimes: synthetic onset (86.6%), real temporal jailbreaks (85%, 17/20), and GCG adversarial attacks. Weighted conformal prediction recovers up to 39 pp of lost coverage for DeBERTa (ESS=46/300) but collapses for all other classifiers (ESS~300): logistic density ratio estimation achieves perfect source/target separability in high-dimensional embedding spaces, clipping all importance weights to the floor. DeBERTa shows a gradient from effective correction (paraphrase, ESS=46) to near-total collapse (adversarial suffix, ESS=206). PCA to 32 dimensions breaks the collapse, recovering 33 pp for Llama Guard and 21 pp for ShieldGemma. Variance decomposition reveals classifier (eta^2=0.243), shift type (eta^2=0.237), and their interaction (eta^2=0.185) all contribute substantially to detection latency variance (all p0.001), indicating per-classifier monitoring profiles are necessary.

[LG-24] Neuro-Relational Programs: Unifying Queries and Neural Computation over Structured Data

链接: https://arxiv.org/abs/2606.11946
作者: Arie Soeteman,Balder ten Cate,Maurice Funk,Benny Kimelfeld,Carsten Lutz,Moritz Schönherr
类目: Databases (cs.DB); Computational Complexity (cs.CC); Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
*备注: 37 pages

点击查看摘要

Abstract:The conventional approach to deep learning over relational databases applies neural models, such as Graph Neural Networks (GNNs), to a graph representation of the database. Recent approaches instead operate on databases directly, associating tuples with embeddings and extending query mechanisms to jointly process embeddings and relational content. Inspired by these developments, we introduce Neuro-Relational Programs (NRPs), a declarative query language for relational databases whose facts carry numeric vector embeddings. NRPs extend Datalog-style rules with operations that combine, aggregate, and transform embeddings, thereby interleaving relational reasoning and learnable neural components within a single formalism. This yields a general approach to neural computation over relational data: an NRP can be read both as a query plan with trainable components and as a neural architecture with relational structure built in. Natural syntactic fragments of NRPs recover existing architectures and query formalisms. Zero-ary NRPs correspond to non-adaptive query algorithms; monadic NRPs generalize GNN-style message passing and precisely capture Deep Homomorphism Networks, a connection that we extend to frontier-guarded NRPs over databases with row-ids. We characterize the expressive power of unrestricted NRPs with ReLU-FFN transformations by FOCQ, an extension of first-order logic with counting interpreted over real-weighted structures, yielding a precise connection with uniform TC ^0 over ordered databases. Together, these results establish NRPs as a broad declarative framework for querying and neural computation over relational data. Comments: 37 pages Subjects: Databases (cs.DB); Computational Complexity (cs.CC); Machine Learning (cs.LG); Logic in Computer Science (cs.LO) ACMclasses: H.2.3; F.4.1; I.2.4 Cite as: arXiv:2606.11946 [cs.DB] (or arXiv:2606.11946v1 [cs.DB] for this version) https://doi.org/10.48550/arXiv.2606.11946 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-25] Critic Architecture Matters: Dual vs. Unified Critics for Humanoid Loco-Manipulation ICRA2026

链接: https://arxiv.org/abs/2606.11891
作者: Mehmet Turan Yardımcı
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: Accepted at the ICRA 2026 Workshop on Reinforcement Learning for Imitation Learning (RL4IL), Vienna, Austria. 4 pages, 2 figures

点击查看摘要

Abstract:Multi-objective reinforcement learning for humanoid robots must coordinate locomotion and manipulation within a single policy. A natural design choice is whether to use a single (unified) critic that estimates the combined value of all objectives, or separate (dual) critics with disjoint reward signals. We present a controlled comparison on the Unitree G1 humanoid (23 active DoF) in NVIDIA Isaac Lab, training loco-manipulation policies through a sequential curriculum spanning 13 levels from stationary reaching to walking with variable-orientation targets. In standardized evaluation, dual-critic policies reach targets 3.5 \times faster (6.5 vs. 22.6 simulation steps), achieve 2 \times higher throughput (14.3 vs. 7.0 validated reaches per 1,000 steps), and attain higher validated reach rates (65.2% vs. 53.8%) compared to the unified-critic policy. Notably, additional anti-gaming reward mechanisms provide no further improvement beyond the architectural change alone (60.9% vs. 65.2%). These results have direct implications for the emerging paradigm of RL fine-tuning of imitation-learned policies: when refining a pre-trained manipulation policy with RL, a unified critic risks suppressing the learned behavior through competing locomotion gradients. These findings demonstrate that critic architecture is a primary - and often overlooked - design choice in multi-objective humanoid RL, with greater impact than reward engineering on reaching efficiency.

[LG-26] MemNovo: Look Back at the Spectrum for Balanced De Novo Peptide Sequencing from Mass Spectrometry

链接: https://arxiv.org/abs/2606.11868
作者: Dongxin Lyu,Jingbo Zhou,Hongxin Xiang,Yuqiang Li,Jun Xia
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注: Code: this https URL

点击查看摘要

Abstract:De novo peptide sequencing from tandem mass spectrometry is pivotal in proteomics, enabling identification of novel peptides without reference databases. While recent Transformer-based encoder-decoder models have achieved remarkable performance, we uncover a critical pathology in their inference dynamics. Through comprehensive feature scaling experiments, we demonstrate that existing auto-regressive peptide decoders tend to over-rely on generated-sequence priors while progressively under-utilizing fine-grained physical evidence from the input mass spectrum. This phenomenon leads to suboptimal results, where generated peptide sequences are biologically plausible yet not faithful to the input spectrum. To rectify this, we propose MemNovo, a training-free and plug-and-play mechanism that re-balances peptide and spectral contributions at inference time. MemNovo alleviates the information bottleneck by establishing a persistent spectral memory bank and injecting retrieved features directly into the final decoding stage via an ultra-conservative residual connection. Theoretical analysis confirms that this mechanism restores the mutual information between the decoder state and the raw spectrum. Extensive experiments on the Nine Species benchmark with two representative baselines, Casanovo and InstaNovo, demonstrate that MemNovo consistently improves both amino acid precision and peptide precision, achieving up to 39.1% relative improvement in peptide precision for Casanovo and up to 3.9% for InstaNovo, with negligible computational overhead.

[LG-27] RePAIR: Predictive Self-Supervised Representation Learning in Chess

链接: https://arxiv.org/abs/2606.11860
作者: Christoph Koller,Johannes Fürnkranz,Timo Bertram
类目: Machine Learning (cs.LG)
*备注: Accepted for oral presentation at IEEE Conference on Games 2026

点击查看摘要

Abstract:In this paper, we introduce Representation Prediction via Autoencoding using Iterative Refinement (RePAIR) - a novel self-supervised representation learning architecture that synthesizes Masked Autoencoders (MAE), Joint Embedding Predictive Architectures (JEPA), and Bidirectional Encoder Representations from Transformers (BERT). We demonstrate how it can be used to encode objects in sequential data like consecutive chess positions into compact yet meaningful representations. The basic principle of the architecture is to mask large portions of a sequence of latent states, similar to BERT and MAE. Then, we apply a lightweight Predictor to the latent representations that repairs gaps in the sequence in a lower-dimensional embedding space akin to JEPA. Our experiments in the domain of chess show that the Encoder refines the board representations such that meaningful chess concepts emerge clustered in the latent space. Furthermore, reconstructions of the masked board states show that the model is able to reason about the piece movements without relying on costly reinforcement learning methods. Lastly, we find that the resulting representation space allows for quick and intuitive dissections of chess games by observing the game path trajectories in this semantically rich space.

[LG-28] askFusion: Continual Anomaly Detection for Heterogeneous Tabular Data

链接: https://arxiv.org/abs/2606.11844
作者: Dayananda Herurkar,Federico Raue,Joachim Folz,Jörn Hees,Andreas Dengel
类目: Machine Learning (cs.LG)
*备注: 22 Pages

点击查看摘要

Abstract:Continual anomaly detection in tabular data is challenging and remains largely underexplored, particularly in settings with heterogeneous feature schemas, distribution shifts, and severe class imbalance. In many real-world applications, data arrive sequentially from diverse domains, rendering conventional continual learning methods ineffective due to their reliance on a fixed input space. We propose a continual learning (CL) method, which can overcome these challenges and continually learn from different tasks. Our method consists of three main parts: our AGF model, Taskfusion augmentation, and outlier exposure. The AGF-model maps task-specific features into a shared space, then aligns distributions to reduce representation drift, and learns anomaly decision boundaries in the aligned space. To improve stability, we introduce Taskfusion augmentation, combining boundary-aware interpolation within tasks to refine the model anomaly boundaries and cross-task mixing to transfer anomaly structure across datasets. To handle class imbalance and memory constraints, we employ tabular dataset distillation to store compact synthetic replay samples, which are jointly used with augmented data in an outlier exposure objective for robust anomaly detection. We evaluate the approach on 21 heterogeneous datasets across multiple domains. Results show that our approach substantially improves continual anomaly detection performance over sequential fine-tuning and other CL baselines while reducing catastrophic forgetting and maintaining stable detection across heterogeneous datasets.

[LG-29] Flow Matching with In-Context Priors for Out-of-Distribution Brain Dynamics

链接: https://arxiv.org/abs/2606.11833
作者: Sam Gijsen,Michał Łukomski,Marc-André Schulz,Kerstin Ritter
类目: Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
*备注: Code and pretrained models available at this https URL

点击查看摘要

Abstract:Flow matching and diffusion models enable conditional generation across domains ranging from images to proteins, with recent extensions to out-of-distribution contexts. Yet generative models of neural time series have largely remained restricted to categorical conditioning, precluding compositional and zero-shot generalization. In this work, we propose a per-timestep conditioned diffusion transformer for generating realistic fMRI brain dynamics during unseen cognitive tasks by injecting both compositional language and optional spatial priors in-context. Such zero-shot generation could enable counterfactual neuroscience by supporting in-silico design and evaluation of novel cognitive experiments before empirical validation. Leveraging this model, we evaluate across hundreds of held-out task conditions and characterize predictive performance in relation to the training manifold. From language alone, the model recovers region-specific recruitment across tasks and held-out spatial activation patterns. Spatial priors, when available, complement the text pathway by anchoring generation in regions of task space where language alone degrades, while retaining the compositional structure needed for counterfactual task specification. To our knowledge this is the first generative model of whole-cortex fMRI dynamics for unseen cognitive tasks, advancing counterfactual neuroscience and data-driven experimental design.

[LG-30] Space-sampled Value Decay: Forgetting Mechanisms for Non-stationary Deep Reinforcement Learning ICML2026

链接: https://arxiv.org/abs/2606.11797
作者: Felix Störck,Fabian Hinder,Barbara Hammer
类目: Machine Learning (cs.LG)
*备注: Accepted at The 2nd Workshop on Epistemic Intelligence in Machine Learning, EIML@ICML 2026, (non-archival)

点击查看摘要

Abstract:Studies on rodents such as mice have shown the capabilities to adapt their behavior when dealing with changing parameters (drift'') of the environment even if no information about change is provided (uncertainty) -- a behavior that can be modeled by forgetting mechanisms. Non-stationary Reinforcement Learning (NSRL) deals with adapting state-of-the-art RL methods to deal with changing environments: these however usually require (partially) perfect information about the drift such as task IDs’’ or ``context’'. To mitigate the effects of drift, this work develops \emphSpace-sampled Value Decay as an explicit forgetting mechanism for value-based deep RL architectures as a simple yet effective approach. In particular we demonstrate and discuss positive effects but also limitations in achieved returns for modifications of Deep Q-networks (DQN) and Soft Actor-Critic (SAC) when evaluated on non-stationary environments.

[LG-31] RCAP: Robust Class-Aware Probabilistic Dynamic Dataset Pruning UAI2025

链接: https://arxiv.org/abs/2606.11761
作者: Atif Hassan,Swanand Khare,Jiaul H. Paik
类目: Machine Learning (cs.LG)
*备注: Proceedings of the Forty-first Conference on Uncertainty in Artificial Intelligence (UAI 2025)

点击查看摘要

Abstract:Dynamic data pruning techniques aim to reduce computational cost while minimizing information loss by periodically selecting representative subsets of input data during model training. However, existing methods often struggle to maintain strong worst-group accuracy, particularly at high pruning rates, across balanced and imbalanced datasets. To address this challenge, we propose RCAP, a Robust, Class-Aware, Probabilistic dynamic dataset pruning algorithm for classification tasks. RCAP applies a closed-form solution to estimate the fraction of samples to be included in the training subset for each individual class. This fraction is adaptively adjusted in every epoch using class-wise aggregated loss. Thereafter, it employs an adaptive sampling strategy that prioritizes samples having high loss for populating the class-wise subsets. We evaluate RCAP on six diverse datasets ranging from class-balanced to highly imbalanced using five distinct models across three training paradigms: training from scratch, transfer learning, and fine-tuning. Our approach consistently outperforms state-of-the-art dataset pruning methods, achieving superior worst-group accuracy at all pruning rates. Remarkably, with only 10% data, RCAP delivers 1% improvement in performance on class-imbalanced datasets compared to full data training while providing an average 8.69\times speedup. The code can be accessed at this https URL

[LG-32] acCoRL: Integrating Tactile Feedback into VLA via Simulation

链接: https://arxiv.org/abs/2606.11743
作者: Siyu Ma,Yuqi Liang,Chang Yu,Yunuo Chen,Hao Su,Yixin Zhu,Yin Yang,Chenfanfu Jiang
类目: Robotics (cs.RO); Graphics (cs.GR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Vision-language-action (VLA) models provide strong visual, language, and action priors for robot manipulation, but visual observations alone often miss the local contact state required for contact-rich tasks. We present TacCoRL, a scalable framework that injects Tactile feedback into VLA policies and improves them through sim-real Co-training and simulation-based reinforcement learning (RL), without requiring large-scale tactile pretraining or extensive real-world contact exploration. The key idea is not only adding touch as an input, but learning how contact readings should modulate action responses in near-failure states that are rare in demonstrations and risky to collect on hardware. We use a real-aligned simulator as a closed-loop training environment for contact interaction. Mixed simulated and real trajectories first warm-start tactile-conditioned actions in the pretrained policy. Reinforcement learning with verifiable task rewards then optimizes the policy using simulated contact rollouts. It reinforces tactile-conditioned actions that lead to task completion, while a supervised objective on real trajectories keeps the refined policy anchored to deployment visual, tactile, and action distributions. The resulting policy transfers directly to the real robot without privileged simulation state or online real-world RL. Across four bimanual contact-rich tasks, the final visuo-tactile policy achieves an average success rate of 72.5%, compared to baseline of 50.0%. Result videos and more details are available at this https URL

[LG-33] Capacity-Constrained Online Convex Optimization with Delayed Feedback

链接: https://arxiv.org/abs/2606.11711
作者: Alexander Ryabchenko,Idan Attias,Daniel M. Roy
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Online learning with delayed feedback typically assumes that the learner can track all pending rounds until their feedback arrives. In practice, tracking resources are finite, and feedback from untracked rounds is permanently lost. In this paper, we study delayed online convex optimization (OCO) under a hard capacity constraint, where at most C pending rounds can be tracked at any time. To model delay information, we introduce a semi-clairvoyant model that refines the clairvoyant assumption from prior work: rather than requiring delays to be known at prediction time, the learner observes delay expirations online, consistent with the classical unconstrained delayed setting. Our approach proceeds via a reduction to a novel ``delayed and weighted’’ OCO problem, using a scheduler that randomizes tracking decisions and importance-weights the resulting observations. For this base problem, we propose and analyze Delayed-Weighted FTRL and its bandit analogue, establishing regret bounds that explicitly characterize the interaction between time-varying weights and delayed feedback. Combining these base learners with our schedulers yields the first regret guarantees for capacity-constrained OCO under convex and strongly convex losses, for both first-order and bandit feedback. For first-order feedback, capacity C = \Omega(\log T) suffices to recover standard delayed OCO rates up to logarithmic factors. For bandit feedback, the regret rates are modulated by powers of (1 + \sigma_\textmax/C) , where \sigma_\textmax is the maximum number of pending observations at any time. This allows the regret bound to degrade gracefully when C \sigma_\textmax , while remaining sublinear.

[LG-34] A Data-Centric Framework for Detecting and Correcting Corrupted Labels

链接: https://arxiv.org/abs/2606.11699
作者: Ha-Linh Nguyen,Hong-Anh Nguyen,Minh-Duc La,Thu-Trang Nguyen,Son Nguyen,Hieu Dinh Vo
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The performance of machine learning and deep learning models largely depends on the quality of the training data. However, the quality of the real-world datasets is often compromised by noisy labels, which can substantially degrade model accuracy and reliability. To address this challenge, we propose Relabeler, an end-to-end data-centric framework for detecting and correcting corrupted labels. For corrupted label detection, Relabeler jointly leverages both local and global relationships among data instances to identify potentially noisy samples. After detecting suspicious instances, Relabeler further performs label correction by estimating the most probable clean label for each instance based on both its input features and observed noisy label. Extensive experiments across multiple datasets, noise types, and noise rates demonstrate that Relabeler consistently outperforms state-of-the-art baselines, achieving up to 58% improvement in label correction precision and 6% improvement in downstream task performance.

[LG-35] Spectrally Regularized Latent Flow Matching for Turbulence Generation ICML2026

链接: https://arxiv.org/abs/2606.11691
作者: Khalid Rafiq,Aditya G. Nair
类目: Machine Learning (cs.LG); Fluid Dynamics (physics.flu-dyn)
*备注: Accepted at the AI4Physics Workshop at ICML 2026. OpenReview: this https URL

点击查看摘要

Abstract:Latent diffusion and flow matching have emerged as leading approaches for synthetic turbulence generation, yet they systematically under-represent dissipation-range amplitudes. We introduce a latent flow matching framework with a spectrally regularized compression stage that directly targets this failure mode. On a 256^2 DNS dataset at Re_f \approx 2250, replacing an MSE-trained VAE with a zone-weighted log-spectral objective raises deep-dissipation retained spectral power from 25% to 94% in reconstruction and from 20% to 79% in unconditional generation. The improved latent representation also yields a substantially better sampling cost-fidelity tradeoff: the MSE-trained latent space imposes a fundamental quality ceiling near DD bias -0.70 that no integrator or step-count can overcome, while the spectrally regularized latent space reaches DD bias -0.117 at just 20 function evaluations. Mechanistically, encoder-decoder swap experiments show that the improvement is driven primarily by encoder-induced latent reorganization rather than decoder capacity, while a support-amplitude decomposition reveals that MSE-trained models behave as conservative suppression models, minimizing pointwise error by attenuating intermittent high-wavenumber structure. Both pipelines recover the second-order structure function and the correct sign of S_3, indicating the correct cascade direction without explicit supervision. A small residual gap in the magnitude of S_3 suggests that phase-coherent triadic organization remains a complementary axis to amplitude fidelity for future generative turbulence models.

[LG-36] Neural-Parameterized Cellular Automata for Wildfire Spread

链接: https://arxiv.org/abs/2606.11676
作者: Maksym Zhenirovskyy,Ion Matei,Rohit Vuppala,Takuya Kurihana,Hon Yung Wonga
类目: Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注: 16 pages, 9 figures

点击查看摘要

Abstract:Traditional wildfire models rely on rigid, low-dimensional parameters and static fuel maps, frequently underpredicting fire spread. To address this weakness, we introduce a hybrid deep-learning parameterized Probabilistic Cellular Automata (CA) framework implemented in JAX. Our approach employs a Multi-Scale Convolutional Neural Network to dynamically generate spatially varying parameters that govern fire-spread probability, wind alignment, and slope influence. This hybrid design captures complex, nonlinear environmental interactions while preserving the physical interpretability of the underlying three-state CA. The JAX implementation enables hardware acceleration and gradient-based parameter calibration. Evaluated on six large-scale wildfires in the western United States, the model maintains IoU 0.6 over 72-hour forecast horizons after a 10-day data assimilation window during which the model is fitted incrementally to observed perimeters; the resulting forecast is a conditional projection of fire growth under the suppression regime already ncoded in those observations.

[LG-37] SpAArSIST: Sparsified AASIST for Efficient and Reliable Anti-Spoofing INTERSPEECH2026

链接: https://arxiv.org/abs/2606.11674
作者: Anton Firc,Vojtěch Staněk,Zbyněk Lička,Kamil Malinka,Martin Perešíni
类目: ound (cs.SD); Machine Learning (cs.LG)
*备注: Accepted at Interspeech 2026

点击查看摘要

Abstract:We present SpAArSIST, a deployment-oriented refinement of the widely used AASIST graph pooling backend for self-supervised learning (SSL) based anti-spoofing. Motivated by redundant operations in public implementations, we replace learned pooling and stack-node attention with explicit, lightweight choices: separate train and inference graph pooling ratios (k_\mathrmtr,k_\mathrminf) , magnitude-based node scoring, and mean aggregation of graph nodes. The best overall configuration (rank 1) cuts backend compute by 20.7% (195.045M \rightarrow 154.706M MACs) and model size by 4.1% (611.8k \rightarrow 586.4k params), while improving out-of-domain robustness on In-the-Wild to 2.82% EER and 0.078 minDCF (from 4.64% and 0.133) and remaining competitive on ASVspoof5. We further provide a composite selection score that summarizes accuracy, calibration, and compute to support balanced deployment-oriented model choice.

[LG-38] Probabilistic Salary Prediction with Graph Attention Networks and a Mixture Density Network

链接: https://arxiv.org/abs/2606.11663
作者: Zhipei Qin,Mohammad Shokri,N. van Weeren,F.W. Takes
类目: ocial and Information Networks (cs.SI); Machine Learning (cs.LG)
*备注: 5 pages, 3 figures

点击查看摘要

Abstract:Accurate salary prediction is critical for bridging the information gap between employers and job seekers in modern labor markets. Existing approaches predominantly yield a single point estimate and treat job attributes such as location, occupation, and industry as independent categorical features, ignoring both the inherent uncertainty and multi-modality of real-world compensation data and the rich hierarchical and semantic-similarity relationships that govern pay norms. In this paper we propose GAT-MDN, a unified framework that addresses both limitations simultaneously. For each of the three attribute domains we construct a domain-specific graph whose edges encode (i) hierarchical parent-child containment and (ii) weighted similarity links derived from a pre-trained Sentence-Transformer. Parallel Graph Attention Networks (GATs) with edge-feature-aware attention learn rich, context-sensitive node representations from these multi-relational graphs. A priority-based hierarchical selection module then assembles a composite feature vector that gracefully handles missing or coarse attributes, and a Mixture Density Network (MDN) head maps this vector to the parameters of a Gaussian Mixture Model (GMM), yielding a full conditional salary distribution. Extensive experiments on a real-world Dutch job-posting dataset of over 1 million records demonstrate that GAT-MDN significantly outperforms a non-graph MLP-MDN baseline in both Negative Log-Likelihood (NLL) and Mean Squared Error (MSE).

[LG-39] Bergson: An Open Source Library for Data Attribution

链接: https://arxiv.org/abs/2606.11660
作者: Lucia Quirke,Louis Jaburi,David Johnston,William Z. Li,Gonçalo Paulo,Guillaume Martres,Girish Gupta,Stella Biderman,Nora Belrose
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Data attribution is a promising field in interpretability that aims to explain model behavior through the influence of its training data, with applications including debugging undesirable model behavior and training dataset curation. However, significant engineering effort is required to perform it at scale, and many cutting edge techniques lack open-source tooling and support. Bergson is an open source library that aims to enable faster progress in the field by providing a host of techniques that scale to very large language models and pre-training datasets. The library natively supports on-disk gradient stores and multi-node distributed training, and provides quality of life tools for researchers. Finally, we introduce the first open-source implementations of three leading data attribution methods: MAGIC, SOURCE, and TrackStar. The library is available at this https URL .

[LG-40] IAPO: Input Attribution-Aware Policy Optimization for Tool Use in Small Multimodal Agents

链接: https://arxiv.org/abs/2606.11652
作者: Yifan Yang,Zhen Zhang,Jiayi Tian,Liyan Tan,Zheng Zhang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper investigates reinforcement learning (RL) methods for improving tool-calling capabilities in multimodal small language model (SLM) agents. While existing works have explored various reward designs to improve agentic tool-calling ability, these approaches face inherent limitations for SLM training, especially under multimodal scenarios. First, many existing methods evaluate tool use correctness through exact matching against certain ground-truth or predefined formats. However, this assumption is often unsuitable for multimodal tasks, where multiple tool use paths may be valid and annotated tool trajectories are typically unavailable. Second, such sparse and brittle binary rewards provide little guidance on how to improve the underlying decision process, making them particularly difficult for multimodal SLM to learn from. To address these issues, we propose Input Attribution-Aware Policy Optimization (IAPO), an RL algorithm for improving tool use in multimodal SLM by aligning the model’s attribution across input components with that of a stronger teacher. Experiments on Qwen2.5-VL-3B show that the proposed method improves visual question answering accuracy by an average of 3% across six test sets compared with existing visual tool use work, by helping the model attend to the most relevant input evidence.

[LG-41] DeepRHP: A Hybrid Variational Autoencoder for Designing Random Heteropolymers as Protein Mimics AAAI2023

链接: https://arxiv.org/abs/2606.11651
作者: Shuni Li,Zhiyuan Ruan,Andy Shen,Ivan Jayapurna,Ting Xu,Haiyan Huang
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM); Applications (stat.AP)
*备注: Oral presentation at AAAI 2023 Workshop on AI to Accelerate Science and Engineering

点击查看摘要

Abstract:Synthetic random heteropolymers (RHPs), consisting of a predefined set of monomers, offer an approach toward the design of protein-like materials. These RHPs, if designed appropriately, can mimic protein behavior and function. As such, there is a need for computational tools to efficiently guide RHP design. We bridge this gap by developing DeepRHP, a modified variational autoencoder (VAE) model under a semi-supervised framework. By equipping a classical VAE with an additional feature-based VAE, DeepRHP forces the latent space to capture structures of critical chemical features as well as individual RHP sequence patterns. In this sense, our method is versatile by allowing any relevant features to be incorporated in a hybrid manner. We demonstrate the effectiveness of DeepRHP by suggesting potential monomer compositions that stabilize membrane proteins (e.g. Aquaporin Z) in non-native environments and cross-validating our prediction with published results. The concordance between our model and true RHP function suggests strong potential in utilizing hybrid autoencoder architectures to guide RHP design for proteins and other biological compounds.

[LG-42] Structure-Preserving Neural Surrogates with Tractable Uncertainty Quantification

链接: https://arxiv.org/abs/2606.11650
作者: Handi Zhang,Adrienne M. Propp,Brooks Kinch,Houman Owhadi,Nathaniel Trask
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA); Computational Physics (physics.comp-ph)
*备注:

点击查看摘要

Abstract:Recent advances in scientific machine learning provide a means of near-real-time solution to partial differential equations (PDEs), but lack the theoretical underpinnings of conventional simulators that support contemporary verification and validation. In this work, we construct data-driven reduced-order models that serve as structure-preserving, real-time surrogates. Remarkably, the exterior calculus that imposes physical conservation structure also exposes topological structure that we use to build a Gaussian process (GP) representation of uncertainty in state-flux relationships, ultimately yielding a Dirichlet-to-Neumann map for quantities of interest with closed-form expressions for posterior uncertainty. We specifically propose structure-preserving H(\mathrmdiv) – L^2 subspaces of conventional Raviart–Thomas and dgP_0 elements prescribed by a lightweight transformer. Reduced-order dynamics consistent with this subspace are learned by posing a conservation law in which a GP describes the fluxes between volumes. This work hinges on a novel interface between mixed FEM spaces and GP regression; when training is posed as the optimal recovery problem (ORP), the resulting GP regression can be written as an optimization problem with equality constraints that impose a conservation structure, amenable to a fast Schur-complement training strategy. The trained model can then be solved in real time with closed-form estimators for boundary fluxes driven by prescribed Dirichlet data. The paper includes RKHS posterior error bounds for linear functionals to support uncertainty quantification, as well as numerical experiments demonstrating the accuracy of the posterior distribution as a surrogate for error estimation.

[LG-43] ree-Structured Orthonormal Decomposition of the Aitchison Simplex ICML2026

链接: https://arxiv.org/abs/2606.11646
作者: Daisuke Yamada,Qijun Zhang,Travis Pence,Barbara B. Bendlin,Federico Rey,Vikas Singh
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM); Machine Learning (stat.ML)
*备注: Accepted at ICML 2026. To appear in PMLR vol. 306

点击查看摘要

Abstract:Compositional data – vectors encoding relative proportions – arise across scientific domains, including ecology, geochemistry, and genomics. The features in these data often come with known hierarchical structure (e.g., taxonomies, phylogenies, ontologies), yet existing methods either ignore this structure, discard the intrinsic Aitchison geometry, are designed for binary trees, or yield incomplete coordinate systems. We describe PolyILR, a canonical orthonormal decomposition of the Aitchison tangent space aligned with any tree topology. Our construction defines a weighted local geometry at each internal node capturing full branching structure, then lifts these to a global orthonormal basis where every coordinate corresponds to a specific tree location. On microbiome and single-cell benchmarks, PolyILR yields stable, interpretable features and enables inference at multiscale tree resolution. We also establish a novel theoretical connection to softmax classifiers, suggesting possible applications to probabilistic modeling.

[LG-44] meRouter: Efficient and Adaptive Routing of Time-Series Foundation Models

链接: https://arxiv.org/abs/2606.11625
作者: Kanghui Ning,Yushan Jiang,Kashif Rasul,Anderson Schneider,Yuriy Nevmyvaka,Dongjin Song
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Time-series foundation models (TSFMs) are increasingly explored as predictive experts within emerging agentic time-series systems. However, TSFMs exhibit heterogeneous inductive biases, and no single model consistently dominates across forecasting regimes, making expert selection a critical challenge. Existing systems often delegate this decision to LLM-based controllers, incurring substantial inference overhead. We present TimeRouter, an efficient routing framework that leverages empirical complementarity across a pool of pretrained TSFMs through lightweight discriminative routing, selective gating, and ensemble fallback. Concretely, TimeRouter combines a learned routing head, a selective gate, and an ensemble fallback, enabling adaptive expert selection without invoking an LLM at inference time. TimeRouter achieves state-of-the-art performance on the GIFT-EVAL leaderboard, with an LB MASE of 0.6765. Beyond benchmark performance, our ablation studies provide empirical insights into TSFM routing design, highlighting the importance of pool composition and selective gating. Taken together, these results position TimeRouter as a modular and lightweight routing layer for future agentic time-series systems built upon foundation-model pools. Our code is available at this https URL.

[LG-45] Beyond the Golden Teacher: Enhancing Graph Learning through LLM -GNN Co-teaching

链接: https://arxiv.org/abs/2606.11583
作者: Zhuoyi Peng,Hanlin Gu,Lixin Fan,Yi Yang
类目: Machine Learning (cs.LG)
*备注: Code: this https URL

点击查看摘要

Abstract:Text-attributed graphs (TAGs) underlie real-world applications such as citation networks, social media, and e-commerce. Few-shot graph learning on TAGs is hard: with only a handful of labels per class and the rest of the graph unannotated, neither GNNs nor LLMs can learn well on their own. GNNs read topology and fail on cold nodes; LLMs read text and fail on text-ambiguous nodes. Existing LLM-GNN methods all follow the same recipe: designate one model as the golden teacher and use its outputs (e.g., features or pseudo-labels) to supervise the other. We argue this golden-teacher assumption breaks under sparse supervision: neither model is golden, and treating either as such transfers its blind spots into the student. We therefore ask: can we avoid designating either model as the golden teacher, and still perform effective graph learning? We answer with LLM-GNN Co-Teaching, a bidirectional co-teaching framework in which neither model is fixed as teacher. The GNN and LLM exchange their most confident pseudo-labels under an architecture-specific small-loss criterion, and both update every round. Supervision is then mined from the trajectory: whenever a node moves from cross-model contradiction at round t to cross-model agreement at round t+1, the LLM’s two answers on the same input form a preference pair (old contradicting self new peer-endorsed self) for DPO training. We call this Round-based Pseudo-Label Preference Optimization (RPL-PO). On six benchmarks, LLM-GNN Co-Teaching consistently outperforms GNN-as-Judge and all prior methods, with absolute 3-shot gains of 7.86% on Cora and 7.73% on ogbn-arxiv; improvements carry over to 5-shot and to zero-shot cross-dataset transfer. Error-structure analysis further shows that abandoning the golden-teacher assumption substantially improves the LLM’s graph learning capability on challenging samples.

[LG-46] Range-Aware Bayesian Optimization for Discovering Diverse Designs within Target Property Windows

链接: https://arxiv.org/abs/2606.11574
作者: Shengli Jiang,Jason Wu,Charles M. Schroeder,Michael A. Webb
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci); Chemical Physics (physics.chem-ph); Machine Learning (stat.ML)
*备注: 64 pages, 6 main text figures, 17 supporting figures, 6 supporting tables

点击查看摘要

Abstract:In many materials and product design problems, desirable candidates exhibit properties that fall within an acceptable range rather than achieve a single optimum. Recovering multiple, distinct solutions that satisfy such specifications is also practically valuable, as some candidates may be preferred for reasons of cost, processability, or robustness that are difficult to encode directly in an objective function. Here, we develop a range-aware Bayesian optimization (BO) framework in which the acquisition function directly scores the posterior probability that a candidate satisfies a target range. The framework naturally extends to parallel pursuit of multiple distinct specifications over a shared candidate space. Across benchmark tasks, range-aware acquisition consistently recovers larger and more diverse sets of valid designs than standard BO baselines and recent goal-seeking methods. Its utility is further demonstrated in two practically motivated design case studies involving optimizing reaction conditions for polymer synthesis and sequence-defined oligomer discovery for prescribed optical absorption bands, supported by quantum chemical calculations. These results suggest that range-aware BO can provide a practical and sample-efficient foundation for specification-driven design, particularly when design flexibility and solution diversity are important considerations.

[LG-47] APEX: A Network-Native Time-Series Foundation Model for Forecasting and Anomaly Detection for Wireless Edge Operations

链接: https://arxiv.org/abs/2606.11553
作者: Swadhin Pradhan,Niloo Bahadori,Peiman Amini
类目: Machine Learning (cs.LG)
*备注: 5 pages, 1 figure, 4 tables. Discusses a network-native time-series foundation model for wireless edge operations

点击查看摘要

Abstract:Generic time-series foundation models transfer poorly to wireless network telemetry whose signals are bursty, zero-inflated, and coupled across protocol layers. We present APEX, a network-native, decoder-only transformer for forecasting enterprise AP telemetry, and evaluate it on DHCP degradation as a representative network task. APEX is pre-trained on 10-channel multivariate telemetry from ~4,500 production wireless networks (~100K AP time series, 34 metrics per AP), and is available as APEX-Large (269M, cloud) and APEX-Edge (10.5M, edge). On a 192-step (4-day) DHCP degradation benchmark, APEX-Large reduces MAE by 18% over the strongest foundation-model baseline (Toto) and 38% over SARIMA, with anomaly-detection F1 = 0.93, while APEX-Edge enables sub-second, privacy-preserving inference on AP-class edge hardware. These results suggest network-native pre-training is a practical foundation for proactive wireless operations.

[LG-48] Learning Object Manipulation from Scratch via Contrastive Interaction

链接: https://arxiv.org/abs/2606.11525
作者: Tongle Shen,Caleb Chuck,Fan Feng,Biwei Huang
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Contrastive Reinforcement Learning (CRL) has seen recent success in a wide variety of goal-conditioned robotics tasks by learning structured representations of the dynamics. However, despite its success in locomotion and simpler control domains, CRL often struggles in interaction-rich manipulation. We argue that a key source of this difficulty is object-centric interaction, such as contact or grasping, that induces distinct changes in the underlying dynamic modes. In this work, we formulate manipulation dynamics as a piecewise-smooth Markov process and show that interaction-induced mode changes create piecewise nonlinear reachability structures that are difficult for standard CRL energy functions to represent and plan over. Based on this analysis, we introduce Interaction-weighted Resampling (IWR). IWR performs interaction-aware resampling around phases before, during, and after interactions, encouraging the learned representation to preserve the mode boundaries that determine future reachability to capture multi-modal and piecewise nonlinear reachability. Across interaction-centric environments, including 2D dynamic control, robotic manipulation, and robot air hockey, IWR improves both sample efficiency and overall performance over prior CRL methods, with 19.8% average improvement in simulation. Finally, using a sim-to-real pipeline with policies trained by IWR, we demonstrate the first real-world goal-conditioned robot air hockey agent capable of hitting goals, improving success from 25% to 60%. Project Page: this http URL.

[LG-49] Counterexample Guided Learning in the Large using Reasoning Agents

链接: https://arxiv.org/abs/2606.11521
作者: Hongyi Liu,Frederic Sala,Thomas Reps,Adithya Murali
类目: Machine Learning (cs.LG)
*备注: Code, data, and resources are publicly available for research purposes: this https URL

点击查看摘要

Abstract:LLMs and LLM agents should improve when given feedback, but identifying when they are able to do so is difficult: feedback is heterogeneous, domain-specific, and difficult to control. We approach this challenge by asking LLMs to perform regular-expression induction, a classical symbolic learning problem where precise mechanisms for feedback exist in the form of counterexamples. In counterexample-guided learning, a learner (LLM) proposes candidate regular expressions from positive/negative-labeled strings, and the teacher (verifier) returns counterexamples showcasing the difference between the candidate and target languages. We identify novel counterexample-guided refinement strategies that enable effective regex learning, such as regularization and symbolic counterexample clusters. We also explore agentic strategies such as reflection and repair loops. Empirically, we find that verifier feedback substantially improves sample efficiency on challenging regex-induction tasks, reducing the number of labeled examples required and enabling learning of complex target expressions where standard prompting fails. For example, on the hardest task groups, our counterexample-guided framework improves success from 3.2% to 38.1% and from 38.9% to 74.1% on two different regex domains. These results suggest that LLMs can benefit from rich feedback beyond treating it as additional data, opening the door for robust verifier-guided methods for LLM-based program synthesis and formal reasoning.

[LG-50] Probabilistic Contrastive Pretraining for Multi-task ADME Property Prediction

链接: https://arxiv.org/abs/2606.11508
作者: Yifan Xue,Srimukh Prasad Veccham,Saee Paliwal,Tyler Shimko,Micha Livne
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:Accurate prediction of absorption, distribution, metabolism, and excretion (ADME) properties is critical to drug discovery, but remains challenging because ADME endpoints are noisy, interdependent, and often data-limited. We propose a molecular graph-transformer pretraining framework that combines chemistry-specific self-supervision with contrastive mutual information machine learning (cMIM). Our method encodes molecular graphs into latent variables, reconstructs SMILES strings from the graph-derived latent codes, and augments the contrastive objective with domain-specific self-supervised chemistry tasks. Rather than treating these tasks as auxiliary regularizers with separately tuned loss weights, we formulate reconstruction, contrastive discrimination, and chemistry-specific supervision as unit-weighted log-probability factors in a single probabilistic latent-variable objective. For fine-tuning, we propose a multi-task GNN readout architecture with task-specific multilayer perceptron heads, preserving shared representation learning while mitigating negative transfer and improving the modeling of heterogeneous, nonlinear task relationships. Across Biogen, ExpansionRX, and ChEMBL-MT, the resulting Contrastive KERMT pretraining improves over the KERMT baseline by 7.6%, 9.9%, and 9.5% respectively (averaged over significantly-improved endpoints). Adding ADME-adjacent molecules to the pretraining corpus further improves transfer, and the contrastive component sharpens chemically meaningful latent neighborhoods.

[LG-51] OmniLoc: A Geometry-Aware Foundation Model for Anchor-Free UE Localization Across Diverse Indoor Environments

链接: https://arxiv.org/abs/2606.11490
作者: Lei Chu,Yuning Zhang,Omer Gokalp Serbetci,Anushka Katiyar,Bassel Abou Ali Modad,Andreas F. Molisch
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Indoor localization from wireless measurements remains challenging in large-scale deployments due to substantial variation in building geometry, the set of detectable access points (APs), and the heterogeneity of received signals. Existing learning-based methods often perform well only in limited settings and degrade under environmental shifts, making robust anchor-free localization across diverse indoor environments notoriously difficult. In this paper, we present OmniLoc, an environment-interactive foundation model for anchor-free user equipment localization across diverse indoor environments. To the best of our knowledge, OmniLoc is the first foundation-model-based approach built directly on wireless measurements for this task. OmniLoc is built on three key designs. First, a unified input tokenization module converts heterogeneous wireless measurements into a common representation that is more amenable to learning. Second, a geometry-aware Transformer performs AP-aware feature extraction by emphasizing dominant APs while aggregating complementary evidence from supporting APs. Third, a geometry-aware location estimation module conditions regression on geometric embeddings to produce geometrically consistent location predictions. We evaluate OmniLoc on both a large-scale in-house dataset and a public benchmark dataset. Results show that OmniLoc significantly outperforms existing methods, consistently improves existing backbones when its design components are integrated, and demonstrates strong generalization in cross-environment evaluations.

[LG-52] Accurate and Resource-Efficient Federated Continual Learning

链接: https://arxiv.org/abs/2606.11480
作者: Jebacyril Arockiaraj,Dhruv Parikh,Jayashree Adivarahan,Rajgopal Kannan,Viktor Prasanna
类目: Machine Learning (cs.LG)
*备注: Technical Report

点击查看摘要

Abstract:Federated continual learning (FCL) must learn from distributed task streams under limited resources, such as communication, computation, memory, and label availability. Existing FCL methods often rely on repeated local optimization, replay, and full supervision. Analytic alternatives avoid iterative training and replay, but using high-dimensional random features to improve accuracy requires a second-order feature statistic, the Gram matrix, which has a quadratic communication cost in the random feature size M . We propose FedRAN, a resource-aware analytic FCL framework that replaces gradient-based updates with compact random feature statistics. Each client transmits a truncated-SVD summary of its Gram matrix, reducing the dominant second-order upload from quadratic to linear in M for fixed rank. The server performs a two-level QR-SVD subspace merge, spatially across clients and temporally across tasks, and solves a ridge classifier in closed form. FedRAN further supports label scarcity through prototype-based pseudo-labeling. Across CIFAR-100, ImageNet-R, and VTAB datasets, FedRAN improves average accuracy by up to 4.8 percentage points over the strongest baseline, uses 30.6-121.8 \times less per-client communication than optimization-based FCL, and is 190.3 \times faster on average than gradient-based baselines; with only 20% labels, pseudo-labeling improves average accuracy by up to 6.61 points. These results show that FedRAN enables accurate and resource-efficient FCL under communication, computation, and label constraints. The source code is available at this https URL.

[LG-53] Mahalanobis-Guided Latent OOD Detection for Hybrid ES-DRL Control in Time-Varying Systems

链接: https://arxiv.org/abs/2606.11474
作者: Shaifalee Saxena,Alexander Scheinker
类目: Machine Learning (cs.LG); Systems and Control (eess.SY); Accelerator Physics (physics.acc-ph)
*备注:

点击查看摘要

Abstract:In this paper, we study Mahalanobis-guided latent out-of-distribution (OOD) detection for test-time RL controller switching in nonlinear time-varying systems. RL controllers can quickly control high-dimensional systems within the training distribution, but their performance can degrade when time-varying dynamics produce unseen observations. We consider a combined ES–DRL controller, where RL provides fast in-distribution actions and bounded extremum seeking (ES) provides robust model-independent control under OOD operation. The key challenge is deciding when to switch. We train a variational autoencoder (VAE) on in-distribution beam-profile observations and use Mahalanobis distance in the VAE latent space to detect OOD beam profiles at test time. This OOD decision sets a binary switch that selects either the RL controller or the ES controller. We evaluate the approach in safety-critical particle accelerator control. In this setting, spatial magnet motion creates OOD beam profiles that were not seen during RL training. Visualization of the VAE latent space shows that the proposed method identifies this OOD scenario and provides an interpretable signal for switching between RL and ES in the combined controller.

[LG-54] Evaluating and Combating the Impact of Concept Drift on the Performance of Machine Learning-Based Phishing Detection Systems

链接: https://arxiv.org/abs/2606.11471
作者: Warren Fernando,Nikos Komninos
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The expansion of the digital domain has resulted in a substantial increase in digital communication, with email emerging as one of the most prominent channels. The proliferation of email communication is apparent in both professional and personal contexts, thereby creating numerous vulnerabilities for malicious actors to exploit. Spam emails, a form of unsolicited correspondence often bearing malicious intent towards recipients, have been an ongoing challenge for email users since the inception of email technology, and this problem has been exacerbated by the growth of the digital landscape. Email spam filters are integral components of email clients, engineered to identify potentially harmful messages and alert users to their malicious content. Phishing, frequently the initial phase of malware-based attacks, is evolving rapidly, with malware becoming increasingly sophisticated over time. A widely adopted approach for detecting malicious activity within malware and spam domains is the application of machine learning. Our aim is to assess the impact of the evolution within the spam email domain on these machine learning-based detection systems and to explore strategies for mitigating associated performance degradation.

[LG-55] Density estimation for Hellinger via minimum-distance estimators: mixtures of Gaussians log-concave and more

链接: https://arxiv.org/abs/2606.11469
作者: Spencer Compton,Jerry Li
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:We study the task of density estimation, where we hope to accurately estimate a probability density from n samples. A textbook method for density estimation in total variation distance is the minimum-distance estimator approach, where we conclude both the algorithm and the analysis merely from bounding the VC dimension of a particular concept class (the so-called Yatracos class). While this technique has originally yielded sharp guarantees primarily for total variation distance, in this work we extend the minimum-distance estimator approach for learning within Hellinger distance. Our main observation is that we may produce an analogous recipe for Hellinger (where we only require bounding the VC dimension of a related concept class) by drawing connections to recent results yielding reverse data processing inequalities. This recipe is flexible enough to accommodate fast algorithms originally designed for total variation distance; by modifying the approach of Acharya et al. (2017) we conclude the first near-linear time algorithm for learning classes including univariate mixtures of log-concave densities and mixtures of Gaussians (with arbitrary variances), with near-optimal sample complexity. Subjects: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Statistics Theory (math.ST) Cite as: arXiv:2606.11469 [cs.DS] (or arXiv:2606.11469v1 [cs.DS] for this version) https://doi.org/10.48550/arXiv.2606.11469 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-56] Mirror Descent Beyond Euclidean Stability: An Exponential Separation in Initialization Sensitivity

链接: https://arxiv.org/abs/2606.11431
作者: Shira Vansover-Hager,Matan Schliserman,Ofir Schlisselberg,Tomer Koren
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Mirror Descent (MD) extends Gradient Descent (GD) beyond Euclidean geometry and has recently reappeared as a lens for KL-regularized policy optimization in reinforcement learning and LLM post-training. This raises a basic robustness question, crucial to reproducibility and reliability: how sensitive are MD dynamics to their inputs? We focus on initialization, often itself a pretrained or previously aligned model. Quadratic-regularized MD, including GD and Mahalanobis geometries, is well-known to be stable for convex smooth objectives. We show a sharp contrast: once the regularizer is non-quadratic, MD can be exponentially more sensitive to initialization than GD, even with a well-conditioned regularizer in Euclidean norm. We give a three-dimensional construction with a convex, smooth objective and a strongly convex, smooth, well-conditioned regularizer where an initial \varepsilon perturbation is quickly amplified to \min\textpolylog^-1(1/\varepsilon), \varepsilon e^\Omega(\eta T)\ after T iterations of MD with step size \eta . For canonical KL-regularized MD on the simplex, we show that even linear objectives can amplify an initial \varepsilon perturbation exponentially fast in high-dimensional or near-boundary regimes. Finally, we show that adding a Bregman regularization term toward an anchor point can stabilize the dynamics while largely preserving the optimization guarantees, and that the choice of anchor is crucial: anchoring at the initialization only partially mitigates the instability, whereas anchoring at a fixed point yields a more stable mechanism.

[LG-57] Recursive Binding on a Budget: Subspace Carving in Order-p Tensor Memories

链接: https://arxiv.org/abs/2606.11391
作者: Travis Pence,Daisuke Yamada,Vikas Singh
类目: Machine Learning (cs.LG)
*备注: 24 pages, 12 figures, 7 tables

点击查看摘要

Abstract:Tensor Product Representations provide the structural fidelity required for symbolic reasoning in models but suffer from exponential dimensionality growth when encoding deep recursive structures. Conversely, Vector Symbolic Architectures maintain constant dimensionality but sacrifice capacity and fidelity due to noisy compression via superposition. In this work, we propose Orthogonal Subspace Carving (OSC), a memory architecture that binds fillers to roles by projecting onto the null space of the role basis before aggregating into a fixed order-p tensor. OSC uses projections to enforce geometric orthogonality between bound structures within a static memory trace. We show that this mechanism decouples the tensor order from the structural depth, enabling deep recursive binding within a constant memory footprint. By performing retrieval via recognition, this construction allows for component vectors that are orders of magnitude smaller than the memory tensor, giving superior memory efficiency in settings involving high superposition. We also show that TPR is a special case of binding in Clifford algebra, and give a Clifford formulation of OSC.

[LG-58] GLACIER: A Multimodal Student-Teacher Foundation Model for Molecular Property Prediction

链接: https://arxiv.org/abs/2606.11382
作者: Emily Nguyen,Yongchan Hong,Harsh Toshniwal,Yan Liu,Andreas Luttens
类目: Machine Learning (cs.LG); Biomolecules (q-bio.BM)
*备注:

点击查看摘要

Abstract:Deep learning models facilitate the discovery of molecules with tailored properties among billions of candidate compounds. However, the computational burden to develop and deploy state-of-the-art models continuously increases, limiting their scalability. Most large-scale models are unimodal in nature and overlook the potential to leverage complementary molecular data modalities. To address these shortcomings, this paper introduces the Graph-Language Alignment for Chemical Inference and Exploration using Representations (GLACIER) model, a student-teacher framework that integrates molecular graphs, SMILES strings, and physicochemical descriptors to learn rich molecular embeddings. Our framework consists of three stages: (1) we pretrain three student encoders on 100,000 drug-like molecules: a message-passing neural network for molecular graphs, a transformer-based encoder for SMILES strings, and a multilayer perceptron for physicochemical descriptors, (2) we fuse these student modalities using a novel Finsler geometry-aware module, and (3) distill complementary knowledge from large teacher models, including MiniMol and MolFormer, into a single lightweight model via contrastive learning. We demonstrate that GLACIER is a robust framework that delivers high predictive performance and computational efficiency in complex molecular property prediction tasks. Our code is publicly available at this https URL.

[LG-59] SwiftCTS: Fast Cross-Design Prediction and Pareto Optimization of Clock Tree Metrics via Few-Shot Calibration

链接: https://arxiv.org/abs/2606.11348
作者: Barsat Khadka,Kawsher Roxy,Md Rubel Ahmed
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Clock Tree Synthesis (CTS) is a computationally expensive stage in the physical design flow, requiring iterative EDA tool invocations to navigate a vast configuration space for optimal power, wirelength, and timing skew. Existing machine learning approaches require computationally expensive retraining or fine-tuning cycles to adapt to unseen macro architectures and are architecturally mismatched to the millions of evaluations demanded by exhaustive combinatorial search. We present SwiftCTS, a physics-informed surrogate framework that addresses both limitations simultaneously. By coupling lightweight, physics-grounded statistical features with gradient-boosted ensembles, SwiftCTS trains in under five seconds on a CPU and delivers sub-millisecond inference without GPU support. To handle out-of-distribution (OOD) designs without retraining or fine-tuning, we introduce a K-shot multiplicative calibration mechanism that anchors predictions to just one or two physical reference runs, reducing power prediction error from 24.5% to 3.3% and wirelength error from 56.6% to under 1% on unseen macros. Integrating this engine with an evolutionary optimizer, SwiftCTS evaluates 100,000 CTS configurations in under ten seconds, yielding Pareto-optimal frontiers that are physically validated within the OpenROAD flow. Closed-loop validation confirms prediction errors below 0.5% for power and wirelength, and timing skew predictions within five picoseconds on an OOD benchmark, consistently outperforming default tool heuristics across all target metrics. Code publicly available at: \hrefthis https URLthis https URL

[LG-60] Energy-Conserved Neural Pipelines: Attenuating Error Propagation in Modular Neural Networks via Physical Conservation Constraints

链接: https://arxiv.org/abs/2606.11341
作者: David Young,Swan Yi Htet
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注: 22 pages, 2 figures, 7 tables, 25 references

点击查看摘要

Abstract:Modular neural network pipelines suffer from error compounding: noise at any module boundary propagates and potentially amplifies through subsequent modules. We introduce energy conservation as a hard physical constraint on inter-module information flow. Activation energy (the squared L2 norm of feature vectors) is enforced to be exactly preserved at every module boundary. Unlike soft energy penalties, conservation is an inviolable law: the network may redistribute energy across neurons but cannot create or destroy it. Four experiments on CIFAR-10 demonstrate: (1) conservation retains 77.4% of clean accuracy at noise sigma=0.2, versus 35.1% for baselines and 30.9% for energy-penalized models (p0.001, 5 seeds); (2) pipelines become depth-invariant, retaining 93.3% at depths 2 through 5 with noise at every boundary; (3) the advantage generalizes to systematic bias (+45.1%), Gaussian (+40.4%), and adversarial noise (+4.8%), with a principled non-effect on dropout (-0.3%); (4) on ResNet-18, the conservation advantage scales inversely with intrinsic normalization: +0.3 pp with BatchNorm, +26.2 pp without at sigma=0.2, reaching +58.0 pp at sigma=0.5. Experiment 5 validates the operator on a real modular robotic pipeline (MuJoCo physics, Franka Panda). Across three independent runs on separate machines (90 trials per cell), conservation provides +18.9 pp average advantage on monocular-depth-style noise. A formal bound proves conserved noise energy is strictly less than input noise energy.

[LG-61] Learning from almost nothing: How neural networks survive heavy input corruption

链接: https://arxiv.org/abs/2606.11319
作者: Justin Tahmassebpur,Asadullah Bhuiyan,Hyejin Kim,Omri Lesser
类目: Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn)
*备注: 26 pages, 10 figures

点击查看摘要

Abstract:Learning from imperfect data is a central theme in machine learning, connecting practical questions of robustness to fundamental questions of learnability. Here we examine attribute noise: learning from corrupted inputs while keeping the labels intact, a setting that has received considerably less analytical attention than its label-noise counterpart. We consider two types of corruption models: additive noise and replacement noise. Through experiments with multi-layer perceptrons (MLPs) on corrupted classification datasets, we find that neural networks remain robust, maintaining well-above-chance accuracy even when inputs are 90% corrupted – far beyond human recognition. To understand this robustness, we analyze infinite-width networks in the heavy-corruption regime using a mean-field-inspired approach and derive a leading-order decision rule for the classification outcome: the network implements a prototype rule, the nearest-class-mean, assigning each test point to the class whose training-set average it most closely resembles. This leading-order decision rule is universal across a broad range of MLP architectures, holding for any depth, as well as a wide class of activation functions and noise distributions. The same centroid mechanism closely matches finite-width network behavior in our experiments and provides an interpretable and analytically tractable account of why learning can succeed even when individual training examples carry almost no signal.

[LG-62] Fixed-Parameter Tractability of Private Synthetic Data Generation

链接: https://arxiv.org/abs/2606.11283
作者: Badih Ghazi,Cristóbal Guzmán,Pritish Kamath,Alexander Knop,Ravi Kumar,Pasin Manurangsi
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We study the problem of generating synthetic data under differential privacy. We establish fixed-parameter tractability (FPT) for this problem where the parameter is the treewidth of the query family’s incidence graph. Our algorithms attain optimal error rates across all regimes and are realized by two different approaches: the first is based on linear programming (LP) and the FPT of the separation problem for the LP dual; the second is based on a subsampled private multiplicative weights method, where we obtain FPT for sampling from Gibbs distributions. Both approaches are unified by a dynamic programming framework over a tree decomposition.

[LG-63] Least-Action-Guided Diffusion for Physical Extrapolation

链接: https://arxiv.org/abs/2606.11277
作者: Zhongxin Yang,Yuanwei Bin,Xiang I.A. Yang,Shiyi Chen
类目: Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注:

点击查看摘要

Abstract:Reliable extrapolation remains a central challenge for generative models in computational physics, because models trained over finite ranges of time, parameters, or geometries may produce physically inconsistent predictions outside the training distribution. We introduce a least-action-principle-guided diffusion, LAPG, a framework that promotes physical consistency during inference rather than relying solely on constraints imposed during training. The method combines a conditional score-based diffusion model with an action-derived physical guidance score. In the first stage, the learned score model generates an in-distribution proposal; in the second, an action-based variational prior refines this proposal toward the target out-of-distribution condition. This formulation turns the principle of least action into a differentiable inference-time correction mechanism and provides an alternative to pointwise residual penalties that often require empirical loss balancing. We evaluate LAPG on representative ordinary- and partial-differential-equation systems, including free fall, conservative and dissipative spring-mass dynamics, interacting point vortices, and potential flow over parameterized airfoils. In temporal, parameter, and geometric extrapolation tests, LAPG reduces phase drift, preserves dissipative decay, captures vortex motion, and improves the lift response of airfoil flows compared with training-time physics-informed baselines. Subjects: Machine Learning (cs.LG); Computational Physics (physics.comp-ph) Cite as: arXiv:2606.11277 [cs.LG] (or arXiv:2606.11277v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2606.11277 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-64] LakeFM: Toward a Foundation Model for Aquatic Ecosystems Using Irregular Multivariate Multi-depth Time Series Data KDD2026

链接: https://arxiv.org/abs/2606.11268
作者: Abhilash Neog,Sepideh Fatemi,Medha Sawhney,Kazi Sajeed Mehrab,Aanish Pradhan,Bennett J. McAfee,Emma Marchisin,Arka Daw,Robert Ladwig,Cayelan C. Carey,Paul Hanson,Anuj Karpatne
类目: Machine Learning (cs.LG)
*备注: KDD 2026

点击查看摘要

Abstract:Understanding and forecasting lake dynamics is critical for monitoring water quality and ecosystem health across lakes and reservoirs. While machine learning methods have been recently applied to ecological time-series data, existing works assume regular sampling in time and depth, and struggle to generalize across lakes with heterogeneous variables, depths, and observation patterns. To address these limitations, we introduce \textscLakeFM, a foundation model for aquatic systems, pre-trained on large-scale ecological datasets comprising both simulated and observed lakes. Through extensive empirical evaluation, we show that \textscLakeFM learns meaningful representations spanning broader lake-level characteristics, and achieves competitive or often superior-forecasting performance compared to existing time-series foundation and non-foundation models, while producing physically plausible predictions consistent with real-world lake dynamics.

[LG-65] A prior-free blind detection of information leakage from model predictions

链接: https://arxiv.org/abs/2606.11267
作者: Laurence A. Jacobs
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Data leakage – contamination of a model with information unavailable at baseline – is the dominant reproducibility failure in machine-learning-based science, yet detection tools require training code, external data, or domain expertise. None operates on the artifact an auditor most often holds: the model’s output. We ask what can be decided about leakage from predictions and outcomes alone. We give a decision-theoretic framework in which leakage diagnostics are functionals of the predicted-risk/outcome law, parameterized by a threshold-weighting linked to proper scoring rules and decision-curve analysis. We prove a sharp impossibility: a recalibrated leak matching an honest model’s calibration and discrimination is indistinguishable from honest performance by \emphany function of the predictions, so the broad class is detectable only against an externally supplied ceiling on achievable discrimination. We then prove what leakage cannot hide: a near-deterministic subgroup – the signature of a near-label leak – produces a sustained unit-purity head that no legitimate predictor of a non-deterministic outcome can manufacture, yielding a prior-free test. These results organize leakage into a trichotomy – miscalibrated, broad-calibrated, and deterministic – each with a matched detector and failure mode. We validate on UK Biobank using time-windowed comorbidity leakage with known, graded severity, measuring a detection floor of \Delta\cstar \approx 0.007 on this endpoint, below which residual leakage is undetectable from output and too small to alter conclusions. The numerical floor is cohort- and endpoint-specific; the structural lesson is general: output-only detection fails where residual leakage is indistinguishable from an honestly stronger predictor. The test returns a verdict on a prediction vector in under a second on commodity hardware.

[LG-66] Seeing Before Colliding: Anticipatory Safe RL with Frozen Vision-Language Models

链接: https://arxiv.org/abs/2606.11266
作者: Samuel Tetteh,Cody Fleming
类目: Machine Learning (cs.LG)
*备注: 44pages, 26 figures

点击查看摘要

Abstract:The cost signal that constrained-RL algorithms optimize against is almost always reactive: the simulator emits a non-zero cost only after a collision has begun, and the Lagrange multiplier of PPO-Lagrangian grows only after the episode budget has been exceeded. At race speeds, where collisions are instantaneous and irreversible, any safety mechanism that waits for cost to accumulate is structurally too late. We present VLM-Safe-RL, a framework that integrates a frozen vision-language model into the CMDP Lagrangian update as an anticipatory cost term. The framework comprises four contributions: (i) Decoupled Dual-Path CLIP, independent reward/cost paths that respect the CMDP’s factorization; (ii) VLM-Lagrange, an augmented multiplier update that incorporates a per-step VLM cost as an anticipatory term; (iii) Confidence Gating, a Bayes-optimal weight derived from a logistic noise model on the CLIP margin; and (iv) VLMPPOLag, the composed algorithm. On Safety-Gymnasium FormulaOne L2, our principal evaluation ( n=5 seeds, 10^6 steps, budget d_\textlim=25 ) VLMPPOLag + Conf is the only configuration in our default budget comparison that simultaneously retains substantive return ( J_r\approx40 ) and holds cost within budget on a majority of seeds; the five constraint-aware baselines (PPOLag, CPO, CPPOPID, CPO-CLG, PPOLag-RND) each fail at least one requirement. The mechanism generalizes to held-out MetaDrive Medium (catastrophe rate 41%\to26% , 95% bootstrap CI [-26,-5] ,pp) and shows directionally consistent transfer to Bullet Safety-Gym; we report honestly where it does not (MetaDrive Easy/Hard, Qwen2-VL backbone) and trace the Hard failure to a Lagrangian-regulation pathology rather than the VLM signal itself. To our knowledge, this is the first work to use frozen VLM signals as an anticipatory cost term inside the CMDP Lagrangian update.

[LG-67] Loss Landscape Diagnosis for Gradient-Based Gray-Scott System Inversion: Disentangling the Roles of PINN Components ICML2026

链接: https://arxiv.org/abs/2606.11258
作者: Yan Yang
类目: Machine Learning (cs.LG); Pattern Formation and Solitons (nlin.PS); Computational Physics (physics.comp-ph)
*备注: Accepted at the AI4Physics Workshop, ICML 2026 (non-archival). 14 pages, 10 figures

点击查看摘要

Abstract:Gradient-based inversion of reaction-diffusion systems is typically approached via surrogate models or physics-informed neural networks (PINNs), while the most direct route, backpropagation through the PDE’s structure itself, has largely been avoided. We pursue this direct route as a diagnostic probe, backpropagating a steady-state loss through unrolled Gray-Scott simulation to recover its parameters, with no surrogate or neural-network augmentation. Optimization fails to converge, and plotting the landscape directly locates the failure in its geometry – flat plateaus with no gradient signal, bounded by sharp cliffs that align with bifurcation boundaries – a structure that recurs across loss functions and is inherited however the gradients are routed to parameters. Reading this minimal setup as an ablation of PINN, we disentangle each component’s role: with the neural network fixed, the residual loss is quadratic in the PDE parameters and yields a smooth landscape, so it alone already avoids the pathology, by implicitly encoding the full PDE dynamics across all initial conditions. The neural network, for its part, cannot repair an ill-posed parameter subspace, and so serves only to complete the observed data – a division of labor not previously made explicit. These findings carry concrete design implications for PINN-type methods and a broader heuristic on when added dimensions actually help.

[LG-68] Bernstein-Schur Kernels: Random Features by Sketched Modulation and Radial Randomization

链接: https://arxiv.org/abs/2606.11255
作者: Taha Bouhsine
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Bernstein–Schur kernels are products of a finite-feature kernel (one with an explicit finite-dimensional feature map) and a completely monotone shift-invariant kernel: nonstationary kernels that fall between the shift-invariant and dot-product templates random features usually exploit, so in general neither Bochner sampling nor polynomial sketching applies to the full kernel directly. We give one random-feature construction for the whole class that \emphrandomizes both factors: it sketches the finite modulation and randomizes the completely monotone radial factor, sampling the latter’s one-dimensional Bernstein–Widder scale and then applying Gaussian random Fourier features (whose frequency is still d -dimensional). The feature dimension is then Dm , set by the sketch size m and the radial-draw count D , free of the O(d^2) size of the exact modulation feature. Keeping the modulation \emphexact is the analyzable limit ( m\to\infty ): there we prove unbiasedness, an exact variance for the recommended flat estimator, an expected matrix-Bernstein operator-norm bound (with a matching high-probability tail) controlled by the top eigenvalues of the kernel and modulation Gram matrices together with an intrinsic dimension rather than the crude N\max_ij entrywise route, and a deterministic relative-spectral kernel-ridge stability result. By conditioning on the sketch, the doubly-randomized estimator inherits the same intrinsic-dimension operator-norm guarantee plus a single additive sketch term, tunable by m independently of D . The motivating instance is the biased yat -kernel k_yat,b(w,x)=(w^\top x+b)^2/(|w-x|^2+\varepsilon) , b\ge0 , whose family span contains the inverse-multiquadric kernel by finite differences in b ; for it the radial mixture is the IMQ spectral sampler, and one frequency per scale is variance-optimal at a fixed radial-feature budget.

[LG-69] Mechanical Field Networks: Structured Neural Dynamics for Multivariate Systems

链接: https://arxiv.org/abs/2606.11251
作者: Xingji Cui
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Many multivariate dynamical systems are observed only through trajectories, leaving the mechanisms governing their joint dynamics hidden. Existing approaches can impose interpretable dynamics or learn flexible state transitions, yet the resulting interaction structure is typically either specified in advance or left implicit within the learned dynamics. We introduce MF-Net, a recurrent dynamical model that represents all variables in a shared field state and updates this state through a learned relation law. Each variable carries a field component, and these components evolve jointly through a learnable mechanical transition. Here, mechanical refers to the relation-to-motion organization of the transition, where learned relations shape state-dependent flows, field responses, and motion tendencies that move the field state forward. The resulting structure is part of the rollout itself: learned relations influence how the field moves, and the same internal quantities support both forecasting and structural readout. Across known-law interaction systems, chaotic benchmarks, real neural recordings, and ecological time series, MF-Net achieves competitive short- and medium-horizon forecasting while retaining inspectable structural readout. On the 40-dimensional Lorenz–96 testbed, MF-Net achieves an eight-step R^2 of 0.798\pm0.018 ; across five seeds, its learned relation matrix recovers the local coupling support with a local/nonlocal strength ratio of 19.80\pm1.00 and Precision@ K of 1.000\pm0.000 . MF-Net provides a structure-readable dynamical modeling framework in which learned relations are trained through forward evolution and, on real data, interpreted as functional predictive couplings under appropriate observational limits.

[LG-70] Few-Shot Resampling for Scalable Statistically-Sound Data Mining KDD2026

链接: https://arxiv.org/abs/2606.11235
作者: Leonardo Pellegrina,Fabio Vandin
类目: Machine Learning (cs.LG); Databases (cs.DB); Methodology (stat.ME)
*备注: Accepted to KDD 2026

点击查看摘要

Abstract:A key step in knowledge discovery is the evaluation of data mining results. In several applications, including pattern mining, graph analysis, and others, this step includes the evaluation of the statistical significance of the results, to avoid spurious discoveries due only to noise or random fluctuations in the data. While specialized procedures have been developed for some specific applications, resampling-based approaches are widely used, in particular for complex analyses where analytical results cannot be derived. However, current resampling-based approaches require the generation and analysis of thousands of resampled datasets, and are therefore impractical for large datasets or computationally intensive analyses. In this paper, we introduce FewRS, a simple and effective resampling-based approach to assess the statistical significance of data mining results with rigorous guarantees on the probability of false discoveries. Our approach can be used in every situation where resampling-based approaches are applied. FewRS builds on our derivation of a novel bound to the supremum deviation of test statistics representing the quality of data mining results. We prove that FewRS needs to generate and analyze an extremely small number of resampled datasets, leading to a highly scalable approach with wide applicability. We test our approach on common tasks such as pattern mining and network analysis. In all cases, our approach results in a reduction of up to two orders of magnitude in running time compared to the state of the art, while preserving high statistical power, enabling the statistical validation of data mining results on large-scale real-world datasets. Comments: Accepted to KDD 2026 Subjects: Machine Learning (cs.LG); Databases (cs.DB); Methodology (stat.ME) Cite as: arXiv:2606.11235 [cs.LG] (or arXiv:2606.11235v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2606.11235 Focus to learn more arXiv-issued DOI via DataCite Related DOI: https://doi.org/10.1145/3770855.3817752 Focus to learn more DOI(s) linking to related resources

[LG-71] Restless bandits with imperfect binary feedback: PCL-indexability analysis and computation

链接: https://arxiv.org/abs/2606.11192
作者: José Niño-Mora
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 59 pages, 12 figures, submitted 27/3/2026

点击查看摘要

Abstract:We study restless bandits with binary latent states and imperfect binary feedback, motivated by opportunistic spectrum access with sensing errors. For the associated belief-state model, we develop a partial conservation laws (PCL)-based analytical and computational framework for establishing indexability and evaluating the Whittle index, building on a verification theorem for real-state discounted restless bandits. The framework analyzes the stochastic dynamics via an associated deterministic skeleton, renewal decompositions, and combinatorics on words. It yields tractable expressions for discounted reward and resource metrics in several threshold regimes, enabling full verification of the PCL-indexability conditions there. For the remaining regime, where a complete analytic verification is not achieved in this paper, we derive efficient numerical schemes for computing the relevant marginal metrics and the marginal productivity (MP) index, which equals the Whittle index when those conditions hold. Extensive computational experiments provide strong evidence that these conditions also hold in that regime across broad parameter ranges and without the stringent parameter restrictions imposed in prior work. The experiments further show that theMP index policy typically outperforms standard benchmark policies, often by a substantial margin.

[LG-72] Quantum Occam Learning: Sample-Supported Expressibility for Circuit-Based Quantum Learning

链接: https://arxiv.org/abs/2606.12211
作者: Jeongho Bang,Kyoungho Cho,Jeongwoo Jae
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: 22 pages (main text + appendix), 2 figures

点击查看摘要

Abstract:A central principle in quantum machine learning is that an ansatz should be expressive enough to represent the quantum data of interest. Yet, the expressibility is statistically meaningful only insofar as it can be learned from finitely many copies of an unknown quantum state. In this work, we develop an information-theoretic Occam theory for quantum data generated by finite-size quantum circuits. For the class S_n,G of n -qubit pure states preparable with at most G two-qubit gates, a metric-entropy argument gives the realizable sample law \widetilde\Theta(G/\epsilon^2) in the circuit-limited regime. For an arbitrary source \hat\rho , we introduce the best G -gate approximation error d_G(\hat\rho) and the approximate circuit complexity C_\eta(\hat\rho) . We prove an agnostic quantum Occam theorem: with M copies, one can learn up to the best G -gate approximation error plus a statistical penalty \widetildeO(\sqrtG/M) . We then remove the need to know G in advance through an adaptive model-selection theorem whose oracle inequality selects the circuit complexity justified by the data. Matching lower bounds yield a sample-supported expressibility law: at trace-distance accuracy \epsilon , M samples can support only G_\rm supported \simeq M\epsilon^2 gates, up to logarithmic factors and tomography saturation at 2^n . Thus, the circuit complexity becomes an adaptive statistical resource rather than a static promise. Our framework turns bounded circuit complexity into a model-selection principle for quantum machine learning.

[LG-73] Phase Transitions in Attention: A Bayesian Theory of Copy Head Emergence

链接: https://arxiv.org/abs/2606.12058
作者: Itay Lavie,Kirsten Fischer,Andrey Lekov,Frederic Van Maele,Zohar Ringel,Moritz Helias
类目: Machine Learning (stat.ML); Disordered Systems and Neural Networks (cond-mat.dis-nn); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Attention is the key mechanism underlying in-context learning in transformers, and attention patterns have been observed empirically to emerge abruptly during training. We present a Bayesian theory of feature learning in attention; we then focus on how the copy subcircuit in the first layer of an induction head is learned by analyzing a single-layer softmax attention network trained on a copy task. We derive a closed-form posterior over the attention matrix and reduce it to a low-dimensional order parameter space. This reduction reveals a phase transition in the amount of training data, which we verify using both Bayesian sampling and standard training with Adam. We contrast our results with linear attention and find that softmax attention exhibits a \emphfirst-order phase transition while in linear attention an initial \emphsecond-order phase transition is followed by a smooth, continuous evolution toward the structured attention pattern (\emphcrossover). Our work provides a first-principles theoretical account of the abrupt emergence of the copy subcircuit, reminiscent of the one observed in training large language models.

[LG-74] NARRAS: Edge-Triggered Distributed Inference for CSI-Based Localization in Vehicular IoT Networks

链接: https://arxiv.org/abs/2606.11914
作者: Rodrigo Oliver,Ricardo Vazquez Alvarez,Alejandro Lancho,Stefano Rini
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: 10 pages, 5 figures, 5 tables. Under review at the IEEE Internet of Things Journal

点击查看摘要

Abstract:CSI-based localization with spatially distributed antenna arrays exposes a basic resource trade-off. Each array can provide a rich view of the channel, but forwarding observations from all arrays to a fusion center is wasteful when only a few carry useful information, and the shared uplink supports only a limited number of simultaneous transmissions. We let each array decide locally whether its current observation is worth reporting, subject to a budget on the average number of active transmitters. We refer to this abstraction as Edge-Triggered Distributed Inference (ETDI). It captures a broader class of task-oriented communication problems where resource-constrained devices share an access channel for a common inference task. We instantiate ETDI for CSI-based localization, a common scenario in vehicular IoT networks. Spatially distributed remote antenna arrays (RAAs) encode local channel state information (CSI) from user equipment (UE) transmissions into latent features, and the fusion center estimates the UE position from the subset of reported features. We propose NARRAS, a decentralized reporting policy in which each RAA combines a recurrent summary of its recent observations with a memory of the last latent it transmitted. Training controls an explicit activity budget through differentiable activity penalties and validation-calibrated deterministic thresholds, and uses channel-chart regularization to shape the latent geometry. Experiments show that, at comparable uplink activity, NARRAS improves localization accuracy over learned and heuristic sparse-reporting strategies, while dense full-report models remain useful budget-free references. In low-activity regimes, chart regularization further reduces high-percentile localization errors, suggesting that geometry-aware latent representations are more robust under sparse reporting.

[LG-75] From Persistence to Survival: Hypothesis Testing Effect Sizes and Vectorisation for Topological Features

链接: https://arxiv.org/abs/2606.11911
作者: Juliette Murris,Bernadette Stolz,Karsten Borgwardt
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Algebraic Topology (math.AT)
*备注:

点击查看摘要

Abstract:Persistence diagrams are common representations in topological data analysis, but they do not naturally live in a vector space, and the statistical tools developed for comparing them have largely evolved separately from those used for downstream prediction. We introduce STRAND (Survival Topological Representation ANalysis of Diagrams), which treats (collections of) PDs as survival data: each topological feature with persistence value p = d - b is a fully observed time-to-event, and the persistence survival function S(t) = \mathbbP(p t) is the central object for comparing diagrams. From this single representation we derive (i) a non-parametric two-sample test with calibrated Type I error and high power from a small number of diagrams; (ii) interpretable effect sizes; and (iii) a 1-Wasserstein-stable feature vector for downstream machine learning. We validate calibration and power on synthetic manifolds with controlled topology, demonstrate competitive vectorisation across 14 graph and 3D point cloud benchmarks, and apply the method to study functional brain connectivity in fMRI/neuroscience data. To our knowledge, STRAND is the first method to provide hypothesis testing and vectorisation for persistence diagrams from a single coherent and interpretable representation.

[LG-76] Seeing Below the Limit of Detection: A Censored-Poisson Bayesian Latent-Growth Change-Point Detector (the Span Detector) for Serial ctDNA in HR/HER2- Metastatic Breast Cancer

链接: https://arxiv.org/abs/2606.11876
作者: Aarchi Singh Thakur,Abhijoy Sarkar
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG); Methodology (stat.ME)
*备注: 9 pages, 4 figures, 2 tables. Code and synthetic data generator: this https URL

点击查看摘要

Abstract:Circulating-tumour DNA (ctDNA) carries evidence of drug resistance months before imaging shows it, but the earliest evidence lives below the assay’s limit of detection (LoD): a nascent subclone is detected only intermittently, producing a flickering sequence of faint detects and non-detects. Commercial liquid biopsies treat each draw as an independent snapshot and a non-detect as nothing. We argue a non-detect is a left-censored observation, and the pattern of non-detects and faint detects over time carries actionable evidence of growth before any single value is trustworthy. We introduce Span, a censored-Poisson Bayesian latent-growth change-point detector that models the binary detection process, accumulates a sequential generalised-likelihood-ratio statistic for an upward change-point in the per-variant detection rate, and raises a competing-risks alarm with calibrated false-alarm control. Span has no learned weights, so there is nothing to overfit. On a synthetic cohort of HR+/HER2- metastatic breast cancer on first-line CDK4/6-inhibitor plus endocrine therapy, at a matched 10% false-alarm rate, Span roughly doubles the fraction of impending progressions caught three months ahead (indolent regime: 25% vs 11% for the snapshot), with a falsifiable dose-response: large for indolent emergence, vanishing for fast emergence. A value-trajectory baseline performs identically to the snapshot, isolating the gain to the censored detection model. The survival backbone matches a Cox baseline on real breast-cancer data (GBSG-2, n=686; C-index 0.67 vs 0.68), and on a real longitudinal cohort with clean biomarkers (PBC2, n=312) the same pipeline correctly declines to win, a falsifiable boundary test confirming the mechanism is regime-specific. All ctDNA trajectories are synthetic.

[LG-77] Modelling magnetic material properties with uncertainty-aware neural networks

链接: https://arxiv.org/abs/2606.11870
作者: Clemens Wager,Heisam Moustafa,Alexander Kovacs,Qais Ali,Harald Oezelt,Hayate Yamano,Masao Yano,Noritsugu Sakuma,Hyuga Hosoi,Akihito Kinoshita,Tetsuya Shoji,Akira Kato,Thomas Schrefl
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注: pre print, unreviewed version

点击查看摘要

Abstract:Machine learning is increasingly applied to accelerate the discovery of novel materials by exploring large compositional and structural design spaces. Yet, the scarcity of high-quality data and the frequent need for out-of-distribution prediction introduce substantial uncertainty, making the assessment of model reliability essential. In this work, we investigate uncertainty quantification as a means to evaluate model confidence in the context of permanent magnet research. In a first study, we benchmark classical and modern machine learning models for predicting intrinsic magnetic properties, focusing on the quality of their uncertainty estimates. We apply Gaussian negative log-likelihood loss and dropout-based Bayesian approximation as practical strategies for estimating predictive uncertainty. In a second study, we transfer these architectural features for uncertainty estimation to a more complex task: predicting coercivity from microstructural information using a graph neural network. Together, these studies demonstrate that uncertainty quantification not only enhances the trustworthiness of predictions but is also transferable across different modeling tasks.

[LG-78] Conformal Bayes under Label Shift: Post-Hoc Calibration vs. In-Training Adaptation ICML2026

链接: https://arxiv.org/abs/2606.11865
作者: Seungjin Choi
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 2nd Workshop on Epistemic Intelligence in Machine Learning (EIML@ICML 2026)

点击查看摘要

Abstract:Conformal Bayes combines Bayesian posterior predictives with conformal calibration to produce prediction sets that are both statistically valid and geometrically efficient. We study conformal Bayes under label shift from a unified perspective, identifying two complementary approaches that restore nominal target-domain coverage through importance-weighted conformal calibration but operate through independent mechanisms. \emphPost-hoc calibration tilts the posterior predictive toward the target domain and corrects the conformal threshold via an importance-weighted quantile, leaving the parameter posterior unchanged. \emphIn-training adaptation tilts the parameter posterior itself to the target domain, producing a corrected predictive whose highest predictive density region serves as the highest predictive density (HPD) based prediction set under the fitted target predictive; efficiency is model-dependent and does not imply finite-sample conditional optimality. Two controlled experiments show that in an unbiased training regime both strategies achieve valid coverage equally, while in a lead-optimization regime in-training adaptation acts as a debiasing operator, reducing interval width at unchanged coverage.

[LG-79] REACH: Interpretability-Driven Feature Identification and Architecture Compression for Multi-Channel Vehicular Channel Estimation

链接: https://arxiv.org/abs/2606.11857
作者: Simbarashe Aldrin Ngorima,Albert Helberg,Marelie H. Davel
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: 22 pages, 16 figures

点击查看摘要

Abstract:Multi-channel mixed-SNR training improves out-of-distribution (OOD) generalisation of deep learning channel estimators for IEEE 802.11p vehicular communications, yet the internal mechanism responsible for this remains unexplained. This work presents REACH (Relevance-based Explanation and Architectural Compression for cHannel estimators), a gradient-based interpretability framework that operates at two levels. Input-level attribution identifies a subset of time-frequency features consistently relevant across all evaluated channel conditions, enabling input dimensionality reduction with minimal performance loss. Filter-level attribution reveals a near-universal internal representation, providing a representational account of the observed OOD generalisation. Guided by the resulting filter taxonomy, relevance-guided architecture compression substantially reduces both the number of parameters and the number of floating-point operations (FLOPs) with sub-1 dB normalised mean square error (NMSE) degradation, and OOD generalisation degrades more slowly than within-distribution accuracy under increasing compression.

[LG-80] Deterministic Policy Gradient for Learning Equilibrium in Time-Inconsistent Control Problems

链接: https://arxiv.org/abs/2606.11798
作者: Xin Guo,Yijie Huang,Xiang Yu
类目: Computational Finance (q-fin.CP); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: Keywords: Time-inconsistent control, two-stage reformulation, model-free continuous-time reinforcement learning, deterministic policy gradient, fixed point iteration

点击查看摘要

Abstract:In this paper, we develop a continuous-time model-free reinforcement learning algorithm to learn deterministic equilibrium policies in general time-inconsistent control problems. Utilizing the extended Hamilton-Jacobi-Bellman system, we recast the original time-inconsistent problem into an equivalent two-stage problem. In the first stage, for given auxiliary functions, we employ the deterministic policy gradient approach to learn an optimal policy in an auxiliary time-consistent control problem. In the second stage, given the updated policy, we exploit the inner fixed point iterations and some martingale characterizations to learn the auxiliary functions. As a theoretical contribution, we provide some mild model assumptions and establish the convergence of inner fixed point iterations. By repeating this actor-critic style of iterations across two stages, our algorithm aims to learn the equilibrium under different sources of time-inconsistency in a unified manner. The superior effectiveness of the proposed algorithm are illustrated in two classical financial applications with time-inconsistency: mean-variance portfolio management and optimal tracking portfolio under non-exponential discounting.

[LG-81] Last-Iterate Convergence of Optimistic Multiplicative Weight Update

链接: https://arxiv.org/abs/2606.11773
作者: Francesco Orabona
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Optimistic Gradient Descent Ascent (OGDA) and Optimistic Multiplicative-Weights Update (OMWU) are two very popular algorithms to solve convex/concave saddle-point problems, where OMWU is the non-Euclidean, entropic version of OGDA. It is known since the '80s that the last iterate of OGDA asymptotically converges to a saddle point in smooth problems. On the other hand, it is unknown if OMWU has the same property. In this paper, I show that OMWU converges asymptotically for smooth convex-concave saddle-point problems, with a small enough constant learning rate. The result does not require uniqueness, strict complementarity, an error bound, or initialization near a solution. The main new ingredient is a boundary argument showing that every cluster point satisfies the inactive-coordinate KKT inequalities. The boundary argument was discovered with assistance from ChatGPT and is documented in the appendix.

[LG-82] Renewable Lasso without Batch-Number Constraints: A Gradient-Enhanced Approach

链接: https://arxiv.org/abs/2606.11738
作者: Junzhuo Gao,Ling Peng,Xu Guo,Heng Lian
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study online estimation for high-dimensional generalized linear models with streaming data. First, for the non-distributed setting, we propose a gradient-enhanced surrogate loss that approximates the cumulative loss using only historical summaries, which modifies and improves upon the existing renewable estimation approach for the same model in the high-dimensional setting, and removes the batch-number constraint in previous studies. We then extend the method to distributed streaming data under the master-client architecture, where batches are partitioned across sites and only summaries (gradient vectors) are exchanged. Instead of directing applying the popular method of Jordan et al. (2019) to the surrogate quadratic loss, our adjusted approach does not require the clients to compute the full surrogate loss. We derive non-asymptotic error bounds under the high-dimensional scaling, without the stringent constraint on the number of batches in the previous studies. Simulation results under linear and logistic models, together with a real-data application, show improved accuracy over existing renewable estimators.

[LG-83] Machine-learning clustering of close-in exoplanet populations: links to pebble accretion

链接: https://arxiv.org/abs/2606.11737
作者: Yi Duann,Anders Johansen,Haiyang S. Wang,H. Jens Hoeijmakers
类目: Earth and Planetary Astrophysics (astro-ph.EP); Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Close-in exoplanets exhibit a wide range of orbital architectures and physical properties shaped by both formation conditions and migration processes. Although population-synthesis models predict distinct planetary populations, establishing a quantitative connection between observed exoplanets and synthetic populations remains challenging. We investigate the intrinsic organisation of close-in exoplanets using physically motivated dynamical parameters and connect the resulting populations to pebble-accretion formation pathways. A two-stage Gaussian mixture model (GMM) is applied to an observed sample of close-in exoplanets, performing unsupervised probabilistic clustering in a feature space dominated by dynamical descriptors of planet-star interactions. The resulting clusters are mapped onto a pebble-accretion synthetic population within a statistically motivated three-dimensional parameter space. Formation-related quantities, including gas availability, gas fraction, and ice-rock mass ratio, are then used to interpret the mapped populations. We identify statistically supported sub-populations without imposing predefined classification boundaries, including very-massive gas giants, hot giants, warm-Jupiter-dominated systems, and lower-mass giants. The mapped synthetic populations reveal systematic differences in formation timing, gas accretion, and solid growth histories. In particular, very-massive gas giants are preferentially associated with earlier formation epochs than hot-giant and warm-Jupiter-dominated populations. These results demonstrate that physically motivated machine-learning approaches can provide a statistically robust framework for linking observed exoplanet populations to theoretical planet formation pathways.

[LG-84] Higher-Order Token Interactions via Quantum Attention

链接: https://arxiv.org/abs/2606.11673
作者: Jian Xu,Chao Li,Delu Zeng,John Paisley,Qibin Zhao
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Standard dot-product self-attention computes, in a single layer, only pairwise (order-2) interactions between tokens; representing a generic order- k interaction is known to require either super-quadratic resources in one layer or composition across depth. We introduce \textbfQuantum Higher-Order Attention (QHA), a shallow, hardware-realizable quantum attention head that, via data re-uploading and an all-to-all non-Clifford entangler, synthesizes order- k token interactions inside the circuit and exposes them through a local single-qubit read-out. We prove (i) an expressivity separation: any single standard self-attention layer with embedding dimension m , H heads and p -bit precision satisfying mHp=o(N/\log\log N) cannot represent the order- k correlation family that one QHA head represents with circuit depth O(\log k) ( O(k) two-qubit gates); and (ii) a trainability guarantee for its local-design instantiation: with a local read-out and O(\log n) depth the gradient variance is \Omega(1/\mathrmpoly(n)) (no barren plateau), which we confirm empirically – while being explicit that the more expressive all-to-all instantiation we benchmark is trained empirically and shows exponentially decaying gradients. Empirically, at a 6.5\times smaller parameter budget, QHA generalizes hidden-subset parity of every order k\le6 from disjoint inputs, whereas the larger classical attention head collapses past order~2; consistent with theory, the size of the advantage tracks the target’s Fourier degree - largest for parity and shrinking when low-order structure is present. As an application, QHA serves as a compact high-order interaction detector across three domains - genetic epistasis, learning-parity-with-noise, and graph triangle detection - reaching the noise ceiling at the smallest parameter budget where field-standard linear methods fail.

[LG-85] Integral Formulation of QENDy for Robust Nonlinear System Identification

链接: https://arxiv.org/abs/2606.11629
作者: Nikhil Saran,Sushant Pokhriyal,Stefan Klus,Rushikesh Kamalapurkar,Joel A. Rosenfeld
类目: Dynamical Systems (math.DS); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This manuscript proposes an integral formulation of the newly defined quadratic embedding method for identifying nonlinear systems (QENDy). In the original algorithm, trajectory data points along with their time derivatives are used. Methods for calculating time derivatives make the algorithm sensitive to noise. Our integral formulation does not use the time derivatives. This results in a more robust method to learn the dynamics.

[LG-86] Family-Aware Residual Architecture for Predicting Quantum Circuit Simulation Performance

链接: https://arxiv.org/abs/2606.11620
作者: Honjar Xing,Yehong Jiang,Xianbang Wang,Zehua Wang,Zhicheng Jiang
类目: Quantum Physics (quant-ph); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
*备注: Accepted as a full paper at IEEE ISVLSI 2026 (QC-CSAA Workshop). To appear in IEEE Xplore. 6 pages, 1 figure, 2 tables

点击查看摘要

Abstract:Approximate tensor-network simulators enable classical simulation of quantum circuits beyond the reach of exact methods, but selecting optimal approximation parameters – such as bond dimension thresholds – remains a costly trial-and-error process. We present a family-aware neural architecture that predicts both the minimum approximation threshold required to achieve target fidelity and the expected wall-clock runtime for quantum circuit simulation, given only the circuit’s OpenQASM description and execution context. Our key insight is that quantum circuits from different algorithmic families (e.g., QFT, Grover, VQE) exhibit fundamentally distinct simulation cost profiles due to their differing entanglement structures. We employ family-conditioned residual corrections – additive, family-specific adjustments atop a shared backbone, drawing on established conditional computation techniques – enabling the model to capture both universal circuit properties and algorithmic nuances. The architecture incorporates a pretrained family classifier (97.5% accuracy) and domain-informed algorithm fingerprint features derived from gate-composition heuristics. Evaluated on circuits spanning 7–130 qubits across 10 algorithm families, our system achieves 79.5% exact threshold accuracy (91.2% within one rung) and R^2 = 0.82 runtime correlation, with inference completing in approximately 50 ms – replacing trial-and-error simulation runs that may take minutes to hours. Ablation studies confirm that family-aware modeling provides the single largest performance improvement (+3.2 percentage points), validating the hypothesis that algorithm family is a first-class feature for simulation cost prediction.

[LG-87] Enhancing Spectral Embedding through Robust and Flexible Knowledge Transfer in Electronic Health Records

链接: https://arxiv.org/abs/2606.11570
作者: Feiqing Huang,Zongqi Xia,Rong Ma,Tianxi Cai
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:We propose a spectral-based, unsupervised representation learning framework to derive low-dimensional embeddings for clinical concepts and patients in rare disease cohorts from electronic health records, where data are high-dimensional but sample sizes are limited. To overcome this challenge, we incorporate a knowledge matrix extracted from a broader population that shares a partially overlapping subspace with the rare-disease cohort. Our method departs from existing approaches by relaxing restrictive one-to-one signal-alignment assumptions between the latent data matrix and knowledge matrix, allowing more flexible and realistic forms of structured sharing. We introduce a novel two-step spectral embedding procedure: first, we identify and remove irrelevant components from the knowledge matrix; then, we apply a projection-based method to separately recover shared and heterogeneous components. Simulations and an analysis of a real-world multiple sclerosis cohort show that the proposed method outperforms competing approaches, particularly in challenging scenarios where shared signals are weak and only partially aligned, as is common in rare-disease data.

[LG-88] Urban Heat MiniCubes: An AI-Ready dataset for urban heat research

链接: https://arxiv.org/abs/2606.11534
作者: Jonathan Starfeldt,Maria J. Molina,Alexander Kerr,Adam Yang,Thomas R.H. Holmes,Christopher R. Hain
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG)
*备注: 53 pages, 26 figures, Submitted to Nature Scientific Data

点击查看摘要

Abstract:Urban heat is amplified by impermeable surfaces and heterogeneous built environments, yet street-level variability remains difficult to quantify because multi-sensor observations are rarely available in consistent, analysis-ready form at the necessary spatiotemporal scales. We present “Urban Heat MiniCubes,” a publicly available, FAIR-oriented dataset designed for machine learning applications in urban heat research. The dataset provides harmonized 90 x 90 km gridded data cubes for 48 cities in the Western Hemisphere spanning 2022-2023, with variables reprojected and collocated to a common grid to reduce preprocessing (e.g., reprojection, resampling, and spatiotemporal alignment). Urban Heat MiniCubes includes two complementary modalities: (i) higher-spatial-resolution, lower-frequency observations from Landsat 8/9 (e.g., surface reflectances) and Sentinel-1 (e.g., synthetic aperture radar backscatter), and (ii) higher-temporal-frequency, coarser observations from GOES-R (e.g., longwave infrared brightness temperatures) and a microwave land surface temperature product. We document variables and metadata and provide technical assessment using inter-variable analyses and autoencoder-based reconstruction-error summaries across pixel classes (e.g., water and cloud). Potential use cases and limitations are also discussed.

[LG-89] FlexiBrain: Resolution-Agnostic Voxel-Level Encoding for Native fMRI

链接: https://arxiv.org/abs/2606.11500
作者: Mo Wang,Wenhao Ye,Junfeng Xia,Minghao Xu,Hongkai Wen,Quanying Liu
类目: Image and Video Processing (eess.IV); Computational Engineering, Finance, and Science (cs.CE); Information Theory (cs.IT); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
*备注:

点击查看摘要

Abstract:The success of large-scale deep learning models in neuroscience is fundamentally constrained by severe data heterogeneity. Native fMRI data aggregated from diverse sources exhibit substantial variation in both spatial and temporal resolutions. Consequently, most existing frameworks rely on lengthy, rigid preprocessing pipelines that enforce uniformity across datasets. This practice introduces two critical limitations: (1) potential degradation of subject-specific anatomical information; (2) significant computational overhead, often requiring hours of processing per subject. Here, we propose FlexiBrain, a resolution-agnostic voxel-level encoding framework for native fMRI based on Mamba-JEPA. FlexiBrain defines patch sizes in real-world physical units and employs a dynamic patch resizing, thereby bypassing destructive spatial standardization while enabling direct ingestion of data in native space. We instantiate the framework using an efficient Mamba-JEPA backbone to model high-dimensional 4D fMRI signals. Across five diverse downstream neuroscience tasks, FlexiBrain consistently outperforms recent state-of-the-art methods, achieving gains of up to 12 percentage points without external data augmentation. Importantly, FlexiBrain functions as a seamless plug-in module, substantially reducing preprocessing costs and accelerating the development of robust voxel-level fMRI foundation models. Code is available at this https URL.

[LG-90] Spatially Masked Regression Reveals Local and Distributed Predictability in Electrophysiological Recordings

链接: https://arxiv.org/abs/2606.11415
作者: Maryam Ostadsharif Memar,Nima Dehghani
类目: Neurons and Cognition (q-bio.NC); Machine Learning (cs.LG); Data Analysis, Statistics and Probability (physics.data-an); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:Neural recordings are often interpreted as local measurements, yet the signal at any one sensor can also reflect structured activity distributed across the broader network. This raises a basic question: to what extent does an electrode’s signal reflect local versus distributed information in the underlying system? More specifically, how much of an electrode’s activity is carried by its immediate neighborhood, and how much is embedded more broadly across the array? We address this with a Spatially Masked Regression (SMR) framework that reconstructs each electrode’s timeseries from the remaining electrodes while excluding a configurable neighborhood around the target. By progressively increasing this mask, spatial locality becomes an experimental control for quantifying how much predictive information survives after nearby channels are withheld. We apply SMR to intracranial EEG with heterogeneous electrode coverage and to scalp EEG with standardized montages over sensorimotor cortex. Using distance correlation between original and reconstructed signals, we find strong within-subject reconstruction in both modalities, substantial residual predictability even when local neighbors are excluded, and markedly stronger cross-subject transfer in EEG than in iEEG. Masking shows that nearby electrodes contribute strongly to reconstruction but do not account for all of it, indicating that individual channels reflect both local redundancy and broader distributed structure. Surrogates that preserve selected marginal or spectral properties while disrupting phase structure or temporal ordering substantially reduce performance, supporting the conclusion that SMR depends on structured temporal and cross-channel organization rather than on marginal statistics alone. These results position SMR as an interpretable framework for quantifying the balance between local and distributed information in recordings.

[LG-91] Annealed Entropic Allocation for Ranking and Selection

链接: https://arxiv.org/abs/2606.11347
作者: Xin Fei,Juergen Branke
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:We propose Annealed Entropic Allocation, an annealed weighted soft-min framework for sequential budget allocation in ranking and selection. The central idea is to replace the non-smooth maximin large-deviation rate objective with a weighted log-sum-exp surrogate that aggregates challenger-specific pairwise scores through soft-min weights, mitigating hard switching when several challengers are nearly active. To improve finite-budget discrimination, we incorporate the saddlepoint approximation – a sub-exponential correction derived from refined pairwise tail asymptotics. Because these corrections are sub-exponential and the smoothing parameter is annealed to zero, the surrogate preserves the same first-order large-deviation target as the classical maximin formulation. We show that the surrogate converges uniformly to the hard minimum, that the soft-min weights concentrate on the active challengers, and that, under fixed weights, the induced target allocation map is continuous on the simplex interior. Numerical experiments on Gaussian and exponential instances demonstrate competitive performance, especially when multiple challengers are nearly tied.

[LG-92] SPADE: Split-and-Delay Embeddings for Autoregressive High-Granularity Calorimeter Simulation

链接: https://arxiv.org/abs/2606.11304
作者: Joschka Birk,Frank Gaede,Anna Hallin,Gregor Kasieczka,Martina Mozzanica,Henning Rose
类目: Instrumentation and Detectors (physics.ins-det); Machine Learning (cs.LG); High Energy Physics - Experiment (hep-ex); High Energy Physics - Phenomenology (hep-ph)
*备注: 20 pages, 13 figures

点击查看摘要

Abstract:We introduce SPADE (SPlit And Delay Embeddings), an autoregressive transformer for sequences whose tokens carry multiple features. Rather than embedding these features jointly, SPADE embeds them independently. Delaying each feature stream relative to the previous one allows intra-token correlations to be learned by the standard self-attention mechanism. Applied to point-cloud calorimeter shower generation in the highly granular ILD detector, SPADE is competitive with the state of the art AllShowers model on photon showers, and substantially outperforms its VQ-VAE-based predecessor OmniJet- \alpha_C . The mechanism is applicable to any generative task with multi-feature tokens, enabling LLM-style pretraining workflows for higher-dimensional data.

[LG-93] Interpretable Neural Marked Statistics for Cosmological Inference ICML2026

链接: https://arxiv.org/abs/2606.11295
作者: Federico Semenzato,Benjamin D. Wandelt,Michele Liguori,Alvise Raccanelli
类目: Cosmology and Nongalactic Astrophysics (astro-ph.CO); Machine Learning (cs.LG)
*备注: 11 pages, 6 figures. Accepted to the Workshop on AI for Physics (ICML 2026)

点击查看摘要

Abstract:Recovering cosmological information beyond the power spectrum is a central goal for upcoming cosmological surveys, since late-time non-Gaussian signal in the matter density cannot be accessed through two-point statistics alone. Marked statistics fold part of this information back into the two-point level by reweighting the field with non-linear functions. We propose a neural marking scheme to generalize this process through a set of interpretable, physically motivated transformations that directly allow to interpret the gain in cosmological information at the morphological level. We employ a contrastive learning objective to align learnable marked summaries with the underlying cosmological parameters. At k_\max=0.2,h\mathrmMpc^-1 , our neural mark tightens the marginalized constraint on \sigma_8 by 2.9\times and on \Omega_m by 1.8\times compared to classical marks, breaking the \Omega_m-\sigma_8 degeneracy at the Fisher information level. It further reduces the parameter MSE across our cosmological parameter prior by 1.45\times over the best classical mark. The learned latent geometry aligns with the \Omega_m and \sigma_8 directions in parameter space, indicating that the contrastive objective recovers the dominant axes of cosmological information. Our approach opens the door to more powerful, interpretable summary statistics for cosmological inference.

[LG-94] Geometric bias in eigenspace perturbation under random heterogeneous noise

链接: https://arxiv.org/abs/2606.11263
作者: Fengkai Liu,Ke Wang,Wanjie Wang
类目: atistics Theory (math.ST); Machine Learning (cs.LG); Numerical Analysis (math.NA); Probability (math.PR)
*备注: 104 pages, 1 figure

点击查看摘要

Abstract:Spectral methods rely fundamentally on the stability of principal eigenspaces under random perturbations. Classically, this stability is quantified by the Davis-Kahan and Wedin theorems, which bound the eigenspace error using the operator norm of the noise and the relevant spectral gaps. While these worst-case bounds are sharp for arbitrary deterministic perturbations, they can be wasteful in the low-rank signal-plus-random-noise setting, as they fail to capture the fine-grained interaction between the signal geometry and the noise distribution. In this paper, we study the spectral perturbation of signal-plus-noise matrices corrupted by sparse, random noise with an arbitrary, inhomogeneous variance profile. We demonstrate that under heterogeneous noise variances, the empirical eigenvectors suffer a systematic, deterministic geometric bias that is entirely invisible to classical perturbation bounds. By leveraging the Quadratic Vector Equation (QVE) and establishing fine-grained isotropic local laws, we derive near-optimal, non-asymptotic perturbation bounds for the leading eigenspaces in the operator and 2\to\infty norms. The bounds separate the usual signal-to-noise contribution, stochastic fluctuations, and structured geometric bias terms determined by the alignment between the signal eigenspaces and the row-wise variance profile.

[LG-95] My Chemical Harness: Evolutionary Molecular Design over Synthetic Pathways with Large Language Model Agents

链接: https://arxiv.org/abs/2606.11256
作者: César Ojeda,Darius A. Faroughy,Maryam Karimi,Payam Zarrintaj,Mir Mehdi Seyedebrahimi,Martín Carballo-Pacheco
类目: Chemical Physics (physics.chem-ph); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: 27 pages | 10 figures

点击查看摘要

Abstract:Designing molecules with target properties is most useful when candidate structures are accompanied by feasible synthetic routes. We introduce My Chemical Harness, a route-native evolutionary framework for goal-directed molecular design in which the search population consists of executable synthetic pathways rather than isolated molecular graphs. Each route is built from purchasable building blocks and reaction templates, executed by deterministic chemistry tools, and scored through task-specific molecular oracles. Large language models (LLMs) are used only as strategy controllers that select high-level preferences over route length, move type, reaction families, motifs, and exploration pressure, while local code performs route construction, validation, deduplication, scoring, selection, and memory updates. This separation lets the LLM guide exploration without allowing it to introduce hallucinated products or unsupported reaction steps. On a soluble epoxide hydrolase proxy task, our LLM agent improves over single pass LLM and deterministic controllers, reaching state-of-the-art performance across the sEH score, synthetic accessibility score, and AiZynthFinder success rate metrics. These results suggest that constrained LLM agents can play a significant role in molecular discovery without requiring training, fine-tuning, or dedicated generative models.

[LG-96] Physically Constrained Ensemble Gaussian Process Modelling for Expensive Quantum Systems with Heteroskedastic Noise

链接: https://arxiv.org/abs/2606.11240
作者: Arpan Biswas,Surtirtha Paul,Joseph Agada,Matthias Thamm,Adrian Del Maestro
类目: Computational Physics (physics.comp-ph); Strongly Correlated Electrons (cond-mat.str-el); Machine Learning (cs.LG); Quantum Physics (quant-ph)
*备注: 14 pages, 6 figures in main text, 2 figures in Supp materials

点击查看摘要

Abstract:Accurate modeling of quantum many-body systems often requires computationally expensive simulations such as Density Matrix Renormalization Group (DMRG) or Quantum Monte Carlo (QMC) calculations. These methods, while precise, impose significant time and resource constraints, limiting their use in exhaustive parameter exploration. Moreover, these expensive simulations can contain variable errors over the large unknown parameter space, which needs to be quantified and propagated. Thus, predictive modelling is required to estimate the functional space accurately over scarcely sampled data with heteroskedastic noise, while preserving the physical relevance of the estimation. Therefore, we present a Physically Constrained Ensemble Gaussian Process (pc-EGP) framework designed to efficiently model complex and noisy quantum systems under physical consistency constraints. The proposed method first enforces physical constraints as a user controlled weighted penalty to the data-driven loss function of the Gaussian Process (GP) surrogates. Then an ensemble of such GP models is trained with variable noisy simulations via numerical quadrature method where these multiple GP(s) at different nodes is integrated as a quadrature weighted average. We first demonstrate the framework on synthetically generated data before applying to quantum systems. In the first case study, we leverage DMRG simulations of the Bose-Hubbard Model to predict the critical interaction parameter Uc governing the superfluid-to-Mott-insulator transition. In the second case study, we demonstrate our method on QMC simulations, of a quantum liquid confined inside a nanoporous silicate with the goal of optimizing a chemical environment to realize a one-dimensional superfluid. Compared to conventional GP, pc-EGP achieves a better balance of accuracy and physically meaningful predictions.

附件下载

点击下载今日全部论文列表

Arxiv今日论文 | 2026-06-11

目录

概览 (2026-06-11)

多智能体系统

自然语言处理

信息检索

人机交互

计算机视觉

人工智能

机器学习

附件下载