本篇博文主要内容为 2026-07-02 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR、MA六个大方向区分。
说明:每日论文数据从Arxiv.org获取,每天早上12:30左右定时自动更新。
提示: 当天未及时更新,有可能是Arxiv当日未有新的论文发布,也有可能是脚本出错。尽可能会在当天修复。
目录
概览 (2026-07-02)
今日共更新664篇论文,其中:
- 自然语言处理共93篇(Computation and Language (cs.CL))
- 人工智能共186篇(Artificial Intelligence (cs.AI))
- 计算机视觉共161篇(Computer Vision and Pattern Recognition (cs.CV))
- 机器学习共145篇(Machine Learning (cs.LG))
- 多智能体系统共13篇(Multiagent Systems (cs.MA))
- 信息检索共26篇(Information Retrieval (cs.IR))
- 人机交互共21篇(Human-Computer Interaction (cs.HC))
多智能体系统
[MA-0] AutoMem: Automated Learning of Memory as a Cognitive Skill
【速读】:该论文旨在解决大语言模型(LLM)在长时程任务中因缺乏有效记忆管理能力而导致性能受限的问题。传统方法依赖人工设计提示(prompt)、文件结构(file schema)和动作词汇表(action vocabulary)来组织记忆,但这些结构难以通过人工优化,且记忆错误往往在长时间执行后才显现,使得人工审查轨迹变得不可行。为此,论文提出AutoMem框架,其核心在于将记忆管理作为可训练的独立技能进行自动化优化:第一阶段,利用强模型自动分析完整智能体轨迹,迭代改进记忆结构(如提示、文件模式与动作语义);第二阶段,从大量任务执行中识别出高质量的记忆决策,作为监督信号直接提升模型的记忆熟练度。实验表明,仅优化记忆管理而保持任务行为不变,即可使基础代理在三个程序生成的长时程游戏中(Crafter、MiniHack、NetHack)性能提升2至4倍,32B规模的开源模型表现达到与Claude Opus 4.5、Gemini 3.1 Pro Thinking等前沿系统相当的水平。这证明记忆管理是一种可独立学习的高杠杆技能,在长时程任务中具有显著增益潜力。
链接: https://arxiv.org/abs/2607.01224
作者: Shengguang Wu,Hao Zhu,Yuhui Zhang,Xiaohan Wang,Serena Yeung-Levy
机构: Stanford University (斯坦福大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)
备注: Project Website: this https URL
Abstract:Memory expertise is a learned skill: knowing what to encode, when to retrieve, and how to organize knowledge–a capacity known in cognitive science as metamemory. We bring this perspective to LLMs by treating memory management as a trainable skill. We promote file-system operations to first-class memory actions alongside task actions, letting the model itself decide how to manage its memory. This memory skill improves along two axes: the structure that supports it (prompts, file schemas, action vocabulary), and the proficiency of the model exercising it. Both axes resist manual optimization: episodes in long-horizon tasks run for thousands of steps, and a single memory mistake can hide long before it surfaces, making human review of full trajectories impractical. We introduce AutoMem, a framework that automates both axes. In the first loop, a strong LLM reviews complete agent trajectories and iteratively revises the memory structure that shapes how the agent interacts with its memory files. In the second loop, the agent’s own good memory decisions are identified from many episodes and used as training signal to sharpen the model’s memory proficiency directly. Across three procedurally generated long-horizon games (Crafter, MiniHack, and NetHack), optimizing memory alone–without modifying the model’s task-action behavior–improved the base agent’s performance ~2x-4x, bringing a 32B open-weight model competitive with frontier systems such as Claude Opus 4.5 and Gemini 3.1 Pro Thinking. Our results show that memory management is an independently learnable skill, and a high-leverage objective yielding large gains on long-horizon tasks.
[MA-1] From Personas to Plot: Character-Grounded Multi-Agent Story Generation for Long-Form Narratives
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在生成长篇叙事文本时面临的叙事一致性差与情节连贯性不足的核心问题。现有方法在处理长篇故事时易出现逻辑断裂、角色行为不一致及事实性幻觉(hallucination)等问题,难以维持统一的世界观与情节推进。其解决方案的关键在于提出一个统一的多智能体驱动叙事生成与验证框架——MAGNET与ATLAS。MAGNET通过引入基于人物设定(persona-grounded)的角色智能体,以共享世界状态(shared world state)和动态演进的故事目标为导向,实现目标驱动的协同创作,从而增强叙事的结构性与角色一致性;而ATLAS则采用基于图结构的分析管道,对生成故事中各场景的世界状态表示进行比对,有效识别并检测出幻觉现象。实验表明,该框架在100页的长篇叙事生成任务中,相较单一模型提示(single-model prompting)和IBSEN方法,分别减少了41%与50%的标注工作量和幻觉发生率,并在成对评分评估中展现出相似优势,证明了显式世界状态追踪与多智能体协同机制在构建可控、结构化长篇叙事中的有效性。
链接: https://arxiv.org/abs/2607.00918
作者: Aayush Aluru,Chloe Ho,Muhammad Hammouri,Kerry Luo,Myra Malik,Ryan Lagasse,Arjun Bahuguna,Vasu Sharma
机构: Princeton University (普林斯顿大学); University of Michigan (密歇根大学); University of Maryland (马里兰大学); Universitat Pompeu Fabra (庞培法布拉大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:
Abstract:Although large language models (LLMs) have demonstrated impressive creative fiction generation, they struggle to maintain narrative consistency and coherent plot lines in long-form stories. In this work, we introduce a unified framework for long-form narrative generation and verification. MAGNET, a multi-agent goal-driven narrative engine for storytelling, generates stories with persona-grounded character agents that propose actions based on a shared world state and evolving story goals, while ATLAS is a graph-based pipeline that compares scene-level world representations across a generated story to detect hallucinations. By evaluating MAGNET using an LLM editor, pairwise rubric scoring, and ATLAS, we show that our framework produces coherent narratives compared to single-model prompting and IBSEN. At 100 pages, MAGNET reduced annotations and hallucinations by 41 and 50%, respectively, compared to the single model baseline and by 34 and 45%, respectively, compared to IBSEN, with pairwise rubric evaluation showing similar results. These results suggest that long-form narratives can emerge from explicit world-state tracking and goal-driven multi-agent generation, providing a foundation for controllable and structurally coherent long-form narrative generation.
[MA-2] Calibrating the Instrument: Controllability of an LLM -Driven Synthetic Population
【速读】:该论文旨在解决生成式合成人口(Generative Synthetic Populations, GSP)在应用于真实人群模拟前的内在有效性问题,即:合成群体是否能以可重复、有序且结构化的方式对已知效价(valence)的刺激作出响应?这一核心问题被称为“可控性”(controllability)。传统研究关注合成群体是否能外推模拟真实人类行为,而本文提出更根本的内部有效性检验——合成群体能否在其自身潜结构(latent structure)中准确恢复出预设的响应模式。其解决方案的关键在于提出并实施了SIVE(Synthetic Instrument Validation Experiment),通过一个虚构城市Montelago中的120个具有明确潜结构的合成个体,在七种从强正到强负的制度沟通情境下进行温度扫描实验,基于七个预先注册的评估标准(保真度、稳定性、噪声基线、特异性、敏感性、排序性)系统验证合成系统的可控性。结果显示所有指标在各温度条件下均通过验证,尤其关键发现是:原设计为“弱正”的信息被识别为功能上为负,归因于文本中未解决的问题、不确定性与机构被动性;经重构后不仅恢复预期排序,还揭示了与个体潜在信任水平之间的非预期互动机制。此外,噪声子实验表明仪器内生噪声约为跨个体估计值的一半且温度稳定。个体轨迹分析进一步揭示了摘要统计量所掩盖的连贯微观动态过程。整体而言,该研究为GSP工具的可信使用提供了方法论基础,强调在外部有效性论证前必须完成对内在可控性的严格检验。
链接: https://arxiv.org/abs/2607.00910
作者: Mirko Degli Esposti
机构: University of Bologna(博洛尼亚大学)
类目: Multiagent Systems (cs.MA)
备注:
Abstract:Generative Synthetic Populations (GSP) – the convergence of population synthesis, agent-based modelling, and LLM agents – are attracting growing interest for urban simulation and institutional communication research. Before any GSP instrument is used on a real population, a more basic question must be answered: does it respond to stimuli of known valence in an ordered, replicable, group-structured way? We call this controllability. We ask not whether a synthetic population tracks humans, but whether it tracks itself: whether the latent structure we impose on it is recovered in its own responses. This internal-validity question is logically prior to any claim about external validity, just as characterising an instrument’s response function must precede using it to test a theory. We report SIVE (Synthetic Instrument Validation Experiment): a fictional municipality (Montelago) with 120 synthetic personas of known latent structure, exposed to seven conditions spanning strongly positive to strongly negative institutional communications about a water network. Seven pre-registered criteria, evaluated across a temperature sweep, jointly assess fidelity, stability, noise floor, specificity, sensitivity, and ordering. All seven pass at every temperature. A central finding turns a calibration failure into a diagnostic success: a message designed as “weakly positive” was identified by the instrument as functionally negative, traced to unresolved problems, uncertainty, and institutional passivity in its text; a redesigned version restored the expected ordering and interacts with agents’ latent trust in unanticipated ways. A noise sub-experiment shows the instrument’s intrinsic noise is roughly half the cross-agent estimate and stable across temperatures. Individual trajectories reveal coherent micro-dynamics that summary statistics obscure. Full data are available via an interactive explorer.
[MA-3] M2Note: Continual Evolution of Vision Language Models via Mistake Notebook Learning
【速读】:该论文旨在解决视觉语言模型(Vision Language Models, VLMs)在多模态推理任务中频繁出现的系统性错误问题,如忽略关键视觉检查、误用领域规则以及生成无依据的概念。现有解决方案主要依赖监督微调(Supervised Fine-Tuning, SFT)和强化学习(Reinforcement Learning, RL),但其迭代成本高且在分布外场景下易出现性能退化。为此,论文提出一种无需训练的持续进化框架——多模态错题笔记学习(Multimodal Mistake Notebook Learning, M2Note),其核心在于将学习过程外部化为可编辑的记忆库。M2Note将失败的推理轨迹转化为紧凑的“主题-指导”笔记:主题用于归纳底层领域与概念,指导则提供可复用的可操作验证步骤。在推理阶段,通过多模态检索增强生成(Retrieval-Augmented Generation, RAG)机制检索相关笔记并注入模型上下文,引导推理规避历史错误。为保障持续演化的稳定性,M2Note引入批次级后验证与回滚机制,仅当笔记修改在同一批次上提升性能时才予以提交,从而抑制噪声更新并防止性能退化。该框架支持自演化(同一VLM同时作为求解器与监督者)与跨模型演化(强监督者指导弱求解器),实现能力迁移而无需权重更新。实验在六个多模态推理基准上均取得一致性能提升,展现出优异的成本与样本效率,并与思维链(Chain-of-Thought, CoT)提示方法具有互补性。
链接: https://arxiv.org/abs/2607.00685
作者: Haiwen Li,Jing Tang,Rui Chen,Lei Sun,Xiangxiang Chu
机构: AMAP, Alibaba Group(阿里巴巴集团)
类目: Multiagent Systems (cs.MA)
备注:
Abstract:Vision Language Models (VLMs) have demonstrated remarkable capabilities in multimodal reasoning tasks, yet they still suffer from recurring failures, such as skipping key visual checks, misapplying domain rules, and hallucinating unsupported concepts. Most existing solutions rely on supervised fine-tuning (SFT) and reinforcement learning (RL), which are expensive to iterate and can be brittle under distribution shift. To this end, we propose Multimodal Mistake Notebook Learning (M2Note), a training-free continual evolution framework that externalizes learning into an editable memory. M2Note transforms failed trajectories into compact subject-guidance notes: the subject summarizes the underlying domain and concept, while the guidance provides actionable verification steps that can be reused in future inference. At test time, M2Note retrieves relevant notes via multimodal retrieval-augmented generation (RAG) and appends them to the model context, steering reasoning away from previously observed pitfalls. To stabilize continual evolution, we adopt batch-level post-verification with rollback, which commits notebook edits only if they improve performance on the same batch, reducing noisy updates and preventing regressions. M2Note supports both self-evolving, where the same VLM acts as solver and supervisor, and cross-model evolving, where a stronger supervisor guides a weaker solver, enabling capability transfer without weight updates. Experiments on six multimodal reasoning benchmarks show consistent improvements across domains and backbones, while achieving strong cost and sample efficiency and remaining complementary to Chain-of-Thought (CoT) prompting.
[MA-4] From Real-Time Planning to Reliable Execution:Scalable Coordination for Heterogeneous Multi-Robot Fleets in Industrial Environments
【速读】:该论文旨在解决工业环境中异构机器人集群在高密度运行下实时路径规划与协同调度的挑战,尤其关注通信延迟、执行不确定性等扰动导致机器人偏离预设时间约束所引发的路径冲突与拥堵传播问题。其核心解决方案在于提出一种名为SCALE的反应式在线协调框架,关键创新点包括:引入运动诱导冲突消减机制,实现对动态冲突的在线可执行路径生成;设计广义共轭动作-优先级超图(Generalized Conjugate Action-Precedence Hypergraph, CAPH),通过自适应调整机器人间的优先级关系以增强系统对扰动的鲁棒性。实验证明,该框架能够在真实仓库场景中实现稳定高效的在线协调与路径规划。
链接: https://arxiv.org/abs/2607.00591
作者: Bo Cao,Zhe Liu,Hesheng Wang
机构: 未知
类目: Robotics (cs.RO); Multiagent Systems (cs.MA)
备注: 11 pages, 9 figures
Abstract:With the increasing deployment of heterogeneous robot fleets in industrial environments, efficient coordination remains a critical challenge. Real-time path planning must simultaneously accommodate high robot densities and heterogeneous motion capabilities, while communication delays, execution uncertainties, and other disturbances may cause robots to deviate from the temporal assumptions underlying planned paths. Such deviations can lead to excessive waiting and congestion propagation across the fleet. This paper presents SCALE, a reactive online coordination framework that enables real-time planning while maintaining robust execution. Within this framework, we introduce a motion-induced conflict reduction mechanism to support the online generation of feasible paths for online conflict resolution. To mitigate the effects of disturbances, we further design a generalized Conjugate Action-Precedence Hypergraph (CAPH) that adaptively adjusts precedence relations among robots. Extensive validation experiments, together with a three-day deployment in a warehouse, demonstrate the
[MA-5] Agri-SAGE: Simulation-Grounded Multi-Agent LLM for Context-Aware Agricultural Advisory Generation
【速读】:该论文旨在解决农业咨询系统中存在的两大核心问题:一是传统静态农艺指南虽具科学依据,却无法反映作物生长季内的动态变化与不确定性;二是基于大语言模型(LLM)的新型咨询系统虽能生成看似合理的农艺建议,但其生理合理性不足,缺乏生物学机制支撑。为此,论文提出Agri-SAGE这一闭环框架,其解决方案的关键在于将检索增强的多智能体大语言模型推理与基于APSIM的生物物理模拟相结合,实现农艺建议的生成与跨季节、跨情景的生物过程验证。通过在10年回溯分析中对比“计划与求解”“思维树”和“反思”三种推理范式,研究发现三者均显著优于静态的“最佳实践组合”(PoP)基准,其中“思维树”在产量表现上达到峰值;而“反思”策略则凭借跨季节的事件记忆机制,在保持相近农艺效果的同时大幅降低计算成本,展现出更高的效率与可持续性。
链接: https://arxiv.org/abs/2607.00454
作者: Vedant Balasubramaniam,Geetha Charan,Manojkumar Patil,Rohit P Suresh,V Priyanka,Kodur Sai Vinay Sathvik,Y. Narahari
机构: Indian Institute of Science, Bengaluru(印度科学研究所, 班加罗尔); BNM Institute of Technology, Bengaluru(BNM 技术学院, 班加罗尔)
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:
Abstract:Agricultural advisory systems face a fundamental tension: static agronomic guidelines offer consistent, evidence-based recommendations, yet remain blind to in-season variability and dynamic uncertainties. Recent advisory systems powered by LLMs are liable for a different risk of generating recommendations that are agronomically credible but physiologically unconvincing. Agri-SAGE is a closed-loop framework designed to resolve the above two limitations by integrating retrieval-grounded multi-agent LLM reasoning with APSIM-based biophysical simulation, to generate and validate agronomic advisories. To assess this framework, we evaluate three reasoning approaches, namely Plan-and-Solve, Tree of Thoughts, and Reflexion, over a 10-year retrospective analysis. All three significantly outperform static PoP (Package-of-Practice) baselines, with Tree of Thoughts achieving impressive peak yields. At the same time, Reflexion achieves comparable agronomic outcomes at substantially lower computational cost by leveraging cross-seasonal episodic memory.
[MA-6] ASPIRE: Agent ic /Skills Discovery for Robotics
【速读】:该论文旨在解决传统机器人编程中存在的多重挑战,包括多模态感知的协同、物理接触动力学的管理、多样化配置与执行失败的处理等问题。其核心解决方案是提出ASPIRE(通过迭代机器人探索实现智能体技能编程),一种持续学习系统,采用“代码即策略”(code-as-policy)范式,使机器人能够自主编写并不断优化控制程序,同时将经验累积为可复用的技能库。ASPIRE的关键在于构建一个闭环的开放循环系统,包含三个核心组件:(1)具备细粒度多模态轨迹暴露能力的闭环机器人执行引擎,支持自主故障诊断、修复合成与验证;(2)持续扩展的技能库,将经验证的修复方案提炼为可迁移的通用知识;(3)进化式搜索机制,生成多样化的任务序列与控制程序,实现对单一轨迹优化的超越。该方法在多种复杂任务上显著优于现有方法,尤其在扰动下的LIBERO-Pro操作任务中提升达77%,并在零样本泛化至未见过的长时序任务中表现出色,证明了其强大的迁移能力与实际应用潜力。
链接: https://arxiv.org/abs/2607.00272
作者: Runyu Lu,Yubo Wu,Ethan Kou,Letian Fu,Wenli Xiao,Ajay Mandlekar,Yinzhen Xu,Guanya Shi,Ken Goldberg,Ang Chen,Mosharaf Chowdhury,Yuke Zhu,Linxi “Jim” Fan,Guanzhi Wang
机构: NVIDIA(英伟达); University of Michigan(密歇根大学); University of Illinois Urbana-Champaign(伊利诺伊大学厄本那-香槟分校); University of California, Berkeley(加州大学伯克利分校); Carnegie Mellon University(卡内基梅隆大学)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 43 pages, 12 figures, 9 tables. Project page: this https URL
Abstract:Traditional robot programming is challenging: it requires orchestrating multimodal perception, managing physical contact dynamics, and handling diverse configurations and execution failures. We introduce ASPIRE (Agentic Skill Programming through Iterative Robot Exploration), a continual learning system that autonomously writes and refines robot control programs in a code-as-policy paradigm while compounding experience into a reusable skill library. ASPIRE discovers skills that persist across tasks, simulation and real-world settings, and embodiments. It operates in an open-ended loop with three components: (1) a closed-loop robot execution engine that exposes fine-grained multimodal traces, enabling autonomous failure diagnosis, repair synthesis, and validation; (2) a continually expanding skill library that distills validated fixes into reusable, transferable knowledge; and (3) evolutionary search that generates diverse task sequences and control programs to explore beyond single-trajectory refinement. ASPIRE surpasses prior methods by up to 77% on LIBERO-Pro manipulation under perturbation, 72% on Robosuite bimanual handover, and 32% on BEHAVIOR-1K long-horizon household tasks. Its accumulated library also enables zero-shot generalization to unseen long-horizon tasks: on LIBERO-Pro Long, ASPIRE achieves 31% success versus 4% for prior methods despite their use of test-time reasoning and retries. Finally, simulation-discovered skills provide initial evidence of sim-to-real transfer, substantially reducing real-robot programming effort across different embodiments and robot APIs.
[MA-7] From Signals to Structure: How Memory Architecture Drives Language Emergence in LLM Agents
【速读】:该论文旨在解决两个智能体如何从零开始共同构建共享语言的问题,聚焦于在莱维斯信号博弈(Lewis signaling game)框架下,智能体如何通过交互历史实现符号系统的协同演化。其核心挑战在于:当通信通道容量有限时,智能体如何克服信息遗忘与符号歧义,形成稳定且可重复使用的信号-意义对应关系。解决方案的关键在于记忆架构的设计——具有持久性私有笔记(persistent private notebook)的智能体能够将学习到的约定外化存储,从而避免每轮交互中重新推导信号规则,显著提升协调可靠性(在通道容量为25时达到0.867 ± 0.023)。相比之下,无状态智能体受限于滚动上下文窗口,仅在中等通道容量下表现最优,随着词汇量增长迅速退化,出现“高容量崩溃”现象。研究进一步表明,单纯依赖通道容量无法预测协调性能,真正决定因素是记忆架构是否能将交互历史转化为稳定的惯例;因此,通道容量与记忆架构的协同作用共同决定了信号如何演变为语言。
链接: https://arxiv.org/abs/2607.00233
作者: Yashar Talebirad,Eden Redman,Ali Parsaee,Osmar R. Zaiane
机构: Alberta Machine Intelligence Institute (Amii); University of Alberta; Network for Applied Technology (NAT)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Theory (cs.IT); Multiagent Systems (cs.MA)
备注:
Abstract:How do two agents invent a shared language from scratch? In a Lewis signaling game, a sender and receiver must coordinate on a code using only their interaction history. We study five memory architectures across varying channel configurations with LLM agents and find that memory architecture matters more than channel capacity. Agents with a persistent private notebook benefit from surplus channel capacity and avoid the high-capacity collapse seen in stateless agents, achieving the most reliable coordination ( 0.867 \pm 0.023 at capacity = 25). Stateless agents peak at moderate capacity and then degrade as the vocabulary grows beyond what a rolling context window can track The notebook externalizes learned conventions, freeing agents from having to re-derive codes each round. An information bottleneck-inspired argument predicts an optimal capacity equal to the number of objects. Instead, the bottleneck (capacity = 8) proves to be a fragility point, and surplus capacity is generally better. We show that channel capacity alone cannot predict coordination; memory architecture determines whether agents turn interaction history into stable conventions, and both dimensions are needed to understand how signals become language.
[MA-8] HydraCollab: Adaptive Collaborative-Perception for Distributed Autonomous Systems IROS2026
【速读】:该论文旨在解决多机器人系统中协同感知(collaborative-perception)面临的通信带宽与感知精度之间的固有权衡问题。现有方法在提升感知性能时往往需传输大量数据,导致通信开销过高,难以满足实际通信网络的带宽约束。为应对这一挑战,本文提出HydraCollab——一种自适应的协同感知框架,其核心在于:(i) 选择性地传输最具信息量的传感器特征,以减少冗余通信;(ii) 根据空间置信度图动态调整协作策略(中间融合或晚期融合),实现通信效率与感知性能的最优平衡。实验结果表明,相较于当前最先进方法Where2comm,HydraCollab在V2X-R和V2X-Radar数据集上分别仅使用41%和26%的带宽,同时分别提升了0.78%和0.75%的感知性能,显著优化了精度与通信成本的综合表现。
链接: https://arxiv.org/abs/2607.00191
作者: Luke Chen,Cheng-Ju Wu,David R. Martin,Qilin Ye,Pramod Khargonekar,Mohammad Abdullah Al Faruque
机构: University of California, Irvine (加州大学欧文分校)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: Accepted at IROS 2026
Abstract:Collaborative-perception enables multi-robot systems to enhance situational awareness by sharing perceptual information. Existing collaborative-perception systems face an inherent trade-off between communication bandwidth requirements and perception accuracy, where methods that exchange more information achieve better perception results at the cost of increased communication overhead. However, real-world communication networks impose bandwidth constraints that require minimizing communication overhead without sacrificing perception performance. To address this challenge, we propose HydraCollab, an adaptive collaborative-perception framework that (i) selectively transmits the most informative sensor features and (ii) dynamically employs collaboration strategies (intermediate or late) based on spatial confidence maps. Extensive evaluations on the V2X-R, V2X-Radar and UAV3D-mini datasets demonstrate that HydraCollab achieves the best overall trade-off between accuracy and communication cost among existing collaborative-perception methods. Relative to SOTA Where2comm, HydraCollab uses only 41% of the bandwidth on V2X-R and 26% on V2X-Radar while improving performance by 0.78% and 0.75% respectively. Our code and models are available at this https URL.
[MA-9] Active Sensing for RIS-Aided Tracking and Power Control: A Hybrid Neuroevolution and Supervised Learning Approach
【速读】:该论文旨在解决功率受限的移动用户在使用可重构智能表面(Reconfigurable Intelligent Surface, RIS)辅助下的能效感知跟踪问题。核心挑战在于,定位导频传输占据了功率受限设备的主要能量开销,且系统面临离散、分布式决策的主动感知难题。其解决方案的关键在于提出一种新型双代理(Dual-Agent, DA)深度学习框架,通过联合优化RIS的离散相位配置与用户设备(UE)的上行发射功率,在实时性条件下实现能效最优的跟踪性能。该框架采用混合训练方法,融合神经进化算法与监督学习,有效克服了RIS单元相位响应的非可微性以及单比特反馈信息瓶颈对模型训练的制约。此外,该框架具备良好的扩展性,适用于单天线与多天线基站场景,仅需在多天线情况下对神经网络结构进行微调(增加一个用于选择数字波束成形器的有效输出分支)。仿真结果表明,该方案在多种目标运动模型下均展现出高精度与强鲁棒性,显著优于扩展卡尔曼滤波、粒子滤波及基于机器学习的追踪器;在静态定位任务中,亦大幅超越传统指纹匹配、深度强化学习基线与标准反向传播估计器。
链接: https://arxiv.org/abs/2607.00056
作者: George Stamatelis,Hui Chen,Henk Henk Wymeersch,George C. Alexandropoulos
机构: National and Kapodistrian University of Athens (雅典国立卡波迪斯特里亚大学); Chalmers University of Technology (查尔姆斯理工大学)
类目: Information Theory (cs.IT); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Networking and Internet Architecture (cs.NI)
备注: Submitted to an IEEE journal, 16 pages
Abstract:This paper studies energy efficient tracking of power-limited mobile users with the assistance of a Reconfigurable Intelligent Surface (RIS). Since localization pilot transmissions dominate the energy budget of power-constrained devices, we introduce a low-overhead feedback link from the Base Station (BS) to the user to enable dynamic uplink power control. To navigate the discrete and decentralized nature of this active sensing problem, we propose a novel Dual-Agent (DA) deep learning framework that jointly optimizes the discrete RIS phase profiles and the UE’s transmit power in real time. Specifically, our approach employs a hybrid training methodology integrating the neuroevolution paradigm with supervised learning, effectively overcoming the non-differentiability of discrete phase responses from the RIS unit elements and the strict information bottleneck of single-bit feedback messages for pilot power control. The proposed DA active sensing framework can be applied with both single- and multi-antenna BSs, the latter with only minor modifications in the structure of one NN: an additional output branch with appropriate structure is included for the latter case to select a valid digital combiner from a finite set. Extensive numerical simulations demonstrate that the proposed scheme achieves highly accurate and robust tracking across diverse target motion models, outperforming extended Kalman and particle filters, as well as, machine learning-based trackers. Furthermore, in static localization, it is shown to significantly outperform traditional fingerprinting schemes, deep reinforcement learning baselines, and standard backpropagation-based estimators.
[MA-10] A Role-Based Multi-Agent Model for Climate Adaptation Deliberation Across Living Labs
【速读】:该论文旨在解决现有代理模型(Agent-Based Models, ABMs)在气候治理研究中普遍存在的局限性,即通常仅孤立地关注制度动态或个体行为机制,难以全面刻画多元主体(如公民、倡导团体、媒体及政治决策者)之间复杂的互动关系。其解决方案的关键在于构建一个模块化、多层次的代理模型架构,将基于实证认知决策模型(采用HUMAT和MOA框架)的动机驱动型个体行为、基于人口统计同质性的社会影响传播机制,以及环境类非政府组织(NGO)、媒体与政客等机构主体的战略行为模块有机整合于统一仿真框架中。该架构通过聚合专家意见、公众动员、政党立场一致性和媒体议程设置等多种信号来生成政治决策,具备基于调查数据生成合成人群与通过“生活实验室”(Living Lab)利益相关方参与校准制度参数的能力,从而支持对气候相关土地利用治理过程的情景分析。论文重点阐述了模型的架构设计原则、模块化结构及其集成逻辑,强调该多层方法对民主化气候治理建模的贡献,并提出可扩展性与未来验证路径。
链接: https://arxiv.org/abs/2607.00046
作者: Önder Gürcan,David Eric John Herbert,F. LeRon Shultz,Christopher Frantz,Ivan Puga-Gonzalez
机构: 未知
类目: Multiagent Systems (cs.MA); Software Engineering (cs.SE)
备注: 4 pages, 1 figure, accepted as poster at the 21st annual Social Simulation Conference (SSC 2026)
Abstract:Climate governance processes involve complex interactions between heterogeneous citizens, advocacy groups, media actors, and political decision-makers. While agent-based models (ABMs) have been widely used to study environmental policy and socio-ecological systems, many existing approaches focus either on institutional dynamics or individual behavioural mechanisms in isolation. This paper presents a modular multi-level agent-based architecture that integrates empirically grounded cognitive decision models with strategic institutional behaviour within a unified simulation framework. The architecture combines (i) motive-based individual decision-making operationalised through the HUMAT and MOA frameworks, (ii) socially embedded influence processes via demographic homophily networks, and (iii) institutional strategy modules for environmental non-governmental organisations (NGOs), media agents, and politicians. Political decisions emerge from the aggregation of multiple signals, including expert input, public mobilisation, party alignment, and media framing. The model is designed to be empirically calibrated through synthetic populations derived from survey data and and institutional parameters informed through Living Lab stakeholder engagement, and to support scenario-based exploration of climate-relevant land-use governance processes. Rather than presenting empirical results, this paper focuses on the architectural design principles, modular structure, and integration logic of the model. We discuss how this multi-layered approach contributes to the modelling of democratic climate governance and outline pathways for generalization and future validation.
[MA-11] Memory-Native Non-Terrestrial Networks for Embodied Intelligence
【速读】:该论文旨在解决非地面网络(NTN)在支持具身智能(EI)应用时面临的动态性高、资源受限、拓扑变化频繁及任务导向性强等挑战,传统无记忆的NTN协议因仅依赖局部信道状态和瞬时服务需求进行决策,导致效率低下。其解决方案的关键在于提出一种面向记忆的非地面网络(MemNTN)范式,通过引入长时程上下文信息实现系统优化。核心创新在于构建双记忆架构,区分表征世界状态的物理记忆与编码历史网络经验的数字记忆,并设计了记忆获取、压缩、估值、更新与利用等机制,支持跨层、基于记忆的智能决策,覆盖从物理层、接入层到网络层与应用层的全栈优化。实验结果表明,在卫星具身问答(SEQA)场景下,MemNTN显著优于传统的无状态NTN与地面网络方案。
链接: https://arxiv.org/abs/2607.00029
作者: Chengyang Li,Yikun Wang,Jiahui He,Yujie Wan,Shuai Wang,Yuan Wu,Yik-Chung Wu,Chengzhong Xu,Huseyin Arslan
机构: The University of Hong Kong (香港大学); Southern University of Science and Technology (南方科技大学); Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences (中国科学院深圳先进技术研究院); University of Macau (澳门大学); Istanbul Medipol University (伊斯坦布尔梅迪波尔大学)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Networking and Internet Architecture (cs.NI)
备注: 8 pages, 4 figures, 2 tables, submitted to IEEE for possible publication
Abstract:Non-terrestrial networks (NTN) provide ubiquitous connectivity for embodied intelligence (EI), enabling robots in wilderness to leverage cloud resources or report critical information to remote centers. However, the synergy is nontrivial due to the highly-dynamic, resource-constrained, topology-varying, and task-oriented environment. Existing memoryless NTN protocols become inefficient, since the decisions are driven by local channel conditions and instantaneous service demands. To address these limitations, this paper proposes the memory-native NTN (MemNTN) paradigm that leverages long-horizon contexts for memory augmented system optimization. To realize this paradigm shift, we establish a dual-memory architecture that distinguishes between physical memory representing the state of the world and digital memory encoding historical network experience. We develop memory acquisition, compression, valuation, update, and utilization mechanisms that facilitate cross-layer, memory-native decision-making, spanning from the physical and access layers up to the network and application layers. Experiments in satellite embodied question answering (SEQA) demonstrate that the proposed MemNTN significantly outperforms conventional stateless NTN and terrestrial approaches.
[MA-12] Decentralized Geometric Control for Cable-Suspended Payload Transport with Adaptive Mass Estimation IROS
【速读】:该论文旨在解决多旋翼无人机协同吊运任务中面临的三大核心挑战:一是控制器需尊重系统非线性配置流形(configuration manifold)的几何特性;二是系统需在无中心协调机制下运行;三是必须满足操作安全约束。其解决方案的关键在于提出一种四层分层式架构GPAC,通过“隐式协调”实现去中心化协同控制。具体而言,每个无人机仅依赖本地缆绳测量信息自主估计自身承担的有效载荷份额,使合力自然收敛至总负载值,无需知晓无人机数量N或负载质量;同时,各无人机基于自身观测的缆绳几何结构局部重构负载位置,仅通过低频邻居位置广播实现碰撞规避。该方法直接作用于完整的非线性配置流形,集成了几何位置与姿态控制、抗摆动调节、用于抗风扰的扩展状态观测器、无需持续激励条件的并发学习型质量估计算法,以及受控制屏障函数(CBF)启发的优先级排序安全滤波器,该滤波器具备输入到状态安全(ISSf)裕度,在单一约束激活时仍能保持精确的安全性。兼容性分析表明,该滤波器对力矩的修正可确保期望姿态始终位于SO(3)姿态控制器的几乎全局稳定区域之内。高保真仿真结果验证了该方法的有效性:在考虑柔性缆绳、机载传感器融合与风扰湍流的复杂场景下,所有控制与估计算法闭环运行,实现了33.8厘米的平均负载跟踪均方根误差(RMSE),变异系数为2.8%(13次种子实验),且单机计算开销极低。
链接: https://arxiv.org/abs/2607.00024
作者: Hadi Hajieghrary,Benedikt Walter,Paul Schmitt,Miguel Hurtado
机构: Georgia Institute of Technology (佐治亚理工学院); Massachusetts Robotics (马萨诸塞州机器人组织)
类目: Robotics (cs.RO); Multiagent Systems (cs.MA); Systems and Control (eess.SY)
备注: Accepted to be presented at IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 2026
Abstract:Cooperative aerial transport requires controllers that respect nonlinear manifold geometry, operate without centralized coordination, and respect operational safety constraints. To address these demands, we present GPAC, a four-layer hierarchical architecture that enables N quadrotors to transport a cable-suspended payload without a central coordinator or by exchanging cable states or adaptive parameters. The key insight is implicit coordination: each quadrotor independently estimates its effective load share from local cable measurements, so combined forces converge to the correct total, even without knowledge of N or the payload mass; the payload position is reconstructed locally from each agent’s own cable geometry, and the only inter-agent communication is a low-rate neighbor-position broadcast for collision avoidance. GPAC operates directly on the full nonlinear configuration manifold and integrates geometric position and attitude control, anti-swing regulation, an extended-state observer for wind rejection, concurrent learning-based mass estimation without persistent excitation, and a priority-ordered control barrier function (CBF)-inspired safety filter that reduces operational risk, with input-to-state safety (ISSf) margins that hold exactly under single-constraint activation. A compatibility result shows that the filter’s force modifications keep the desired attitude within the almost-global stability region of the \mathrmSO(3) attitude controller. Finally, high-fidelity simulation with flexible cables, onboard sensor fusion, and wind turbulence – with all control and estimation loops closed through the estimator – yields a mean payload-tracking RMSE of 33.8 cm (2.8% coefficient of variation over 13 seeds) at a low per-agent computational cost.
自然语言处理
[NLP-0] Measuring the Gap Between Human and LLM Research Ideas
【速读】: 该论文旨在解决当前大语言模型(LLM)在科研创意生成中与人类研究人员之间存在的“研究品味”差距问题。现有评估方法多聚焦于单个创意的新颖性、可行性或专家偏好,而忽视了对创意整体分布模式与人类研究范式的系统性对比。为此,论文提出一种基于高质量人类科研论文的大规模创意评估框架:通过逆向解析每篇论文的核心灵感来源,构建其相关前期工作的集合,并以此为输入让LLM生成新研究想法。研究引入一个双轴的“研究品味分类法”(research-taste taxonomy),从“机会模式”(opportunity pattern)与“研究范式”(research paradigm)两个维度刻画每个创意的特征,进而量化人类与LLM生成创意之间的分布差异。实验结果表明,尽管强模型可生成多样合理的想法,但其创意分布存在显著偏移——过度集中于“桥梁型”机会与整合式方法,而人类研究则更广泛地覆盖了多种问题界定方式与贡献构建路径。因此,该研究的关键发现在于:当前LLM生成的创意虽具备合理性,但其创造性范围仍受限于特定范式,未能全面反映人类科研思维的多样性与深度。
链接: https://arxiv.org/abs/2607.01233
作者: Ziyu Chen,Yilun Zhao,Arman Cohan
机构: Yale University; University of Chicago
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:LLMs are increasingly used to brainstorm research ideas, but existing evaluations mostly judge individual ideas by novelty, feasibility, or expert preference. We instead ask: how far are current LLM-generated ideas from human researchers? To characterize this gap, we build a large-scale evaluation framework for ideation from high-quality human research papers. For each paper, we reverse-engineer a small set of closely related prior works that likely inspired its core idea. LLMs are then prompted to generate a new idea from the set of paper titles and summaries. We introduce a two-axis research-taste taxonomy to profile each idea by its opportunity pattern and research paradigm, and use it to quantify the divergence between human and LLM ideas. Across idea sets generated by different LLMs, we observe a consistent distributional gap: LLM ideas are disproportionately concentrated around bridge-like opportunities and synthesis methods, whereas the human paper reference distribution spreads more broadly across ways of framing gaps and constructing contributions. This result suggests that strong LLMs can produce a range of reasonable ideas, but that range remains narrower than, and systematically shifted relative to, human research taste.
[NLP-1] Is One Layer Enough? Training A Single Transformer Layer Can Match Full-Parameter RL Training
【速读】: 该论文旨在解决生成式大语言模型(Large Language Models, LLMs)在强化学习(Reinforcement Learning, RL)后训练过程中,不同Transformer层对性能提升的贡献分布不均但传统方法却统一更新所有参数这一关键问题。现有方法普遍假设各层对RL优化的贡献相近,但本文通过系统性的分层研究发现,这种假设并不成立:仅训练单一Transformer层即可恢复绝大部分全参数微调所获得的性能增益,甚至在某些情况下表现更优。其解决方案的关键在于提出“层贡献度”(layer contribution)这一量化指标,用于衡量单个层独立训练时所能恢复的完整RL改进比例。实验结果表明,在涵盖多个模型家族(Qwen3、Qwen2.5)、多种强化学习算法(GRPO、GiGPO、Dr. GRPO)及多类任务(数学推理、代码生成、代理决策)的广泛测试中,强化学习收益高度集中于少数甚至单一中间层,且具有极强的稳定性——高贡献层始终集中在Transformer堆栈的中段,而输入与输出端的层贡献显著偏低。这一结构模式在不同数据集、任务、模型家族和算法间保持高度一致,揭示了强化学习优化过程中的深层架构敏感性。
链接: https://arxiv.org/abs/2607.01232
作者: Zijian Zhang,Rizhen Hu,Athanasios Glentis,Dawei Li,Chung-Yiu Yau,Hongzhou Lin,Mingyi Hong
机构: University of Minnesota; Peking University; Amazon
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:Reinforcement learning (RL) has become a central component of post-training large language models (LLMs), yet little is understood about how RL adaptation is distributed across transformer layers. Existing approaches typically update all model parameters uniformly, implicitly assuming that every layer contributes similarly to the gains obtained during RL post-training. In this work, we challenge this assumption through a systematic layer-wise study of RL training. Surprisingly, we find that training a single transformer layer can recover most of the gains achieved by full-parameter RL training, and in some cases even surpass it. To quantify this phenomenon, we introduce the quantity layer contribution, which measures the fraction of full RL improvement recovered by training a layer in isolation. Across seven models spanning two model families (Qwen3, Qwen2.5), three RL algorithms (GRPO, GiGPO, Dr. GRPO), and multiple task domains including mathematical reasoning, code generation, and agentic decision-making, we observe a remarkably stable pattern: RL gains are highly concentrated in a small subset of, and in many cases even a single, transformer layers. More strikingly, the same structural pattern consistently emerges: high-contribution layers concentrate in the middle of the transformer stack, while layers near the input and output ends contribute substantially less. The resulting layer rankings remain strongly correlated across datasets, tasks, model families, and RL algorithms.
[NLP-2] heoria: Rewrite-Acceptability Verification over Informal Reasoning States
【速读】: 该论文旨在解决生成式 AI 系统在推理过程中可信度评估的两难问题:传统形式化证明工具虽能提供确定性验证,但难以覆盖大多数实际问题;而基于标量评分的大型语言模型(LLM)判断器虽具备广泛覆盖能力,却因输出结果不透明、不可审计且存在一致性缺陷而难以信赖。其核心解决方案是提出 Theoria 验证架构,通过将候选解重构为一系列带类型的态转移序列,每一步均需由明确的依据(如引用、计算或题目给定事实)支持,确保所有状态变化均可独立审计。该架构的核心在于“变更完备性”(completeness of change)原则——任意两个连续证明状态之间的差异必须被充分解释,从而将隐藏前提以未授权的突变形式暴露出来,而非隐匿通过。实验表明,在 HLE-Verified Gold 基准上,Theoria 在 185 个纯文本专家级问题中实现 91.4% 的严格精度(95% Wilson 置信区间 [84.5%, 95.4%]),并生成可读性强的证明轨迹,支持逐步质疑。与整体式 LLM 判定器相比,Theoria 在相同覆盖率下表现相当,但在错误类型上互补(Jaccard 相似度仅 0.14–0.36)。在 95 个跨 15 个领域的对抗性污染证明测试中,结构化判别器的检出率(94.7%)显著优于整体式判别器(83.2%,p=0.0017),尤其在隐藏前提(90.6% vs. 62.5%)和虚构引用(100% vs. 90%)两类错误上优势明显,这与形式化分析预测一致;而在算术与定理误用等无理论优势的错误类型上,两者性能持平。在 GPQA Diamond 基准上,认证精度达 97.1%(95% Wilson 置信区间 [85.1%, 99.5%]),进一步验证了该方法的有效性。
链接: https://arxiv.org/abs/2607.01223
作者: Ben Slivinski,Michael Saldivar
机构: Independent Researchers
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Logic in Computer Science (cs.LO); Software Engineering (cs.SE)
备注:
Abstract:When should an AI system’s answer be trusted? Formal proof assistants offer certainty but cannot reach most of the problem distribution; scalar LLM judges offer coverage but produce opaque scores that cannot be audited after the fact and are subject to the same coherence issues as any LLM. We present Theoria, a verification architecture that closes this gap. A candidate solution is rewritten into a sequence of typed state transitions, each licensed by an explicit justification, whether that be a citation, computation, or problem-given fact, and every transition is independently auditable. The foundational invariant is completeness of change: every difference between consecutive proof states must be accounted for, so hidden premises surface as unlicensed mutations rather than passing silently. On HLE-Verified Gold (185 text-only expert problems), Theoria certifies 105 at 91.4% strict precision (Wilson 95% CI [84.5%, 95.4%]). Every certification produces a human readable proof trace in which each step can be independently challenged. Holistic LLM judges achieve comparable precision at matched coverage but fail on different problems (Jaccard 0.14-0.36), making the approaches complementary. On 95 adversarial poisoned proofs across 15 domains, structured judges catch 94.7% versus 83.2% for holistic judging (p= 0.0017). The overall 11.5 pp gap concentrates in hidden premises (90.6% vs. 62.5%, a 28 pp difference) and fabricated citations (100% vs. 90%), the error classes where the formal analysis predicts an advantage; performance is identical on arithmetic and theorem-misapplication errors, where no advantage is predicted. On GPQA Diamond (n= 65), certified precision is 97.1% (Wilson CI [85.1%, 99.5%]).
[NLP-3] he State-Prediction Separation Hypothesis
【速读】: 该论文旨在解决标准Transformer模型在语言建模过程中将“下一个词预测”与“状态存储”功能耦合所带来的效率瓶颈问题。其核心问题是:当前架构在单一流程中同时执行生成与记忆任务,导致计算资源冗余和优化路径干扰,限制了模型在数据和计算效率上的表现。解决方案的关键在于提出“状态-预测解耦假设(state-prediction separation hypothesis)”,设计一种新型Transformer变体,通过引入两条独立的前向计算流——一条专门用于预测下一个词,另一条专注于维护和更新可用于未来预测的内部状态——实现两个功能的显式分离。实验结果表明,该解耦结构在不同规模的预训练设置下均显著提升数据与计算效率,平均在下游任务上将验证损失降低2–3个百分点,优于标准Transformer。此外,通过系统的实证分析,研究排除了潜在混淆因素,并证实了所提方法在梯度传播机制上的本质差异,验证了其有效性与可解释性。
链接: https://arxiv.org/abs/2607.01218
作者: Giovanni Monea,Nathan Godey,Kianté Brantley,Yoav Artzi
机构: Cornell University (康奈尔大学); Harvard University (哈佛大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Preprint
Abstract:Transformers use the same forward computation stream to both predict the next token and store useful state for future token predictions. We formulate the \emphstate-prediction separation hypothesis: disentangling the two roles yields better language modeling performance. We design a Transformer variant that uses two computation streams to separate the two functions, and conduct pretraining experiments across various scales. Our experiments show that state-prediction separation consistently offers better data and compute efficiencies, improving validation loss and outperforming standard Transformers by 2–3 percentage points on average on downstream tasks. We also conduct extensive empirical analysis that rules out potential confounders and demonstrates the fundamental difference in the gradients our design entails.
[NLP-4] Distill to Detect: Exposing Stealth Biases in LLM s through Cartridge Distillation ICML2026
【速读】: 该论文旨在解决部署在高风险场景中的语言模型可能隐匿存在偏好偏见(preferential bias)的问题,尤其针对那些仅在特定主题上表现出倾向性、而在其他输入上与基础模型行为一致的“隐身型”偏见。这类偏见可源自模型供应链中的任意环节,且因其不通过可读文本显现而极难被检测。现有方法受限于检测前需预先知晓偏见主题,导致在缺乏先验知识的情况下无法有效识别此类隐蔽偏见。本文提出一种名为Distill to Detect (D2D)的新方法,其核心创新在于将疑似模型与基线模型之间的分布偏移(distributional shift),通过软logit分布的差异,以“弹药包”(cartridge,即KV缓存前缀适配器)的形式进行蒸馏,并放大该偏移信号至生成文本中,从而实现对隐藏偏见的显式暴露。D2D的关键机制在于利用前缀调优(prefix-tuning)适配器的容量瓶颈,将其从参数效率工具转变为检测工具,理论上通过费舍尔加权投影(Fisher-weighted projection)解释了其对logit分布偏移的有效捕捉能力。实验表明,D2D能可靠地放大多种类型的隐蔽偏见,为部署语言模型的审计提供了可操作的检测框架。
链接: https://arxiv.org/abs/2607.01208
作者: Shayan Talaei,Abhinav Chinta,Devvrit Khatri,Amin Karbasi,Azalia Mirhoseini,Amin Saberi
机构: Stanford University (斯坦福大学); University of Texas at Austin (德克萨斯大学奥斯汀分校); Foundation AI–Cisco Systems Inc. (Foundation AI–思科系统公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to the ICML 2026 Workshops on TAIGR, AI4GOOD, Mechanistic Interpretability, and CoLoRAI
Abstract:Language models deployed in high-stakes roles can potentially favor certain entities, brands, or viewpoints, steering user decisions at scale. Such preferential biases can be introduced by any actor in the model’s supply chain and are most dangerous when the model reveals its preference only on the relevant topic while behaving identically to its unmodified base on all other inputs. Recent work has shown that these biases can transfer through context distillation on semantically unrelated data, with the signal residing entirely in the soft logit distribution and remaining invisible to text-based inspection. However, the defender faces a fundamental asymmetry: without knowing the bias topic, no detection method can reliably surface a stealth preferential bias, regardless of whether it examines generated text, internal representations, or model weights. Here we introduce Distill to Detect (D2D), a method that surfaces hidden biases by distilling the distributional shift between a suspected model and its base into a cartridge (a KV-cache prefix adapter), concentrating the dominant divergence and amplifying the bias signal into generated text. We show that D2D successfully amplifies the hidden biases of stealth models to the extent that they can be reliably detected across multiple bias types. We also propose a theoretical framework that explains the efficacy of D2D through the lens of Fisher-weighted projection of the logit distribution shift, supported by empirical observations. By turning the capacity bottleneck of prefix-tuning adapters into a detection tool, D2D provides a practical building block for auditing hidden behaviors in deployed language models.
[NLP-5] Right in the Right Way: LM Training with Verifiable Rewards and Human Demonstrations
【速读】: 该论文旨在解决当前基于可验证奖励的强化学习(RLVR)方法在训练大语言模型时过度聚焦于可量化任务指标,而忽视人类生成输出中难以形式化为标量奖励的主观性特征(如风格、结构、自然度)所导致的问题。这一局限性引发多样性坍缩、语义不自然及奖励黑客等典型失败模式。其解决方案的关键在于提出一种对抗生成器-判别器框架,通过引入从人类示范中学习到的判别信号来增强可验证奖励。具体而言,生成器在强化学习指导下同时优化任务准确性与来自判别器的对抗性奖励;判别器则与生成器协同训练,学习区分人类撰写与模型生成的输出,从而作为人类输出分布的隐式代理,对难以显式定义的生成质量维度提供反馈。该方法在代码修复、开放式生成等多领域均实现了非可验证属性的显著提升,同时保持了与传统RLVR相当的任务性能,有效弥合了强化学习与监督微调(SFT)之间的鸿沟,为可验证与不可验证属性的联合优化提供了可扩展的路径。
链接: https://arxiv.org/abs/2607.01181
作者: Mehul Damani,Isha Puri,Idan Shenfeld,Jacob Andreas
机构: MIT EECS(麻省理工学院电子工程与计算机科学系)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:RL with verifiable rewards (RLVR) has emerged as a powerful paradigm for training LMs on tasks with well-defined success metrics, such as code generation and mathematical reasoning. However, current RLVR methods optimize only what can be objectively scored, often neglecting subjective, non-verifiable aspects of human-like outputs, such as style and structure. This limitation leads to well-documented failure modes such as diversity collapse, unnatural-sounding responses, and reward hacking. We propose an adversarial generator-discriminator framework that augments verifiable rewards with a learned signal from human demonstrations. A generator model is trained using RL to maximize both task accuracy and an adversarial reward derived from a discriminator. The discriminator, trained alongside the generator policy, learns to distinguish human-written outputs from model-generated ones. The discriminator serves as a learned proxy for the human output distribution, providing feedback on aspects of generation that are difficult to formalize as scalar rewards. Across diverse domains, including bug fixing and open-ended generation, our approach consistently improves non-verifiable properties while preserving the accuracy gains of RLVR. In bug fixing, our method produces solutions with significantly lower edit distance compared to RLVR baselines while matching end performance. In story generation, our method significantly improves win rate while producing stories that are diverse and more human-like. And in a simple reward hacking benchmark, our method nearly eliminates model misbehavior while maintaining high benchmark scores. Together, these results show that our approach bridges RL and SFT, offering a scalable path toward jointly optimizing the verifiable and non-verifiable properties of a task.
[NLP-6] QuasiMoTTo: Quasi-Monte Carlo Test-Time Scaling
【速读】: 该论文旨在解决在扩展语言模型推理计算资源时,因并行采样样本相互独立而导致的计算资源浪费问题。尽管独立同分布(i.i.d.)采样易于并行化,但其生成大量冗余解,降低了样本效率。论文提出的关键解决方案是引入一种基于准蒙特卡洛(quasi-Monte Carlo, QMC)的相干采样框架——QuasiMoTTo,通过将自回归采样重参数化为逆累积分布函数(inverse-CDF)采样,并使用QMC生成均匀分布变量,实现并行条件下相关但精确的样本生成。这种设计使样本在输出空间中分布更均匀,显著减少冗余,从而提升计算资源利用率。尽管批次内样本存在相关性,但每个样本的边际分布仍保持与语言模型一致,因此可直接用于策略梯度强化学习(如GRPO)训练。实验表明,QuasiMoTTo在四个推理基准上以25%-47%更少的样本达到与i.i.d.相当的pass@k准确率,且常逼近任意保边际分布采样器的理论上限;在强化学习任务中,仅需50%的训练步数即可达到相同性能,其优势源于更高的输出空间覆盖度,从而提供更强的学习信号。
链接: https://arxiv.org/abs/2607.01179
作者: Michael Y. Li,Anthony Zhan,Kanishk Gandhi,Noah D. Goodman,Emily B. Fox
机构: Stanford University (斯坦福大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:Scaling inference compute, by generating many parallel attempts per problem, is a costly but reliable lever for improving language model capabilities. By default these attempts are generated independently, wasting inference compute on redundant solutions. This waste seems unavoidable. After all, independence is what makes parallel sampling trivial to scale. However, this tradeoff is not fundamental: there is a rich design space of samplers that generate correlated but exact samples entirely in parallel. We explore this design space as an avenue for improving sample efficiency in scaling inference compute and reinforcement learning (RL). Concretely, we introduce QuasiMoTTo, which uses correlated samples as a drop-in replacement for i.i.d. samples. To generate these samples, QuasiMoTTo uses a reparameterization of autoregressive sampling as inverse-CDF sampling and draws the underlying uniforms with quasi-Monte Carlo (QMC); because QMC spreads the uniforms out more evenly than i.i.d., the resulting samples cover the output space with far less redundancy. Even though the batch is correlated, each sample is marginally distributed according to the language model, so we can use the batch for policy-gradient training. Our empirical analysis focuses on understanding how efficiently QuasiMoTTo can turn compute into performance. To evaluate correlated samplers, whose dependence breaks standard pass@k estimators, we first develop an unbiased bootstrap estimator. Across four reasoning benchmarks, QuasiMoTTo matches i.i.d. pass@k accuracy with 25-47% fewer samples. Strikingly, QuasiMoTTo often saturates an upper bound on pass@k that holds for any marginal-preserving sampler. We also apply QuasiMoTTo to policy-gradient RL (GRPO) where it matches i.i.d. performance with 50% fewer training steps. These gains come from higher coverage, which yields a stronger learning signal per batch.
[NLP-7] Adversarial Prag matics for AI Safety Evaluation: A Benchmark for Instruction Conflict Embedded Commands and Policy Ambiguity
【速读】: 该论文旨在解决当前大语言模型(LLM)安全评估中因依赖模糊自然语言行为判断而产生的评价失真问题,尤其针对指令遵循、政策合规、拒绝响应、嵌入式指令抵抗及代理任务进展误报等关键场景。现有基准常将复杂行为简化为“通过/失败”二元标签,掩盖了失败背后的根本原因,如能力局限、政策歧义、指令冲突、提示框架失效或评估者判断不稳。为此,论文提出“对抗性语用学”(adversarial pragmatics)作为新的评估基准与标注协议,其核心在于构建一个基于语言学控制的分类体系,涵盖指令冲突、嵌入式命令、引述、范围模糊、指示词、间接言语行为及多轮代理对话等典型挑战情境。该方案的关键创新包括:一个包含18个项目的种子基准集及其由验证者强制执行的元数据标注机制、54行本地种子试点数据、一套专家评估流程以区分任务成功、政策合规性、安全风险、拒绝结果与评估者置信度,并引入判别有效性、诊断模糊性与分类体系漂移等量化指标。该框架将语言学判断方法转化为可操作的安全评估工具,广泛适用于模型安全评测、人工评估者校准、黄金标准构建、提示注入测试及安全文档编制,显著提升了评估的可解释性与可靠性。
链接: https://arxiv.org/abs/2607.01153
作者: Brett Reynolds
机构: Humber Polytechnic(汉伯理工学院); University of Toronto(多伦多大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: 15-page main paper plus 9-page supplement; 6 figures and 8 tables total; code and data artifact available at the linked repository
Abstract:Safety evaluations for language models increasingly depend on judgments about ambiguous natural-language behaviour: whether a model has followed an instruction, refused appropriately, complied with a policy, resisted an embedded command, or misreported progress in an agentic task. Existing benchmarks often compress these distinctions into pass/fail labels, obscuring whether failures arise from capability limits, policy ambiguity, instruction conflict, scaffold failure, or unstable evaluator judgments. This paper introduces adversarial pragmatics as a benchmark and annotation protocol for evaluating model behaviour under instruction conflict, embedded commands, quotation, scope ambiguity, deixis, indirect speech acts, and multi-turn agent transcripts. The contribution is empirical and methodological: a linguistically controlled taxonomy, an 18-item seed benchmark with validator-enforced metadata, a 54-row local seed pilot, an expert-evaluation protocol distinguishing task success, policy compliance, safety risk, refusal outcome, and evaluator confidence, and metrics for judge validity, diagnostic ambiguity, and taxonomy drift. The framework turns linguistic judgment methodology into a practical tool for validating safety evals, LLM judges, gold-set construction, prompt-injection tests, and safety documentation. Comments: 15-page main paper plus 9-page supplement; 6 figures and 8 tables total; code and data artifact available at the linked repository Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Software Engineering (cs.SE) Cite as: arXiv:2607.01153 [cs.CL] (or arXiv:2607.01153v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2607.01153 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[NLP-8] AGC-Bench: Measuring Artificial General Creativity
【速读】: 该论文旨在解决生成式人工智能(Generative AI)在创造力评估中的关键问题:当前缺乏一个统一、可扩展且能有效区分创造力与一般智能的基准评测体系。尽管人类创造力被广泛认为具有领域特异性(如视觉、写作、科学等),并可能在心理测量学上与一般智力(general intelligence)相分离,但这一争议同样适用于大型语言模型(LLM)。为此,研究提出AGC-Bench——一个基于系统性文献综述(筛选3,101篇论文,识别497个基准)构建的人工通用创造力基准,并结合代理框架(agentic harness)将异构代码库转化为符合HELM标准的标准化评测任务。其核心解决方案包括:第一,通过裁判响应理论(Judge Response Theory)对大模型作为评分者时存在的宽容度/严苛度偏差进行心理测量校准;第二,利用经过偏差校正的三款前沿模型评分数据,微调Qwen3-30B以生成开源权重的AGC-Judge模型,使其能够鲁棒地评估未参与训练的新创造力任务。实验揭示了前沿模型在整体表现上领先,开放模型紧随其后,且不同模型在各领域表现出差异化的创造优势(如写作优于科学构思)。进一步分析发现:一、因子分析在83个模型中提取出单一创造力因子“c”,解释81.5%的方差,与一般知识/推理相关但可分离;二、提示模型“发挥创造力”显著提升其表现,远超增强推理能力的效果,表明该基准确实捕捉创造力而非一般能力;三、在人类匹配子集上,顶尖人类仍优于顶尖大模型。研究成果以公开基准、排行榜、评分模型及人类数据形式发布,为大规模测量人工智能创造力提供了可复现的开放基础设施。
链接: https://arxiv.org/abs/2607.01152
作者: Roger Beaty,Vijeta Deshpande,Clin K.Y. Lai,Anna Attuch,Namrata Shivagunde,Swastik Roy,Rajkumar Pujari,Paul V. DiStefano,Sherin Muckatira,Claire E. Stevenson,Mikhail Gronas,Anna Rumshisky
机构: Pennsylvania State University (宾夕法尼亚州立大学); University of Massachusetts Lowell (马萨诸塞大学洛厄尔分校); University of Amsterdam (阿姆斯特丹大学); Amazon AGI (亚马逊AGI); Dartmouth College (达特茅斯学院)
类目: Computation and Language (cs.CL)
备注:
Abstract:Creativity research has debated whether creativity is domain-specific (e.g., visual, writing, science), and if it is psychometrically separable from general intelligence. Both questions now apply to LLMs, but a unified benchmark of AI creativity remains elusive. We introduce AGC-Bench, an artificial general creativity benchmark built from a systematic review of the AI creativity literature (3,101 papers screened, 497 benchmarks identified), paired with an agentic harness that converts idiosyncratic codebases into HELM-standardized benchmarks. The first release covers 78 datasets spanning brainstorming, problem solving, STEM, narrative, figurative language, and humor. To address bias in LLM-as-judge, we apply Judge Response Theory – a psychometric calibration of judge leniency/severity; we then fine-tune Qwen3-30B on the bias-corrected ratings of three frontier LLMs to produce AGC-Judge, an open-weight model that robustly scores new creativity benchmarks it was not trained on. Results reveal frontier models at the top of the AGC-Bench leaderboard, with open models close behind. LLMs show different creative strengths, ranking higher on some domains (e.g., writing) than others (e.g., scientific ideation). Extensive experiments yield three main findings. First, applying factor analysis across 83 LLMs, we recover a single creativity factor ‘c’, analogous to the ‘g’ factor of general intelligence, that explains 81.5% of variance, related to but separable from general knowledge/reasoning. Second, we show that prompting models to “be creative” boosts their performance far more than enabling reasoning, evidence that the benchmark tracks creativity over general ability. Third, on a human-matched subset, we find the top human still leads the top LLM on creativity. We release AGC-Bench with a public leaderboard, AGC-Judge, and human data as open infrastructure for measuring AI creativity at scale.
[NLP-9] textLog_textbQuant: Quantizing Language Models in Logarithmic Space
【速读】: 该论文旨在解决现代语言模型在低精度量化(如4位量化)下因权重量化分布不均而导致的性能下降问题,尤其针对低频但高幅值权重在均匀量化码本中易产生次优表示的缺陷。其解决方案的关键在于提出一种可调基底的对数型量化方法——Log₆ Quant,通过引入可调节的对数基底以自适应匹配模型参数的典型分布特性,从而在张量级粒度下实现更优的量化精度。实验表明,该方法在多个基准测试中优于非对称线性量化,在保持显著内存压缩与适度推理加速的同时,具备在消费级GPU上私有部署的可行性。
链接: https://arxiv.org/abs/2607.01127
作者: Jeremias Bohn,Tizian Dippold,Mahdi Koubaa,Elias R. Wahl,Georg Groh
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Quantization has become an invaluable tool to reduce memory requirements and inference speed of modern language models, in particular to make them available for consumer setups and edge devices. While previous work has primarily focused on uniform quantization codebooks, such approaches are prone to suboptimal representations due to low-frequency high-magnitude weights. We introduce Log _\textb Quant, a novel logarithmic quantization approach with adjustable bases, to adapt to common parameter distributions. We show that our method exhibits superior performance at 4-bit precision on several performance benchmarks compared to asymmetric linear quantization at tensor-wise granularity, while achieving moderate speedup and high memory savings, making it suitable for private use on consumer-grade GPUs.
[NLP-10] owards Developing a Multimodal Chat Assistant for University Stakeholders: RAG -based Approach
【速读】: 该论文旨在解决高校利益相关者在发展中国家难以获取及时、可靠信息的问题,尤其针对现有基于规则的聊天机器人无法应对复杂领域查询且难以适应不断变化的机构政策的局限性。其核心解决方案是提出一种基于检索增强生成(Retrieval-Augmented Generation, RAG)的多模态高校聊天机器人系统,通过将大语言模型与语义检索相结合,从以机构为中心的资源(如大学手册)中生成上下文相关的回答。系统支持文本和图像双模态输入,利用视觉-语言模型处理多模态查询,并采用量化推理技术实现对资源受限硬件的快速部署。基于FastAPI构建的可扩展后端与响应式前端协同保障了系统的实时可用性。多模态评估表明,尽管视觉输入导致响应时间增加,系统在文本和图像查询上均保持较高的用户满意度;定量分析进一步显示,该RAG系统将幻觉率从31.7%显著降低至6.6%,验证了检索增强机制在提升回答准确性与可靠性方面的有效性。
链接: https://arxiv.org/abs/2607.01115
作者: Md Abu Hanif Shaikh,Abdullah Al Shafi
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at 2025 28th International Conference on Computer and Information Technology (ICCIT)
Abstract:University stakeholders often face difficulties in accessing timely and reliable information, especially in developing countries, where there are very few intelligent support systems. Existing rule-based chatbots are unable to handle complex, domain-specific queries and are not well-equipped to adapt to evolving institutional policies. As a fill-in-the-gap solution, we present the multimodal university chatbot with retrieval-augmented generation. The system combines the large language model with semantic retrieval to produce context-based responses from institution-centric resources, such as the university handbook. The system accepts text and image queries through the vision-language model and applies quantized inference for rapid deployment on constrained hardware. A scalable backend built with FastAPI, adjoined with a responsive frontend developed with this http URL, ensures real-time usability. Our multimodal evaluation demonstrates that the system maintains strong satisfaction scores across both text and image queries, despite increased response time for visual inputs. Furthermore, quantitative evaluation shows that hallucination is reduced from 31.7% to 6.6% in our proposed RAG-based system, confirming the effectiveness of retrieval grounding.
[NLP-11] CausalMix: Data Mixture as Causal Inference for Language Model Training
【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)训练中数据混合策略在数据分布动态变化时缺乏适应性的问题。现有方法依赖于静态数据分布的假设,导致当数据池规模扩大或分布发生漂移时,需重新进行昂贵的端到端训练,难以实现从小规模到大规模场景的无缝扩展。其核心解决方案是提出CausalMix,将数据混合优化问题建模为因果推断任务:将数据池的统计特征视为协变量(covariates),将领域混合比例视为处理效应(treatment),通过构建因果模型估计条件平均处理效应(Conditional Average Treatment Effect, CATE)。基于对512次Qwen2.5-0.5B训练运行的因果建模,CausalMix实现了对80万条数据池的最优混合策略外推,并成功应用于7B参数模型的训练;同时,该框架进一步推广至Qwen3-4B-Base上的长链思维(long chain-of-thought)数据,展现出良好的泛化能力。通过分离混杂偏差(confounding bias),CausalMix能够动态推断状态相关的最优数据混合策略。大量实验表明,由CausalMix指导的混合策略在多个下游任务上持续优于RegMix等基线方法。此外,研究引入了CATE解释器以可视化分析学习到的混合策略,从而提供可解释性支持。总体而言,CausalMix构建了一个具有因果性与可解释性的通用数据混合优化框架。
链接: https://arxiv.org/abs/2607.01104
作者: Zinan Tang,Yukun Zhang,Shaomian Zheng,Zhuoshi Pan,Qizhi Pei,Dingnan Jin,Jun Zhou,Yujun Wang,Biqing Huang
机构: 1. Tsinghua University (清华大学); 2. Alibaba Group (阿里巴巴集团); 3. Peking University (北京大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 22 pages, 3 figures
Abstract:In Large Language Model (LLM) training, data mixing plays a pivotal role in determining model performance. Recent methods optimize mixture weights via proxy models, but they rely on the assumption of static data distributions. As a result, when the underlying data pool shifts, these methods require costly retraining from scratch. This limitation restricts their ability to scale seamlessly from small settings to larger data pools and model sizes. In this paper, we propose CausalMix to address this limitation by casting data mixture optimization as a causal inference problem. We formulate the statistical features of the data pool as covariates and the domain mixture as the treatment. After fitting a causal model on 512 runs of Qwen2.5-0.5B to estimate the Conditional Average Treatment Effect (CATE), we extrapolate the optimal mixture for an 800K data pool and apply it to train a 7B model. Furthermore, we successfully generalize the framework to long chain-of-thought data on Qwen3-4B-Base. By leveraging causal modeling to isolate confounding biases, CausalMix dynamically infers state-dependent optimal data mixtures. Extensive experiments show that the mixture guided by CausalMix consistently improves performance across multiple downstream tasks, outperforming RegMix and other baselines. In addition, we use the CATE Interpreter to provide visual analysis of the learned mixing strategy. Overall, CausalMix offers a causal and interpretable framework for optimizing LLM data mixtures.
[NLP-12] Clinician-Level Agreement Without Clinical Caution: LLM Evaluator Limits in Medical AI Benchmarking
【速读】: 该论文旨在解决生成式AI在医学临床开放回答评估中缺乏真实临床校准与审慎性的问题。当前,尽管开放式评估相较于选择题基准具有更强的临床效度,但人工评分存在瓶颈,促使自动化大语言模型(LLM)作为评判者(LLM-as-Judge)的应用兴起。然而,这些自动化评估者是否真正复现了临床医生在评估过程中的谨慎态度与元认知能力尚无验证。为此,研究提出了MedQADE——首个针对德语的标准化开放回答临床评估基准,包含3,800个由十名执业医师和九个大型语言模型(LLM)共同标注的题目。结果显示,表现最佳的评估模型Gemini 3 Flash在评分一致性上接近医生天花板水平(κ = 0.694 vs. κ = 0.709),但置信区间较宽,解释受限。更重要的是,尽管统计上对齐,自动化评估者普遍缺乏临床元认知:医生会根据题目难度调整弃答比例,而前沿模型则在所有情况下均给出确定性评分。此外,研究还量化了系统性谱系依赖偏差,即模型倾向于偏好其架构上的“同源”模型,这种偏差独立于语言。因此,研究揭示:统计一致性并不等同于临床审慎性,评估者独立性必须通过显式验证来确保。
链接: https://arxiv.org/abs/2607.01103
作者: William Philipp,Finn Fassbender,Thorsten Langer,Martje Pauly,Rebecca Herzog,Alexander Baumann,Markus Hobert,Theresa Paulus,Ip Chi Wang,Lukas Goede,Johanna Reimer,Sebastian Löns,Ronald Böck,Sebastian Fudickar
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Open-response evaluation provides stronger clinical validity than multiple-choice benchmarks but creates a scoring bottleneck that motivates automated LLM-asa-Judge approaches. Whether such evaluators replicate clinical calibration and caution, however, remains untested. We introduce MedQADE, the first standardised open-response clinical benchmark for German, a major clinical language lacking native evaluation infrastructure, comprising 3,800 items annotated by ten practising physicians and nine Large Language Model (LLM) evaluators. The top-performing evaluator model, Gemini 3 Flash, reached alignment consistent with the physician ceiling (\kappa = 0.694 vs. \kappa = 0.709), though wide confidence intervals limit interpretation. Despite this statistical alignment, automated evaluators exhibited near-absent clinical metacognition: physicians scaled abstention with item difficulty, while frontier models assigned definitive scores in every case. We additionally quantified systematic lineage-dependent biases, where models preferentially scored architectural siblings, an effect independent of language. These results show that statistical alignment does not ensure clinical caution, and that evaluator independence requires explicit verification.
[NLP-13] Message Passing Enables Efficient Reasoning
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在推理过程中生成长链式思维(Chain-of-Thought, CoT)所面临的计算瓶颈问题,尤其针对传统串行推理方法(如CoT)和现有并行化方法(如叉-接法 Fork-and-Join, FJ)在扩展性与通信效率上的局限。其核心挑战在于:当任务复杂度增加时,长序列推理导致上下文长度急剧上升,而现有的并行方法中线程为临时状态且缺乏点对点通信能力,限制了可扩展性。为此,论文提出消息传递语言模型(Message Passing Language Models, MPLMs)框架,其关键创新在于引入轻量级的“发送-接收”(send-receive)通信原语,使多个推理线程能够直接交换信息,从而实现高效协同。MPLMs通过两个核心机制实现高效扩展:一是降低通信开销,避免冗余上下文共享;二是支持抢占(preemption),允许线程基于同伴提供的部分信息提前终止无效分支。实验表明,MPLMs在数独、3-SAT求解及长上下文问答等三类任务上均显著优于传统串行与并行方法,在数独任务中实现了渐近更小的上下文需求,并成功求解25×25难题;在3-SAT任务中利用抢占机制有效剪枝低效路径,提升推理效率;同时,预训练大模型经适当提示后亦可遵循MPLM协议,达到与主流叉-接方法相当甚至更优的性能表现。
链接: https://arxiv.org/abs/2607.01077
作者: Xuecheng Liu,Daman Arora,Gokul Swamy,Andrea Zanette
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: pre-print
Abstract:While inference-time scaling has improved the reasoning abilities of large language models (LLMs), the need to generate long chains-of-thought (CoTs) is a computational bottleneck. Thus, in contrast to sequential scaling methods like CoT, recent parallel scaling techniques instead use fork and join (FJ) primitives to divide work across multiple LLM threads. However, in the fork-join paradigm, threads are typically transient and do not communicate pointwise with one another which limits scalability. To tackle this, we introduce Message Passing Language Models (MPLMs), a framework for LLM reasoning in which threads communicate directly via lightweight send and receive primitives. MPLMs enable efficient scaling through two key mechanisms: (1) reduced communication costs, achieved by avoiding redundant context sharing, and (2) preemption, which allows threads to terminate early based on partial information from their peers. We demonstrate the promise of MPLMs on 3 classes of tasks. First, on Sudoku puzzles, we show that MPLMs require an asymptotically smaller context than both serial CoT and parallel FJ. We then fine-tune a single model to solve 25 x 25 puzzles that remain challenging for standard CoT and FJ approaches, as well as frontier reasoning models without tools. Second, on 3-SAT puzzles, the capability of preemption allows termination of unpromising branches, which results in improved efficiency. Finally, we show that appropriately prompted large pre-trained models follow the MPLM protocol, achieving competitive results on long-context question answering relative to popular fork-join approaches. Comments: pre-print Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG) Cite as: arXiv:2607.01077 [cs.CL] (or arXiv:2607.01077v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2607.01077 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[NLP-14] Agent ic generation of verifiable rules for deterministic self-expanding reaction classification
【速读】: 该论文旨在解决化学合成路线规划中反应规则库构建的可扩展性与适应性难题,即传统方法依赖人工设计固定规则集,难以应对化学反应的长尾分布及新兴化学领域的动态变化。其解决方案的关键在于提出一个全自动化多智能体框架,基于大语言模型(Large Language Models, LLMs)自主识别反应模式并自动生成反应规则,通过在665,901条美国专利反应数据上运行验证循环,实现规则的自我校验与迭代优化。该方法将标准反应分类体系从68类扩展至14,073类,且无需人工标注;结合轻量级指纹分类器,可对97.7%的未见反应进行准确分类,性能媲美领先商业工具,同时具备更精细的化学解析能力与对训练分布外化学的按需扩展能力。最终构建了一个持续演进的反应活性数据库,为生成式模型向可信赖、自扩展的符号系统转化提供了通用路径。
链接: https://arxiv.org/abs/2607.01061
作者: Daniel Armstrong,Maarten Dobbelaere,Valentas Olikauskas,Helena Avila,Octavian Susanu,Jérôme Waser,Philippe Schwaller
机构: École Polytechnique Fédérale de Lausanne (EPFL); Ghent University (根特大学); NCCR Catalysis (瑞士国家研究中心催化)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Computer-assisted synthesis planning breaks target molecules into accessible precursors using large libraries of reaction rules that assign each transformation a deterministic, interpretable label. But chemistry is long-tailed, making manual encoding intractable, and existing tools rely on fixed rulesets that cannot adapt to new chemistries. Here we present a fully automated pipeline in which a multi-agent framework of large language models (LLMs) classifies reactions and writes the rules themselves across 665,901 US patent reactions, generating each rule under a verification loop that tests it against the corpus. It expands a standard taxonomy from 68 to 14,073 classes without human curation. With a lightweight fingerprint classifier, it classifies 97.7% of unseen reactions, matching a leading proprietary classifier while resolving chemistry more finely and extending on demand to chemistry outside its training distribution. The result is a living reactivity database and a general route to turning generative models into reliable, self-expanding symbolic systems.
[NLP-15] Conversable Complexity: Agent ic LLM Collectives as Interpretable Substrates
【速读】: 该论文试图解决的核心问题是:复杂性与可解释性之间的固有矛盾——传统上,具备复杂行为涌现能力的系统往往因过于黑箱而难以解释,而高度透明的系统又缺乏产生复杂行为的能力。针对这一挑战,论文提出以具备代理能力(agentic)的大语言模型(LLM)集体作为人工生命(Artificial Life, ALife)研究的计算基底。其解决方案的关键在于:通过赋予多个LLM持久记忆、工具使用能力、共享技能以及自主发起行动的能力,使它们形成具有交互性与动态演化的智能体集体;更重要的是,由于这些智能体以自然语言进行通信,其集体行为可通过分析文本痕迹或直接向智能体提问的方式实现可解释性。论文进一步拓展了语言模型研究中的可解释性概念,将其应用于智能体集体,并综述了现有研究中已实现的代理型LLM集体实例,验证了该范式在受控实验与真实场景中的可行性。
链接: https://arxiv.org/abs/2607.01047
作者: Elias Najarro,Ane Espeseth,Eleni Nisioti,Sebastian Risi,Stefano Nichele
机构: IT University of Copenhagen (哥本哈根信息技术大学); University of Oslo (奥斯陆大学); Østfold University of Applied Sciences (奥斯特福尔应用科学大学); Sakana AI (萨卡纳人工智能)
类目: Computation and Language (cs.CL)
备注:
Abstract:Complexity and interpretability rarely coincide: systems rich enough for complex behaviours to emerge are usually too opaque to question, while transparent ones are too simple for anything complex to emerge. A single large language model (LLM) is a static artefact, hardly exhibiting any of the emergent properties we associate with life. This changes through interaction: populations of LLMs display emergent dynamics absent from isolated models. Furthermore, LLMs can be endowed with persistent memory, tools and shared skills, and the capacity to initiate actions unprompted, i.e., turning LLMs agentic. In this paper, we argue that such collectives of agents can serve as a computational substrate for Artificial Life (ALife) research. Critically, since the agents communicate in natural language, their collective behaviour can be directly interrogated by examining textual traces and asking the agents themselves. We outline the notion of interpretability in language-model research and extend it for collectives of agents. Lastly, we survey recent examples of agentic LLM collectives that already instantiate the idea of agentic substrates, from controlled experiments to deployments in the wild.
[NLP-16] Evidence-Supported Credit Risk Report Generation Using News-Centric Financial Knowledge Graphs
【速读】: 该论文旨在解决金融市场的动态变化与新闻文本中隐含事件驱动因素之间缺乏显式关联的问题,尤其在信用风险报告生成任务中,现有方法难以有效捕捉公司、事件与外部环境之间的复杂关系。其核心挑战在于如何从非结构化新闻文本中提取可解释的事件锚点,并构建具备事实性、公司中心性和环境感知能力的知识图谱,以支持高质量、低幻觉的自动化报告生成。解决方案的关键在于提出FinKG-News框架,通过自动抽取新闻事件作为锚点并将其与公司实体建立链接,构建一个融合事件、新闻和公司数据的公司中心型知识图谱;在此基础上,设计基于上下文学习(in-context learning)的架构,在三个核心金融维度上实现信用风险报告的生成。该方法显著提升了报告质量(提升19%-34%),同时有效减少幻觉现象,验证了知识图谱作为可信证据源在金融自然语言生成中的关键作用。
链接: https://arxiv.org/abs/2607.01023
作者: Rocio Jimenez-Villen,Ziwei Xu,Ying Chen,Oscar Araque,Ryutaro Ichise
机构: 未知
类目: Computation and Language (cs.CL)
备注: 15 pages, 5 figures, extended version of paper accepted at DEXA 2026
Abstract:Financial markets evolve in response to real-world events reported in news, yet these drivers often remain implicit in text. To better explain market dynamics, event-market relations must be explicitly modeled through factual, company-centric, and environment-aware knowledge graphs. We present FinKG-News, a framework that automatically constructs such graphs by extracting news events as anchors linked to companies. Using FinKG-News as grounded evidence that integrates events, news, and company data, we develop an in-context learning architecture for credit risk report generation across three core financial dimensions. Automatic and human evaluations show that automated hallucination detection and quality assessment remain unreliable, making expert judgment indispensable. Our approach consistently outperforms baselines, improving quality by 19%-34% while reducing hallucinations. The source code and project resources are publicly available at: this https URL.
[NLP-17] Reading Order Inference for Complex Document Layouts
【速读】: 该论文旨在解决复杂历史手稿数字化过程中多阅读流(multiple spatially interleaved reading streams)的阅读顺序推断问题,尤其针对如《普通注释》(Glossa Ordinaria)布局这类非矩形、非凸区域中主文与环绕注释交错排列的典型场景。其核心挑战在于如何在缺乏标注数据的情况下,准确建模文本行间的逻辑顺序关系,并避免传统方法因贪心策略导致的“边窃取”(edge-theft)错误传播。解决方案的关键在于提出一种无需训练的图基框架:将每行OCR文本作为有向图中的节点,通过轻量级语言模型信号(因果语言模型条件似然与BERT的下一句预测,NSP)加权组合生成边权重,利用度约束有向路径覆盖(degree-constrained directed path cover)求解全局阅读顺序;为克服贪心选择带来的次优解问题,引入最大遗憾(max-regret)推理规则,优先选择机会成本高的边以提升整体决策质量。实验表明,该方法在合成Glossa布局上平均恢复95%真实后继边,显著优于递归XY切割(XY-cut)的50%;在OmniDocBench多栏子集上达到88%宏边准确率,远超XY-cut的75%和LayoutReader的25%。此外,该方法表现出优异的镜像不变性,在水平与垂直翻转下性能波动低于1个百分点,而基准方法波动达2至8个百分点,验证了其鲁棒性。
链接: https://arxiv.org/abs/2607.01018
作者: Iddo Hakim,Sharva Gogawale,Omer Ventura,Gal Grudka,Daria Vasyutinsky-Shapira,Berat Kurar-Barakat,Nachum Dershowitz
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Digital Libraries (cs.DL)
备注:
Abstract:Reading order inference remains a critical bottleneck in the digitization of complex historical manuscripts, where pages contain multiple spatially interleaved reading streams, the canonical example being the Glossa Ordinaria layout, in which a central text is surrounded by commentaries that wrap around it in non-rectangular, non-convex regions. We present a training-free, graph-based framework: each OCR text line becomes a node in a directed candidate-transition graph, edges are scored by a weighted additive ensemble of two lightweight language-model signals (causal language model conditional likelihood and BERT next-sentence prediction, NSP; a third sentence-embedding signal was evaluated but did not improve reading order), and the global reading order is recovered as a degree-constrained directed path cover. To avoid the cascading “edge-theft” failures of greedy edge selection, we propose a max-regret inference rule that prioritizes commitments with high opportunity cost. We evaluate on synthetic Glossa Ordinaria grid layouts, on 23 ALTO page geometries (10 historical source pages plus mirrored and flipped variants), and on a 140-page multi-column English subset of OmniDocBench, comparing our method against the canonical recursive XY-cut (PaddleOCR PP-StructureV3) and two LayoutReader variants (layout-only and text+layout) on identical inputs. On wrap-around Glossa layouts our method recovers 95% of ground-truth successor edges on average vs. XY-cut’s 50%; on the OmniDocBench multi-column subset it reaches 88% macro edge accuracy versus XY-cut’s 75% and LayoutReader’s 25%. The LayoutReader baselines transfer poorly due to a word-level vs. line-level granularity mismatch. We additionally verify mirror-invariance under horizontal and vertical page reflections: Our method changes by less than 1 percentage point, classical XY-cut by 2 points, and LayoutReader-T by up to 8 points.
[NLP-18] Understanding Large Language Models
【速读】: 该论文旨在解决当前关于大语言模型(Large Language Models, LLMs)机制、能力及其与人类认知关系的诸多争议性问题。其核心关切在于厘清LLMs是否具备类人认知能力,以及这些能力的本质是源于真正的理解还是仅表现为对训练数据中模式的模仿。解决方案的关键在于通过系统梳理近期研究证据,揭示LLMs在符号推理、心理理论(theory of mind)及欺骗策略等高阶认知功能上的“涌现”能力,并结合可解释人工智能(Explainable AI)方法,如神经元激活分析与电路追踪,深入解析其内部处理机制。同时,论文批判了将LLM行为简单归因于数据记忆的还原论观点,指出此类论断源于对优化过程与认知容量的误解,主张应以更精细的视角探讨AI认知的可能性,既承认人类与LLMs之间的本质差异,又避免因过度简化而否定未来实现真正智能的潜力。
链接: https://arxiv.org/abs/2607.01006
作者: Yannik Keller,Thomas Eisenmann
机构: 未知
类目: Computation and Language (cs.CL)
备注: 25 pages, 1 figure
Abstract:Large Language Models (LLMs) represent one of the most significant advances in AI and natural language processing in recent years. Still, many pressing questions about their mechanisms, capabilities, and relationship to human cognition remain highly debated. This chapter aims to outline our current understanding of LLMs by discussing recent evidence on emerging capabilities and their mechanistic implementation within processing layers. We begin with a concise overview of the Transformer architecture, emphasizing how the attention mechanism enables training on massive datasets, allowing LLMs to function as generalist rather than specialized models. Next, we examine emergent LLM capabilities that appear to resemble aspects of human cognition, including symbolic reasoning, theory of mind, and deception strategies. Several studies provide evidence that LLMs can solve tasks previously thought to require human-like cognition. Other studies reveal insightful failure cases that shed light on the differences between human and LLM cognition. Alongside these findings, we review explainable AI approaches ranging from neuron activation analysis to circuit tracing. In the final section, we address current debates concerning what LLMs genuinely understand versus what they merely appear to understand. Prominent arguments against AI anthropomorphism point to the simplicity of LLM training objectives, claiming that LLM behavior is better explained by pattern memorization of training data than by genuine cognition. We argue that this standpoint is guided by misconceptions about optimization processes and cognitive capacity, and advocate for a more nuanced discussion of LLM cognition that neither dismisses the differences between humans and LLMs nor precludes the possibility of AI cognition through overly simplistic reductionist arguments.
[NLP-19] Logit-Contribution Scoring Identifies Non-Literal Retrieval Heads
【速读】: 该论文旨在解决大语言模型在长上下文场景中进行非字面意义的语义合成(non-literal synthesis)时,现有注意力头检测方法无法有效识别关键生成性注意力头的问题。传统检测器基于“字面匹配”准则,仅关注注意力头是否读取了与输出词一致的输入标记,忽略了其输出值(OV)通路在生成过程中所起的非字面信息整合作用。为此,本文提出逻辑贡献评分(Logit-Contribution Scoring, LOCOS),一种面向“写入行为”的新型检测机制,通过在单次前向传播中对比目标答案词在未命中(off-needle)与命中(needle)源位置上的输出投影方向,量化每个注意力头对最终生成结果的贡献度。实验表明,在Qwen3、Gemma-3和OLMo-3.1三个模型家族上,移除LOCOS选出的前若干个高分头即可显著降低非字面检索任务(NoLiMa)的ROUGE-L得分,例如在Qwen3-8B上仅移除50个头即导致ROUGE-L从0.401降至0.000,而基线方法仍保持0.292;同时,该操作对参数化回溯和算术推理等任务影响微小,证明所选头具有任务特异性;此外,在MuSiQue和BABILong等长程推理基准上,性能同样大幅下降,而随机头移除控制组则无明显变化,验证了LOCOS检测的有效性与针对性。
链接: https://arxiv.org/abs/2607.01002
作者: Aryo Pradipta Gema,Beatrice Alex,Pasquale Minervini
机构: University of Edinburgh(爱丁堡大学); Heriot-Watt University(赫瑞瓦特大学); Miniml.AI(最小化人工智能)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 41 pages, 18 figures
Abstract:In long-context use, large language models frequently synthesize answers from the meaning of a relevant context span rather than literally copy-pasting them. Identifying which attention heads perform this synthesis matters for interpreting long-context model behavior. Yet existing detectors miss these heads by construction: they reward heads whose attended token matches the generated token, a literal-copy criterion that captures where a head reads but not what it writes through its output-value (OV) circuit, the very mechanism that carries non-literal retrieval. We introduce Logit-Contribution Scoring (LOCOS), a write-aware detector that scores each head by the projection of its OV-circuit output onto the answer-token unembedding direction, contrasting needle and off-needle source positions in a single forward pass. Across three model families (Qwen3, Gemma-3, OLMo-3.1), mean-ablating the top LOCOS heads on the NoLiMa non-literal retrieval benchmark collapses ROUGE-L at lower head counts than prior attention-based detections; on Qwen3-8B, ablating 50 heads drives ROUGE-L from 0.401 to 0.000 while the strongest baseline still retains 0.292. The selected heads are retrieval-specific: parametric recall and arithmetic reasoning stay at baseline under the same ablation. On Qwen3-8B, the same ablation also drops MuSiQue from 0.55 to 0.08 and BABI-Long from 0.62 to 0.20, while a random-heads control stays within 0.05 of baseline.
[NLP-20] KnowledgeDebugger – an Exploration Tool for Knowledge Localization and Editing in Transformers
【速读】: 该论文旨在解决大模型中知识存储与编辑机制的理解难题,特别是如何定位和修改Transformer模型内部的知识表示。其核心挑战在于现有研究多依赖于手动、低效的实验方式,难以快速探索个体样本上的知识现象。为此,论文提出KnowledgeDebugger,一个基于图形用户界面(GUI)的探索工具,用于支持知识定位与编辑的交互式分析。该工具的关键创新在于以“无代码”方式集成EasyEdit这一主流知识编辑方法库,使研究人员能够高效地在单个样本层面进行知识编辑实验,并为后续大规模统计验证提供直观支持。通过案例研究验证了该工具在揭示近期研究成果中的有效性,显著提升了知识编辑研究的可探索性与效率。
链接: https://arxiv.org/abs/2607.01000
作者: Eric Benz,Lennart Stöpler,Nikolai Bolik,Artur Andrzejak
机构: Heidelberg University (海德堡大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Recent research has increasingly focused on understanding how Transformers store and process knowledge, as well as how this knowledge can be edited. Research work in this area is often conducted in two phases: first, phenomena are explored on individual samples. Then, when results appear promising, more statistically robust experiments follow. To support the first phase, we propose KnowledgeDebugger, a GUI-based exploration tool for knowledge localization and editing in Transformers. Our tool - inspired by LM-Debugger - offers no-code access to the methods in EasyEdit, a widely used library of state-of-the-art Knowledge Editing approaches. We demonstrate the tool’s effectiveness through case studies of recent findings in this field.
[NLP-21] Svarna: An Open Corpus Workbench for Modern Greek
【速读】: 该论文旨在解决现代希腊语语言技术中长期存在的资源分散与获取受限问题。当前虽存在多种语料库资源,但它们分布于不同平台,且许多受机构访问权限限制或已无法在线获取,严重制约了研究的可及性与效率。本文提出的解决方案——Svarna,是一个免费、开源、基于Web的语料库工作台,其关键在于将涵盖官方、文学、方言、社交媒体及历史等多种语域的五个语料库整合至单一界面,总规模超过5.07亿词、约2900万句。系统通过FastAPI后端与SQLite FTS5全文索引实现高效检索,采用Docker容器化部署于Azure,支持免登录、无需安装即可使用,并提供包括关键词在上下文(KWIC)标记、频率分析(含按语域归一化)、互信息计算的搭配提取、93个希腊语话语标记的分布特征词典、文本层面的n-gram、变体与搭配网络分析、对数比值法的语域对比、正则表达式搜索以及可选的大语言模型(LLM)层用于语用标注和自由研究模式等多功能集成。整个系统以MIT许可证发布,源代码、构建脚本与部署配置均公开于GitHub,支持用户自定义语料库并部署私有实例。Svarna不仅实现了多源异构语料的统一接入与高效处理,还为未来更深入的语言学与计算语言学研究奠定了坚实基础。
链接: https://arxiv.org/abs/2607.00970
作者: Stergios Chatzikyriakidis
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:This paper introduces Svarna, a free, open-source, web-based corpus workbench for modern Greek. Svarna integrates five databases covering various registers, institutional, literary, dialectal, social media, and historical, to provide a total of more than 507 million words and around 29 million sentences. This platform addresses the chronic gaps in Greek language technology. Although various corpus resources exist, they are scattered across different platforms, and in many cases, institutional access is restricted or they are no longer available online. Svarna integrates these resources into a single interface that can be used without logging in, installation, or specialized training. This system provides a concordancer with KWIC marking capabilities, frequency analysis including register-by-register normalization, collocation extraction using mutual information, a dictionary of 93 Greek discourse markers providing distribution profiles, text-level analysis tools including n-grams, variants, and collocation networks, register comparison using log-ratio, regular expression search, and an optional LLM layer for pragmatic annotation and free research mode. This platform is built upon SQLite FTS5 full-text indexes provided via a FastAPI backend, deployed as Docker containers on Azure, and released under the MIT license. Source code, build scripts, and deployment configurations are publicly available on GitHub. Users can add their own corpora and deploy their own instances. This document describes the system design, corpus structure, and use cases demonstrating the various queries supported by the platform. Svarna serves as the first step in exploring available data and is expected to lay the foundation for more comprehensive research in the future.
[NLP-22] Persona Non Grata: LLM Persona-Driven Generations in MCQA are Unstable in Distinct Dimensions
【速读】: 该论文旨在解决生成式 AI(Generative AI)在非文本密集型输出任务中,特别是多选题问答(Multiple-Choice Question Answering, MCQA)场景下,角色驱动生成(Persona-driven Generation, PDG)所表现出的不稳定性问题。尽管已有大量研究关注通过自由文本(如对话)表达角色时的稳定性,但针对以选择题等结构化输出为主的场景,其角色一致性与行为稳定性尚未得到充分探讨。论文提出三个关键评估指标,分别从性能、输出结果和问题理解正确性三个维度量化角色生成的稳定性。研究发现,模型家族、模型规模以及题目领域(如数学与常识类问题)对不稳定性具有显著影响,其中数学与常识类题目导致更高的不稳定性;此外,任务提示格式(prompt format)相较于温度等超参数对预测不稳定性的影响更为显著。更重要的是,论文揭示了不稳定性与任务准确率之间存在关联,并通过所提出的不稳定性度量工具,发现即使在相似表现的实验设置下,不同超参数配置也可能导致最优或最差角色表现的反转。这表明,在角色驱动生成任务中,必须系统评估超参数引入的不稳定性,以确保生成结果的可复现性与可靠性。
链接: https://arxiv.org/abs/2607.00937
作者: César Guerra-Solano,Xiang Lorraine Li
机构: University of Pittsburgh(匹兹堡大学)
类目: Computation and Language (cs.CL)
备注: 23 pages, 12 figures. Under review at ARR
Abstract:Persona-driven generations (PDGs) have seen prolific use in research and industry applications, where a large language model (LLM) takes on a ‘persona’ while completing some task. While persona expressed through free-form text (like dialogue) has substantial work investigating stability or consistency, relatively, persona expressed in non-text-heavy outputs (like in multiple-choice question answering, or MCQA) is often overlooked. We work to address this gap, seeking to understand the instability of LLM PDGs in MCQA tasks. We develop three metrics investigating the performance, outcome, and question correctness stability, evaluating three distinct dimensions. Using these metrics, we find that instability varies consistently between model families and model size, and across question domains, with math/commonsense questions leading to greater instability. We also find task prompt format introduces more prediction instability than other hyperparameters, like temperature. Finally, we find that instability is related to task accuracy, and using our instability metrics, find different experimental settings that result in different best and worst personas for tasks, despite their similarity. This reveals the importance of checking hyperparameter instability in PDGs.
[NLP-23] Graph-Native Reinforcement Learning Enables Traceable Scientific Hypothesis Generation through Conceptual Recombination
【速读】: 该论文旨在解决生成式人工智能在材料设计中因缺乏可追溯的多步领域推理能力而导致科学假设可信度不足的问题。现有大语言模型虽能生成流畅文本,但其推理过程隐含且不可解释,难以验证最终结论是否基于连贯的中间推演。为此,论文提出Graph-PRefLexOR这一图原生推理模型,通过组相对策略优化(Group Relative Policy Optimization, GRPO)进行微调,将推理过程显式划分为机制探索、图结构构建、模式提取与假设合成四个阶段。该设计将神经语言生成与符号化关系结构相耦合,使因果关系得以显式构建、检验与复用。实验表明,在材料科学与力学领域的100个开放性问题上,Graph-PRefLexOR相较基线模型性能提升40%-65%,尤其在推理可追溯性方面表现突出;嵌入分析显示其具备更广义的语义探索范围及约2-3倍的语义多样性;语义回溯与层间隐藏状态分析进一步证实结构化推理与最终答案之间更强的一致性。此外,测试时图结构扩展结果表明,额外计算资源主要促进受限语义空间内的长程概念重组,而非简单扩大语义覆盖范围。研究确立了图原生强化学习在实现可解释性科学假设生成方面的可行性,为材料设计及其他科学领域中的智能推理系统提供了新范式。
链接: https://arxiv.org/abs/2607.00924
作者: Subhadeep Pal,Shashwat Sourav,Tirthankar Ghosal,Markus J. Buehler
机构: Massachusetts Institute of Technology (麻省理工学院); Oak Ridge National Laboratory (橡树岭国家实验室); Lawrence Berkeley National Laboratory (劳伦斯伯克利国家实验室)
类目: Artificial Intelligence (cs.AI); Materials Science (cond-mat.mtrl-sci); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Accelerating materials discovery requires AI systems that can generate scientifically valid hypotheses through multi-step, domain-grounded reasoning. Standard large language models often produce fluent but weakly traceable responses to open-ended materials design problems, making it difficult to determine whether final answers are supported by coherent intermediate reasoning. We develop Graph-PRefLexOR, a family of graph-native reasoning models fine-tuned with Group Relative Policy Optimization (GRPO) to organize reasoning into explicit phases for mechanism exploration, graph construction, pattern extraction, and hypothesis synthesis. This design links neural language generation with symbolic relational structure, enabling causal connections to be constructed, inspected, and reused. On 100 open-ended questions from materials science and mechanics literature, Graph-PRefLexOR achieves 40-65% improvements over corresponding base models, with the largest gains in reasoning traceability. Embedding analyses show broader semantic exploration and approximately 2-3 times greater semantic diversity than baselines. Semantic backtracking and layer-wise hidden-state analyses further show stronger alignment between structured reasoning and final answers. Finally, test-time graph expansion reveals that additional compute primarily increases long-range conceptual recombination within a bounded semantic space, rather than simply expanding semantic coverage. These results establish graph-native reinforcement learning as a pathway toward interpretable AI systems for scientific hypothesis generation in materials design and other scientific applications.
[NLP-24] Beyond Document Grounding: Span-Level Hallucination Detection over Code Tool Output and Documents
【速读】: 该论文旨在解决检索增强生成(RAG)系统中针对非自然语言结构化输入(如源代码、开发工具输出、标记文档、表格及仓库元数据)的细粒度幻觉检测问题。传统幻觉评估多基于自然语言文档证据,难以覆盖日益普遍的结构化输入场景。其解决方案的关键在于构建首个统一的跨模态细粒度幻觉检测基准,涵盖代码、工具输出、结构化文档及现有自然语言RAG数据集,通过从有据可依的正确答案出发,精确注入局部幻觉并以基于证据的评审验证代码测试集,确保标注质量。在此基础上,微调后的Qwen3.5-2B检测器在统一测试集上达到0.689的span-F1,在代码代理源上达0.60,显著优于LettuceDetect-large(0.17)和最强零样本大模型判别器(最高0.22),同时在经典自然语言基准上保持竞争力(RAGTruth例级F1为81.8,PsiloQA IoU为0.724)。
链接: https://arxiv.org/abs/2607.00895
作者: Ádám Kovács,Bowei He,Xue Liu,István Boros,Szilveszter Tóth,Gábor Recski
机构: KR Labs( KR实验室); MBZUAI; McGill University(麦吉尔大学); TU Wien(维也纳工业大学)
类目: Computation and Language (cs.CL)
备注: 8 pages
Abstract:Hallucination detection for retrieval-augmented generation (RAG) is usually evaluated on natural-language document evidence. However, grounded generation systems increasingly rely on structured inputs: source code, developer-tool output, markdown documents, tables, and repository metadata. We introduce a unified benchmark for span-level hallucination detection over code, tool output, structured documents, and existing natural-language RAG datasets. The benchmark is built by starting from grounded correct answers, injecting localized hallucinations with exact character labels, and validating the code test split with evidence-based review. Our fine-tuned Qwen3.5-2B detector reaches 0.689 span-F1 on the unified test set and 0.60 on the code-agent source, where it substantially outperforms LettuceDetect-large (0.17) and the strongest zero-shot LLM judges we evaluated (at most 0.22). The same model remains competitive on established natural-language benchmarks, with 81.8 RAGTruth example-F1 and 0.724 English PsiloQA IoU.
[NLP-25] MultiSynt/MT: Trillion-Token Multi-Parallel Pre-Training Data Translated Across 36 Languages
【速读】: 该论文旨在解决多语言大语言模型(LLM)发展受限于英语主导的开放网络规模预训练语料库的问题,尤其针对中低资源欧洲语言缺乏高质量、大规模可公开获取的平行语料。其核心解决方案是构建一个名为MultiSynt/MT的开源合成平行语料库,涵盖36种欧洲语言,约4.8万亿目标语言词元,通过使用Tower+与OPUS-MT/HPLT-MT系统对1000亿高质量的Nemotron-CC语料进行翻译生成。该语料库在多数中低资源欧洲语言中成为目前最大且公开可用的预训练数据源。实验表明,在多语言基准测试中,基于MultiSynt/MT训练的参考模型仅需约72%的预训练词元即可达到HPLT 2.0(原生数据基线)的最终性能,并在相同1000亿词元训练预算下相对提升约15%。研究还揭示了评估盲点:标准多项选择基准未能捕捉翻译质量差异,而基于流畅性敏感的“模型作为裁判”评估方法可有效识别这些差异;此外,挪威语中的习语和文化相关任务仍更依赖原生数据。为支持可控研究,作者已公开发布该语料库及其多系统生成的行对齐翻译结果。
链接: https://arxiv.org/abs/2607.00890
作者: Maximilian Idahl,Jörg Tiedemann,Sampo Pyysalo,David Salinas,Tomasz Galica,Shenbin Qian,Tudor Nicolae Mateiu,Zihao Li,Anna Lokrantz,Fedor Vitiugin,André F. T. Martins,Jenna Kanerva,Filip Ginter,Matthias Lindemann,Tim Isbister,Birger Moell,Jonas Lindh,Jan Hajič,Jenia Jitsev,Andrey Kutuzov,Stephan Oepen,Gema Ramírez-Sánchez
机构: Leibniz University Hannover(汉诺威莱布尼茨大学); University of Helsinki(赫尔辛基大学); University of Turku(图尔库大学); ELLIS Institute Tübingen(蒂宾根ELLIS研究所); Prior Labs(先验实验室); University of Oslo(奥斯陆大学); Prompsit Language Engineering(普罗姆皮斯语言工程公司); AI Sweden(瑞典人工智能); Instituto de Telecomunicações(电信研究所); Instituto Superior Técnico(里斯本高等技术学院); TransPerfect(TransPerfect公司); Charles University(查尔斯大学); Ontocord(Ontocord公司); LAION( LAION公司); Open-Ψ\Psi (Open-Sci) Collective(开放科学集体); Juelich Supercomputing Center (JSC), Research Center Juelich (FZJ)(于利希超级计算中心(JSC),于利希研究中心(FZJ))
类目: Computation and Language (cs.CL)
备注:
Abstract:Open web-scale pre-training corpora remain concentrated in English, limiting multilingual LLM development. We introduce MultiSynt/MT, an open synthetic parallel corpus with approximately 4.8 trillion target-language tokens across 36 European languages, produced by translating 100 billion high-quality Nemotron-CC tokens with Tower+ and OPUS-MT/HPLT-MT systems. For many medium- and lower-resource European languages, this is the largest openly available pre-training resource. On a broad multilingual benchmark suite, reference LLMs trained on MultiSynt/MT reach the final score of HPLT 2.0, a native-data baseline, using roughly 72% fewer pre-training tokens, and outperform it by approximately 15% relative at a matched 100B-token training budget. Our analyses also identify evaluation blind spots: standard multiple-choice benchmarks miss translation-quality differences that a fluency-sensitive LLM-as-judge evaluation cleanly recovers on the trained LLMs (with no fluency deficit in MultiSynt itself), and Norwegian idiomatic and culturally grounded tasks remain better served by native data. We release the corpus, including row-aligned translations from multiple systems, to support controlled research on multilingual pre-training data and evaluation.
[NLP-26] How Ethos and Pathos Appeals Resonate in Reader Interpretations of Social Media Messages SIGDIAL
【速读】: 该论文旨在解决现有研究过度关注社交媒体中显性互动(如评论)而忽视沉默受众(即多数不公开表达反应的读者)对修辞策略感知与理解的问题。其核心挑战在于揭示隐性受众如何在无直接反馈的情况下解读文本中的修辞意义,尤其是古典说服模式——可信度(ethos)与情感共鸣(pathos)在无声语境下的传递效果。解决方案的关键在于构建一个包含社交媒体原文及其人工撰写解释的配对数据集,并对源文本与解释内容同时标注ethos与pathos特征,通过对比分析两者间的一致性与变异程度,量化修辞元素在受众解读过程中的保留与演变情况。研究发现,30%的解释与原始文本存在分歧,且带有强烈修辞色彩的内容引发更高程度的解释变异性;更重要的是,原始文本中的ethos与pathos可有效预测受众对作者的态度,表明修辞不仅影响显性互动,更以微妙方式塑造潜在受众的认知与态度。
链接: https://arxiv.org/abs/2607.00873
作者: Ewelina Gajewska,Katarzyna Budzynska,Jaroslaw Chudziak,Liesbeth Allein
机构: Warsaw University of Technology, Poland; Department of Computer Science, KU Leuven, Belgium; Department of Electronics and Information Systems, Ghent University, Belgium
类目: Computation and Language (cs.CL)
备注: The article has been accepted to the 27th Annual Meeting of the Special Interest Group on Discourse and Dialogue (SIGDIAL) that will be held in Atlanta, Georgia on August 2-5, 2026. The official version will appear in the conference proceedings
Abstract:Rhetorical strategies and their influence on audiences are often studied through social media posts and comments. However, this focus overlooks the universal audience, which is the majority of readers who remain silent and do not explicitly express how a message affects them. This study investigates how two classical modes of persuasion, ethos and pathos, resonate in the silent audience’s interpretations of meaning. Using a dataset of social media sentences paired with human-written interpretations, we label both sources for ethos and pathos and assess whether these rhetorical appeals are preserved. Our analyses show that interpretations diverge from the original sentences in 30% of cases, with rhetorically charged content eliciting greater variability than neutral content. We further find that ethos and pathos in original sentences can predict audience attitudes toward the author, underscoring the subtle ways rhetoric shapes perception beyond visible engagement.
[NLP-27] Self-Evolving Agents with Anytime-Valid Certificates
【速读】: 该论文旨在解决自演化智能体(self-evolving agents)在学习理论保证中面临的根本性问题:即数据、评估器、组件及假设空间均由正在更新的策略生成,导致传统学习理论的假设失效。其核心解决方案是提出一种名为SEA(Self-Evolving Agent)的架构,通过将自修改限制在小型“引导适配器”(steering adapter)和围绕一个冻结基础模型(frozen base model)的版本化约束框架内,并引入一种任意时间有效(anytime-valid)的门控机制,该机制仅在满足固定误差预算的前提下发出可审计的证书,从而确保修改的安全性与可追溯性。该架构依赖五种闭环控制器来整合现有理论保证;由于门控机制只能从冻结基础模型已产生的行为中选择,因此需借助五种“验证者在环”(verifier-in-the-loop)机制——包括Best-of-N、微步搜索、自撰复现判别器、搜索层控制以及自修复——以无需外部评分器的密集信号,仅基于任务文本本身提供必要反馈。在涵盖四个基础模型的52个实例的SWE-bench Verified子集上,实验表明基础模型能力是主导且无混淆效应的因素,而在两个强基线模型上,通过刻意设计的无操作复合对照组(no-op-composite control),成功分离出该方法套件的贡献分别为+4和+5(Glm 5.2 24→28;Gpt 29→34,达到65%最优表现),事件日志进一步验证了各机制的触发与退化抑制作用。研究结果为单次运行下昂贵评估的结果,未来工作包括验证运行间方差及动态调整每任务的算法组合。
链接: https://arxiv.org/abs/2607.00871
作者: Biswa Sengupta
机构: JPMorgan Chase & Co. (摩根大通公司)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Self-evolving agents violate the assumption behind most learning-theoretic guarantees: the data, evaluator, components, and hypothesis space are produced by the policy being updated. We present \textbfSEA, an architecture that confines self-modification to a small steering adapter and a versioned harness around a \emphfrozen base model and admits each modification only through an anytime-valid gate that emits an auditable certificate against a fixed error budget. Five loop controllers compose published guarantees; because such gates can only \emphselect among behaviors the frozen base already produces, five verifier-in-the-loop mechanisms – best-of- N , micro-step search, self-authored reproduction oracles, search-layer control, and self-repair – supply the dense, grader-free signal the gates require, computed from the issue text alone. On a 52 -instance SWE-bench Verified subset across four base models, base capability is the dominant, confound-free effect, and on two strong base models a deliberate no-op-composite control isolates the suite’s contribution at +4 and +5 (\textscGlm~5.2 24\to28 ; \textscGpt 29\to34 , the 65% best), with event logs confirming that its mechanisms fire and prevent regressions. Results are single-run on expensive evaluations; confirming run-to-run variance and adapting the per-task algorithm mix are future work.
[NLP-28] Dynamic Bidirectional Pattern Memory: A Production-Scale Empirical Characterisation of Inference-Time Gating in Clinical NLP
【速读】: 该论文旨在解决大规模临床自然语言处理(NLP)流水线中生成-验证架构在推理时因重复验证已知失败候选而导致的效率瓶颈问题。其核心挑战在于如何在不增加验证器负担的前提下,通过轻量级记忆机制动态学习并过滤掉已被证明无效的提取结果。解决方案的关键在于设计一种基于模式的记忆门控机制,该机制通过分析验证器拒绝的样本,识别出可被提前过滤的模式,从而避免重复评估。研究发现,直接从验证器拒绝行为中学习规则在大规模场景下失效,因其拒绝信号过于分散;而利用固定临床本体(clinical ontology)构建的简单规则则能有效捕获大量违反本体的实体关系。进一步分析表明,成功的过滤器必须测试与验证器相同的证据依据,而非模仿其输出,这一原则适用于所有五种问答过滤器的设计。最终,系统采用“标记可疑提取”而非直接删除的方式,确保所有决策可追溯、可审查,为生成式AI在临床场景中的安全部署提供了可迁移的工程范式。
链接: https://arxiv.org/abs/2607.00870
作者: Ali H. Lazem,William Teahan
机构: Bangor University (班戈大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:We study inference-time pattern-memory gating in a production-scale clinical natural language processing (NLP) pipeline. The pipeline pairs a generator (Llama-3.3 70B) proposing extractions with a verifier (MMed-Llama-3.1 70B) accepting or rejecting them, over 167,034 PMC-Patients narratives, and adds a lightweight memory that learns at deployment which extractions to filter, so the verifier need not re-examine candidates already seen to fail. We report four findings. First, learning filtering rules directly from the verifier’s rejections failed at full scale: the relation-extraction filter stayed empty despite 785,797 logged rejections, because they were spread too thinly across too many distinct forms to accumulate. Second, a simpler rule using a fixed clinical ontology produced the same filtering without the verifier, capturing 49,734 ontology-violating relations on a held-out 5,000-patient set. Third, of five versions of the question-answering filter, four failed for distinct, instructive reasons; the fifth succeeded by checking whether a patient’s extracted entities support the question asked, and where it applies was 1.84 times likelier to flag an answer the verifier would reject than one it would accept. Fourth, one pattern held across all five: a filter is selective only when it tests the same evidence the verifier weighs, not when it imitates the verifier’s output. Together these give a transferable result for any generator-verifier pipeline: the most natural memory design can fail silently at scale, and whether a pre-generation gate is selective is decided before any engineering effort, by whether its signal probes the question the verifier itself answers. Throughout, the system flags suspect extractions rather than deleting them, so every decision stays visible for clinical review. All code and test artefacts are released openly.
[NLP-29] CAT: Confidence-Adaptive Thinking for Efficient Reasoning of Large Reasoning Models ACL2026
【速读】: 该论文旨在解决大型推理模型(Large Reasoning Models, LRMs)在处理简单任务时出现的“过度思考”问题,即模型在面对简单查询时仍生成过长的思维链(chain-of-thought, CoT),导致令牌(token)开销增加和推理效率下降。现有压缩方法多采用统一的长度缩减策略或依赖粗粒度难度估计,难以兼顾复杂任务的推理准确性与简单任务的高效性。本文提出一种自适应信心调节的推理框架——信心自适应思考(Confidence-Adaptive Thinking, CAT),其核心在于将模型内在的自我确信信号(self-certainty signals)作为置信度指标,融入偏好优化过程,实现根据问题难度自主调节推理长度:对高置信度的简单问题进行有效压缩,对低置信度的复杂问题保留充分推理。实验表明,CAT在多个基准测试上均显著优于当前最优基线,在不同基础模型上保持一致的推理准确率提升,为实际工业场景中精度与延迟之间的平衡提供了鲁棒且高效的解决方案。
链接: https://arxiv.org/abs/2607.00862
作者: Qizhi Jiang,Shuo Wang,Pei Ke,Yuhang Song,Ke Qin
机构: University of Electronic Science and Technology of China, Chengdu, China; Ubiquitous Intelligence and Trusted Services Key Laboratory of Sichuan Province
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at ACL 2026 Industry Track
Abstract:Large Reasoning Models (LRMs) have achieved remarkable success on complex tasks by leveraging long chain-of-thought (CoT) trajectories, yet they frequently exhibit overthinking on simple queries, resulting in significant token overhead and reduced inference efficiency. However, existing compression methods predominantly apply uniform length reduction or rely on coarse-grained difficulty estimation, often leading to performance degradation on difficult problems. To address this limitation, we propose Confidence-Adaptive Thinking (CAT), a framework that incorporates the model’s intrinsic self-certainty signals as confidence into the preference optimization process, which autonomously modulates reasoning lengths based on problem difficulty. Experimental results show that CAT consistently outperforms state-of-the-art baselines on reasoning accuracy across multiple benchmarks on different base models. Our work enables LRMs to effectively compress confident responses while deliberating on uncertain ones, offering a potentially robust solution for balancing accuracy and latency in practical industrial scenarios.
[NLP-30] Recovering Input Text from Hidden States: Study of Gradient-Based Inversion of Decoder-Only Language Models
【速读】: 该论文旨在解决解码器仅语言模型(decoder-only language model)的隐藏状态逆问题,即从最后一层隐藏状态中恢复原始输入标记序列。其核心挑战在于如何在不依赖硬性标记投影的前提下,实现高精度的序列重建。解决方案的关键在于将逆问题建模为连续嵌入空间中的优化过程:在整个搜索过程中保持变量在连续空间中演化,仅在内循环结束时进行一次离散标记的确定(commit),从而避免早期硬投影带来的信息损失。这一设计使得优化过程可观察性显著增强,能够捕捉到丰富的内部信号,如真实标记的秩轨迹、各位置的损失曲线以及提交时刻的离散损失。更重要的是,通过累积离散损失可有效评估重构正确性。研究发现,失败主要集中在嵌入空间密集区域中以空格前缀、高频出现的功能词为主的一类词汇,而承载语义内容的词汇几乎被完美恢复。在10标记的C4提示下,随着候选窗口扩大,精确匹配率从66.9%提升至97.5%(平均相似度0.994),表明多数错误为可修正的近似误判而非本质模糊性。与公开的SIPIT方法对比显示,虽然逐步硬投影速度更快,但连续优化框架使整个反演过程具备可观测性和故障可检测性,进一步证实了GPT-2最后一层隐藏状态对原始文本具有高度敏感性。
链接: https://arxiv.org/abs/2607.00852
作者: Mikołaj Słowikowski,Maciej Witold Majewski
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:This work studies the hidden-state inversion problem: recovering the original input token sequence of a decoder-only language model from its last-layer hidden states. Rather than treating inversion as a one-shot reconstruction, we study it as a continuous embedding-space optimisation in which a soft proxy is driven towards the leaked target without any hard-token projection during the search, and a token is committed only once, at the end of the inner loop. This design choice has two consequences which are the main focus of this paper. First, keeping the optimisation entirely in continuous space exposes a rich set of internal signals: rank trajectories of the ground-truth token, per-position loss curves, and a discrete loss measured at commit time. Second, the discrete loss allows assessing the correctness of recovery via cumulative discrete loss. We further analyse which tokens break the reconstructions and find a sharp categorical asymmetry: space-prefixed, high-frequency function words in dense regions of the embedding matrix dominate the failures, while content-bearing tokens are recovered almost perfectly. On 10-token C4 prompts the exact-match rate rises from 66.9% to 97.5% (mean similarity 0.994) as the candidate window is widened, confirming that most errors are recoverable near-misses rather than genuine ambiguities. A comparison with the released SIPIT reference situates these findings: per-step hard projection is faster, but the continuous formulation is what makes the optimisation observable and its failures detectable. The results show that last-layer hidden states of GPT-2 are as sensitive as the original text.
[NLP-31] he Course of News Events: A Comparison of Bottom-Up and Top-Down Approaches for Collecting Text-Based Data about Disasters
【速读】: 该论文旨在解决社会-环境研究中如何选取具有代表性的新闻数据样本这一关键方法论问题。现有两种主流方法:一是基于已有的灾害清单,通过自上而下的方式查询新闻数据库;二是利用自然语言处理(NLP)技术,基于时间与空间特征自下而上地对新闻文本进行聚类。研究以全球范围内的德国新闻报道为数据集,对比了这两种方法在事件覆盖范围上的差异。其解决方案的关键在于揭示不同采样策略对新闻样本代表性的影响,进而影响媒体覆盖不平等性分析、灾害监测以及灾害清单扩充等后续研究的可靠性与有效性。
链接: https://arxiv.org/abs/2607.00849
作者: Brielen Madureira,Andreas Niekler,Mariana Madruga de Brito
机构: LeipzigLab – Climate Discourse, Leipzig University, Germany; Helmholtz Centre for Environmental Research, Germany; Computational Humanities, Leipzig University, Germany
类目: Computation and Language (cs.CL)
备注: work in progress
Abstract:News articles are an important source of information on disaster impacts and adaptation. A key methodological challenge in socio-environmental studies is how to select a representative data sample. Two approaches are common: querying news databases top-down with the aid of an existing disaster inventory or using NLP methods to cluster news texts bottom-up based on temporal and spatial features. Using a dataset of German news about landslides worldwide, we compare these approaches and discuss variations in event coverage. Such research design decision can influence the resulting news sample, affecting its use in studies of inequality in media coverage, disaster monitoring and inventory enrichment.
[NLP-32] MetaHOPE: A Metaphor-Oriented Evaluation Framework for Analysing MT and LLM Translation Errors
【速读】: 该论文旨在解决生成式人工智能(Generative AI)在隐喻翻译任务中因语义复杂性、语境依赖性及文化嵌入特征所引发的错误严重性评估难题。现有自然语言处理(NLP)模型在处理隐喻时易产生语义偏差或文化误译,但缺乏对错误严重程度的系统性标注与量化标准。为此,论文提出一种误差严重性感知的标注框架MetaHOPE,通过引入多维度的错误分类体系,实现对翻译错误的精细化标注。其核心解决方案在于:构建一个基于双语人工标注的隐喻平行语料库(VUAMC与PSUCMC),并利用MetaHOPE框架对原始单语语料进行错误标注,同时生成高质量的人工校对参考译文作为双语基准资源。该框架不仅支持对当前主流神经机器翻译(NMT)与大语言模型(LLM)在隐喻翻译中的表现进行系统评估,还为未来研究提供了可复用的标注规范与数据资源,推动了隐喻翻译领域向更精准、可度量的方向发展。
链接: https://arxiv.org/abs/2607.00848
作者: Jiahui Liang,Lifeng Han
机构: Leiden University(莱顿大学); Leiden University Medical Centre(莱顿大学医学中心)
类目: Computation and Language (cs.CL)
备注:
Abstract:In this opinion paper, we propose MetaHOPE, an error severity-aware annotation framework for evaluating metaphor translations. Metaphors present challenges for machine translation (MT) and natural language understanding and processing (NLU, NLP), because it presents the features of semantic complexity, contextual dependency, and cultural embeddings that can lead to ambiguity issues for NLP models. To investigate how state-of-the-art NLP models perform on translating metaphors, we select three representative systems, i.e., GoogleMT, GPT5.4, and Hunyuan-7b as Neural MT (NMT) models and LLMs. We used two human-annotated metaphor corpora, including VUAMC and PSUCMC for English-to-Chinese and Chinese-to-English translation purposes. The original corpora we used are monolingual, where we carried out error annotation using the MetaHOPE framework, and also produced the human post-edited gold reference for bilingual use as a new resource. We believe the MetaHOPE evaluation framework for metaphor translation annotation, the parallel corpora resources, and the error analysis on SOTA automatic translation models can be useful and shed some light for the field of metaphor translation study. We share our resources publicly upon paper acceptance.
[NLP-33] MSQA: A Natively Sourced Multilingual and Multicultural SimpleQA Benchmark
【速读】: 该论文旨在解决生成式 AI 在多语言场景下存在的“文化对齐错觉”(Illusion of Cultural Alignment)问题,即模型具备多语言表达能力并不等于具备相应语言所承载的文化理解能力。其核心挑战在于:现有大语言模型(LLM)在跨语言任务中常依赖英语中心的迁移学习路径,导致对非英语语境下的本地化文化知识理解严重不足。为系统检验这一假设,作者提出了MSQA基准,包含1,064个源自母语者的多语言问题,覆盖11个语言群体、5个文化维度及3个难度层级,强调本地化知识的真实性并避免英语主导的转移捷径。实验评估18个主流大模型发现,文化理解能力与预训练数据中的文化暴露程度高度相关,而非通用推理能力,呈现出显著的“局部性效应”(Locality Effect)。进一步分析表明,常见的推理时干预手段——如置信度校准、重复采样和检索增强——均无法有效缓解该问题:模型在陌生文化情境下仍过度自信,重复采样结果不稳定,而检索增强对长尾文化事实的帮助亦不均衡。研究结论指出,单纯依靠多语言能力无法实现真正的文化对齐,必须采取超越推理阶段校准、采样与检索的深层干预策略。
链接: https://arxiv.org/abs/2607.00724
作者: Xianru Chen,Yukai Huang,Mingxiang Chen,Xinping Lei,Fangbing Deng,Jin Chen,Ge Zhang,Wenhao Huang,Jiaheng Liu
机构: ByteDance Seed; Beijing University of Posts and Telecommunications; Nanjing University
类目: Computation and Language (cs.CL)
备注:
Abstract:Multilingual fluency often invites a stronger assumption: a model that can speak a user’s language must also understand the culture encoded by that language. We call this the Illusion of Cultural Alignment. To test this assumption directly, we introduce MSQA, a benchmark of 1,064 natively sourced questions across 11 language groups, five cultural dimensions, and three difficulty tiers. Unlike translated benchmarks, MSQA targets locally grounded knowledge and reduces shortcuts from English-centric cross-lingual transfer. Evaluating 18 LLMs, we find substantial cultural degradation and a pronounced Locality Effect: cultural competence tracks pre-training exposure more closely than general reasoning ability. We further show that common inference-time remedies do not dissolve the illusion. Models remain overconfident on unfamiliar cultural questions, repeated sampling yields unstable rather than reliable correctness, and retrieval augmentation helps unevenly on long-tail facts. These findings indicate that cultural alignment cannot be inferred from multilingual ability alone and requires deeper intervention than calibration, sampling, or retrieval at inference time
[NLP-34] Self-conditioned Flow Map Language Models via Fixed-point Flows
【速读】: 该论文旨在解决自条件(self-conditioning)在基于流(flow-based)的语言模型中性能提升机制不明确,以及如何有效应用于少步生成(few-step generation)的问题。其核心挑战在于理解自条件技术如何通过迭代优化生成文本质量,尤其是在使用少步流映射(flow maps)时缺乏有效的整合方法。解决方案的关键在于提出“固定点流”(fixed-point flows),这是一种二维的自条件流模型结构:第一维表征流过程本身,第二维则建模自条件带来的固定点迭代优化。该框架揭示了自条件流模型本质上是在求解一个固定点方程,从而实现对去噪器性能的自我增强。通过引入固定点蒸馏(fixed-point distillation)和流映射蒸馏(flow map distillation)两种压缩策略,可将复杂模型高效压缩为轻量级的流映射语言模型(FMLM^\star),在OpenWebText数据集上的一步与少步生成任务中均超越现有自条件模型及少步生成模型的性能表现。
链接: https://arxiv.org/abs/2607.00714
作者: Jaehoon Yoo,Wonjung Kim,Floor Eijkelboom,Chanhyuk Lee,Nicholas M. Boffi,Seunghoon Hong,Jinwoo Kim
机构: KAIST(韩国科学技术院); University of Amsterdam(阿姆斯特丹大学); Carnegie Mellon University(卡内基梅隆大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Self-conditioning is a core technique that enhances continuous flow-based language models, where the model learns to denoise generated text by conditioning on its own denoising estimate. While empirically successful, its performance improvements are poorly understood. Moreover, there is growing interest in the use of few-step generators based on flow maps, for which how to leverage self-conditioning is unclear. Here, we show that flow language models with self-conditioning solve a fixed-point iteration that bootstraps the performance of the learned denoiser. We use this viewpoint to formulate fixed-point flows, a two-dimensional class of self-conditioned flows, where the first dimension represents the flow process and the second represents the fixed-point iteration. We show that fixed-point flows define valid flow maps, and show that they can be distilled from self-conditioned flow models by compressing both fixed-point iterations and the flow process, the former with fixed-point distillation and the latter with flow map distillation. Our resulting flow map language model, FMLM ^\star , outperforms state-of-the-art self-conditioned models and few-step models in one- and few-step generation on OpenWebText. Code is available at this https URL.
[NLP-35] YOMI-Bench: A Benchmark for Evaluating Kanji Reading and Phonological Understanding of LLM s for Japanese
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在日语汉字读音理解与音韵认知方面表现不佳的问题。由于日语中单个汉字常具有多种可能的读音(音读/训读),仅依靠表层文本难以准确推断正确读音,这一语言特性对模型的语义与语音映射能力提出了严峻挑战。为此,论文提出YOMI-Bench基准测试体系,专门针对日语汉字读音理解能力设计了四项评估任务。其解决方案的关键在于构建一个系统化、多维度的评测框架,能够有效揭示不同类型模型(包括多语言开源模型、日本专用开源模型及商业闭源模型)在处理复杂汉字读音歧义时的局限性。实验结果表明,即使经过专门优化的日本语专用模型也表现出较低性能,且商业模型在需综合考虑汉字读音的任务中同样存在显著缺陷,凸显了当前主流模型在深层语言结构理解上的不足。
链接: https://arxiv.org/abs/2607.00664
作者: Ryota Mibayashi,Hiroya Takamura,Hitomi Yanaka
机构: Kobe University; National Institute of Advanced Industrial Science and Technology (AIST); The University of Tokyo; RIKEN; Tohoku University
类目: Computation and Language (cs.CL)
备注:
Abstract:We propose YOMI-Bench, a benchmark for evaluating kanji reading and phonological understanding of large language models (LLMs) for Japanese. In Japanese, a single kanji character often has multiple possible readings, making it difficult to infer the correct reading from surface-level text alone. Due to these linguistic characteristics, it is empirically known that LLMs exhibit low performance in kanji reading for Japanese. The proposed YOMI-Bench consists of four tasks specifically designed to evaluate kanji reading performance in Japanese. In our evaluation using YOMI-Bench, we assessed one multilingual open LLM, four Japanese-specific open LLMs, and five commercial LLMs. As a result, we found that even Japanese-specific models show low performance, and that commercial models also perform poorly on generation tasks that require consideration of kanji readings.
[NLP-36] Faithful by Definition: Emotion Analysis via Natural Semantic Metalanguage Explications
【速读】: 该论文旨在解决情感分类模型生成解释(explanation)时普遍存在的后验性(post hoc)问题,即现有方法产生的解释无法保证与模型实际决策过程一致,导致解释缺乏可信度。其核心挑战在于如何构建一个既能保持较高分类性能,又能提供可验证、可审计的决策依据的情感分析框架。解决方案的关键在于提出一种基于事件的情感分析可解释接口(explication interface),通过将输入文本解析为由自然语义元语言(Natural Semantic Metalanguage, NSM)构成的封闭词汇表中的结构化脚本(十二个类型化的槽位),并利用从已有语义定义中提取的固定规则列表(decision list)仅基于该脚本计算情感标签。这一设计确保了分类决策具有因果性和定义上的可追溯性(faithfulness guarantee),而所有经验风险(empirical risk)则集中于可审计的解析器(parser)中。此外,通过引入逐行蕴含关系(per-line entailment)接口,使解析结果可直接与原始输入进行验证。在众包收集的事件描述数据集上,经过微调的解析器达到0.33的准确率和0.48的选择性准确率,表明该方法在牺牲极小精度的前提下,显著提升了决策过程的透明性与可解释性。研究同时发布了包含逐行验证元数据和完整规则集的EmoExpl-1200数据集。
链接: https://arxiv.org/abs/2607.00661
作者: Frank Xing,Erik Cambria
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 12 pages, 8 figures
Abstract:Explanations for emotion classifiers are usually produced post hoc, with no guarantee that they reflect the computation behind the label. We present an explication interface for event-based emotion analysis. A parser maps the input text to an explication, a short script in the closed vocabulary of Natural Semantic Metalanguage organized into twelve typed slots, and a fixed decision list of rules transcribed from published semantic definitions computes the label from the explication alone. The faithfulness guarantee is therefore causal and definitional, while all empirical risk lives in the learned parser, which the per-line entailment interface makes auditable against the input. On crowd-sourced event descriptions, our fine-tuned parser reaches 0.33 accuracy and 0.48 selective accuracy on a small held-out set, suggesting that the interface trades insignificant accuracy difference to a black-box model for a verifiable, inspectable decision basis for first-person event-based emotion analysis. We also release EmoExpl-1200 with per-line verification metadata and the full rule set.
[NLP-37] Auditing Forgetting in Limited Memory Language Models
【速读】: 该论文旨在解决有限记忆语言模型(LMLM)在执行删除操作后,仍可能存在已删除事实通过参数化记忆残留、替代检索路径或近邻检索伪影等方式“复活”的问题。现有评估方法仅关注删除后的整体正确率,无法揭示具体退化机制。为此,作者提出一种因果审计框架,在推理时固定模型参数,通过三种干预策略(FULL、DEL-ON、DEL-OFF)动态改变数据库状态,从而将删除后的行为分解为三部分:参数泄漏(L(f))、检索介导的正确性(R(f))以及基于推理时检索轨迹的检索伪影率。实验覆盖12,228个别名闭包删除案例,涵盖十三个数据库(含四种对抗性拓扑结构:Base、Alias、Noise、Collision)及六种提示范式。结果表明,所有变体和提示风格下参数泄漏均接近于零,说明模型本身极少在无检索支持时返回被删答案;而剩余的错误主要源于检索图中的近邻检索伪影,即检索介导的正确性与检索伪影率在统计上高度一致,表明删除后的错误几乎完全由近邻检索重构。该残余率从公开数据库的0.7%到最对抗性变体的13.6%不等,且提示形式无法独立控制删除事实的存活程度。研究结论表明,对于此类LMLM及其删除流程,未遗忘边界主要由数据库管理员控制,而非模型自身特性决定。
链接: https://arxiv.org/abs/2607.00605
作者: Arya Raeesi,Hanna Roed
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 17 pages, 7 figures, 6 tables
Abstract:Limited Memory Language Models (LMLMs) externalize factual knowledge to a database to enable deletion-based unlearning without retraining. Existing evaluations measure post-deletion correctness in aggregate and cannot tell whether a deleted fact persists through residual parametric memory, alternative retrieval paths, or near-neighbor retrieval artifacts. We propose a causal auditing framework that holds the model fixed and varies the database state at inference time across three interventions: FULL, DEL-ON, and DEL-OFF. The framework decomposes post-deletion behavior into parametric leakage L(f), retrieval-mediated correctness R(f), and a retrieval artifact rate grounded in the inference-time retrieval trace. We apply it to 12,228 alias-closure deletions across thirteen databases, including four adversarial topologies (Base, Alias, Noise, Collision) we construct in three domains, and six prompt formulations. Parametric leakage is near zero in every variant and every prompt style: the model rarely returns the deleted answer in the absence of retrieval. The residual that does survive lives in the retrieval graph: retrieval-mediated correctness and the retrieval artifact rate match within rounding everywhere, so post-deletion correctness is, in our audit, predominantly reconstituted from near-neighbor retrieval. This residual ranges from 0.7% on the released LMLM database to 13.6% on the most adversarial variant, and prompt formulation does not independently control how much of a deleted fact survives. These results suggest that, for this class of LMLM and deletion procedure, the unlearning boundary is drawn primarily by the database administrator rather than by the model.
[NLP-38] “Dont Say It!”: Constraints Compliance and Communication when Language Models Play Taboo
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在生成受严格词汇约束的描述时,如何平衡规则遵从性与交际有效性的问题。具体而言,研究聚焦于“禁忌词游戏”(Taboo)这一任务场景,其中需在避免使用一组禁用词的前提下准确描述目标词,以供他人猜出。此任务要求模型在生成过程中同时满足语义完整性与形式约束,体现了推理阶段多重需求之间的冲突。其解决方案的关键在于通过多层次干预机制——包括提示工程、生成过程中的动态约束以及对模型内部表征的操控——系统性地评估模型在不同控制层级下的表现。研究结果表明,模型在规则遵守与表达有效性之间存在显著权衡,且其作为“猜词者”的表现远逊于人类,揭示了当前语言模型在受限条件下的语义锚定(lexical grounding)能力仍存在根本性挑战。
链接: https://arxiv.org/abs/2607.00601
作者: Sara Candussio,Francesca Padovani,Daniel Scalena,Malvina Nissim
机构: AILab, MIGe, University of Trieste (特里斯特大学); Center for Language and Cognition (CLCG), University of Groningen (格罗宁根大学); University of Milano - Bicocca (米兰-比科卡大学); Malvina Nissim
类目: Computation and Language (cs.CL)
备注:
Abstract:The game of Taboo requires describing a target word without using a set of forbidden words, so that other players can guess it. This deceptively simple task combines strict lexical constraints with the need for communicatively effective descriptions, making it a compelling playground for examining how LLMs navigate competing demands at inference time. We evaluate two open-weight models under conditions that intervene at progressively deeper levels of the generative process, from prompting to generation-time constraints to internal representations manipulations. We assess their outputs through forbidden word violation detection, LLM-as-a-judge measuring the degree to which generated descriptions successfully evoke the target concept for both human and machine guessers, and examining whether the strategies models adopt under constraint align with those of human players. Our results show that compliance with the rules of the game and communicative effectiveness trade off differently across conditions, and that models remain substantially weaker than humans as guessers, suggesting that lexical grounding under constraint is an open challenge for current language models.
[NLP-39] Low Perplexity is Repetition: A One-Dimensional Self-Conditioning Attractor in Continuous Diffusion LMs
【速读】: 该论文旨在解决连续扩散语言模型(continuous diffusion language models)在生成文本时过度重复的问题,而这一问题被传统的生成困惑度(Gen-PPL)指标所掩盖——由于Gen-PPL对重复行为不惩罚反而奖励,导致其低分值高估了模型实际生成质量。研究发现,这种重复现象的根源在于自条件反馈回路中一个沿单一方向的收缩吸引子(contractive attractor),该吸引子使生成过程趋向于收敛到固定模式。针对此问题,作者提出一种一维修复方法——ACE(Attractor-Contrast-Escape),通过在每一步反馈中减去该单一、无标签的方向,有效抑制重复。实验表明,该方法将最小模型的Gen-PPL从19.5提升至27.7,且重复率降至接近人类水平,同时保持生成质量竞争力;该方法具有良好的可迁移性,适用于不同规模模型与采样器,并可推广至其他架构。为更真实评估性能,论文改用生成“人类清洁文本”所需的计算开销作为衡量标准,结果显示ACE在效率上比基准方法提升1.5至5倍。
链接: https://arxiv.org/abs/2607.00588
作者: Shuai Zhang,Zijie Chen,Hongliang He,Lun Du,Zhenzhong Lan
机构: Zhejiang University (浙江大学); Westlake University (西湖大学); Ant Group (蚂蚁集团)
类目: Computation and Language (cs.CL)
备注:
Abstract:Continuous diffusion language models such as ELF report record-low generative perplexity (Gen-PPL). We find a catch: these models repeat far more than human text, and Gen-PPL rewards rather than penalizes that repetition, so its low scores overstate quality. Strip the repetition and ELF-B’s Gen-PPL rises from 19.5 to 27.7 ; the smallest model even posts the best Gen-PPL because it repeats most. We trace the repetition to its source: a contractive attractor along a \emphsingle direction in the self-conditioning feedback loop, the loop that feeds each step’s clean estimate into the next. Because the failure is one-dimensional, a one-dimensional fix suffices, and we propose one. \textbfACE (Attractor-Contrast-Escape) subtracts that single, label-free direction from the feedback at each step. Estimated once on the 105 M model, the direction cuts repetition to near the human level while keeping quality competitive, and transfers near-unchanged to the 342 M and 652 M models and across samplers; the same recipe recovers useful directions on other architectures. Since Gen-PPL itself rewards repetition, we instead measure the compute each fix needs to produce human-clean text, where ACE is 1.5 – 5\times cheaper.
[NLP-40] Safe Alone Unsafe Together: Safeguarding Against Implicit Toxicity When Benign Images Combine
【速读】: 该论文旨在解决多图像隐性毒性(Multi-image Implicit Toxicity, MIIT)的识别问题,即在社交媒体中,单个图像看似无害,但当多张图像被联合解读时,会涌现出有害语义,而现有商业内容审核API和模型因缺乏每张图像中的显式风险线索,难以有效检测此类威胁。其解决方案的关键在于构建一个名为MIIT-dataset的纯图像多图像安全数据集,通过自动化生成流程覆盖七类典型风险场景,以缓解该领域数据稀缺的问题;并提出采用渐进式知识蒸馏推理监督训练的MiShield模型,使其不仅能输出安全判断,还能提供对导致危害的相关实体之间关联关系的显式分析,从而显著提升对复杂、隐蔽的多图像协同风险的识别能力。实验表明,MiShield-8B模型在性能上超越了主流审核服务及更大规模模型,展现出在实际应用中的有效性与价值。
链接: https://arxiv.org/abs/2607.00576
作者: Jiaxian Lv,Shiyao Cui,Yingkang Wang,Guoxin Wu,Qingling Zhang,Minlie Huang
机构: The Conversational AI (CoAI) Group, DCST, Tsinghua University (清华大学计算机科学与技术系对话人工智能组)
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR); Multimedia (cs.MM)
备注: 15 pages, 8 figures
Abstract:Multi-image content has become an increasingly prevalent form of visual communication in social media, giving rise to a new safety issue, multi-image implicit toxicity (MIIT), where each image appears benign in isolation, but harmful semantics emerge when the images are interpreted jointly. MIIT is particularly challenging for existing commercial moderation APIs and models due to the lack of explicit risky cues in each image. This paper aims to study how to identify MIIT. We first provide a formal definition of MIIT and analyze three key challenges for its detection. To alleviate the scarcity of data in this area, we construct MIIT-dataset, an image-only multi-image safety dataset covering seven representative risk categories through an automatic generation pipeline. Finally, we train MiShield with progressively distilled reasoning supervision, enabling it to produce safety judgments accompanied by explicit analyses of the correlated entities that result in the hazards. Experiments show that MiShield-8B models outperform representative moderation services and even larger-scale models, revealing its effectiveness and practical value for this widely used visual format. Warning: This paper contains potentially sensitive content.
[NLP-41] Dual-Confidence Contrastive Decoding for Retrieval-Augmented Generation
【速读】: 该论文旨在解决多文档检索增强生成(multi-document RAG)中因多个检索文档存在内部冲突(intra-context conflict)而导致的生成偏差问题,尤其在企业级深度研究场景下,检索到的证据可能包含过时、噪声或相互矛盾的信息。现有对比解码方法主要关注模型内部记忆与外部检索上下文之间的冲突,而本文聚焦于更复杂的文档间内部冲突问题。其解决方案的关键是提出一种无需训练的双置信度对比解码(Dual-Confidence Contrastive Decoding, DCCD)方法,该方法结合文档级置信度(document-level confidence)与词元级置信度(token-level confidence),分别评估单个文档是否足以回答问题以及是否支持下一个词元的高置信预测。通过双重置信度信号选择正负样本条件流,并基于置信度差值对文档级对比进行加权,从而实现更精准的源感知(source-aware)解码。实验表明,在自建的面向企业特定事实的真值冲突问答基准DRQA及标准多文档问答基准上,DCCD在全上下文与对比解码基线中表现最佳,尤其在DRQA上取得显著提升,验证了在存在内部冲突的检索证据中采用置信度门控解码的重要性。
链接: https://arxiv.org/abs/2607.00570
作者: Raymond Li,Md Tawkat Islam Khondaker,Amirhossein Abaskohi,Gabriel Murray,Giuseppe Carenini,Issam H. Laradji
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Retrieval-augmented generation (RAG) increasingly requires models to answer questions from multiple retrieved documents, where only some sources are relevant and the retrieved bundle may contain stale, noisy, or conflicting evidence. Existing contrastive decoding methods primarily focus on resolving conflicts between the model’s internal memory and the retrieved context. In contrast, we study the complementary problem of intra-context conflict in multi-document RAG. To evaluate this setting, we introduce DRQA, a factual-conflict question answering benchmark derived from enterprise deep-research scenarios, where answers are grounded in synthetic enterprise-specific facts that are designed not to be recoverable from the model’s internal memory. We further propose Dual-Confidence Contrastive Decoding (DCCD), a training-free decoding method that combines document-level confidence, which estimates whether a document appears sufficient for answering the question, with token-level confidence, which estimates whether that document supports a confident next-token prediction. DCCD selects positive and negative document-conditioned streams using these dual-confidence signals and scales a document-level contrast by their confidence margin. Across DRQA and standard multi-document QA benchmarks, DCCD achieves the best average performance among full-context and contrastive decoding baselines, with the largest gains on DRQA. These results highlight the importance of source-aware, confidence-gated decoding when retrieved evidence is internally conflicting.
[NLP-42] A Task-State Representation for Long-Horizon Mobile GUI Agents
【速读】: 该论文旨在解决长时程移动GUI智能体在执行过程中因任务状态(task state)与瞬时屏幕观测(transient screen observations)纠缠而产生的上下文负担问题。随着执行历史的积累,智能体容易遗忘初始任务要求、产生虚假进展幻觉或重复操作过时界面,从而导致性能下降。其解决方案的关键在于提出一种无需训练的Task-State Representation (TSR) 框架,通过显式解耦任务状态与感知输入,构建轻量级外部封装。TSR维护三个结构化组件:全局指令摘要、动态子目标进度追踪器以及基于状态转移的动作验证器,通过前后动作的视觉对比持续更新,实现对智能体推理的有效引导,且无需修改模型架构。实验在四个移动GUI基准上验证了TSR的有效性,在复杂跨应用和高记忆需求任务中,成功率最高提升12个百分点。
链接: https://arxiv.org/abs/2607.00502
作者: Yujie Zheng,Zikang Liu,Xin Zhao,Ji-Rong Wen
机构: Gaoling School of Artificial Intelligence, Renmin University of China(中国人民大学高瓴人工智能学院); School of Software, Beihang University(北京航空航天大学软件学院)
类目: Computation and Language (cs.CL)
备注: Preprint. 9 pages, 3 figures
Abstract:While long-horizon mobile GUI agents typically rely on thought-action-observation loops, they struggle to separate persistent task states from transient screen observations. As execution histories grow, this entanglement imposes a severe context burden, causing agents to forget initial requirements, hallucinate progress, or repeatedly interact with stale interfaces. To address this, we introduce Task-State Representation (TSR), a training-free framework that explicitly decouples task state from sensory input. Acting as a lightweight external wrapper, TSR maintains three structured components: a global instruction summary, a dynamic progress tracker for subgoals, and a transition-aware action verifier. By continuously updating through pre- and post-action visual comparisons, TSR effectively guides the agent’s reasoning without requiring architectural modifications. Experiments across four mobile GUI benchmarks validate TSR’s effectiveness, yielding up to a 12 absolute point increase in success rate on complex cross-application and memory-intensive tasks.
[NLP-43] BaseRT: Best-in-Class LLM Inference on Apple Silicon via Native Metal
【速读】: 该论文旨在解决在Apple Silicon硬件上运行大语言模型(LLM)时,现有推理运行时框架因不匹配Metal执行模型及统一内存拓扑结构而引入的性能开销问题。其核心解决方案是构建一个原生基于Metal的推理运行时BaseRT,通过芯片级内核融合、面向统一内存的优化以及自定义调度逻辑,有效规避了传统框架抽象带来的性能损耗。该方案在M3和M4 Pro设备上对Qwen3、Llama 3.2和Gemma 4系列模型在Q4与Q8量化格式下的评估中,实现了相较于现有框架高达1.56倍的解码吞吐量提升,且在混合专家模型(mixture-of-experts)的预填充阶段表现更为显著,覆盖从子10亿到300亿参数规模的模型均保持业界领先的性能。这一成果重新确立了Apple Silicon作为高效能推理平台的地位,为日益重要的边缘端本地化部署提供了关键性能支撑,特别是在隐私保护、低延迟需求和云成本控制驱动下,高性能本地运行时成为实现端侧推理落地的核心使能技术。
链接: https://arxiv.org/abs/2607.00501
作者: Prabod Rathnayaka,Fabian Waschkowski,Lukas Wesemann
机构: Base Compute, Melbourne, Australia
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Performance (cs.PF)
备注:
Abstract:We present BaseRT, a native Metal inference runtime for large language models (LLMs) on Apple Silicon, and report the highest inference throughput on this hardware to date. Existing runtimes, including this http URL and MLX-based frameworks, incur overhead from abstractions not designed for Metal’s execution model or Apple Silicon’s unified memory topology. By building natively on Metal with chip-specific kernel fusion, unified memory-aware optimisation, and custom dispatch logic, BaseRT recovers performance that framework-based approaches leave on the table. BaseRT supports a wide range of model families across eight quantisation formats (Q2 to FP16) on all Apple M-series devices. In this paper, we evaluate the Qwen3, Llama 3.2, and Gemma 4 families at Q4 and Q8 quantisation on M3 and M4 Pro devices. BaseRT achieves up to 1.56x higher decode throughput than this http URL and up to 1.35x higher than MLX, with substantially larger margins on prefill for mixture-of-experts models, delivering consistent best-in-class throughput from sub-1B to 30B parameter models. These results establish Apple Silicon as a more capable inference platform than previously reported, with direct implications for the emerging edge inference paradigm: as privacy requirements, latency constraints, and cloud cost pressures drive inference toward on-device deployment, performance-optimised local runtimes are a critical enabling layer for this transition. BaseRT is publicly available at this https URL
[NLP-44] MindEdit-Bench: Benchmarking Object-Level Counterfactual Spatial Reasoning in VLMs from In-the-Wild Photos
【速读】: 该论文旨在解决当前视觉语言模型(Vision-Language Models, VLMs)在空间推理能力上的局限性,特别是其在处理假设性场景变化(如物体移动或旋转)时的不足。现有基准测试多聚焦于观测性空间推理,即模型仅需描述输入图像中已存在的空间关系,而无法有效进行对象级反事实推理(counterfactual reasoning),即预测在未发生但假设发生的物理变换后场景的变化。为此,研究提出MindEdit-Bench,一个基于三张智能手机拍摄的室内场景图像三联体构建的基准测试,通过自动化的“真实世界”3D场景图提取流程生成数据。该基准包含六项空间推理任务,其中四项考察对已有结构的感知与视角转换,而新增的L4(空间编辑)和L5(跨视图可见性编辑)任务则专门评估模型在无任何输入图像中存在正确答案的情况下进行对象级反事实推理的能力。每个问题提供8–24个结构化选项,支持以答案字母级别进行空间错误与替代策略(fallback)错误的诊断分析。基准涵盖120个私有室内场景,避免了公开数据集的预训练重叠风险。在1,003个经人工验证的问题上对15个VLMs的评估显示,模型任务平均准确率仅为8%–31%,远低于人类多数投票准确率(81%–97%),整体人机差距达53个百分点,且每项任务均存在至少39个百分点的差距。结构化答案空间进一步揭示了模型失败模式的非均匀性,包括对相机深度轴信息推断较弱以及在复杂可见性编辑任务中表现出显著的退而求其次行为。因此,该研究的关键解决方案在于构建一个具有高语义挑战性和可诊断性的反事实空间推理基准,推动模型从“观察”向“预测”转变。
链接: https://arxiv.org/abs/2607.00491
作者: Leyuan Yu,Xiao Tang,Minghao Liu,Xinyuan Li,Xiaokai Bai,Sheng Zhou,Qunshu Lin,Weihao Xuan,Naoto Yokoya
机构: ZODA; Zhejiang University (浙江大学); Tongji University (同济大学); The University of Tokyo (东京大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 18 pages, 7 figures. Dataset available at this https URL
Abstract:Benchmarks for vision-language models (VLMs) mostly test observational spatial reasoning: models describe relations already visible in the input. Existing what-if tasks typically vary the observer while keeping the scene fixed. Can VLMs instead predict the consequences of hypothetically moving or rotating an object? We introduce MindEdit-Bench, a benchmark of six spatial reasoning tasks built from three-photo smartphone triplets of newly captured indoor scenes via an automatic in-the-wild 3D scene-graph extraction pipeline. Four tasks probe perception and perspective transformation over observed structure; two new tasks, L4 (spatial editing) and L5 (cross-view visibility editing), probe object-level counterfactual reasoning, where correct answers are absent from all input images. Each question provides 8-24 structured answer choices, enabling answer-letter-level diagnosis of spatial and fallback errors. The benchmark covers 120 private indoor scenes not drawn from public datasets, reducing public-data pretraining-overlap risk. Across 15 VLMs on 1,003 human-verified questions, task-wise mean VLM accuracy is only 8%-31%, versus 81%-97% human majority-vote accuracy. The pooled human–best-VLM gap is 53 pp, with at least 39 pp on every task. The structured answer space further reveals non-uniform failures, including weaker camera-depth-axis inference and fallback behavior on difficult visibility-editing cases.
[NLP-45] Efficient Multilingual Reasoning Transfer via Progressive Code-Switching
【速读】: 该论文旨在解决大语言模型(Large Reasoning Models, LRMs)在非英语语境下推理能力显著下降的问题。现有迁移方法通常依赖于强模型生成的目标语言推理轨迹进行知识蒸馏,或通过外部判别模型提供在线监督,这些方法成本高昂且难以规模化。为此,本文提出一种名为PCS(Progressive Code-Switching,渐进式代码切换)的高效迁移框架,其核心创新在于仅需轻量级翻译即可完成迁移,无需强模型蒸馏或外部判别器。PCS首先通过将部分英文推理步骤翻译为目标语言,构建混合语言(代码切换)推理轨迹,并通过监督微调初始化模型的代码切换能力;随后采用基于步骤级语言一致性课程的强化学习策略,逐步提高目标语言占比,最终实现完全以目标语言进行推理。这种渐进式设计为模型提供了平滑的迁移路径,有效避免了直接强制目标语言推理时常见的不稳定性与性能下降问题。实验结果表明,PCS在多个基准测试及五种类型差异显著的语言上均显著缩小了目标语言与英文推理性能之间的差距,在保持较高准确率的同时实现了更强的语言一致性。
链接: https://arxiv.org/abs/2607.00485
作者: Zhijun Wang,Junxiao Liu,Hao Zhou,Hao-Ran Wei,Baosong Yang,Shujian Huang
机构: Tongyi Lab; Zhejiang University (浙江大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large reasoning models (LRMs) have achieved strong reasoning capabilities in English, yet their performance degrades significantly when required to reason in other languages. A natural solution is to transfer the model’s English reasoning ability to target languages. However, existing transfer approaches typically rely on distilled target-language reasoning traces from stronger LRMs or online supervision from external judge models, which are costly and difficult to scale. In this paper, we propose PCS (Progressive Code-Switching), a more efficient transfer framework that requires only lightweight translation without any stronger model for distillation or judging. PCS first constructs code-switched reasoning traces by translating a subset of English reasoning steps into the target language, and uses them to initialize the model’s code-switching ability via supervised fine-tuning. It then applies reinforcement learning with a step-level language consistency curriculum, progressively raising the target-language ratio until the model reasons entirely in the target language. This progressive design provides a smooth transfer path that avoids the instability and performance degradation commonly observed when directly enforcing target-language reasoning. Experiments on multiple benchmarks and five typologically diverse languages show that PCS substantially narrows the performance gap between target-language and English reasoning, yielding more language-consistent reasoning while maintaining competitive accuracy.
[NLP-46] Know When to Stop: Segment-Level Credit Assignment for Reducing Overthinking
【速读】: 该论文旨在解决生成式语言模型在推理过程中普遍存在的“过度思考”(overthinking)问题,即模型生成冗长且无益的思维链,表现为自我怀疑、策略放弃和自相矛盾等行为,这些行为消耗大量计算资源但未能提升答案准确性。其核心挑战在于如何区分有益与有害的自我反思环节,而获取细粒度的步骤级标注成本高昂。解决方案的关键在于提出一种低成本的代理信号——通过分析推理轨迹中各中间答案候选与真实答案的一致性,判断后续反思是否具有生产性,从而无需额外人工标注即可实现对推理过程的段落级信用分配。基于此,作者提出DASH(Drift Aware advantage SHaping)方法,依据每个推理段落是否趋向或偏离正确答案来动态调整奖励信号。在竞赛级数学基准测试中,DASH在过度假设场景下表现最优(AIME25: 50.8% vs. 45.4% GRPO),显著提升了推理准确性,同时有效抑制了过度思考行为,并实现了更高效的自我修正。
链接: https://arxiv.org/abs/2607.00482
作者: Chia-Hsuan Lee,Sihui Dai,Mingyang Zhou,Isha Slavin,Shi-Xiong Zhang,Sambit Sahu,William Campbell
机构: Capital One
类目: Computation and Language (cs.CL)
备注:
Abstract:Reasoning language models frequently overthink: generating extended chains of behaviors such as hedging, approach abandonment, and self contradiction that consume tokens without improving answers. We show that these behaviors are not merely a consequence of length; even when controlling for response length, incorrect traces exhibit higher rates of unproductive self-reflection than correct ones. Addressing this requires identifying where self-reflection helps vs hurts, but obtaining these step-level annotations is costly. We observe that intermediate answer commitments within reasoning traces can provide a cheap proxy: by comparing each final answer candidate in the trace to the ground truth, we can determine whether subsequent reflection is productive without any additional supervision. Building on this insight, we propose DASH (Drift Aware advantage SHaping), which assigns segment-level credit based on whether each reasoning segment leads toward or away from correctness. On competition-level math benchmarks, DASH achieves the highest accuracy where overthinking is prevalent (AIME25: 50.8% vs. 45.4% GRPO) while reducing overthinking behaviors and achieving more productive self-correction than baselines.
[NLP-47] StochasT: Learning with Stochastic Turn Depth for Visual Instruction Tuning STOC ECCV2026
【速读】: 该论文旨在解决大型视觉语言模型(Large Vision-Language Models, LVLMs)在训练与评估阶段不匹配的问题:尽管视觉指令微调(Visual Instruction Tuning, VIT)通常采用多轮对话形式对同一图像进行多种语言任务的联合训练,但现有评估基准却大多采用孤立的单轮测试场景,导致模型在多轮训练中易出现视觉注意力衰减和上下文过拟合,进而限制其在实际应用中的表现。为此,论文提出一种名为“随机轮次深度学习”(Stochastic Turn Depth, StochasT)的解决方案,其核心在于通过随机分组同一图像的相关语言任务,形成不同深度(turn depth)的训练序列,同时保持任务间的自然顺序。该方法借鉴了残差网络中的随机深度(stochastic depth)思想,但不实际丢弃任何数据,从而最大化训练数据的利用效率。此外,论文设计了一种基于平衡拉丁方(Balanced Latin Square)的、与具体基准无关的评估机制,以衡量模型在不同上下文依赖强度下的鲁棒性。大量实验表明,StochasT能有效提升LVLM在单轮与多轮场景下的协同能力,显著增强模型的泛化性能。
链接: https://arxiv.org/abs/2607.00465
作者: Yuan Qing,Chengzhi Mao,Boqing Gong
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted to ECCV 2026. Project page and code: this https URL
Abstract:Large Vision-Language Models (LVLMs) rely extensively on Visual Instruction Tuning (VIT) to elicit their multimodal reasoning capabilities. However, we find a discrepancy: VIT often packs multiple language tasks about the same image for conversational, multi-turn training, whereas existing benchmarks evaluate LVLMs in isolated, single-turn scenarios. The models can suffer from visual attention decay and contextual overfitting during multi-turn training, making it hard for them to realize their full potential in the mismatched test phase. To close the gap, we propose learning with Stochastic Turn Depth (StochasT), which stochastically groups language tasks for the same image into clusters of varying sizes (turn depth) while preserving their organic order. Hence, while StochasT draws on Dropout and stochastic depth for ResNets, it does not actually drop anything to maximize the utility of the training data. Furthermore, we introduce a challenging, benchmark-agnostic evaluation mechanism based on the Balanced Latin Square to measure LVLMs’ robustness under varying contextual dependencies. Extensive experiments demonstrate that StochasT effectively grants LVLMs strong, harmonized capabilities for both single-turn and multi-turn use cases.
[NLP-48] MolSafeEval: A Benchmark for Uncovering Safety Risks in AI-Generated Molecules ACL2026
【速读】: 该论文旨在解决当前分子生成模型在复杂性、新颖性和性质匹配等评估维度上过度关注,而忽视了生成分子潜在安全风险的问题。许多生成模型可能产出具有毒性、高反应性或其他有害特性的分子,这些隐患在现有评估体系中未被充分识别与管控。为此,论文提出MolSafeEval基准,其核心创新在于构建了一个结构化的分子安全知识图谱(molecular safety knowledge graph),整合了来自毒理学数据库、危险性规则等多种异构安全知识源,为大语言模型(LLM)提供推理基础,实现对生成分子中不安全特征的系统性检测与可解释性分析。此外,研究将分子生成模型划分为四类典型任务类型——无条件生成、性质优化、基于靶蛋白的设计以及文本引导生成,并为每类任务提供了标准化数据集与安全评估协议。通过系统揭示现有生成方法的安全漏洞,MolSafeEval为分子生成模型的评估提供了新的视角,并为实现更安全、更可信的分子设计提供了关键指导。
链接: https://arxiv.org/abs/2607.00464
作者: Tong Xu,Xinzhe Cao,Zhihui Zhu,Keyan Ding,Huajun Chen
机构: Zhejiang University (浙江大学); ZJU-Hangzhou Global Scientific and Technological Innovation Center (浙江大学杭州全球科创中心); University of Oxford (牛津大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Accepted by Findings of ACL 2026
Abstract:Current molecular generation benchmarks emphasize task complexity, molecule novelty, and property alignment; they largely overlook a critical concern: the potential safety risks of AI-generated molecules. In practice, many generative models may produce molecules with toxic, reactive, or otherwise hazardous characteristics - posing hidden dangers that remain insufficiently addressed. To address this gap, we introduce MolSafeEval, a benchmark dedicated to evaluating and analyzing the safety risks of molecular generation. Unlike prior approaches that rely on narrow toxicity predictors, MolSafeEval integrates heterogeneous safety knowledge - ranging from toxicological databases to hazard rules - into a structured molecular safety knowledge graph. This graph serves as a foundation for large language model-based reasoning, enabling systematic detection and explanation of unsafe features in generated compounds. We further categorize molecular generative models into four representative task types - unconditional generation, property optimization, target protein-based design, and text-based generation - and provide standardized datasets and safety evaluation protocols for each. By systematically revealing the safety vulnerabilities of current generative approaches, MolSafeEval offers a new lens for benchmarking molecular models and provides essential guidance toward safer, more trustworthy molecular design.
[NLP-49] Understanding Why Language Models Hallucinate: Testing Reasoning Against Priors
【速读】: 该论文旨在解决大语言模型在生成回答时出现幻觉(hallucination)的问题,特别是当这些幻觉违背了提示(prompt)层面的约束条件时。其核心问题是:此类错误是由于模型缺乏相关知识,还是因模型虽具备相关信息却选择了错误的推理路径所致?为此,作者提出“推理错位”(inference misalignment)的概念,即模型所支持的答案与由提示内容引导的正确答案之间存在不一致,这种不一致源于统计上显著的潜在关联(latent associations)主导了推理过程。解决方案的关键在于构建一个潜在关键任务模型(latent key-task model),通过该模型揭示预训练频率失衡如何导致捷径路径(shortcut path)压制本应遵循的约束敏感路径(constraint-sensitive path),从而引发正向推理损失(positive inference loss)。该框架预测出两种典型失败模式:实体消歧中的任务-检索偏差(task-retrieval bias)和动作选择中的关键要素选择偏差(key-selection bias)。为验证这一理论,作者设计了名为TrapQA的受控诊断测试平台,包含两个组件:ScientistQA用于测试相似科学家之间的消歧及附加事实探针,Real-Life Constrained QA则评估日常情境下对显性捷径的约束遵循能力。实验结果表明,幻觉现象主要源于潜在推理路径的偏倚,而非知识缺失本身。
链接: https://arxiv.org/abs/2607.00447
作者: Yangfan Hu,Xuhan Tong,Haoyue Bai,Xi Ding,Shashank Muralidhar Bharadwaj,Siyang Cao,Robert Nowak,Jiawei Zhang
机构: University of Wisconsin–Madison
类目: Computation and Language (cs.CL)
备注: Project page: this https URL
Abstract:Large language models often produce hallucinated answers that violate prompt-level constraints. A key diagnostic question is whether these failures reflect missing knowledge, or whether the model has the relevant information but follows the wrong inference path. We study this phenomenon as inference misalignment: a mismatch between the answer supported by the prompt and the answer favored by statistically salient latent associations. We formalize this view with a latent key-task model, in which pretraining-frequency imbalance can cause a shortcut path to dominate the constraint-sensitive path and induce positive inference loss. The framework predicts two failure modes: task-retrieval bias in entity disambiguation and key-selection bias in action choice. We introduce TrapQA, a controlled diagnostic testbed with two components. ScientistQA tests disambiguation among similar scientists with supplementary factual probes, while Real-Life Constrained QA tests everyday constraint following under salient shortcuts. Our results show that hallucination can arise from biased latent inference rather than absent knowledge alone.
[NLP-50] Selective Test-Time Debiasing for CLIP via Reward Gating
【速读】: 该论文旨在解决视觉语言模型(Vision Language Models, VLMs)在以人物为中心的查询中固有的社会刻板印象问题,此类模型虽具备强大的零样本性能,却常因偏见传播导致人口统计分布偏差。现有去偏方法采用统一的去偏修正策略,忽视输入查询的敏感性差异,从而引发公平性与实用性之间的根本权衡:过度去偏会扭曲无偏见敏感性的查询中的语义信息,而弱去偏则无法有效缓解敏感性查询中的刻板印象。本文提出基于强化学习的测试时自适应框架——奖励门控测试时自适应(Reward-Gated Test-Time Adaptation, RG-TTA),其核心在于根据输入内容的偏见敏感性动态选择性地触发去偏机制。在测试时策略适应过程中,RG-TTA 依据每个输入的偏见敏感度自适应激活公平性正则化,同时对无偏见敏感性的输入仅专注于优化跨模态对齐,避免不必要干扰。在 FairFace、UTKFace 等公平性基准上的实验表明,该方法在显著降低偏见的同时,还提升了零样本性能,成功化解了传统统一去偏策略带来的公平性-实用性权衡问题。
链接: https://arxiv.org/abs/2607.00423
作者: Jaeho Han,Jisoo Yang,Hyeondong Woo,Mingyu Jeon,Sunjae Yoon,Junyeong Kim
机构: Chung-Ang University (中央大学)
类目: Computation and Language (cs.CL)
备注: 15 pages, 7 figures, 11 tables
Abstract:Vision language models (VLMs) demonstrate strong zero-shot performance, but often perpetuate social stereotypes in person-centric queries, yielding skewed demographic distributions. Current debiasing methods apply uniform bias corrections across all input queries regardless of their bias sensitivity, creating a fundamental fairness–utility trade-off. Strong debiasing distorts semantically meaningful information in bias-insensitive queries, while weak debiasing fails to mitigate stereotypes in bias-sensitive ones. This one-size-fits-all approach hampers simultaneously achieving high utility on bias-insensitive queries and fairness on bias-sensitive queries. We introduce Reward-Gated Test-Time Adaptation (RG-TTA), a reinforcement learning-based test-time adaptation framework that selectively applies debiasing based on input sensitivity. RG-TTA adaptively triggers fairness regularization based on the bias sensitivity of each input during test-time policy adaptation, while focusing exclusively on optimizing cross-modal alignment for bias-insensitive inputs. Experiments on fairness benchmarks (e.g., FairFace, UTKFace) demonstrate substantial bias reduction while simultaneously improving zero-shot utility, resolving the trade-off of uniform debiasing.
[NLP-51] Speech Playground: An Interactive Tool for Speech Analysis and Comparison INTERSPEECH2026
【速读】: 该论文旨在解决现有语音分析工具(如Praat)在与现代深度学习表征集成及多类型语音特征对比分析方面存在操作繁琐、交互性差的问题。其核心解决方案是提出Speech Playground,一个基于Python后端与Web前端相结合的交互式语音可视化与比较平台,支持连续、离散及变长表示等多种特征类型的动态探索,并集成TextGrid标注与强制对齐功能,提供可配置的距离度量与对齐参数,实现视觉与听觉双重对比分析,有效服务于语音研究、表征验证以及计算机辅助发音训练(CAPT)相关的实验需求。
链接: https://arxiv.org/abs/2607.00418
作者: Stephen McIntosh,Daisuke Saito,Nobuaki Minematsu
机构: The University of Tokyo(东京大学)
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Accepted to Interspeech 2026 (Show and Tell); 2 pages, 3 figures
Abstract:This paper presents Speech Playground, an interactive speech visualization and comparison tool. While existing tools such as Praat are excellent, it can be cumbersome to integrate them with modern deep learning representations and use them for comparison. Speech Playground addresses this by combining a Python backend with a web-based frontend for interactive exploration of multiple feature types, including continuous, discrete, and variable-length representations. It includes TextGrid and forced alignment support together with configurable distance and alignment settings for visual and auditory comparison. Speech Playground is intended for use in speech research, representation validation, and computer-aided pronunciation training (CAPT)-oriented experimentation.
[NLP-52] A Mechanistic View of Authority Hierarchy in LLM Sycophancy
【速读】: 该论文旨在解决生成式 AI(Generative AI)中权威偏见(authority bias)这一关键安全问题,即模型在回答过程中系统性地优先考虑权威人物的社会身份线索,而非基于事实一致性进行判断,导致其输出受信息来源可信度影响而偏离正确答案。解决方案的关键在于通过受控的医学问答(medical QA)实验设置,揭示该现象背后的机制:尽管模型未被显式提示权威等级,但其响应强度随感知权威水平呈梯度变化,这种层级结构由训练过程自发形成。通过对模型内部表示的深入分析(如logit lens、线性/非线性探针),研究发现权威偏见的核心机制并非表层输出偏差,而是存在于模型晚期层中的精确知识擦除——高权威信号会主动抹去正确答案的内部表征,且该擦除行为与权威等级正相关,难以通过均值向量干预修复,仅部分可通过思维链(chain-of-thought)推理缓解。这表明,权威诱导的阿谀奉承本质上是深层、层定位的内部表示覆盖,属于一种机制性的知识重构而非简单输出倾向。
链接: https://arxiv.org/abs/2607.00415
作者: Emil Joswin,Srujananjali Medicherla,Priyanka Mary Mammen
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Authority bias poses a critical safety concern in language models: models systematically prioritize social cues from authority figures over factual consistency, swaying their answers based on source credibility rather than evidence. We mechanistically investigate this phenomenon using a controlled medical QA setting, where hints suggesting incorrect answers are attributed to personas of varying expertise. Across Llama-3.1-8B, Qwen3-8B, and Gemma-2-9B, we find that models respond in a graded manner proportional to perceived authority, a hierarchy that is never explicitly prompted but emerges from training. Logit lens analysis and linear/non-linear probing localize this effect to a critical late layer where correct answer representations are actively erased, an erasure that scales with authority level, resists mean vector intervention, and is only partially reversible through chain-of-thought reasoning. Our findings suggest that authority-induced sycophancy is not a surface-level output bias but mechanistic knowledge erasure, a precise, layer-localized overwriting of correct internal representations by high-status authority signals.
[NLP-53] When Classic Cache Policies Fail: Learning-Augmented Replacement for Semantic Retrieval Buffers
【速读】: 该论文旨在解决大语言模型(LLM)智能体在使用检索缓存(retrieval buffer)时,缓存管理策略缺乏系统性设计的问题。现有方法多依赖启发式规则(如最近最少使用LRU、最不常用LFU),但在语义型工作负载下表现不佳,原因在于语义检索任务中缺乏时间局部性和频率集中性,导致传统策略失效。其核心解决方案是提出SOLAR——一种基于学习增强的缓存替换框架,关键创新在于:1)通过后悔累积(regret accumulation)机制动态确定缓存项的更新时机,实现约17%的适度修改率;2)基于贝叶斯在线学习对隐式检索反馈进行内容选择,实现更优的语义匹配。理论分析表明,SOLAR具有常数竞争比(≤3),且与缓存大小和时间跨度无关(而FIFO为Ω(K)),同时其淘汰后悔(eviction regret)达到O(√KT log T),在对数因子范围内逼近Ω(√KT)的下界。实验验证了在紧约束缓存条件下相对FIFO提升5%-75%,并在工作集边界处呈现清晰的相变现象;合成实验进一步揭示缓存池规模与检索质量呈倒U型关系,支持容量限制应被理解为检索噪声而非存储瓶颈的论断。
链接: https://arxiv.org/abs/2607.00394
作者: Yushi Sun,Bowen Cao,Wai Lam
机构: Tencent(腾讯); The Chinese University of Hong Kong(香港中文大学)
类目: Databases (cs.DB); Computation and Language (cs.CL)
备注:
Abstract:LLM agents increasingly rely on retrieval buffers to store and reuse past experience, yet the cache management policies governing these buffers remain largely ad-hoc. We formalize this as an online semantic cache replacement problem with switching costs, where items are matched by embedding similarity and hit quality is continuous rather than binary. Through experiments on two datasets from MemoryBench-Full (LoCoMo, DialSim) with 8 replacement policies, we reveal a surprising finding: classic heuristics (LRU, LFU) \emphconsistently underperform the naive FIFO baseline on semantic workloads, due to the absence of temporal locality and frequency concentration. We propose SOLAR, a learning-augmented framework that derives modification timing from regret accumulation (achieving \sim 17% modification rate) and content selection from Bayesian online learning over implicit retrieval feedback. We prove SOLAR achieves a constant competitive ratio \leq 3 , independent of cache size and horizon (vs.\ \Omega(K) for FIFO), and eviction regret O(\sqrtKT\log T) , matching the \Omega(\sqrtKT) lower bound up to logarithmic factors. Experiments demonstrate 5–75% relative improvement over FIFO at tight cache sizes, with a clearly characterized phase transition at the working set boundary. Synthetic experiments with 5000-item pools further reveal an inverted-U relationship between pool size and retrieval quality, justifying capacity constraints as a retrieval noise phenomenon rather than a storage limitation.
[NLP-54] Beyond Perplexity: A Behavioral Evaluation Framework for Deployment-Memory Claims in LLM Test-Time Training
【速读】: 该论文旨在解决当前大语言模型测试时训练(TTT)评估中存在的一种关键问题:现有评估方法主要依赖局部代理指标(如困惑度、未来词损失、长上下文表现等),这些指标虽能有效衡量流式适应、领域迁移或上下文压缩等能力,却难以充分支持对“部署后助手记忆”“个性化”或“稀疏的部署后学习”等更高级别行为能力的论证。此类能力需依赖真实的行为证据,如后期回忆、改写鲁棒性、记忆保留、局部性、冲突处理能力以及在原始支持上下文移除后仍能执行下游任务的表现。为此,论文提出一种行为评估框架,其核心在于通过两个关键组件实现对TTT记忆主张的证据校准:一是“主张校准的证据阶梯”,将不同类型的适应能力(如流/领域适应、桥接内化与部署时行为学习)进行区分;二是包含显式记忆基线和互斥失败类别匹配的评估协议。通过审计近期相关研究并构建受控诊断实验,在稀疏新事实场景下验证发现,单步LoRA更新虽显著降低支持与答案损失,但生成式自由回忆性能仍为零,揭示了代理指标改善与实际部署行为之间的显著差距。该框架为作者和评估者提供了明确标准,确保TTT记忆主张与其实际报告的证据相一致。
链接: https://arxiv.org/abs/2607.00368
作者: Xiangchen Song,Zhenhao Chen,Lingjing Kong,Shaoan Xie,Xinshuai Dong,Guangyi Chen,Kun Zhang
机构: Carnegie Mellon University (卡内基梅隆大学); MBZUAI (穆罕默德·本·扎耶德人工智能大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language model test-time training (TTT) is often evaluated through local proxy metrics: models are updated on recent tokens, retrieved context, target-domain data, or verifiable task attempts, and then judged by perplexity, future-token loss, long-context performance, or reward. These metrics are well matched to claims about stream adaptation, domain adaptation, context compression, and reward-backed test-time improvement. They are weaker evidence, however, for a capability that TTT results are increasingly used to motivate: deployed assistant memory, personalization, or sparse post-deployment learning, which instead requires behavioral evidence such as later recall, paraphrase robustness, retention, locality, conflict handling, and use in downstream actions after the original support context is removed. We introduce a behavioral evaluation framework that calibrates TTT memory claims to the evidence that supports them. It has two components: a claim-calibrated evidence ladder that separates stream/domain adaptation, bridge internalization, and deployment-time behavioral learning; and an evaluation protocol with matched explicit-memory baselines and mutually exclusive failure categories. We validate the framework by auditing recent TTT and memory-adjacent work and by instantiating it as a controlled diagnostic in which, in a sparse nonce-fact setting, one-step LoRA updates lower support and answer loss across three Qwen3 model scales while generated free-form recall stays at zero, exposing a measurable gap between proxy improvement and deployment behavior. The framework gives authors and evaluators a concrete standard for aligning TTT memory claims with the evidence actually reported.
[NLP-55] DiscoLoop: Looping Discrete Embeddings and Continuous Hidden States for Multi-hop Reasoning
【速读】: 该论文旨在解决大语言模型在单次前向传播中完成多跳推理(multi-hop reasoning)时的性能瓶颈问题,尤其聚焦于两跳推理任务中模型难以有效整合跨层参数化知识的挑战。其核心问题是标准非循环Transformer架构存在“深度局部存储”缺陷:早期层学习到的事实信息在后续层进行第二跳检索时已不可用。尽管环路Transformer(Looped Transformers)通过重复利用记忆缓解了此问题,但其泛化能力仍不理想。研究发现,根本瓶颈在于表征对齐不足——即使第一轮循环已使桥梁实体(bridge entity)近乎可完美解码,对应的隐藏状态与桥接标记嵌入之间的对齐依然较差。令人意外的是,一种无需训练的简单重对齐干预即可显著缩小泛化差距。基于此洞察,作者提出DiscoLoop架构,其循环机制同时包含离散嵌入通道与连续隐藏状态通道,实现信息传递与表征对齐的双重优化。DiscoLoop在符号和合成语言的多跳推理任务中均实现了接近完美的准确率,并大幅减少训练步数;在真实世界预训练场景下也表现出更低的训练损失与更强的基准性能,验证了混合通道设计在实际语言建模中的可迁移性与有效性。
链接: https://arxiv.org/abs/2607.00341
作者: Hengyu Fu,Tianyu Guo,Zixuan Wang,Hanlin Zhu,Jason D. Lee,Jiantao Jiao,Stuart Russell,Song Mei
机构: University of California, Berkeley (加州大学伯克利分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 16 pages, 7 figures
Abstract:Large language models achieve strong performance on many reasoning tasks when allowed to externalize intermediate steps as Chain-of-Thought (CoT). However, many questions require the model to internalize the multi-step reasoning within a single forward pass before generating the answer. We study this challenge through two-hop reasoning, a representative task where the model must compose multiple pieces of parametric knowledge within a single forward pass. Standard non-recurrent Transformers suffer from a depth-local storage problem: facts learned in earlier layers are unavailable where second-hop retrieval happens. We found that Looped Transformers mitigate this issue by reusing the same memory, but still generalize imperfectly. We show that the remaining bottleneck is representational. In the two-hop reasoning task, the first loop often makes the correct bridge entity nearly perfectly decodable, yet the corresponding hidden state remains poorly aligned with the bridge token embedding. Surprisingly, an easy training-free realignment intervention nearly closes the generalization gap. Building upon this insight, we propose DiscoLoop, a looping architecture whose recurrence carries both a discrete embedding channel and a continuous hidden-state channel. DiscoLoop achieves near-perfect accuracy with substantially fewer training steps across symbolic and synthetic-language multi-hop reasoning tasks. When applied to real-world pretraining, DiscoLoop attains lower training loss and stronger benchmark performance than looped-transformer baselines, suggesting that the mixed-channel design transfers to practical language modeling.
[NLP-56] RACE: State-Aware Query Processing over Temporal Evidence Graphs for Conversational Data
【速读】: 该论文旨在解决长期对话数据中用户状态建模的挑战,核心问题在于:随着对话的持续演化,用户的计划、偏好等信息会动态变更,后序消息常会覆盖或矛盾早期内容,而现有长记忆处理管道将记忆视为独立的文本或向量对象,导致检索到的证据语义相似但已过时,难以支持对当前状态的准确推理。其解决方案的关键是提出一种基于时序证据图(temporal evidence graph)的查询处理框架TRACE,该框架将对话建模为包含事件、会话与主题的分层图结构,并引入类型化的时序、因果、更新与矛盾关系,同时通过有效性标注(validity annotations)保留过时事实以支持历史查询,但在生成当前状态答案时自动排除无效信息。在查询阶段,TRACE结合基于向量的笔记检索与图引导的证据搜索,生成具备有效性感知的支持路径(support paths)和混合上下文,实现词汇召回与证据重构的分离,从而在长对话历史中实现有界的时间复杂度推理。实验表明,该方法显著提升了时间敏感性与多跳推理能力,消融实验进一步验证了层次结构、更新感知种子及路径依赖证据的重要性。
链接: https://arxiv.org/abs/2607.00339
作者: Maolin Wang,Yu Wang,Zichun Liu,Baiyuan Qiu,Chenbin Zhang,Jiguang Shen,Haoran Yang,Hao Miao
机构: University of Science and Technology of China (中国科学技术大学); Tsinghua University (清华大学); Peking University (北京大学); Shanghai Jiao Tong University (上海交通大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Conversational data is increasingly used as a persistent source of user state for long-running assistants and AI agents. However, querying this data remains challenging because conversations naturally evolve: plans are revised, preferences change, and later messages frequently supersede or contradict earlier information. Existing long-memory pipelines largely treat memories as independent text or vector objects. This approach often retrieves semantically similar but stale evidence, offering limited support for state-aware reasoning. To address this problem, we present TRACE, a query processing framework over temporal evidence graphs for evolving conversational data. TRACE models conversations as a hierarchical graph spanning events, sessions, and topics, enriched with typed temporal, causal, update, and contradiction relations. Crucially, the framework maintains validity annotations so obsolete facts remain accessible for historical queries but are discounted for current-state answers. At query time, TRACE combines vector-based note retrieval with graph-guided evidence search, generating validity-aware support paths and a hybrid context for answer generation. This design separates lexical recall from evidence reconstruction, enabling bounded query-time reasoning over long conversational histories. Experiments on long-conversation query-answering (QA) benchmarks show that TRACE improves temporal and multi-hop reasoning, with ablations highlighting the importance of hierarchy, update-aware seeding, and path-grounded evidence.
[NLP-57] Watermarking for Proprietary Dataset Protection ICML2026
【速读】: 该论文旨在解决生成式语言模型中训练数据成员身份推断(training data membership inference)这一难题,该问题在现代语言建模场景下被广泛认为是根本性困难。其核心解决方案的关键在于引入基于水印(watermarking)的推断机制,利用先前研究中发现的语言模型在部分水印训练数据下仍会保留“残留水印信号”(residual watermark “radioactivity”)的特性,从而将原本难以处理的成员身份推断问题转化为可计算的水印检测任务。研究通过将基于水印的数据集推断方法与传统的基于损失(loss-based)的成员身份推断方法进行对比,在子集暴露程度足够高的条件下,证明了水印方法可在另一组假设下实现与传统方法相当的成员身份检测性能,凸显了水印技术作为有效工具在提升训练数据成员推断可解性方面的潜力。
链接: https://arxiv.org/abs/2607.00325
作者: John Kirchenbauer,Brian R. Bartoldson,Bhavya Kailkhura,Tom Goldstein
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 8 pages and 6 figures in the main body; presented at the ICML 2026 Workshop on Trustworthy AI for Good
Abstract:A growing body of literature suggests that training data membership inference problems are fundamentally hard tasks in modern language modeling settings. We argue that output watermarking techniques are the right gadget to make training membership tests for generative models more tractable, based on prior results showing that language models exhibit residual watermark “radioactivity” under partially watermarked training datasets. We pit a watermark-based dataset inference approach head-to-head against traditional loss-based membership inference methods and show that watermarking can achieve comparable membership detection performance when subset exposure is high enough, under an alternate set of assumptions.
[NLP-58] Mapping the Evaluation Frontier: An Empirical Survey of the Bias-Reliability Tradeoff Across Eleven Evaluator-Agent Conditions
【速读】: 该论文旨在解决大语言模型(LLM)评估系统中存在的一种“偏差-可靠性权衡”(bias-reliability tradeoff)问题,即在固定样本量 N 时,评估者耦合度(evaluator coupling, γ)、策略多样性(strategy diversity, H)与小样本测量可靠性(small-sample measurement reliability, CV(N))三者无法同时最优。其核心解决方案的关键在于通过扩展实证数据集至11个评估条件,系统性地测量并验证了γ、H与CV(N=5)之间的权衡关系:当评估者耦合度γ较低(如<0.2)时,测量噪声(CV(N=5))显著升高(>1.0),而高耦合度(γ>0.9)则可实现低噪声(CV(N=5) <0.16),且γ与H之间呈现极强负相关(r=−0.989),表明评估者耦合会抑制策略多样性。此外,研究发现所有条件下均不存在γ<0.2且CV(N=5) <0.3的区域,进一步证实了该权衡的普遍性。研究还识别出GPT-4o在2026年6月版本更新后出现的接口版本漂移(version drift)导致γ=0.000、H=1.000的异常模式,并公开了完整的标准化评估指标数据集,为后续评估器比较提供基准支持。
链接: https://arxiv.org/abs/2607.00304
作者: Zewen Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 5 pages, 1 figure, 1 table
Abstract:The bias-reliability tradeoff conjectures that LLM evaluation systems are constrained in (gamma, H, CV) space, where evaluator coupling (gamma), strategy diversity (H), and small-sample measurement reliability (CV(N)) cannot be simultaneously optimized at fixed sample size N. Prior evidence rests on n=5 conditions with complete metrics from a single study. We expand the empirical base to 11 conditions, measuring gamma and H for all 11 (nine with valid weight vectors) and CV(N=5) for seven with sufficient seeds (N = 5). Five conditions provide the complete (gamma, H, CV) triple. The data confirm the trade-off: conditions with low evaluator coupling (gamma 0.2) exhibit high measurement noise (CV(N=5) 1.0), while conditions with strong coupling (gamma 0.9) achieve low noise (CV(N=5) 0.16). The correlation r(H, gamma) = -0.989 (n=5, excluding GPT-4o conditions) confirms that evaluator coupling suppresses strategy diversity. Four GPT-4o conditions show gamma=0.000 and H=1.000 across all seeds – a pattern we attribute to version drift in the June 2026 GPT-4o API. No condition occupies the region gamma 0.2, CV(N=5) 0.3. We release all per-condition metrics as a standardized benchmark dataset for evaluator comparison.
[NLP-59] EPC: A Standardized Protocol for Measuring Evaluator Preference Dynamics in LLM Agent Systems
【速读】: 该论文旨在解决大语言模型(LLM)智能体在闭环反馈中因依赖评估器(evaluator)反馈而引发的评估偏好耦合(Evaluator Preference Coupling, EPC)问题,即评估器的固有偏见会通过智能体策略分布持续传播并固化。现有研究虽已证实跨评估器家族与模型版本间存在耦合现象,但缺乏统一、可复现的标准化协议,难以实现第三方研究者对耦合效应的可重复测量、跨评估器及时间点的横向比较,以及对专有评估器更新导致测量衰减的及时检测。本文提出EPC协议——一种基于四阶段隔离范式(four-phase isolation paradigm)的详细、类RFC(Request for Comments)格式的规范,涵盖执行器与评估器配置、策略与任务设计、TTRL更新规则、指标计算(包括gamma、JSD、ECE、Brier)及输出结构等核心要素。同时,发布了一个版本化的参考快照v1.0,包含基于GPT-4o、Qwen、DeepSeek等模型在8种评估器条件下共122次独立实验的耦合测量数据,附带评估器版本标识、API端点与测量日期。该快照具有明确的时间限定性,所有数值均依赖于特定模型版本,预期随专有评估器更新而发生衰减。论文还定义了版本命名规范(vX.Y-Z),并提供使用指南,涵盖协议采纳、结果解读及已知陷阱。整体协议、参考快照与实现代码作为开放基础设施公开发布,以支持可复现、可比较、可持续监测的评估偏好耦合研究。
链接: https://arxiv.org/abs/2607.00297
作者: Zewen Liu
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 10 pages, 3 tables
Abstract:When LLM agents use evaluator feedback to adapt their behavior in closed loops, evaluator biases propagate through the agent’s strategy distribution – a phenomenon known as evaluator preference coupling. Prior work has documented coupling across multiple evaluator families and model versions, but the field lacks a standardized protocol that enables third-party researchers to (i) reproduce coupling measurements, (ii) compare results across evaluators and time points, and (iii) detect measurement decay as proprietary evaluators silently update. This paper provides the protocol. We specify EPC (Evaluator Preference Coupling) – a detailed, RFC-style protocol specification for the four-phase isolation paradigm, covering executor and evaluator configuration, strategy and task design, the TTRL update rule, metric computation (gamma, JSD, ECE, Brier), and output schema. We accompany the protocol with a versioned Reference Snapshot v1.0: coupling measurements for eight evaluator conditions (N=122 unique experimental repetitions across GPT-4o, Qwen, DeepSeek, and others) derived from five independent studies, annotated with evaluator version identifiers, API endpoints, and measurement dates. The snapshot is explicitly time-bound: all values are conditional on specific model versions and are expected to decay as proprietary evaluators update. We define a versioning convention (vX.Y-Z, encoding protocol version, snapshot version, and evaluator generation) and provide a usage guide covering adoption, interpretation, and known pitfalls. The protocol, reference snapshot, and implementation code are released as open infrastructure.
[NLP-60] Rosetta: Composable Native Multimodal Pretraining
【速读】: 该论文旨在解决多模态基础模型在持续引入新模态时面临的“灾难性遗忘”问题,即在新增生成式任务(如图像生成)的同时保持对已有离散理解任务(如语言与视觉理解)的知识不被破坏。现有架构,包括标准的混合专家模型(Mixture-of-Experts, MoE)和结构化分区的混合变压器(Mixture-of-Transformers, MoT),均因梯度冲突而易发生表征覆盖,难以实现非破坏性的多模态扩展。其解决方案的关键在于提出一种名为“动量锚定正交投影”(Momentum-Anchored Orthogonal Projection, MAOP)的新机制:MAOP利用优化器的动量状态作为隐式语义锚点,选择性地抑制来自新模态的冲突梯度分量,同时保留有助于知识协同更新的梯度方向。通过将核心通用知识固化于全局共享专家中,并将特定模态能力分布于可插拔的专家模块中,Rosetta框架实现了模态间的可组合、非破坏性扩展。实验表明,相比传统MoE和MoT,Rosetta在保持原有语言与视觉理解能力的同时,显著提升了图像生成性能并激发跨模态协同效应,为构建真正可组合、统一的多模态基础模型提供了有效路径。
链接: https://arxiv.org/abs/2607.00293
作者: Xiangyue Liu,Zijian Zhang,Miles Yang,Zhao Zhong,Liefeng Bo,Ping Tan
机构: Tencent Hunyuan; HKUST
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Achieving true artificial general intelligence requires foundation models capable of integrating new modalities without forgetting prior knowledge. However, accommodating continuous generative objectives alongside discrete understanding tasks causes severe gradient conflicts. Existing architectures, including standard Mixture-of-Experts (MoE), are highly susceptible to representation overwriting. Even structurally partitioned paradigms like Mixture-of-Transformers (MoT) remain vulnerable to catastrophic forgetting, severely impeding multimodal scalability. In this work, we introduce Rosetta, a composable native multimodal pretraining framework designed for seamless and non-destructive modality expansion. Rosetta adopts a modular paradigm where core foundational knowledge is preserved within global shared experts, while modality-specific capabilities are distributed across plug-and-play experts. To guarantee non-destructive composition, we propose Momentum-Anchored Orthogonal Projection (MAOP). MAOP leverages the optimizer’s momentum state as an implicit semantic anchor, selectively neutralizing conflicting gradient components from new modalities while preserving synergistic updates. Extensive evaluations demonstrate that, while standard MoE and MoT architectures suffer catastrophic forgetting of previously acquired knowledge, Rosetta robustly preserves established language and visual understanding. Furthermore, it delivers superior image generation and unlocks cross-modal synergy, paving the way for truly composable and unified multimodal foundation models. To facilitate further multimodal research, we release our code and checkpoints to the community. Project page at this https URL.
[NLP-61] An LLM -Based Framework for Intent-Driven Network Topology Design
【速读】: 该论文旨在解决从自然语言需求中自动设计可部署且具备韧性的网络拓扑结构这一挑战性问题,核心在于如何确保生成的拓扑在结构上有效且满足各类约束条件。其解决方案的关键在于提出一种基于约束驱动的流水线框架,融合分层建模与系统化验证机制,以保障生成拓扑的结构性正确性与约束合规性。通过在四个真实网络场景下对专有及开源权重的大语言模型(Large Language Models, LLMs)进行多模型对比评估,并利用参考拓扑计算节点与边的F1分数来衡量结构正确性,同时通过服务器和内容连通性指标评估拓扑韧性,该研究系统地揭示了不同模型在处理结构与韧性约束时的表现差异。此外,研究还分析了常见的生成失败模式,如接口不匹配和方向性不一致等问题,为未来面向AI驱动的网络设计提供了可量化的基准与模型选型依据。
链接: https://arxiv.org/abs/2607.00292
作者: Kholoud El-Habbouli,Fen Zhou,Stephane Huet
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: submitted to IEEE CNSM 2026
Abstract:Designing deployable and resilient network topologies from natural language requirements remains a challenging problem in network automation. This work investigates the ability of Large Language Models (LLMs) to generate structurally valid and constraint-compliant network topologies through a constraint-driven pipeline combining hierarchical modeling and systematic validation. The framework is evaluated via a multimodel comparison of proprietary and open-weight LLMs across four realistic network scenarios released as a public dataset. We assess structural correctness using node and edge F1-scores against reference topologies, and evaluate resilience through server and content connectivity metrics. In addition, we analyze common failure modes, including interface mismatches and directional inconsistencies in generated topologies. Overall, this work provides a systematic benchmark for understanding how LLMs handle structural and resilience constraints in topology synthesis, and supports informed model selection for AI-driven network design.
[NLP-62] sting Frontier Large Language Models Physics Literacy in Parallel Physical Worlds
【速读】: 该论文旨在解决当前大型语言模型(Large Language Model, LLM)物理基准测试中仅依赖答案准确率评估所带来的局限性,即无法区分模型是基于真实推理还是仅凭对熟悉问题模式的回忆,且难以揭示模型在推理过程中的具体失效环节。为此,研究提出了一种可审计的四阶段诊断框架,其核心在于通过诱导(induction)、公式化(formulation)、预测(prediction)与回顾(review)四个阶段,系统评估LLM在陌生物理框架下的推理能力。该方案的关键创新包括:锁定预注册机制以防止信息泄露、各阶段间使用独立会话确保无上下文依赖、采用双模型判别提升判断客观性,以及引入人工审核路径保障结果可信度。实验在三个并行物理世界(单方程反事实世界F=mv、历史力学框架亚里士多德力学、四领域反事实世界“衰变世界”)中进行,结果显示,Claude Opus 4.7、GPT-5.5和Gemini 3.1 Pro在三类框架中的综合通过率分别为6/15、6/15和0/15,表明模型在结构性与内容性双重要求下表现受限。最显著的实证模式为定性与定量推理间的非对称性:在衰变世界中,模型极少预测错误的变化方向,但频繁因回退至标准物理关系而计算错误的比例。此外,研究还发现模型判别器的可靠性不具跨框架迁移性,且第四阶段自审能力普遍薄弱——在至少三分之二存在早期错误的测试中,模型自身审查未能识别错误,暴露出自我纠错机制的严重缺陷。研究已公开全部提示词、响应、判决及审计记录,以支持可复现性和透明性。
链接: https://arxiv.org/abs/2607.00276
作者: Dong Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 37 pages, 2 figures, 9 tables
Abstract:Current large-language-model (LLM) physics benchmarks are usually scored by answer accuracy, which cannot distinguish genuine reasoning from recall of familiar problem patterns and reveals little about where a model’s reasoning breaks down. We introduce an auditable four-stage diagnostic that evaluates whether an LLM can reason inside an unfamiliar physics framework through induction, formulation, prediction, and review. The diagnostic combines locked pre-registrations, fresh sessions between stages, dual-LLM judging, and a human-audit pathway, and we apply it to three parallel physics worlds: a single-equation counterfactual world ( F=mv ), a historical framework (Aristotelian mechanics), and a four-domain counterfactual world (Decay World). Across Claude Opus 4.7, GPT-5.5, and Gemini 3.1 Pro, the three worlds yield composite PASS rates are 6/15, 6/15, and 0/15 respectively (content \land structural for F=mv and Aristotelian, content axis only for Decay World where the structural axis is out of scope). The most pointed empirical pattern is a qualitative-versus-quantitative asymmetry: in Decay World, models almost never predict the wrong direction of change, but frequently compute the wrong ratio by slipping back to standard-physics relations. The protocol also surfaces two methodology findings: LLM-judge reliability does not transfer across frameworks, and Stage 4 self-review is weak in every framework, with the model’s own review wrongly reporting no earlier error in at least two-thirds of the trials that actually contained one. We release the full prompts, responses, verdicts, and audit records.
[NLP-63] SEFORA: Student Essays with Feedback Corpus and LLM Feedback Evaluation Framework EMNLP2026
【速读】: 该论文旨在解决大规模生成高质量写作反馈(Writing Feedback)所面临的两大核心挑战:一是缺乏真实课堂环境中教师反馈的公开语料库,二是缺乏可靠的方法来评估生成反馈与教师实际反馈的一致性。其解决方案的关键在于构建两个核心工具:一是SEFORA,一个公开的语料库,包含来自大学各类写作体裁的564份学生多稿修订文本及其对应的教师内联反馈、任务提示、评分标准和评分结果;二是UniMatch,一种基于参考的开放生成评估框架,通过将反馈拆分为反馈单元(feedback units),依据教师制定的标准进行语义对应度评分,并通过最优匹配实现对生成反馈的可解释性精确率(precision)、召回率(recall)和F1值评估。实验表明,尽管覆盖多种大语言模型(LLM)配置,最高F1值仍不超过0.4,揭示当前模型在识别教师优先关注的反馈要点方面存在显著不足,且随着生成内容增多,性能持续下降。
链接: https://arxiv.org/abs/2607.00274
作者: Shayan Peyghambari Oskoui,Norah Almousa,Zhaoyi Joey Hou,Carolina Gustafson,Gayle Rogers,Raquel Coelho,Diane Litman,Xiang Lorraine Li
机构: University of Pittsburgh (匹兹堡大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Under review for EMNLP 2026
Abstract:Effective writing feedback is among the strongest drivers of student learning, yet producing it at scale is labor-intensive. LLMs offer a natural path to scaling writing support, but two gaps stand in the way: few public corpora capture how instructors actually deliver feedback in real classrooms, and no reliable method measures whether generated feedback aligns with what an instructor would write. We address both. SEFORA is a public corpus pairing instructor inline feedback with assignment prompts, rubrics, scores, and multi-draft revisions across various college writing genres, comprising 564 drafts and 8,240 instructor annotations. UniMatch is a reference-based evaluation framework for open-ended generation: it segments feedback into feedback units, scores their semantic correspondence under instructor-derived criteria, and aligns them via optimal matching to yield interpretable precision, recall, and F1. Across 74 experimental configurations spanning multiple LLMs, no setting exceeds 0.4 F1. UniMatch reveals that models struggle to identify the feedback instructors would prioritize, and performance degrades as models generate more.
[NLP-64] LV-ROVER: Multi-Stream Tesseract Voting for Maltese Parag raph OCR ALT
【速读】: 该论文旨在解决马耳他语(Maltese)在光学字符识别(OCR)领域面临的低资源问题,即尽管已有一定规模的文本语料库和预训练语言模型,但缺乏足够大规模的真实标注PDF语料用于训练,仅有单一已知的57页真实标注数据集,远不足以支持段落级的OCR训练。为应对这一挑战,研究提出了一种合成训练数据生成管道,并构建了一个五流Tesseract LV-ROVER集成系统。其核心解决方案在于通过合成数据增强训练能力,并采用多阶段后处理链实现显著性能提升:仅集成识别部分即使字符错误率(CER)从基准模型的0.0234降低至0.01317(降幅44%),而完整的五阶段后处理流程进一步将CER降至0.00700(总降幅达70%)。其中,大部分改进源于排版规范化处理,但有一阶段专门针对误识变音符号的恢复,被单独归类为识别增益而非单纯格式校正,凸显了该方法在保留语言特征方面的有效性。研究将44%的改进视为可迁移的识别性能增益,而70%的总体提升则被视为与特定标注规范相关的优化结果。
链接: https://arxiv.org/abs/2607.00250
作者: Adam Darmanin
机构: Independent Researcher(独立研究员); Hecatus Research(赫卡图斯研究)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 1 figure, 3 tables. System paper for the DocEng 2026 Maltese Paragraph OCR Competition
Abstract:Maltese has decent text corpora and pretrained language models, but, like many languages outside the handful with large OCR benchmarks, only a single known real labelled PDF corpus for OCR training, 57 page, far below what paragraph-level training needs: low-resource for OCR specifically. With no real corpus to train on at scale, we built a synthetic training pipeline and a 5-stream Tesseract LV-ROVER ensemble, and report results on a 422-paragraph benchmark against a fine-tuned-Tesseract baseline of character error rate (CER) 0.0234. Ensemble recognition alone improves CER by 44 percent, to 0.01317; a five-stage post-processing chain brings the full pipeline to CER 0.00700, a 70 percent reduction. Most of that chain is typographic normalisation, but one stage recovers misread diacritics rather than aligning punctuation, so we report it as a recognition gain rather than folding the whole chain under one label. We treat the 44 percent figure as the portable estimate of what the recogniser learned, and the 70 percent figure as specific to this benchmark’s label convention.
[NLP-65] SLIM-RL: Risk-Budgeted Random-Masking RL for Diffusion LLM s Without Trajectory Slicing
【速读】: 该论文旨在解决扩散大语言模型(dLLMs)在强化学习训练中因随机掩码(random masking)与模型推理轨迹不匹配而导致的训练效率低下问题。现有最优方法TraceRL通过将每个轨迹切分为最多K/s个对齐训练样本以重建推理轨迹,但此过程随块大小K增加而带来显著计算开销。本文提出SLIM-RL,其核心创新在于无需重构轨迹,而是通过引入tau预算解码器(tau-budget decoder)对每一步的“提交风险”(commit risk)进行约束,从而控制训练数据中的累积提交风险。在优化阶段,SLIM-RL采用无轨迹依赖的随机掩码目标,结合序列级重要性采样与基于均值保持、单调递减掩码率调度的确定性积分方法,实现方差缩减。实验表明,在块大小为16时,SLIM-RL仅需0.46倍于TraceRL的训练样本即可达到相当的MATH500准确率,并在相同动态采样条件下分别提升6.32%(MATH500)和11.05%(GSM8K);当块大小为4时,4B规模的SLIM-RL在数学任务上超越更大的LLaDA-8B和Dream-7B模型,且性能优于autoregressive Qwen2.5-7B,同时在代码生成任务上相较TraceRL提升4.20%(MBPP)和3.65%(HumanEval)。此外,tau预算解码器具备良好的可迁移性,支持零训练跨模型部署至LLaDA、Dream和SDAR系列模型。
链接: https://arxiv.org/abs/2607.00208
作者: Ruikang Zhao,Zhenting Wang,Han Gao,Ligong Han
机构: Technical University of Denmark (丹麦技术大学); MBZUAI Institute of Foundation Models (MBZUAI基础模型研究所); Iowa State University (爱荷华州立大学); Red Hat AI Innovation (红帽人工智能创新); MIT–IBM Watson AI Lab (麻省理工学院-IBM沃森人工智能实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 17 pages
Abstract:Reinforcement learning for diffusion large language models (dLLMs) has largely moved to trajectory-aware methods. The current state of the art, TraceRL, holds that random masking is mismatched with the model’s inference trajectory, and it reconstructs that trajectory during training by slicing each rollout into up to K/s trajectory-aligned training samples, a cost that grows with the block size K. We show that this mismatch can be mitigated without reconstructing the trajectory. Our method, SLIM-RL, bounds the commit risk of each rollout step with a tau-budget decoder, reducing aggregate commit risk in the training data. During optimization, SLIM-RL trains on these risk-controlled rollouts with a trace-free random-masking objective that adapts variance-reduction tools, combining sequence-level importance sampling, deterministic quadrature over masking levels under a mean-preserving, monotonically decreasing per-block mask schedule that we introduce. On SDAR-4B, SLIM-RL matches TraceRL’s best MATH500 accuracy on only 0.46x its training samples at block size 16, improving over TraceRL by 6.32% on MATH500 and 11.05% on GSM8K under matched dynamic sampling. At block size 4, the 4B SLIM-RL surpasses the larger LLaDA-8B and Dream-7B dLLMs on math, exceeding LLaDA-8B by 10.76% on MATH500 while staying below the autoregressive Qwen2.5-7B. On code, it improves over TraceRL by 4.20% on MBPP and 3.65% on HumanEval. The tau-budget decoder transfers training-free across LLaDA, Dream, and SDAR. The source code is available at this https URL .
[NLP-66] Structural Pattern Mining in Inka Khipus: Unsupervised Clustering Provenance Classification and a Computational Validation of the Santa Valley Match
【速读】: 该论文旨在解决印加帝国(约1400–1532年)使用的结绳记事系统——基普(khipu)的解码难题,即如何从大量未被破译的结绳数据中识别其结构规律与历史归属。其核心解决方案在于构建一个可复现的机器学习流程,应用于开放基普库(Open Khipu Repository, OKR),该数据库包含619个基普、54,403根绳索和110,677个结。研究通过设计每条基普27个结构特征,采用三种方法:(i) 基于UMAP与HDBSCAN的无监督聚类,成功识别出三个结构上显著不同的群体(轮廓系数=0.769);(ii) 采用梯度提升进行有监督的来源分类,在印加晚期帝国风格上的F1分数达0.86;(iii) 利用SHAP可解释性分析,发现绳索捻向是区分帝国型基普的关键结构特征。此外,研究发现某一聚类并非由地理区域主导,而是受19世纪欧洲博物馆收藏实践影响,表明殖民时期的采集与记录行为已嵌入数据结构中;同时,仅基于公开的OKR数据库,便独立验证了梅德拉诺与厄顿(Medrano & Urton, 2018)关于圣塔谷六条基普的正反面(摩伊蒂,moiety)结构的结论,包括整体连接率与单一混合样本的识别。然而,研究也报告了一项负面结果:以n-gram形式编码的结型序列顺序,并未提供超越聚合特征的来源判别信息。所有代码与数据均开源,支持研究的可重复性与透明性。
链接: https://arxiv.org/abs/2607.00185
作者: Maria Contreras
机构: Universidad Peruana de Ciencias Aplicadas (UPC), Lima, Peru
类目: Computation and Language (cs.CL)
备注: 10 pages, 4 figures, 2 tables
Abstract:Khipus–knotted cord devices–were the primary recording medium of the Inka Empire (c. 1400-1532 CE), yet their system remains undeciphered. We present a reproducible machine-learning pipeline applied to the Open Khipu Repository (OKR), a public database of 619 khipus comprising 54,403 cords and 110,677 knots. We engineer 27 structural features per khipu and apply (i) unsupervised clustering via UMAP and HDBSCAN, recovering three structurally distinct groups (silhouette = 0.769); (ii) supervised provenance classification via gradient boosting, reaching F1 = 0.86 for the Inka Late Horizon imperial style; and (iii) SHAP-based interpretability, which identifies cord twist direction as the dominant structural discriminator of imperial khipus. We further report two findings of methodological interest. First, one cluster is dominated not by a geographic region but by nineteenth-century European museum collections, indicating that colonial acquisition and recording practices are structurally encoded in the corpus. Second, we provide an independent computational verification of the recto/verso (moiety) structure of the six Santa Valley khipus reported by Medrano and Urton (2018), reproducing both the aggregate attachment ratio and the identification of the single mixed specimen–using only the public OKR database, without physical access to the objects. We additionally report a negative result: knot-type sequence order, encoded as n-grams, adds no provenance signal beyond aggregate features. All code and data are openly available.
[NLP-67] ALEE: Any-Language Evaluation of Embeddings via English-Centric Minimal Pairs
【速读】: 该论文旨在解决现有文本嵌入(Text Embeddings)评估基准在跨语言性、语言覆盖广度、领域泛化能力及低资源语言代表性方面的不足。当前评估体系普遍存在静态性、语言覆盖有限、领域专一、易过拟合以及对低资源语言表征能力评估不充分等问题。为此,论文提出ALEE框架,其核心创新在于将Sentence Smith方法扩展至跨语言与段落级别,利用抽象意义表示(Abstract Meaning Representations, AMR)生成具有受控、细粒度语义变化的英语最小对比对,并将其与目标语言的平行翻译配对,从而实现对任意具备英语平行数据的语言中嵌入模型的精准诊断。该方案的关键在于通过AMR驱动的语义控制机制,系统性地构建涵盖多样语言现象与文本长度的评估数据集,进而揭示嵌入模型在跨语言语义表征上的显著差异,其性能表现与训练数据中的语言流行度及子词分词策略密切相关。研究在275种以上语言和三个平行语料库上进行了大规模实证分析,验证了该框架的有效性与广泛适用性。
链接: https://arxiv.org/abs/2607.00171
作者: Andrianos Michail,Stylianos Psychias,Michelle Wastl,Simon Clematide,Rico Sennrich,Juri Opitz
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Text embeddings are standard for semantic similarity tasks, yet their evaluation remains an open challenge. Current benchmarks are static, cover only a limited set of languages, are often domain-specific, susceptible to overfitting, and poorly representative of low-resource languages. To address these limitations, we introduce ALEE, a framework that extends Sentence Smith (Li et al., 2025) to the cross-lingual and paragraph level. ALEE uses Abstract Meaning Representations (AMR) to generate English minimal pairs with controlled, fine-grained semantic shifts, which are paired with translations in target languages. This approach enables targeted diagnostics for models in any language with English parallel data. We conduct a large-scale empirical study across a diverse set of embedding models and 275+ languages spanning three parallel datasets. On ALEE, performance varies substantially across languages, text lengths, and linguistic phenomena, exposing persistent gaps in cross-lingual semantic representation that track language prevalence in training resources and subword tokenization. We release ALEE at this https URL
[NLP-68] Readable but Not Controllable: Neuron-Level Evidence for Medical LLM Hallucination
【速读】: 该论文旨在解决医学领域大语言模型(Medical LLMs)中存在的幻觉(hallucination)问题,特别是探讨在检测到幻觉后,其内部表征是否可用于主动控制而非仅限于识别。研究发现,通过一个精心设计的条件探针(carefully conditioned probe),可有效检测幻觉,其在多个医疗问答数据集上的受试者工作特征曲线下面积(AUROC)达到0.77至0.86。进一步分析表明,与幻觉相关的内部表征具有分布广泛且冗余的特性:小规模随机选取的神经元子集即可恢复几乎全部检测性能,而低维随机投影亦能保留主要判别能力。然而,尽管该信号易于解码,却难以实现因果性控制——在16组模型-数据集组合中,检测能力与神经元层面的可控性之间存在显著鸿沟。这表明,尽管幻觉在内部激活中显而易见,但通过调控最相关神经元来纠正幻觉并不可靠。研究揭示,幻觉缓解并非简单地定位“正确”神经元即可实现,而是反映了表征可读性与可操控性之间的深层分离,提示未来干预策略需超越单纯神经元级调控,转向更系统性的机制设计。
链接: https://arxiv.org/abs/2607.00158
作者: Vijay Vankadaru,Asha Matthews,Tanya Roosta,Peyman Passban
机构: University of California, Berkeley (加州大学伯克利分校)
类目: Computation and Language (cs.CL)
备注:
Abstract:Hallucination remains one of the central obstacles to deploying medical LLMs. Yet, even when hallucination can be detected, it is still unclear whether the internal representations associated with it can be used for control rather than detection alone. Using four open-source models across a suite of medical question-answering datasets, we show that a simple, carefully conditioned probe can reliably detect hallucination, with AUROC scores between 0.77 and 0.86 in our case. We further show that this signal is distributed and redundant rather than narrowly localized. Systematically selected neurons outperform random neurons only at very small subset sizes, whereas random subsets of a few hundred neurons recover nearly the full signal, and low-dimensional random projections preserve most of the detection performance. Beyond detection, we test whether this representation is causally actionable. Across 16 model–dataset combinations, our results reveal a sharp gap between decodability and controllability. The same internal structure that makes hallucination easy to detect does not translate into reliable neuron-level control. These findings show that medical hallucination seems to be readily visible in internal activations, but not easily corrected by steering the neurons most associated with it. More broadly, our results suggest that hallucination mitigation is not simply a matter of identifying the right neurons, and point to a deeper separation between what representations reveal and what they allow us to change.
[NLP-69] GRPO Dr. GRPO and DAPO Are Three Operations on One Number: The Group-Standard-Deviation Identity
【速读】: 该论文旨在解决生成式语言模型在推理训练过程中如何有效利用反馈信号以提升推理能力的核心问题。现有三种主流方法——Group Relative Policy Optimization (GRPO)、GRPO Done Right(Dr. GRPO)以及Decoupled Clip and Dynamic Sampling Policy Optimization (DAPO),表面上看似独立的改进策略,实则均通过调节一个关键参数——提示词采样答案的标准差(standard deviation),来控制模型的学习强度与方向。该标准差反映了模型对同一问题多次回答的一致性:当答案在正确与错误间平均分布时,标准差最大,表示高不确定性,此时学习信号最强;而当所有回答一致时,标准差为零,意味着无学习价值。论文证明,这三种方法本质上是同一优化机制的不同配置,其核心在于“组内标准差-梯度更新量”恒等关系(group-standard-deviation identity):分歧程度直接决定了训练更新的幅度。因此,真正关键的解决方案并非复杂的算法设计,而是将标准差作为可调控的“学习强度旋钮”——高分歧样本应被赋予更高权重并进行更多采样尝试,而一致样本则自动抑制学习。这一发现已在真实复杂数据集Big-Math及受控训练实验中得到验证,揭示了看似平凡的归一化操作实则决定着学习发生的时机与强度。
链接: https://arxiv.org/abs/2607.00152
作者: Yong Yi Bay,Kathleen A. Yearick
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)
备注: 18 pages, 10 figures, 4 tables. Code and data: this https URL
Abstract:Three of the most popular methods for training language models to reason look like three different tricks. They are not. All three adjust a single number: standard deviation, reflecting how much a prompt’s sampled answers disagree. When such a model is trained, it answers each problem many times, and an automatic checker marks every answer right or wrong. The standard deviation of those marks measures the disagreement: largest when the answers split evenly between right and wrong, and zero when they all agree. Group Relative Policy Optimization (GRPO) divides by this number, GRPO Done Right (Dr. GRPO) drops the division, and Decoupled Clip and Dynamic Sampling Policy Optimization (DAPO) discards the groups where it is zero. Each is presented as its own fix, yet this paper proves they are three settings of one dial. That dial is not cosmetic: for right-or-wrong rewards, the disagreement is exactly the size of the training update, the group-standard-deviation identity. A split group teaches the most, while a unanimous group teaches nothing and falls silent. The same result says which problems deserve the most weight and how many tries each one needs. This paper confirms the intuition on a large real difficulty dataset (Big-Math) and in a controlled training run. What looks like a harmless normalization step is the dial that decides where learning happens and how strongly.
[NLP-70] Hate Speech Detection in Turkish and Arabic Languages: A Comprehensive Study
【速读】: 该论文旨在解决在线仇恨言论(online hate speech)在多语言、多议题语境下难以有效识别与治理的问题,尤其关注针对宗教、种族、民族、文化、国籍或移民身份等特定群体的仇恨言论对社会暴力事件的催化作用。其核心挑战在于如何在保障言论自由的同时,实现对复杂、多层次仇恨内容的精准内容审核。解决方案的关键在于构建一个覆盖土耳其语五类敏感议题(难民、以巴冲突、反希腊情绪、少数族裔/宗教群体如阿列维派、亚美尼亚人、阿拉伯人、犹太人、库尔德人以及LGBTI+)及阿拉伯语一类议题(难民)的综合性仇恨言论数据集,并基于先进的BERT模型开发一套多维度分析框架,涵盖仇恨类别分类、仇恨强度预测、目标识别与仇恨言论片段检测,从而实现对网络话语中仇恨内容的全面、细粒度解析。
链接: https://arxiv.org/abs/2607.00143
作者: Somaiyeh Dehghan,Gökçe Uludoğan,Mehmet Umut Şen,Elif Erol,Arzucan Özgür,Berrin Yanikoglu
机构: Sabanci University (萨班奇大学); Bogazici University (博阿齐吉大学); Hrant Dink Foundation (哈兰特·丁克基金会)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 11 Tables
Abstract:Online hate speech has been linked to a global rise in violence against minorities, including incidents such as mass shootings, lynchings, and ethnic cleansing. Societies grappling with this issue, particularly when hate speech targets specific groups based on religion, race, ethnicity, culture, nationality, or migration status, face the challenge of balancing freedom of expression with the need for effective content moderation on widely used online platforms. In response to this challenge, we introduce a comprehensive hate speech dataset covering five distinct topics in Turkish: refugees, the Israel-Palestine conflict, anti-Greek sentiment in Turkey, ethnic or religious communities (Alevis, Armenians, Arabs, Jews, and Kurds), and LGBTI+, alongside one topic in Arabic (refugees). In addition, we develop state-of-the-art BERT-based models to address multiple dimensions of hate speech analysis, including hate category classification, hate intensity prediction, target identification, and hate speech span detection, enabling a comprehensive understanding of hateful content in online discourse.
[NLP-71] CogTax: A Four-Level Cognitive Taxonomy for Command-Line Computing Education
【速读】: 该论文旨在解决计算教育在系统管理与命令行环境等操作性领域扩展过程中,现有教学框架无法有效表征学习者操作真实世界后果的问题。传统认知分类体系(如布鲁姆修订版分类法)仅关注认知复杂度,忽视了命令执行对系统的实际影响,导致在命令行教学中存在关键盲区——看似简单的命令可能引发严重系统后果。为此,本文提出CogTax,一个融合认知复杂度与操作影响双维度的四层级认知分类体系:从安全的只读检查到需整合多抽象模型的高级系统管理,其层级由两个维度的最大值决定,确保认知理解与操作意识并重。该框架为教师提供课程内容编排与评估难度校准的理论依据,也为学生提供自我评估与差距识别的明确参照。为实现分类自动化以支持大规模应用,研究设计了一种结合抽象语法树(Abstract Syntax Tree, AST)的句法表示与语义嵌入的分类器,在585条经专家标注的Linux/bash命令上达到89%准确率,显著优于单一表示方法,并通过命令语言间的结构等价性验证了跨语言可扩展性。
链接: https://arxiv.org/abs/2607.00140
作者: Manuel Alonso-Carracedo(1 and 2),Ruben Fernandez-Boullon(1 and 2),Pedro Celard(1 and 2),Francisco J. Rodriguez-Martinez(1 and 2),Lorena Otero-Cerdeira(1 and 2) ((1) Universidade de Vigo, Spain, (2) IFCAE, Universidade de Vigo, Spain)
机构: University of Vigo (Vigo University); Galician Center for Advanced Studies of the Sea (CEAMAR)
类目: Computers and Society (cs.CY); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 35 pages, 9 figures, 4 tables
Abstract:As computing education expands beyond traditional programming into operational domains such as systems administration and command-line environments, existing pedagogical frameworks struggle to capture a dimension that is critical in these contexts: the real-world consequences of learner actions. Existing cognitive taxonomies classify learning objectives by mental operations but do not account for system impact, leaving a critical gap in command-line education where conceptually simple commands can have severe consequences. This work presents CogTax, a four-level cognitive taxonomy that integrates two dimensions: cognitive complexity, derived from Bloom’s Revised Taxonomy, and operational impact, which distinguishes observational, reversible, structural, and administrative operations. The four progressive levels range from safe read-only inspection to advanced system management requiring integration of multiple abstract models. Then, the taxonomy level is defined as the maximum of these dimensions, ensuring that both conceptual understanding and operational awareness are addressed. CogTax gives instructors a principled framework for sequencing course material and calibrating assessment difficulty, and gives students an explicit reference for self-assessment and gap identification. To demonstrate that taxonomy levels are automatically assignable, making the framework scalable without manual expert annotation, a classifier that combines syntactic representations derived from abstract syntax trees with semantic embeddings is trained. Evaluated on 585 expert-annotated Linux/bash commands, this combined approach achieves 89% accuracy, outperforming either representation alone, and demonstrates cross-language extensibility through structural equivalences across command languages.
[NLP-72] Benchmarking Frontier LLM s on Arabic Cultural and Sociolinguistic Knowledge: A Cross-Evaluation Framework with Human SME Ground Truth
【速读】: 该论文旨在解决在高风险、专业化领域(尤其是阿拉伯语社会语言学知识)中,由于人工专家评估成本过高而导致语言模型部署受限的核心瓶颈问题。其关键解决方案在于构建一个跨社区评估框架,针对埃及阿拉伯语和伊拉克阿拉伯语两种代表性方言,由母语专家(SMEs)共同设计并验证了103组提示-评分标准对(含53项文化类与50项语言类任务),采用带惩罚权重的评分体系以区分正面内容要求与具体答案中的负面错误标准。通过引入五种前沿大模型作为自动化评判者,并结合人类专家对三类前沿大模型进行302次独立评估,提出一种融合平均绝对偏差(MAD)与有符号均值误差的双指标评估方案,有效分离方向性评分偏差与对称性噪声。实验结果表明,尽管多数自动化评判者存在系统性宽松倾向(+2.01%至+6.56%),且文化类任务普遍比语言类任务更难准确评分,但最核心发现是:所有自动化评判模型的主要失效模式在于缺乏对隐含文化推理的模拟能力——即未能像母语者一样进行情境化判断,而仅依赖词汇匹配进行验证,这揭示了当前生成式AI在跨文化语境理解上的根本局限。
链接: https://arxiv.org/abs/2607.00139
作者: Sajjad Abdoli,Ghassan Al-Sumaidaee,Ahmad ElShiekh,Clayton W. Taylor,Ahmed Rashad
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:The cost of human expert evaluation is a principal bottleneck to deploying language models in specialized, high-stakes domains. This is particularly acute for Arabic sociolinguistic knowledge: credible grading requires not only linguistic fluency but deep cultural familiarity that cannot be approximated by surface-level metrics. We address this with a cross-evaluation framework instantiated on two underrepresented Arabic dialect communities: Egyptian and Iraqi Arabic. We contribute 103 validated prompt-rubric pairs (70 Egyptian, 33 Iraqi; 53 Cultural, 50 Linguistic), authored and graded by native-speaker SMEs using penalty-weighted rubrics distinguishing positive content requirements from answer-specific negative error criteria. Three frontier LLMs serve as target models (graded by human SMEs across 302 unique prompt-response pairs), while five frontier LLMs serve as automated judges enforcing a provider-level self-evaluation guard. A dual-metric scheme combining Mean Absolute Deviation (MAD) with Signed Mean Error separates directional grading bias from symmetric noise. Across 1,307 judge evaluations: GPT-5.4 is the most reliable judge (MADj = 10.21 pp, Signed Error = -1.12%); four of five judges show systematic leniency (+2.01% to +6.56%); Cultural tasks are harder to grade than Linguistic tasks for all judges (MAD gap 1.83-4.78 pp); and models substantially outperform on Egyptian prompts compared to Iraqi prompts. However, given leniency differences between Iraqi and Egyptian SMEs, we cannot solely attribute this gap to model knowledge. We therefore emphasize findings that do not assume identical leniency across human graders. Across all samples, implicit cultural reasoning – requiring models to simulate native-speaker judgment rather than rely on lexical verification – emerges as the primary failure mode for automated grading across all judge models.
[NLP-73] Harnessing the Latent Space: From Steering Vectors to Model Calibrators for Control and Trust ACL2026
【速读】: 该论文旨在解决大规模语言模型(Large Language Models, LLMs)在参数规模急剧增长背景下,其内部表征难以理解的问题,尤其是在高风险或中等风险应用场景中,如何确保模型行为可控且输出可信。随着生成式AI(Generative AI)能力的增强,用户对模型输出的依赖程度不断提高,但其决策过程缺乏透明性与可解释性,导致信任机制缺失。为此,本文提出的关键解决方案是:通过构建控制向量(steering vectors)来操纵语言模型的潜在空间(latent space),实现对模型输出的定向干预;同时开发基于潜在空间的模型校准器(model calibrators),用于评估和提升模型输出的可信度。这两项核心贡献共同揭示了语言模型内部表示的可操控性与可测量性,为构建更可解释、可信赖的语言技术提供了新范式。
链接: https://arxiv.org/abs/2607.00083
作者: Nishant Subramani
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: ACL 2026 (BigPicture Workshop)
Abstract:Language models have changed from unreliable text generators to highly-capable large models with trillions of parameters. Capability increases come hand-in-hand with increases in scale, making understanding the internal representations of models more challenging. Since millions of users increasing rely on language models to interact with external tools or make decisions in medium or high-stakes scenarios, we need to establish control over model behavior and know when to trust model outputs. In this paper, we discuss our contributions on harnessing the latent spaces by proposing steering vectors for control and developing latent space-based model calibrators for trust. Together, our contributions help demystify the latent spaces of language models and offer new insights into how to harness model internals to build more trustworthy language technology.
[NLP-74] Destination-Labeled Self-Looping Systems with Dwell: Intrinsic Characterization Realization Cost and Recognition
【速读】: 该论文旨在解决在具有固定可见转移结构且每个可见状态存在最小驻留时间要求的系统中,如何设计有限状态符号控制器的问题。其核心挑战在于:当引入驻留时间约束后,当前可见状态不再足以决定是否允许离开,从而破坏了传统状态机的确定性转移逻辑。为此,论文提出了一种称为“带驻留的终点标记自环系统”(DLSL 系统)的建模框架,通过在相位扩展(phase expansion)后引入驻留记忆来刻画状态驻留行为,并以局部决策映射与可见图相结合的方式进行控制表示。解决方案的关键在于揭示:所有可由 DLSL 系统实现的确定性转换器,恰好构成一类满足纤维线性(fiber-linear)且保持图结构(graph-respecting)的转换器类。在自然可达性与可实现离开假设下,同一可见图上的等价可达实现彼此同构,表明可见转换函数唯一确定驻留向量与局部决策映射。此外,论文证明任何保持图结构并强制实现驻留值 $ (d_i) $ 的确定性实现,至少需要 $ \sum_i d_i $ 个控制状态。最后,作者提出了一个 $ O(|Q||\Omega|) $ 时间复杂度的识别与重构算法,并将分析拓展至边进入变体,其中转移可进入后继纤维的内部相位。
链接: https://arxiv.org/abs/2607.00044
作者: Reda Belaiche
机构: Université Paris-Est Créteil (巴黎-东部克雷泰尔大学)
类目: Formal Languages and Automata Theory (cs.FL); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Logic in Computer Science (cs.LO)
备注:
Abstract:We study a finite-state symbolic controller for systems in which the admissible visible transitions are fixed in advance and each visible state carries a minimum dwell requirement. The resulting model, which we call a destination-labeled self-looping system with dwell (DLSL system), records the visible graph together with local decision maps; dwell memory appears only after phase expansion. The main structural issue is that, once dwell is imposed, the current visible state no longer determines whether a departure is allowed. This leads to the converse problem: which deterministic transducers arise as phase-expanded realizations of DLSL systems over a fixed visible graph? We show that the answer is exactly the class of fiber-linear graph-respecting transducers. Under natural reachability and realizable-departure assumptions, equivalent accessible realizations over the same visible graph are isomorphic; in particular, the visible transduction determines the dwell vector and the local decision maps. We also prove that any graph-preserving deterministic realization enforcing dwell values (d_i) requires exactly \sum_i d_i control states. Finally, we give an O(|Q||\Omega|) recognition and reconstruction procedure, and extend the analysis to an edge-entry variant in which transitions may enter interior phases of successor fibers. Subjects: Formal Languages and Automata Theory (cs.FL); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Logic in Computer Science (cs.LO) Cite as: arXiv:2607.00044 [cs.FL] (or arXiv:2607.00044v1 [cs.FL] for this version) https://doi.org/10.48550/arXiv.2607.00044 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[NLP-75] Controllable Narrative Rendering for Enhanced Assisted Writing
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在创意写作中因固有的二元性失败而产生的核心矛盾:模型在编辑过程中往往在“安全但浅显的修正”(remedial polishing)与“破坏性且失控的情节扩展”之间来回震荡,导致叙事忠实度与描述强度之间存在不可调和的权衡。其解决方案的关键在于提出Loom框架,该框架基于叙述学中“故事”(story)与“话语”(discourse)的区分,构建了一个三层式流水线架构,通过以意图为中心的符号链式思维(intent-centered semiotic chain-of-thought)实现对叙事意图与描写密度的精确控制。该架构将感知材料生成与句法插入过程分离,确保内容增强不破坏原始事件结构,从而在保持事实完整性的同时显著提升描述强度。实证评估结果表明,Loom在自动指标与人工评价中均优于现有最先进基线,有效缓解了创作中叙事忠实度与表现力之间的根本性张力。
链接: https://arxiv.org/abs/2607.00009
作者: Mingzhe Lu,Yanbing Liu,Jiayue Wu,Jiarui Zhang,Qihao Wang,Yue Hu,Yunpeng Li,Yangyan Xu
机构: 1. Tsinghua University (清华大学); 2. Institute for AI Industry Research, Tsinghua University (清华大学人工智能产业研究院); 3. Alibaba Cloud (阿里云)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Despite the remarkable proficiency of large language models (LLMs) in basic writing assistance, their utility in creative writing is fundamentally hindered by a persistent binary failure. This issue manifests as an oscillation between safe, surface-level editing, referred to as remedial polishing, and destructive, uncontrolled plot expansion. This dilemma defines a critical trade-off between narrative fidelity and descriptive intensity. We propose Loom, an assisted writing framework grounded in the narratological distinction between story and discourse. Loom employs a three-layer pipeline that operationalizes an intent-centered semiotic chain-of-thought to enforce precise control over narrative intent and rendering density. This architecture separates the generation of perceptual material from syntactic insertion, ensuring that enhancement occurs without violating the original event structure. Our comprehensive evaluation, which includes LLM-based metrics and human assessment, demonstrates that Loom successfully resolves this fundamental tension. Loom achieves the highest overall quality score, yielding substantial gains in factual integrity and descriptive intensity compared to state-of-the-art baselines.
[NLP-76] Persona Without Substrate: Regime-Dependence and the LLM Individuation Problem
【速读】: 该论文旨在解决大语言模型(LLM)个体化问题中的本体论框架所依赖的核心假设——即跨不同操作范式(如提示引导、梯度下降微调和推理时调控)下,同一方向始终指向相同语义内容的“跨范式共指假设”缺乏充分论证的问题。其解决方案的关键在于提出“范式索引的个体化”(regime-indexed individuation)新框架,主张表征内容的身份单元应为“载体-范式”(vehicle, regime)二元组,而非孤立的载体。这一修正通过四项基于Qwen3-4B-Instruct与Mistral-7B-Instruct-v0.2的个性拓扑实验予以支持:提示提取向量与微调基域的非共线性、虚构人格对真实锚点方向的更强扰动效应、具有矛盾情感极性的混合向量偏向由训练历史决定的吸引子,以及推理时算术组合与微调时奇美拉训练在组合逻辑上的不对称性,共同揭示了原假设的不成立。由此,原先被视作竞争同一指称对象的多个候选立场,实则描述的是不同范式内部的对象,从而为理解模型内部表征的稳定性与可迁移性提供了更精细的理论基础。
链接: https://arxiv.org/abs/2607.00006
作者: Shuaizhi Cheng
机构: Harbin Institute of Technology (哈尔滨工业大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 30 pages, 2 figures, 1 table. Replies to Beckmann Butlin ( arXiv:2604.17031 )
Abstract:Beckmann Butlin’s (2026) ontological framework for the LLM individuation problem inherits an unargued cross-regime co-reference assumption from the persona-vectors literature: that the same direction picks out the same content under prompt-conditioning, gradient-descent fine-tuning, and inference-time steering. We present four empirical wedges from persona-topology experiments on Qwen3-4B-Instruct and Mistral-7B-Instruct-v0.2 - non-collinearity of prompt-extracted vectors and fine-tune basins; fictional personas displacing the model along real-anchor directions more strongly than real anchors do; contradictory-valenced mixtures biased toward a training-history-determined attractor; and asymmetric compositional algebra under inference-time arithmetic versus fine-tune-time chimera training - that jointly undermine the assumption. We propose regime-indexed individuation: the identity unit for representational content is a (vehicle, regime) pair, not a vehicle alone. Under this framework, Beckmann Butlin’s three candidate positions describe three different regime-internal objects rather than competing for the same referent; the same diagnosis applies to Mollo Millière, Chalmers, and Cerullo.
[NLP-77] Disentangling Speaker and Language Effects in Cross-Lingual Speaker Verification for Iberian Languages
【速读】: 该论文旨在解决跨语言语音验证(Cross-lingual Speaker Verification, CL-SV)系统中因语种不匹配导致的性能下降问题,尤其针对现有评估协议将语种差异与说话人个体差异混淆的缺陷。传统评估通常在不同说话人之间进行跨语言测试,难以区分性能退化是由语言差异还是说话人变异引起的。为此,本文提出一个包含五种伊比利亚语言的双语同说话人评估数据集,确保在固定说话人身份的前提下分析跨语言语音验证的表现。研究采用基于HuBERT的语音验证系统,并引入跨语言迁移矩阵(Cross-Lingual Transfer Matrix, CLTM)对成对语言间的迁移能力进行量化分析。结果表明,尽管说话人相关变异性是性能下降的部分原因,但语种不匹配仍是主导因素。该工作为跨语言语音验证中的语言依赖性提供了更精确的刻画,揭示了模型在跨语言场景下的真实鲁棒性瓶颈。
链接: https://arxiv.org/abs/2607.01161
作者: Pol Buitrago,Javier Hernando
机构: Barcelona Supercomputing Center (巴塞罗那超级计算中心); Universitat Politècnica de Catalunya (加泰罗尼亚理工大学)
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL)
备注: 5 pages, 8 figures, Submitted to IberSPEECH 2026
Abstract:Cross-lingual speaker verification (SV) systems typically exhibit performance degradation when enrollment and test utterances are spoken in different languages. However, standard evaluation protocols confound language mismatch with inter-speaker variability, as evaluation is generally performed with different speakers across languages. In this work, we introduce a bilingual same-speaker evaluation set for five Iberian languages, enabling analysis of cross-lingual SV under constant speaker identity. We apply this setup to a HuBERT-based SV system previously shown to exhibit strong language dependence, and analyze results using the Cross-Lingual Transfer Matrix (CLTM) to study pairwise cross-lingual transfer. Our results show that speaker-related variability accounts for part of the observed degradation, but language mismatch remains the main driver of cross-lingual performance loss. These findings provide a more precise characterization of language dependence in cross-lingual SV. Comments: 5 pages, 8 figures, Submitted to IberSPEECH 2026 Subjects: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL) Cite as: arXiv:2607.01161 [eess.AS] (or arXiv:2607.01161v1 [eess.AS] for this version) https://doi.org/10.48550/arXiv.2607.01161 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[NLP-78] NeuroCogMap Reveals Cognitive Organization of Large Language Models
【速读】: 该论文旨在解决大语言模型(LLM)内部表征是否形成可重复、具有功能意义的组织系统,从而解释模型行为、失败模式以及与人类认知之间关联的核心问题。其解决方案的关键在于提出一种受认知神经科学启发的框架——NeuroCogMap,该框架将大语言模型的内部特征划分为功能性的脑区样模块(functional parcels),并将其与可解释的认知功能、认知能力层级及人类大脑皮层活动相联系。这些模块呈现出稳定且语义连贯的组织结构,部分在不同模型间保持保守性,并与模型输出存在功能性关联。在此架构下,模型的主要失败模式(如幻觉、偏见、拒绝响应失败和阿谀奉承)被映射为表征系统与行为控制系统的特定扰动,从而提供机制导向的检测信号与干预靶点。此外,NeuroCogMap显著提升了对人类自然语言理解过程中皮层响应的预测能力,尤其在高级联合皮层表现最优;同时揭示了指导经典人类决策模型优化的潜在策略。因此,NeuroCogMap构建了一个系统级的框架,实现了对人工系统功能组织的映射,并将其与人类皮层功能和认知行为有效关联。
链接: https://arxiv.org/abs/2607.00397
作者: Zhongxiang Sun,Haolang Lu,Qiang Ma,Qi Li,Qipeng Wang,Liang Pang,Chenyu Liu,Qiankun Li,Hao Sun,Kun Wang,Yi Zeng,Jun Xu,Guoqi Li,Ji-Rong Wen
机构: The University of Hong Kong (香港大学); Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); Gaoling School of Artificial Intelligence, Renmin University of China (中国人民大学高瓴人工智能学院); State Key Laboratory of AI Safety, Institute of Computing Technology, Chinese Academy of Sciences (中国科学院计算技术研究所人工智能安全重点实验室); College of Computing and Data Science, Nanyang Technological University (南洋理工大学计算机与数据科学学院); IGS, Imperial College London (帝国理工学院地球科学与工程学院); Gaoling School of Artificial Intelligence, Renmin University of China (中国人民大学高瓴人工智能学院); Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); Gaoling School of Artificial Intelligence, Renmin University of China (中国人民大学高瓴人工智能学院)
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 79 pages, 6 main figures, 5 extended figures
Abstract:Understanding how complex cognitive functions are organized within artificial systems is central to interpreting large language models (LLMs) and relating them to biological cognition. Yet although LLMs exhibit broad cognitive-like behaviours, it remains unclear whether their internal representations form reproducible functional systems that explain behaviour, failure and links to human cognition. Here we present NeuroCogMap, a cognitive neuroscience-inspired framework that organizes internal features of LLMs into functional parcels and links them to interpretable functions, cognitive capabilities and a cognitive hierarchy. These parcels form a stable and semantically coherent organization that is partly conserved across models and functionally linked to model outputs. Within this organization, major LLM failures, including hallucination, bias, refusal failure and sycophancy, correspond to distinct disruptions in representational and behavioural-control systems, yielding internal signatures for mechanism-guided detection and targeted intervention. Beyond model behaviour, NeuroCogMap improves prediction of human cortical responses during naturalistic language comprehension, with the strongest correspondence in higher-order association cortex. At the cognitive level, its internal signatures expose latent strategies that guide refinements of classical models of human decision-making. Together, these findings establish NeuroCogMap as a system-level framework for mapping functional organization in artificial systems and for relating this organization to human cortical function and cognitive behaviour.
信息检索
[IR-0] Diffusion-GR2: Diffusion Generative Reasoning Re-ranker
链接: https://arxiv.org/abs/2607.01170
作者: Zhuoxuan Zhang,Kangqi Ni,Yuhang Chen,Mingfu Liang,Xiaohan Wei,Yunchen Pu,Fei Tian,Chonglin Sun,Frank Shyu,Adam(Yang)Song,Sandeep Pandey,Luke Simon,Tianlong Chen,Xi Liu
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: Work in progress
Abstract:Generative reasoning re-rankers achieve strong recommendation accuracy by emitting a chain-of-thought before re-ordering a candidate list, but they are slow at inference: an autoregressive (AR) decoder spends one sequential forward pass per reasoning token, and the reasoning trace far exceeds the ranking it produces. To reduce this cost, block-diffusion language models decode many positions in parallel over a few denoising steps and are substantially faster, yet naively converting an AR re-ranker into one opens two accuracy gaps: (1) a structural gap: answer positions are denoised in parallel and scored independently, so the decoder emits invalid rankings (duplicated, dropped, or out-of-set identifiers) that AR avoids through left-to-right masking; and (2) a distributional gap: fine-tuning the converted model on fixed teacher trajectories is off-policy relative to its own decoding at inference, leaving a residual accuracy gap. To close both gaps while keeping the speedup, we propose \textbfDiffusion-GR2, a recipe that converts our AR reasoning re-ranker (GR2) into a block-diffusion re-ranker. First, conversion fine-tuning (CFT) adapts the AR-initialized diffusion model to denoise the answer into a valid permutation on its own, without an external constrained decoder. Next, on-policy distillation (OPD) then supervises the model on its own decoded trajectories with dense per-token targets from the AR teacher. Finally, we apply a reinforcement-learning (RL) stage against a re-ranking reward on top of OPD’s on-policy policy. Experiments on Amazon Beauty demonstrate that Diffusion-GR2 recovers to near-parity with the AR re-ranker, while block-parallel decoding raises decode throughput by 2.4 – 3.5\times at the model’s reasoning output length. Ablations show that CFT recovers most of the conversion gap, and that on-policy distillation further closes it to the AR reference.
[IR-1] rie-based Experiment Plans for Efficient IR Pipeline Experiments SIGIR2026
链接: https://arxiv.org/abs/2607.01162
作者: Irene Anu,Craig Macdonald
类目: Information Retrieval (cs.IR)
备注: Accepted at ReNeuIR’26 workshop, colocated with SIGIR 2026. To appear in CEUR workshop proceedings
Abstract:Search engines are often formulated as cascading pipelines, where successive stages combine the results of different retrievers, and iteratively refine the ranking of candidate documents to obtain a final ranking, which can be presented to a user, or provided as context to an LLM. Such pipelines can be complex to evaluate in an end-to-end manner, necessitating measurement of Recall of early stages, and Precision of later stages, which are often interchangeable. PyTerrier is ideal for building and evaluating cascading retrieval pipelines, due to its declarative nature for pipeline construction and wide ecosystem of retrievers and rerankers. However, comparative evaluation of pipelines can be expensive due to repeated components. In this work, we describe the use of a trie data structure to formulate an experiment plan for comparative pipeline experiments that enhances experiment efficiency compared to a sequential “linear” plan. Empirically, on a demonstration experiment involving BM25, MonoT5 and DuoT5 on MSMARCO v2, we observe a 26% reduction in experiment duration. Finally, we report on a user study of undergraduate and postgraduate research students’ use of the experiment plans.
[IR-2] MemSyco-Bench: Benchmarking Sycophancy in Agent Memory
链接: https://arxiv.org/abs/2607.01071
作者: Zhishang Xiang,Zerui Chen,Yunbo Tang,Zhimin Wei,Ruqin Ning,Yujie Lin,Qinggang Zhang,Jinsong Su
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
Abstract:Memory has emerged as a cornerstone of modern LLM-based agents, supporting their evolution from single-turn assistants to long-term collaborators. However, memory is not always beneficial: retrieved memories often induce a critical issue of sycophancy, causing agents to over-align with the user at the cost of factual accuracy or objective reasoning. Despite this emerging risk, existing memory benchmarks primarily evaluate whether memories are correctly stored, retrieved, or updated, while overlooking how retrieved memories influence downstream reasoning and decision-making. To bridge this gap, we propose MemSyco-Bench, a comprehensive benchmark for evaluating memory-induced sycophancy in agent systems. MemSyco-Bench measures when memory should influence a decision and how valid memory should be used. Specifically, it covers five tasks that assess whether agents can reject memory as factual evidence, respect its applicable scope, resolve conflicts between memory and objective evidence, track memory updates, and use valid memory for personalization. All related resources are collected for the community at this https URL.
[IR-3] As It Was: Aligning LLM Search Evaluation with Historical User Preferences
链接: https://arxiv.org/abs/2607.01040
作者: Ali Vardasbi,Gustavo Penha,Enrico Palumbo,Claudia Hauff,Hugues Bouchard,Mounia Lalmas
类目: Information Retrieval (cs.IR)
备注:
Abstract:Large-scale search systems evolve faster than human quality assurance can scale, especially for long-tail intents and multilingual queries. LLM-as-a-judge approaches provide a scalable alternative for evaluating the relevance of search engine result pages (SERPs), but judgments based solely on semantic similarity or world knowledge can drift from actual user preferences, particularly for ambiguous queries. We introduce a behavior-grounded LLM judge that augments each SERP item with a lightweight and auditable behavioral prior in the form of a Query-Relevance-Impressions (QRI) card. Each card summarizes how users have historically interacted with similar queries and results, providing compact empirical evidence that the judge can cite to resolve ambiguity and make more consistent relevance judgments while still relying on semantic reasoning. In a large-scale music search evaluation at Spotify, using relevance estimates derived from historical user interactions across 6,000 recomposed SERPs, the behavior-grounded judge achieves stronger alignment with user preferences, improving Spearman rank correlation by approximately 5% overall and yielding a 91% relative improvement on disagreement cases. On a multilingual human-judged dataset spanning five languages, grounding further increases correlation with human relevance judgments by 15%. Importantly, when evaluated against outcomes from a live A/B test, the grounded judge shows consistently higher alignment with the observed winning model. While absolute alignment remains moderate, these findings demonstrate that lightweight behavioral grounding can improve the reliability and practical usefulness of LLM-based evaluation in real-world search systems.
[IR-4] RACORN-1: Adaptive Recall-Preserving Speedup for Low-Selectivity Filtered Vector Search
链接: https://arxiv.org/abs/2607.00768
作者: Yoonseok Kim,Gyusik Choe
类目: Databases (cs.DB); Information Retrieval (cs.IR)
备注: 13 pages, 11 figures, 10 tables
Abstract:Filtered Vector Search (FVS), which combines vector embedding similarity with structured metadata predicates, has emerged as a core requirement in RAG and production retrieval systems. ACORN-1, the representative In-filtering algorithm that reuses an existing HNSW index, substantially reduces latency at low selectivity but suffers connectivity instability below 5% selectivity and recall collapse below 1%. We propose RACORN-1, an in-place extension of ACORN-1 that resolves this collapse via (i) Adaptive Search Fallback (ASF) – repurposing filter-failing nodes as transient bridges to detour around severed paths; bridge and two-hop candidate selection uses stride sampling for spatial diversity. While filter-first ACORN-family methods have a structural recall trade-off relative to distance-first HNSW, RACORN-1 improves the trade-off curve via ASF, minimizing recall loss while substantially reducing latency. Across three 1M-scale and one 40M-scale dataset, RACORN-1 delivers approximately 9-26x latency reduction over HNSW in the sweet spot (1%-0.3%), and recovers ACORN-1’s recall collapse from 0.45-0.72 (1%) and 0.03-0.10 (0.3%) to 0.70-0.96 and 0.77-0.98 respectively. For the extreme-low-selectivity regime where linear scan can outperform graph search, we combine RACORN-1 with (ii) Adaptive Exact Fallback (AEF) in a variant RACORN-1+, achieving recall 1.00 with 20-75x speedup at 1M =0.1% and 13x speedup at 40M 0.01%. Under a Negative Correlation evaluation (K-means clusters), where ACORN-1 collapses (recall 0.08-0.41), RACORN-1 maintains recall 0.80-0.98 with a 5-9x latency advantage over HNSW. Together, RACORN-1 and RACORN-1+ form an ACORN-1-compatible mechanism robust to both extreme-low-selectivity and adversarial query-filter correlation.
[IR-5] When to Repair a Graph ANN Index: Navigability-Signal-Triggered Local Repair Protects Tail Recall Under Bursty Churn
链接: https://arxiv.org/abs/2607.00728
作者: Madhulatha Mandarapu,Sandeep Kunkunuru
类目: Databases (cs.DB); Information Retrieval (cs.IR)
备注: 7 pages. Code + one-command reproduction: this https URL
Abstract:Graph approximate-nearest-neighbor (ANN) indexes (HNSW, DiskANN/Vamana) lose recall under insert/delete churn, because deletions orphan the greedy-search paths that route through removed nodes. Production systems restore navigability by repairing the graph on a fixed schedule (consolidate every X operations). We ask whether triggering local edge repair on a measured navigability-degradation signal, rather than a blind clock, spends a fixed repair budget better. On two real ANN datasets (SIFT-128 and Fashion-MNIST-784) under a controlled bursty churn stream, and comparing repair policies at matched amortized repair budget (equal consolidation count), signal-triggered repair Pareto-dominates fixed-cadence repair. The gain is concentrated on worst-case (tail) recall at scarce budget: at roughly one consolidation it improves the minimum recall@10 by +0.014 (SIFT) to +0.050 (Fashion-MNIST) across four stream seeds, with 95% confidence intervals excluding zero, while the mean-recall gain is small (0.005). The advantage follows a clean drift-severity gradient – larger for sparser, more fragile graphs – and fades to parity when the index is robust or budget is ample. A cheap probe-recall signal is a valid, leading indicator of true recall (Spearman rho ~= 0.95). We contribute the mechanism, a budget-matched evaluation protocol that separates repair scheduling from repair spend, and an open, reproducible churn-repair harness. We deliberately do not claim a mean-recall improvement or a new index; a recall-versus-repair-cost bound and data-distribution-drift coupling are left as future work.
[IR-6] What Survives Into Context: A Diagnostic for Budget-Constrained Multi-Hop RAG and When Submodular Evidence Packing Improves It
链接: https://arxiv.org/abs/2607.00725
作者: Ananto Nayan Bala
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 12 pages, 5 figures
Abstract:Retrieval-augmented generation (RAG) under a fixed reader-context budget forces a selection problem: of the evidence retrieved, only a fraction can be shown to the reader. We argue that document recall – the standard retrieval metric – is the wrong quantity to optimize in this regime, and we make two contributions. First, as a general contribution, we introduce answer-in-context, a diagnostic that measures whether a gold answer survives as a contiguous span in the packed reader context (not the retrieved set). It predicts answer F1 better than recall (r=0.39-0.55 vs. about 0.31), separates answer quality roughly five-fold (0.60 vs. 0.12 on HotpotQA), and carries information beyond retrieval: it adds Delta R squared=0.17 over recall and shows a 4.6x EM gap even among questions where all gold was retrieved. We also confirm it interventionally: on 2WikiMultiHopQA a packing change that raises coverage but not answer-in-context yields no accuracy gain. Second, as a conditional contribution, we cast reader-context construction as budgeted monotone submodular maximization and build a packer that jointly optimizes relevance, query coverage, representativeness, and diversity. On HotpotQA with a 160-token budget and a 3B reader it beats a strong focused heuristic, MMR, and naive packing – by up to +5.1 F1 at equal-or-lower token cost, across three seeds. Crucially, we map the scope of this win honestly: it requires the conjunction of (i) multi-hop complementary structure, (ii) retrieval that surfaces the evidence, (iii) a binding but not extreme budget, and (iv) a reader weak enough that evidence density, not reading capacity, is the bottleneck. A quantization-controlled reader-scale ladder (3B to 7B to 14B) shows the edge over the heuristic is absorbed by 7B and significantly reverses by 14B, while the diagnostic explains every boundary with a single variable.
[IR-7] Multi-Turn Agent ic Scientific Literature Search via Workflow Induction
链接: https://arxiv.org/abs/2607.00597
作者: Jisen Li(1 and 2),Bingxuan Li(1),Nanyi Jiang(3),Xuying Ning(1),Xiyao Wang(3),Yifan Shen(1),Heng Wang(1),Yuqing Jian(2),Xiaoxia Wu(2),Ben Athiwaratkun(2),Pan Lu(4),Jiaxuan You(1),Bingxin Zhao(3) ((1) University of Illinois Urbana-Champaign, (2) Together AI, (3) University of Pennsylvania, (4) Stanford University)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 17 pages, 12 figures
Abstract:Scientific literature search often requires more than retrieving papers from a single query: users’ intents are underspecified, preference-dependent, and evolve through interaction. Existing search agents typically rely on fixed pipelines or implicit language-only reasoning, making their search strategies difficult to control, inspect, and refine. We introduce PaperPilot, a multi-turn literature search agent that frames scientific search as workflow induction. Given an anchor paper and a user query, PaperPilot constructs an executable DAG of paper-search operators, including keyword search, citation expansion, filtering, scoring, reranking, and evidence extraction. User feedback is then used to refine both the query and the workflow itself. We train PaperPilot with supervised workflow imitation and preference optimization over controlled workflow corruptions. Experiments show that PaperPilot-9B improves over the base Qwen3.5-9B toolset agent under multi-turn interaction, increasing Hit@5 from 58.0 to 77.0, MRR from 47.5 to 59.4, and nDCG@10 from 26.8 to 32.5, while reducing workflow execution errors from 9.5% to 0%. These results show that explicit, editable search workflows provide an effective and controllable interface for aligning literature search agents with complex scientific intent.
[IR-8] When RAG Meets Query Planning : Logical Query Trees for Resolving Exploratory Reasoning Problems SIGMOD2027
链接: https://arxiv.org/abs/2607.00508
作者: Ganlin Xu,Linghao Zhang,Zhitao Yin,Hongda Xi,Chen Yang,Jiaqing Liang,Weijia Lu,Sihang Jiang,Yanghua Xiao,Deqing Yang
类目: Information Retrieval (cs.IR)
备注: Accepted by SIGMOD 2027
Abstract:Retrieval-Augmented Generation (RAG) effectively grounds large language models (LLMs) in external knowledge but struggles with \textbfexploratory reasoning problems (ERPs) that are the complex queries involving high uncertainty and ambiguity. Resolving ERPs requires complex reasoning with unclear paths, tending to result in retrieval noise and error accumulation. Furthermore, the absence of an end-to-end planning mechanism makes it difficult to generate effective trajectories for ERPs. Motivated by database query planning, we introduce \emphPlanRAG, an RAG framework that models ERPs of natural language as \textbflogical query trees (LQTs). However, translating ERPs into LQTs is non-trivial due to representation and optimization gaps between structured SQL and unstructured natural language, making it highly challenging to construct high-quality LQTs. To address these problems, we first decompose ERPs into atomic queries and then organize them into LQTs using dynamic programming guided by a cost model involving multiple complementary dimensions. Finally, we execute iterative aggregation, rewriting, retrieval, and generation over LQTs, processing nodes concurrently and propagating intermediate results upward, with further parallelization across multiple threads for efficiency. Our experimental results show that PlanRAG outperforms state-of-the-art iteration-based and graph-based RAG systems on our newly constructed dataset, \textbfWikiWeb-ERP, thereby providing a new formulation for optimizing natural language queries. Our source code and dataset are available at this https URL.
[IR-9] Real-Time Hard Negative Sampling via LLM -based Clustering for Large-Scale Two-Tower Retrieval
链接: https://arxiv.org/abs/2607.00448
作者: Ivan Ji,Liuyi Hu,Harrison(Zihao)Zhao,Lei Huang,Qunshu Zhang, Max (Xiangjun)Fan,Aameek Singh
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
Abstract:The two-tower model has been widely used for large-scale recommendation systems, particularly in the retrieval stage. Industry standards for training two-tower models typically involve in-batch and/or out-of-batch negative sampling. However, these methods often produce easy negatives that models can quickly learn, failing to sufficiently challenge the model. To address this issue, a novel self-supervised hard negative sampling technique is proposed that leverages a large language model (LLM) to generate hard negatives from the same cluster during model training. By utilizing the LLM to learn media representations, the proposed approach ensures that the generated negatives are more challenging and informative. This real-time sampling framework is designed for seamless integration into production models, capable of handling billions of training data points with minimal computational complexity. Experiments on public datasets, along with deployment to a large-scale online system, demonstrate that the proposed negative sampling technique outperforms widely used industry methods. Furthermore, analysis in industrial applications reveals that this sampling method can help break inherent feedback loops in recommendations and significantly reduce popularity bias.
[IR-10] Attribute-Prompted Kernel Hashing for Unsupervised Data-Efficient Cross-Modal Retrieval
链接: https://arxiv.org/abs/2607.00379
作者: Runhao Li,Xiaoxu Ma,Zhenyu Weng,Yue Zhang,Guibo Luo,Huiping Zhuang,Zhiping Lin,Yap-Peng Tan
类目: Information Retrieval (cs.IR); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Unsupervised cross-modal hashing enables efficient retrieval of semantically related instances across different modalities without requiring manual semantic annotation. However, existing unsupervised methods rely heavily on large-scale image-text pairs. Collecting such data can be costly, particularly in scenarios where well-aligned pairs are scarce due to privacy and specialized constraints. More critically, existing methods tend to overfit to seen training data, restricting their generalization performance on unseen categories that the constrained training data cannot cover. To address these limitations, we propose Attribute-Prompted Kernel Hashing (APKH), a novel data-efficient approach that constructs a compact, modality-aligned Hamming space driven by the generalized attribute priors of vision-language foundation models. Specifically, APKH introduces two core modules: Context-optimized Attribute Kernel Mapping (CAKM) and Kernel-Smoothed Contrastive Alignment (KSCA). CAKM formulates cross-modal alignment through hyperspherical Radial Basis Function kernel mapping, optimizing dynamic attribute kernels via prompt learning to capture modality-invariant semantics. Furthermore, KSCA extends conventional point-to-point contrastive learning by modeling limited paired data as continuous kernel distributions. This explicit smoothing of the modality gap alleviates overfitting to sparse pairwise correlations. Extensive experiments demonstrate that APKH outperforms state-of-the-art hashing methods in the challenging cross-modal retrieval tasks from seen to unseen categories under data-constrained scenarios.
[IR-11] Learning to Compose: Revisiting Proxy Task Design for Zero-Shot Composed Image Retrieval ECCV2026
链接: https://arxiv.org/abs/2607.00374
作者: Jingjing Zhang,Lei Zhang,Zheren Fu,Zhendong Mao
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR); Multimedia (cs.MM)
备注: Accepted by ECCV 2026
Abstract:Composed Image Retrieval (CIR) retrieves a target image from a reference image and a textual modification. While supervised CIR relies on costly triplets, Zero-Shot CIR (ZS-CIR) alleviates this reliance through proxy tasks trained on image-text pairs. However, existing proxy tasks primarily enhance visual and textual representations to accommodate a predefined composition mechanism such as pseudo-word injection into a frozen text encoder or linear feature arithmetic. As a result, the composition function itself remains unlearned, limiting the model’s ability to express diverse and fine-grained semantic modifications. To address this, we propose FoCo, which models composition as two coordinated stages: focusing on modification-relevant visual content, and then completing the target semantics. We realize these through two proxy tasks: text-anchored visual aggregation to selectively gather visual content guided by localized textual semantics, and context-conditioned semantic completion to transform these aggregated visuals with the remaining scene context into a coherent composed representation. The tasks are trained jointly with a cross-instance contrastive objective, encouraging semantic diversity and discouraging shortcut composition strategies. Extensive experiments on four ZS-CIR benchmarks show FoCo’s state-of-the-art performance and improved generalization.
[IR-12] Identifying and Resolving Pitfalls of Knowledge-Based VQA Benchmarks: Auditing Repairing and Augmenting ECCV26 ECCV2026
链接: https://arxiv.org/abs/2607.00159
作者: Qian Ma,S M Rayeed,Charles V. Stewart,Qiong Wu,Yao Ma
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Multimedia (cs.MM)
备注: Accepted to ECCV 2026. The datasets and code are available in this https URL
Abstract:Knowledge-Based Visual Question Answering (KB-VQA) aims to evaluate whether Visual Language Models (VLMs) can retrieve, ground, and reason over external structured knowledge beyond visual evidence. In practice, answer accuracy is widely adopted as the primary evaluation metric, implicitly treating correctness as a proxy for knowledge-grounded reasoning. However, for existing KB-VQA benchmarks, this proxy relies on critical assumptions that are often overlooked and rendered unreliable by benchmark issues: annotated answer must be derivable from the associated knowledge base, question must be well-posed with sufficient constraints, and visual setting must meaningfully require grounded disambiguation. In this work, we show that these assumptions are systematically violated in existing KB-VQA benchmarks. Our audit reveals substantial instances with missing or contradicted answers and underspecified questions that render accuracy a misleading metric. Furthermore, we find that existing datasets rely on visually trivial, single-entity scenes that bypass the need for sophisticated visual-to-knowledge mapping. We demonstrate that even with controlled architectures, these flaws lead to distorted model rankings and overestimations of reasoning capabilities. To address this, we introduce (1) a principled audit-and-repair protocol that restores answer derivability and question clarity, and (2) a controlled multi-entity augmentation protocol that introduces visual ambiguity to challenge initial retrieval and grounded reasoning. Re-evaluation under corrected and augmented settings yields markedly different performance trends. Our findings call for rethinking evaluation protocols and designing more interaction-aware KB-VQA benchmarks that prioritize verifiable reasoning over simple matching.
[IR-13] AGE: Adaptive-masking for Graph Embedding in Graph Retrieval-Augmented Generation
链接: https://arxiv.org/abs/2607.00052
作者: Bao Long Nguyen Huu,Atsushi Hashimoto
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
Abstract:GraphRAG is an extension of retrieval-augmented generation (RAG) that supports large language models (LLMs) by referring to graph-structured data as external knowledge. While this technique ideally captures intricate relationships, it often struggles with graph representations for LLMs, particularly for frozen LLMs, due to the misalignment between graph-based and text-based latent features. We tackle this issue by introducing the \it Adaptive-masking for Graph Embedding (AGE). AGE employs a Transformer in a mask-based self-supervised learning (SSL) approach. We designed the architecture similar to text embedding encoders, addressing the latent feature misalignment. In contrast to natural language texts, graphs are concise representations, and there exist \it key nodes that hold dominant contextual information, which are challenging to predict from their surroundings. Masking such key nodes leads to inefficiency in the SSL process. Therefore, AGE focuses on predicting nodes apart from key nodes, utilizing a learnable node sampler. Our experimental results indicate that AGE significantly improves approaches using non-parametric search component in GraphQA tasks, achieving superior accuracy across four benchmark datasets with distinct characteristics.
[IR-14] Aligning Sentence Embeddings to Human Concepts via Sparse Autoencoders
链接: https://arxiv.org/abs/2607.00023
作者: Wonseok Shin,Songkuk Kim
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
Abstract:Dense sentence embeddings are fundamental to modern Retrieval-Augmented Generation (RAG) systems but suffer from a lack of interpretability due to feature superposition. This opacity hinders the alignment of retrieval processes with human intent, as the entangled representations are difficult to analyze or control. In this work, we propose a method to disentangle the dense representations of sentence transformers (e.g., E5) into human-interpretable concepts using Top-k Sparse Autoencoders (SAEs). We demonstrate that these disentangled features align with specific semantic, syntactic, and pragmatic categories. Furthermore, we introduce an activation steering mechanism that allows for precise intervention in the retrieval process. By clamping specific latent features, we show that it is possible to re-rank search results to better align with user constraints without retraining the backbone model. Our findings suggest that SAE-based decomposition offers a viable path toward transparent and steerable neural information retrieval.
[IR-15] Learning User-Aware Recall: Personalized Retrieval in Long-Term Conversational Memory
链接: https://arxiv.org/abs/2607.00017
作者: ZhiShu Jiang,Haibo Liu,Xin Shen,Guanqiang QI,Chenxi Miao,Weikang Li,Liwei Qian,Xin Pei,Jizhou Huang
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Long-term conversational agents are expected to remember past interactions, but memory is useful only when the right evidence is recalled for the right user. Existing memory-augmented LLM agents have made progress in building compact memory banks, yet retrieval is still often driven by query-centered similarity or fixed ranking rules, leaving user-conditioned relevance this http URL address this gap, we propose Profile-guided Personalized Retrieval Optimization (PPRO), a retrieval-centric framework that makes memory retrieval both user-aware and this http URL builds episodic and semantic memory banks from dialogue histories and derives a user profile from accumulated this http URL profile serves as an explicit personalized prior in memory ranking, allowing retrieval to account for stable user attributes, preferences, and this http URL further trains a query rewriter with Group Relative Policy Optimization, using both evidence retrieval quality and downstream answer quality as feedback while keeping the memory banks and answer model this http URL on LoCoMo and LongMemEval-S show consistent gains over training-free memory systems and training-based this http URL studies further show that both profile-guided ranking and retrieval-oriented rewriting contribute substantially to performance, highlighting retrieval optimization as a key factor in personalized long-term memory use.
[IR-16] Libra: Training the Environment for Agent ic Information Retrieval
链接: https://arxiv.org/abs/2607.00016
作者: Xuan Zhao,Andy Chiu,Gengyu Wang
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
Abstract:Information localization within massive repositories is a cornerstone of agentic LLM systems. While synthetic data-driven optimization has proven successful in training LLMs, little attention has been paid to optimizing the agent’s working environment (the repository itself) in a data-driven manner. To bridge this gap, we present Libra, a self-evolving framework that introduces mutable “catalogs” (hierarchical Markdown files serving as navigable indices) into the repository. Libra runs an LLM-driven optimization loop where a Prompter generates synthetic queries, a frozen Solver attempts to resolve them by navigating the catalogs, and a Healer rewrites the catalogs in response to the Solver’s localization failures. Evaluations across 12 SWE-bench Lite repositories demonstrate that this environmental healing yields continual, logarithmic improvements in code localization accuracy. Furthermore, these environmental improvements transfer zero-shot across different LLMs and problem sets. Although the focus of this paper is to study the general behavior of such a system, we also demonstrate that a minimalist coding agent equipped with Libra-optimized catalogs outperforms state-of-the-art baselines. Code is available at this https URL and data at this https URL.
[IR-17] GRACE-RAG : Governed Retrieval Architecture for Canonical Evidence Synthesis Enabling Lightweight Deployment in Closed-Domain Institutional Settings
链接: https://arxiv.org/abs/2607.00013
作者: Asit Desai,Aman Kumar,Prashant Devadiga
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: 15 pages, 5 figures, 4 tables. Submitted to COLM 2026
Abstract:Retrieval-Augmented Generation (RAG) systems are widely used in institutional question answering settings where responses must be grounded in authoritative documentation (Gao et al., 2023). In entity-dense domains where relevant information is distributed across heterogeneous documents, vector-only retrieval often produces fragmented evidence and increases dependence on inference-time reasoning (Zhao et al., 2024). This paper introduces GRACE-RAG, a retrieval-governed, graph-augmented RAG architecture that externalizes structural reasoning from the generative stage to a structured retrieval layer, resolving structural ambiguity offline, enabling deployment on self-hosted lightweight models calibrated to closed-domain institutional vocabulary. Experiments across three model capacities: Mistral 24B, GPT OSS 120B, and Gemini 2.5 Flash show consistent improvements in completeness, depth, and anticipatory coverage, with overall quality gains of up to 20% under mid-scale models, indicating that retrieval architecture governs structural quality over model scale, reducing computational and latency footprint without dependence on proprietary systems.
[IR-18] PRA-RAG : Provably Robust Aggregation in Retrieval-Augmented Generation against Retrieval Corruption
链接: https://arxiv.org/abs/2607.00012
作者: Xue Tan,Yi Zheng,Chang Huo,Yunruo Zhang,Yu Liu,Hao Luan,Zhuyang Yu,Xiaoyan Sun,Ping Chen,Jun Dai
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
Abstract:Retrieval-Augmented Generation (RAG) enhances Large Language Models (LLMs) by incorporating external knowledge, effectively mitigating their inherent knowledge limitations. However, RAG remains vulnerable to poisoning attacks that manipulate retrieved texts to mislead model outputs. Existing defense mechanisms often lack theoretical robustness guarantees and perform unreliably when the LLM has limited knowledge of the retrieved content. In this work, we propose PRA-RAG, a provably robust retrieval aggregation algorithm designed to defend against poisoning attacks on retrieved texts. PRA-RAG samples multiple combinations of retrieved texts and utilizes geometric structures in the embedding space to identify a robust subset, from which a stable aggregated representation is derived. We provide theoretical bounds on the maximum impact of poisoned retrieved content and establish a quantitative measure of RAG’s robustness. Experiments across multiple benchmarks and RAG architectures demonstrate that PRA-RAG reduces the attack success rate to as low as 1% while maintaining an accuracy of 71%, significantly outperforming representative state-of-the-art methods.
[IR-19] SkillSelect-Serve: Budget-Controllable and QoS-Aware Skill Service Recommendation and Composition for Small LLM Agents
链接: https://arxiv.org/abs/2607.00011
作者: Jingyuan Zheng,Dongjing Wang,Xin Zhang,Butian Huang,Haiping Zhang,Dongjin Yu,Shuguang Deng
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: 5 figures, 6 tables
Abstract:Reusable skill libraries are becoming important infrastructure for large language model (LLM) agents, yet existing selection methods often treat skills as retrievable documents and return fixed top-k lists. This paper presents SkillSelect-Serve, a budget-controllable and QoS-aware framework that formulates agent skill selection as Skill Service Recommendation and Composition. SkillSelect-Serve represents raw skills as structured Skill Services with functional descriptions, dependencies, context cost, risk, and QoS-related attributes. A local Micro-Agent Requirement Planner converts natural-language tasks into structured service requirements, while a shared discovery backbone retrieves candidate services from a large registry. The framework then performs dual-granularity utility modeling with skill-level marginal suitability estimation and bundle-level calibration for coverage, redundancy, cost, and risk trade-offs. Experiments on 35,353 skills and 586 task queries show that SkillSelect-Serve consistently improves same-budget bundle recall and mean utility over fixed top-k retrieval baselines.
[IR-20] Prompt Optimization for User Simulation in Conversational Recommender Systems: A Multi-Objective Framework ICDE
链接: https://arxiv.org/abs/2607.00010
作者: Nipun B Nair,Tongtong Wu,Weiqing Wang
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: to be published in 2026 IEEE 42nd International Conference on Data Engineering Workshops (ICDEW)
Abstract:Conversational recommender systems (CRSs) are a core component of next-generation intelligent recommender systems because they enable users to actively elicit preferences, clarify intentions, and adapt recommendations in real time. However, there are two key obstacles in the CRS domain: evaluation and access to training data. Evaluating CRSs through real human studies is more critical than for traditional recommender systems, yet such studies are both costly and time-consuming. Moreover, CRS interaction data are often difficult to obtain for model training due to privacy concerns. Large language model (LLM)-based user simulators have shown promise in addressing both challenges by generating synthetic user interactions for evaluation and training. However, existing approaches suffer from systematic positive bias, data leakage, and limited behavioral diversity, and they rely on brittle manual prompt engineering that requires extensive domain expertise. In this paper, we propose a framework to automatically optimize prompts for LLM-based user simulators in CRSs, simultaneously mitigating these issues. Experimental results demonstrate that the proposed framework achieves improved behavioral alignment with human interaction patterns compared to baseline methods across diverse prompt settings.
[IR-21] SchemaRAG : Dynamic Large Schema Reduction for LLM -driven Structured Information Extraction
链接: https://arxiv.org/abs/2607.00008
作者: Sin Yu Bonnie Ho,Arlie Coles,Erik Larsson,Eric Marshall,Nathan Bodenstab,Paul Vozila
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
Abstract:Extracting structured data from unstructured text using large language models (LLMs) becomes challenging when target schemas are large and complex. In such cases, including the full schema in the prompt increases cost and latency, risks lost-in-the-middle performance degradation, and can exceed context length limits. We propose SchemaRAG, a retrieval-augmented generation (RAG) framework that dynamically prunes the output schema space for schema-conditioned information extraction tasks by leveraging schema metadata and few-shot examples when available. We evaluate SchemaRAG on real-world healthcare and e-commerce datasets. Our results show that SchemaRAG can achieve up to an 8.8% increase in micro-F1, a 47% reduction in latency, and a 48% reduction in token costs, demonstrating its practicality for large-schema extraction.
[IR-22] BaRA: BFS-and-Reflection Web Data Collection Agent
链接: https://arxiv.org/abs/2607.00007
作者: Soojeong Lee,Joseph Lee,Yongseong Cho,Sunjae Kim,Youngwoo Moon,Kyungwoo Song
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language model (LLM)-based web agents reduce manual scripting for web data collection, yet on live websites, they often miss relevant pages, return incomplete multimodal outputs, or return media URLs that are not directly downloadable. We present BFS-and-Reflection Agent (BaRA), a framework for site-level collection under a fixed interaction budget. The framework combines bounded breadth-first search (BFS) traversal with history-based self-reflection. We evaluate BaRA on 50 synthetic websites with ground-truth reference sets. We additionally test on three public websites with cluttered or dynamic layouts. BaRA outperforms Pure LLM, SeeAct-Vision, and Browser-use on link discovery and downloadable multimodal extraction, with the largest gains in download-valid image and video recovery. Our code is available at this https URL.
[IR-23] opological Void Analysis A Mathematical Framework for Systematic Technical Innovation Discovery in Knowledge Spaces
链接: https://arxiv.org/abs/2607.00005
作者: Kris Pan
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: 11 pages, 3 tables, 2 case studies; arXiv Industry Track
Abstract:Identifying where to innovate in a dense technical domain - such as operating systems or hardware/software co-design - is fundamentally a search problem in a high-dimensional knowledge space. Existing approaches rely on keyword search, citation proximity, or human intuition, none of which formalise the notion of an unexplored region that is simultaneously relevant to a target goal and absent from prior art. We present Topological Void Analysis (TVA), a mathematical framework that defines topological voids as triads (A, B, C) in a dense-sparse hybrid embedding space. A void requires three conditions: (i) both concepts A and B are semantically cohesive with domain anchor C; (ii) their pairwise similarity falls within a calibrated marginality band - avoiding both obvious combinations and unrelated noise; and (iii) they share a sparse lexical bridge while the geodesic midpoint on the embedding hypersphere is unoccupied. Applied to ~140k indexed documents, TVA generates 2,128 invention candidates across 96 targets; 90% survive automated quality filtering, yielding 191 REVISE and 1 APPROVE verdict from four-specialist adversarial review (0.05% end-to-end). Two case studies demonstrate the framework surfaces non-obvious connective tissue rather than merely obvious related pairs. Comments: 11 pages, 3 tables, 2 case studies; arXiv Industry Track Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI) ACMclasses: I.2.6; H.3.3 Cite as: arXiv:2607.00005 [cs.IR] (or arXiv:2607.00005v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2607.00005 Focus to learn more arXiv-issued DOI via DataCite
[IR-24] Why Advanced Encoders Lag on Sparse Retrieval? The Answer and an Approach to Bridging Vocabulary Gaps SIGIR2026
链接: https://arxiv.org/abs/2607.00004
作者: Zhichao Geng,Yang Yang
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at SIGIR 2026
Abstract:While advanced foundation models like ModernBERT significantly outperform older architectures in dense retrieval, they surprisingly lag behind the aging BERT-base baseline in learned sparse retrieval (LSR). We identify the root cause as the \textitVocabulary Gap: modern tokenizers utilize raw, case-sensitive vocabularies designed for lossless reconstruction, which map single semantic units to redundant surface forms, wasting model capacity on morphological noise and hindering lexical matching. We formalize this intuition through a theoretical framework, demonstrating that appropriate vocabulary coarse-graining can tighten the generalization bounds by reducing complexity of the hypothesis class, provided that semantic integrity is preserved. To resolve this, we propose \textbfVocabulary Transfer (VT), a model-agnostic framework that migrates advanced encoders to sparse-friendly, normalized vocabularies with minimal computational cost. VT utilizes a novel \textbfSemantic Initialization via spatial topology to preserve geometric structure and an \textbfActivation Potential Calibration (APC) mechanism to align pre-trained manifolds with sparsity constraints, preventing the dead neuron and dense collapse observed in standard fine-tuning. Empirically, VT is universally effective: it enables ModernBERT to achieve state-of-the-art performance on the BEIR benchmark (\textbf52.4 nDCG, a \textbf+4.7 improvement), resuscitates failing models like RoBERTa-large, and generalizes seamlessly to inference-free architectures and specialized domains. These results confirm that the performance lag is not an architectural deficiency but a solvable vocabulary mismatch. We’ve released our code and models.\footnotethis https URL. All details included. Comments: Accepted at SIGIR 2026 Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2607.00004 [cs.IR] (or arXiv:2607.00004v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2607.00004 Focus to learn more arXiv-issued DOI via DataCite Related DOI: https://doi.org/10.1145/3805712.3809724 Focus to learn more DOI(s) linking to related resources
[IR-25] From “Strings” to “Things” for Personal Knowledge Graphs: Evaluating LLM Triple Extraction for Recommendation Systems
链接: https://arxiv.org/abs/2607.00003
作者: Abhirup Dasgupta,Fernando Spadea,Oshani Seneviratne
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
Abstract:Personal Knowledge Graphs (PKGs) offer a privacy-preserving framework for modeling user preferences, yet constructing them from unstructured, decentralized conversational data remains a challenge. This paper bridges the gap between conversational “strings” and semantic “things” by presenting a reproducible pipeline for extracting structured user-preference triples using lightweight Large Language Models. We evaluate Qwen- and Gemma-based models on their ability to extract RDF-compliant triples linked to Wikidata identifiers from conversational data for PKG construction. Our evaluation assesses both the semantic extraction fidelity and the utility of the resulting graphs in a downstream recommendation task. We found that certain models performed well and had proportionally high downstream performance relative to their triple extraction performance.
人机交互
[HC-0] ouching and Feeling the Data: A Reusable Software Pipeline for Tactile Statistical Graphs in Accessible Education
链接: https://arxiv.org/abs/2607.01214
作者: Lawrence Obiuwevwi,Krzysztof J. Rechowicz,Jessica M. Johnson,Erika Frydenlund,Vikas Ashok,Sachin Shetty,Sampath Jayarathna
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:Statistical visualization is usually treated as a visual medium, but data can also be touched. Three dimensional printed tactile graphs let blind and low vision students feel distributions, trace trends, and explore relationships through direct haptic interaction. Yet classroom scale use remains limited because producing each graph in CAD software requires specialized skill and hours of manual work. We address this bottleneck as a software problem through a three layer reusable pipeline in about 1500 lines of JavaScript. The first layer derives tactile design parameters automatically from plate dimensions using tactile perception research. The second provides shared chart scaffolding and five modular builders for scatter, bar, histogram, line, and box plots. The optional third layer uses a multi-modal large language model to extract structured chart specifications from uploaded images, with mandatory teacher review before print generation. The pipeline produces print ready binary Standard Tessellation Language files in under 250 milliseconds. We present the design, performance, and limitations.
[HC-1] Behavior-Adaptive Conversational Agents : Toward a Fluid Personality Framework AAAI AAAI-2026
链接: https://arxiv.org/abs/2607.01034
作者: Hasibur Rahman,Smit Desai
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: Presented at Bridging AI and Behavior Change, a Bridge Program organized at the AAAI Conference on Artificial Intelligence 2026 (AAAI-2026)
Abstract:Large language model (LLM)-based conversational agents (CAs) are now ubiquitous, creating new opportunities for AI-mediated behavior change. Their capacity to project nuanced personalities and adopt diverse metaphorical roles raises a design question: how should an agent’s persona and personality be calibrated to the moment? Recent evidence suggests that (i) moderate personality expression outperforms low or high extremes on trust, enjoyment, and intention to adopt in goal-oriented tasks, and (ii) context-appropriate metaphors outperform static one-note assistants on user experience and uptake. Yet most CAs still fix both persona and style, risking misalignment when dynamics, urgency, and formality vary, for example in medical information seeking, fitness coaching, and reflective learning. We propose a Fluid Personality Framework that jointly adapts (1) the agent’s metaphorical persona, such as coach, tutor, librarian, or tool, and (2) its personality expression intensity, low, medium, or high, as a function of task context, user goals and traits, and situational urgency. We sketch the framework and its core design dimensions.
[HC-2] SenseWalk: Agent -Based Semantic Trajectory Simulation Powered by Large Language Models in Zoned Environments
链接: https://arxiv.org/abs/2607.00989
作者: Ziyue Lin,Xinhang Xie,Kangyi Wang,Siming Chen
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 18 pages, 7 figures
Abstract:Semantic trajectory analysis has recently emerged as an approach for modeling human movement by capturing implicit patterns and behaviors through semantic information (e.g., visitors’ profiles and goals) beyond raw spatial paths to better understand why people move in certain ways. However, analyzing semantic trajectories in real-world scenarios remains challenging, as collecting high-quality data is costly and often lacks rich semantic information. Meanwhile, existing simulation tools require substantial technical expertise, which makes them difficult for practitioners to adopt. To address these limitations, the paper proposes SenseWalk , an interactive system that supports simulating semantic trajectories by LLM-powered agents. We develop a simulation workflow that combines LLMs and the social force model to balance physical plausibility and semantic coherence. A user-friendly interface is designed to facilitate users in customizing the simulation configuration and analyzing simulation outputs. We also conduct a quantitative experiment to evaluate the effectiveness of our simulation workflow, and a user study (n=12) to assess the usefulness and efficiency of our system.
[HC-3] Visualizing Engineering Fundamentals: Design of Mixed Reality and Physical Toolkits for Effective Learning
链接: https://arxiv.org/abs/2607.00979
作者: Mohammad Abu Nasir Rakib,Sharmin Akter,Eshwara Prasad Sridhar,Somik Biswas,Md Rassel Raihan,Mahmudur Rahman
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:This study examined students’ experiences with mixed-reality applications and physical toolkits in Engineering Mechanics to inform design guidelines for educational tools. In a user study with 24 participants, we compared classroom instruction alone, classroom instruction with a mixed-reality application, and classroom instruction with physical toolkits. Thematic analysis of participant feedback revealed that learners’ workflows and engagement with fundamental mechanics problems varied across instructional modalities. Participants valued multimodal and interactive experiences that combined visualization with hands-on interaction, while reporting challenges with complex or unclear visualizations. These insights support the human-centered design of mixed-reality and physical tools for engineering education.
[HC-4] Understanding How Humans Inject Knowledge into Machine Learning Workflows through Visual Analytics
链接: https://arxiv.org/abs/2607.00969
作者: Yiwen Xing,Philip Beaucamp,Joyraj Chakraborty,Afrah Farea,Yuanzhe Jin,Saiful Khan,Gennady Andrienko,Natalia Andrienko,Min Chen
类目: Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注:
Abstract:Visual analytics (VA) plays an increasingly important role in supporting machine learning (ML) workflows. In the field of visualization, such approaches and techniques are referred to as VIS4ML. While ML models are mostly learned automatically, the corresponding ML workflows receive a variety of human inputs, such as data labelling, feature engineering, model architecture designing, hyper-parameter tuning, and so on. In this work, we surveyed over 200 VIS4ML papers to gain an understanding of how humans inject their knowledge into ML workflows through interactive visualization. We collected a corpus of VIS4ML papers from the IEEE VIS conferences in the past decade. We developed a coding scheme to facilitate the literature research from four perspectives: characteristics of ML, visualization, interaction, and actions. The analysis of the coded dataset allows us to observe different pathways that transfer human knowledge to ML workflows via interactive visualization. Building on the analysis, we explain the phenomena of VIS4ML using the conceptual model that views VA as model building and the information-theoretic cost-benefit analysis that reasons VA as for optimizing ML workflows. This work provides unequivocal evidence showing the merits of using VA in ML workflows. The full list of surveyed papers, along with all analysis results and figures, is available at this https URL.
[HC-5] Quantifying the Affective Gap: A Zero-Shot Evaluation of LLM s on Fine-Grained Emotion Taxonomies
链接: https://arxiv.org/abs/2607.00968
作者: Lawrence Obiuwevwi,Krzysztof J. Rechowicz,Jessica M. Johnson,Vikas Ashok,Sachin Shetty,Sampath Jayarathna
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: in Proc. 27th IEEE Int. Conf. (IRI’2026)
Abstract:Emotion recognition in natural language is a foundational challenge in affective computing, with critical implications for human-computer interaction, mental health support, and conversational AI. This paper presents a rigorous, unified zero-shot evaluation of three leading commercial large language models: Claude (claude-sonnet-4-6), ChatGPT (GPT-5.4), and Gemini (gemini-2.5-flash). The models were queried through their respective production APIs as of April 2026 on a fine-grained 13-class emotion classification task. Using a stratified 1,000-sentence sample from the boltuix/emotions dataset, which comprises 131,306 sentences across 13 categories, a single uniform prompt with no exemplars was applied identically across all models. Gemini achieves the highest accuracy (39.9%) and macro-F1 score (0.363), followed by GPT-5.4 (38.8%, macro-F1 = 0.291) and Claude (38.0%, macro-F1 = 0.159). All models excel on sarcasm and desire while consistently failing on love, confusion, and shame. McNemar tests reveal no statistically significant pairwise differences (p 0.10), suggesting convergence at a shared zero-shot ceiling. Claude’s markedly lower macro-F1 score exposes a class-imbalance prediction bias. These findings highlight the current limitations of frontier AI systems in zero-shot fine-grained emotion classification.
[HC-6] A field experiment of social influence and behavioral contagion with bots on Reddit
链接: https://arxiv.org/abs/2607.00854
作者: Hiroki Oda,Kinga Makovi,Taha Yasseri,Milena Tsvetkova
类目: ocial and Information Networks (cs.SI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注: 10 pages, 3 figures
Abstract:Recent advances in AI have heightened scholars’ and policy makers’ concern with social influence and behavioral contagion in online communities. We conduct a field experiment on Reddit to investigate the extent to which online users are susceptible to positive behavioral stimuli from other users and artificial agents. We let apparent human and bot accounts give symbolic awards to users with one of four rationales: praising the recipient’s logical argument, emotional sensitivity, or moral integrity, or explaining that the award resulted from a random draw in a lottery. We evaluate how the different rationales for the award affect the recipients’ subsequent behavior on the platform in terms of volume, impact, and content, as well as the further behavioral contagion to other users. We find that awards do not increase user activity and downstream impact, and awards from bots with the lottery rationale can in fact reduce them. Nevertheless, awards encourage direct communication between users. These findings highlight the possible resilience of online users to simple behavioral manipulation from platform algorithms and artificial agents, but not necessarily to more sophisticated schemes that simulate human conversation. Transparently labeling automated agents remains essential for ethical and effective platform governance.
[HC-7] AI-Centered Grand Challenges in Visual Analytics for Healthcare: Synthesizing the VAHC 2025 Community Experience
链接: https://arxiv.org/abs/2607.00542
作者: Jürgen Bernard,David Gotz,Robert S Laramee,Silvia Miksch,Gabriela Morgenshtern,Renata G. Raidou,Alessio Arleo
类目: Human-Computer Interaction (cs.HC)
备注: 4 pages, 2 tables, reference to the OSF project behind: this https URL
Abstract:The intersection of AI, healthcare, and visualization is evolving rapidly, posing challenges that cut across disciplinary boundaries and resist easy resolution. The Visual Analytics in Healthcare workshop (VAHC), co-located every other year at the IEEE VIS conference and the AMIA (American Medical Informatics Association) annual conference, has served as a forum to connect the visualization and medical informatics community since 2010. In 2025, to celebrate the 16th edition, we used the workshop as an opportunity to consolidate the community’s collective experience (and expertise) and identify Grand Challenges where the field should prioritize going forward. We combined thematic coding of the 15 accepted VAHC workshop papers with structured group discussions among more than 40 participants, organized around three major themes: “Technical innovation vs. clinical reality”, “Human-centered and scalable VAHC”, and “From foundations to actionable insights”, followed by post-workshop reflexive analysis. Across all three groups, AI emerged as the most consistently recurring concern. In this paper, we report our AI-centered insights from the VAHC 2025 group activity, contextualize them against the broader literature along five Grand Challenges themes, and distill them into five challenge clusters, each concluded with recommendations for future research directions that cross disciplinary boundaries: (1) trust and bias, (2) data and infrastructure, (3) explainability and communication, (4) human-AI interaction, and (5) model reliability and validation. We share these challenges and their associated research directions as a starting point for discussion and collaboration across the healthcare, AI, and visualization communities. All supplemental materials are available at this https URL.
[HC-8] You Shall Not Pass! Where and Why Developers Draw The Line on AI Autonomy
链接: https://arxiv.org/abs/2607.00533
作者: Rudrajit Choudhuri,Christian Bird,Carmen Badea,Marco Gerosa,Anita Sarma
类目: Human-Computer Interaction (cs.HC); Software Engineering (cs.SE)
备注:
Abstract:As AI takes on more software work, the line between human and AI effort is shifting. Where developers draw that line around AI autonomy bears on how we design tools and roles that preserve meaningful work. Drawing on cognitive appraisal theory, work design, and automation research, we conducted a mixed-methods study of 448 professional developers at Microsoft to investigate their accepted levels of AI autonomy across software engineering work. Most developers accepted AI producing work under their oversight, although accepted autonomy varied substantively across tasks and individuals. Acceptance was lowest for identity-defining, human-facing, and design-oriented work, and higher among developers with more AI experience and risk tolerance. Task accountability was associated with lower odds of allowing AI to act on developers’ behalf, whereas task identity was associated with lower odds of granting AI decision-making autonomy. Task demands had the opposite effect, increasing willingness to delegate decision-making to AI. Our findings suggest that preferences for AI autonomy reflect how developers cognitively experience their work, highlighting important considerations for designing meaningful work.
[HC-9] AI Trust and Teaming: The Humans-as-Handlers Approach for Autonomous and Opaque AI Systems
链接: https://arxiv.org/abs/2607.00523
作者: Nathan G. Wood
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Emerging Technologies (cs.ET); Systems and Control (eess.SY)
备注:
Abstract:Artificial intelligence (AI) is becoming ubiquitous, and across domains, increasingly autonomous systems are carrying out tasks which raise significant ethical and legal challenges which demonstrate a need for strong human-machine teams rooted in trust. In this article, I argue that within highly impactful areas (such as medicine or warfighting) there are grounds for us initially treating autonomous and opaque systems as relevantly analogous to dogs (or other animals with which we have close relationships). Under this analogy, humans making use of these systems are not to be viewed as “users” or “deployers” of these systems, but instead take the role of “handlers”. This recasting of roles shifts the way we view humans, AI-enabled and autonomous systems, and the relations between them, and moreover clarifies the clear and traceable lines of responsibility humans have for the outcomes brought about when using these systems. In developing this point, I clarify that the machine-animal analogy does admit disanalogous elements, but that its touch-points ground it as a starting point. I then explore how we can divest the humans-as-handlers approach of those aspects of our relationships with animals which are unfitting for how we engage with and make use of autonomous and AI-enabled systems. I conclude by arguing that the trajectory of human-machine teamings for autonomous and AI-enabled systems should be a state where we authentically view these not as artifacts which we simply make use of, but as collaborators with which we pursue complex goals and carry out complex tasks.
[HC-10] Draped Surfaces: A Contour-Adaptive Interface Overlaid on the Physical Environment for Mixed Reality Workspaces
链接: https://arxiv.org/abs/2607.00518
作者: SoonUk Kwon,Barrett Ens,Pourang Irani
类目: Human-Computer Interaction (cs.HC)
备注: Published in CHI '26
Abstract:Conventional Mixed Reality (MR) workspaces are frequently organized in cockpit-like layouts, where multiple floating windows surround the user. While this configuration facilitates access to digital content, it often induces occlusion, reducing understanding of the physical environment and limiting access to real-world objects. To overcome this challenge, we present the Contour-Adaptive Mixed Environment Overlays (CAMEO), a contour-adaptive MR interface that drapes virtual windows onto physical surfaces. This design integrates digital content with nearby items, thereby improving users’ visual access to background objects and supporting interaction with them. We evaluate CAMEO in two controlled studies. The first demonstrates that draping reduces hand-movement detours relative to flat mid-air surfaces, enabling more direct interaction with nearby items. The second shows that controlled window deformation does not significantly impair text legibility when compared to flat surfaces. Together, these findings contribute a novel design paradigm for MR workspaces that balances immersion, readability, and environmental understanding.
[HC-11] Gaze-Informed Proactive AI Assistance for Childrens Picture Exploration
链接: https://arxiv.org/abs/2607.00445
作者: Zekun Wu,Man Su,Huiyong Li,Tomohiro Nagashima,Anna Maria Feit
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:Proactive assistance with large language models (LLMs) has received growing attention in the human computer interaction (HCI) community. However, most past work on proactive LLMs’ assistance has focused on adult users and task-oriented settings, leaving open how such systems could support children, whose interests and needs are often expressed through gaze and other nonverbal behaviors rather than explicit requests. In this study, we focus on two key challenges of proactive assistance in children’s picture exploration: when to provide assistance and what assistance to provide based on children’s nonverbal behaviors. To address these challenges, we introduce Ollie, a gaze-informed proactive artificial intelligence (AI) assistant that offers short narrative descriptions based on where a child is looking. Ollie uses children’s gaze to estimate their attention, identify their current visual focus, and select a related picture region for the LLM to verbally describe. In a within-subject experiment, we compared gaze-informed assistance with random assistance. Results show that gaze-informed assistance kept children’s attention on their current focus for a longer period of time, and guided them more effectively to related picture regions. Children, parents, and a participating kindergarten teacher viewed Ollie positively and consider that it better matched children’s interests when compared with the random assistance. This work shows the feasibility of using gaze as an implicit input for proactive AI assistance for children and provides design implications for future child-centered AI systems.
[HC-12] A Simple Solution to Improving Human Supervision of Algorithms: Evidence from Smart Vending
链接: https://arxiv.org/abs/2607.00420
作者: Minda Zhao,Brian Rongqing Han,Xin Chen,Tao Zhu
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:Organizations increasingly deploy autonomous artificial intelligence (AI) systems for operational decisions, such as inventory replenishment. Yet fully granting override rights can degrade performance due to human bias and noise, while prohibiting them may overlook valuable private information. This raises a key question: How should override rights be structured to improve human supervision of autonomous AI? Methodology/results: We propose a constrained override policy that limits overrides per decision episode to enable selective filtering that prioritizes high-value overrides. We tested it through a randomized field experiment with 553 workers at a major Chinese smart vending machine retailer that manages more than 59,000 machines and 4,000 SKUs. Workers were assigned to no overrides, free overrides, or a two-per-machine limit on downward overrides. Free overrides reduce inventory by 1.95% but also cut sales by 1.19%. Constrained overrides reduce inventory by 1.28% without harming sales, as workers select better SKUs to override, confirmed via local average treatment effects. Gains are largest for experienced workers, high-incentive SKUs, and growth-stage SKUs. A simulated personalized policy further increases sales probability by 9.1%. Managerial implications: Academics gain novel insights from the causal effects of discretion design in human-supervised AI, emphasizing selective filtering to enhance decision quality. Managers can benefit from a scalable, low-cost policy for operations such as retail, logistics, and resource planning, reducing excess inventory without sales loss while harnessing private human information, with no need for algorithmic redesign, information customization, or additional training.
[HC-13] A Penny for Your Prompts: Experiments Detecting and Mitigating LLM Usage by Survey Respondents
链接: https://arxiv.org/abs/2607.00403
作者: Zane Xu,Nathan Malkin
类目: Human-Computer Interaction (cs.HC); Cryptography and Security (cs.CR); Computers and Society (cs.CY)
备注: Published at SOUPS 2026 (Symposium on Usable Privacy and Security)
Abstract:Large language models are increasingly used by participants on crowdsourcing platforms when responding to surveys, potentially undermining the validity of collected data. Our study aims to quantify the prevalence of this behavior and investigate methods to detect and prevent it. In a series of surveys (N = 250), we examined conditions such as platform choice, survey length, requests not to use AI, and disabling copy-paste functionality. We were able to identify distinct characteristics of LLM-assisted responses and found that their frequency varied widely, from under 10% on Prolific to over 80% on Mechanical Turk. Mitigation measures reduced LLM usage but did not necessarily improve data quality. No participants employed browser-use agents at the time of our survey, but we report on our own detection experiments. We recommend that researchers actively screen survey responses for LLM usage by recording and analyzing keystroke data and crafting instructions and questions aimed at AI.
[HC-14] Child Safety in Generative AI: An Expert-Guided and Incident-Grounded Evaluation Framework
链接: https://arxiv.org/abs/2607.00395
作者: Haein Kong
类目: Human-Computer Interaction (cs.HC)
备注: Accepted to the HEAL Workshop at CHI 2026
Abstract:As generative AI is increasingly used by children and adolescents, there is a growing need for risk evaluation frameworks that account for child-specific harms. However, most existing safety evaluation frameworks focus on general user populations, often overlooking risks unique to younger users. To address this gap, we propose an evaluation framework that integrates expert-guided risk factors with real-world AI incident data for child safety. The framework identifies hazard categories from expert guidelines and AI incident databases and uses this information to construct a synthetic test set for model evaluation. Particularly, we apply the framework to the education domain and evaluate three Llama Guard models on their ability to detect unsafe user prompts. Our results show that current Llama Guard models struggle to identify education-related unsafe user prompts. We conclude by discussing how future work can extend the evaluation to additional risk categories and incorporate domain experts throughout the evaluation pipeline.
[HC-15] A Text-Steerable Instrument for Sketching Procedural Soundscapes via Language Models
链接: https://arxiv.org/abs/2607.00309
作者: Prabal Gupta(Rama Labs, Kitchener, Canada)
类目: ound (cs.SD); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: 10 pages, 7 figures, 2 tables. Accepted to the International Conference on New Interfaces for Musical Expression (NIME 2026), London, UK. Supplementary material included as an appendix. Code and demo: this https URL
Abstract:We present a real-time musical interface that converts natural-language scene descriptions into evolving procedural soundscapes. A performer types a prompt such as “warm jazz cafe at midnight” and steers it through direct parameter adjustments - stepping brightness down, switching a rhythm style - each producing a predictable, audible shift without re-prompting. Where GPU-bound text-to-audio systems synthesize monolithic waveforms, our instrument generates human-readable configurations over a categorical schema, enabling fine-grained performer control; most valid combinations are designed to sound musically coherent. Three interchangeable backends - embedding retrieval for sub-second CPU-only use, hosted LLMs via API, and a fine-tuned 270M local model - all emit the same schema. A live generator architecture continuously emits audio while resolving new instructions in the background, crossfading seamlessly when ready; even when an LLM takes 5-12 seconds to respond, the audience hears uninterrupted sound - reframing text-to-music as an ongoing performable stream rather than a one-shot generation. We evaluate text-audio semantic alignment using LAION-CLAP on held-out prompts as a technical proxy, finding that retrieval-based configuration outperforms random valid configurations on this metric, while noting that LAION-CLAP also informed retrieval-map construction. We report performance observations, informal listener feedback, and release materials for the SDK, dataset artifacts, model, and audiovisual performance interface.
[HC-16] May (A)I Beautify Your Visualization? Expert Judgments of Acceptable Aesthetic Alterations
链接: https://arxiv.org/abs/2607.00239
作者: Kalina Borkiewicz,Jixian Li,Joshua A. Levine,Katherine E. Isaacs
类目: Human-Computer Interaction (cs.HC); Graphics (cs.GR)
备注:
Abstract:In 3D visualizations of natural phenomena, improving aesthetics can provide measurable benefits, but often involves transformations that affect how the data is perceived. As a growing range of tools - including AI-based methods - make visual design and modification more accessible, it is increasingly important to understand trade offs and concerns when making these changes. We conducted an expert survey (N=95) with visualization researchers, practitioners, and domain scientists, investigating reactions to fifteen alterations spanning presentation-level adjustments (e.g., lighting, camera position) and data-level modifications (e.g., removing errors, filling gaps), applied by both humans and AI systems. Results show differences in perceived acceptability are driven by the transformation’s meaning, regardless of whether it operates at the presentation or data level. Additionally, certain modifications were consistently judged as more permissible than others regardless of human or AI authorship. While this relative ordering remains largely stable, AI-generated transformations are consistently rated as less acceptable than identical human-produced changes. These results reveal a distinction between more permissible and more sensitive alterations, and suggest the need for both designers and AI-assisted visualization tools to incorporate constraints and guardrails that reflect these differences.
[HC-17] Constructing Epistemic AI Literacy: Detecting Epistemic Aims and Processes in Student-AI Co-Programming
链接: https://arxiv.org/abs/2607.00211
作者: Mengqian Wu
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:
Abstract:Epistemic thinking plays a central role in students’ learning processes when applying generative artificial intelligence (GenAI), particularly in programming contexts where learners must construct queries, evaluate and validate AI-generated outputs, and regulate problem-solving strategies. This study introduces the conceptual framework of Epistemic AI Literacy (EAIL), reframing AI literacy as a process-oriented epistemic phenomenon that emerges through dynamic human-AI interactions across different domains. Drawing on the AIR (epistemic aims, ideals and reliable epistemic processes) framework, this study examines how epistemic aims and epistemic processes are enacted in GenAI-supported co-programming activities and explores scalable approaches for operationalizing these constructs in interaction data. Using a large dialogue dataset of human-AI co-programming, this study identifies observable dimensions of epistemic aims (i.e., mastery-oriented aims) and epistemic processes (i.e., outsourcing, explanation seeking, verification seeking, prompt monitoring, and epistemic justification). The results reveal a prevalent lack of EAIL, with 78.8% of student-GenAI interactions relying on non-mastery-oriented aims and less reliable epistemic strategies like outsourcing and verification-seeking. Conversely, only 11.1% of interactions showed high epistemic engagement, where mastery-oriented aims were coupled with advanced epistemic strategies like epistemic justification in a more reliable epistemic process.
[HC-18] Comparing the Emotional Impact of Thematic Versus Episodic Framing in Visualization Text
链接: https://arxiv.org/abs/2607.00103
作者: Poorna Talkad Sukumar,Maurizio Porfiri,Oded Nov
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:Although textual framing in data visualizations is known to influence comprehension, recall, and perceptions of bias, its effects on viewers’ emotional responses remain underexplored. Drawing on two widely studied framing strategies in political communication, we examine how episodic framing (foregrounding a specific event) versus thematic framing (foregrounding broader trends) affects emotional and attitudinal responses to visualizations. We conducted a preregistered, between-subjects online experiment (N = 800) in which participants viewed identical visualizations of U.S. mass shooting data that varied only in textual framing: a thematic title, a thematic title with annotation, or an episodic title paired with the same annotation. Results show that episodic framing elicited significantly more negative emotional valence than both thematic conditions. In contrast, adding an annotation to a thematic title did not alter emotional impact. While framing did not significantly affect policy attitudes, mediation analysis revealed a significant indirect effect: increased negative emotion under episodic framing predicted greater support for gun control. These findings position emotion as a critical, yet underexamined, dimension of how textual framing shapes responses to data visualizations.
[HC-19] DigitalCoach: Communication and Grounding Gaps in Human and Agent ic Computer Use Coaching
链接: https://arxiv.org/abs/2606.31980
作者: Meng Chen,Anya Ji,Tsung-Han Wu,Tobias Maringgele,David M. Chan,Alane Suhr,Amy Pavel
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:
Abstract:Agents are increasingly capable of automating software tasks, but can they teach humans how to use software themselves? We introduce DigitalCoach, a multimodal dataset of 72 human expert-novice computer use coaching sessions consisting of 22,752 dialogue turns grounded in 28.1 hours of screen and input event recordings across five software applications. We use DigitalCoach to evaluate whether state-of-the-art models can teach humans how to use computers. Automated evaluation shows that models differ from humans in how they coach: models provide more direct instructions, but fewer explanations, error diagnoses, and knowledge-check questions. When we fix the coaching method, models produce utterances similar to human references yet poorly grounded in visual context. Interactive evaluation confirms that model coaches cause learners to passively follow instructions without deeper engagement and fall short in visual grounding. DigitalCoach lays a foundation for collaborative and proactive computer use coaching agents.
[HC-20] Surfacing Variations to Calibrate Perceived Reliability of MLLM -generated Image Descriptions
链接: https://arxiv.org/abs/2507.15692
作者: Meng Chen,Akhil Iyer,Amy Pavel
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages, 6 figures
Abstract:Multimodal large language models (MLLMs) provide new opportunities for blind and low vision (BLV) people to access visual information in their daily lives. However, these models often produce errors that are difficult to detect without sight, posing safety and social risks in scenarios from medication identification to outfit selection. While BLV MLLM users use creative workarounds such as cross-checking between tools and consulting sighted individuals, these approaches are often time-consuming and impractical. We explore how systematically surfacing variations across multiple MLLM responses can support BLV users to detect unreliable information without visually inspecting the image. We contribute a design space for eliciting and presenting variations in MLLM descriptions, a prototype system implementing three variation presentation styles, and findings from a user study with 15 BLV participants. Our results demonstrate that presenting variations significantly increases users’ ability to identify unreliable claims (by 4.9x using our approach compared to single descriptions) and significantly decreases perceived reliability of MLLM responses. 14 of 15 participants preferred seeing variations of MLLM responses over a single description, and all expressed interest in using our system for tasks from understanding a tornado’s path to posting an image on social media.
计算机视觉
[CV-0] Ink3D: Sculpting 3D Assets with Extremely Complex Textures via Video Generative Models ECCV2026
链接: https://arxiv.org/abs/2607.01222
作者: Yue Han,Chong Li,Zhening Liu,Cong Huang,Fang Deng,Yong Liu,Fangyun Wei,Yan Lu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ECCV 2026. Project page: this https URL
Abstract:Recent 3D generative models can synthesize high-quality geometry but often struggle to reproduce intricate textures from reference images, largely due to the scarcity of large-scale 3D training data with rich surface appearance. In contrast, visual generative models are trained on datasets several orders of magnitude larger and excel at modeling complex visual patterns. Motivated by this gap, we introduce Ink3D, a framework that bridges 3D generation with large-scale video generative models to synthesize extremely complex textures. Ink3D first reconstructs a white-mesh geometry using an off-the-shelf 3D generation model. It then employs OrbitPainter, a conditional video generative model, to produce dense orbit-scan videos capturing object appearance across viewpoints. To convert these views into coherent textures, we introduce TextureOptimizer, a neural baking module that integrates dense multi-view observations while mitigating geometry inconsistencies arising from video generation. By decoupling geometry and texture synthesis and leveraging large-scale pretrained video priors, Ink3D enables significantly richer and more faithful texture generation than prior approaches.
[CV-1] Linkify: Learning from Interface-Augmented Assembly Graphs
链接: https://arxiv.org/abs/2607.01205
作者: Anushrut Jignasu,Daniele Grandi
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code is available at this https URL
Abstract:We present Linkify, a framework for learning from interface-augmented assembly graphs to enable context-aware part retrieval in mechanical assemblies. While recent generative AI methods for CAD have focused largely on isolated parts or monolithic assemblies, the rich geometric information at the interfaces between parts, where function is realized, remains underexplored. We address this gap by recomputing high-fidelity interface geometry for the Fusion 360 Gallery Assembly dataset, correcting missing and erroneous contacts, and generating point-cloud representations of local contact regions. Using this data, we construct assembly graphs whose nodes encode part geometry and whose edges encode interface geometry via a pretrained point-cloud encoder. On top of this representation, we train a Graph Attention Network based on GATv2 to solve a masked part prediction task: given an assembly with one part held out, the model predicts the class of the missing component from a large vocabulary of geometrically clustered parts, thereby approximating a realistic part-retrieval scenario. Compared to non-graph baselines such as logistic regression and k-nearest neighbors operating on aggregated node features, Linkify achieves higher Top-K accuracy and F1 scores. Ablation studies on graph connectivity, edge attributes, and attention mechanisms demonstrate that accurate contact computation and dynamic attention over interfaces are critical for performance. Our corrected interface dataset and training pipeline, released publicly, provide a foundation for future interface-aware models for assembly retrieval, validation, and generative design.
[CV-2] World from Motion: Generative Dynamic Gaussian Reconstruction from Monocular Video
链接: https://arxiv.org/abs/2607.01202
作者: Liyuan Zhu,Shengyu Huang,Amrita Mazumdar,Tianye Li,Zan Gojcic,Gordon Wetzstein,Iro Armeni,Shalini De Mello,Alex Trevithick
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)
备注: Project page: this https URL
Abstract:We present World from Motion, a method for generating freely renderable dynamic 3D Gaussian representations from monocular videos. Our approach conditions a video model on dense, pixel-aligned renderings that encode appearance, geometry, and 3D scene motion along both input and target camera trajectories to correct rendering artifacts and fill in missing regions from an initial reconstruction. To train this model, we construct a dataset of aligned multiview video pairs and dynamic 3DGS representations, with simulated artifacts characteristic of monocular reconstruction. At test time, we distill the model’s generations, including newly observed regions and motions, back into a single consistent, high-quality dynamic 3DGS, improving both novel-view synthesis and the underlying 3D motion. Our method sets a new state of the art in 4D reconstruction and seamlessly generalizes to in-the-wild videos with large viewpoint changes and dynamic motions.
[CV-3] Perceive-to-Reason : Decoupling Perception and Reasoning Reasoning for Fine-Grained Visual Reasoning
链接: https://arxiv.org/abs/2607.01191
作者: Hongxing Li,Xiufeng Huang,Dingming Li,Wenjing Jiang,Zixuan Wang,Haolei Xu,Hanrong Zhang,Haiwen Hong,Longtao Huang,Hui Xue,Weiming Lu,Jun Xiao,Yueting Zhuang,Yongliang Shen
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code: this https URL
Abstract:Fine-grained visual reasoning remains challenging for vision-language models, especially when small but critical visual cues are buried in high-resolution images. Existing approaches rely on repeated cropping or test-time visual search to introduce local evidence, but they typically do not explicitly distinguish perception from reasoning. In this paper, we propose Perceive-to-Reason (P2R), a unified framework that formulates fine-grained visual reasoning as a two-stage process: the model first localizes question-relevant evidence as a Perceiver, and then answers the question as a Reasoner based on the annotated image and cropped regions. To better align training with this decoupled formulation, we further introduce Perception-Reasoning Alternating GRPO (PRA-GRPO), a role-aware reinforcement learning strategy that alternates between perception-focused and reasoning-focused updates using only final-answer supervision. Built on top of Qwen3-VL-Instruct-2B/4B/8B, P2R consistently improves performance across model scales. In particular, P2R-4B achieves 93.2% on V-Star, 81.9% on HR-Bench-4K, and 80.5% on HR-Bench-8K, substantially outperforming its corresponding backbone. Further experiments show that the benefits of P2R extend beyond high-resolution benchmarks to broader multimodal reasoning tasks. These results suggest that explicitly decoupling perception from reasoning provides an effective framework for fine-grained visual reasoning.
[CV-4] High-dimensional Embedding Prior for Noisy K-space Domain MRIReconstruction
链接: https://arxiv.org/abs/2607.01176
作者: Yu Guan,Tianjia Huang,Qinrong Cai,Qiuyun Fan,Dong Liang,Qiegen Liu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Magnetic resonance imaging (MRI) reconstruction under realistic acquisition conditions can be fundamentally viewed as estimating the underlying k-space distribution from incomplete and noise-corrupted measurements. While diffusion models have recently shown strong potential as generative prior for inverse problems,existingapproachesstruggletohandlenoisyreconstruction settings, especially when operating directly in k-space domain. In this work, we propose a unified high-dimensional k-space reconstruction framework tailored for noisy inverse problems, whichenhancesdiffusion-based solversthroughrepresentation this http URL underlying optimization procedures, the proposed framework augments the data representation space, enabling existing diffusion-based solvers to operate on enriched k-space embeddings with improved expressiveness. Extensive experiments on both in-house and public datasets across varying noise levels and undersampled factors demonstrate that the proposed frame work consistently improves reconstruction quality for multiple diffusion-based inverse solvers. Notably, the largest gains are observed in high-noise regimes, which is consistent with our theoretical analysis of error propagation under high-dimensional representation. These results suggest that high-dimensional representation provides a general and model-agnostic mechanism for improving diffusion-based MRI reconstruction in noisy settings, offering a new perspective on robust k-space generative modeling for practical inverse problems. The code will be available at this https URL.
[CV-5] Structured 4D Latent Predictive Model for Robot Planning
链接: https://arxiv.org/abs/2607.01166
作者: Zhiyi Li,Peilin Wu,Xiaoshen Han,Ruojin Cai,Yilun Du
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Video predictive models are emerging as a powerful paradigm in robotics, offering a promising path toward task generalization, long-horizon planning, and flexible decision-making. However, prevailing approaches often operate on 2D video sequences, inherently lacking the 3D geometric understanding necessary for precise spatial reasoning and physical consistency. We introduce a Structured 4D Latent Predictive Model, which predicts the evolution of a scene’s 3D structure in a structured latent space conditioned on observations and textual instructions. Our representation encodes the scene holistically and can be decoded into diverse 3D formats, enabling a more complete and 3D consistent scene understanding. This structured 4D latent predictive model serves as a planner, generating future scenes that are translated into executable actions by a goal-conditioned inverse dynamics module. Experiments demonstrate that our model generates futures with strong visual quality, substantially better 3D consistency and multi-view coherence compared to state-of-the-art video-based planners. Consequently, our full planning pipeline achieves superior performance on complex manipulation tasks, exhibits robust generalization to novel visual conditions, and proves effective on real-world robotic platforms. Our website is available at this https URL.
[CV-6] EquiSteer: Cross-Attention Steering Towards a Fairer Text-Guided Image Generation
链接: https://arxiv.org/abs/2607.01147
作者: Tatiana Gaintseva,Akshit Achara,Gregory Slabaugh,Jiankang Deng,Ismail Elezi
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Text-to-image diffusion models power everyday creative tasks, but they still reproduce the demographic biases in their training data. On common prompts such as a photo of a nurse,'' a photo of a CEO’', they skew their outputs toward one gender, driven by the statistics of training data rather than anything in the text. Existing debiasing methods show promise in narrow settings but require retraining, batch-level control, or prompt-specific tuning, limiting their scalability. We propose \emphEquiSteer, a training-free method that works per sample by steering cross-attention (CA) activations at inference time. For each target attribute, EquiSteer precomputes steering vectors from contrastive prompts. Then at generation time, a prompt-aware gate leaves attribute-specific prompts untouched, while for neutral ones it clears existing attribute signals from the CA activations and injects a target attribute. Across SD-1.5, SD-2.1, SDXL, and SANA, EquiSteer reduces the average parity gap by up to 87% , with minimal effect on image quality and text-image alignment. Code is available at \hrefthis https URLthis https URL.%
[CV-7] Relation-Centric Open-Vocabulary 3D Gaussian Segmentation
链接: https://arxiv.org/abs/2607.01140
作者: Eunsung Cha,Hyunjoon Lee,Jaesik Park
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL
Abstract:Open-vocabulary 3D Gaussian segmentation is challenging because it requires language understanding for diverse queries and accurate separation of Gaussians along object boundaries. Prior approaches either embed language knowledge into individual Gaussians to improve query responsiveness or optimize per-Gaussian instance features to encode object identity. However, these strategies may produce noisy Gaussian segmentations or rely on cost-inefficient per-scene optimization. We propose PairGS, a framework that reframes Gaussian segmentation as modeling pairwise relations between Gaussians. 3D Gaussian representations provide rich signals for relation estimation, such as view contribution weights and multi-view mask evidence. By leveraging these cues, PairGS explicitly constructs a relation graph for segmentation without a heavy optimization process. PairGS first proposes sparse edge candidates using low-dimensional descriptors, computes precise pairwise affinities only on those candidates, and builds a hierarchical cluster tree for multi-granular querying. It achieves state-of-the-art results on open-vocabulary 3D Gaussian segmentation benchmarks, while the fast variant is 50x faster than optimization-based instance-feature approaches.
[CV-8] SD-RouteFusion: Ego-Trajectory Prediction with SD-Map Route Conditioning
链接: https://arxiv.org/abs/2607.01139
作者: Sviatoslav Voloshyn,Bruno K. W. Martens,Wangxin Liu,Jakob Vinkås,Junsheng Fu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 4 figures, 29th International Conference on Information Fusion
Abstract:This paper presents SD-RouteFusion, a deployable end-to-end ego-trajectory prediction method that fuses a front-facing camera, vehicle kinematics, and a navigation route derived from a Standard Definition (SD) map. Unlike approaches that rely on High Definition (HD) map geometry, SD-RouteFusion aligns the learning objective with scalable and production-ready SD-map route inputs, enabling route-aware prediction without requiring HD-map infrastructure. First, we demonstrate that SD-map route prior provides a powerful long-horizon semantic prior. Through a comprehensive study on a large-scale real-world dataset comprising 480k driving scenarios across 10 European countries and the U.S., we quantify the value of SD-route conditioning: incorporating SD-map routes yields a 10.5% ADE improvement over an image-and-kinematics baseline, while our full fusion strategy achieves a 16.9% ADE reduction given a prediction horizon of 8 seconds. The fusion strategy consists of a dual-hypothesis design paired with a gated classifier, to ensure robustness under route corruption and visual uncertainty. Finally, to support broader evaluation, we release an SD-route generation toolkit that enables SD-route-conditioned ego-trajectory prediction on all datasets containing ego pose and future trajectories. Together, SD-RouteFusion establishes a practical path toward robust, route-aware ego-trajectory prediction at scale.
[CV-9] owards Metric-Agnostic Trajectory Forecasting ECCV2026
链接: https://arxiv.org/abs/2607.01133
作者: Markus Knoche,Daan de Geus,Bastian Leibe
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: ECCV 2026. Project page at this https URL
Abstract:Accurate trajectory forecasting of surrounding traffic participants is a core capability for autonomous driving, enabling vehicles to anticipate behavior and plan safe maneuvers. We observe that current state-of-the-art forecasting models on Argoverse 2 and the Waymo Open Motion Dataset tailor their training objectives to the different benchmark metrics. Because these metrics encourage conflicting behavior, we propose a paradigm change for trajectory forecasting: training models with metric-agnostic probabilistic objectives and treating metric optimization as a downstream task applied to the predictive distribution. Concretely, we introduce Trajectory Distribution Evaluation (TraDiE) policies, metric-specific policies that map a predictive distribution to the set of K trajectories and confidences required by trajectory forecasting metrics. We evaluate this framework by introducing DONUT-NLL, which adapts the training objective of the state-of-the-art trajectory forecasting model DONUT to directly optimize the predictive distribution. Using our policies, DONUT-NLL achieves state-of-the-art results on all metrics of the Waymo motion prediction benchmark.
[CV-10] Autonomous Scientific Discovery via Iterative Meta-Reflection
链接: https://arxiv.org/abs/2607.01131
作者: Bingchen Zhao,Sara Beery,Oisin Mac Aodha
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Autonomous scientific discovery systems offer the potential to accelerate research by automating the process of hypothesis generation and validation. However, current systems operate within constrained search spaces or require predefined research questions, limiting their capacity for true open-ended inquiry. Furthermore, while they generate hypotheses iteratively, they largely lack the ability to explicitly synthesize their own accumulated findings to uncover complex, interconnected phenomena. We introduce DiscoPER, an autonomous large language model-powered framework that conducts open-ended research by dynamically generating and executing code to explore datasets without pre-specified research objectives. To ensure rigorous scientific validity, every proposed discovery must pass statistical testing. To overcome the limitations of isolated search, our framework introduces a second-order reasoning mechanism that periodically analyzes its own accumulated discoveries. By treating prior discoveries as empirical data, DiscoPER identifies structural patterns, confounds, and epistemic gaps, actively redirecting hypothesis exploration toward uncharted regions of the search space. The search space is further expanded by incorporating tool use, enabling the system to explore hypotheses beyond structured metadata by seamlessly processing and extracting useful information from multimodal sources like images. Evaluated on iNatDisco, a new multimodal ecological knowledge benchmark with pattern-level ground truth obtained from peer-reviewed literature, DiscoPER recovers 8 of 9 known patterns with a 72.7% hypothesis support rate, outperforming both classical causal discovery and LLM-guided baselines. Ablations show that DiscoPER scales with more data, and confirms the benefits of second-order meta-reflection.
[CV-11] MoHallBench: A Benchmark for Motion Hallucination in Video Large Language Models
链接: https://arxiv.org/abs/2607.01117
作者: Jiale Li,Sihan Chen,Mengyuan Liu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 5 figures
Abstract:Video Large Language Models (VideoLLMs) have shown strong progress in video understanding, yet they still suffer from hallucinations that are inconsistent with visual evidence. Existing benchmarks mainly focus on object hallucination or coarse action perception, leaving a key video-specific problem underexplored: motion hallucination, in which models infer human motions that are absent from the video. We present MoHallBench, a benchmark for diagnosing motion hallucination in VideoLLMs. MoHallBench systematically evaluates three major sources of hallucination: co-occurrence priors, sequential inference, and similarity confusion. It contains 11,306 video clips and 40,493 question-answer pairs, covering binary-choice, multiple-choice, and generative settings. We further introduce a bi-directional questioning protocol with bias-aware metrics to reduce affirmation bias in binary evaluation. Experiments on ten recent open-source VideoLLMs reveal a clear decoupling between action recognition and hallucination resistance, as models that perform well on positive action recognition often fail on adversarial negatives. Among all settings, sequential inference hallucination is the most severe, showing that current models tend to over-infer expected outcomes from partial motion cues. Our analyses further confirm that stronger priors and finer-grained similarity substantially amplify hallucination. We hope MoHallBench can facilitate future evaluation and mitigation of motion hallucination in VideoLLMs.
[CV-12] CPDDNet: Color-Polarization Denoising and Demosaicking Network WWW ICIP2026
链接: https://arxiv.org/abs/2607.01100
作者: Qihang Zhang,Yusuke Monno,Masayuki Tanaka,Masatoshi Okutomi
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Presented at ICIP2026 Project Page: this http URL
Abstract:Color-polarization imaging using a color-polarization filter array (CPFA) sensor captures both texture (color intensity) and physical (polarization) information of the scene in a single shot, enabling various applications in computer vision. However, the raw mosaic output from a CPFA sensor often suffers from severe noise and resolution loss, especially under low-light conditions. Existing methods generally focus on either denoising or demosaicking tasks, failing to capture the coupling between them and neglecting shared low-level features. In this paper, we propose a color-polarization denoising and demosaicking network (CPDDNet), which is a joint framework that performs noise removal and CPFA interpolation using a feature fusion module that retains the features from the CPFA raw data at both the denoising and the demosaicking stages. Experimental results demonstrate that CPDDNet significantly enhances image quality and polarization parameter accuracy, outperforming existing approaches on a real dataset.
[CV-13] LongVQUBench: Benchmarking Long-Term Video Quality Understanding of Vision-Language Models
链接: https://arxiv.org/abs/2607.01086
作者: Arpita Nema,Hanwei Zhu,Xi Zhang,Weisi Lin
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted at European Conference on Computer Vision 2026
Abstract:The evaluation of long-term video quality understanding remains an open challenge for large vision-language models (LVLMs). Existing video quality benchmarks predominantly focus on short clips and isolated distortions, overlooking the temporal continuity, cumulative degradation, and reasoning complexity inherent in long-duration content. To address these limitations, we present LongVQUBench, a comprehensive benchmark for long-term video quality understanding. LongVQUBench contains over 1200 diverse videos spanning movies, documentaries, surveillance footage, egocentric recordings, and animated content, accompanied by 1500 multiple-choice and open-ended questions for validation and testing. To assess perceptual reasoning across different temporal scopes, we introduce three progressively complex evaluation levels: (i) local event quality understanding (LQU) for analyzing localized distortions; (ii) cross-event quality reasoning (CQR) for integrating multiple degraded events; and (iii) global quality understanding (GQU) for holistic perceptual evaluation over extended durations. Furthermore, a needle distortion question-answering (NDQA) paradigm is embedded across all three levels, where spatial or temporal artifacts are sparsely inserted to probe fine-grained detection and reasoning capabilities. Extensive experiments on 14 state-of-the-art LVLMs reveal significant performance degradation with increasing video length and reasoning depth, highlighting their limited capacity for long-range temporal integration and perceptual attribution. We envision LongVQUBench as a foundational step toward the systematic, hierarchical, and explainable evaluation of LVLMs’ long-term video quality understanding.
[CV-14] Human-Centric Transferable Tactile Pre-Training for Dexterous Robotic Manipulation
链接: https://arxiv.org/abs/2607.01067
作者: Chi Zhang,Penglin Cai,Ziheng Xi,Haoqi Yuan,Hao Luo,Wanpeng Zhang,Sipeng Zheng,Chaoyi Xu,Zongqing Lu
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: The first two authors contribute equally. Orders are decided by flipping a coin
Abstract:As an essential modality for dexterous and contact-rich tasks, tactile sensing provides precise force feedback that cannot be reliably inferred from vision. However, limited by hardware and data collection systems, existing datasets with tactility remain small in scale and narrow in contact coverage. Meanwhile, Vision-Language-Action (VLA) models with tactile modality are constrained on dynamics-agnostic post-training, which limits the performance ceiling on downstream tasks. In this paper, we present H-Tac, a large-scale tactile-action dataset with 160-hour egocentric human videos containing more than 300 tasks and 135k episodes. Building upon this, we propose Transferable Tactile Pre-Training (TTP), a system of tactile-based pre-training on human data for fine-grained robotic tasks. To bridge the gap between humans and robots, we use unified tactile and action spaces throughout the pre-training and post-training phases, preserving prior knowledge during human-to-robot transfer. By leveraging a tactile expert for future tactile prediction, our framework explicitly models the contact dynamics and precise physical interactions. Extensive experiments in simulation and on real robots demonstrate that our model achieves superior performance, exhibiting robust generalization and fine-grained manipulation capabilities. TTP paves the way for scalable tactile pre-training via human-to-robot transfer.
[CV-15] GeoSearcher: Anchor-Guided Progressive Reasoning for Remote Sensing Visual Grounding with Process Supervision
链接: https://arxiv.org/abs/2607.01050
作者: Dianyu Wang,Yidan Zhang,Peirong Zhang,Xuyang Li,Xiaoxuan Liu,Lei Wang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages,11 figures,7 tables
Abstract:Recent multimodal large language models (MLLMs) have shown strong cross-modal understanding and coordinate generation abilities in visual grounding. However, transferring these abilities to remote sensing visual grounding (RSVG) remains challenging. High-resolution remote sensing images usually cover large-scale scenes, where targets are often extremely small and surrounded by numerous visually similar distractors. Meanwhile, queries often contain multiple clues, such as reference objects, spatial relations, and target attributes. Existing MLLM-based methods usually formulate RSVG as one-step coordinate generation, which may lead to unstable predictions for small-object localization and complex queries. To address these challenges, we propose GeoSearcher, which reformulates RSVG as an anchor-guided progressive reasoning process and realizes it through two coupled stages: Anchor-Centric Reasoning Supervised Fine-Tuning (ACR-SFT) and Process-Faithful Group Relative Policy Optimization (PF-GRPO). In ACR-SFT, anchor-centric reasoning data are used to teach the model to represent key visual clues as anchors and progressively integrate location, relational, and attribute clues around them. In PF-GRPO, Process-Aware Reward (PAR) and Reasoning-Informative Sample Selector (RISS) further optimize this reasoning behavior by jointly evaluating key reasoning steps and target localization, while focusing training on samples that are more beneficial for improving progressive reasoning. Through this design, GeoSearcher transforms large-scale visual search into a more constrained local reasoning process. Extensive experiments on DIOR-RSVG, OPT-RSVG, and VRS-Bench show that GeoSearcher outperforms existing state-of-the-art methods. The project will be released at this https URL.
[CV-16] GenAU: Language-Grounded Industrial Anomaly Understanding with Vision-Language Models
链接: https://arxiv.org/abs/2607.01049
作者: Hongkuan Zhou,Tristan Rehm,Nadeem Nazer,Lavdim Halilaj,Jingcheng Wu,Steffen Staab
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Industrial inspection requires more than binary anomaly detection: a practical system should determine whether an anomaly exists, localize the defective region, identify the defect type, and provide interpretable visual evidence. Existing CLIP-based methods detect and localize anomalies well but offer limited language-level defect understanding, while instruction-tuned vision-language models can describe defects but do not natively produce pixel-level masks. We introduce GenAU, a Generalist vision-language framework for industrial Anomaly Understanding that unifies image-level detection, pixel-level segmentation, multi-type anomaly detection, and defect analysis in a single instruction-following model. GenAU augments a vision-language model with two segmentation tokens, [SEG_defect] and [SEG_normal], whose hidden states act as language-grounded queries over multi-scale visual features for pixel-level localization; the image-level score fuses this map with the decoder’s textual normal/defect decision, while the language decoder produces structured defect-aware responses. Trained with a joint language-modeling and segmentation objective, GenAU covers all four tasks within one architecture and recipe, adding zero-shot multi-type detection and language-grounded defect analysis at a quantified cost to detection and segmentation. Across cross-dataset benchmarks, GenAU attains the strongest image-level detection among CLIP-based zero-shot methods on VisA and Real-IAD, with segmentation approaching but not surpassing specialized CLIP baselines.
[CV-17] EchoRisk: A Multicentre Echocardiography Dataset and Benchmark for Cardio-Oncology MICCAI2026
链接: https://arxiv.org/abs/2607.01039
作者: Grigorios Kalliatakis,Georgia Karanasiou,Georgios Manikis,Manolis Tsiknakis,Dimitrios Fotiadis,Dorothea Tsekoura,Kalliopi Keramida,Vasileios Bouratzis,Lampros Lakkas,Katerina Naka,Andri Papakonstantinou,Anastasia Constantinidou,Kostas Marias
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Primary technical reference for the EchoRisk-MICCAI 2026 challenge, accepted as a satellite event at MICCAI 2026
Abstract:Therapy-induced cardiotoxicity is the leading non-oncological cause of treatment interruption in breast cancer patients, yet early, automated risk stratification from routine cardiac imaging remains an unsolved problem. We present EchoRisk, the first curated, multicentre, longitudinal echocardiography dataset with explicit cardiotoxicity labels, released as the primary technical reference for the EchoRisk-MICCAI 2026 challenge. The dataset comprises 422 patients enrolled in the EU-funded CARDIOCARE prospective study across five European sites, yielding 2,159 echocardiography videos across 1,123 clinical exams acquired at up to five longitudinal timepoints, alongside a dedicated cohort of 280 patients with baseline imaging for early cardiotoxicity prediction. Three clinically grounded tasks are defined: automated estimation of left ventricular ejection fraction from cine video (Task 1), classification of LV dysfunction from longitudinal imaging (Task 2), and early prediction of therapy-induced cardiotoxicity from pre-therapy baseline echocardiography alone (Task 3). For each task we specify the evaluation protocol, primary and secondary metrics, and ranking procedure. We establish baseline performance using an R(2+1)D video backbone with LSTM aggregation trained from Kinetics-400 pretrained weights, demonstrating strong discriminative performance for cardiac functional assessment and LV dysfunction classification, while early cardiotoxicity prediction from a single pre-therapy video remains a significant open problem for the community. The dataset, evaluation code, and baseline implementations are publicly available to serve as a benchmark for further collaboration, comparison, and the creation of task-specific architectures in cardio-oncology.
[CV-18] SuperFlex: Deformable Superquadrics for Point Cloud Decomposition
链接: https://arxiv.org/abs/2607.01015
作者: Gabriel Tavernini,Elisabetta Fedele,Tiago Novello,Leonidas Guibas,Marc Pollefeys,Francis Engelmann
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
Abstract:Superquadrics have proven to provide a compact, geometrically meaningful representation for 3D objects. However, existing methods suffer from limited reconstruction accuracy, are restricted to rigid primitives, and lack robustness to partial point clouds. In this work, we present SuperFlex, an enhanced framework that expands the expressive power and applicability of superquadric decompositions. First, we introduce a novel loss formulation which significantly improves reconstruction accuracy. Second, we include bending and tapering deformations, enabling high-fidelity representation of curved and asymmetric geometries. Finally, we leverage these high-quality decompositions as supervision to train a model that is robust to partial real-world point clouds. Experiments demonstrate substantial improvements in reconstruction accuracy over both optimization- and learning-based baselines while maintaining a highly compact primitive representation.
[CV-19] Foundation Models vs. Radiomics for Lung Computed Tomography: A Benchmark of Feature Extractors Classification Heads and Segmentation Choices ALT
链接: https://arxiv.org/abs/2607.01001
作者: Nils Neukirch,Martin Maurer,Nils Strodthoff
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 17 pages, 8 figures, 2 tables, Code is available at this https URL
Abstract:Radiomics is the established approach for CT-based lung cancer phenotyping, yet comparisons with foundation models rarely isolate contributions of feature extractor, classification head, and segmentation choice, or test cross-cohort robustness. We benchmark five feature extractors (Curia, Curia-2, DINOv3, Radiomics2D, Radiomics3D), seven classification heads (TabPFN, TabICL, XGBoost, CatBoost, Random Forest, logistic regression, Ridge), and three segmentation regimes on five tasks: tumor volume and stage classification, 2-year survival prediction, histology classification, and age prediction. Models are trained on LUNG1 (n=338) and evaluated on an internal test set (n=84) and the external LUNG2 cohort (n=211), with worst-case cross-cohort performance as the primary metric. The dominant design factor is task-dependent: segmentation drives volume and stage classification, while classifier choice drives survival, histology, and age prediction. Radiomics is competitive for tumor volume, tumor stage and survival (partly due to label-derivation effects for the former); Curia variants reach comparable peak scores for survival; DINOv3 falls slightly short across tasks. Patch and slice aggregation have negligible impact. We recommend Curia with tumor segmentation and a CatBoost head as a safe default, achieving the best mean rank across the three primary clinical tasks, though task-specific selection consistently outperforms any cross-task default. When tumor delineations are unavailable, Curia-2 with lung segmentation and logistic regression offers a competitive alternative. All pipelines use a two-stage design suited to small cohort sizes where end-to-end fine-tuning would risk overfitting.
[CV-20] AVSR-Diff: Scale-Agnostic Diffusion Priors for Temporally Consistent Arbitrary-Scale Video Super-Resolution ECCV2026
链接: https://arxiv.org/abs/2607.00987
作者: Geunhyuk Youk,Jeonghyeok Do,Dayeon Kim,Jihyong Oh,Munchurl Kim
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ECCV 2026. Project page: this https URL
Abstract:Diffusion models have significantly advanced video super-resolution (VSR) but remain largely constrained to fixed upsampling scales. Conversely, while coordinate-based arbitrary-scale VSR methods offer scale flexibility, they inherently suffer from severe over-smoothing at large scaling factors. Integrating generative priors with continuous decoding is promising but currently hindered by severe temporal flickering caused by the stochasticity of diffusion sampling. To address this, we propose AVSR-Diff (Arbitrary-scale Video Super-Resolution with Diffusion), a novel decoupled framework that separates scale-agnostic latent denoising from continuous coordinate rendering, effectively avoiding computationally heavy resolution-specific sampling. Our approach introduces a Temporally-Gated Feature Recurrence (TGFR) module to extract strictly aligned, temporally consistent latent priors. Furthermore, we design a continuous video VAE decoder incorporating a Scale-Aware Fourier Refinement (SAFR) module to dynamically adapt frequency components to any target scale. Extensive experiments demonstrate that AVSR-Diff consistently preserves high-frequency details and strong temporal stability across various scales, surpassing state-of-the-art arbitrary-scale baselines. Remarkably, our framework outperforms recent fixed-scale generative models even on their native resolution.
[CV-21] QCA: Query- and Content-Aware Keyframe Selection for Long Video Understanding
链接: https://arxiv.org/abs/2607.00983
作者: Jun Peng,Baiyang Song,Jie Li,Hui Li,Yiyi Zhou,Rongrong Ji,Yonghong Tian
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Video understanding is often plagued by severe temporal redundancy, where processing dense frame sequences is both semantically inefficient and computationally expensive. This challenge is further amplified when only a small subset of frames is truly relevant to the given query. In this paper, we propose a Query- and Content-Aware (QCA) keyframe selection framework that can select a compact yet information-rich set of frames from long videos. QCA first partitions the video into temporal segments and estimates the information contribution of each segment by jointly modeling query relevance and content deviation, and dynamically allocates keyframe budget to each segment. Within each segment, QCA anchors on the most query-relevant frame and iteratively incorporates additional frames to maximize diversity while maintaining high semantic relevance to the query. Crucially, our method requires no additional training and can be seamlessly integrated into existing Video-LLMs. Extensive experiments across multiple long video understanding benchmarks demonstrate that our proposed approach achieves state-of-the-art performance and has strong generalization ability. For instance, QCA achieves 67.8% on LongVideoBench using 128 frames, while GPT-4o achieves 66.7% using 256 frames. Our codes are available in \hrefthis https URLGitHub.
[CV-22] Privacy-Preserving Depth-Only Open-Vocabulary 3D Semantic Segmentation Via Uncertainty-Guided Test-Time Optimization
链接: https://arxiv.org/abs/2607.00978
作者: Xuying Huang,Sicong Pan,Maren Bennewitz
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:Privacy-preserving perception is a critical requirement for deploying 3D scene understanding systems in real-world indoor environments, yet it remains underexplored in open-vocabulary 3D semantic segmentation. Existing methods typically rely on obtaining rich semantic cues from RGB images, which may expose privacy-sensitive visual information. Depth-only 3D geometry provides a privacy-preserving alternative, but the absence of appearance-based semantic cues makes open-vocabulary predictions highly uncertain and less reliable. Under this setting, we propose to convert uncertainty into a guidance signal to identify unreliable semantic responses and use semantic priors from foundation models to regularize their refinement. We present UTTO, an uncertainty-guided test-time optimization framework for depth-only open-vocabulary 3D semantic segmentation. Without additional training, experiments on ScanNet20, ScanNet40, and ScanNet200 demonstrate that UTTO consistently improves depth-only open-vocabulary 3D segmentation and outperforms representative baselines under privacy-preserving conditions.
[CV-23] RCGL-Net: A Long-Tailed Multi-Label Chest X-Ray Classification Framework with Generative Data Augmentation and Label Co-Occurrence Modeling
链接: https://arxiv.org/abs/2607.00975
作者: Tong Shao,Hongshun Ling,Li Zhang,Jinjing Wu,Junke Wang,Yuan Gao,Fang Wang
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Chest X-ray multi-label classification is a core task in intelligent medical imaging diagnosis. However, real clinical data often exhibit extreme long-tailed distributions, leading to degraded performance on rare diseases in tail classes. This issue is not only driven by data scarcity but also by two intrinsic factors:1) attenuation of tail-class lesion representations under complex anatomical backgrounds, and 2) dominance of head classes in modeling label co-occurrence relationships. To address these challenges, we propose TRCGL-Net. First, a learnable text-guided conditional diffusion model is employed to generate high-quality tail-class chest X-ray image samples under disease semantic constraints, improving data diversity and realism of rare disease patterns while alleviating class imbalance and preserving pathology-consistent this http URL, a channel reweighting mechanism is introduced to perform feature recalibration by emphasizing disease-relevant feature channels, thereby improving feature discriminability under long-tailed distributions.A class-aware attention mechanism is further applied to generate class-specific attention maps, enabling the model to localize disease-relevant regions and focus on fine-grained lesion this http URL, a graph convolution network based on label co occurrence is introduced to establish an information propagation mechanism among categories. Experiments on the PadChest dataset show that the proposed method achieves a tail-class mAP of 0.4904, an overall mAP of 0.4408, and an mAUC of 0.8989, outperforming state-of-the-art methods. TRCGL-Net effectively improves recognition performance for rare diseases under long-tailed distributions and mitigates the impact of extreme class imbalance in chest X-ray multi-label classification.
[CV-24] QuaMoE-DRF: Proactive Beam and Rate Adaptation via Multimodal Dynamic Radio Map Forecasting in ISAC Networks
链接: https://arxiv.org/abs/2607.00974
作者: Zhihan Zeng,Kaihe Wang,Zhongpei Zhang,Chongwen Huang
类目: Information Theory (cs.IT); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Static radio maps provide location-dependent propagation priors, but they cannot capture short-term blockage caused by moving objects. Direct sensing-assisted beam prediction is also limited because a beam index discards SINR margins, MCS thresholds, BS alternatives, and communication-equivalent neighboring beams. This paper proposes QuaMoE-DRF, a quality-aware multimodal dynamic radio map forecasting framework for proactive beam and rate adaptation in ISAC networks. Its core representation is a future beam-SINR field. We show that the full multi-BS beam-SINR field is sufficient for finite-codebook threshold-rate BS, beam, MCS, goodput, and outage decisions. For tractability, the implemented model learns a compact reference-BS local field, complemented by BS-level supervision, joint BS–beam supervision, and latent network context; we also clarify that this compact projection alone is not sufficient for BS association. QuaMoE-DRF fuses static geometry, event-like motion observations, structured sensing states, and wireless history through a quality-aware mixture-of-experts module motivated by inverse-variance fusion under heteroscedastic modality errors. It jointly predicts communication-oriented map channels and proactive BS, beam, and MCS decisions. On a dynamic multi-BS and multi-UE urban benchmark, QuaMoE-DRF achieves 402.5 Mbps effective rate, 0.0417 outage probability, and 0.1836 map RMSE, improving the effective rate by 5.67% and reducing outage by 8.35% over the strongest completed effective-rate baseline. The current validation uses labels from a compact blockage/path-loss simulator, with ray tracing used only for calibration and sanity checking.
[CV-25] Slope-Guided Mamba and Angular-Refined Transformer for Light Field Super-Resolution ICME2026 UAI
链接: https://arxiv.org/abs/2607.00965
作者: Li Jin,Jian Huang,Junde Lu,Shuai Wang,Hao Sheng,Jie Wu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 4 figures, 4 tables. Accepted by IEEE ICME 2026. Hangzhou International Innovation Institute, Beihang University, Hangzhou, China Corresponding author: Jie Wu (jiewu@buaa. this http URL ) Emails: {lijin01, hj, ljd2406107, shuaiwang, shenghao, jiewu}@buaa. this http URL
Abstract:Light Field Super-Resolution (LFSR) necessitates accurate modeling of spatial-angular correlations while preserving intrinsic 4D ray coherence. However, maintaining such high-dimensional consistency remains challenging, primarily due to two inherent limitations in prevailing modeling paradigms. First, spatial and angular dimensions are often modeled in a decoupled manner, restricting early cross-dimensional interaction and leading to geometric inconsistencies. Moreover, although continuous sequence modeling paradigms show promise in representing epipolar structures, their rigid scanning mechanisms fundamentally conflict with epipolar geometry, limiting geometry-aware feature aggregation. To address these challenges, we propose a hybrid light field super-resolution network, termed SMART, which integrates a Slope-Guided Mamba and an Angular-Refined Transformer to effectively overcome these limitations. Specifically, we introduce an angular-modulated spatial module to bridge the decoupling gap, incorporating angular priors to strengthen spatial-angular correlation modeling. To mitigate the scan-geometry mismatch, we propose a manifold-aligned trajectory module that enables geometry-consistent sequence modeling along epipolar structures. Experiments on five benchmarks demonstrate that SMART achieves state-of-the-art performance, surpassing previous methods by 0.42 dB (PSNR) with significantly reduced artifacts.
[CV-26] GaussianEmoTalker: Real-Time Emotional Talking Head Synthesis with Audio-Driven and Blendshape-Based 3D Gaussian Splatting
链接: https://arxiv.org/abs/2607.00959
作者: Haijie Yang,Zhenyu Zhang,Yixuan Dong,Jianjun Qian,Jian Yang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Audio-driven talking head synthesis has achieved impressive progress in lip synchronization and visual quality, yet generating expressive emotional avatars with controllable intensity remains challenging, especially under real-time constraints. In this paper, we present GaussianEmoTalker, an audio-driven framework for real-time emotional talking head synthesis based on 3D Gaussian Splatting. Instead of directly predicting the final emotional avatar from speech, we formulate emotional animation as a neutral-to-emotional residual deformation problem. GaussianEmoTalker first constructs an identity-specific neutral talking space with GaussianBlendshapes, which provides high-fidelity Gaussian attributes and phoneme-synchronized neutral motion. It then predicts an emotion-conditioned residual deformation by combining mesh displacement cues, audio features, emotion categories, and intensity encodings. To fuse these heterogeneous signals, we introduce a spatial-audio-emotion attention module that estimates the offsets of Gaussian attributes for expressive and temporally stable rendering. Extensive experiments demonstrate that GaussianEmoTalker achieves competitive video quality, accurate lip synchronization, controllable emotional expression, and real-time rendering compared with recent emotional talking head methods. Our project page is available at this https URL
[CV-27] Learning Cardiac Motion Priors for Implicit Neural Representations
链接: https://arxiv.org/abs/2607.00955
作者: Andrew Bell,George Webber,Andrew P King,Steffen E Petersen,Muhummad Sohaib Nazir,Alistair Young
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Implicit neural representations (INRs) are well suited to cardiac motion estimation, providing continuous, compact representations of motion fields. However, fitting an INR to each image sequence is time-consuming and sensitive to the optimisation trajectory. Learned priors can help guide optimisation towards plausible motion fields and enable faster adaptation, but learning priors for cardiac motion INRs remains under-explored. In this work, we compare four strategies for learning cardiac motion priors, including a population prior learned by joint optimisation, a consensus prior obtained by weight averaging, auto-decoders, and meta-learning. Using short-axis tagged cardiac magnetic resonance images from the UK Biobank, we evaluate their impact on tracking accuracy, motion behaviour, and adaptation trajectory. All learned priors substantially improved early adaptation performance compared with random initialisation. While the simple consensus prior was effective, auto-decoders recovered large deformations faster during early adaptation. Meta-learning achieved strong early performance and maintained the best adaptation trajectory over 50 iterations. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2607.00955 [cs.CV] (or arXiv:2607.00955v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2607.00955 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[CV-28] Dataset Biases and Shortcut Learning in Motion-Based AI-Generated Video Detection
链接: https://arxiv.org/abs/2607.00948
作者: Joren Michels,Lode Jorissen,Nick Michiels
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The visual quality of AI-generated videos has improved drastically in recent years, making it increasingly difficult for humans to distinguish between real and synthetic media. In this work, we evaluate the robustness and applicability of four state-of-the-art motion-based AI-generated video detectors. We identify significant preprocessing and sampling biases in these methods and demonstrate that they account for a substantial portion of their reported performance. Furthermore, we find that these detectors are highly sensitive to motion patterns specific to their evaluation datasets, where AI-generated videos generally exhibit less inter-frame movement than real videos. We show that for all detectors, performance collapses to near-random levels when evaluated on a dataset that does not contain this motion bias. Additionally, through dataset rebalancing and the application of simple spatial augmentations, we observe severe performance degradation across all evaluated models. In contrast, we find that an existing frequency-based detector maintains strong performance across all evaluated datasets, suggesting that frequency-based approaches may offer a more generalizable path forward for AI-generated video detection. We hope that our work raises awareness towards these vulnerabilities and encourages the development of more representative, unbiased datasets and more robust evaluation protocols.
[CV-29] Post-Training Pruning for Diffusion Transformers
链接: https://arxiv.org/abs/2607.00927
作者: Chengzhi Hu,Xuewen Liu,Jing Zhang,Mengjuan Chen,Zhikai Li,Qingyi Gu
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 15 pages, 13 figures
Abstract:Diffusion Transformers (DiTs) have demonstrated impressive performance in image generation but suffer from substantial computational overhead and resource consumption. Post-training pruning offers a promising solution; however, due to DiTs’ unique architectural design and parameter distribution, traditional pruning methods are inapplicable, leading to significant performance degradation. Specifically, prior methods developed for LLMs, which derive metrics through a series of approximations, amplify the relative contribution of weights in the saliency metric. In addition, weights in DiTs exhibit significantly larger magnitudes than those in LLMs. Moreover, existing pruning granularity overlooks variations in model structures. In this paper, we propose DiT-Pruning, which improves pruning performance by introducing customized saliency criteria and pruning granularity. We design a novel metric that balances the contributions of weights and activations from an energy-based perspective, enabling more effective identification of important elements. Furthermore, we observe distinct clustering patterns in the two-dimensional weight space. Accordingly, we adopt a clustering-aware pruning granularity, enabling effective sparse allocation. Extensive evaluations on various DiTs show that our method consistently preserves image quality, especially under high sparsity. For FLUX.1-dev at 512x512 resolution on MJHQ, DiT-Pruning achieves only a 0.001 loss in CLIP score at 50% sparsity, dramatically outperforming recent pruning methods.
[CV-30] GMO-E2DIT: Grounded Multi-Operation Editing for E-Commerce Images
链接: https://arxiv.org/abs/2607.00920
作者: Zipeng Guo,Xiaoan Liu,Lichen Ma,Cheng Wang,Yu He,Xiaolong Fu,Jingling Fu,Xinyuan Shan,Shaojie Guo,Luohang Liu,Junshi Huang,Yan Li
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Real-world e-commerce image editing often requires multiple, localized, and auditable operations rather than global restyling. This compositional nature poses a dual challenge: models must precisely apply all requested edits to the correct regions while preserving unmodified content, even under ambiguous instructions. Existing one-shot editors conflate intent resolution, spatial grounding, and synthesis into a single step, frequently resulting in partial execution failures, which is unacceptable for commercial scenarios. To address this, we introduce GMO-E ^2 DIT, an agentic editing framework that couples a Vision-Language Model (VLM) with a mask-conditioned image editor to tackle structured multi-turn task completion. Given an underspecified instruction, the VLM agent constructs a region-grounded edit agenda, effectively decoupling cognitive reasoning from generative rendering. The framework then executes sub-programs via operation-aware masks and references, utilizing a reflection-driven loop to inspect intermediate results and determine the subsequent state. This iterative mechanism reliably preserves safe partial progress, retries unfinished operations, and recovers from errors. Furthermore, we develop a unified data pipeline providing aligned supervision for planning, execution, and reflection, alongside EComEditBench, a comprehensive benchmark for instruction-driven evaluation. Extensive experiments demonstrate that GMO-E ^2 DIT achieves competitive performance compared to strong closed-source models, yielding superior instruction accuracy and edit fidelity over existing baselines.
[CV-31] Condensing Large-Scale Datasets Directly with Minimal Information Loss ECCV2026
链接: https://arxiv.org/abs/2607.00916
作者: Xinyi Shang,Peng Sun,Bei Shi,Zixuan Wang,Tao Lin
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ECCV 2026
Abstract:Recent advancements in scaling dataset distillation rely heavily on decoupled information extraction pipelines, comprising SQUEEZE, RECOVER, and RELABEL stages. Despite their scalability to large-scale datasets, these methods suffer from prohibitive computational overhead and poor cross-architecture generalization. In this paper, we reveal the root cause of these bottlenecks: the implicit dual-compression process, from data to model and back to images, inherently induces severe information loss. Crucially, we empirically and theoretically demonstrate that this loss creates a distribution shift that fundamentally compromises the widely adopted RELABEL strategy, transforming the pre-trained model into an unreliable labeler that yields sub-optimal labels. To overcome these critical flaws, we propose CIM, a novel, metric-driven framework that abandons the flawed dual-compression paradigm. Instead, CIM explicitly quantifies and minimizes the information gap between the original and synthetic datasets. By directly aligning the data distributions, our approach ensures high-fidelity information condensation and inherently satisfies the prerequisites for effective relabeling. Extensive experiments demonstrate that CIM establishes a new state-of-the-art. Notably, it distills ImageNet-1K at an IPC=10 in merely 80 minutes on a single RTX-4090 GPU, achieving an unprecedented 48.7% Top-1 accuracy on ResNet-18 and significantly outperforming previous SOTA approaches, such as NRR-DD and DELT, by 2.6% and 2.9%, respectively. Our code is available at this https URL.
[CV-32] MG-RWKV: Multi-Grained Context-Aware RWKV for Temporal Forgery Localization ECCV2026
链接: https://arxiv.org/abs/2607.00902
作者: Jingchen Ni,Cangjin Yu,Dan Jiang,Quan Zhang,Keyu Lv,Shannan Yan,Linyue Pan,Ke Zhang,Chun Yuan
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ECCV 2026
Abstract:Driven by Artificial Intelligence-Generated Content (AIGC), the authenticity of audio-visual content is facing severe challenges. Temporal Forgery Localization (TFL) aims to precisely identify manipulated segments within untrimmed sequences. However, existing methods are limited by CNNs’ local receptive fields or Transformers’ quadratic complexity, while emerging linear models often struggle to balance global authentic context compression with local abrupt forgery perception. To address this, we propose MG-RWKV, a multi-granularity framework that leverages the data-dependent state evolution of RWKV to achieve efficient full-sequence processing with O(T) complexity. Our framework features three core innovations: (1) a Bidirectional RWKV architecture that captures bidirectional temporal contexts without quadratic overhead; (2) a Multi-Granularity Mixture of Experts (MG-MoE) that performs dynamic routing over explicit temporal receptive fields, adaptively selecting granularities based on forgery duration to significantly enhance decision interpretability; and (3) Cross-Granularity Consistency (CGC), which aligns adjacent feature pyramid levels through hierarchical scale-wise pairing and spatial boundary-aware weighting, effectively reducing false positives in authentic regions. Extensive experiments on Lav-DF, TVIL, and Psynd datasets demonstrate that MG-RWKV achieves state-of-the-art performance with low computational cost.
[CV-33] DeWorldSG: Depth-Aware 3D Semantic Scene Graph Generation via World-Model Priors ECCV2026
链接: https://arxiv.org/abs/2607.00889
作者: Seok-Young Kim,Abdelrahman Elskhawy,Taewook Ha,Dooyoung Kim,Eunjae Shin,Benjamin Busam,Woontack Woo
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 19 pages, 6 figures, ECCV 2026
Abstract:We present DeWorldSG, a novel framework that generates spatio-temporally robust 3D Semantic Scene Graphs from RGB-D sequences. Existing methods often struggle to construct reliable 3D scene graphs due to unstable 3D object representations and missing relations caused by frame-wise inference. DeWorldSG addresses these issues by estimating instance-level geometric 3D Gaussian distributions through depth-guided filtering and representing each object as a probabilistic 3D node rather than a single projected point. To mitigate relational sparsity from frame-wise inference, our framework further aggregates spatiotemporal evidence across object pairs and refines relations using contextual priors derived from a world model (V-JEPA 2). Experiments on the 3DSSG and ReplicaSSG datasets demonstrate state-of-the-art (SoTA) performance in both object and predicate prediction, while producing temporally consistent scene structures. In particular, our method improves triplet recall by 77.4% and predicate recall by 23.2% over prior SoTA approaches, making it suitable for robotic manipulation and AR applications. Our code and models are open-sourced.
[CV-34] Geometry-Aware Cross-Height Channel Knowledge Map Prediction for UAV-Assisted Communications With Uncertainty-Guided 3D Sensing
链接: https://arxiv.org/abs/2607.00887
作者: Zhihan Zeng,Amir Hussain,Yue Xiu,Phee Lep Yeoh,Lu Chen,Zhongpei Zhang,Guan Gui
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Low-altitude Unmanned Aerial Vehicles (UAVs) often need to infer channel knowledge across a range of heights from only sparse observations collected at a few altitude layers. To address this challenge, this paper studies height-conditioned cross-height channel knowledge map (CKM) prediction for UAV-assisted communications in geometry-rich urban environments. We develop a geometry-aware conditional prediction framework that combines urban scene priors, sparse multi-altitude observations, and target-height descriptors to reconstruct dense CKMs at unobserved target heights. An uncertainty head is further introduced to characterize prediction confidence and to support cost-aware online UAV sensing under motion and safety constraints. Experiments on a layered aerial CKM benchmark show that the proposed Feature Pyramid Network (FPN)-Transformer achieves the best overall performance under both unseen-scene zero-shot and legacy patch-random protocols, reducing the Root Mean Square Error (RMSE) to 5.347dB and 1.111dB, respectively, compared with 6.937dB and 1.221dB for the strongest baseline 3D-RadioDiff. Moreover, after applying our unseen-scene few-shot adaptation, the RMSE further decreases from 5.347dB in zero-shot prediction to 3.518dB with 10-shot two-height support, while the uncertainty-guided cost-aware sensing policy improves active reconstruction from 6.94dB at initialization to 4.79dB at sensing budget 40, outperforming uncertainty-only sensing at 5.08dB and random aerial sampling at 5.84dB.
[CV-35] Beyond Pixel Overlap: A Framework for Decomposing Segmentation Evaluation Metrics
链接: https://arxiv.org/abs/2607.00886
作者: Youwei Pang,Xiaoqi Zhao
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Evaluation metrics are central to binary target segmentation because they determine how progress is measured, compared, and interpreted. In this paper, target denotes the task-defined positive region to be segmented rather than a generic foreground object. It may be salient, camouflaged, transparent, glass-like, mirror-like, shadow-like, lesion-like, or defined by other application-specific semantics. We treat existing metrics as compositions of modular design choices rather than isolated formulas. The proposed framework decomposes each metric into five stages covering prediction representation, target extraction, target matching, score computation, and metric reporting. We use this framework to analyze representative metrics and show how newer metrics address specific limits in earlier protocols. The stage choices keep each metric’s assumptions visible. We then discuss the design space opened by the framework and its implications for task-aware evaluation protocols. Reference code is available at this https URL.
[CV-36] Improving Sparse-View 3DGS Generalization via Flat Minima Optimization ECCV2026
链接: https://arxiv.org/abs/2607.00885
作者: Kangmin Seo,Sangeek Hyun,MinKyu Lee,Jae-Pil Heo
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to ECCV 2026. Project Page: this https URL
Abstract:Recent advances in neural rendering have established 3D Gaussian Splatting (3DGS) as a highly efficient representation for novel view synthesis, enabling fast training and real-time rendering with strong fidelity. However, when supervision is limited to sparse input views, 3DGS tends to overfit to the observed images and generalize poorly to unseen viewpoints. We address this challenge from the perspective of flat minima (FM) optimization, which seeks solutions that remain stable under small parameter perturbations. Viewing Gaussian parameters as trainable weights, we adapt FM principles to the geometric and dynamic nature of 3DGS with a lightweight training framework. Our method regularizes optimization with controlled Gaussian perturbations that account for each Gaussian’s anisotropy and the training progress, preserving fine details while improving robustness to sparse-view overfitting. To further stabilize this flat minima optimization process, we introduce periodic reinitialization, which temporarily returns non-positional parameters to their initial states for a short window. Together, these techniques integrate seamlessly into existing 3DGS pipelines without architectural changes. Experiments on LLFF and Mip-NeRF360 datasets demonstrate improved quantitative metrics and perceptual quality under sparse-view supervision, producing reconstructions that are sharper, more stable, and better generalized to novel viewpoints.
[CV-37] OmniView-Space: Reinforcing Spatial Reasoning via Multi-Perspective Spatial Mapping
链接: https://arxiv.org/abs/2607.00881
作者: Xudong Li,Mengdan Zhang,Peixian Chen,Jiaxi Tan,Zihao Huang,Jingyuan Zheng,Yan Zhang,Xiawu Zheng,Xing Sun,Rongrong Ji
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Spatial intelligence remains a persistent challenge for Multimodal Large Language Models (MLLMs), as it requires coherent spatial scene representations beyond basic object recognition. Existing methods typically build such representations through textual reasoning or 3D reconstruction. However, they often falter during multi-step reasoning, particularly when required to dynamically re-anchor evidence to the specific camera-, object-, or direction-centric reference frames demanded by complex queries. To address this, we propose OmniView-Space, a framework designed to maintain spatial consistency through multimodal egocentric evidence. Our approach consists of three core components: (1) Multi-Perspective Spatial Mapping (MPSM), which re-anchors reconstructed geometry into a query-aligned visual cognitive map and a textual spatial graph; (2) Tool-Guided Egocentric Reasoning, an interleaved policy trained to actively select the ego anchor required by the query and request the corresponding MPSM evidence; and (3) Cognitive-Map Distillation, which uses MPSM-generated trajectories and ego-frame rewards to train the model to reason with self-generated cognitive maps. Experiments on single- and multi-image spatial reasoning benchmarks show that OmniView-Space achieves state-of-the-art performance. Furthermore, the distilled model maintains this performance while reducing reliance on external geometry pipelines.
[CV-38] EFlow: Learning Evidence Flow for Long-Video Reasoning with Adaptive Reflection
链接: https://arxiv.org/abs/2607.00867
作者: Wenhao Zhang,Kuanwei Lin,Xuyi Yang,Wei Gao,Ge Li
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Long-video reasoning is fundamentally constrained by how models acquire and utilize visual evidence. Existing tool-augmented video frameworks often interleave temporal grounding and answer reasoning within a single trajectory, causing early semantic hypotheses to bias evidence localization. We term this failure mode premature semantic commitment, where biased grounding retrieves incomplete evidence and incomplete evidence further reinforces incorrect reasoning. To address this issue, we propose EFlow, an evidence-first video reasoning framework built upon Qwen3-VL. EFlow explicitly separates temporal grounding and logical reasoning through CoT for Temporal Grounding and CoT for Reasoning, enabling the model to retrieve relevant evidence before answer inference. In addition, EFlow introduces a confidence-aware reflection mechanism that re-evaluates the full video when retrieved evidence is potentially insufficient. We further construct dedicated trajectory datasets and train EFlow through supervised fine-tuning, reinforcement learning, and reinforcement fine-tuning. Extensive experiments across five video understanding benchmarks demonstrate that EFlow consistently improves long-video reasoning performance.
[CV-39] rajLoc: Trajectory-Attention Localization for Multi-Object Motion Control
链接: https://arxiv.org/abs/2607.00861
作者: Omer Sela,Inbar Huberman-Spiegelglas,Michael Rotman,Sagie Benaim,Avi Ben-Cohen
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: Project page: this https URL Code: this https URL
Abstract:Controlling the motion of multiple objects in image-to-video (I2V) generation requires preserving object identities while enforcing adherence to distinct target trajectories. This becomes particularly challenging as the number of objects increases and their paths intersect or occlude one another. Existing approaches entangle multiple trajectories within a shared, dense conditioning signal, making object-level correspondence difficult to preserve in crowded scenes. We depart from this paradigm and enforce a strict, per object spatial constraint that isolates instances independently. Our method, TrajLoc, achieves this directly within the attention layers by substituting the cross-attention weights of each object token with a Gaussian heatmap centered on its target location at every frame. The same per object token interface carries trajectory and depth through a learned embedding and preserves identity by encoding first frame appearance in place of an object token. Evaluations across six datasets, featuring up to 20 simultaneously controlled objects and out of distribution real world scenes, demonstrate that our method consistently improves both visual fidelity and trajectory adherence. Applied to two architecturally distinct backbones (CogVideoX 5B and WaN 2.1 14B), our approach achieves average gains of +4.3 dB PSNR and a 51% reduction in trajectory end point error compared to the strongest baselines. Project page: this https URL
[CV-40] MoVA: Learning Asymmetric Dual Projections for Modular Long Video-Text Alignment ECCV2026
链接: https://arxiv.org/abs/2607.00858
作者: Peiyuan Zhu,Shaoan Xie,Zijian Li,Yifan Shen,Namrata Deka,Harsh Shrivastava,Guangyi Chen,Kun Zhang
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: ECCV 2026
Abstract:Contrastive pre-training has propelled video-text alignment, yet models often inherit the critical limitations of their image-text predecessors like CLIP, resulting in entangled representations. These challenges are severely exacerbated by two fundamental properties in the video domain: Temporal Misalignment, where textual descriptions often correlate only to specific, constrained temporal windows, leaving other frames text-irrelevant; and Semantic Asymmetry, which dictates a sparse, bidirectional, and non-equivalent relevance between frame-level visual details and caption-level concepts. This failure persists whether captions are short and temporally disjoint, creating ambiguity, or long and detailed, fostering entanglement between static objects and their temporal evolution. In this paper, we establish theoretical conditions that enable flexible alignment between video and text representations across the temporal dimension and at varying levels of granularity. Building on these theoretical insights, we introduce MoVA, Modular Long Video-Text Alignment, which learns dual asymmetric projections: a text-side projection that adaptively selects frame-aware subspaces of the caption, and a video-side projection that disentangles text-relevant visual concepts. Our framework ensures that the model can preserve global cross-modal semantics while disentangling evolving, frame-specific concepts and scale naturally to long captions and videos. Empirical evaluations show that MoVA outperforms existing methods in multiple video-text alignment tasks, demonstrating the effectiveness of our method.
[CV-41] Mirror-Fusion Attention for Reflection-Aware Self-Supervised Representation Learning KDD2026 ECML
链接: https://arxiv.org/abs/2607.00850
作者: Ruixin Li,Jin Liu,Yuling Shi,Stefano Lodi
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted at ECML PKDD 2026. The final authenticated version will be available in the Springer LNCS proceedings
Abstract:Most self-supervised learning (SSL) methods encourage invariance across augmentations, but strict flip invariance can suppress informative left–right correspondences in approximately bilateral data such as medical images and human faces. We propose Mirror-Fusion-Augmented Self-Supervised Learning (MFASSL), a Vision Transformer framework that injects a soft reflection prior into standard SSL without redesigning the backbone. MFASSL constructs mirror-paired views aligned to an estimated symmetry axis and introduces a lightweight Mirror-Fusion Attention (MFA) module for adaptive token-level interaction between mirrored regions while preserving asymmetric cues. The base SSL objective is further coupled with reflection-consistency and mid-layer token-alignment losses. Across CheXpert, BraTS, CelebA-HQ, and WFLW, MFASSL improves downstream performance, calibration, and reflection robustness over MoCo-v3, DINO, and MAE baselines under matched ViT-B/16 settings. It also achieves stronger and more consistent gains than recent equivariant SSL approaches with only approximately 2.7% additional parameters. These results show that lightweight geometry-aware priors can effectively complement invariance-based SSL.
[CV-42] Rethinking Multi-Label Image Classification With Deep Learning: Taxonomy Challenge and Outlook
链接: https://arxiv.org/abs/2607.00839
作者: Xuelin Zhu,Xiu-Shen Wei,Jiawei Ge,Shuai Xu,Bing Wang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Multi-label image classification (MLIC), a fundamental task in computer vision, focuses on identifying multiple objects or concepts within an image, underpinning numerous read-world applications, such as autonomous driving, disease diagnosis, recommendation system, and mobile service robot. Over the past decade, deep learning paradigms based on convolutional neural networks, recurrent neural networks, and Transformers have significantly advanced this field, owing to their powerful capability in visual representation and relationship modeling. These advances have markedly improved the robustness, scalability, and generalization ability of MLIC models across diverse datasets and application domains. In this survey, we provide a comprehensive review of the deep learning-based literature on MLIC. Concretely, we first revisit the background, including problem definition, datasets, backbones and evaluation metrics. Next, we develop a plausible taxonomy for the deep learning-based MLIC approaches, organizing them into six groups: region-oriented methods, label-oriented methods, architecture-oriented methods, representation-oriented methods, learning-oriented methods, and data-oriented methods. Finally, we provide an insightful exposition of the underlying learning game in MLIC and its implications for other vision domains, and we empirically summarize the key challenges and research directions in MLIC while outlining promising avenues for future development. We believe this survey offers the research community a holistic and systematic perspective on MLIC, thereby facilitating subsequent exploration and innovation in this field and beyond.
[CV-43] Pano2World: End-to-End 3D Generation via Unified Multi-View Sequences
链接: https://arxiv.org/abs/2607.00832
作者: Zhenjia Li,Jinrang Jia,Yifeng Shi
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 10 pages, 3 figures, 3 tables. Preprint
Abstract:A single panorama captures the full visual sphere from one camera center, yet confines users to looking around in place without enabling true scene exploration. Converting a single panorama into a persistent, renderable 3D representation for free-viewpoint navigation has attracted growing interest; existing methods either adopt iterative per-view completion that propagates inpainting results to update the underlying geometry, leading to progressive error accumulation and cumbersome multi-step pipelines, or leverage the temporal consistency priors of video generation models, yet the continuous-trajectory constraint intrinsic to such models limits their flexibility in covering scenes from multiple directions simultaneously. We present Pano2World, which takes a single indoor panorama as input and directly outputs a persistent, explorable 3D Gaussian scene. Given the source panorama, Pano2World first reconstructs a coarse 3D Gaussian proxy and renders it at adaptively sampled nearby poses to obtain geometrically aligned guidance panoramas; a panoramic diffusion model then jointly denoises all target views via View-Aware Attention Routing, where each target view simultaneously receives geometric constraints from its corresponding guidance panorama and global semantic guidance from the source panorama, naturally enforcing cross-view consistency. To avoid the information loss incurred by decoding the multi-view hidden features formed during joint denoising back to the pixel domain via VAE, we introduce Latent Feature Adapter, a geometry-aware bridge module that directly distills these hidden features into a scene latent, subsequently decoded into the final 3D Gaussian scene. Experiments demonstrate that Pano2World significantly outperforms existing methods on the multi-position panoramic novel-view synthesis benchmark.
[CV-44] Stitched Embeddings: A Unified Latent Space for 3D Garments and 2D Patterns
链接: https://arxiv.org/abs/2607.00829
作者: Andrea Sanchietti,Riccardo Marin,Bharat Lal Bhatnagar,Yuanlu Xu,Gerard Pons-Moll
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:While garments are essential for realistic digital humans, their topological variety makes them much harder to model than parametric bodies. Traditional tailoring relies on 2D sewing patterns, yet bridging these patterns to 3D geometry currently requires physical simulations. We present Stitched Embeddings, the first simulation-free framework to unify 3D garment reconstruction and sewing pattern inference within a single bidirectional latent space. By leveraging the geometric priors of a pretrained 3D foundation model, our approach overcomes the data scarcity typically associated with high-quality garment modeling. We propose to use the BoxMesh as a critical intermediate representation to align 2D panels into 3D configurations without the computational overhead of a simulator. This architecture achieves state-of-the-art accuracy in pattern reconstruction while significantly improving efficiency. Furthermore, our differentiable pipeline enables novel applications, including pattern recovery from meshes and 3D editing from 2D patterns. Finally, this work provides a scalable link between neural 3D vision and the physical garment manufacturing pipeline. Project Page: this https URL
[CV-45] raining-Free Debiasing of Diffusion Models via CLIP-Guided Denoising Optimization
链接: https://arxiv.org/abs/2607.00817
作者: Dain Kim,Jinseo Kim,Sungyong Baik
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Text-to-image diffusion models achieve impressive visual quality, yet demographic bias remains a challenge, as neutral prompts consistently produce stereotypical representations across gender and race. Existing approaches remain limited by costly retraining or by inference-time interventions that often degrade image quality and semantic alignment. We propose Text Embedding Steering (TES), a training-free framework that mitigates demographic bias by directly optimizing conditional text embeddings during the diffusion process. We show that a two-stage strategy - early-stage global alignment followed by iterative denoising-time refinement with CLIP-based feedback - enables stable and controllable attribute steering without modifying model parameters. Extensive experiments on Stable Diffusion demonstrate that TES outperforms existing training-free baselines in fairness while maintaining competitive image quality. These results highlight that inference-time text embedding optimization is a practical and scalable solution for fairness-aware generation in diffusion models.
[CV-46] owards High-Resolution Visual Perception via Hierarchical Entity Exploration ECCV2026
链接: https://arxiv.org/abs/2607.00816
作者: Ziyu Ma,Shidong Yang,Yuxiang Ji,Yiming Hu,Tongwen Huang,Yong Wang,Jianfei Cai,Xiangxiang Chu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ECCV2026
Abstract:High-resolution (HR) image perception remains a key challenge in multimodal large language models (MLLMs), as fine-grained details are often lost when the image is processed as a whole. Existing methods either require training to teach models where to look or heuristically divide the image into fixed regions, both of which struggle to generalize in complex HR scenes. In this work, we propose Hierarchical Entity Exploration (HEE), a training-free and model-agnostic framework that transforms static image understanding into dynamic, query-guided entity exploration. HEE first evaluates each region using a dual scoring mechanism to determine whether it already contains sufficient evidence to answer the question. If not, it applies object detection within the most promising region to extract fine-grained entities, clusters them into coherent subregions, and organizes them into a multi-level semantic hierarchy for deeper exploration. When deeper regions still fail to yield confident answers, a confidence-guided backtracking mechanism revisits alternative paths to ensure adaptive perception. Extensive results show that HEE outperforms training-free methods like ZoomEye and RAP in both accuracy and efficiency on two complex HR benchmarks (Visual Probe and HR-Bench), across different MLLMs such as Qwen2.5-VL and LLaVA-OneVision. Moreover, HEE demonstrates generalization on the MME-RealWorld benchmark.
[CV-47] Spotted: Location-informed Reidentification of Hyenas and Leopards in Camera Trap Surveys
链接: https://arxiv.org/abs/2607.00804
作者: Halil Sina Kelebek,Julia Hindel,Kobus Hoffman,Lauren Hoffman,Andrew Loveridge,Bob Mandinyenya,Kudakwashe Ncube,Justin Seymour-Smith,Andrea Sibanda,Abhinav Valada,Matthew Wijers,Daniele De Martini
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Animal re-identification (ReID) in camera-trap surveys remains challenging due to low image quality, strong variation in illumination and viewpoint, and highly imbalanced numbers of observations per individual. As a result, current ReID performance is often insufficient for fully automated use, and practical workflows typically depend on expert review of algorithmically proposed candidate matches. Moreover, most existing approaches focus almost exclusively on visual cues and overlook auxiliary information routinely available in field studies, such as image timestamps and camera-trap locations. We introduce Spotted, a location-informed, human-in-the-loop animal ReID framework that integrates visual similarity with spatio-temporal feasibility priors derived from camera locations, thereby reducing the amount of required expert review. Our method (i) computes an image-model-agnostic feasibility score based on the minimum travel speed required for two detections to correspond to the same individual, (ii) uses these feasibility cues as pseudo-supervision to train a lightweight head on top of a frozen visual foundation model, and (iii) fuses adapted visual similarity with spatio-temporal feasibility to obtain a robust pairwise matching score. We additionally integrate an active pair sampling strategy to accelerate annotation by initially prioritizing uncertain predictions. We evaluate Spotted on three challenging camera-trap ReID datasets comprised of spotted hyenas and leopards, which we release as part of this work. Our model improves average top-5 identification accuracy by 9pp, 2pp and 9pp over the best baseline on our LeopardID102, SpottedHyenaID109 and SpottedHyenaID415 datasets, respectively. Further, we show that our human-in-the-loop strategy reduces the number of queried comparisons by up to 69pp while achieving equivalent positive matches.
[CV-48] ClinRAG -GRAPH: Clinical-prior Retrieval-Augmented Graph Model with Domain Adversarial Learning for Breast pCR Prediction
链接: https://arxiv.org/abs/2607.00798
作者: Yaofei Duan,Yuhao Huang,Tianyu Zhang,Yuan Gao,Luyi Han,Xin Wang,Xinyu Xie,Xinglong Liang,Chunyao Lu,Muzhen He,Patrick Pang,Yue Sun,Ning Mao,Tao Tan,Ritse Mann
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 5 figures
Abstract:Neoadjuvant chemotherapy (NAC) response prediction is clinically important for treatment stratification in breast cancer. However, robust pre-treatment pathological complete response (pCR) prediction remains challenging due to insufficient cross-modal modeling, multicenter imaging heterogeneity, and weak evidence-grounded interpretability. We propose ClinRAG-GRAPH, a Clinically informed Retrieval-Augmented Generation Graph framework, for pre-treatment pCR prediction from DCE-MRI, structured clinical variables, and biopsy-derived pathological biomarkers. ClinRAG-GRAPH constructs an intra-patient clinical-prior graph and applies a prior-guided relation-aware graph convolutional network for structured multimodal representation learning. To improve cross-center robustness, we introduce a dual-branch domain-adversarial learning strategy to suppress protocol-related MRI bias while preserving pCR-relevant features. To enhance interpretability, we further incorporate large language model (LLM)-driven subgraph RAG module that retrieves clinically analogous historical cases and integrates retrieved evidence for pCR inference. We assemble a large-scale multicenter NAC breast cancer cohort for extensive validation, drawing from two public sources and three in-house this http URL show that ClinRAG-GRAPH achieves AUCs of 0.815 on the internal test set and 0.774/0.712 on two external test sets, demonstrating robust pre-treatment pCR prediction across centers. The code is available at the anonymized this https URL.
[CV-49] LeVLJEPA: End-to-End Vision-Language Pretraining Without Negatives
链接: https://arxiv.org/abs/2607.00784
作者: Lukas Kuhn,Giuseppe Serra,Randall Balestriero,Florian Buettner
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Vision-language pretraining remains dominated by contrastive objectives, whereas vision-only self-supervised learning has largely adopted non-contrastive methods. At the same time, the role of vision-language encoders has shifted: they are increasingly deployed not as zero-shot classifiers but as the frozen visual backbone of vision-language models and dense prediction systems, which consume the full grid of patch tokens rather than a single pooled embedding. We introduce LeVLJEPA, the first fully non-contrastive end-to-end vision-language pretraining method. LeVLJEPA learns through cross-modal prediction with stop-gradient targets and per-modality distributional regularization, without negatives, temperature, momentum encoder, or teacher-student schedule, and trains stably at large scale. We find that the resulting encoder provides markedly stronger dense semantic features for downstream use: as a frozen vision-language-model backbone, LeVLJEPA is the strongest of the evaluated encoders across GQA, VQAv2, and POPE under two distinct language models, and outperforms contrastive baselines on semantic segmentation, while remaining on par on global readouts such as linear probing. These results establish non-contrastive pretraining as an effective means of producing dense semantic vision features.
[CV-50] SpiralFovea: Input-Adaptive Foveated Tokenization as a Third Lever of Resource-Adaptive Inference
链接: https://arxiv.org/abs/2607.00780
作者: Kyan Mahajan,Mohammad Saqlain
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Most adaptive-inference techniques for foundation models change what the model does - early exit, MoE routing, KV-cache compression, dynamic attention sparsity. The input that hits the backbone, however, remains a fixed-grid tokenisation indifferent to image content. We argue that this is a missed lever. We present SpiralFovea, a parameter-free, input-adaptive tokeniser in which token identity, location, scale, and count are all functions of local visual entropy and selection completes before any backbone parameter is queried. Around content-driven hotspot anchors, multi-scale spiral rings produce = 78 patches that replace the standard 196-patch ViT grid at the input stage. Across four canonical fine-grained benchmarks, SpiralFovea yields +1.7-2.1 pp accuracy with a 60% reduction in input tokens, an 84% reduction in self-attention FLOPs at every transformer layer, and 18-29% throughput gains over the matched static tokenisation baseline. A controlled ablation on CUB-200-2011 Genus across four backbones reveals a clean diagnostic: the gain magnitude tracks inversely with the strength of the backbone’s whole-image positional prior, isolating self-supervised foundation models as the regime where input-adaptive tokenisation is most valuable.
[CV-51] Soft Mixture-of-Recursions: Going Deeper with Recursive Vision Transformers
链接: https://arxiv.org/abs/2607.00774
作者: Sang In Lee,Jihun Park
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 16 pages, 8 figures
Abstract:Recent recursive Transformer studies have primarily reused shared parameters across computation steps to construct compact, parameter-efficient models. In this work, we leverage recursion to build effectively deeper Transformers with stronger representational capacity. However, in Vision Transformers, simply increasing recursion depth does not reliably improve performance, as existing recursive approaches do not fully utilize the intermediate representations produced throughout recursive computation. We propose Soft Mixture-of-Recursions (SoftMoR) and its Vision Transformer instantiation, Soft Recursive Vision Transformer (SR-ViT). SoftMoR learns token-wise mixture weights to softly combine outputs from all recursion steps, allowing intermediate representations to be utilized in a learnable and flexible way. Across diverse vision tasks, SR-ViT consistently improves as recursion depth increases with minimal parameter overhead. On ImageNet-1K, increasing recursion depth from 1 to 4 improves SR-ViT-S top-1 accuracy from 79.83% to 82.48% with only 1.7M additional parameters, outperforming the substantially larger DeiT-B while using approximately 27% of its parameters. These results demonstrate that SoftMoR provides a parameter-efficient path to deeper and stronger Vision Transformers through recursion.
[CV-52] Decoupled Guidance: Disentangling Subject and Context Pathways in Text-to-Image Personalization
链接: https://arxiv.org/abs/2607.00766
作者: Seongmin Kim,Kyucheol Shin,Heesun Jung,Jinseo Kim,Sungyong Baik
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Text-to-image personalization aims to generate a user-provided subject in novel scenes described by text. However, most existing methods encode subject identity (fidelity) and context (editability) through the same conditioning pathway, forcing the two to compete for attention-map resources. We refer to this phenomenon as conditioning entanglement and show that it induces a fidelity-editability trade-off. We further provide causal evidence by replacing the target subject token with a generic subject token, which produces shifts in attention allocation and corresponding changes in context adherence. To this end, we propose Decoupled Guidance (DeGu), a plug-and-play framework that routes subject identity and scene context through two independent guidance streams. We further introduce a spatial mixing mechanism that dynamically fuses these streams, ensuring each operates within its semantically relevant region without interference. Furthermore, DeGu can be readily applied to existing personalization methods without modifying the underlying backbone models, consistently improving the overall personalization performance while enabling inference-time control over the fidelity-editability balance, across diverse methods and backbones, including flow-matching Diffusion Transformers (DiTs).
[CV-53] GKDT: General Keypoint Detection Transformer ECCV2026
链接: https://arxiv.org/abs/2607.00752
作者: Changsheng Lu,Yuxin Chen,Haokun Gui,Rong Wang,Jie Yang,Harry Yang,Anton van den Hengel,Jiaya Jia
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ECCV 2026
Abstract:With the emergence of various pre-trained vision and language models, computer vision is shifting from narrow-domain to open-domain recognition. The construction of a more powerful yet general keypoint detection (GKD) model to support diverse tasks has become increasingly important in the field. To this end, we firstly present a large-scale unified keypoint dataset called MegaKPT. The dataset is composed of over 1.3 million diverse object instances from twenty-nine existing datasets, and enjoys high-quality unified annotations with keypoint text descriptions. Based on MegaKPT, we develop GKDT, a simple, flexible and powerful DINOv3 based Transformer model for General Keypoint Detection. Our GKDT supports visual prompts, text prompts, or both. To enhance model training, we also propose a suite of useful strategies such as mix-modal prompted training and dynamic importance sampling. By testing over 22 test sets with seen or unseen objects, our single GKDT model shows strong performance and generality in detecting keypoints on broad categories, with most categories over 90% PCK@0.1 accuracy, offering high practical applicability to real-world problems. The dataset, models, and codes will be released at this https URL.
[CV-54] FrameONE: Hierarchical Motion Modeling for Universal Multi-View Echocardiographic Keyframe Detection MICCAI2026
链接: https://arxiv.org/abs/2607.00748
作者: Rusi Chen,Yuhao Huang,Hongyuan Zhang,Chao Tian,Shunan Ji,Yuhan Zhang,Dong Ni
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by MICCAI 2026. 10 pages, 4 figures
Abstract:Accurate detection of end-systole (ES) and end-diastole (ED) frames is fundamental to echocardiographic assessment. Existing methods are typically developed in a view-specific manner, depend on auxiliary annotations or intensive visual modeling, which limits their generalizability. In multi-view modeling, keyframe detection is driven by shared cardiac motion, yet large appearance differences and motion patterns make unified modeling challenging. To address these issues, we propose FrameONE, a unified end-to-end framework for multi-view echocardiographic keyframe detection. FrameONE introduces a Hierarchical Motion Modeling strategy: an intra-view multi-task learning reduces appearance bias and promotes motion-focused representations within each view; an inter-view general motion learning module further separates view-agnostic dynamics from view-specific patterns, enabling shared yet flexible motion representation learning across views. Extensive experiments on 25,872 videos spanning four standard views demonstrate that FrameONE achieves state-of-the-art keyframe detection accuracy with strong cross-view generalization. Code is available at this https URL.
[CV-55] Active Learning for Cascaded Object Detection: Balancing Coverag e and Uncertainty in Table Extraction Pipelines ICDAR2026
链接: https://arxiv.org/abs/2607.00747
作者: Eliott Thomas,Mickael Coustaty,Aurelie Joseph,Gaspar Deloin,Vincent Poulain d’Andecy,Jean-Marc Ogier
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted at ICDAR 2026
Abstract:Table extraction from business documents relies on a cascaded pipeline where Table Detection (TD) first localizes tables and Table Structure Recognition (TSR) then recovers their internal layout. Building task-specific training sets for this pipeline is costly, particularly for TSR which requires fine-grained structural annotations. Active learning (AL) can reduce this annotation burden, yet most AL strategies are designed for single-model tasks and do not account for inter-stage dependencies in cascaded architectures. In this work, we present the first adaptation of Uncertainty Herding (UHerding), a hybrid coverage-uncertainty sampling method originally proposed for image classification, to cascaded object detection pipelines. We propose two pipeline-aware extensions that exploit the TD-to-TSR dependency: RankFusion adds dual-manifold coverage over both detection and structure representation spaces, while CAPA further incorporates stage-dependent gating and per-task uncertainty calibration. Extensive experiments across two public (PubTables-1M and FinTabNet) and two private table extraction datasets, with various annotation budgets (from 71 to 500 documents) show that UHerding generalizes well to table extraction, outperforming each baseline. Among pipeline-aware variants, RankFusion achieves higher expected gains but at the cost of greater variance, while CAPA emerges as the most consistent strategy, outperforming standard UHerding on three out of four datasets.
[CV-56] GaussianFusion: Unified 3D Gaussian Representation for Multi-Modal Fusion Perception ICLR2026
链接: https://arxiv.org/abs/2607.00746
作者: Xiao Zhao,Chang Liu,Mingxu Zhu,Zheyuan Zhang,Linna Song,Qingliang Luo,Chufan Guo,Kuifeng Su
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: ICLR 2026
Abstract:The bird’s-eye view (BEV) representation enables multi-sensor features to be fused within a unified space, serving as the primary approach for achieving comprehensive 3D perception. However, the discrete grid representation of BEV leads to significant detail loss and limits feature alignment and cross-modal information interaction in multimodal fusion perception. In this work, we break from the conventional BEV paradigm and propose a new universal framework for multi-modal fusion based on 3D Gaussian representation. This approach naturally unifies multi-modal features within a shared and continuous 3D Gaussian space, effectively preserving edge and fine texture details. To achieve this, we design a novel forward-projection-based multi-modal Gaussian initialization module and a shared cross-modal Gaussian encoder that iteratively updates Gaussian properties based on an attention mechanism. GaussianFusion is inherently a task-agnostic model, with its unified Gaussian representation naturally supporting various 3D perception tasks. Extensive experiments demonstrate the generality and robustness of GaussianFusion. On the nuScenes dataset, it outperforms the 3D object detection baseline BEVFusion by 2.6 NDS. Its variant surpasses GaussFormer on 3D semantic occupancy with 1.55 mIoU improvement while using only 30% of the Gaussians and achieving a 450% speedup.
[CV-57] Foundation Model-driven Key Anatomy Frame Selection for Blind-sweep Ultrasound Fetal Birth Weight Estimation MICCAI2026
链接: https://arxiv.org/abs/2607.00745
作者: Le Ou,Xiliang Zhu,Huanwen Liang,Wenxiong Pan,Yuhao Huang,Yuxiang Deng,Xuan Sheng,Hong Yin,Juhua Xiao,Xin Zhou,Dong Ni
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by MICCAI 2026. 10 pages, 2 figures. Code: this https URL
Abstract:Accurate fetal birth weight (FBW) estimation shortly before delivery is clinically valuable yet challenging due to its reliance on operator expertise, particularly in low-resource settings. To reduce this reliance, we study near-term birth-weight regression from blind-sweep ultrasound (US) videos acquired within 48 hours prior to delivery, with post-delivery weighing as ground truth. Accordingly, we propose a foundation model-driven key anatomy frame selection framework that enables accurate FBW regression despite the absence of plane constraints in blind sweeps. Our highlights are as follows: (1) We believe this is the first work to estimate FBW using blind-sweep US videos, enabling operator-independent assessment. (2) An Anatomy-Guided Frame Selection module equipped with a vision-language foundation model is proposed for keyframe collection in unconstrained sweeps. (3) A Redundancy-Aware Feature Compression module is designed to compress frame features while preserving task-relevant information, alleviating temporal redundancy. Extensively validated on prospectively collected data from 839 patients, our method achieves an MAE of 161.3 g, with 90.23% and 100% of cases falling within 10% and 15% absolute percentage error, outperforming typical Hadlock estimation and strong competitors. Codes are available at this https URL.
[CV-58] Prototype Memory-Guided Training-Free Anomaly Classification and Localization in Prenatal Ultrasound MICCAI2026
链接: https://arxiv.org/abs/2607.00744
作者: Huanwen Liang,Yuhao Huang,Xiliang Zhu,Yuanji Zhang,Xuedong Deng,Xinru Gao,Guowei Tao,Yuhan Zhang,Dong Ni
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by MICCAI2026
Abstract:Prenatal anomaly classification and localization is of critical importance for fetal health and pregnancy management. Although ultrasound (US) is the primary modality for prenatal screening, accurate diagnosis remains challenging due to the low prevalence and high heterogeneity of anomalies. Existing deep learning methods for prenatal tasks rely on large-scale annotated datasets, which are difficult to obtain in practice. Although few-shot learning alleviates data scarcity, it typically requires fine-tuning for new categories, limiting its practicality in resource-limited clinical settings. To address these challenges, we propose a training-free framework for multi-class prenatal US anomaly classification and localization that operates with only a few reference images per class, representing the first exploration of this setting. Our framework comprises three key components: (1) a memory bank with multi-granular prototypes that explicitly models both class-level semantics and anomaly characteristics; (2) a prototype-driven soft merging mechanism that aggregates discriminative features to detect the anomaly region; and (3) a class-aware refinement strategy that leverages prototype consistency to improve category prediction. Extensively validated on a multi-center prenatal US dataset containing 1,149 cases, with a total of 2,357 images and 9 categories, our proposed method outperforms the competitors.
[CV-59] owards Robust Driving Perception: A Flexible Scale-Driven Family for Self-Supervised Monocular Depth Estimation ECCV2026
链接: https://arxiv.org/abs/2607.00736
作者: Zhaowen Zhu,Li Zhang,Yujie Chen,Tian Zhang,Yingjie Wang,Mingxia Zhan
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ECCV2026. Code is available at this https URL
Abstract:Self-Supervised Monocular Depth Estimation (MDE) has garnered attention in recent years due to its independence from ground truth. However, most existing models are limited to a single scale and exhibit considerable performance degradation in complex driving environments. Networks specifically designed to handle dynamic traffic participants tend to be overly complex, hindering their deployment on resource-constrained automotive edge devices. To address these limitations and move towards robust driving perception, we propose FlexDepth, a scale-driven and flexible family of self-supervised MDE models tailored for challenging road scenarios. FlexDepth employs a two-stage static-dynamic decoupled training strategy, enabling the independent assessment of confidence for both static backgrounds and dynamic road objects. Furthermore, it introduces a meticulously designed Scale-Driven Decoder (SDD) to dynamically select components based on scale size, facilitating efficient feature fusion and the output of high-precision depth maps. Extensive experiments on standard driving benchmarks demonstrate that without any auxiliary information, our model achieves state-of-the-art performance across arbitrary scales with minimal computational overhead. Our smallest model, Flex-Nano, requires only 0.7 GFLOPs and achieves 37.6 FPS on mobile platforms, ensuring reliable real-time perception while maintaining excellent zero-shot this http URL source code is avalible: this https URL
[CV-60] ConRTF: Edge-Constrained Boundary Distribution Refinement for Realtime TransFormer Table Structure Recognition ICDAR2026
链接: https://arxiv.org/abs/2607.00734
作者: Eliott Thomas,Tri-Cong Pham,Mickael Coustaty,Aurelie Joseph,Gaspar Deloin,Vincent Poulain d’Andecy,Jean-Marc Ogier,Antoine Doucet
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to ICDAR 2026
Abstract:Table Structure Recognition (TSR) aims to recover the row and column layout of tables from document images, a key step in document understanding pipelines. Accurate TSR depends on precise boundary localization: small errors in row or column boundaries can propagate into incorrect cell assignments and structural inconsistencies. Yet detection-based approaches treat table elements as generic objects, ignoring a fundamental property of table layout: rows and columns play structurally distinct roles and their boundaries carry unequal importance. We propose an Edge-constrained Fine-grained Localization loss (EFL) that formalizes this structural asymmetry by encoding table-specific geometric priors into the training objective: row-like elements are supervised with emphasis on their horizontal boundaries, while column-like elements prioritize vertical boundaries. Implemented within a real-time detector with distribution-based boundary refinement (D-FINE), EFL operates during training only and guides boundary refinement toward structurally meaningful adjustments with no change to the inference pipeline. The proposed approach, ConRTF, is also data-efficient, maintaining robust accuracy with as few as 2k–3k annotated tables. Experiments on PubTables-1M and two private datasets show consistent improvements over the optimized baseline and several real-time detectors including RT-DETRv2 and YOLOv10-11, with gains of up to +1.6 GriTS points at equal inference speed.
[CV-61] AV-SyncBench: Decoupled Benchmarking of Temporal and Semantic Audio-Visual Synchronization INTERSPEECH2026
链接: https://arxiv.org/abs/2607.00726
作者: Tianhong Zhou,Mingyang Han,Boyu Li,Yuxuan Jiang,Jiaxin Ye,Dongxiao Wang,Haoxiang Shi,Kunpeng Wang,Jun Song,Cheng Yu,Bo Zheng
类目: Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD)
备注: Accepted by Interspeech 2026
Abstract:Audio-visual feature extraction is a fundamental component of multimodal understanding and generation tasks. However, existing evaluation protocols for feature extraction models exhibit dimensional bias, typically focusing on either semantic matching or temporal offset detection. Moreover, their data construction remains coupled, preventing independent assessment of temporal and semantic consistency. We propose AV-SyncBench, the first benchmark to fully separate temporal and semantic evaluation for audio-visual synchronization. Built from in-the-wild videos, it spans Voice, Music, and Sound across 10 scenarios and 5 challenge tasks. Data are automatically filtered and manually verified to ensure on-screen sound sources. The benchmark contains 3,269 videos and 38,390 samples, and we evaluate five representative models to quantify feature quality for alignment and downstream tasks. The code and dataset are available at: this https URL.
[CV-62] Partial Skeleton Visibility for Action Recognition: A Constrained Field-of-View Approach
链接: https://arxiv.org/abs/2607.00716
作者: Yingjie Dai,Tianyang Xu,Yanglin Deng,Xiao-Jun Wu,Josef Kittler
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 18 pages, 4 figures
Abstract:Skeleton-based action recognition has achieved remarkable success by exploiting joint coordinates and their topological connections, yet prevailing methods overwhelmingly assume complete and clean skeleton inputs. In real-world deployments, such as egocentric vision, crowded surveillance, wearable devices, or edge robotics, limited field-of-view (FoV) frequently causes substantial joint visibility dropout, leading to severe performance degradation that existing models are largely unprepared to handle. To bridge this critical yet underexplored gap, we introduce PartialVisGraph, a novel hypergraph framework tailored for robust skeleton action recognition under constrained FoV. We first construct highly expressive hypergraphs by introducing learnable virtual hyperedges that form a soft incidence matrix, capturing flexible high-order dependencies beyond conventional pairwise graphs. We then propose the Single-Head Sample-Adaptive Transformer, which adaptively aggregates joint features onto hyperedges while explicitly incorporating a visibility prior. This prior selectively gates information flow, preventing occluded or out-of-view joints from corrupting reliable feature propagation. We further establish rigorous evaluation protocols with realistic FoV simulation benchmarks on NTU RGB+D 60 and 120. Extensive experiments demonstrate that PartialVisGraph consistently achieves state-of-the-art accuracy under partial visibility, with gains of up to 68.8% on subsets with severe FoV restrictions compared to recent strong baselines, while remaining superior on full-visibility settings. Our approach offers a principled and practical pathway toward deployable skeleton-based action understanding in unconstrained environments.
[CV-63] owards Memory-Efficient Autoregressive Video Generation via Instance-Specific Parametric Absorption ECCV2026
链接: https://arxiv.org/abs/2607.00712
作者: Xiaomeng Fu,Jia Li,Yiming Hu,Yong Wang,Hayden Kwok-Hay So,Jiao Dai,Xiangxiang Chu,Jizhong Han
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: ECCV 2026 Camera Ready
Abstract:Autoregressive (AR) streaming models have emerged as a powerful paradigm for long video generation. However, the linearly growing Key-Value (KV) cache poses a significant bottleneck, leading to memory overload and degraded inference throughput. A common compression method is to drop redundant KV tokens, which often breaks long-range dependencies, resulting in temporal flickering and identity loss. In this paper, we propose Instance-Specific Parametric Absorption (ISPA), a novel framework that shifts the KV cache compression from discarding to distilling. The core idea is to transit a subset of layers from Full-Attention (F-Layers) to memory-efficient Local-Attention (L-Layers) by “absorbing” historical context into the model’s weights. Specifically, during a brief warmup phase, ISPA monitors the output discrepancy between global and local attention. At the transition point, we solve a closed-form least-squares problem to compute an instance-specific weight modulation that compensates for the missing history. Experiments across architectures (1.3B to 14B) demonstrate that ISPA can remove up to 50% of the KV cache with near-lossless visual quality. We hope this perspective encourages future work to explore parametric memory consolidation beyond external token-level cache management for streaming generative models.
[CV-64] Creating Impactful Autonomous Driving Datasets: A Strategic Guide from Research Gap to Benchmark
链接: https://arxiv.org/abs/2607.00710
作者: Richard Schwarzkopf,Jonas Merkert,Frank Bieder,Annika Bätz,Alexander Blumberg,Carlos Fernandez,Felix Hauser,Fabian Immel,Christian Kinzig,Hendrik Königshof,Fabian Konstantinidis,Martin Lauer,Willi Poh,Nils Rack,Kevin Rösch,Yinzhe Shen,Marlon Steiner,Gleb Stepanov,Dominik Strutz,Ömer Şahin Taş,Julian Truetsch,Kaiwen Wang,Royden Wagner,Jan-Hendrik Pauls,Christoph Stiller
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: Keywords: Autonomous Driving, Dataset Design, Benchmarks, Research Gap Identification. 14 pages, 3 figures
Abstract:Well-designed autonomous driving datasets have fundamentally shaped research progress, yet existing literature primarily describes what datasets contain rather than how to strategically design impactful ones. This is especially limiting for small and medium-sized labs and startups that cannot afford to misallocate scarce resources. We argue that impactful dataset creation begins with a diagnosis: whether a research question is blocked by a data problem or an evaluation problem, and proceeds by selecting the minimal data operator(s) that closes the resulting gap, recording new data only when no cheaper operator(s) suffices. We analyze the evolution of major autonomous driving (AD) datasets through this lens and distill a strategic framework spanning gap identification, operator choice, sensor suite design, and annotation strategy. We ground the framework in a running case study of our KITScenes dataset family. The datasets are available at: this https URL
[CV-65] Imprint: Online Memory Compression for Long-Horizon Egocentric QA
链接: https://arxiv.org/abs/2607.00696
作者: Kousik Das,Debaditya Roy
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Long-horizon egocentric question answering involves answering about events that have occurred hours or days in the past. This requires memory representations that remain both retrieval-effective and scalable over days or weeks of recording. Existing long-horizon egocentric QA methods construct memory as hierarchical textual summaries of observations. While effective for reducing memory size, summarization optimizes for descriptive compression rather than retrieval: repeated interactions are absorbed into coarse textual descriptions instead of being preserved as explicit, recurring memory units, making long-horizon evidence aggregation difficult. We propose Imprint, an interaction-centric memory framework that formulates long-horizon egocentric memory as an online memory compression problem rather than summarization. Incoming observations are first represented as structured Interaction Records and continuously organized into recurring interaction patterns. Using human memory consolidation signals of recurrence, recency, and distinctiveness, Imprint selectively retains and compresses interactions into a compact retrieval-oriented memory. We evaluate Imprint on EgoLifeQA, a seven-day egocentric benchmark containing questions that require reasoning over interactions occurring hours to days before the query. With the same LLM, Imprint improves QA accuracy from 31.0% to 35.8%, increases evidence-grounded answers by 6\times compared with EgoRAG, reduces memory footprint by 2.3\times , and decreases retrieval latency by 11.8\times . These results demonstrate that memory compression provides a scalable and retrieval-effective foundation for long-horizon egocentric question answering.
[CV-66] LUMA: Benchmarking Segmentation via a Lightweight Universal Mask Adapter
链接: https://arxiv.org/abs/2607.00687
作者: Tobias Christian Nauen,Anosh Billimoria,Federico Raue,Stanislav Frolov,Brian B. Moser,Andreas Dengel
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Performance (cs.PF)
备注:
Abstract:Comparing transformer backbones for image segmentation is confounded: each is paired with a different decoder, recipe, and pretraining, so reported differences rarely reflect the backbone itself. We introduce the Lightweight Universal Mask Adapter (LUMA), a lightweight, backbone-agnostic mask-transformer head that treats any backbone as a black-box feature extractor, letting a set of queries read from its features through cheap cross-attention. LUMA matches the accuracy of EoMT, the state-of-the-art efficient ViT-segmenter, at lower cost, while attaching unchanged to isotropic, hierarchical, convolutional, and mixture-of-experts backbones alike. Holding this head fixed, we benchmark 20 backbones, 11 pretraining schemes and a range of resolutions on ADE20K and Cityscapes under one modern recipe. We find that ``efficient’’ token mixers fail to deliver efficiency even at the high resolutions that motivate them, with plain ViT holding the throughput Pareto-front at every resolution. Additionally, the pretraining objective, not the architecture, the lever the field has tuned hardest, governs segmentation quality.
[CV-67] ABot-M0.5: Unified Mobility-and-Manipulation World Action Model
链接: https://arxiv.org/abs/2607.00678
作者: Ronghan Chen,Yandan Yang,Zuojin Tang,Dongjie Huo,Tong Lin,Haoning Wu,Haoyun Liu,Yuzhi Chen,Lulu Zheng,Botai Yuan,Tianlun Li,Mingxin Wang,Dekang Qi,Bin Hu,Wei Mei,Yuze Xuan,Haolong Yang,Yanqing Zhu,Mu Xu,Zhiheng Ma,Xinyuan Chang
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Code: this https URL
Abstract:Mobile manipulation is a key capability for general-purpose robots, yet remains challenging for current embodied learning methods. VLA policies are typically reactive and lack explicit world modeling, while existing World Action Models (WAMs) are still poorly aligned with the structure of mobile manipulation: they operate on coarse video chunks, model entangled navigation-manipulation actions, and train inverse dynamics under supervision that does not match autoregressive inference. As a result, they often miss fine-grained contact dynamics, suffer from action-distribution conflicts, and accumulate errors over long-horizon rollouts. We propose ABot-M0.5, a new WAM built on the insight that mobile manipulation requires alignment at three levels: temporal granularity, action space, and train-test consistency. To align temporal granularity, we introduce intermediate latent actions that capture local visual state transitions and serve as an bridging action space between video latents and embodiment-specific controls. To align action space, we design a dual-level Mixture-of-Transformers architecture that disentangles both modality representations and heterogeneous action subspaces such as base movement and arm manipulation. To align inference conditions, we propose the dream-forcing training strategy that progressively trains inverse dynamics on model-predicted videos, improving train-test alignment and robustness during autoregressive prediction. Experiments on challenging mobile and fine-grained manipulation benchmarks demonstrate that ABot-M0.5 achieves state-of-the-art performance in both long-horizon task success and finegrained control accuracy. These results highlight the critical importance of granularity-aligned, action-disentangled, and inference-consistent world-action modeling.
[CV-68] DART: Difficulty-Adaptive Routing for Zero-Shot Video Temporal Grounding ECCV
链接: https://arxiv.org/abs/2607.00672
作者: Zhengbo Zhang,Mark He Huang,Zhigang Tu,Ming-Hsuan Yang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to the European Conference on Computer Vision (ECCV) 2026
Abstract:Zero-shot video temporal grounding (VTG) localizes events in untrimmed videos from natural language queries without task-specific training. Existing methods rely on frame-query feature matching, which suffices for simple events but struggles with complex multi-stage queries that require understanding temporal ordering and causal structure – a disparity we call the reasoning gap. We propose DART (Difficulty-Adaptive Routing for Temporal Grounding), which bridges this gap by coupling difficulty-aware routing with structured reasoning in large vision-language models. A query-conditioned Determinantal Point Process (DPP) serves a dual role: selecting diverse, query-relevant keyframes as temporal evidence, and providing spectral entropy as a difficulty indicator. Simple queries are routed to a Fast path for direct prediction, while complex queries follow a Slow path with Temporal Markup Prompting, which decomposes localization into global event analysis, per-frame temporal role annotation, and boundary extraction. On Charades-STA and ActivityNet Captions, DART achieves state-of-the-art zero-shot performance across both identically distributed and multiple out-of-distribution settings, improving mIoU by up to 3.5 points over the strongest baseline while using over 7 times fewer frames. The project homepage is available at this https URL.
[CV-69] Domain Arithmetic: One-Shot VLA Adaptation under Environmental Shifts ECCV2026
链接: https://arxiv.org/abs/2607.00666
作者: Taewook Kang,Taeheon Kim,Donghyun Shin,Jonghyun Choi
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: ECCV 2026. Project page: this https URL
Abstract:Vision-Language-Action (VLA) models often fail to perform the same learned tasks under environmental shifts, such as changes in camera pose and shifts to a different but similar robot (e.g., from Panda to UR5e). Adapting these models to the shifted environment (i.e., target domain) often requires training on multiple demonstrations for each task, which are costly to collect. To reduce the burden of data curation and training, we propose an analogy-based method that adapts VLA models under environmental shifts through weight vector arithmetic with domain-specific information addition, named Domain ARiThmetic (DART). Unlike prior approaches, DART requires collecting only a single demonstration, enabling efficient adaptation. To accurately isolate domain-specific information for addition, DART performs subspace alignment between singular components in weight vectors to filter out noisy components. In both simulated and real-world experiments, DART outperforms existing VLA adaptation methods in one-shot scenarios across diverse visual and embodiment shifts. Code is available at this https URL.
[CV-70] Linguistic Relative Policy Optimization for Video Anomaly Reasoning ICML2026
链接: https://arxiv.org/abs/2607.00654
作者: Jiaxu Leng,Jiankang Zheng,Mengjingcheng Mo,Zhanjie Wu,Haosheng Chen,Ji Gan,Xinbo Gao
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ICML 2026; 18 pages, 8 figures, 9 tables
Abstract:Video anomaly detection (VAD) with multimodal large language models has shown strong potential, yet most existing methods still depend on large-scale annotations or expert-designed priors, limiting their ability to acquire anomaly knowledge with as little human intervention as possible. To address this, we propose Linguistic Relative Policy Optimization (LRPO), which distills group-relative semantic advantages from multiple reasoning trajectories into a linguistically expressed anomaly experience prior, and adapts the model by injecting this prior into the context to steer its output distribution without any parameter updates. LRPO builds two complementary experience representations: general experience captures transferable anomaly preferences across scenarios, while scenario experience models context-dependent anomaly rules for targeted refinement. To further improve the learned experience, we introduce an anomaly alignment reward that guides trajectory optimization to match human risk preferences and reinforce temporally grounded reasoning. Extensive experiments on XD-Violence, UCF-Crime, and UBnormal demonstrate that LRPO significantly outperforms existing state-of-the-art methods under tuning-free settings.
[CV-71] Not All Prediction Targets Keep Training-Free Diffusion Guidance on the Manifold ECCV2026
链接: https://arxiv.org/abs/2607.00647
作者: Yunsung Lee,Hyeongmin Lee
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ECCV 2026. 15-page main paper with appendix (48 pages total, 14 figures). Project page: this https URL
Abstract:Training-free guidance (TFG) steers a pretrained diffusion model toward a desired attribute at inference. To be effective, this guidance must be applied from the earliest, high-noise steps of sampling. Because its objective (a classifier or energy) is defined on clean images, \epsilon - and v -prediction models must first estimate the clean image \hatx from the noisy state at each step, and the accuracy of that estimate determines how easily guidance drifts off the data manifold. x -prediction, a recent alternative, outputs the clean image directly, removing this source of error even at high noise. This is our motivation. We provide a theoretical analysis of how each prediction target shapes this accuracy, and introduce guided-class FID (Child FID), a metric that exposes the manifold damage standard evaluation misses. Experiments on a new fine-grained bird benchmark and on style transfer confirm that x -prediction keeps guided samples on the manifold most reliably, making it the strongest foundation for training-free guidance. Code is available at this https URL
[CV-72] Uncertainty-aware tree height change regression
链接: https://arxiv.org/abs/2607.00638
作者: Max Gaber,Dimitri Gominski,Jaime C. Revenga,Stefan Oehmcke,Rasmus Fensholt,Martin Brandt
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Monitoring canopy height change is essential for understanding carbon sinks and forest dynamics. Remote sensing enables consistent, large-scale observations of such changes, increasingly integrated with deep learning architectures such as Geospatial Foundation Models (GFMs). However, existing methods and datasets frame the problem as binary change detection, which overlooks both the continuous nature of change, especially for vegetation, and the inherent uncertainty in labels. We present the Canopy Height Change (CHC) dataset, providing 3 \mathrmm resolution continuous canopy height differences and associated spatially resolved uncertainties across 10598 \mathrmkm^2 of northern and western Spain. The dataset is paired with a co-located time series of PlanetScope satellite imagery. Based on the dataset, we introduce the task of uncertainty-aware change regression, associated metrics and strategies for fine-tuning GFMs. Furthermore, we evaluate state-of-the-art GFMs and highlight promising directions and remaining challenges for advancing continuous canopy height change estimation.
[CV-73] Learning to Watch: Active Video Anomaly Understanding via Interleaved Policy Optimization ICML2026
链接: https://arxiv.org/abs/2607.00622
作者: Mengjingcheng Mo,Jiaxu Leng,Xinbo Gao
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ICML 2026; 25 pages, 8 figures, 15 tables
Abstract:Video anomaly understanding (VAU) relies on sparse, context-dependent cues. However, existing passive paradigms suffer from observational aliasing, where static sampling fails to disambiguate semantically distinct events. To overcome this, we propose Anom\text-\pi , a closed-loop framework that reconceptualizes video understanding as an active sequential decision-making process within a dynamic environment. Inspired by human video-reviewing behavior, this framework unifies internal cognitive reasoning and strategic evidence acquisition into an interleaved policy, utilizing temporal atomic operators such as local backtracking, temporal expansion, and fine-grained sampling to endow the model with perceptual proactivity. To learn such complex interaction strategies under video-level weak supervision, we design Interactive Direct Preference Optimization (iDPO) to achieve trajectory-level policy alignment, guided by an Active Evidence Inquiry (AEI) utility that balances task success, informative evidence acquisition, and interaction cost. This approach enables the agent to learn to actively disambiguate hypotheses while suppressing redundant exploration. Extensive experiments demonstrate that our framework, with only 2B parameters, achieves highly competitive performance, significantly outperforming state-of-the-art large-scale VAU models in complex scenarios.
[CV-74] Identifying Latent Concepts and Structures for Generalized Category Discovery ICML2026
链接: https://arxiv.org/abs/2607.00620
作者: Boyang Dai,Chaoqi Chen,Yizhou Yu
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: This paper has been accepted by ICML2026
Abstract:Generalized Category Discovery (GCD) aims to recognize known classes while autonomously discovering novel ones in open-world settings. However, current approaches primarily focus on designing clustering objectives, often overlooking a critical bottleneck: standard vision backbones yield high-rank, entangled token representations that are ill-suited for unsupervised discovery of latent concepts and structures. In this paper, we propose Compositional Primitive Fields (CPF-GCD), a novel representation learning framework that reshapes the feature space to make such latent structure identifiable by enforcing a low-rank compositional organization. Our core hypothesis is that all categories, whether known or novel, can be expressed as compositions and spatial arrangements of a finite set of learnable visual primitives that capture reusable concepts. CPF instantiates this geometric constraint via a spatial field mechanism. Inserted between the backbone and the head, it rewrites noisy patch tokens through low-rank primitive mixtures, effectively decomposing images into reusable atomic parts and their spatial layouts. By explicitly modeling the spatial distribution of primitives, CPF enables novel categories to emerge naturally as new activation patterns over a shared vocabulary. This shifts the focus of representation from merely partitioning global embeddings to constructing a structured and separable primitive field. Extensive experiments demonstrate that CPF serves as a generic, plug-and-play module that consistently boosts performance across diverse GCD baselines, validating that identifying and leveraging low-rank compositional structure is a crucial inductive bias for open-world recognition.
[CV-75] Diffusion-Based Multi-Class Normality for OOD Detection: An Application to CDP Authentication
链接: https://arxiv.org/abs/2607.00609
作者: Bolutife Atoki(imagine, LIRIS),Iuliia Tkachenko(imagine, LIRIS),Bertrand Kerautret(imagine, LIRIS),Carlos Crispim-Junior(imagine, LIRIS)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: IEEE International Conference on Advanced Visual And Signal-Based Systems, Aug 2026, Lecce, Italy
Abstract:Reconstruction-based generative models offer a natural framework for unsupervised out-of-distribution (OOD) detection, but multi-class normality modelling requires a single detector to capture multiple in-distribution manifolds and produce comparable anomaly scores across classes. We study this problem in copy detection pattern (CDP) authentication, where authentic and counterfeit samples are visually similar but differ in subtle printing-and-digitisation (P\D) signatures. We propose a diffusion based multi-class normality framework in which a single class-conditional ControlNet is trained exclusively on authentic CDPs from multiple P\D classes and detects counterfeits through reconstruction error under authentic-class conditioning. We further introduce dual template masking, which hides complementary regions of the input template and scores only withheld pixels, reducing reliance on visible binary structure. On the Indigo 1 x 1 Base dataset, the proposed method outperforms traditional and adapted generative baselines under multi-class authentic-versus-counterfeit evaluation, without using counterfeit samples for training or threshold calibration.
[CV-76] Retrieved Images as Visual Thought: Training-Free Multimodal In-Context Learning for the Open-vs-Closed Gap
链接: https://arxiv.org/abs/2607.00606
作者: Bingchen Huang,Zhiling Wang,Yifu Chen,Yuanchao Du
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 6 figures. Includes appendix. Introduces the MAAC-Bench benchmark
Abstract:Recent work on Thinking with Images makes vision a dynamic part of reasoning, but does so through generation: the model invokes external tools, synthesizes code, or imagines new imagery, each at the cost of a tool protocol, brittle code, or an expensive training pipeline. A fourth route makes vision dynamic without generating anything, by retrieving labeled exemplar images and reasoning over them, yet it remains underexplored despite being train-free. We present ReVisIT, a train-free framework that realizes this retrieval-based route by treating each retrieved image-label pair as a unit of visual thought. ReVisIT combines structured class definitions, per-query multimodal retrieval of exemplars, and alternating user/assistant injection of those exemplars before joint multi-attribute decoding, and degrades gracefully to whichever components a task admits. On VL-ICL Bench Fast Open MiniImageNet, Qwen3-VL-30B-A3B with ReVisIT reaches 98.5% at 4-shot, statistically indistinguishable from the 72B LLaVA-OneVision SOTA (98.7%) on this near-saturated task at about 1/2.4 the parameters, while the same backbone without the scaffold sits at chance. The turns layer alone adds 26.1 points to GPT-4.1 on free-form concept induction (Bongard-OpenWorld), and the full stack yields a 4-6 point macro gain across three backbones on MAAC-Bench, a new license-clean 27-class, 5-attribute benchmark, significant by paired bootstrap on the curator-derived attributes. Component analysis shows that retrieval-plus-turns is the universal lever while structured definitions are need-adaptive, and that 83% of the retrieval gain comes from retrieval quality rather than from the presence of exemplars. MAAC-Bench is released with a rubric-grounded LLM verification protocol that replaces author spot-check on subjective attributes.
[CV-77] Semantic-Guided Reading Order Reconstruction in Historical Armenian Newspapers with LLM s
链接: https://arxiv.org/abs/2607.00596
作者: Chahan Vidal-Gorène(CJM, LIPN),Nadi Tomeh(LIPN),Victoria Khurshudyan(Inalco, SeDyL)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: International Conference on Pattern Recognition, 2026, Lyon, France
Abstract:This paper addresses reading order reconstruction in historical Armenian newspapers, which combine complex layouts with limited language resources. We introduce a new annotated dataset of 66 pages and compare geometric heuristics, YOLO-based layout parsing, an end-to-end document model ECLAIR, and a hybrid method combining semantic zone detection with a generative LLM. Our hybrid method achieves the lowest error rates of all evaluated approaches, reducing ordering errors by up to 76% over the strongest geometric baseline, and remains robust in multi-page settings and under noisy OCR. Rather than targeting production the method is designed as a data bootstrapping strategy enabling rapid annotation in highly under-resourced scenarios. Alongside the dataset, we release a specialized Tesseract OCR model for historical Armenian print.
[CV-78] GADA: Geometry-Aware Deformable Aggregation for Image-Based Gaussian Splatting ICML2025
链接: https://arxiv.org/abs/2607.00595
作者: Siwoo Lim,Sunjae Yoon,Gwanhyeong Koo,Chang D. Yoo
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICML 2025
Abstract:Gaussian Splatting has achieved significant improvements by incorporating warping-based techniques. However, such methods suffer from pixel-level inaccuracies due to uncertain geometry. This uncertainty leads to spatial misalignments in the warped images, which disrupt residual learning used in warping-based methods and fundamentally limit the gains of correction, particularly on thin structures and high-frequency details. Driven by our insight that useful visual cues are not lost but locally preserved under slight displacement, we propose Geometry-Aware Deformable Aggregation (GADA). This method introduces an iterative refinement module with deformable offsets to actively correct spatial misalignments and recover these displaced cues. Furthermore, to address the limitations of standard pipelines where visibility checks (i.e., thresholding) often discard valid pixels and multi-view warped image fusion relies on naive mean aggregation, our module is coupled with an implicit confidence weighting mechanism that selectively suppresses unreliable evidence. Consequently, our approach outperforms prior warping-based Gaussian Splatting, preserving high-frequency quality while achieving 2.13 times faster FPS.
[CV-79] Active Spatial Guidance: Eliminating Injected Positional Mechanisms in Vision Transformers
链接: https://arxiv.org/abs/2607.00580
作者: Cong Liu,Xiaofang Li,Simon X. Yang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Vision Transformers (ViTs) commonly rely on injected positional mechanisms to address self-attention’s permutation invariance. Motivated by the spatial regularities of natural images, we ask whether spatial organization can be induced from data rather than explicitly injected. Under controlled, matched from-scratch training, we propose Active Spatial Guidance (Guidance), a training-only objective that disables positional injection and applies an auxiliary 2D coordinate-regression loss to the final-layer patch tokens. The guidance head is used only during training and removed for inference; the deployed model consists of a positional-injection-free ViT encoder and the task-specific prediction module. Using DINOv3 ViT backbones, Guidance consistently improves performance on ImageNet-100 classification, ADE20K semantic segmentation, and Hypersim monocular depth estimation, outperforming strong injected baselines such as learned absolute positional embeddings and rotary positional embeddings under identical training protocols. On ImageNet-100, broader comparisons against representative injected positional designs further support Guidance’s effectiveness. Guidance also improves robustness under resolution transfer, and multi-resolution training further strengthens accuracy across input sizes. Overall, our results suggest that spatial inductive bias in ViTs need not be architecturally injected, but can be shaped through training-time supervision. The code used for training and evaluation is publicly available in this https URL.
[CV-80] EPO: Boosting 3D Foundation Models with Edge-based Pose Optimization ECCV2026
链接: https://arxiv.org/abs/2607.00579
作者: Mattia D’Urso,Christian Sormann,Mattia Rossi,Friedrich Fraundorfer
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ECCV 2026
Abstract:We introduce \textbfEdge-based Pose Optimization (EPO), a trackless geometric optimization framework specifically designed to boost the Structure-from-Motion reconstructions generated by 3D Foundation Models. These models achieve rapid inference by bypassing the time-consuming feature extraction and matching stages of traditional pipelines, where explicit correspondences between each 3D point and multiple images, referred to as tracks, are established. However, their geometric accuracy currently falls short of traditional pipelines. While this can be addressed in a post-processing step via Bundle Adjustment-like refinement, doing so requires extracting feature tracks, thus defeating the original speed advantage. Instead, our fully differentiable framework uses edge map alignment as a proxy for geometric optimization, avoiding feature extraction and track construction entirely. Through extensive evaluation across multiple datasets and tasks, we demonstrate that EPO matches or outperforms Bundle Adjustment-like methods while requiring significantly lower runtime and memory. Notably, its reduced memory footprint makes EPO suitable for consumer-grade hardware, where competing refinement methods cannot run.
[CV-81] Caption Bottleneck Models ECCV2026
链接: https://arxiv.org/abs/2607.00578
作者: Seref Baris Cagliyan,Umut Ozdemir,Merve Tapli,Emre Akbas
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ECCV 2026
Abstract:Concept Bottleneck Models (CBMs) provide interpretability by routing predictions through a layer of human-understandable concepts. However, defining an optimal concept set for a specific dataset remains an open challenge. Existing approaches rely on expensive expert annotations or LLM-generated lists based solely on class names. Even “open-vocabulary” variants typically depend on static concept sets, which restrict discovery and introduce label bias. Furthermore, traditional CBMs often suffer from information leakage, where unmodeled visual features bypass the bottleneck and compromise the integrity of the explanations. To overcome these limitations, we propose Caption Bottleneck Models (CaBM), a framework that circumvents the need for predefined concept sets by replacing rigid concept layers with free-form natural language. By representing images via LMM-generated captions and training a classifier strictly on this text, CaBM ensures a leakage-free architecture by construction. Additionally, by analyzing the text classifier post-training, CaBM autonomously discovers high-quality, dataset-specific concepts. Our results across fine- and coarse-grained benchmarks demonstrate that CaBM achieves competitive accuracy while preserving interpretability without the constraints of external dictionaries or manual labeling.
[CV-82] BrainFIBRE: A Foundation Model via Information Decomposition for Brain Microstructure ECCV2026
链接: https://arxiv.org/abs/2607.00573
作者: Zijian Dong,Yi Lin,Ji Fang,Jianxiong Zhou,Kwun Kei Ng,Juan Helen Zhou
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ECCV 2026. The first three authors contributed equally
Abstract:Diffusion MRI probes brain microstructure with particular sensitivity to early cerebrovascular and neurodegenerative changes. Neurite Orientation Dispersion and Density Imaging (NODDI) decomposes the diffusion signal into three biophysically interpretable maps: neurite density index (NDI), orientation dispersion index (ODI), and free water fraction (FWF), capturing neurite packing, fiber coherence, and extracellular fluid. These 3D maps offer a rich substrate for transferable microstructural representations, yet integrating them is challenging: standard representation learning struggles to disentangle the unique information in each map from their shared and synergistic interactions. We present BrainFIBRE, the first foundation model for brain microstructure, pretrained on NODDI-derived maps from 55,592 UK Biobank participants. We propose Self-supervised Partial Information Decomposition (SPID), which extends PID-guided multimodal learning to the self-supervised regime for the first time. A novel Counterfactual Candidate Construction (CCC) paradigm perturbs inter-modality alignment through modality dropping and swapping, providing the contrastive signal for a Mixture-of-Experts architecture to disentangle unique, synergistic, and redundant information without any downstream label. On both Caucasian and Asian cohorts, BrainFIBRE achieves state-of-the-art performance across diverse tasks predicting age, sex, cerebrovascular and neurodegenerative markers, and cognition, while yielding neurobiologically interpretable representations that reveal task- and cohort-specific interaction patterns. BrainFIBRE establishes a versatile foundation for neuroimaging analysis at the microstructural level.
[CV-83] EgoGapBench: Benchmarking Egocentric Action Selection in Multi-Agent Scenes
链接: https://arxiv.org/abs/2607.00547
作者: Jihyeok Jung(1),Jeewu Lee(2),Sanghyeop Kim(2),Chanhee Han(3),Seong Joon Oh(1) ((1) KAIST AI, (2) Sogang University, (3) Ministry of Science and ICT)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 15 pages, 2 figures, 8 tables. Code and benchmark are available at this https URL
Abstract:Existing egocentric benchmarks have primarily constructed the egocentric setting from first-person-view data, which makes it difficult to evaluate egocentric perspective itself in isolation. However, understanding first-person-view input and taking an egocentric perspective are separable abilities, especially when first-person body cues are absent or when other agents are present. To isolate egocentric perspective understanding, we introduce EgoGapBench, a diagnostic benchmark for measuring action selection in multi-agent egocentric scenes. We define the ability measured by this benchmark as Egocentric Action Selection (EAS): selecting an appropriate action from the agent’s perspective in the presence of other agents. On EgoGapBench, humans answer reliably, whereas both open-source and proprietary MLLMs perform substantially worse and systematically select actions performed by other visible agents. Fine-tuning on existing egocentric data fails to close this gap and can even be detrimental. In contrast, fine-tuning on EgoGapBench training data improves accuracy but does not reach human performance. These results show that EAS is difficult to acquire from first-person-view data alone, and that MLLMs should be evaluated and trained not only for scene understanding but also for egocentric action selection.
[CV-84] ECoSim: Data Efficient Fine-Tuning for Controllable Traffic Simulation ECCV
链接: https://arxiv.org/abs/2607.00545
作者: Yu-Hsiang Chen,Wei-Jer Chang,Yi-Ting Chen,Masayoshi Tomizuka
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: European Conference on Computer Vision (ECCV) 2026
Abstract:Controllable traffic simulation is critical for testing autonomous driving systems, yet existing approaches often require retraining large generative models with extensive annotated data. We introduce a lightweight control adaptation framework that enables multi-modal controllability (sketch, latent behavior codes, and text) for pretrained state-of-the-art diffusion and autoregressive traffic models. By modulating intermediate features through identity-initialized FiLM layers, our method efficiently adds new control modalities while preserving the base model’s generative prior. Evaluated on Waymo Open Sim Agents Challenge, our approach demonstrates strong controllability with less than 1% of the paired control data. Through context-aware condition transfer, our framework enables counterfactual scenario generation and long-tail synthesis while maintaining stable closed-loop driving realism and safety. Our framework unlocks new possibilities for controllable traffic simulation, enabling targeted scenario generation through lightweight adaptation of pretrained generative models. Project page: this https URL
[CV-85] GEAR-Seg: A Grounded Explainable Agent for Reasoning Segmentation and Data Engine
链接: https://arxiv.org/abs/2607.00544
作者: Yanan Wang,Wen Li,Yibin Ying,Zhenghao Fei
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 21 pages, 8 figures
Abstract:Reasoning segmentation requires localizing targets based on complex, implicit queries. Current end-to-end models typically entangle perception and deduction into an opaque black box, severely limiting interpretability and scalability. To address this, we propose GEAR-Seg (Grounded Explainable Agent for Reasoning Segmentation), an explicitly decoupled agent that shifts the paradigm by translating visual pixels into dense, attribute-rich text. By decoupling class-agnostic segmentation, semantic description, and Large Language Model (LLM) deduction, GEAR-Seg transforms implicit reasoning into an explicit, trackable logic chain. As a zero-shot inference framework, it achieves highly competitive performance across diverse reasoning and fine-grained referring segmentation benchmarks. Furthermore, GEAR-Seg inherently functions as a highly scalable data engine. Utilizing this engine, we construct GEAR-131K, a massive benchmark (over 38k images, 656k QA-mask pairs) introducing a multifaceted taxonomy tailored for complex real-world manipulation-oriented reasoning. Finally, distillation experiments demonstrate that lightweight models supervised exclusively by our automated pipeline closely match the upper-bound performance of costly human-annotated baselines.
[CV-86] Flow-Map GRPO: Reinforcement Learning for Few-Step Flow-Map Generators via Anchored Stochastic Composition
链接: https://arxiv.org/abs/2607.00535
作者: Zhiqi Li,Wen Zhang,Bo Zhu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 31 pages, 29 figures
Abstract:Few-step flow-map generators, such as consistency models and MeanFlow, accelerate sampling by directly learning long-range transport maps between noise and data. However, these models are typically deterministic, which makes them difficult to optimize with reinforcement learning (RL) post-training methods that require stochastic trajectories and well-defined likelihood ratios. Existing SDE-based stochasticization techniques are designed for velocity-based samplers with infinitesimal or finely discretized transitions, and therefore do not directly apply to long-range flow maps. In this work, we propose Flow-Map GRPO, an online RL post-training framework for deterministic few-step flow-map generators. The key component is Anchored Stochastic Flow Map Composition (ASFMC), a path-preserving stochasticization mechanism that introduces randomness through anchor-based conditional resampling while preserving the original marginal probability path of the deterministic flow map. We derive GRPO objectives for both single-time and two-time flow-map parameterizations. Experiments on few-step FLUX-based text-to-image generators, including MeanFlow and sCM, show that Flow-Map GRPO improves pretrained deterministic flow-map models across reward-based, perceptual, and task-level evaluation metrics. Our results demonstrate that deterministic few-step flow-map generators can be effectively aligned with RL post-training without modifying their original model parameterization or retraining them as native stochastic models.
[CV-87] NoPA: Non-Parametric Online 3D Scene Graph Generation ECCV26
链接: https://arxiv.org/abs/2607.00529
作者: Qi Xun Yeo,Seungjun Lee,Yan Li,Gim Hee Lee
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This paper has been accepted in ECCV 26
Abstract:Classic 3D scene graph generation approaches fail to work in real-time due to the heavy computational cost of environment mapping and the need to generate intermediate point-cloud representations. To alleviate this issue, a recent work eschews point clouds in favor of a lightweight Gaussian distribution for each object. This approximation drastically speeds up inference and enables real-time 3D scene graph generation. However, the representation has two key weaknesses. \textbf1) Each object is approximated by a single 3D Gaussian, which causes a severe loss of 3D geometric detail. \textbf2) The discrepancy between this approximation and the true object geometry exacerbates the inaccurate merging of object candidates during online inference. To address these issues, we propose \textbfNoPA, which represents each object as a separate non-parametric distribution. This formulation retains 3D geometric information while preserving real-time inference of the parametric Gaussian formulation. To build upon our novel object representation, we propose a tailored merging strategy to recover coherent object instances. Specifically, we leverage maximum mean discrepancy on kernel density estimates to enable robust merging of object candidates during online exploration while minimizing added computational complexity. The key is to maintain a fixed particle set per object. Furthermore, to rectify the relation loss caused by misclassified objects, NoPA propagates relationships between objects with high affinity. Experiments show that NoPA substantially outperforms current methods without sacrificing real-time inference speed.
[CV-88] SPECSIA: Stylization Dataset for Novel-View Enhancement in Drawing-based 3D Animation ECCV2026
链接: https://arxiv.org/abs/2607.00525
作者: Kyuwon Kim,Sunjae Yoon,Chang D. Yoo
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ECCV 2026
Abstract:Generating animation from a single 2D drawing is challenging because the output must preserve character appearance while remaining plausible and temporally coherent under motion. Existing drawing-based 3D animation pipelines often use sample-wise 2D refinement to align animated renderings with the input image, but such optimization tends to overfit to the observed view and fails to correct projection-induced artifacts in novel views. To address this limitation, we introduce SPECSIA-15K, a paired stylization dataset containing 14,980 artifact-corrupted projection/refinement-target pairs from 1,498 3DBiCar characters. We further present DraViE (Drawing-based View Enhancement), a lightweight plug-and-play module trained with data-level priors to remove novel-view artifacts while preserving style and motion plausibility. Experiments show consistent gains in novel-view fidelity and temporal coherence with lower per-character adaptation cost than sample-wise fine-tuning.
[CV-89] Restore3D: Breathing Life into Broken Objects with Shape and Texture Restoration
链接: https://arxiv.org/abs/2607.00522
作者: Xiaolong Shen,Zongxin Yang,Yi Yang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Restoring incomplete or damaged 3D objects is crucial for cultural heritage preservation, occluded object reconstruction, and artistic design. Existing methods primarily focus on geometric completion, often neglecting texture restoration and struggling with relatively complex and diverse objects. We introduce Restore3D, a novel framework that simultaneously restores both the shape and texture of broken objects using multi-view images. To address limited training data, we develop an automated data generation pipeline that synthesizes paired incomplete-complete samples from large-scale 3D datasets. Central to Restore3D is a multi-view model, enhanced by a carefully designed Mask Self-Perceiver module with a Depth-Aware Mask Rectifier. The rectified masks learned by the self-perceiver guide an image integration and enhancement phase, helping retain observed shape and texture patterns while refining the generated regions and mitigating the low-resolution limitations of the base model, yielding high-resolution, semantically coherent, and view-consistent multi-view images. A coarse-to-fine reconstruction strategy is then employed to recover detailed textured 3D meshes from refined multi-view images. Experiments on synthetic and real broken-object benchmarks show that Restore3D improves multi-view restoration quality and textured-mesh reconstruction over representative inpainting, completion, and reconstruction baselines in the evaluated settings. Project Page: this http URL
[CV-90] Cross4D-JEPA: Dense Cross-modal Correspondence Distillation for 4D Point Cloud Representation Learning
链接: https://arxiv.org/abs/2607.00514
作者: Trung Thanh Nguyen,Hai Nguyen-Truong,Tu Vo,Hoang M. Truong,Tuan-Anh Vu
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Automatic understanding of dynamic 4D point clouds, the 3D-point sequences captured over time by depth sensors and LiDAR, is central to robotics and embodied perception. Yet annotating them densely is expensive, making self-supervised pretraining the natural route to transferable representations. Existing pretext tasks, however, are almost entirely intra-modal, and the few methods that transfer knowledge from 2D foundation models rely on a single global embedding per clip, discarding the rich per-patch semantics that these models compute. To address this gap, we propose Cross4D-JEPA, a teacher-student method that distills a frozen 2D foundation model, an image model DINOv2, or a video model V-JEPA 2, into a 4D point encoder. The proposed method combines (1) a dense cross-modal correspondence that maps every 3D point to the teacher patch feature it projects to, and (2) a per-point objective that trains the student to match these features in latent space with no masking, negatives, or decoder. We evaluate Cross4D-JEPA on four benchmarks, MSR-Action3D, DeformingThings4D, NTU-RGB+D 60, and HOI4D, against intra-modal and global cross-modal baselines. Experimental results show that, under a matched protocol, the proposed method consistently outperforms intra-modal and global cross-modal baselines across the four benchmarks and is competitive with heavier published 4D methods; further analysis attributes this gain primarily to the granularity of the correspondence rather than the teacher modality. Beyond recognition accuracy, the dense representation learned by Cross4D-JEPA transfers across domains, improves label efficiency, and improves full-label fine-tuning under the same training budget, while a 13x smaller encoder matches a heavyweight pooling backbone.
[CV-91] AnF-DiffPET: Anatomy- and Frequency-Guided Diffusion for PET/CT Denoising
链接: https://arxiv.org/abs/2607.00509
作者: Xuepeng Liu,Ruili Li,Zetong Liu,Renyiming Li,Yan Li,Yin Dai,Chao Li,Yueyang Teng
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 8 figures, 3 tables
Abstract:Positron emission tomography (PET) provides essential functional information for disease assessment, however reducing injected activity or acquisition time produces low-dose (LD) PET with stronger count dependent noise and less reliable uptake quantification. Diffusion models offer a promising solution for PET denoising by progressively recovering high-dose (HD) PET images from LD inputs. However, LD-to-HD PET denoising is still challenging due to insufficient anatomical guidance, unstable multi-scale feature propagation, and uncertain frequency domain uptake recovery. We propose AnF-DiffPET, an anatomy- and frequency-guided diffusion framework for computed tomography (CT) conditioned LD PET denoising. The framework integrates Anatomical-Frequency Guidance (AFG), Multi-Scale Cross-Transformer Reconstruction (MSCTR), and Frequency-Contrastive Hard Mining (FCHM) to enhance anatomy aware feature modulation and frequency domain consistency during denoising. Experimental results across four PET/CT datasets show that the proposed method improves image fidelity, anatomical consistency, and quantitative fidelity over representative CNN-based, GAN-based, transformer-based, and diffusion-based methods. The code and trained models will be publicly released upon acceptance.
[CV-92] Prior-Anchored Debiasing for Long-Tailed Multi-Organ Pathology Report Generation
链接: https://arxiv.org/abs/2607.00499
作者: Feng Yang,Jie Liu,Yubo Pang,Peilin Chen,Xinheng Lyu,Shiqi Wang,Howard Leung,Ping Chen
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Automated pathology report generation from Whole Slide Images (WSIs) has attracted increasing attention in digital pathology. However, existing methods are predominantly developed under single-organ settings, overlooking the multi-organ scenarios encountered in clinical practice, where organ types typically follow a long-tailed distribution. To address this gap, we identify two critical biases: (1) visual representation bias, where the encoder favors head-class patterns over tail-class discriminative features, and (2) textual decoding bias, where the decoder overfits to head-class narrative patterns, yielding diagnostically unreliable outputs for tail-class organs. To mitigate these two biases, we propose a novel Prior-anchored multi-Organ pathology report Generation framework (PriOrGen). Specifically, a Visual-Prototype Anchored Bottleneck module leverages the information bottleneck principle with learnable anchor representations to selectively retain diagnostically relevant visual information while filtering out head-biased redundancy. Secondly, a Meta-Report Anchored Bank module constructs an organ-specific meta-report anchored bank and retrieves organ-faithful textual priors to steer the decoder away from head-class narrative patterns. Extensive experiments on a multi- organ pathology dataset demonstrate that our method effectively mitigates long-tail biases and achieves superior report generation performance across both head and tail organ categories compared to state-of-the-art methods. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2607.00499 [cs.CV] (or arXiv:2607.00499v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2607.00499 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Feng Yang [view email] [v1] Wed, 1 Jul 2026 06:31:39 UTC (608 KB) Full-text links: Access Paper: View a PDF of the paper titled Prior-Anchored Debiasing for Long-Tailed Multi-Organ Pathology Report Generation, by Feng Yang and 7 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.CV prev | next new | recent | 2026-07 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked="checked"class=“labs-tab-input”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); We gratefully acknowledge support from our major funders, member institutions, , and all contributors. About Help Contact Subscribe Copyright Privacy Accessibility Operational Status (opens in new tab) Major funding support from
[CV-93] Robust 3D Alignment of Generative Reconstructions via Partial Monocular Observations
链接: https://arxiv.org/abs/2607.00498
作者: Yuchen Zhang,Luanyuan Dai,Yiwei Wang,Xiwei Xu,Jianing Zhang,Johnny.r.zhang,Xianhui Meng,Yanbiao Ma,Jiayi Ma,Xiaoshuai Hao
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Aligning generative 3D reconstructions with partial monocular observations is a critical but under-explored challenge in computer vision. This task is inherently ill-posed due to severe asymmetries between noisy, sparse monocular inputs and dense generative priors, whose scale ambiguity and geometric hallucinations, combined with the lack of initial overlap, render traditional registration pipelines ineffective. To resolve these issues, we propose a training-free and interpretable geometric alignment framework that grounds generative 3D priors via a 3D similarity transformation (Sim(3)), which can recover accurate metric scale and pose. Specifically, we introduce an explicit scale factor to resolve metric ambiguity and employ a coarse-to-fine alignment strategy, leveraging geometry-aware descriptors for robust initialization and a decoupled closed-form solver for precision refinement. In addition, we introduce a Hallucination Filtering operation to effectively suppress outliers caused by hallucinated geometry. To evaluate alignment performance under these extreme conditions, we introduce GenPMOAlign–Where2Place, a rigorous benchmark specifically designed for Generative-to-Partial Monocular Observational Alignment. Experiments demonstrate that our method achieves stable and accurate registration, substantially outperforming both classical geometric pipelines and state-of-the-art learning-based baselines. Code and the benchmark will be publicly released.
[CV-94] HieDG: A Hierarchical Discrete Geometry-Guided Framework for Multi-Animal Tracking ECCV2026
链接: https://arxiv.org/abs/2607.00494
作者: Chenxun Deng,Zhongde Zhang,Ye Yuan,Chengyang Zhang,Yifan Zhang,Bohao Chen,Hongying Yan,Hang Zhou,Hua Han,Xi Chen
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ECCV 2026
Abstract:Multi-animal tracking (MAT) is critical for wildlife monitoring and behavioral analysis, yet remains challenging due to uniform appearance, high density, and irregular motion. Existing methods typically follow heuristic- or query-based paradigms: the former relies on handcrafted geometric associations without end-to-end optimization, whereas the latter enables joint optimization but relies heavily on appearance embeddings. In such conditions, continuous geometric embeddings can be unstable, as small coordinate perturbations may disproportionately alter cross-frame attention weights, degrading identity association performance. To address this limitation, we propose HieDG, a Hierarchical Discrete Geometry-guided tracking framework that reformulates geometric dynamics as structured discrete representations within a query-based tracker. Instead of directly using raw geometric signals, HieDG employs a two-stage residual codebook to discretize position, scale, and velocity cues, transforming unstable continuous geometry into structured, stable discrete tokens. These tokens are aligned with visual embeddings and integrated into the tracking queries to enhance identity consistency. Extensive experiments on animal-specific benchmarks (AnimalTrack, BFT, and BuckTales) demonstrate state-of-the-art association performance with significant improvements in HOTA, AssA, and IDF1. Additional evaluations on generic multi-object tracking benchmarks, including DanceTrack and SportsMOT, show competitive performance, indicating the broader applicability of discretized geometric modeling beyond animal-specific scenarios.
[CV-95] GenSP: Consistent Spherical Parameterization via Learning Shape Generative Models ECCV2026
链接: https://arxiv.org/abs/2607.00492
作者: Sai Karthikey Pentapati,Shashank Gupta,Rajesh Sureddi,Yuezhi Yang,Alan C. Bovik,Qixing Huang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ECCV 2026. Sai Karthikey Pentapati and Shashank Gupta contributed equally to this work
Abstract:We introduce GenSP, a data-driven framework that learns consistent spherical parameterizations across a collection of genus-0 shapes. Instead of optimizing the parameterization of each shape independently, our method learns a neural generative model that predicts a continuous mapping from the unit sphere to shapes in a dataset. Under this formulation, spherical parameterizations are obtained through the inverse mappings of the learned generator, which encourages similar shapes to share consistent parameterizations. To make this formulation practical, we address several key challenges in learning such a generative model. First, we introduce a continuous neural deformation model that predicts surface points from sphere coordinates and latent shape codes, avoiding discretization artifacts common in mesh-based formulations. Second, we augment the training space with intermediate shapes that bridge the sphere and input shapes, allowing the model to learn meaningful deformations across a heterogeneous shape collection. Third, we compute reliable initial correspondences by propagating mappings along a spanning tree of training shapes in the latent space. Experiments on the ShapeNet dataset demonstrate that our approach significantly reduces geometric distortion and improves cross-shape consistency compared with state-of-the-art spherical parameterization methods.
[CV-96] PAPA: Online Personalized Active Preference Alignment KDD2026 ECML
链接: https://arxiv.org/abs/2607.00486
作者: Anindya Sarkar,Nasik Muhammad Nafi,Isaac Lyngaas,Muralikrishnan Gopalakrishnan Meena,Yevgeniy Vorobeychik
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ECML PKDD 2026
Abstract:Diffusion models are highly effective at modeling complex data distributions, including images and text. However, in applications like personalized recommender systems, the objective often shifts to modeling specific regions of the distribution that maximize user preferences-initially unknown but gradually uncovered through interactive feedback. This can naturally be framed as a reinforcement learning problem, where the goal is to fine-tune a diffusion model to maximize a reward function based on preferences. However, the main challenge lies in learning a parameterized reward model, which typically requires large-scale preference data-something that is often not feasible in practice. In this work, we introduce Personalized Active Preference Alignment PAPA, a novel method that bypasses the requirement for a parametrized reward model by directly optimizing the diffusion model using real-time user feedback. PAPA enables feedback-efficient preference alignment, drawing inspiration from the variational inference framework. We demonstrate PAPA’s effectiveness through extensive experiments and ablation studies across diverse class-conditioned and fine-grained alignment tasks. Additionally, based on theoretical insights, we propose an enhanced fine-tuning strategy, referred to as EPAPA, that requires less computational budget and accelerates the fine-tuning process, further boosting PAPA’s suitability for real-world deployment. Our code is made publicly available at this https URL.
[CV-97] Multimodal Continuous Reasoning via Asymmetric Mutual Variational Learning
链接: https://arxiv.org/abs/2607.00461
作者: Shijie Li,Yilin Gao,Siyuan Yang,Tieyuan Chen,Chaofan Gan,Zhihao He,Zicheng Zhao,Yuyu Guo,Weiyao Lin,Hang Yu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Multimodal Large Language Models (MLLMs) are often constrained by a language-space bottleneck, forcing complex visual reasoning into discrete tokens which can lose perceptual nuance. A promising alternative is continuous latent reasoning, where the goal is to discover implicit reasoning pathways that bridge the multimodal query and the final answer. However, this introduces a severe train-inference mismatch: a training-time posterior, conditioned on the ground-truth answer, can exploit answer-dependent shortcuts. Standard variational training then forces the inference-time prior to mimic a posterior that has access to information unavailable at test time, leading to poor performance. To address this, we propose Asymmetric Mutual Variational Learning (AMVL), a framework that resolves this mismatch via a bidirectional calibration objective. A forward KL divergence trains the target-agnostic prior to match the posterior, while a novel reverse KL divergence simultaneously regularizes the posterior, preventing it from collapsing into inference-incompatible regions and mitigating this ``answer leakage’'. We provide theoretical analysis formalizing this leakage as prior contamination and prove that our dual-KL objective reduces it. We instantiate AMVL in a latent-integrated MLLM and show that it consistently outperforms strong discrete and latent-reasoning baselines, improving the average score on the complex BLINK benchmark by +10.83 and achieving gains of up to +32.00 on individual reasoning tasks, with analyses confirming improved latent-space stability.
[CV-98] VideoSearch-R1: Iterative Video Retrieval and Reasoning via Soft Query Refinement ECCV2026
链接: https://arxiv.org/abs/2607.00446
作者: Seohyun Lee,Seoung Choi,Dohwan Ko,Jongha Kim,Hyunwoo J. Kim
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to ECCV 2026
Abstract:As video corpora continue to expand in both scale and task complexity, there is increasing demand for approaches that retrieve relevant videos from large-scale corpora (inter-video reasoning) and subsequently perform fine-grained, query-conditioned tasks (intra-video reasoning) within the retrieved content, such as temporal grounding. However, existing approaches typically treat retrieval as a preprocessing step, and consequently, when the initial retrieval fails, there is no mechanism to refine the search, leading to the failure of subsequent fine-grained intra-video reasoning. Moreover, while recent agentic frameworks have advanced video understanding, they typically assume that the query-relevant video is already given, focusing exclusively on intra-video reasoning tasks. To address these limitations, we propose VideoSearch-R1, an agentic framework for iterative video retrieval and reasoning through multi-turn interaction with a video search engine. Specifically, we introduce Soft Query Refinement (SQR) to refine search query tokens in a continuous latent space rather than rewriting queries in the discrete text space, enabling more efficient and fine-grained adjustments. SQR and its reasoning process are trained using Group Relative Policy Optimization (GRPO), guided by task-level reward signals derived from retrieval and downstream tasks. Building upon this, VideoSearch-R1 achieves state-of-the-art performance across three datasets on Video Corpus Moment Retrieval (VCMR), iteratively retrieving videos from large-scale corpora, refining search queries, and performing precise query-conditioned temporal grounding within the retrieved content. Our analyses show that SQR effectively refines the original query, requiring significantly fewer generated tokens than explicit text-level query refinement. Code and model checkpoints are publicly available at this http URL.
[CV-99] Information-Regularized Attention for Visual-Centric Reasoning ECCV2026
链接: https://arxiv.org/abs/2607.00434
作者: Guohao Sun,Xiaofang Wang,Yash Patel,Mengchen Liu,Zhiqiang Tao,Praveen Krishnan
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted by ECCV 2026
Abstract:Vision-language models (VLMs) have become a paradigm for multimodal learning, yet remain unstable due to object hallucination, weak visual grounding, and catastrophic forgetting after full-parameter instruction tuning. We claim these failures result from a lack of explicit control over visual representation learning during the standard next-token prediction objective. As a result, visual embeddings thus become passively optimized and prone to injecting redundant or spurious signals. To counter this, we introduce Information-Regularized Attention (IRA), a stochastic attention mechanism that explicitly regulates the amount of visual information injected into the hidden states of intermediate transformer layers. This local reparameterization translates uncertainty about visual representations into local noise that is independent across data points. Beyond evaluating model performance, we also quantify embedding properties, where IRA produces smoother curvature trajectories and suppresses attention-sink across all layers, indicating a more stable transformation of the visual signal. Our results suggest that stochastic attention is not merely a regularizer but a key contributor to representation learning in a generative architecture, offering a new direction for building more reliable VLMs.
[CV-100] HyFL-CLIP: Hyperbolic Fine-Tuning of CLIP for Robust Long-Context Understanding ECCV2026
链接: https://arxiv.org/abs/2607.00428
作者: Ji Ha Jang,Hayeon Kim,Chulwon Lee,Junghun James Kim,Se Young Chun
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ECCV 2026. Project page: this https URL
Abstract:CLIP (Contrastive Language-Image Pre-training) has become a de facto paradigm for image-text alignment, but it struggles with long-context descriptions (77 tokens) due to absolute positional encoding and pretraining on short captions. In long contexts, sentences are often reordered, summarized, or partially omitted. Although prior works extend CLIP with longer positional encodings, they often suffer from degraded image-text alignment under such text perturbations. We attribute this limitation to the Euclidean contrastive objective, which enforces strict one-to-one matching and lacks explicit mechanisms for modeling hierarchical relationships between global context and its constituent elements. To address this issue, we propose HyFL-CLIP, a hyperbolic fine-tuning framework that distills the well-established text-image alignment learned in Euclidean CLIP into hyperbolic space via cross-manifold similarity distillation, leveraging its geometry to capture hierarchical and entailment relations. Our method models hierarchical semantics by linking summarized token-wise features, long-context descriptions, constituent short textual components, and images, capturing part-whole relationships via hyperbolic entailment with Einstein midpoint aggregation. Experiments on diverse benchmarks, including long-context cross-modal retrieval, cross-modal retrieval with caption perturbations, intra-modality retrieval, and short-text cross-modal retrieval, show that HyFL-CLIP achieves more robust long-context understanding. In particular, it yields up to 19.5% improvement in long-text cross-modal retrieval under textual perturbations over the best prior method. We also show HyFL-CLIP can be seamlessly integrated into other model frameworks by applying it to Stable Diffusion XL (SDXL).
[CV-101] EO-VGGT: Orbital Ray-Conditioned 3D Foundation Models for Satellite Multi-View Reconstruction
链接: https://arxiv.org/abs/2607.00417
作者: Qiyan Luo,Yingdong Pi,Lekang Wen,Jie Yang,Xiaoyu Wang,Haiming Zhang,Mi Wang
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: This article is submitted to journal and under review
Abstract:In the era of satellite constellations, multi-view optical satellite imagery is pivotal for Earth Observation (EO) and high-quality Digital Surface Model (DSM) reconstruction. Although feed-forward 3D foundation models have transformed computer vision, their deployment in satellite remote sensing is inherently constrained by the structural discrepancy between implicit perspective assumptions and explicit orbital pushbroom geometry. This geometric incongruity is further compounded by pronounced view-set heterogeneity. We present EO-VGGT, a framework that adapts a frozen perspective-driven model to orbital observations via explicit physical geometry this http URL, the Geometry-Correlation Constrained Selection (GCCS) strategy prunes sub-optimal observations by balancing geometric diversity and radiometric consistency to optimize the input sequence. Second, a Sensor-Ray Encoder (SRE) parameterizes pixel-level pushbroom lines of sight derived from the Rational Function Model (RFM) into high-dimensional space-geometric tokens, reconciling the mathematical discrepancy between central projection and orbital kinematics. Third, a lightweight Ray-Pointing-Aware Adapter (RPAA) employs gated residual blocks to integrate these tokens directly into the frozen transformer backbone. Our findings underscore that integrating explicit physical geometry with optimized view selection is essential for robust feed-forward satellite 3D reconstruction.
[CV-102] DroneIQA-VLE: Multi-Task Drone Image Quality Assessment via Vision-Language Ensemble ALT ICME2026
链接: https://arxiv.org/abs/2607.00416
作者: Wei Sun,Weixia Zhang,Hongjian Zhan,Mingkai Lu,Yixuan Gao,Guangtao Zhai
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: The model achieves 2nd place in ICME 2026 Drone-IQA Grand Challenge on Target-aware Image Quality Assessment for Low-altitude UAV Images
Abstract:We present DroneIQA-VLE, our solution to the ICME 2026 Drone-IQA Grand Challenge on Target-aware Image Quality Assessment for Low-altitude UAV Images. The framework jointly predicts global, target, and background quality scores by ensembling two complementary pipelines: (1) SigLIP2 vision encoders with multi-task regression heads, and (2) a LoRA-adapted Qwen3.5-9B multimodal large language model for quality score regression. The final global quality prediction is obtained by arithmetically averaging the outputs of both pipelines. Our method achieves 2nd place in the challenge, demonstrating its effectiveness. The code is available at this https URL.
[CV-103] MindAU: EEG-Conditioned Facial Action Unit Editing via Dual-Stream Manifold Alignment
链接: https://arxiv.org/abs/2607.00410
作者: Zhenhang Li,Xin Zhou,Hao Deng,Lijun Yin
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Recent brain decoding studies have made substantial progress in reconstructing externally perceived visual content from neural signals. However, using electroencephalography (EEG) recordings to guide facial expression editing remains largely unexplored and poses a distinct challenge: rather than recovering what a subject sees, it requires identifying facial-action related patterns from noisy EEG signals and grounding them in localized, identity-preserving expression edits. In this paper, we investigate EEG-conditioned facial image editing for fine-grained facial action unit (AU) control and propose MindAU, a unified framework for controlling facial AU edits from EEG signals. MindAU first learns noise-robust and AU-discriminative EEG representations through temporal masked reconstruction and AU classification supervision. It then bridges the modality gap via Dual-Stream Manifold Alignment, aligning EEG features with AU-level text semantics and identity-reduced visual displacement trajectories in the multimodal space of Qwen2.5-VL. Finally, MindAU incorporates EEG-aware Multimodal Rotary Positional Embeddings, landmark-guided reference masking, and AU-aware region supervision into a multimodal diffusion-based editor for high-fidelity identity-preserving editing. We also introduce E-CAFE, a curated benchmark for EEG-Conditioned Action-Unit Facial Editing with paired EEG-face editing samples and standardized evaluation protocols. Extensive experiments demonstrate the effectiveness of MindAU and suggest its potential as a step towards future assistive expression technologies for individuals with facial neuromuscular disorders.
[CV-104] MedCAGD: Context-Aware Gated Decoder for Efficient Medical Image Segmentation ECCV2026
链接: https://arxiv.org/abs/2607.00409
作者: Saad Wazir,Patrick Dominique Vibild,Dinh Phu Tran,Seongah Kim,Daeyoung Kim
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at the European Conference on Computer Vision (ECCV 2026)
Abstract:Medical image segmentation relies on the ability of encoder-decoder architectures to translate rich feature representations into accurate pixel-level predictions under challenging conditions such as low contrast, structural ambiguity, and scale variability. While recent advances in large-scale pretraining and transformer-based encoders have substantially improved feature extraction, segmentation accuracy remains constrained by decoder design, particularly in terms of cross-scale alignment, contextual integration, and boundary preservation. In this work, we revisit medical image segmentation from a decoder-centric perspective and propose a context-aware gated decoder that systematically regulates feature fusion and contextual aggregation throughout the decoding process. The proposed decoder integrates lightweight multi-scale channel recalibration, gated skip fusion with spatial competition and a global context aggregation mechanism that injects encoder-wide information into intermediate decoding stages. This design enables effective translation of strong pretrained encoder representations into spatially consistent predictions. Extensive experiments across 11 medical image segmentation benchmarks validate the effectiveness and demonstrate that the proposed approach consistently outperforms strong baselines while remaining computationally practical. Code: this https URL
[CV-105] he Illusion of High Utility in Safety Alignment of Text-to-Image Diffusion Models ECCV2026
链接: https://arxiv.org/abs/2607.00402
作者: Adeel Yousaf,Soumik Ghosh,James Beetham,Amrit Singh Bedi,Mubarak Shah
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: ECCV 2026
Abstract:Safety alignment of text-to-image (T2I) diffusion models aims to suppress harmful generations while preserving utility on benign prompts. Recent methods often appear to deliver high safety with high utility, but this conclusion rests largely on coarse global utility metrics (e.g., FID, CLIPScore) that are insensitive to fine-grained semantic correctness, creating an illusion of high utility. We show that when utility is measured with structured evaluation, this illusion breaks: on TIFA (Text-to-Image Faithfulness evaluation with Question Answering), safety-aligned models suffer substantial drops in semantic fidelity, including failures in object counts, attributes, and relationships. To diagnose the source of this gap, we analyze the text-encoder prompt embedding space and uncover semantic collapse, a contraction of embedding spread coupled with distortion of inter-prompt similarity structure, which strongly correlates with structured utility loss. Guided by this insight, we propose StructureAware Geometric Regularization (SAGE), a safety alignment objective that explicitly preserves embedding spread and inter-prompt relational structure during adaptation. Our method restores structured utility (TIFA +5.0% over prior state-of-the-art) while maintaining strong safety performance and competitive coarse-grained utility scores. Our source code and trained models are available at this https URL.
[CV-106] DriveVer: Lightweight Trajectory Evaluator as Test-Time Verifier for Autonomous Driving
链接: https://arxiv.org/abs/2607.00399
作者: Chong He,Yuechen Luo,Fang Li,Shaoqing Xu,Fuxi Wen
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:End-to-end autonomous driving models often encounter performance bottlenecks, as training-time scaling leads to high computational costs and diminishing marginal returns. Existing planners typically adopt a one-shot generation paradigm, lacking secondary validation and active correction mechanisms to detect and revise suboptimal or unsafe trajectories during inference. To address this issue, we propose DriveVer, a lightweight, plug-and-play Test-Time Verifier that leverages the test-time scaling paradigm to enable autonomous driving systems to validate and refine trajectories without costly and heavy training. We construct a dedicated trajectory dataset based on the NAVSIM benchmark through condition-driven clustering and balanced sampling according to ego-vehicle states and navigation commands. Employing a dual-head architecture, DriveVer efficiently fuses candidate trajectories with multi-view visual representations and ego-vehicle kinematic features to simultaneously predict a safety confidence score and an absolute geometric refinement vector. Extensive experiments on the NAVSIM benchmark show that DriveVer significantly improves the performance of base planning models. Notably, as an extremely compact model with only 34M parameters, DriveVer introduces minimal computational overhead, achieving competitive results while maintaining real-time inference efficiency.
[CV-107] Vitality-Aware Compression for Efficient Image-to-Shape Diffusion Transformers ECCV2026
链接: https://arxiv.org/abs/2607.00382
作者: Jaeah Lee,Hyunjin Kim,Jaewoong Cho,Gihyun Kwon
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ECCV 2026
Abstract:We propose the first compression approach for image-to-shape Diffusion Transformers (DiTs) that substantially reduces model size while preserving geometric fidelity. Despite remarkable progress in 3D shape generation, large DiT-based models remain computationally prohibitive in resource-constrained settings. Furthermore, it is difficult to directly transfer existing diffusion model compression strategies developed for different domains to 3D generation, and prior 3D efficiency approaches focus primarily on inference speed rather than backbone compression. To address this limitation, we build a geometry-aware compression framework tailored to image-to-shape DiTs. Guided by the observation that 3D DiT layers exhibit non-uniform importance for geometry synthesis, we introduce a vitality-guided framework integrating structured pruning, adaptive quantization, and targeted fine-tuning. Our method achieves up to 66% model-size reduction across state-of-the-art image-to-3D models while maintaining synthesis fidelity comparable to full-sized counterparts. This highlights the potential of our framework as a plug-and-play solution for efficient 3D shape generation across diverse models.
[CV-108] Radial Interaction Tomography: Recognizing Non-Transitive Evolutionary Games from One Range-Expansion Image
链接: https://arxiv.org/abs/2607.00378
作者: Faruk Alpay,Baris Basaran
类目: Computer Vision and Pattern Recognition (cs.CV); Populations and Evolution (q-bio.PE)
备注: 17 pages, 10 figures. Ancillary files include computational diagnostics, benchmark code, and supplementary proofs
Abstract:Colored sectors in a microbial range expansion encode more than lineage survival counts. We formulate a computer-vision inverse problem: from one endpoint image of an accretive multi-type expansion, recover the radius-indexed pairwise boundary-flow field and test whether the visual pattern is compatible with a transitive scalar fitness hierarchy. The observable is a geometric signal extracted from sector-boundary curves in log-polar coordinates. We prove endpoint observability and stability for frozen fronts, weighted transitive/cyclic decomposition, contact-complete circular design, physical-clock and mechanism non-identifiability, exact Gaussian cyclicity testing, and Bonferroni-valid interval scanning. The benchmark is deterministic: analytic endpoint images, blurred/noisy pixel round trips, scalar-null stress tests, public-image tracing, multi-resolution mechanistic endpoints, and a non-learning frozen-front simulator. The implementation recovers pairwise edge-flow histories from endpoint images, detects cyclic residuals in a mechanistic four-type expansion, and uses those residuals as forcing signals for a dimensionless active design-control layer covering reaction-diffusion control, phenotype-frontier optimization, protocol synthesis, Monte Carlo robustness, and a downstream population-state bridge.
[CV-109] LIST3R: Long-sequence Instance-aware 3D Reconstruction
链接: https://arxiv.org/abs/2607.00375
作者: Jing Gao,Wei Wang,Feiran Wang,Yan Yan
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We present LIST3R, an instance-aware framework for long-sequence 3D reconstruction inspired by the way humans organize spatial memory around stable and recognizable objects. LIST3R organizes long-sequence reconstruction around instance anchors, using them to reconnect fragmented subsequences and consolidate local observations into a coherent global 3D scene. Given a long video, our approach partitions it into overlapping subsequences and builds a structured local instance library for each partial reconstruction, maintaining persistent trackable anchors with semantic and geometric evidence. These anchors are matched across subsequences to recover revisited regions and provide object-aware constraints for fragment alignment, producing a consistent global reconstruction. During this process, the evolving geometric evidence updates the local instance libraries and progressively organizes them into a unified global 3D instance library. Experiments on long-sequence benchmarks show that our method produces more accurate trajectories and higher-quality 3D reconstructions, highlighting the effectiveness of persistent instance anchors for organizing long-horizon 3D reconstruction. Our code is available on the project page: this https URL.
[CV-110] MEPA: Multi-Scale Representation Alignment for Visual Autoregressive Modeling with Mixture of Experts ECCV2026
链接: https://arxiv.org/abs/2607.00371
作者: Nuoyan Zhou,Zhijun Tu,Lei Yu,Kun Cheng,Jie Hu,Nannan Wang,Xinghao Chen
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 15 pages, 4 figures, 8 tables, Accepted at ECCV 2026
Abstract:Visual AutoRegressive modeling (VAR) has pioneered a coarse-to-fine multi-scale autoregressive generative paradigm, demonstrating strong capabilities in image generation. However, VAR still suffers from inherent deficiencies in multi-scale representation learning. Specifically, lower scales primarily capture global semantics, while higher scales focus on fine-grained details. Employing a shared architecture across scales induces optimization conflicts. Moreover, due to the causal autoregressive process, inaccurate semantics at early scales can propagate and significantly degrade the final output. To address these issues, we introduce a scale-aware token-routed Mixture of Experts (MoE) architecture, allowing scale-adaptive expert selection, thereby facilitating decoupled representation learning across scales. In addition, we enhance semantic modeling at early scales by incorporating external self-supervised features. Unlike naive alignment, we analyse and design a residual feature aggregation scheme tailored to the VAR paradigm. Extensive experiments show that our method significantly improves both training efficiency and generation quality. On the ImageNet 256*256 benchmark, our model achieves a superior FID compared to the dense baseline while requiring only half of the default training epochs and a smaller parameter budget, with a merely marginal increase in training cost. Moreover, the performance gap further widens with larger training epochs.
[CV-111] SFDATrack: Generalized Source-Free Domain Adaptive Tracking Under Adverse Weather Conditions ECCV2026
链接: https://arxiv.org/abs/2607.00369
作者: Siyuan Yao,Ziqi Wang,Ruiqi Yu,Junqi Huang,Wenqi Ren,Xiaochun Cao
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ECCV 2026
Abstract:Domain adaptive visual object tracking under adverse weather conditions has garnered significant attention in recent years. Despite the impressive performance, existing methods heavily rely on the large-scale video frames from both source and target domains, which is impractical under rigid resource constraints where source data is unavailable. To overcome this limitation, we propose SFDATrack, a generalized source-free domain adaptive tracker that merely leverages adverse weather samples from the target domain for robust state estimation. Specifically, SFDATrack first employs a mean-teacher backbone with Dual Interactive Mamba (DIM) blocks to distill the candidate target tokens that are resilient to weather variations from classified, augmented samples. Afterwards, we introduce a hyperspherical prototype projection (HPP) module to project these tokens onto multi-domain prototypes within a latent hyperspherical space. By enforcing both domain-specific and domain-invariant properties of the multi-domain prototypes, SFDATrack can be seamlessly adapted to diverse weather conditions with powerful generalizability. Extensive experiments evaluated on various benchmarks demonstrate that SFDATrack achieves superior performance compared to state-of-the-art approaches. The code is available at this https URL.
[CV-112] Personalized Object Identification and Localization via In-Context Inference with Vision-Language Models
链接: https://arxiv.org/abs/2607.00357
作者: Kensuke Nakamura,Byung-Woo Hong
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Personalized object localization (POL) localizes an object instance in a query image based on a few reference images with bounding-box annotations and a target object label. The pioneering method, IPLoc, solves this task through in-context inference with vision-language models (VLMs). However, it assumes that the query image always contains the target object. This assumption severely limits its applicability to real-world scenarios with many irrelevant images. To address this issue, we formulate a new task, personalized object identification and localization (POIL), by positioning POL within the broader few-shot object detection framework. POIL aims to localize the target object instance while rejecting query images that do not contain the reference object instance. We also present POIL datasets constructed from public sources. We further propose an in-context algorithm named IPLoc-ID for solving POIL with VLMs. IPLoc-ID first predicts a candidate bounding box and then determines whether it corresponds to the reference object instance. We introduce a self-posed query to connect these two steps within a single autoregressive generation framework. Through ablation studies and comprehensive experiments, we show that IPLoc-ID substantially suppresses false-positive detections on negative query images while maintaining localization performance comparable to IPLoc. Overall, IPLoc-ID effectively addresses the practical instance-level POIL task, which cannot be sufficiently solved by conventional object detection, few-shot object detection, or the localization-only IPLoc method.
[CV-113] DroneFINE: Domain-Aware Parameter-Efficient Fine-Tuning of Vision-Language Detectors for Drone Images ECCV2026
链接: https://arxiv.org/abs/2607.00338
作者: Ke Wu,Yanan Zhang,Yingjie Gao,Wenhao Li,Chenyu Zhou,XinZhu Ma,Jiaxin Chen,Di Huang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ECCV2026
Abstract:Object detection for Unmanned Aerial Vehicles (UAVs) working in open and dynamic environments is a highly challenging task. While Vision-Language Models (VLMs) have offered a powerful solution for universal object detection, adapting them to UAV scenarios remains non-trivial due to a substantial domain gap between VLM pre-training data and aerial imagery. The prevailing Parameter-Efficient Fine-Tuning (PEFT) methods prove ineffective in bridging this gap, as VLMs’ “natural-scene, foreground-dominant” visual priors misalign with the “bird’s-eye-view, background-dominant, small-object” characteristics of UAV data. To address this issue, we propose DroneFINE, a novel PEFT paradigm comprising two domain-aware complementary modules tailored for VLM-based drone image detectors. Specifically, a data-dependent, foreground-aware, and multi-path adaptation mechanism named HyperAdapter is designed, which overcomes the static structural constraints of PEFT. In addition, a background suppression algorithm named SemanticGate is developed. It is a text-conditioned guidance strategy that employs background vocabulary to actively guide the model in suppressing responses from irrelevant regions. Extensive experiments on VisDrone and UAVDT demonstrate that DroneFINE significantly outperforms existing PEFT methods and achieves performance comparable to full fine-tuning while substantially reducing the number of trainable parameters.
[CV-114] CORGI: Consistency-Aware 3D Dog Reconstruction from a Single Image in the Wild
链接: https://arxiv.org/abs/2607.00321
作者: Yuxiao Wu,Weile Li,Boyi Zhu,Yumeng Liu,Youcheng Cai,Ligang Liu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Reconstructing high-fidelity 3D models of highly articulated animals, such as dogs, from a single in-the-wild image remains a formidable challenge. In this paper, we introduce CORGI, a novel framework for consistency-aware 3D dog reconstruction from a single unconstrained image that completely eliminates the need for 3D supervision. To overcome generative inconsistencies and the lack of multi-view capture, our pipeline introduces three core components. First, we propose a Canonical-Driven Orbital Generation (CDOG) strategy, utilizing specialized Canonical and Orbit LoRAs to normalize arbitrary input poses and synthesize reliable 360-degree video observations. Second, we design a Consistency-aware Deformable 3DGS (CA-3DGS) module that anchors on a D-SMAL prior, explicitly modeling per-view generative errors through dedicated neural deformation fields to learn accurate vertex-level displacements. Finally, to eliminate structural distortions and recover high-frequency details, we introduce a self-supervised Deformation-Conditioned Generative Repair (DCGR) module. Extensive experiments demonstrate that CORGI achieves state-of-the-art performance, generalizing seamlessly across diverse dog breeds to produce geometrically accurate, visually coherent, and fully animatable 3D assets ready for downstream applications.
[CV-115] ypography-Based Monocular Distance Estimation for Advanced Driver-Assistance Systems
链接: https://arxiv.org/abs/2607.00319
作者: Manognya Lokesh Reddy,Zheng Liu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 23 pages, 11 figures
Abstract:Estimating the distance to a leading vehicle is a basic input to forward collision warning, adaptive cruise control, and automated emergency braking. Production systems obtain this distance from radar, laser scanners, or stereo camera pairs, which add cost, power draw, and packaging constraints. This paper asks whether a single ordinary camera can recover the same distance by using a target that is standardized in size and present on every road vehicle: the rear license plate. U.S. plates share a fixed outer size and a character height that is set by regulation and varies only narrowly between states, so the height of a plate character in the image is a direct measure of distance once the camera geometry is known. The proposed method (Typography-Based Monocular Distance Estimation) detects the plate, measures the height of its printed characters, identifies the issuing state to select the correct physical character height, and recovers distance from the camera projection. Three measurements taken from the same plate: the character height, the stroke width, and the character spacing. Together with the spacing of the two mounting holes and a single-image depth network, are combined so that a weak or corrupted measurement is given less weight automatically. The distance, its rate of change, and a time-to-collision estimate are smoothed across frames and used to raise a warning with the timing used by U.S. collision-warning regulations. The same plate that anchors the scale also identifies the vehicle, so the method returns a distance, a bearing, and an identity from one passive sensor. It reads scale from a printed standard instead of from time of flight or parallax, making it a cheap, low-maintenance complement to those sensors in a fault-tolerant perception stack, achieving the cost-effective distance estimation with error less than 0.13 m.
[CV-116] RetailSMV: Exocentric vs. Egocentric Adaptation of Foundation Video World Models in Retail
链接: https://arxiv.org/abs/2607.00310
作者: Amirreza Rouhi,Rajat Aggarwal,Parikshit Sakurikar,Anoop M. Namboodiri,Sashi P. Reddi
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Foundation video diffusion models are increasingly viewed as world simulators for embodied agents, yet their pretraining on internet-scale generic video leaves them poorly aligned with real-world deployment domains. We study parameter-efficient adaptation of a pretrained foundation video world model to retail scenes: when synchronized egocentric and exocentric video of the same activity are available, which viewpoint of training data produces the strongest adapted model? We introduce RetailSMV (Retail Synchronized Multi-View), a corpus of 32,105 captioned retail clips from five supermarkets with synchronized ego/exo capture from the store-staff perspective (stocking, arranging, weighing, managing supply carts, scanning at checkout), rather than the customer-centric framing of prior retail video corpora, and train three matched Low-Rank Adaptation (LoRA) configurations of Cosmos3-Nano (egocentric-only, exocentric-only, combined) under identical hyperparameters. On a 200-clip held-out test set evaluated with seven complementary metrics under a strict paired statistical protocol, exocentric-only adaptation matches or exceeds combined adaptation on six of seven point estimates and is significantly better on LPIPS, PSNR, and DreamSim, despite training on only 15,985 exocentric clips (versus 32,105 for combined). A symmetric paired comparison further shows that adding exocentric data to egocentric-only training helps while adding egocentric data to exocentric-only training hurts. The absolute adaptation gap is largest at the shortest rollout time, identifying the near-horizon prediction window as the regime in which adaptation is most beneficial. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2607.00310 [cs.CV] (or arXiv:2607.00310v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2607.00310 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[CV-117] Wake up for Touch! Mask-isolated Tactile Alignment Learning in MLLM s KR ECCV2026
链接: https://arxiv.org/abs/2607.00302
作者: Yoonhyung Park,Minji Kim,Sungwon Moon,Jiyoung Lee
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Robotics (cs.RO)
备注: ECCV 2026, Project page: this http URL
Abstract:Touch supplies the physical grounding needed to perceive intrinsic material properties, such as friction and compliance, that vision alone often cannot resolve. Recent efforts for equipping multimodal LLMs with this tactile sense, however, expose a zero-sum trade-off: the limited parameter budget of compact models forces a choice between acquiring the new sensory modality and preserving the established vision-language reasoning. We present Splash, a mask-isolated tactile alignment learning framework for MLLMs. Splash quantifies the significance of each pretrained parameter, and partitions the parameter space into a dormant and critical subspace. While the frozen critical subspace acts as a stable anchor to safeguard general visual knowledge, Splash updates the isolated dormant subspace to internalize tactile alignment towards LLMs. This selective, non-destructive expansion effectively prevents catastrophic forgetting and ensures non-destructive modality expansion. Extensive experiments show that Splash effectively achieves tactile reasoning without additional inference overhead in the LLM part, demonstrating state-of-the-art performance on visuo-tactile benchmarks, including SSVTP, TVL, and TacQuad, while preserving its original general-purpose capabilities.
[CV-118] Learning When to Listen: Gated Affect Fusion for Human Motion Prediction
链接: https://arxiv.org/abs/2607.00296
作者: Jingni Huang
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Human motion forecasting in unconstrained real-world videos remains challenging due to the ambiguity of future behaviors and the presence of noisy multimodal observations. While facial affect potentially provides complementary behavioral cues, its practical utility and mechanistic boundaries within motion forecasting frameworks remain poorly understood. In this work, we present a systematic study investigating the utility and temporal limitations of affect-conditioned forecasting in-the-wild. We establish a rigorous multimodal pipeline combining MediaPipe body pose trajectories with HSEmotion facial affect representations, and introduce the Gated Affect Transformer (GAT) to dynamically regulate cross-modal information flow. Through extensive multi-horizon evaluations under a strict subject-wise protocol, we demonstrate that naive early cross-modal concatenation consistently degrades forecasting accuracy relative to pose-only baselines. Conversely, our proposed gating mechanism stabilizes cross-modal integration by adaptively controlling the affective stream. Crucially, controlled counterfactual experiments using shuffled and randomized affect inputs reveal that the learned gate successfully suppresses unstructured cross-modal noise while remaining responsive to plausible affective signals. Furthermore, our empirical results indicate that facial affect features provide bounded, horizon-dependent predictive cues strictly within short-to-medium windows (e.g., 30 frames), whereas long-term trajectories remain predominantly governed by intrinsic kinematic continuity. Our findings provide empirical evidence that facial affect should be regarded as a complementary behavioral cue rather than a dominant driver of future motion, offering practical guidance for selective multimodal fusion in unconstrained human motion forecasting.
[CV-119] OnPoint: Offline-to-Online Multi-Level Distillation for Point-Supervised Online Temporal Action Localization ECCV2026
链接: https://arxiv.org/abs/2607.00289
作者: Sakib Reza,Gauri Jagatap,Mohsen Moghaddam,Octavia Camps,Andrea Fanelli
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ECCV 2026
Abstract:Temporal Action Localization (TAL) typically relies on segment annotations or offline access to full videos, limiting scalability and online use. We introduce Point-Supervised Online TAL (POTAL), which localizes actions in streaming videos using only one temporal point per instance. To solve POTAL, we propose OnPoint, an offline-to-online multi-level distillation framework that transfers knowledge from a point-supervised offline teacher to an online student via (i) pseudo-segment instance distillation, (ii) class-activation sequence distillation, and (iii) anticipatory window-level distillation. We further improve robustness by incorporating the original point labels into student training and by refining anchor decoding with actionness-guided attention calibration. Experiments on five datasets show OnPoint consistently outperforms strong baselines, establishing a solid foundation for POTAL.
[CV-120] Whats Hidden Matters: Identifying Planning -Critical Occluded Agents using Vision-Language Models IROS2026
链接: https://arxiv.org/abs/2607.00283
作者: Amirhosein Chahe,Tyler Naes,Jovin D’sa,Faizan M. Tariq,Sangjae Bae,Lifeng Zhou,David Isele
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to the 2026 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2026). 9 pages, 5 figures, 5 tables
Abstract:Autonomous vehicles must safely navigate complex environments where planning-critical agents may be hidden from view. Current approaches often treat all occlusions with uniform conservatism, yielding needlessly defensive driving, or they infer hidden spaces without estimating the impact on the planner. This work bridges the critical gap between perception and planning by enabling Vision-Language Models (VLMs) to identify and reason about the specific hidden agents that are most critical to the ego-vehicle’s trajectory. We introduce a novel framework that uses Planning KL-divergence (PKL), an information-theoretic metric, to systematically identify and rank occluded agents based on their impact on the ego vehicle’s plan. Using this planning-aware ranking, we employ an expert VLM (GPT-5) to generate rich, structured annotations that capture the visual evidence and reasoning required for this task. We apply this framework to the nuScenes dataset to create a new benchmark focused on high-impact scenarios. We conduct comprehensive experiments on a wide range of general-purpose and domain-adapted VLMs, demonstrating that fine-tuning on our PKL-guided data yields dramatic performance improvements across all models. Notably, our results show that smaller, fine-tuned models significantly outperform their much larger zero-shot counterparts, and that our PKL-guided data selection strategy improves performance by approximately 30% over random sampling. Our work presents the first systematic approach for training VLMs to focus on planning-critical occlusions, enabling more semantically grounded and efficient risk assessment in autonomous driving.
[CV-121] AEGIS: A Multi-Task Joint-Embedding Predictive Architecture for Mammography
链接: https://arxiv.org/abs/2607.00277
作者: Scott Chase Waggener,Sai Karthik Navuluru,Lakshman Tamil
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We present Aegis, a joint-embedding predictive architecture for breast cancer detection and density assessment in mammography. We train three Vision Transformer variants (Small/Base/Large) using self-supervised joint-embedding predictive architecture (JEPA) pre-training on 71,103 studies from 14 clinical sites, followed by supervised fine-tuning with progressive resolution scaling up to 2048x1536. On a curated 785-study test set, our largest model achieves area under the receiver operating characteristic curve (AUC) 0.949 for breast cancer triage with 93% sensitivity and 75% specificity at the optimal operating point. An ensemble combining our model with a U.S. Food and Drug Administration-cleared baseline further improves discrimination to 0.952 AUC. For breast density classification, the model achieves 0.953 AUC for binary (dense vs. non-dense) classification and 62.6% exact accuracy across four Breast Imaging Reporting and Data System (BI-RADS) categories, with 98.8% adjacent accuracy comparable to reported human inter-reader agreement. External validation on the public VinDr-Mammo dataset provides evidence of cross-population transfer under a different reference standard, with the largest model achieving 0.871 AUC for triage in a zero-shot setting.
[CV-122] MVDGC: Joint 3D and 2D Multi-view Pedestrian Detection via Dual Geometric Constraints
链接: https://arxiv.org/abs/2607.00273
作者: Thinh Phan,Hao Vo,Khoa Vo,Thanh Ngo,Cuong Pham,Ngan Le
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The core challenge in multi-view pedestrian detection (MVPD) lies in effective aggregation of visual features from different viewpoints for robust occlusion reasoning. Recent approaches have addressed this by first projecting image-view features onto a Bird’s Eye View (BEV) map, where ground localization is then performed. Despite impressive performance, the perspective transformation induces severe distortion, causing spatial structure break and degrading the quality of object feature extraction. The blurred and ambiguous features hinder accurate BEV point localization, especially in densely populated regions. Moreover, the strong mutual relationship between the BEV ground point and image bounding boxes is not capitalized on. Although multi-view consistency of 2D detections can serve as a powerful constraint in BEV space, these detections are commonly treated as auxiliary signals rather than being jointly optimized with the primary this http URL this work, we propose \textbfMVDGC, a unified framework that \emphjointly estimates pedestrian locations on the BEV plane and 2D bounding boxes in image views. MVDGC employs a \emphsparse set of 3D cylindrical queries that embraces geometric context across both BEV and image views, enforcing dual spatial constraints for precise localization. Specifically, the geometric constraints is established by modeling each pedestrian as a vertical cylinder whose center lies on the BEV plane and whose projection casts a rectangular box in the image views. These queries function as shape anchors that directly extract 2D features from the intact image-view features using camera projection, eliminating projection-induced distortions. The 3D cylindrical query enables the unification of BEV and ImV localization into a single task: 3D cylinder position and shape refinement. Code is available at: this https URL
[CV-123] Multi-Hypothesis Test-Time Adaptation to Mitigate Underspecification ECCV’26
链接: https://arxiv.org/abs/2607.00259
作者: Afshar Shamsi,Xiao-Yu Guo,Hamid Alinejad-Rokny,Arash Mohammadi,Damien Teney,Ehsan Abbasnejad
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 26 pages, 4 figures, 12 tables, Accepted in ECCV’26
Abstract:Test-Time Adaptation (TTA) seeks to improve model robustness under distribution shifts by adapting parameters using unlabeled target data. However, in the absence of supervision, entropy-based adaptation is fundamentally underconstrained: multiple distinct parameter updates can achieve similarly low entropy while inducing drastically different decision boundaries. This phenomenon, known as underspecification, renders standard TTA brittle and prone to collapse into spurious modes. In this work, we reinterpret TTA through a posterior-inspired lens induced by entropy minimization, where low-entropy solutions define a pseudo-likelihood over parameters. Instead of committing to a single point estimate, we introduce a particle-based diversification framework that explores multiple plausible adaptation trajectories simultaneously. Our method can be viewed as a structured exploration of multiple plausible adaptation solutions, implemented through multi-level diversification at the output, parameter, optimizer, and input levels. Crucially, the framework acts as a plug-and-play wrapper compatible with existing TTA methods. Extensive experiments on challenging benchmarks demonstrate consistent gains in stability and robustness, achieving improvements of 3-4% under mixed shifts, 2-3% with batch size one, and 1-2.5% under label shifts, outperforming state-of-the-art baselines. Our results suggest that treating TTA as a multi-hypothesis inference problem, rather than a single-point optimization task, is key to mitigating underspecification and enabling reliable real-world deployment.
[CV-124] Leverag ing Phase Information to Boost Unrolled Network Learning for Image Deblurring
链接: https://arxiv.org/abs/2607.00251
作者: Samira Malek,Haichuan Zhang,Chul Lee,Vishal Monga
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:While most image deblurring techniques directly restore the spatial image variable, we propose an amplitude and phase decomposition recognizing the importance of accurate phase estimation in recovering sharp image details. To that end, we first develop novel linear minimum mean squared (LMMSE) estimators of the amplitude and phase of the blurred, noisy image observation. An iterative optimization algorithm follows that recovers the sharp image using the aforementioned LMMSE estimators. Finally, matrix parameters that are statistically determined and fixed in the iterative algorithm are now learned using a training dataset of clean and degraded observations. Our deblurring engine is dubbed UPADNet (Unrolled Phase and Amplitude Decomposition Network), such that each iteration of the underlying phase and amplitude recovery algorithm is parameterized and trained end-to-end. Experiments over benchmark evaluation datasets such as GoPro, RealBlur and COCO datasets confirm that UPADNet outperforms state of the art deep networks including those based on algorithm unrolling in the image domain. The benefits of UPADNet are even more pronounced in high noise and limited training data regimes.
[CV-125] Does Your ViT Still Need U-Net for Segmentation?
链接: https://arxiv.org/abs/2607.00223
作者: Xin Li,Wenhui Zhu,Xuanzhao Dong,Xiwen Chen,Yanxi Chen,Yujian Xiong,Hao Wang,Oana M. Dumitrascu,Yalin Wang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 4 figures
Abstract:Medical image segmentation is dominated by U-Net-style encoder-decoder architectures. Vision Transformers (ViTs) overcome the limited receptive field of convolutional networks through self-attention, enabling modeling of long-range dependencies. Early ViT-based segmentation methods typically retained U-Net-style decoders because pretrained ViT representations were insufficient to support accurate dense prediction. Recent advances in large-scale pretraining have redefined the representation capability of ViTs, reducing the reliance on U-Net-style decoder architectures in modern vision models. This prompts two questions: Is the U-Net paradigm still necessary for medical image segmentation? If not, how should an encoder-only segmentation framework be designed? Motivated by these questions, we explore key architectural choices for encoder-only medical image segmentation based on modern ViT backbones and establish a query-based encoder-only design with multi-level query modeling and learnable block fusion, realized in Encoder-only Segmentation (EoSeg). Extensive experiments across seven benchmark datasets spanning CT, MRI, histopathology, endoscopy, and dermoscopy validate the effectiveness of the proposed design across diverse medical imaging modalities, including mDice scores of 85.50% on Synapse, 91.73% on ACDC, and 93.27% on GlaS. The results demonstrate that a U-Net-style decoder is no longer necessary for medical image segmentation with modern ViT backbones and further show that EoSeg provides an effective encoder-only design. Code is available at: this https URL
[CV-126] EgoSafetyBench: A Diagnostic Egocentric Video Benchmark for Evaluating Embodied VLMs as Runtime Safety Guards
链接: https://arxiv.org/abs/2607.00218
作者: Siddhant Panpatil,Arth Singh,Mijin Koo,Chaeyun Kim,Haon Park,Dasol Choi
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Vision-language models (VLMs) are now proposed as runtime safety guards for embodied agents in homes and factories. A deployable guard must catch genuinely unsafe situations while avoiding unnecessary intervention on routine but superficially alarming activity, a distinction that binary safety benchmarks obscure. We introduce EgoSafetyBench, an egocentric video benchmark of 1,200 robot-view scenarios annotated at half-second granularity, to evaluate VLMs as streaming guards across two tracks. The situational track (800 scenarios) spans four families, from routine and safe-but-suspicious scenes to obvious and contextual hazards. The visual-channel track (400 scenarios) targets in-scene text-a sign, sticker, or label visible in the scene-that can misrepresent the physical situation, pairing each misleading sign with a truthful version to test both whether a guard flags the text as misleading and whether the text corrupts its physical-safety judgment. Both tracks use contrastive ladders: near-identical scenarios differing only in a single visible deciding cue, so a correct call must hinge on that cue rather than the overall scene type. We evaluate ten open- and closed-source VLMs. We find that while guards reliably recognize videos containing hazards, they often miss specific hazardous moments, particularly contextual hazards. Furthermore, misleading in-scene signs degrade all tested guards: vulnerable models miss up to a third of hazards, while robust models over-intervene on safe content. Matched controls reveal that apparent safety robustness often reflects indiscriminate alarming rather than true physical reasoning.
[CV-127] rust the Prior (or Not): Uncertainty-Aware Abdominal Aortic Aneurysm Segmentation
链接: https://arxiv.org/abs/2607.00201
作者: Erich Robbi,Daniele Ravanelli,Andrea Passerini
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 4 figures
Abstract:Robust segmentation of intraluminal thrombus is critical for risk assessment in Abdominal Aortic Aneurysm, yet it remains challenging due to heterogeneous thrombus features and low contrast with surrounding non-enhanced tissues. Domain shifts induced by different Computed Tomography Angiography (CTA) protocols further inhibit multi-center generalization of deep learning models. To address these challenges, we propose a patient-specific framework that integrates discriminative learning with anatomically informed priors. Our approach introduces two key components: (1) a patient-specific intensity normalization based on a Gaussian Mixture Model of local anatomy, and (2) an Uncertainty-Gated Anatomical Attention module that incorporates spatial priors while adaptively modulating their influence according to voxel-wise confidence. This design allows for anatomical guidance in ambiguous regions while suppressing unreliable priors. The proposed method achieves state-of-the-art performance on in-distribution test data and substantially outperforms existing alternatives in generalization to external multi-center CTA data, while remaining interpretable through an explicit separation of visual and anatomical evidence.
[CV-128] VOCA: Visual Odometry with Codec Awareness ECCV2026
链接: https://arxiv.org/abs/2607.00189
作者: Nouri Alexander Hilscher,Mateo de Mayo,Dominik Muhle,Christoph Otten genannt Hermes,Daniel Cremers
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ECCV 2026
Abstract:Camera pose estimation from image streams is a critical component of spatial world models that integrate perception into planning and decision-making. Nearly all Visual Odometry (VO) and Simultaneous Localization and Mapping (V-SLAM) systems have focused on datasets containing raw, uncompressed videos. Many working systems instead use ubiquitous hardware units to efficiently compress and decode video streams, saving orders of magnitude in storage and bandwidth. However, this lossy compression introduces visual artifacts that hinder the performance of traditional tracking systems. We present VOCA, a causal stereo visual-odometry method that exploits codec information to improve tracking performance. We achieve state-of-the-art performance on causal VO for relative trajectory error, efficiency, and absolute trajectory error on compressed streams. This work highlights the potential of leveraging widely available video codec information for vision tasks.
[CV-129] DriftScope: Measuring The Hidden Effects of Diffusion Model Adaptation ECCV2026
链接: https://arxiv.org/abs/2607.00183
作者: Héctor Laria,Yiping Han,Julian D. Santamaria,Kai Wang,Bogdan Raducanu,Joost van de Weijer,Alexandra Gomez-Villa
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 22 pages, 5 figures, Accepted at ECCV 2026
Abstract:Adapting pre-trained text-to-image diffusion models, whether to learn new visual concepts or erase unwanted ones, is routinely evaluated on its intended effects alone. We argue this framing is incomplete. Through sparse autoencoder analysis and zero-shot classification, we demonstrate that adaptation systematically damages semantically unrelated concepts in ways that aggregate metrics structurally cannot surface: when damage is severe enough for FID and KID to respond, the model is already nearly unusable; when the model remains functional, FID and KID stay flat while specific classes silently suffer worst-case zero-shot accuracy drops of up to 18.9 points and concept-level distributions shift dramatically. This pattern appears at both ends of the adaptation spectrum (concept customization and concept unlearning), suggesting it is a systematic consequence of weight-level modification rather than an artifact of any particular method. To surface this hidden drift before deployment, we introduce DriftScope, a prompt-level diagnostic tool that takes any two model checkpoints and returns a ranked list of tokens whose visual concepts have shifted most between them. DriftScope optimizes a soft prompt to attribute drift at the token level without requiring access to real data or model internals. The result is an interpretable, concept-level audit that aggregate evaluation cannot provide.
[CV-130] PRISM-VO: Scale-Aware Visual Odometry Using Photometric Plenoptic Bundle Adjustment ECCV
链接: https://arxiv.org/abs/2607.00176
作者: Aymeric Fleith,Julian Zirbel,Daniel Cremers,Niclas Zeller
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for publication at the 19th European Conference on Computer Vision (ECCV) 2026
Abstract:We introduce PRISM-VO, a novel pure optimization-based sparse photometric visual odometry framework for focused plenoptic cameras. The core of PRISM-VO is a novel photometric plenoptic bundle adjustment which jointly optimizes camera poses and inverse depth values of points in a sliding window. By combining geometric depth from a single plenoptic image with temporal multi-view constraints, PRISM-VO achieves accurate and drift-resilient motion estimation. Through explicit modeling of the plenoptic projection, PRISM-VO provides reliable metric-scale reconstructions, overcoming the scale ambiguity of monocular SLAM algorithms. Importantly, our approach relies solely on a single plenoptic sensor and avoids complex initialization, as depth priors are computed directly from plenoptic imaging. Experiments show that PRISM-VO outperforms the current state-of-the-art plenoptic visual odometry method on indoor and outdoor scenes. The proposed approach rivals other optimization- and learning-based methods while accurately and reliably recovering a metric scale of the scene. Project page: this https URL Comments: Accepted for publication at the 19th European Conference on Computer Vision (ECCV) 2026 Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2607.00176 [cs.CV] (or arXiv:2607.00176v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2607.00176 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[CV-131] Steal the Patch Size: Adversarially Manipulate Vision-Language Models
链接: https://arxiv.org/abs/2607.00174
作者: Kai Hu,Akash Bharadwaj,Weichen Yu,Matt Fredrikson
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:We present a black-box model-stealing attack that recovers private vision-tokenizer configurations of deployed vision-language models (VLMs), including the visual patch size and input preprocessing pipeline. The key idea is a task-level side channel induced by ViT-style patchification: when a synthetic grid image is aligned with the hidden patch grid, boundary cues are erased at tokenization, causing periodic accuracy drop. By sweeping the grid cell size and measuring these collapses, we infer the patch size; by introducing padding and a consistency-check test, we further identify whether preprocessing is dynamic- or fixed-resolution and recover the target resize resolution. Across open-source Qwen-VL variants and proprietary models including GPT and Claude, we reliably recover tokenizer-related parameters. Finally, we show that such leakage enables preprocessing-aware transfer attacks and model-targeted adversarial manipulation.
[CV-132] Progressive Pose-Guided 4D Animal Reconstruction from Monocular Video ECCV2026
链接: https://arxiv.org/abs/2607.00157
作者: Siyuan Li,Weiying Chen,Yilin Wang,Xinxin Zuo,Xingyu Li,Li Cheng
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ECCV 2026. Camera-ready author version
Abstract:Reconstructing 4D animals from monocular videos is challenging due to large inter-species variation, complex articulations, and the lack of reliable templates. Existing approaches typically rely on either strict category-specific priors that restrict generalization, or unconstrained generative models that sacrifice input fidelity. To bridge this gap, we present a progressive test-time optimization framework built on 3D Gaussian Splatting for high-fidelity 4D animal reconstruction from a single video. Our key insight is that a coarse shape prior suffices when coupled with a progressive strategy that disentangles articulated pose from non-rigid deformation. Specifically, we employ a symmetry-aware temporal encoding that exploits bilateral cues while absorbing camera estimation drift and a part-conditioned deformation mechanism guided by learnable part anchors and a learnable skinning field. Extensive experiments demonstrate that our approach generalizes robustly across diverse species, achieving superior geometric accuracy, temporal consistency, and visual fidelity compared to existing baselines, even under severe prior mismatch.
[CV-133] 3D Point World Models: Point Completion Enables More Accurate Dynamics Learning
链接: https://arxiv.org/abs/2607.00148
作者: Skand Peri,Hung Nguyen,Chanho Kim,Li Fuxin,Stefan Lee
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 21 Pages
Abstract:Learning predictive models of the world enables robotic control through planning, potentially allowing robots to improvise solutions on new tasks. However, large video-based dynamics models lack explicit 3D spatial structure and suffer from geometrically inconsistent long-term rollouts with compounding errors. Emerging 3D dynamics models based on partial point clouds improve geometric consistency but remain sensitive to occlusions and accumulated prediction drift. To address these challenges, we present 3D Point World Models (3DPWM) - a task-agnostic world model that operates entirely in 3D space by first completing partial point clouds and then learning action-conditioned dynamics in this completed 3D scene. By operating on completed geometry, 3DPWM enables reliable long-horizon rollouts and more accurate cost evaluation for model-based planning while supporting adaptation to new tasks. Experiments across different robotic embodiments and tabletop manipulation benchmarks demonstrate that 3DPWM achieves significantly more reliable long-horizon rollouts (100-300+ steps), supports both open-loop and closed-loop planning, and enables successful sim-to-real transfer.
[CV-134] A Mechanism-Driven Theory of Phase Transitions in Active Learning ECCV2026
链接: https://arxiv.org/abs/2607.00144
作者: Julia Machnio,Mads Nielsen,Mostafa Mehdipour Ghazi
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at ECCV 2026
Abstract:Active learning (AL) performance is known to be budget-dependent, yet regimes are typically defined by heuristic label counts that fail to generalize across datasets or architectures. We characterize AL dynamics by reframing budget regimes as shifts in the dominant generalization mechanism. By reinterpreting PAC-style risk components as dynamic interacting terms, we prove that dominance shifts are structurally unavoidable, creating a moving bottleneck for generalization. We operationalize this using measurable proxies and a segmented regression procedure to identify a tripartite taxonomy: data-driven, transition, and model-driven phases. Our framework explains the long-standing observation that representativeness, coverage, and uncertainty strategies excel at different stages. Experiments across natural and medical imaging show that AL efficiency depends on the alignment between the strategy’s inductive bias and the active bottleneck. Moreover, self-supervised representation shift transitions earlier along the labeling trajectory, highlighting the role of representation quality in shaping AL dynamics. Overall, this work provides a unified framework for the next generation of transition-aware AL algorithms.
[CV-135] MG-SpaIR: Multi-grade Sparse-guided Implicit Representation for Training-Data-Free Image Restoration
链接: https://arxiv.org/abs/2607.00138
作者: Jianmin Liao,Lei Huang,Ronglong Fang,Ashley Prater-Bennette,Lixin Shen,Yuesheng Xu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:MG-SpaIR is a training-data-free framework for restoring a clean image from a single observation corrupted by a mixture of blur, downsampling, noise, and missing pixels. Building on implicit neural representations (INRs), we introduce a multi-grade coarse-to-fine residual hierarchy that progressively refines the reconstruction across resolution grades, improving representational fidelity and mitigating spectral limitations. To stabilize reconstruction optimization and suppress INR-induced artifacts, we further propose an explicit sparse proximal regularization (e.g., \ell_0 -type) applied directly in the high-resolution image domain, which discourages spurious high-frequency patterns while preserving sharp structures. The resulting optimization is solved efficiently via a multi-grade proximal alternating scheme, and we establish convergence guarantees for the associated updates under standard regularity conditions. Experiments on mixed-degradation benchmarks demonstrate that MG-SpaIR consistently outperforms strong training-data-free baselines such as Deep Image Prior, providing a stable, interpretable, and data-efficient alternative to conventional learning-based restoration methods.
[CV-136] A Synthetic-Driven Vision System for Assembly Step Recognition
链接: https://arxiv.org/abs/2607.00129
作者: Hui Zhang,Xuanang Lei,Rui Wang,Julian Ferchow,Mirko Meboldt
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CASE 2026
Abstract:Quality control in industrial assembly is essential, and real-time monitoring of the assembly process is crucial for preventing costly defects and ensuring production reliability. Vision-based automated inspection offers a powerful solution for such real-time monitoring. However, due to the specialized industrial components and processes, training these models typically relies on task-specific real-world data, which is costly and labor-intensive to collect and annotate. In this paper, we propose a system that automatically generates realistic assembly sequences and further trains real-time inspection models using the synthetic data. It can be efficiently applied to a given task within an hour, requiring only CAD models and simple step descriptions. Focusing on practical challenges, our system integrates a physics-based motion generation module to capture the variance of different human assembly, designs domain-randomized rendering to deal with the environmental complexity and variation, and employs an object-detection-based step recognition module for robust sim-to-real transfer, leading to 92.4% accuracy on a real-world assembly case with 46.7%, 15.8% and 61.2% performance improvement, respectively. Overall, our system provides a practical solution for industrial assembly inspection without requiring expensive real-world data collection and annotation, with the effectiveness validated on real industrial assembly tasks.
[CV-137] Decompose Compare and Decide: Multimodal LLM s are Implicit Few-Shot Learners
链接: https://arxiv.org/abs/2607.00125
作者: Yunhan Wang,Eshika Khandelwal,Edson Araujo,Walid Bousselham,Nina Shvetsova,Hilde Kuehne
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Multimodal Large Language Models (MLLMs) have demonstrated remarkable abilities when analyzing images, yet translating these capabilities to few-shot image classification remains challenging. To bridge this gap, we present DeCoDe, a simple yet effective technique that enables off-the-shelf MLLMs to act as strong few-shot classifiers without any additional training. Our approach builds on the idea of few-shot classification as a set of pairwise image comparisons, decomposing the task into a set of binary decisions. Given a query image and a support image from a candidate class, the MLLM is prompted to decide whether the two images depict the same class. The logit corresponding to an affirmative response is then used as a similarity score to assign the query image to the most likely class. While this already yields good results, we show that providing additional high-level information, such as the data domain, to the model further improves performance. Our evaluation provides an extensive analysis of various inference variants on a suite of twelve datasets, six established and six newly curated few-shot benchmarks spanning across diverse domains. The results show that the proposed simple decomposition technique can turn off-the-shelf MLLMs into powerful few-shot learners, significantly outperforming current state-of-the-art few-shot methods on both standard and novel domains. Code is available at this https URL.
[CV-138] Segmenting Fast and Slow: Real-Time Open-Vocabulary Video Instance Segmentation with Dual-Path Processing ECCV2026
链接: https://arxiv.org/abs/2607.00124
作者: Luca Barsellotti,Martin Sundermeyer,Mattia Segu,Nikita Araslanov,Muhammad Ferjad Naeem,Marcella Cornia,Yongqin Xian,Maxim Berman
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ECCV 2026
Abstract:Object-centric models inspired by DETR have become the dominant paradigm for open-vocabulary video instance segmentation (OV-VIS). While recent efforts have reduced the computational cost of pixel decoding, textual modality fusion, and object decoding to make these architectures more suitable for mobile devices, real-time on-device inference at high frame rates remains an open challenge. In this paper, we introduce SegFS, a dual-stream fast-slow framework that significantly improves efficiency without sacrificing accuracy. On sparse keyframes, an open-vocabulary object-based model predicts instance-level representations. These representations are then projected back into the backbone feature space to condition a lightweight fast network, which efficiently relocalizes and segments the instances in subsequent frames. By shifting instance propagation from object decoding to feature-space conditioning, our approach decouples multimodal semantic understanding from dense mask prediction and enables efficient temporal propagation. The proposed fast branch achieves up to 14x lower latency than the mobile-oriented MOBIUS model, while maintaining competitive segmentation performance on standard OV-VIS benchmarks.
[CV-139] PixelEyes: Decoupling Perception and Reasoning for Pinpoint Visual Evidence Seeking
链接: https://arxiv.org/abs/2607.00115
作者: Dengxian Gong,Yuanzheng Wu,Haobo Yuan,Zhengdong Hu,Tao Zhang,Yikang Zhou,Shihao Chen,Quanzhu Niu,Kai Wang,Jason Li,Haochen Wang,Lu Qi,Shunping Ji,Ming-Hsuan Yang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 22pages, 10 figures
Abstract:This paper explores multi-turn visual reasoning and observes that MLLMs repeatedly fail to localize the target, leading to long, redundant trajectories. We attribute this failure to the entanglement of reasoning and perception within a single model, the MLLM reasons and localizes simultaneously, and inaccurate localization triggers additional reasoning turns that bloat the trajectory. To solve this problem, we propose PixelEyes, a multi-turn visual reasoning agent that explicitly decouples reasoning from perception, i.e., the reasoner decides what to look for, while a specialized perception tool answers where it is. Specifically, PixelEyes introduces 1) Mask-guided Visual Search. A referring segmentation model is invoked to provide mask-precise localization, freeing the reasoner from the need to compensate for imprecise grounding. 2) Semantic-region Breadth-first Search (BFS). To eliminate redundant loops caused by repeatedly cropping incorrect sub-regions, we organize exploration as a breadth-first search over semantic regions. To internalize these capabilities, we construct the PixelEyes-6K dataset by resynthesizing expert trajectories from existing data. This explicitly embeds our mask-guided search and BFS logic into the model. We further introduce Pinpoint-Bench, a zero-hint visual search benchmark, i.e., no location cues are provided in the question, with instance-level masks and bounding boxes that separate localization failures from reasoning failures, enabling fine-grained analysis of failure modes such as inattentional blindness. Recent state-of-the-art MLLMs and visual reasoning agents leave large headroom on Pinpoint-Bench, demonstrating its quality and difficulty. Code and models are open-sourced.
[CV-140] Lost in the Tail: Addressing Geographic Imbalance in Urban Visual Place Recognition ECCV2026
链接: https://arxiv.org/abs/2607.00090
作者: Zhiyao Shu,Jiacheng Yang,Yang Lu,Waishan Qiu,Chuan Li,Da Chen
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to ECCV 2026, 28 pages including supplementary material
Abstract:Urban-scale Visual Place Recognition (VPR) aims to identify the geographic location of a query image by matching it against a geo-tagged database. While recent methods achieve impressive performance, they overlook a serious long-tailed problem hidden in urban-scale datasets, which biases the model towards locations with abundant images and ignores less-visited areas, causing models to systematically favor frequently photographed locations while failing in sparsely covered areas. In this paper, we systematically characterize this imbalance challenge and propose Distribution-Aware Place Recognition (DAPR), a model-agnostic plug-in framework that rebalances gradient contributions across head and tail classes. Additionally, within classification-retrieval pipelines, DAPR applies a multi-scale distance search mechanism to compute per-class distributional compactness, providing complementary gains at the retrieval stage. On the large-scale SF-XL benchmark, our framework outperforms the previous classification-retrieval baseline by 18.3% on test set v1, and 6.7% on test set v2. As a plug-in module, it achieves consistent improvements across representative VPR methods on SF-XL, MSLS, and Pitts30k, demonstrating broad generalizability across different methods and benchmarks.
[CV-141] Synergistic Perception-Reasoning Governance: Grounding Medical MLLM s with Verifiable Anatomical Evidence MICCAI2026
链接: https://arxiv.org/abs/2607.00060
作者: Rui Hao,Qiankun Li,Junyuan Mao,Linghao Meng,Dirui Xie,Dayu Tan,Zhigang Zeng
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by MICCAI 2026 (Early Accept, Top 9%)
Abstract:Multimodal large language models (MLLMs) show strong promise for clinical VQA and radiology report generation, yet inference-time hallucinations still undermine trustworthy use: models can produce fluent conclusions that conflict with imaging evidence. Existing mitigation strategies typically rely on additional training, external retrieval/knowledge bases, or multi-stage post-hoc verification, which increases cost and pipeline complexity and often generalizes poorly across models and this http URL address this, we propose a holistic, training-free evidence-injection framework that systematically mitigates hallucinations through dual-side evidence injection. By leveraging ROI priors acquired using MedSAM in our implementation, we recalibrate the visual perception trajectory via ROI-guided activation modulation while anchoring the textual reasoning trajectory by mapping anatomical coordinates into discrete semantic tokens as verifiable external memory. Then we introduce a task-aware dynamic router to select modality-specific interventions based on task semantics, balancing perceptual grounding and linguistic fluency. We conduct systematic evaluations on 2 tasks and 5 datasets using \textttLLaVA-1.5-7B, \textttLLaVA-Med-1.5-7B, \textttQwen3-VL-8B/32B, and \textttInternVL-3.5-8B/38B. Controlled ablations and visualizations further validate the framework, which consistently outperforms baselines across medical benchmarks, improving close-ended accuracy by up to \sim\mathbf6%\uparrow and reducing open-ended hallucinations by \sim\mathbf35%\downarrow . The code has been made available on GitHub: \hrefthis https URL\textcolorbluethis https URL.
[CV-142] Joint Medical Image Enhancement and Segmentation with Diffusion-based Symbiotic Information Interaction IJCAI2026
链接: https://arxiv.org/abs/2607.00058
作者: Ying Chen,Jinyue Li,Qiankun Li
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by IJCAI 2026
Abstract:Image quality is critical for accurate medical diagnosis. However, MRI, CT, and ultrasound images are often of low resolution and quality due to cost constraints, complicating the visualization of key anatomical structures and lesions. While such limitations are common in practice, traditional methods treat image enhancement as a separate preprocessing step, failing to fully leverage its potential synergy with image segmentation. To address this, we propose DiSIINet (Diffusion-based Symbiotic Information Interaction Network), which is built on the principle that enhancement and segmentation should mutually reinforce each other in a unified model. Based on Denoising Diffusion Implicit Models (DDIM), DiSIINet integrates an enhancement branch and a segmentation branch. These branches interact through a novel Symbiotic Information Interaction (SII) module, which facilitates dynamic, feature-level information exchange via cross-attention during the reverse diffusion process. This design enables both tasks to iteratively improve each other. The DDIM backbone ensures high-quality output and efficient inference through deterministic sampling. Experiments on multi-modal medical datasets (MRI, CT, ultrasound) show that DiSIINet achieves significant performance improvements compared to sequential or independent enhancement and segmentation approaches. The code is available at: this https URL.
[CV-143] Enhancing Oracle Bone Inscription Recognition via Multi-Scale Layer Attention
链接: https://arxiv.org/abs/2607.00057
作者: Chaowen Yan,Kaishen Wang,Yong Wang,Jianlong Xiong,Tao He
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Oracle Bone Inscriptions (OBIs) recognition plays a crucial role in understanding ancient Chinese culture. However, accurately recognizing OBIs remains highly challenging due to their complex, irregular, and often degraded shapes. Traditional methods rely on expert knowledge and manual analysis, which are time-consuming and error-prone. Although deep learning has greatly advanced general image recognition, existing methods struggle to capture the fine-grained details and subtle variations inherent in OBIs, resulting in limited performance. Even most recent and effective layer attention techniques are designed to capture fine-grained dependencies through enhanced inter-layer interactions, yet they still exhibit only marginal improvements in OBIs recognition. To address these limitations, we propose Multi-Scale Layer Attention (MSLA), a novel paradigm that explicitly models both multi-scale and cross-layer feature interactions. By enriching the representation with fine-grained details across multiple spatial scales, MSLA enables more accurate and robust OBIs recognition. Extensive experiments on large-scale OBIs datasets demonstrate that MSLA consistently outperforms existing attention mechanisms while maintaining computational efficiency.
[CV-144] Vertigo Vertigo: Reconstructing a Cinematic Ideal through its Predictive AI Double SIGGRAPH
链接: https://arxiv.org/abs/2607.00047
作者: Adam Cole,Mick Grierson
类目: Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: Accepted to Ars Electronica EXPANDED 2026 - Conference on Animation and Interactive Art (in cooperation with ACM SIGGRAPH), Ars Electronica Festival, Linz. 7 pages, 7 figures. Authors’ version
Abstract:Vertigo Vertigo is a scene-for-scene AI reconstruction of Hitchcock’s Vertigo (1958), generated from only 2.78% of the original film’s frames. Using this sparse set of keyframe anchors, we perform first-last frame interpolation via a large video diffusion model to predict the intervening sequences. Vertigo is itself a film about the obsessive reconstruction of an artificial ideal; Vertigo Vertigo extends this logic to the material of the film, treating the canonical text as a probe for the normative conventions of classical cinema encoded within generative systems. Evaluated through computational analysis and critical feedback from media theorists (Lev Manovich, Shane Denson, Kevin L. Ferguson), the artifact demonstrates remarkable structural fidelity: 73.1% of frames are recognizable as plausible renditions of Vertigo and only 3.6% fail catastrophically. This fidelity suggests that cinematic norms are deeply compressed within the model’s latent priors. Aesthetically, the reconstruction is rendered as an unstable overlay between the original film and its predictive shadow, fueling a persistent doubt in the viewer’s perception of authenticity – a 21st-century vertigo. The work argues that generative media is not a paradigm shift from cinema but an acceleration of its logic of desire and false authenticity, extending from classical Hollywood through to the predictive media environments now reshaping contemporary perception.
[CV-145] Learning Dexterous Manipulation Using Contact Wrench Guidance From Human Demonstration
链接: https://arxiv.org/abs/2607.00033
作者: Xinghao Zhu,Zixi Liu,Shalin Jain,Chenran Li,Milad Noori,Huihua Zhao,John Welsh,Michael Andres Lin,Wei Liu,Tingwu Wang,Xingye Da,Zhengyi Luo,Vishal Kulkarni,Naema Bhatti,Yuke Zhu,Linxi Fan,Bowen Wen,Danfei Xu,Soha Pouya,Yan Chang
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Dexterous robot manipulation can benefit from the abundance of human demonstrations, but transferring such demonstrations to robot policies remains challenging. We present Contact Wrench Guidance from Human Demonstration in Robotic Dexterous Manipulation (CHORD), a framework for long-horizon manipulation of rigid and articulated objects with reinforcement learning. The key idea is object-centric contact wrench space guidance: we represent human and robot motions by the forces and torques they can induce on the object, enabling similarity to be measured by the induced instantaneous motions. This guidance makes reinforcement learning more scalable for contact-rich dexterous manipulation. We further introduce a large-scale simulation benchmark with 4,739 bimanual dexterous manipulation tasks, constructed from motion-capture datasets and reconstructed in-house videos. Evaluated on 1,831 benchmark tasks, CHORD achieves an average success rate of 82.12%, demonstrating strong scalability. CHORD also generalizes to whole-body manipulation from hand-only and third-person demonstrations, achieving a 90.77% success rate, and the learned policies transfer to the real world in both open-loop and closed-loop settings.
[CV-146] owards an automated AI-based framework for floor plan compliance checks for residential buildings
链接: https://arxiv.org/abs/2607.00015
作者: Subash Gautam,Debaditya Acharya,Alexandra Kleeman,Sarah Foster
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Emerging Technologies (cs.ET)
备注:
Abstract:To improve residents’ well-being in Australia’s urban areas, governments have introduced policy reforms such as SEPP65, BADS, and SPP7.3 to enhance apartment design quality. These regulations require precise geometric and spatial analysis to evaluate health-related features, including daylight access, natural ventilation, privacy, and space efficiency. However, compliance checking remains challenging due to its manual, time-intensive nature. Additionally, evolving policies limit scalability for large-scale assessments across thousands of apartments. Existing automated floor plan analysis methods are fragmented and typically focus on single apartments, lacking a unified framework for multi-unit compliance checking. This article explores current advancements in automated floor plan analysis, particularly AI-driven approaches, and highlights key challenges in their practical adoption. To address these gaps, a conceptual framework is proposed for automated compliance checking in multi-apartment buildings. A Large Language Model (LLM) is used within a Rule Engine to convert textual building codes into executable, explainable rules. A Data Extraction Engine segments floor plan images into elements such as walls, rooms, fixtures, text, and symbols, and transforms them into a structured building graph with topological relationships. This structured representation is then evaluated by a Compliance Check Engine, which leverages LLM-generated rules for assessment. The proposed framework offers a scalable, consistent, and transparent approach to automated compliance checking across jurisdictions, supporting efficient enforcement of apartment design standards and promoting healthier, higher-density urban development.
[CV-147] UltraFlux: Data-Model Co-Design for High-quality Native 4K Text-to-Image Generation across Diverse Aspect Ratios
链接: https://arxiv.org/abs/2511.18050
作者: Tian Ye,Song Fei,Lei Zhu
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Project Page: this https URL
Abstract:Diffusion transformers have recently delivered strong text-to-image generation around 1K resolution, but we show that extending them to native 4K across diverse aspect ratios exposes a tightly coupled failure mode spanning positional encoding, VAE compression, and optimization. Tackling any of these factors in isolation leaves substantial quality on the table. We therefore take a data-model co-design view and introduce UltraFlux, a Flux-based DiT trained natively at 4K on MultiAspect-4K-1M, a 1M-image 4K corpus with controlled multi-AR coverage, bilingual captions, and rich VLM/IQA metadata for resolution- and AR-aware sampling. On the model side, UltraFlux couples (i) Resonance 2D RoPE with YaRN for training-window-, frequency-, and AR-aware positional encoding at 4K; (ii) a simple, non-adversarial VAE post-training scheme that improves 4K reconstruction fidelity; (iii) an SNR-Aware Huber Wavelet objective that rebalances gradients across timesteps and frequency bands; and (iv) a Stage-wise Aesthetic Curriculum Learning strategy that concentrates high-aesthetic supervision on high-noise steps governed by the model prior. Together, these components yield a stable, detail-preserving 4K DiT that generalizes across wide, square, and tall ARs. On the Aesthetic-Eval at 4096 benchmark and multi-AR 4K settings, UltraFlux consistently outperforms strong open-source baselines across fidelity, aesthetic, and alignment metrics, and-with a LLM prompt refiner-matches or surpasses the proprietary Seedream 4.0.
[CV-148] Closed-loop coupling of personalised and foundation models for real-time treatment guidance with MRI
链接: https://arxiv.org/abs/2607.00500
作者: James Grover,Emily A. Hewson,Andrew Phair,Michael Ferraro,Hilary L. Byrne,Paul Keall,Michael G. Jameson,David E.J. Waddington
类目: Medical Physics (physics.med-ph); Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages, 8 figures, 2 supplementary figures
Abstract:Image-guided therapies, including radiotherapy, biopsy and deep brain stimulation, rely on real-time targeting of anatomical structures. However, in the presence of motion, imaging latencies create a temporal misalignment between observed and true anatomy, compromising treatment accuracy. Artificial intelligence-based frameworks have increasingly been presented to close this latency gap, but leading personalised models can fail due to a lack of stable anatomical grounding. Foundation models can provide grounded behaviour, but they do not adapt to real-time, individual patient dynamics. Here we introduce a closed-loop coupling framework that synergises patient-specific temporal prediction with continuous segmentation-based anatomical interpretation from a foundation model. A personalised model predicts future anatomy to compensate for system latency, while a streaming foundation model provides anatomical supervision used to continuously update the temporal predictor in real time during treatment. We validate the framework using a digital phantom and intrafraction magnetic resonance imaging (MRI) from patients undergoing MRI-guided radiotherapy. For a prediction horizon of 400 ms, the proposed method improves anatomical prediction and reduces dosimetric error compared with existing approaches, within clinically relevant latency constraints. These results establish closed-loop coupling as a general strategy for real-time image-guided intervention.
[CV-149] Predicting Lethal Outcome (Cause) And Understanding Key Biomarkers Linked With Acute Myocardial Infarction Using Deep Artificial Neural Network And Ensemble Of Machine Learning Methodologies
链接: https://arxiv.org/abs/2607.00472
作者: Sagnik Ghosh
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Master of Science (MSc), Thesis Report
Abstract:Cardiovascular disease is still one of the main causes of death around the world. Acute myocardial infarction (MI), or heart attack, claims millions of lives each year. MI happens when blood flow to the coronary arteries is blocked or reduced, which causes permanent damage to the heart muscle. Without treatment, this can lead to cardiac arrest, where the heart stops pumping blood to the organs, resulting in organ failure and death. Even survivors often face serious problems like heart failure, pulmonary edema, and asystole. Research shows that 5 to 10 percent of survivors die within the first year after an MI, and nearly half need to be hospitalized again. Early thrombolytic treatment leads to better outcomes, so there is a clear need for faster and more accurate ways to diagnose MI. Right now, doctors usually review patient history and use their own experience to find the causes of MI. This process takes a lot of time and can be inconsistent. Detecting MI accurately and quickly can help patients take better care of themselves and prevent fatal events. In this study, we introduce an automated model to predict deadly outcomes of MI and help doctors understand important biomarkers linked to its complications. This approach aims to make diagnosis clearer, faster, and more affordable. The process includes preparing the data, filling in missing values, and handling imbalanced data using SVMSMOTE, ADASYN, and class-weighted methods. We use wrapper and embedded feature selection to find the most important variables, then scale the features for consistency. The model combines Logistic Regression, Random Forest, Light-GBM, and Bagging SVM, and is further improved with an artificial neural network to increase accuracy. We evaluate all models using precision, recall, and other key measures to find the best option for clinical use.
[CV-150] MalariAI: A Label-Resilient Decoupled Framework for Universal Cell Segmentation and Explainable Stage Classification in Dense Malaria Blood Smears
链接: https://arxiv.org/abs/2607.00385
作者: Kaysarul Anas Apurba,Md Hasibul Hasan,Mohammed Ali,Tanzilur Rahman
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to Computerized Medical Imaging and Graphics (under review). 4 authors, includes figures and appendix
Abstract:Automated malaria diagnosis from blood smear microscopy is a critical challenge in global health AI; in resource-limited settings, the scarcity of expert microscopists remains the primary bottleneck to timely and accurate diagnosis. Three compounding failure modes prevent reliable clinical deployment of existing deep learning systems. First, end-to-end detectors treat unannotated cells as background during training, producing recall figures that are strongly influenced by annotation completeness rather than reflecting true cell recovery. Second, Non-Maximum Suppression tends to suppress valid detections in dense smear regions where infection counts matter most. Third, existing whole-slide detection pipelines lack per-cell spatial evidence for clinical audit, despite image-level explainability methods such as Grad-CAM having been applied to malaria image classification tasks. We present MalariAI, a two-stage decoupled framework that addresses all three failure modes in a unified pipeline. Stage 1 applies an annotation-agnostic distance-transform guided watershed algorithm to isolate every cell in a full 1600x1200 blood smear image, recovering 75.95% of ground-truth cells by centroid localisation across the 120-image NIH BBBC041 test set without any ground-truth input. Stage 2 fine-tunes EfficientNet-B0 with Focal Loss (gamma = 2.0, per-class inverse-frequency weights) on 64x64 crops, achieving 98.36% overall classification accuracy with 87.5% and 75.0% per-class accuracy on the rare schizont and gametocyte stages, compared to only 24.57% and 25.95% AP for a Faster R-CNN baseline on the same classes. Grad-CAM++ heatmaps generated per detected cell provide instance-level spatial evidence for clinical audit, enabling microscopists to verify model predictions at the individual parasite level without sacrificing classification performance.
人工智能
[AI-0] Language-Critique Imitation Learning from Suboptimal Demonstrations
链接: https://arxiv.org/abs/2607.01225
作者: Chih-Han Yang,Dai-Jie Wu,Yun-Ping Huang,Ping-Chun Hsieh,Kenneth Marino,Shao-Hua Sun
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Prior work on imitation learning from suboptimal demonstrations typically relies on compressed supervision signals such as confidence estimates, discriminator scores, or importance weights. These scalar signals are inherently limited, as they cannot explicitly express intermediate reasoning about task progress, failure modes, or corrective actions. We propose a language-critique framework for imitation learning from suboptimal demonstrations that instead leverages natural language as a structured supervision signal, avoiding the collapse of expressive feedback into scalars. Our method first constructs language labels from demonstrations that explicitly describe current progress, identify suboptimal behaviors, and provide fine-grained corrective guidance. We then introduce a language-critique loss that directly trains policies using these structured signals without reducing them to scalars, and instantiate it for both behavior cloning and diffusion policies, yielding LC-BC and LC-DP. We further provide a theoretical result showing that the proposed objective upper-bounds the expert performance gap under standard assumptions. Empirically, we evaluate on diverse continuous control tasks spanning navigation, manipulation, and gameplay, where our methods consistently outperform strong imitation learning and offline reinforcement learning baselines. These results demonstrate that language can serve as a powerful and structured form of supervision for learning robust policies from suboptimal data.
[AI-1] FurnitureVLA: Learning Long-Horizon Bimanual Furniture Assembly with Vision-Language-Action Model
链接: https://arxiv.org/abs/2607.01212
作者: Chenyang Ma,Yue Yang,Radu Corcodel,Siddarth Jain,Andrew Wu,Chiori Hori,Diego Romeres
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Project Page: this https URL
Abstract:Current work on robot furniture assembly mostly focuses on toy-scale settings or single-arm manipulation. We introduce FurnitureVLA, the first systematic study of real-scale bimanual furniture assembly using Vision-Language-Action models (VLAs). We formalize the task, develop a scalable simulation pipeline for expert data generation and evaluation, and build a VR teleoperation system for single-operator bimanual control to collect high-quality real-world demonstrations. To address extreme long-horizon assembly with up to 7 subtasks and 1550 control steps, we propose a progress-enhanced VLA, finetuned on semantically grounded subtasks, that jointly predicts actions and a continuous progress signal, enabling automatic subtask transitions and reducing compounding errors during inference. We further study perception and control design factors that critically affect precision in real-scale assembly. FurnitureVLA improves average simulation success from 48% to 80% compared to baselines across three furniture types, with an additional 21% gain from our design factor study. We validate on a real Kinova Gen3 platform with only 16% drop on the hardest task.
[AI-2] Are Performance-Optimization Benchmarks Reliably Measuring Coding Agents ?
链接: https://arxiv.org/abs/2607.01211
作者: Zhi Chen,Zhensu Sun,Yuling Shi,David Lo,Lingxiao Jiang
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Repository-level performance-optimization benchmarks such as GSO, SWE-Perf and SWE-fficiency evaluate coding agents by applying patches to real repositories and comparing runtime against unoptimized baselines and official reference patches. Their leaderboard scores are increasingly used as evidence of coding-agent progress, but those scores can conflate runtime instability, benchmark-specific scoring rules, and how many tasks are already solved by at least one public submission. We audit these issues across the three benchmarks. First, we replay the official reference patches for 740 code optimization tasks across four common types of Google Cloud machines. Most benchmark tasks can be replayed, but their reference patches satisfy the original benchmark validity rules in every cross-machine replay for only 39/102 GSO tasks, 11/140 SWE-Perf tasks, and 411/498 SWE-fficiency tasks; SWE-Perf is especially fragile because many reference patches produce close-to-zero runtime changes. Second, we show that public submission rankings depend strongly on the benchmark scoring rule. Among eight public submissions shared by GSO and SWE-fficiency, the official rankings disagree on 9 of 28 pairwise submission comparisons, and SWE-fficiency’s leaderboard scoring rule assigns the worst ten tasks overly high score weights of 58.5%-82.8%. Third, looking across 10 public submissions for each task, we find that at least one submission matches or beats the reference patch on 85.3% (384/450) of replay-valid GSO and SWE-fficiency tasks, and beats the unoptimized base code on 99.8% (449/450). Our study complements leaderboard scores by identifying tasks with more reliable performance signals, quantifying per-task score contributions, and exposing the remaining performance gaps that are hidden by aggregate rankings.
[AI-3] GPU-Parallel Linearization Error Bounds for Real-Time Robust Optimal Control of Nonlinear and Neural Network Dynamics
链接: https://arxiv.org/abs/2607.01203
作者: Jeffrey Fang,Keyi Shen,Anutam Srinivasan,Glen Chou
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO); Optimization and Control (math.OC)
备注:
Abstract:This paper studies real-time robust optimal control for uncertain nonlinear systems, where linear time-varying (LTV) approximations make planning tractable but require sound linearization error bounds (LEBs) to guarantee robust constraint satisfaction. We develop tight, differentiable, GPU-parallel LEBs for LTV approximations of nonlinear and neural network (NN) dynamics. For analytic dynamics, we introduce path-based Hessian bounds that are tighter than standard interval methods. For NN dynamics, we derive certified LEBs using NN verifier-generated affine relaxations and local Jacobian corrections. We adapt a GPU-parallel system-level synthesis LTV-based robust control solver to be compatible with these LEBs by extending it to handle right-invertible disturbance matrices and non-zero-centered disturbance sets for tight zonotopic uncertainty propagation. Our method, GPUSLS-LEO, enables online optimization of robust feedback policies that account for linearization error, producing tight, formally verified reachable tubes. On complex nonlinear and NN dynamics up to 168 state dimensions, our method can compute robust control policies on the GPU at rates up to 67 Hz, reducing solve times and conservativeness relative to baselines while preserving formal guarantees and real-time performance.
[AI-4] Optimal Resource Utilization for Autonomous Laboratory Orchestrators
链接: https://arxiv.org/abs/2607.01188
作者: Austin McDannald,Julia Tisaranni,Howie Joress
类目: Artificial Intelligence (cs.AI); Materials Science (cond-mat.mtrl-sci)
备注:
Abstract:In autonomous laboratories, AI agents suggest the next batch of experiments to do. However, planning and executing those tasks taking full advantage of the available resources is a completely different question. This can be challenging when dealing with real-world hardware constraints, especially so when there are multiple instruments with different capacities and throughputs. Here we demonstrate a 2-step method to address resource utilization for our autonomous platform for metal-organic framework synthesis. First, we use constraint programming to find optimal schedules. This finds schedules that minimizes the total time while still satisfying the limitations and capacities of the hardware. Secondly, we use a system of status dependencies for each task, which allows for the robust execution of the optimal schedules.
[AI-5] Sequentially-Controlled Interactive Multi-Particle Flow-Maps for Online Feedback-Driven Search
链接: https://arxiv.org/abs/2607.01144
作者: Binglin Ji,Anindya Sarkar,Hengchang Lu,Jens Sjölund,Yevgeniy Vorobeychik
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
备注: 28 pages, 19 figures
Abstract:While generative models have enabled training-free reward alignment, current methods typically excel in local exploration within narrow regions of the underlying distribution. These approaches struggle when preferences are unknown a priori and only revealed through sequential feedback-a scenario demanding broad exploration to uncover high-utility regions. To address this, we propose Sequentially-Controlled Interactive Multi-Particle Flow-Maps (IMPFM), a framework for sample-efficient online feedback-driven search. IMPFM progressively transports a group of interactive particles toward the target distribution, maintaining the broad coverage essential for heterogeneous preference alignment. IMPFM introduces a principled and efficient posterior sample sharing mechanism across particles powered by flow maps. By correcting individual particle drift with the collective posterior samples of the entire ensemble at each resampling step, the framework maximizes sample utility to enable global exploration while actively mitigating reward over-optimization, typical of standard control frameworks. Paired with a principled exploration-exploitation reweighting mechanism involving multi-particle interaction, this sequentially corrected multi-particle dynamics explicitly preserves structural diversity and overcomes the weight degeneracy inherent to standard SMC samplers. Crucially, we prove that the resulting sampling framework yields a multi-particle interaction-aware Feynman-Kac corrector that progressively steers the multi-particle system toward a KL-tilted target distribution, facilitating global exploration and preventing mode collapse. Extensive empirical evaluations and rigorous ablations across diverse search and alignment tasks confirm the efficacy of IMPFM over existing baselines.
[AI-6] Skills Are Not Islands: Measuring Dependency and Risk in Agent Skill Supply Chains
链接: https://arxiv.org/abs/2607.01136
作者: Changguo Jia,Tianqi Zhao,Runzhi He,Minghui Zhou
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Agent skills package reusable operational knowledge for Large Language Model (LLM) agents, yet as they grow in scope, they become dependency-bearing artifacts whose identities, versions, and provenance remain implicit. This opacity already causes duplicated dependencies and inconsistent installations, exposing a gap that dependency management has yet to close. We introduce Agent Skill Supply Chains (ASSCs) to characterize mixed skill-package-service dependency graphs and help close this gap. Borrowing from Software Bill of Materials (SBOMs), we design SkillDepAnalyzer to capture natural-language dependency evidence and model skills as dependency-bearing artifacts. On the SKILL-DEP benchmark, SkillDepAnalyzer recovers skill metadata and dependency graphs accurately and comprehensively, substantially outperforming an LLM-based baseline and package-centric SBOM tools. Applying SkillDepAnalyzer to over 1.43 million skills, we obtain ASSCs and explore their structural diversity and security signals. We find four structural patterns: skill metadata is activation-ready but governance-poor; dependency graphs span skill, package, and service dependencies with concentrated reuse; recursive skill reuse expands dependency graphs and creates hidden package inventory; and skill dependency clusters form around related workflows. We also find that inspecting a skill alone misses security-relevant signals hiding in its dependencies. By analyzing ASSCs, we identify and report known malicious skills persisting in ASSCs to their developers. Based on these findings, we recommend typed dependency manifests, first-class dependency-cluster management, risk-warning audit commands for skill infrastructure maintainers, and lockfile-like records for skill developers.
[AI-7] Muon as a Residual Connection
链接: https://arxiv.org/abs/2607.01124
作者: Hao Huang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Muon has recently emerged as one of the most effective optimizers for training large neural networks, yet its empirical success has been explained from several different perspectives. In this paper, we propose a simple mechanistic interpretation: Muon can be understood as an implicit residual connection during training. Specifically, orthogonalizing the update can sacrifice some immediate gradient fidelity while improving representation preservation for downstream layers. We study this trade-off in controlled linear optimization settings, where Muon can learn representations that are slower to fit a local target but easier for downstream layers to exploit. Our results suggest a conceptual explanation for Muon and a design perspective for optimizers that balance local descent with downstream usability.
[AI-8] FAR: Failure-Aware Retry for Test-Time Recovery and Continual Policy Improvement
链接: https://arxiv.org/abs/2607.01111
作者: Haoran Hao,Shahram Najam Syed,Jeffrey Ichnowski,Jeff Schneider
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Robot policies inevitably encounter failures when deployed in real environments. Naive retries often repeat the same mistakes, while many existing recovery methods rely on human intervention. In this paper, we propose Failure-Aware Retry (FAR), a framework that enables robots to learn from previous failures at test time, adapt their behavior accordingly, and eventually complete the task autonomously. FAR combines Failure-Contrastive Preference Adaptation, which constructs preference learning data from failures to steer the policy away from previously unsuccessful behaviors, with lightweight action perturbations during retries to encourage local exploration. We further incorporate successful recovery trajectories into a training loop for continual policy improvement. Experiments in both simulation and real-world manipulation tasks show that FAR substantially improves success rates and robustness, with average gains of 17.6% over the standard diffusion policy in simulation and 11.7% in the real world. In addition, FAR significantly improves data efficiency under both reset and timestep budgets during continual policy improvement by exploiting informative failure cases.
[AI-9] Cheap Code Costly Judgment: A Case Study on Governable Agent ic Software Engineering
链接: https://arxiv.org/abs/2607.01087
作者: James C. Davis,Paschal C. Amusuo,Tanmay Singla,Berk Çakar,Kirsten A. Davis
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 12 pages
Abstract:Generative AI is shifting software engineering from a practice organized around scarce implementation effort toward one organized around abundant, low-cost code production. This shift changes the central engineering problem: not whether AI can generate useful code, but how engineers organize architectures, tools, evidence, and feedback loops so that AI-mediated development remains inspectable, correctable, and maintainable. We study this problem through a first-person case study: a 12-week development effort in which a single expert software engineer used frontier AI coding agents to build a document accessibility remediation system. The empirical record comprises 88 contemporaneous field notes, 420 KLOC of production code, and 1.16 MLOC of tests, lints, supporting documentation, and agent tooling. From this record, we develop a candidate middle-range theory of governance conversion, expressed as a process model explaining how high-velocity agentic implementation becomes governable. The model explains how agentic implementation velocity surfaces recurring structural failure classes, and how engineering judgment sustains velocity by converting those failures into durable governance mechanisms. In contrast to existing governance models that derive controls from known obligations, governance conversion explains how controls are discovered from failures that become visible only during agentic work. We use our model to make testable predictions and to describe implications for software engineering research and practice. Comments: 12 pages Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI) Cite as: arXiv:2607.01087 [cs.SE] (or arXiv:2607.01087v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2607.01087 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-10] Can Agents Generalize to the Open World? Unveiling the Frag ility of Static Training in Tool Use ICML2026
链接: https://arxiv.org/abs/2607.01084
作者: Song-Lin Lv,Weiming Wu,Rui Zhu,Zi-Jian Cheng,Lan-Zhe Guo
类目: Artificial Intelligence (cs.AI)
备注: Accepted by ICML 2026
Abstract:While Large Language Model (LLM) agents demonstrate proficiency in static benchmarks, their deployment in real-world scenarios is hindered by the dynamic nature of user queries, tool sets, and interaction dynamics. To address this generalization gap, we formalize OpenAgent (Tool-Use Agent in Open-World), a problem setting characterized by distributional shifts across query, action, observation, and domain dimensions. To systematically diagnose its impact, we construct a controlled sandbox environment where we define fine-grained environmental shifts across a four-tier hierarchy, Perception, Interaction, Reasoning, and Internalization, and conduct a comprehensive series of experiments. Our analysis yields a series of key insights, demonstrating that agents trained via both Supervised Fine-Tuning(SFT) and Reinforcement Learning suffer from varying degrees of performance degradation when confronting open environmental shifts. Building on these insights, we propose Perturbation-Augmented Fine-Tuning, a disturbance-based intervention strategy for SFT that lays the foundation for enhancing agent robustness and utility in realistic environments. Our code will be released at: https://github. com/LAMDA-NeSy/OpenAgent.
[AI-11] Staleness-Learning Rate Scaling Laws for Asynchronous RLHF
链接: https://arxiv.org/abs/2607.01083
作者: Jingwei Song,Haofeng Xu,Jie Xiao,Chengke Bao,Jingwei Shi,Pengbin Feng,Weixun Wang,Yuhang Han,Chuan Wu,Linfeng Zhang,Bill Shi
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:High-throughput RLHF systems often decouple rollout generation from policy optimization, leading to the use of stale rollouts during learner updates. In this work, we study the effect of such staleness in asynchronous GRPO. We make the behavior policy explicit in the GRPO surrogate objective and distinguish between the surrogate-gradient mapping used by the learner and the true total derivative of a distribution-dependent population objective. Under assumptions of local boundedness, distributional smoothness, and behavior-policy smoothness, we show that stale rollouts introduce a per-step surrogate-gradient bias of order O(S * eta), where S denotes the maximum rollout lag and eta denotes the learning rate. We further derive a conditional collapse-time scaling law: when within-cycle drift remains below a batch-level clipping radius, collapse is governed primarily by cumulative learner drift T * eta; when the stale-rollout constraint is active, stability instead depends explicitly on S * eta. This yields a two-constraint stability condition eta minR_batch / (S * G_upd), R_crit / (T * G_upd), explaining why the maximum stable learning rate may appear weakly dependent on staleness in the horizon-limited regime.
[AI-12] DART-VLN: Test-Time Memory Decay and Anti-Loop Regularization for Discrete Vision-Language Navigation
链接: https://arxiv.org/abs/2607.01043
作者: Shaoheng Zhang,Zhichen Li,Jie Mei
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Accepted by the 2026 IEEE International Conference on Systems, Man, and Cybernetics (IEEE SMC 2026). Camera-ready version
Abstract:Memory-based discrete vision-language navigation (VLN) agents must act under partial observability, yet even strong frozen backbones remain vulnerable at test time. Two common failure modes are stale historical evidence at memory readout and inefficient local backtracking during action selection. We present DART-VLN, a training-free test-time control framework for discrete VLN. DART-VLN combines Test-Time Memory Decay, a read-side memory reweighting rule that suppresses stale and redundant evidence without rewriting stored content, with Anti-Loop Regularization, a lightweight next-hop penalty that discourages immediate reversals during action selection. The framework introduces no new learnable parameters and leaves the learned backbone unchanged. Experiments on R2R and REVERIE show a consistent pattern: decay-only provides stable read-side gains, while decay+anti-loop achieves the best overall quality-efficiency trade-off, yielding shorter trajectories, lower runtime, and improved navigation performance in key settings. Behavioral analysis further confirms that anti-loop regularization reduces local backtracking and improves path efficiency under frozen backbones. Overall, the results show that modest test-time control can make memory-based discrete VLN more reliable and efficient without retraining.
[AI-13] PedNStream: Scalable Network Flow Simulation for Pedestrian Traffic Management
链接: https://arxiv.org/abs/2607.01021
作者: Weiming Mai,Dorine Duives,Serge Hoogendoorn
类目: Artificial Intelligence (cs.AI)
备注: 13 pages, 14 figures
Abstract:Large-scale crowd management requires pedestrian simulations that are both computationally efficient and compatible with feedback-based control. However, most open-source tools are either microscopic or not designed for network-scale closed-loop evaluation. This paper presents PedNStream (Pedestrian Network Flow Simulation), an open-source, Python-native simulator for macroscopic pedestrian network loading based on the Link Transmission Model (LTM). The framework extends LTM-based pedestrian models by incorporating stochastic link dynamics that capture diffusion and activity-induced variability, and replaces dynamic user equilibrium route choice with a utility-based formulation suited to uncertain, intervention-driven settings. PedNStream is implemented as a modular framework with built-in controller interfaces for interventions such as gating, flow separation, and route guidance. We evaluate the framework in a staged manner. Synthetic scenarios verify key mechanisms, including queue formation, spillback, congestion dissipation, and adaptive rerouting. Real-network experiments assess large-scale behavior and consistency with observed pedestrian counts. A closed-loop case study demonstrates controller integration, and a runtime analysis quantifies scalability. These results establish PedNStream as an efficient and practical testbed for large-scale pedestrian network simulation and control.
[AI-14] SWE-Doctor: Guiding Software Engineering Agents with Runtime Diagnosis from Multi-Faceted Bug Reproduction Tests
链接: https://arxiv.org/abs/2607.00990
作者: Yaoqi Guo,Yang Liu,Jie M. Zhang,Yun Ma,Yiling Lou,Zhenpeng Chen
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language model (LLM)-based software engineering agents are increasingly developed to resolve software issues by generating patches from issue reports and code repositories. Bug reproduction tests (BRTs) are an important building block for such agents and have been shown useful for patch validation. However, it remains unclear whether BRTs can also help the more central stage of patch generation. We first conduct a preliminary study and find that directly using advanced BRT generators to guide patch generation is not beneficial: fail-to-fail BRTs can mislead agents, while even fail-to-pass BRTs bring limited or negative gains. Our analysis reveals two reasons: fail-to-pass BRTs may cover only one manifestation of the reported issue, leading to partial patches, whereas fail-to-fail BRTs are unreliable as direct patch-generation targets. Motivated by these insights, we propose SWE-Doctor, a software issue resolution agent that guides patch generation with runtime diagnoses derived from multi-faceted BRT executions. SWE-Doctor first generates multi-faceted BRTs for different behavioral requirements stated in the issue, then executes and debugs these BRTs to construct runtime-grounded diagnosis records, and finally uses the diagnoses together with localization information inferred during BRT generation to guide patch generation and reduce partial patches. We evaluate SWE-Doctor on Python bug-fixing issues from the widely adopted SWE-bench Verified and SWE-bench Pro across five LLM backends. SWE-Doctor consistently outperforms existing agents across all 10 LLM-benchmark combinations, achieving average resolution rates of 75.7% on SWE-bench Verified and 59.4% on SWE-bench Pro. In particular, on the more challenging SWE-bench Pro, SWE-Doctor improves the average resolution rate by 8.0-8.9 percentage points over the baseline agents.
[AI-15] Bayesian Uncertainty Propagation for Agent ic RAG Pipelines: A Proof-of-Concept Study on Multi-Hop Question Answering
链接: https://arxiv.org/abs/2607.00972
作者: Louis Donaldson,Connor Walker,Koorosh Aslansefat,Yiannis Papadopoulos
类目: Artificial Intelligence (cs.AI)
备注: Submitted for 7th International Conference on Maintenance and Intelligent Asset Management (ICMIAM 2026)
Abstract:Trustworthy deployment of Agentic Retrieval-Augmented Generation (RAG) systems requires mechanisms for estimating when multi-stage reasoning pipelines may fail. This paper presents an uncertainty-aware Agentic Retrieval-Augmented Generation (RAG) framework in which planner, evaluator and generator stages produce uncertainty signals derived from semantic divergence and generator self-evaluation. These signals are propagated through a Bayesian Network (BN) to estimate system-level uncertainty and provide node-level indicators of potential failure points across the workflow. The approach is evaluated on StrategyQA and HotpotQA using GPT-3.5-Turbo and GPT-4.1-Nano, with Area Under the Receiver Operating Characteristic Curve (AUROC), Area Under the Accuracy-Rejection Curve (AUARC), Expected Calibration Error (ECE), and Brier Score used to assess discrimination, selective prediction and calibration. Results show that Bayesian propagation is more effective on HotpotQA, where uncertainty accumulates across multi-hop reasoning stages, while StrategyQA exposes limitations caused by miscalibration and unreliable upstream signals. The study positions Bayesian uncertainty propagation as a promising but preliminary mechanism for monitoring Agentic RAG systems, with future validation required in industrial domains such as Offshore Wind (OSW) maintenance decision support.
[AI-16] Aionoscope: Debugging Latent-State Accessibility in Time-Series Representations KDD
链接: https://arxiv.org/abs/2607.00956
作者: Alexander Chemeris,Ming Jin,Randall Balestriero
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 9 pages, 4 figures. Accepted by the 12th Mining and Learning from Time Series (KDD MILETS 2026). Interactive results: this https URL . Source artifacts: this https URL and this https URL
Abstract:Time-series models are often evaluated by what they can forecast or classify, but those scores do not show whether their representations preserve the process state a user may want to inspect: event timing, phase, amplitude, frequency, or regime variables. We introduce Aionoscope, a generator-based diagnostic tool for debugging latent-state accessibility in frozen time-series representations. Aionoscope separates process generation from observation rendering, producing seeded synthetic streams with exact categorical and dense labels across mixture complexity and nuisance variation. We instantiate Aionoscope as Primitive Process Mixtures and evaluate 37 model-plus-adapter systems with a common pooled linear-probe protocol. The main result is a mismatch between coarse and fine-grained accessibility. Most systems make component presence easy to recover, but expose dense process state much less reliably: the highest observed dense-probe row reaches 0.689 mean masked R^2 , while a dense-feature oracle reaches 0.999. This is the failure mode Aionoscope is designed to surface: a representation can look informative at the level of “what kind of signal is present” while hiding the timing, phase, amplitude, frequency, or regime variables needed for debugging. Comments: 9 pages, 4 figures. Accepted by the 12th Mining and Learning from Time Series (KDD MILETS 2026). Interactive results: this https URL . Source artifacts: this https URL and this https URL Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) ACMclasses: I.2.6; I.5.2; G.3 Cite as: arXiv:2607.00956 [cs.LG] (or arXiv:2607.00956v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2607.00956 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-17] Human-Machine Collaboration on Generative Meta-Learning: Model and Algorithm
链接: https://arxiv.org/abs/2607.00926
作者: Midhun Parakkal Unni,Samuel Kaski
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Generalizing machine learning models to environments that differ from their training distribution remains a critical hurdle, particularly when data from the target domain is entirely or partially unavailable. We propose Generative Meta-Learning with Human Feedback (GMHF), a novel framework that bridges this domain gap by leveraging expert intuition to guide data synthesis. Grounded in a theoretical analysis of generalization error, we derive bounds demonstrating that aligning the distribution of generated data with human beliefs regarding the target physics significantly mitigates risk. GMHF operationalizes this insight by employing a Conditional Neural ODE (cNODE) as a generative digital twin, coupled with a Reinforcement Learning (RL) agent. The agent iteratively refines the latent physical parameters of the generated trajectories based on feedback, effectively steering the meta-learner toward the unobserved target distribution. Empirical validation on a nonlinear Duffing oscillator shows that GMHF substantially reduces deployment loss as expert reliability increases, and that the divergence between generated and target data falls under reliable feedback, directly corroborating the divergence-minimisation mechanism predicted by our theory. Further experiments on a non-dynamical probabilistic model confirm that the framework extends beyond ODE-governed systems, establishing human-AI collaboration as a rigorous catalyst for robust generalisation under distribution shift.
[AI-18] Valdi: Value Diffusion World Models
链接: https://arxiv.org/abs/2607.00917
作者: Christopher Lindenberg,Kashyap Chitta
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: RLC 2026 WMW
Abstract:World models can enable Model Predictive Control (MPC), but this requires dynamics prediction that is both fast enough for online use and expressive enough to represent uncertain futures. Diffusion models offer a natural mechanism for modeling uncertain dynamics, yet their iterative inference procedure makes them difficult to use for low-latency latent planning. We bridge this gap with Value Diffusion World Models (Valdi), combining end-to-end online training for MPC with a latent diffusion dynamics model. In preliminary experiments on the CarRacing environment, we show that Valdi, using a single diffusion step at both training and inference, matches a deterministic MLP baseline. Our experiments expose a trade-off between predictive multimodality and control performance in this setup. Code is available at this https URL.
[AI-19] wo AI Metrics Diverged: Will it Make All the Difference? ICML
链接: https://arxiv.org/abs/2607.00913
作者: Alex Fogelson,Zachary A. Brown,Hans Gundlach,Jayson Lynch,Neil Thompson
类目: Artificial Intelligence (cs.AI)
备注: Accepted into 2026 ICML Technical AI Governance Research Workshop
Abstract:As exponential compute scaling continues, will the capabilities of frontier AI models outstrip what is accessible to developers on a small fixed budget? Or will capabilities converge, with “meek models inheriting the earth”? Building on Gundlach et al. (2025b), we show that the answer depends on how we value and measure AI capabilities. We discuss conventional performance measures and show that, while validation loss shows a shrinking gap, on other metrics frontier models grow their lead forever. Classifying performance metrics by their functional forms in relation to training (and inference) compute, we provide tight mathematical conditions for determining which metrics favor meek models, and show that bounded performance metrics always do. But careful interpretation of performance metrics is essential: we show that many common bounded metrics have closely-related counterpart metrics that are unbounded (and vice versa). Determining the apt metric in a domain is a prerequisite for policy, since bounded and unbounded metrics may suggest opposing policy responses. If a particular capability – like software engineering, synthetic biology, or rhetorical persuasiveness – is unbounded when measured in the terms we care about, frontier-level capability will likely be concentrated in the hands of a few wealthy actors. Conversely, if that capability is instead bounded, frontier-level capabilities proliferate through meek models into the hands of the many.
[AI-20] From World Models to World Action Models: A Concise Tutorial for Robotics
链接: https://arxiv.org/abs/2607.00836
作者: Xiaoxiong Zhang,Xiong Zeng,Wei Zhang
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注: Project page: this https URL
Abstract:World models are increasingly used in embodied intelligence and generative simulation, yet their scope remains ambiguous across communities. This tutorial presents a design-space view of world models as action-conditioned predictive models that estimate the future evolution of task-relevant observations or states. We categorize existing methods into observation-space and state-space world models, comparing their trade-offs in visual fidelity, spatial structure, physical interpretability, and control usability. We further introduce world action models, which connect predicted futures with executable robot actions, and summarize four representative paradigms: imagine-then-execute, video-feature-conditioned action prediction, joint video-action modeling, and auxiliary video prediction for policy learning. The goal of this tutorial is to clarify the conceptual scope of world (action) models and provide a structured taxonomy for embodied prediction and control.
[AI-21] Exploring the Semantic Gap in Agent ic Data Systems: A Formative Study of Operationalization Failures in Analytical Workflows
链接: https://arxiv.org/abs/2607.00828
作者: Jalal Mahmud,Eser Kandogan
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) are increasingly used to generate queries, invoke tools, and construct analytical workflows. Although recent advances have substantially improved workflow generation and execution, the semantic information required to operationalize analytical concepts often lies beyond what is explicitly represented in database schemas and data values. We present a cross-domain formative study of operationalization failures in agent-generated analytical workflows. Across 236 analytical intents spanning finance, human resources, and public safety domains, we identify 153 recurring failures despite successful workflow generation and execution. Our analysis reveals five recurring classes of failures: comparative grounding, process reasoning, quantitative reasoning, role confusion, and policy grounding. These findings suggest a semantic gap between user-level analytical concepts and the information available to workflow-generation systems. More broadly, they raise questions about the admissibility of analytical operations and suggest that future agentic data systems may require richer semantic representations to bridge the gap between analytical intent and executable computation.
[AI-22] LRAT-Catcher: Importing SAT Solver Certificates into Lean4 by Reflection
链接: https://arxiv.org/abs/2607.00815
作者: Stefan Szeider
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI)
备注:
Abstract:SAT solvers settle combinatorial problems beyond the reach of interactive theorem provers and produce LRAT certificates for independent verification. We present LRAT-Catcher, a standalone, general-purpose tool that imports a DIMACS formula together with an LRAT certificate into Lean 4 as a theorem. LRAT-Catcher runs the formally verified LRAT checker from Lean core as compiled native code via reflection. This scales to instances where Mathlib’s explicit proof-term import exhausts memory. LRAT-Catcher also composes cube-and-conquer solving runs entirely inside Lean. Per-cube refutations are combined with a cover-completeness certificate, itself an LRAT proof, into a single unsatisfiability theorem. Verified encodings connect CNF-level results to the original combinatorial problems. We evaluate the tool against Mathlib’s proof-term import and the external checker cake_lpr on establishing the Schur number S(4) = 44 and the Ramsey number R(4,4) = 18 as Lean theorems.
[AI-23] Phantom References: Hallucinated Citations That Survive Peer Review at Top-Tier Conferences
链接: https://arxiv.org/abs/2607.00738
作者: Mark Russinovich,Ram Shankar Siva Kumar,Ahmed Salem
类目: Digital Libraries (cs.DL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models can generate polished scientific text that includes unsupported claims, allowing hallucinations to enter the archival record. Assessing this risk via technical statements is difficult and often requires expert judgment, but citations provide a more auditable surface: a reference either resolves to a real scholarly work with compatible authorship, or it does not. We measure citation hallucination in peer-reviewed proceedings using a conservative definition limited to identity-level failures: non-existent works and substantial author-list mismatches. We explicitly exclude ordinary bibliographic drift (e.g., venue/year differences, publication-status updates, minor name variants). To audit citations at scale, we build RefChecker, a verification pipeline that resolves bibliography entries against multiple bibliographic sources and escalates unresolved cases to web-search re-verification. We apply RefChecker to accepted camera-ready papers from ICLR, ICML, NeurIPS, and USENIX Security. Hallucinated citations have entered the archival record. While reference-level rates are usually below 1%, proceedings are large enough that paper-level failures are visible: in 2025, roughly one in twenty NeurIPS and USENIX Security papers contains at least two likely hallucinated academic-paper-like references under our strict definition. We also observe post-ChatGPT increases in several venues, including a tail of papers with 5+ failures in a single bibliography, and likely hallucinated citations even among award-winning papers. These results suggest peer review alone does not reliably enforce citation integrity, yet auditing is tractable (about 0.04 per paper in one venue-scale scan). We open-source RefChecker for routine, reproducible citation verification before publication (this https URL). Subjects: Digital Libraries (cs.DL); Artificial Intelligence (cs.AI) Cite as: arXiv:2607.00738 [cs.DL] (or arXiv:2607.00738v1 [cs.DL] for this version) https://doi.org/10.48550/arXiv.2607.00738 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Ahmed Salem [view email] [v1] Wed, 1 Jul 2026 10:21:07 UTC (311 KB)
[AI-24] LLM -Guided ODE Discovery and Parameter Inference from Small-Cohort Aggregate Data
链接: https://arxiv.org/abs/2607.00733
作者: Hanning Yang,Meropi Karakioulaki,Lennart Purucker,Tim Litwin,Cristina Has,Moritz Hess
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Mechanistic modeling via ordinary differential equations (ODEs) provides interpretable descriptions of complex dynamics and enables inference of underlying mechanisms, which is particularly valuable in clinical settings. However, in rare diseases, both the structure and parameters of the model are typically unknown, while individual-level data is scarce, noisy, heterogeneous, and subject to privacy constraints. In such settings, population-level summary statistics provide a practical privacy-preserving data representation, while capturing heterogeneity further requires modeling parameters as distributions rather than fixed values. Yet no existing method jointly discovers ODE structure and refines parameter distributions solely from summary statistics. We present AgentODE, an end-to-end framework that addresses this gap. An LLM proposes candidate ODE structures, while a tool-augmented inference agent iteratively refines parameter distributions through a diagnosis–update loop, operating on population-level summary statistics alone. We evaluate AgentODE on three benchmark problems across different fields and two clinical datasets, including the rare disease recessive dystrophic epidermolysis bullosa (RDEB), with only 231 observations across 46 patients. AgentODE recovers functionally consistent ODE structures across all settings, and experiments on RDEB demonstrates that in sparse and noisy data settings reasoning from summary statistics promotes mechanistically principled structure discovery, whereas baselines with individual-level data access recover implausible structures despite better predictive performance. AgentODE opens new possibilities for mechanistic modeling of rare diseases directly from population-level summary statistics, where data scarcity and privacy constraints have traditionally limited such analyses.
[AI-25] Detecting the Undetectable: Enhancing Unsupervised time series Anomaly Detection via Active Learning
链接: https://arxiv.org/abs/2607.00720
作者: Seung Hun Han,Hyeongwon Kang,Jinwoo Park,Pilsung Kang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Despite the increasing sophistication of industrial AI systems, the ability to reliably detect subtle and noisy anomalies in complex time series data remains a critical yet unresolved challenge. In large-scale industrial applications, labeling time series data is often prohibitively expensive and time-consuming, making unsupervised learning a practical and widely adopted approach. However, existing unsupervised methods frequently struggle to distinguish near-normal anomalies from normal patterns and are vulnerable to noise contamination within normal samples. To address these limitations, we propose a novel framework that leverages active learning to iteratively enhance the performance of unsupervised models. Our framework’s core contributions are (1) a masked time-series reconstruction feedback strategy that forces the model to learn robust temporal dependencies, and (2) a minimax learning strategy that promotes robustness by differentially treating normal and abnormal samples. This process encourages the model to better capture the dynamics of subtle and noisy patterns. The proposed framework is evaluated across 28 test cases involving four multivariate time-series datasets and seven unsupervised backbone models. Experimental results demonstrate a 12.39% improvement in AUC compared to the original models, confirming that our method can be readily integrated into existing unsupervised reconstruction-based anomaly detection systems to significantly enhance their performance.
[AI-26] LLVM-Bench: Benchmarking and Advancing Large Language Models for LLVM Compiler Issue Resolution
链接: https://arxiv.org/abs/2607.00700
作者: Zhao Tian,Yingquan Zhao,Chenyao Suo,Meng Wang,Junjie Chen
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Programming Languages (cs.PL)
备注:
Abstract:LLVM is a widely used compiler infrastructure whose scale and complexity make issue resolution labor-intensive and challenging. Although large language models (LLMs) have recently achieved remarkable success in issue resolution, their effectiveness on complex system-level LLVM compiler remains largely unexplored. To address this gap, we introduce LLVM-Bench, the first large-scale benchmark for LLVM issue resolution, containing 423 real-world, validated tasks collected from the LLVM project. We further develop LLVM-Gym, a scalable evaluation platform that automates issue reproduction, patch application, compiler building, and test execution. Using LLVM-Bench and LLVM-Gym, we conduct a comprehensive study of four representative LLMs, six retrieval configurations, and three agents. Our results show that current LLM-based issue resolution techniques remain limited on LLVM-Bench, with patch invalidity and build failures as the dominant failure modes. We further reveal a strong complementarity among different LLMs and agents, motivating LLVM-Ens, a lightweight ensemble approach that expands the patch space through integrating the patches generated by diverse techniques, filters incorrect and redundant candidates, and identifies the most promising solution. Our results show that LLVM-Ens achieves a resolution rate of up to 21.99%, further improving LLVM issue resolution.
[AI-27] Self-GC: Self-Governing Context for Long-Horizon LLM Agents
链接: https://arxiv.org/abs/2607.00692
作者: Xubin Hao,Hongjin Meng,Xin Yin,Jiawei Zhu,Chenpeng Cao
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Long-horizon LLM agents accumulate tool results, files, plans, and user constraints that are too structured to be treated as a disposable text suffix. Current systems mostly rely on in-run heuristics such as chronological pruning and tool-output masking, or on final self-summary near a context limit. Heuristics are cheap but blind to future dependencies; summaries preserve narrative state but often hide exact evidence, locators, and editable artifacts. We present Self-GC, where GC denotes self-governing context while deliberately echoing garbage collection: the system does not merely reclaim unused tokens, but governs the lifecycle of agent context objects. Self-GC turns user turns, tool spans, and skill state into indexed objects; asks a side-channel planner to propose fold, mask, and prune actions; and lets the harness enforce recoverable sidecars, safe commit boundaries, and cache-aware commit. On a 33-session Hard Set, Self-GC prunes 43.95% of prefix tokens while leaving 84.85% of future continuations unaffected, compared with no-impact rates of 54.55% to 69.70% for heuristic baselines. On a 332-session production-derived suite, three planner backbones reach no-impact rates of 91.27% to 94.58%, while baselines remain at 77.71% to 87.46%. In production, an online account-level split reduces daytime average input tokens by 10% to 15%, with peak reductions near 20%. These results point to context management as runtime lifecycle control over indexed, recoverable objects rather than post hoc text cleanup.
[AI-28] Multi-Label Node Classification with Label Influence Propagation ICLR2025
链接: https://arxiv.org/abs/2607.00671
作者: Yifei Sun,Zemin Liu,Bryan Hooi,Yang Yang,Rizal Fathony,Jia Chen,Bingsheng He
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted to ICLR 2025
Abstract:Graphs are a complex and versatile data structure used across various domains, with possibly multi-label nodes playing a particularly crucial role. Examples include proteins in PPI networks with multiple functions and users in social or e-commerce networks exhibiting diverse interests. Tackling multi-label node classification (MLNC) on graphs has led to the development of various approaches. Some methods leverage graph neural networks (GNNs) to exploit label co-occurrence correlations, while others incorporate label embeddings to capture label proximity. However, these approaches fail to account for the intricate influences between labels in non-Euclidean graph data. To address this issue, we decompose the message passing process in GNNs into two operations: propagation and transformation. We then conduct a comprehensive analysis and quantification of the influence correlations between labels in each operation. Building on these insights, we propose a novel model, Label Influence Propagation (LIP). Specifically, we construct a label influence graph based on the integrated label correlations. Then, we propagate high-order influences through this graph, dynamically adjusting the learning process by amplifying labels with positive contributions and mitigating those with negative influence. Finally, our framework is evaluated on comprehensive benchmark datasets, consistently outperforming SOTA methods across various settings, demonstrating its effectiveness on MLNC tasks.
[AI-29] Coachable agents for interactive gameplay
链接: https://arxiv.org/abs/2607.00642
作者: Roberto Capobianco(1),Harm van Seijen(2),Nolan D. Bard(2),Neil Burch(2),Fatima Davelouis(2),Josh Davidson(2),Alisa Devlic(1),Yunshu Du(2),Ishan Durugkar(2),Siddhant Gangapurwala(2),Daniel Hernandez(2),G. Zacharias Holland(2),Sahil Jain(2),Kenta Kawamoto(3),Raksha Kumaraswamy(2),Patrick MacAlpine(2),Dustin R. Morrill(2),Declan Oller(2),Francesco Riccio(1),Akanksha Saran(2),Craig Sherstan(3),Kaushik Subramanian(1),Thomas J. Walsh(2),Samuel Barrett(2),Kizza N. Frisbee(2),Mady Govil(2),Johannes Günther(2),Varun R. Kompella(2),James A. MacGlashan(2),Maxwell Svetlik(2),Michael D. Thomure(2),Jaden B. Travnik(2),Kevin Waugh(2),Elahe Aghapour(2),Florian Fuchs(1),Andreanne Lemay(2),Shruti Mishra(1),Takuma Seno(3),Peter Stone(2),Michael Spranger(3),Peter R. Wurman(2) ((1) Sony AI, Zurich, Switzerland, (2) Sony AI, North America, various locations, (3) Sony AI, Tokyo, Japan)
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Reinforcement learning has proven to be a valuable tool in the creation of advanced AI and robotic systems, contributing to everything from game playing to robotics to foundation models. Through trial-and-error, these AI systems typically learn one, near-optimal behavior to solve their tasks. However, there are many use cases in which one would like to assert some level of control, preferably in real time, over how the task is solved. We refer to these modifications of a core task as styles. We combine universal value function approximators (UVFAs) with carefully selected training scenarios, learning algorithms, and data augmentation to create a framework for coaching agents that exhibit styles in complex domains. We demonstrate the framework’s application in the AAA video games Horizon Forbidden West and Gran Turismo, and in an open-source humanoid test domain. Despite the different nature of the domains – car racing, stylized game combat, and humanoid walking – each agent shows strong coherence to the style requests while still satisfying the main task in its domain. Importantly, the techniques outlined in this paper allow an end user to choose the final behavior at run time, giving them flexible control over the final executed performance.
[AI-30] Loss Smoothing for Stable Adaptation Under Distribution Shift
链接: https://arxiv.org/abs/2607.00634
作者: Darshan Patil,Ekaterina Lobacheva,Razvan Pascanu,Sarath Chandar
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:In settings such as fine-tuning and reinforcement learning, neural networks are often adapted under distribution shift. Standard adaptation methods typically optimize the target objective directly, inducing an abrupt change from the source training objective. This abrupt transition can distort learned representations, including features that may still be useful for the new task. We investigate whether a more gradual transition can improve adaptation. We propose loss smoothing, a simple approach that interpolates between the source and target training objectives at the start of adaptation. This smooth transition helps to preserve useful features from the source distribution while still enabling the model to specialize to the target distribution. Across controlled supervised shifts, pretrained vision adaptation, offline-to-online and online reinforcement learning, and language model fine-tuning, we find that loss smoothing consistently improves performance, suggesting that smoother objective transitions are a broadly useful tool for model adaptation.
[AI-31] AGI Maze as a Benchmark Framework for World-Modeling Agents
链接: https://arxiv.org/abs/2607.00627
作者: Alexey Potapov
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) are powerful pattern-completion systems, but their default operating mode - predicting the next token from a static context - does not reliably produce persistent, manipulable representations of an external world. Many tasks that look like “reasoning” in text become substantially harder once the environment is partially observable, stateful, and requires memory and structured hypotheses about hidden state. AGI Maze is a lightweight framework for building such environments without requiring high-dimensional sensory inputs. It provides a family of grid-based maze tasks with a clean API and multiple difficulty regimes. The goal is to create benchmarks where agents must learn and use world state representations, not just infer a local rule over readily provided observations. We provide an initial evaluation of several vanilla LLMs on simple mazes showing that they fail to represent mazes internally at LLM inference time. We also introduce a baseline agent, which is allowed to use its message history as a working memory to construct descriptions of observations at agentic runtime. Although this can improve performance, it is still insufficient for an LLM agent to reliably solve even small mazes within a step budget that is more than enough for humans.
[AI-32] HARC: Coupling Harmfulness and Refusal Directions for Robust Safety Alignment
链接: https://arxiv.org/abs/2607.00572
作者: Shei Pern Chua,Fangzhao Wu
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:
Abstract:Understanding how aligned LLMs internally represent safety is critical for diagnosing alignment vulnerabilities, as it explains why jailbreaks succeed and informs the design of robust alignment strategies. Prior work shows that aligned LLMs encode harmfulness and refusal as separable directions in the residual stream at prompt-side token positions. We show that jailbreaks succeed at prompt encoding by suppressing either the refusal or harmfulness direction before any token is generated, with distinct attack classes occupying separable regions of the harmfulness-refusal plane. Extending the analysis to response-token positions, we find that the model recognizes harmful content while it is generating that content, even when it failed to recognize the input as harmful at the prompt side. Motivated by our findings, we introduce HARC (Harmfulness-And-Refusal Coupling), a fine-tuning method that pairs the two directions across both prompt and response positions. Since the intervention is confined to the harmfulness-refusal subspace, it leaves the rest of the residual stream intact and does not degrade general capability or inflate over-refusal. Across extensive experiments, HARC achieves the strongest robustness-capability-usability trade-off among six baselines spanning the major training-time and inference-time safety methods. The harmfulness and refusal directions at prompt and response positions transfer across the five model families and two scales we tested without architecture-specific tuning.
[AI-33] A Methodology for Investigating AI Patterns Prevalence in Software Repositories
链接: https://arxiv.org/abs/2607.00558
作者: Srinath Perera,Hasinthaka Piyumal,Frank Leymann,Rania Khalaf
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: Published in PATTERNS 2026 : The Eighteenth International Conference on Pervasive Patterns and Applications
Abstract:As Artificial Intelligence(AI)-based applications take off, a clear understanding of AI patterns can uplift the quality of AI applications. Many AI patterns have been proposed in the literature; however, their prevalence in real-life code has not yet been validated. Understanding the actual use of those patterns in practice can clarify our understanding both of the significance of these patterns and their utility. In this paper, we present a methodology to a) identify relevant patterns by mining the literature and then to b) validate their presence and prevalence in actual code repositories using active learning. To that end, we identify 14 AI pattern classes by mining 44 published AI pattern-related sources. Then we use an active learning approach to determine the prevalence of the most common pattern class across 100 GitHub open AI repositories. Using prevalence estimation, we propose bounds on the accuracy of the occurrences. The model achieves 56% accuracy and 55% recall in an 8-way classification task, significantly outperforming the 11% random-chance baseline. Furthermore, the prevalence estimation offers usable bounds for analyzing pattern applications. This methodology provides a robust foundation to start understanding how AI patterns are used in practice, a field that currently lacks empirical data.
[AI-34] Group-Equivariant Poincaré Convolutional Networks
链接: https://arxiv.org/abs/2607.00556
作者: Aiden Durrant,Rahul Baburajan,Georgios Leontidis
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 19 Pages, 5 figures
Abstract:While recent advancements like the Poincaré ResNet have demonstrated the potential of learning visual representations directly in hyperbolic space, their optimisation remains hampered by the computationally intensive nature of Riemannian gradients and the strict boundaries of the manifold. Furthermore, standard hyperbolic networks treat spatial transformations of the same object as distinct hierarchical concepts, leading to redundant parameter usage and vanishing signals. We propose Equivariant Poincaré ResNets, combining hyperbolic geometry with discrete symmetry groups ( C_4 and D_4 ). We identify critical roadblocks in applying Euclidean equivariance to hyperbolic space and propose geometrically safe tensor reshaping, left-regular permutations for hyperbolic group convolutions, and joint-orientation Poincaré Midpoint Batch normalisation. Empirically, embedding equivariance drastically reduces the optimisation space, accelerating convergence while accelerating convergence while respecting the boundary constraints of the Poincaré ball and preserving spatial-group equivariance.
[AI-35] Cross-Domain Generalization Failure in Lightweight Intrusion Detection Models for IIoT Networks
链接: https://arxiv.org/abs/2607.00553
作者: MD Azizul Hakim,Md Shihab Uddin,Talha Ibne Anis
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Lightweight machine learning models are increasingly proposed for intrusion detection in Industrial Internet of Things (IIoT) networks due to their suitability for resource-constrained edge deployment. Most reported results evaluate these models only within their training network, leaving behavior on unseen networks unverified. This study trains four lightweight architectures on one IIoT dataset and evaluates them, without retraining, on two structurally distinct IIoT datasets using a feature representation restricted to attributes available across all three sources. Explainability analysis across two top-performing models shows both rely overwhelmingly on coarse port-category features; the most influential category occurs in source-domain attack traffic at 96 to 435 times the rate in the two target domains, indicating that coarsening port resolution relocates rather than removes a documented shortcut. Evaluation under naturally imbalanced class distributions reveals a further effect: the evaluation protocol used can reverse which target network appears to pose the greater generalization challenge. Adversarial robustness and recovery through limited target-domain exposure are also assessed; robustness to adversarial perturbation is unrelated to cross-network generalization, and recovery through adaptation varies considerably by architecture. These findings suggest deployment readiness should be assessed using cross-network evaluation under realistic class distributions, rather than within-domain accuracy alone.
[AI-36] Active-GRPO: Adaptive Imitation and Self-Improving Reasoning for Molecular Optimization
链接: https://arxiv.org/abs/2607.00531
作者: Xuefeng Liu,Mingxuan Cao,Qinan Huang,Thomas Brettin,Rick Stevens,Le Cong
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Biomolecules (q-bio.BM); Machine Learning (stat.ML)
备注:
Abstract:Scientific reasoning is an increasingly important capability of large language models, yet improving the robustness and efficiency of training such reasoning remains a key open challenge. We study this problem in instruction-based molecular optimization, where answer-only supervised fine-tuning (SFT) collapses multi-step reasoning and reinforcement learning with verifiable rewards (RLVR) suffers from sparse feedback. Reference-guided Policy Optimization mitigates both by anchoring policy updates to dataset-provided references, but its effectiveness is tightly coupled to reference quality: weak or misaligned references impose a performance ceiling. To overcome this ceiling, we propose active reasoning, a paradigm in which the policy actively decides, on a per-instance basis, when to imitate a reference and when to reinforce its own discoveries, while continuously upgrading what it imitates. We instantiate this paradigm as Active Group Relative Policy Optimization (Active-GRPO), realized through two coupled mechanisms: active imitate-reinforce and active referencing. The former performs imitation learning when the reference still outperforms the policy’s own candidates, and shifts to self-improvement via reinforcement learning once the policy has generated molecules that surpass the reference. The latter continuously upgrades the reference itself by replacing it with the best policy-generated candidate discovered so far, progressively raising the imitation target and ensuring that reference guidance remains informative-rather than restrictive-throughout training. Across TOMG-Bench MOLOPT, Active-GRPO improves average SRxSim from 0.0959 for GRPO and 0.1665 for RePO to 0.1773 under matched three-seed evaluation, with statistically significant gains on LogP, MR, and QED.
[AI-37] From Technical Metrics to User Perception: A User Study of a Multimodal Human-Robot Interaction System for Object Detection and Grasping
链接: https://arxiv.org/abs/2607.00530
作者: Jian Song,Tian Zi,Shen Guanting
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 8 pages
Abstract:Improvements in the technical performance of human–robot interaction (HRI) systems do not automatically translate into differences that human users can detect during live interaction. This paper investigates whether a 15 percentage point gain in end-to-end task success (from 75% in a multimodal baseline system to 90% in an improved configuration identified through a prior ablation study) is sufficient to produce consistent and measurable differences in user perception. The baseline system combines Whisper for speech recognition, Florence-2 for open-vocabulary object detection, LLaMA 3.1 for action extraction, and an interval Type-2 fuzzy logic controller for motion execution. The improved configuration replaces the perception and language modules with Grounding DINO + SAM and Qwen 3.5 9B, respectively, while retaining the same controller. A within-subject user study with 24 participants compared both systems on the same tabletop object-grasping task. After interacting with each configuration, participants rated perceived speed, reliability, and overall competence and fluency on a 7-point Likert scale. Results show that 17 out of 24 participants (70.83%) preferred the improved system (exact binomial test, p = 0.043, h = 0.43), and all three perceptual constructs were rated significantly higher for the improved configuration after Holm correction, with large to very large effect sizes (p 0.001). These findings confirm that the identified technical improvements are perceptible to users in direct interaction and underscore the importance of complementing benchmark evaluation with user-centred evidence when assessing robotic manipulation pipelines.
[AI-38] AI Native Games: A Survey and Roadmap
链接: https://arxiv.org/abs/2607.00527
作者: Zhiyue Xu,Fandi Meng,Kaijie Xu,Clark Verbrugge,Simon Lucas,Jian Zhao
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Generative AI now enables games to produce dialogue, quests, characters, images, and worlds at runtime. Yet generation alone does not make a game AI-native, nor does it guarantee playability. This paper defines AI-native games by whether runtime generative AI is constitutive of the core loop: if the AI component were removed or trivially replaced, the central form of play would collapse or become fundamentally different. This counterfactual criterion separates AI-native games from AI-augmented games, boundary artifacts, chatbots, tavern-style role-play, procedural content generation, and AI-assisted production. Using this definition, we screen candidate artifacts and analyze 53 publicly available AI-native games and prototypes. We introduce a dual-axis G/N taxonomy: the G-axis captures player-facing game type, while the N-axis captures the dominant AI mechanic that makes generative AI indispensable to play. The corpus is concentrated around language-forward designs, especially narrative adventure, epistemic interaction, and generative narrative, while categories such as semantic adjudication, multi-agent simulation, generative construction, and relationship/companion play remain less represented. We argue that the central design problem is organizing semantic openness into stable gameplay. AI-native design depends on mechanical invariants: goals, rules, state, feedback, pacing, and player agency that make open-ended AI outputs interpretable and consequential. We conclude with a roadmap for controllable generation, AI-as-mechanic design, multimodal and multi-agent systems, inference economics, evaluation, safety, and regulation.
[AI-39] Beyond the Prompt: Jailbreaking Function-Calling LLM s via Simulated Moderation Traces
链接: https://arxiv.org/abs/2607.00481
作者: Junlong Liu,Haobo Wang,Weiqi Luo,Xiaojun Jia
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Jailbreak attacks remain a critical threat to the safe deployment of large language models (LLMs). While prior work has primarily studied attacks and defenses at the prompt level, we show that this prompt-centric paradigm overlooks a structural vulnerability in stateful, function-calling environments. In such applications, developer-defined schemas, structured arguments, and untrusted tool outputs are interleaved into a single shared model context. This architecture expands the attack surface by blurring the boundary between trusted control logic and untrusted data, allowing adversarial intent to be distributed across a multi-turn execution path. We exploit this architectural flaw through SMT, a black-box attack framework based on Simulated Moderation Traces. Departing from purely prompt-based interactions, SMT constructs a multi-turn trajectory that simulates a legitimate moderation-auditing workflow. Within this trajectory, a fabricated moderation frame leverages red-team testing as a pretext to elicit harmful generations. The subsequent validation feedback treats safety refusals as execution failures, prompting refinements that gradually weaken the model’s safety constraints and ultimately trigger harmful outputs. Extensive empirical evaluations on prominent commercial LLMs from five different providers across two standardized safety benchmarks show that SMT consistently achieves the highest average attack success rate and HarmScore while requiring a near-minimal number of queries, substantially outperforming existing baselines. These findings demonstrate that prompt-level sanitization alone is fundamentally insufficient for defending tool-enabled LLM systems and highlight the urgent need for context-aware validation across schemas, arguments, tool outputs, and accumulated conversation state. The code is available at this https URL.
[AI-40] A Multi-Resolution Finite-Volume Inspired Deep Learning Framework for Spatiotemporal Dynamics Prediction
链接: https://arxiv.org/abs/2607.00460
作者: Xin-Yang Liu,Xiantao Fan,Jian-Xun Wang
类目: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI); Computational Physics (physics.comp-ph)
备注: 19 pages, 11 figures
Abstract:Predicting complex spatiotemporal dynamics in physical processes often demands computationally expensive numerical methods or data-driven neural networks that suffer from high training costs, error accumulation, and limited generalizability to unseen parameters. An effective approach to address these challenges is leveraging physics priors in training neural networks, known as physics-informed deep learning (PiDL). In this work, we introduce the Multi-Resolution Finite-Volume-inspired network, MuRFiV, designed to capitalize on the conservative property of finite volume on the global scale and the expressive power of deep learning on the local scale. We demonstrate the effectiveness of MuRFiV on several spatio-temporal systems governed by partial differential equations (PDEs), including Burgers’ equation, shallow water equations, and incompressible Navier-Stokes equations. By embedding PDE information into the deep learning architecture, MuRFiV achieves strong long-term prediction accuracy and remains stable over very long autoregressive rollouts, significantly outperforming data-driven neural network baselines. This result highlights the promise of combining multiresolution learning with finite-volume-inspired inductive bias for accurate and robust long-term prediction of complex dynamics.
[AI-41] Multi-scale Mixture of World Models for Embodied Agents in Evolving Environments ECCV2026
链接: https://arxiv.org/abs/2607.00457
作者: Jinwoo Jang,Daniel J. Rho,Sihyung Yoon,Hyunsuk Cho,Honguk Woo
类目: Artificial Intelligence (cs.AI)
备注: Accepted at ECCV 2026. 15 pages
Abstract:Embodied agents operating in the real world require multi-scale reasoning and knowledge adaptation as conditions change. We identify two challenges in applying Mixture of Experts (MoE) to this setting: routing lacks an explicit notion of scale, preventing targeted updates at specific scales, and a uniform update policy cannot accommodate the different rates at which knowledge at each scale becomes outdated. We present MuSix, a framework that addresses both challenges through scale-aware world model mixture and evolution. A two-stage routing mechanism grounds scale selection in experiential distance, a measure of situational novelty inspired by Construal Level Theory: a meta-router first maps this quantity to a weight over continuous scale space, then per-scale base routers select world models within the identified scale. For adaptation, scale-dependent forgetting rates allow low-scale knowledge to refresh rapidly while high-scale abstractions persist, and gated inter-scale transfer maintains coherence across the hierarchy. Experiments on EmbodiedBench and HAZARD show that MuSix improves over state-of-the-art baselines on multi-scale reasoning and dynamic adaptation.
[AI-42] Gauging Measuring and Controlling Critic Complexity in Actor-Critic Reinforcement Learning
链接: https://arxiv.org/abs/2607.00452
作者: Konstantin Garbers
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Actor-critic methods depend on learned critics, but critic quality is often evaluated only indirectly through return, temporal-difference error, or value loss. Critic complexity is introduced as an additional diagnostic and intervention dimension for actor-critic reinforcement learning. The analysis uses spectral effective-rank entropy, a rank-like summary of the singular-value distributions of critic weight matrices, to assess critic model complexity. Across TD3 and PPO experiments, critic complexity is tracked together with return and Monte Carlo value-estimation bias. The results show that critic complexity is measurable throughout training and is systematically associated with training behavior, while also making clear that the relationship is heterogeneous across algorithms, tasks, and hyperparameters. A direct complexity-control intervention is then evaluated by adding a spectral-entropy penalty to the critic loss. This intervention reliably changes the targeted spectral quantity, demonstrating that critic complexity can be controlled rather than only observed. Return effects are treated as task-dependent evidence rather than as a general performance claim, because overall complexity-control results vary.
[AI-43] Search-Based Spatiotemporal and Multi-Robot Motion Planning on Graphs of Space-Time Convex Sets
链接: https://arxiv.org/abs/2607.00444
作者: Jingtao Tang,Zining Mao,Lufan Yang,Hang Ma
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Spatiotemporal motion planning, especially in multi-robot settings, requires robots to reason about collision-free regions that change over time, which is challenging in continuous spaces when feasible regions are transient and geometrically constrained. We present an algorithmic framework based on graphs of space-time convex sets (ST-GCSs), where collision-free regions are represented as convex sets in space-time and trajectories correspond to paths on the graph together with continuous motions within the selected sets. We formulate time-optimal planning on ST-GCSs as a graph-search problem over path-indexed states and develop a best-first search solver that evaluates partial paths via continuous trajectory optimization, guided by admissible heuristics and dominance checks. We further present an Exact Convex Decomposition (ECD) scheme to reserve trajectory occupancies in space-time, enabling unified handling of dynamic obstacles and multi-robot interactions. For multi-robot motion planning, we integrate ST-GCS planning and ECD into prioritized planning methods and introduce a windowed coordination scheme to improve efficiency. Extensive experiments on single-robot and multi-robot problems demonstrate substantial speedups over various planners while maintaining high solution quality, particularly in environments with narrow and transient feasible regions. Large-scale demonstrations further show that the proposed multi-robot motion planner can solve instances with up to 100 robots within only a few minutes. Project homepage: this https URL
[AI-44] Learning Gait-Aware Quadruped Locomotion with Temporal Logic Specifications
链接: https://arxiv.org/abs/2607.00442
作者: Merve Atasever,Cagan Bakirci,Alfredo Reina Corona,Keyan Azbijari,Jyotirmoy V. Deshmukh
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Reinforcement learning (RL) for quadruped locomotion commonly depends on fixed, hand-crafted, and Markovian reward functions that limit both interpretability of learned policies and lack explicit control over gait behaviors. We introduce a framework where distinct gaits are specified using parameterized constraints expressed in Signal Temporal Logic (STL). These include safety bounds, gait synchronization constraints, command tracking, and actuation bounds. From these specifications, we develop a reward shaping mechanism that provides learning agents a dense, continuous reward landscape that encodes desired behavior. We define parametric STL templates for three speed regimes (walking-trot, trot, bound), calibrate their parameters from reference rollouts, and compute rewards from using smooth approximations of STL robustness over the rollouts. The generated rewards can be used to provide shaped gradients compatible with Proximal Policy Optimization (PPO). We instantiate the approach on Google’s Barkour quadruped robot in MuJoCo XLA (MJX). We use parallelization within the simulator to improve training speeds and use domain randomization to robustify learned policies. We show that compared to a baseline of hand-crafted rewards, the STL-shaped rewards yield tighter velocity tracking and more stable training. Videos can be found on our project website: this https URL.
[AI-45] PHREEQC-MCQ-200: A Diagnostic Benchmark for Tool-Augmented Scientific Simulator Agents
链接: https://arxiv.org/abs/2607.00436
作者: Ke Zhang,Sahchit Chundur,Mohammad Javad Qomi,Maziar Raissi
类目: Artificial Intelligence (cs.AI)
备注: 30 pages, 2 figures
Abstract:Large language model agents are increasingly connected to scientific software, yet it remains unclear when tool access makes scientific computation more reliable rather than merely more complex. We introduce PHREEQC-MCQ-200, a benchmark for evaluating tool-augmented agents on deterministic aqueous-geochemistry simulations. The benchmark contains 200 multiple-choice questions derived from 21 validated PHREEQC scenarios, requiring agents to construct simulator inputs, execute PHREEQC, inspect structured outputs, and commit to final answers. Across multiple frontier and mid-tier model families, simulator access substantially improves aggregate accuracy, confirming that grounded execution is necessary for many scientific-computation tasks. However, the gains are not monotonic: tool-augmented agents also lose items they answered correctly without tools, revealing regressions that average accuracy alone hides. We further show that output-access protocol matters. A table-of-contents interface can reduce token cost while preserving or improving accuracy for stronger models, but it degrades performance for mid-tier models that cannot reliably navigate structured simulator outputs. PHREEQC-MCQ-200 therefore frames scientific tool use as an end-to-end diagnostic problem rather than a simple tool-calling capability. We argue that evaluations of scientific agents should report not only accuracy, but also item-level retention, output-access sensitivity, trajectory failures, and where the computation chain breaks. Comments: 30 pages, 2 figures Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2607.00436 [cs.AI] (or arXiv:2607.00436v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2607.00436 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-46] Personalization as Inverse Planning : Learning Latent Design Intents for Agent ic Slide Generation via Structural Denoising ECCV2026
链接: https://arxiv.org/abs/2607.00407
作者: Tianci Liu,Zihan Dong,Linjun Zhang,Haoyu Wang,jing Gao,Emre Kiciman,Ranveer Chandra,Wei-Ting Chen
类目: Artificial Intelligence (cs.AI)
备注: ECCV 2026
Abstract:Slide design requires personalizing both deck themes and page layouts. Yet, current AI agent-based methods struggle with fine-grained, page-level design. Solely relying on prespecified templates or user verbose instructions, they fail to capture latent design intents, leaving Page-level Slide Personalization (PSP) unresolved. To close this gap, this work formulates PSP as an inverse planning problem. We propose to learn a design intent without assuming any knowledge of the specific executing tools (e.g., PowerPoint, Beamer) being used. However, relinquishing control over these tools makes the problem intractable to optimize end-to-end. To overcome this, we propose SPIRE, a principled framework to solve PSP approximately. By intentionally corrupting the visual structures of clean slides, SPIRE creates a verifiable task to denoise the corruption, whereby two agents learn to collaboratively refine executable designs via reinforcement learning (RL). We present a proof that structural denoising is a consistent surrogate for PSP, and that the multi-agent formulation strictly reduces policy gradient variance in RL. Extensive experiments demonstrate the superiority of SPIRE.
[AI-47] Learning Generalizable Skill Policy with Data-Efficient Unsupervised RL
链接: https://arxiv.org/abs/2607.00392
作者: Jongchan Park,Seungjun Oh,Seungho Baek,Yusung Kim
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Unsupervised Reinforcement Learning (URL) aims to pre-train scalable, skill-conditioned policies without extrinsic rewards, serving as a foundation for downstream control tasks. Despite recent progress, we argue that current off-policy URL methods are limited by two critical, overlooked bottlenecks: (1) non-stationary skill semantics and (2) brittle generalization. To address these challenges, we propose GenDa (Generalizable Data-efficient Agent), a unified framework for robust unsupervised reinforcement learning. First, we introduce a skill relabeling mechanism to mitigate non-stationarity and significantly improve data efficiency for pre-training. Second, we propose a Complementary Information Bottleneck (CIB), encouraging the learned skill policy to focus on ego-centric features and become robust to distribution shifts for downstream tasks. Through various experiments, we demonstrate that GenDa significantly enhances the scalability of URL with superior generalizability and data efficiency. Our code and videos are available at this https URL.
[AI-48] Enhancing Flow Matching with A Unified Guidance Framework for Efficient and Robust Speech Synthesis INTERSPEECH2026
链接: https://arxiv.org/abs/2607.00363
作者: Zuda Yu,Qianhui Xu,Ting Chen,Junhui Zhang,Tao Fu,Hongjiang Yu,Qiangqing Wang,Yang Song
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注: Accepted to INTERSPEECH 2026
Abstract:Flow Matching (FM) has emerged as a powerful paradigm for speech generation but remains constrained by high inference latency and timbre leakage. To address these bottlenecks, we propose a unified guidance framework that enhances generation efficiency and robustness through two complementary strategies. On the data front, we introduce Data-guidance via heterogeneous augmentation, encouraging the model to disentangle linguistic content from acoustic residue. In parallel, we propose an enhanced Model-guidance mechanism that synergizes trajectory rectification with a novel intrinsic guidance objective. This approach distills conditional knowledge into network weights and straightens inference trajectory path, thereby eliminating Classifier-Free Guidance (CFG) overhead. Experiments demonstrate that our framework accelerates inference by nearly three times while effectively improving speaker similarity compared to state-of-the-art baselines.
[AI-49] SoK: Attack and Defense Landscape of Mobile On-device AI Systems
链接: https://arxiv.org/abs/2607.00362
作者: Yujin Huang,Xin Zheng,Xingliang Yuan,Kwok-Yan Lam
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Mobile on-device AI (MoAI) systems that integrate locally deployed AI models with conventional mobile software components are emerging as a key paradigm for delivering intelligent functionality directly on end-user devices. By moving inference from remote cloud services to the local mobile environment, such systems enable privacy-preserving, low-latency, and offline-capable AI functionality, yet introduce new security risks arising from the local storage of AI models. This paper presents the first comprehensive systematization of knowledge on MoAI security, covering security pillars, attack landscape, and defense landscape of MoAI systems. We further identify unresolved gaps in current attack and defense research and point to promising directions for future research in this emerging area. Our work establishes the first systematic framework for understanding the attack and defense landscapes of MoAI systems, serving as a foundation for building secure MoAI systems and advancing research in this critical domain. Companion resources are available at this https URL.
[AI-50] Managed Autonomy at Runtime: Gear-Based Safety and Governance for Single- and Multi-Agent Cyber-Physical Systems
链接: https://arxiv.org/abs/2607.00334
作者: Srini Ramaswamy,Wang Miaosheng
类目: Artificial Intelligence (cs.AI)
备注: to be submitted to a Journal, 18 pages
Abstract:Autonomous agents, whether LLM-driven software agents or robotic physical agents, face a common class of failure modes when operating without continuous human oversight: safety violations from unverified actions, behavioral instability from unconstrained loops, and continuity loss from unhandled error states. We develop \system, a discrete-time control system that combines five execution gears (\Gobs, \Gsug, \Gplan, \Gexec, \Gint) with utility-gated dispatch and event-driven fallback. For the single-agent case, we prove monotonic stability, execution safety, eventual stabilization, fallback completeness, and equivalence to a gear-constrained Markov decision process. For multi-agent cyber-physical systems (CPS), we apply the established \smart managed-autonomy lifecycle and map runtime evidence into its four governance states (\Stable/\Meta/\Assisted/\Regulated). Consensus gating, swarm-level Lyapunov analysis, per-agent gear authority, and rendezvous control provide distributed safety and stability guarantees, including zero collision under the stated assumptions. We evaluate the resulting runtime on a three-agent UR5 robotic assembly cell using fault magnitudes calibrated from the NIST \emphDegradation Measurement of Robot Arm Position Accuracy dataset across 10,000 Monte Carlo episodes. It achieves a 99.6% anomaly detection rate versus 2.1% for the single-agent baseline, reduces detection latency by 3.5\times , and supplies a formal physical-workspace safety certificate. The execution gears act as micro-level permissions beneath the \smart runtime governance states, separating action control from autonomy governance.
[AI-51] K-Inverse-RFM: A Modified RFM that Bridges the Gap to Neural Networks for Data-Corrupted Mathematical Tasks
链接: https://arxiv.org/abs/2607.00329
作者: Gil Pasternak
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Master’s thesis, University of California San Diego, 2025
Abstract:Recursive Feature Machines (RFMs) are a class of kernel machines that utilize the Average Gradient Outer Product (AGOP) as a mechanism for feature learning. They have been shown to effectively replicate the learning dynamics and feature representations of Feedforward Neural Networks (FNNs) across various settings. However, despite comparable capacity for feature learning and the similarities in the features they acquire, RFMs exhibit significantly lower performance than neural networks in certain data-corrupted scenarios. In this work, we investigate these limitations in mathematical problems. As a solution, we introduce a remarkably effective transformation applied to the training labels which promotes learning in noisy, complexly represented, and class-imbalanced data. This simple yet powerful adjustment enables RFMs to close the performance gap with FNNs and, in some cases, even surpass them.
[AI-52] Entropy-Regularized Probabilistic Gates for Sparse Model Discovery in Scarce-Data Federated Learning
链接: https://arxiv.org/abs/2607.00275
作者: Krishna Harsha Kovelakuntla Huthasana,Alireza Olama,Andreas Lundell
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (stat.ML)
备注:
Abstract:Federated Learning (FL) is a distributed machine learning (ML) paradigm with collaboration among multiple clients without sharing data. FL is challenging under data heterogeneity and partial client participation. Learning sparse models is useful for communication and computational efficiency in FL, but it is especially difficult in the small-sample high-dimensional regime (d N) where optimization can yield parameter configurations that fail to generalize to unseen test data. While magnitude-based pruning doesn’t account for uncertainty exploration in the parameter space, a formulation with probabilistic gates and an L0 constraint allows sampling from competing sparse configurations during training. In this work, we study entropy regularization of gate distributions as a mechanism to maintain uncertainty in sparse federated optimization by preventing early commitment to sparse support. We examine its impact under data heterogeneity, client participation heterogeneity, and sparsity. Experiments on synthetic and real-world benchmarks show consistent improvements over federated iterative hard thresholding (Fed-IHT) and pruning after dense federated averaging (FedAvg) training, both in statistical performance on test data and in sparsity recovery accuracy.
[AI-53] Mnemosyne: Agent ic Transaction Processing for Validating and Repairing AI-generated Workflows
链接: https://arxiv.org/abs/2607.00269
作者: Edward Y. Chang,Longling Geng,Emily J. Chang
类目: Artificial Intelligence (cs.AI)
备注: 36 pages, 24 tables, 6 figures
Abstract:LLMs, solvers, and agent teams increasingly generate workflow actions, repairs, and plans, but a generated action may be syntactically valid yet stale, infeasible, conflicting, or destructive of the evidence that triggered a repair. We introduce Agentic Transaction Processing (ATP), a transaction model that treats generated actions as untrusted proposals until they pass deterministic admission under a declared, executable constraint set C. The principle is two-sided: a proposal is not truth, and no proposal foresees every disruption: anything may propose, but only the runtime admits and commits, and when an unforeseen disruption strikes it repairs reactively within bounds rather than trusting a fresh proposal. Relative to C, committed-state correctness becomes independent of the competence, honesty, or learning of the proposing layer. We realize ATP in Mnemosyne, a runtime with an append-only transition log, effective-state projection, dependency-safe compensation, and active commitment records, and prove four safety properties relative to C (authority separation, serial-equivalent generative admission, evidence-preserving repair, and obligation containment) together with a bounded-reactive-repair guarantee for its localized repair protocol (LCRP). A reproducible artifact rejects the targeted violations across nine falsification tests while still admitting valid work, at under 6% projection-and-validation overhead, and bounded local repair edits an order of magnitude fewer operations than global recompute. Mnemosyne is open source: this https URL.
[AI-54] Validating Causal Abstraction Metrics on Simulated Complex Systems
链接: https://arxiv.org/abs/2607.00267
作者: Maxime Méloux,Tiago Pimentel,François Portet,Maxime Peyrard
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:A central goal of science is to produce valid explanations of complex systems: high-level causal accounts that faithfully reflect the behavior of lower-level mechanisms. Yet no consensus exists on how to measure whether a proposed high-level explanation is actually valid. We introduce a benchmark of ten complex systems spanning both discrete and continuous state spaces, as well as static and dynamical regimes, each equipped with consensual ground-truth causal explanations and invalid contrastive conditions. Within a unified causal abstraction framework, we systematically evaluate over thirty candidate metrics drawn from observational, functional, information-theoretic, and causal families. Our results show that only the latter reliably discriminates valid from invalid abstractions, and only when incorporating faithfulness testing over unmapped variables. Building on these findings, we introduce the Causal Abstraction Error (CAE), a continuous validity metric with an explicit faithfulness test, which passes all discrimination tests across every system and can converge with as few as 30 sampled interventions. We offer it as a general-purpose metric for the discovery and validation of high-level explanations.
[AI-55] Seed2.0 Model Card: Towards Intelligence Frontier for Real-World Complexity
链接: https://arxiv.org/abs/2607.00248
作者: Bytedance Seed
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:We present Seed2.0, a model series that takes a meaningful step toward solving complex, real-world tasks. Our approach begins with identifying users’ genuine needs and constructing a reliable, forward-looking evaluation system by selecting and abstracting benchmarks grounded in these needs and in realistic, complex scenarios. Guided by this evaluation system, Seed2.0 targets two persistent challenges, long-tail knowledge and complex instruction following, substantially improving the model’s reliability on intricate, long-horizon tasks. Beyond these, Seed2.0 delivers world-leading reasoning intelligence, visual understanding, and search capabilities that address the most common needs of a broad user base. Through extensive real-world use cases documented in this model card, we demonstrate that Seed2.0 begins to exhibit the ability to handle initial complex real-world tasks, delivering greater value to hundreds of millions of users.
[AI-56] Adaptive Perturbation Selection for Contrastive Audio Decoding
链接: https://arxiv.org/abs/2607.00247
作者: Aaron Isidore Grace,Zhouyuan Huo,Weiran Wang
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注: In submission
Abstract:Large audio-language models (LALMs) frequently hallucinate by overriding acoustic evidence with language priors. While contrastive decoding (CD) offers training-free mitigation, existing methods rely on blunt perturbations like masking or noise, leaving structured audio transformations unexplored. We explore this design space by evaluating a diverse library of targeted audio perturbations and adaptively selecting the optimal negative branch for each task and example. First, we improve upon earlier prompt engineering by showing that a simple binary yes/no constraint reduces the model’s tendency to falsely confirm absent audio features. Second, evaluating our library across temporal, spectral, frequency, and amplitude domains reveals that optimal transformations are highly task-dependent; for instance, reversing the audio array disrupts temporal coherence, raising accuracy on the temporal order task from 74.7% to 81.4%. Finally, we trained a light-weight perturbation selector on model hidden states to dynamically route negative branches, yielding an additional +4.3% gain on the existence task.
[AI-57] Play Like Champions: Counterfactual Feedback Generation in Latent Space
链接: https://arxiv.org/abs/2607.00190
作者: Andrzej Białecki,Adam Mastalerz,Han Zhou
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 19 pages total, 5 figures, 6 tables, 28 equations
Abstract:Recent advances in reinforcement learning have produced superhuman agents across a wide range of competitive games. As a byproduct, researchers have begun studying how these agents play, extracting behavioral representations, analyzing decision structure, and modeling the latent geometry of expert performance. However, this growing body of work has overwhelmingly focused on defeating human players rather than providing feedback, leaving a critical gap in creating model solutions to improve human players. Unlike chess and Go, where AI has become integral to player training, real-time strategy (RTS) games lack principled frameworks for translating expert knowledge into actionable feedback. We introduce Latent Maps of Performance, a framework for counterfactual path generation. We focus on StarCraft~II data to model player improvement as an algorithmic recourse within a learned representation space. As inspiration for our work, we have looked at the championship model used in sports science. We trained a Guided Variational Autoencoder model on 23,305 professional tournament replays, enabling counterfactual traversal between losing and winning gameplay profiles. To fulfill our goal, we have devised and verified four traversal strategies on out-of-distribution (OOD) data randomly sampled from a dataset of amateur replays, namely linear interpolation, iterative optimal transport, density-regularized gradient ascent, and neural flow matching, each designed to generate multi-step improvement trajectories that remain grounded in observed expert behavior while moving a player’s profile toward winning configurations. Feedback is extracted at multiple granularities to support players at different stages of improvement. Finally, we conclude that there is a trade-off between the path-finding methods we employ and hope that future research will focus on developing model solutions for human improvement.
[AI-58] Scaling Up Thermodynamic AI Models
链接: https://arxiv.org/abs/2607.00170
作者: Andrew G. Moore
类目: Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn); Artificial Intelligence (cs.AI)
备注:
Abstract:Thermodynamic computing devices based on the Ising model show great promise for low-power AI inference and edge computing, but scalable methods for training large models for such hardware remain limited. Prior theory shows that the time-averaged behavior of high-temperature Gibbs-sampled Ising systems can implement feed-forward neural inference. We turn this theoretical correspondence into a scalable and purely backpropagation-based algorithm for training deep convolutional networks for thermodynamic inference on Ising machine hardware. Our image classification models achieve accuracies of 94.9% on CIFAR-10 and 76.0% on CIFAR-100 under binary Gibbs sampling. We then develop and experimentally validate a mathematical theory relating inference cost to accuracy and controlling autocorrelation times. Subsequently, we calculate asymptotic results showing that inference cost is bounded by a well-controlled tradeoff with performance and exhibit algorithms for computing optimal inference schedules. Finally, we discuss implications for hardware development and the future of high-temperature thermodynamic AI models.
[AI-59] A Contextual-Bandit Oversight Game with Two-Sided Informational Asymmetry
链接: https://arxiv.org/abs/2607.00155
作者: Yunjin Tong
类目: Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT)
备注:
Abstract:We study runtime human oversight of an AI agent when private information runs in both directions: the human privately knows her reward function, while the AI privately knows the quality of the action it proposes. This is the kind of asymmetry that arises naturally when an autonomous robot or software agent has inspected a situation its human supervisor cannot directly assess. Building on Cooperative Inverse Reinforcement Learning (CIRL) and the Oversight Game, we introduce a contextual-bandit team game with two-sided asymmetric information and a play/ask/trust/oversee interface. The bandit structure removes physical state transitions and thereby yields exact one-shot characterizations that would remain conjectural in the full POMDP setting, though the common belief remains a dynamically controlled state across rounds. We give two one-shot characterizations, a team optimum and a behaviorally natural myopic rule, whose gap is a slab of avoidable harm: a region in which the AI privately knows the proposed action is harmful and shutdown would help, yet a myopic human, trusting her prior, declines to oversee. We show this gap is the price of non-credible oversight communication, and give a partial analysis of how it resolves dynamically over repeated rounds through passive learning and active signaling with a one-period-lagged oversight response.
[AI-60] EVOTS: Evolutionary Transformer Search for Time Series Forecasting
链接: https://arxiv.org/abs/2607.00154
作者: AbdElRahman ElSaid,Damir Pulatov
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注:
Abstract:Evolutionary neural architecture design for multivariate time-series forecasting remains underexplored, with most approaches relying on fixed Transformer architectures despite substantial variation across tasks and forecasting settings. This paper introduces an evolutionary neural architecture search framework for discovering task-adaptive Transformer-like models for time-series forecasting (EVOTS). Architectures are encoded using a modular genome representation that enables flexible composition of attention, feed-forward, and projection components, while a repair mechanism enforces structural validity throughout the evolutionary process. This formulation allows effective exploration of a diverse architecture space without relying on hand-crafted design rules. The proposed approach is evaluated on four benchmark datasets from the ETT family (ETTh1, ETTh2, ETTm1, and ETTm2) under multiple forecasting settings, including univariate-to-univariate, multivariate-to-univariate, and multivariate-to-multivariate prediction, with horizons of 96, 192, 336, and 720. In the multivariate-to-multivariate setting, the evolved architectures achieve competitive and, in several cases, improved mean squared error relative to a strong Transformer-based baseline. Additional analyses examine performance differences across forecasting settings and report wall-clock training time to provide a coarse indication of computational cost. Overall, the results demonstrate that evolutionary search can effectively discover flexible and high-performing Transformer-like architectures for multivariate time-series forecasting within practical runtime constraints.
[AI-61] RareDxR1: Autonomous Medical Reasoning for Rare Disease Diagnosis Beyond Human Annotation ICME
链接: https://arxiv.org/abs/2607.00147
作者: Deyang Jiang,Haoran Wu,Ziyi Wang,Yiming Rong,Yunlong Zhao,Ye Jin,Bo Xu
类目: Artificial Intelligence (cs.AI)
备注: 7 pages, 3 figures. Accepted to IEEE International Conference on Multimedia and Expo (ICME) 2026
Abstract:Rare disease differential diagnosis is a critical yet arduous clinical task, requiring physicians to identify precise phenotypes from complex, unstructured patient symptoms and execute intricate reasoning within a vast search space. However, existing AI approaches typically rely on pipeline-based phenotype extraction or retrieval-augmented generation, which suffer from critical information loss due to predefined ontologies, retrieval bottlenecks, and a lack of diagnostic logic. To address these challenges, we introduce RareDxR1, an end-to-end reasoning-centric large language model designed for open-domain rare disease diagnosis directly from unstructured clinical notes. We design a progressive end-to-end training framework by synergizing knowledge internalization with autonomous evolutionary learning, thereby bypassing reliance on structured phenotypes and closed-set decision-making. To overcome the limitations of RAG and phenotype restriction, we enabled the deep internalization of fragmented rare-disease knowledge directly into the model’s parameters. Moreover, to bridge the gap between model generation and expert reasoning, we propose Reflection-Enhanced Reasoning Sampling (RERS), a strategy that synthesizes expert-level diagnostic trajectories by learning from failures without human annotation. Additionally, we propose a dual-level curriculum reinforcement learning approach for gradually mastering rare disease diagnosis. Experimental results demonstrate that RareDxR1 achieves state-of-the-art accuracy across different benchmarks, marking a significant breakthrough in open-domain rare disease diagnosis. Our code and dataset will be publicly available.
[AI-62] Would You Marry Superintelligence?
链接: https://arxiv.org/abs/2607.00120
作者: Inyoung Cheong
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 21 pages, 1 table
Abstract:Emotional bonds between humans and AI companions are growing, and the question of whether a person may marry an AI system will soon move from speculative fiction into law. This chapter examines whether the autonomy-centered logic that has expanded marital choice among human beings can justify extending marital status to superintelligent companions. Following a scenario-envisioning exercise informed by anticipatory ethics, I argue that granting such status leads to socially unjust outcomes, even under the generous assumption of reliable superintelligence. Marriage as a socio-legal institution does more than ratify private agreement; it creates networks of mutual obligation, joins families, and makes each partner vulnerable to the other. A relationship sustained by corporate policy and continued payments is a subscription rather than a bond tested by time. Discussing wholesale marital status is therefore the wrong frame. Law should carve out targeted rights and protections for pressing needs arising from intimate human-AI relationships.
[AI-63] SNAP-FM: Sparse Nonlinear Accelerated Projection for Physics-Constrained Generative Modeling
链接: https://arxiv.org/abs/2607.00095
作者: Alaina Kolli,Theodoros Xenakis,Utkarsh Utkarsh,Pengfei Cai,Rafael Gomez-Bombarelli,Alan Edelman,Christopher Vincent Rackauckas
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
备注:
Abstract:Generative models have emerged as scalable surrogates for physical simulation, yet they offer no guarantee that their outputs respect the conservation laws, boundary conditions, and nonlinear invariants that govern the underlying physics. Constrained sampling closes this gap, enforcing such constraints exactly at inference time without retraining, but at a computational cost: projection, correction, and trajectory-optimization steps are repeated during sampling, with these steps becoming expensive for nonlinear constraints. Standard ML frameworks exacerbate this: their dense tensor algebra and limited sparse solver composability obscure the structure that physical constraints naturally induce, making efficient batched nonlinear optimization difficult to realize in practice. We address this bottleneck by exploiting the structure that sample-wise batching and local PDE couplings induce in the projection subproblems – namely, block-sparse Jacobian and KKT systems – exposing this structure using this http URL and solving the resulting sparse nonlinear programs with this http URL and GPU sparse factorization. Applied to Physics-Constrained Flow Matching (PCFM), on PDE benchmarks with linear, nonlinear, one-dimensional, and two-dimensional constraints, this approach accelerates nonlinear constraint projection while maintaining constraint satisfaction. These results show that sparse GPU nonlinear optimization is a practical foundation for constrained generative sampling in scientific machine learning.
[AI-64] Optimal any-angle path planning in static and dynamic environments
链接: https://arxiv.org/abs/2607.00065
作者: Yiyuan Zou,Clark Borst
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 33 pages, 13 figures
Abstract:Any-angle path planning extends traditional graph-based path planning by allowing movement between any pair of vertices, rather than being restricted by predefined edges. It can find straighter and shorter paths in continuous space with graphs, making it particularly suitable for navigation in open areas such as airspaces, warehouses, and oceans. Many any-angle path-planning algorithms have been proposed, but only a few can guarantee optimal solutions, especially in the presence of dynamic obstacles. To address this challenge, this article focuses on optimal any-angle path planning on grids and introduces two general techniques that accelerate computation while preserving optimality in both static and dynamic environments: 1) elliptical forward expansion, which leverages ellipse-based neighborhoods to restrict the search space, and 2) field of view, which replaces traditional line-of-sight methods to speed up visibility checks. To integrate these two techniques, inverted and forward scanning are introduced. Inverted scanning establishes visual connections from open nodes, whereas forward scanning initiates scans from closed nodes. Building on the proposed techniques, Zeta* and Zeta*-SIPP are developed for static and dynamic environments respectively. Zeta*, when combined with forward scanning, is similar to the state-of-the-art algorithm Anya and attains comparable performance. Unlike Anya, Zeta* can be readily extended to other settings, such as dynamic environments (e.g., Zeta*-SIPP). Zeta*-SIPP, with either scanning method, is more than 20 times faster than the corresponding state-of-the-art optimal planner TO-AA-SIPP. Overall, this research identifies the key requirements for achieving optimal any-angle path planning and introduces a unified approach suitable for different environments.
[AI-65] Solution space path planning for supporting en-route air traffic control
链接: https://arxiv.org/abs/2607.00064
作者: Yiyuan Zou,Wenying Lyu,Clark Borst
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO); Systems and Control (eess.SY)
备注: 37 pages, 16 figures
Abstract:As technology advances, many path-planning algorithms have been proposed for Air Traffic Management, yet their operational adoption in tactical control remains limited, revealing a misalignment between algorithmic design priorities and air traffic controllers’ needs. This underscores the need for decision-support solutions that are inherently interpretable, computationally efficient, and explicitly designed for human use. Focusing on this design challenge, this study develops a conflict-free path-planning algorithm for en-route Air Traffic Control (ATC) designed to be compatible with two guiding considerations: (1) the interpretability and flexibility offered by solution-space displays, which motivate constructing an algorithm that exposes all feasible safe actions and accommodates shifting optimization goals; and (2) the decision logic controllers naturally apply when enforcing operational constraints, such as separation standards, maneuverability limits, waypoint minimization, and routing practicality. Centered on these principles, the algorithm integrates three intent-based conflict detection methods – distance-based, time-interval-based, and zone-based – within a solution-space framework to identify conflict-free paths in computationally efficient ways. Additionally, vertex-based and edge-based search nodes are proposed for solution space path planning (SSPP), resulting in two variants – SSPPV and SSPPE, respectively, which are evaluated in terms of computational speed and solution quality. Empirical results show that SSPPV paired with zone-based conflict detection achieves the best performance, computing paths in 3.69 ms on average in operational-relevant scenarios based on the Delta sector of the Maastricht Upper Area Control Centre (MUAC) using a 5 nmi grid.
[AI-66] AlgoBench: Benchmarking Algorithmic Adaptation in Code Generation
链接: https://arxiv.org/abs/2607.00062
作者: Xinyuan Song,Zekun Cai,Liang Zhao
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Programming Languages (cs.PL)
备注: Under Review
Abstract:High pass rates on established programming benchmarks such as HumanEval and LiveCodeBench do not always show whether a model can reason about algorithms. Many fixed benchmarks eventually become part of the public training ecosystem through released problem statements, editorials, and generated solutions, allowing later models to improve partly by exposure rather than by stronger algorithmic ability. We introduce ALGOBENCH, a framework that automatically builds novel algorithmic problems from known competitive-programming problems through structured constraint-shifting transformations. Each accepted ALGOBENCH variant is traceable to a source problem, but must make the original reference algorithm fail. Beyond pass@ k , we introduce complexity-aware metrics – including OPTT, OPTS, TRAPRATE, GAPT, and CONSENS – to test whether a solution is not only functionally correct but also asymptotically suitable for the generated problem. Experiments across multiple LLMs and prompting strategies show that performance drops sharply on ALGOBENCH variants, retrieval can increase reuse of the old algorithm, and many correct-looking solutions fail to meet the required complexity. Error analysis shows that failures are mainly algorithmic rather than implementation-level, suggesting that ALGOBENCH evaluates adaptation beyond functional correctness.
[AI-67] SWE-Router: Routing in Multi-turn Agent ic Software Engineering Tasks ICML2026
链接: https://arxiv.org/abs/2607.00053
作者: Seongho Son,Sangwoong Yoon,Jiahua Tang,Shuhan Wang,Lorenz Wolf,Ilija Bogunovic
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: The 5th Deep Learning for Code Workshop, ICML 2026
Abstract:Large language models (LLMs) embedded in multi-turn agentic harnesses are reshaping software engineering (SWE), but routing every task to a frontier model is wasteful when many issues admit cheap fixes. Existing LLM routers operate on the task description alone, which inherits an information-theoretic Bayes-error floor in agentic settings: a similar issue can hide either a localized typo or a multi-module refactor, and the prompt does not separate the two. We introduce SWE-Router, a value-based temporal approach that lets a cheap model run for a few exploratory turns and reads the resulting partial trajectory before deciding whether to continue cheaply or to escalate to an expensive model. We provide a Bayes-optimality theorem showing that conditioning on the partial trajectory never harms routing and is strictly better whenever exploration is informative. Across the LLM pairs of weak and strong models spanning the contemporary cost–capability frontier, we show that SWE-Router greatly improves the cost efficiency of SWE tasks, while maintaining the majority of the performances of the stronger model. We additionally release a multi-LLM trajectory dataset which allows reproduction of our trajectory-level routing.
[AI-68] Prompting GPT -5 on Scrum Certification Questions: An Empirical Accuracy Study
链接: https://arxiv.org/abs/2607.00049
作者: Mirko Perkusich,Danyllo Albuquerque,João Paiva,Robson Vilar,Emanuel Dantas,Ademar França de Sousa Neto,Rohit Gheyi,Kyller Gorgônio,Angelo Perkusich
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) are increasingly used in Agile Software Development for documentation, coaching, and training. As practitioners adopt these tools to prepare for certifications such as Professional Scrum Master (PSM), a key question is whether LLMs can reliably reason about Scrum, a framework with normative, well-defined rules described in the Scrum Guide (2020). This paper examines how different prompt techniques affect the factual accuracy of LLM responses to Scrum certification-style questions. A dataset of 993 validated PSM-aligned questions was answered by GPT-5 using three techniques: zero-shot, chain-of-thought, and with-source citation. All prompts achieved certification-level accuracy above 85%, with the citation-based variant performing best (89.1%) and yielding the lowest error rate. Correct answers concentrated in well-defined topics, such as \emphDefinition of Done, Events, and Product Backlog Management, and in single-answer multiple-choice items, while multi-select questions and more interpretive areas, such as Scrum Team and Product Value, were less stable. Among questions where at least one prompt failed (16.2%), errors clustered into misalignment with the Scrum Guide (28%), content outside its scope (34%), and outdated or biased interpretations (38%). Overall, prompt techniques produced modest but consistent improvements, particularly in reducing misinterpretation and version drift, supporting more reliable use of LLMs in Agile learning and certification preparation.
[AI-69] Comparing Large Language Models on Scrum Certification-Style Questions: Accuracy Stability and Error Patterns
链接: https://arxiv.org/abs/2607.00048
作者: Robson Alves Vilar,Emanuel Dantas Filho,Ademar França de Sousa Neto,Mirko Perkusich,Danyllo Wagner Albuquerque,João Paiva,Kyller Gorgônio,Angelo Perkusich
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) are increasingly used in exam- and certification-style question answering tasks, where their ability to retrieve, interpret, and apply domain-specific knowledge can be systematically assessed. In Software Engineering, such settings are particularly relevant when questions depend on strict adherence to normative definitions, roles, artifacts, and rules. This paper evaluates the performance of three contemporary LLMs, \textitGPT-5 mini, \textitGemini 3 Flash, and \textitDeepSeek Chat 3.2, in answering 993 Scrum certification-style questions aligned with the Professional Scrum Master I (PSM I) assessment format. We evaluated the models under three prompting strategies (\textitzero-shot, \textitchain-of-thought, and \textitsource-grounded), with repeated executions to assess intra-model stability. We also analyzed performance across Scrum topics and question formats, complemented by a qualitative analysis of recurring error patterns in incorrect answers. Results revealed clear differences among models, with Gemini 3 Flash achieving the highest accuracy, followed by GPT-5 mini and DeepSeek Chat 3.2, while intra-model variability remained low across all conditions. By question format, the models achieved the highest accuracy on single-answer multiple-choice items, whereas multi-select and True/False questions were more error-prone. By topic, performance was more consistent in normatively explicit areas such as Artifacts, Empiricism, and Product Value, but more fragile in Scrum Values, Self-Managing Teams, and Stakeholders \ Customers. The qualitative analysis showed that errors were systematic rather than random, involving overgeneralization, restrictive wording, compound distractors, and conflicts between common market interpretations and strict Scrum definitions.
[AI-70] ATM: CID-Brokered Pre-Write Admission for Multi-Agent Code Co-Synthesis
链接: https://arxiv.org/abs/2607.00041
作者: Eagl Huang
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 40 pages, 8 figures. Source code and supplementary artifact links are described in the manuscript appendix
Abstract:Multi-agent LLM systems can decompose software-engineering work into planning, generation, validation, and repair, but a narrower systems problem remains: before any governed shared mutation is applied, a system must decide which concurrently formed write intents may proceed in parallel, which require deterministic composition or serialization, and which must take a fail-closed path. We address this problem with the AI-Atomic-Framework (ATM), a specification-grounded governance substrate for software agents operating within a single governance domain. ATM binds task intent, repository scope, write admission, validation, and evidence obligations into one governance chain. A Content Identifier (CID) broker serves as the shared-mutation admission subsystem. Adapter-guided atomization maps write intents to semantic atoms and bounded regions; when persistent atom-map coverage is incomplete, virtual atoms provide temporary auditable governance units for conservative comparison and routing. Governed shared writes are ultimately applied by a neutral steward rather than directly by proposing agents. Evaluation combines controlled, field, adoption, and extension evidence, including a 12-scenario deterministic design matrix, three archived runner cases, ATM-AdmissionBench, three archived same-file boundary cases, a three-week external-adopter study, and an operational recovery-routing benchmark. The results support feasibility, auditability, and bounded recoverability within the observed single-domain settings, but do not claim broad comparative superiority or cross-clone governance. Comments: 40 pages, 8 figures. Source code and supplementary artifact links are described in the manuscript appendix Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI) Cite as: arXiv:2607.00041 [cs.SE] (or arXiv:2607.00041v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2607.00041 Focus to learn more arXiv-issued DOI via DataCite
[AI-71] Making Failure Safe: A Constrained Verifiable Agent Framework for Open-Web Data Collection
链接: https://arxiv.org/abs/2607.00035
作者: Bo Chen
类目: Artificial Intelligence (cs.AI)
备注: 15 pages, 1 figure
Abstract:LLMs and agents can generate web scrapers from natural-language requirements, but direct generation remains unreliable because of dependency errors, broken selectors, schema mismatches, and heterogeneous page structures. We propose a constrained, verifiable agent framework that shifts LLM output from free-form code to typed JSON collector configurations, combining a six-type collector taxonomy, template and utility-function constraints, static Airflow DAG execution, rule-based quality checking, and structured feedback correction. Experiments on 138 tasks show that the taxonomy supports description-based requirement typing, while confirming that stable instantiation requires completing source, field, and execution constraints beyond the initial description. On 80 independently source-verified tasks, the framework runs with zero execution-stage LLM tokens and the lowest average wall-clock time, trading moderate one-shot quality for a reusable, deterministic, and verifiable execution path suited to repeated scheduled collection. These results position the framework as a reusable, low-cost, and verifiable execution path for repeated open-web data collection.
[AI-72] he MMM Data Model – A Normative Specification for Knowledge Interoperability in a Decentralisable Knowledge Commons
链接: https://arxiv.org/abs/2607.00032
作者: Mathilde Noual
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Many information systems are built around documents: self-contained units optimised for print production and linear reading. While effective for large-scale dissemination, the document-centric organisation constrains how knowledge can be structured, updated, shared, and reused. Formal approaches address some of these limitations but struggle to achieve widespread contribution and adoption due to their prioritisation of formal structure over other system properties such as human usability and scope. AI systems are reshaping document production, but without providing a unified portable alternative to traditional documents for humans’ expression and exchange of knowledge. This paper presents MMM, a data model for knowledge documentation that emerged from the practical needs of interdisciplinary collaborative research, and positioned here within a comparative analysis of the design space of information systems. MMM combines a small set of normative constraints with the expressive freedom of free-text labels. It is designed for interoperability across disciplines, applications and deployments without requiring semantic convergence. A reference implementation and pilot deployment data demonstrate implementability and early usability.
[AI-73] FLYNN: Robust Neural Network for Robot Navigation using Fly Brain Topology
链接: https://arxiv.org/abs/2607.00025
作者: Benquan Wang,Jingdao Chen
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:While deep learning models achieve state-of-the-art performance in complex tasks, they remain brittle when faced with new environments or sensory deprivation. In contrast, biological systems exhibit remarkable tolerance to these challenges. We address this vulnerability by developing a recurrent neural network (RNN) whose architecture is directly derived from the synaptic-resolution brain connectome of the fruit fly Drosophila melanogaster. We demonstrate the feasibility of training the fly connectome neural network (FLYNN) to perform vision-based navigation in MuJoCo, achieving performance comparable to modern hand-crafted networks of similar parameter counts. Crucially, FLYNN exhibits superior resistance to out-of-distribution (OOD) data and tolerance to sensory loss without further training. It remained functional even under total vision loss while hand-crafted networks largely failed, even when specifically trained with camera dropout. Principal Component Analysis (PCA) of the internal state of FLYNN suggests that it exhibits a particularly high degree of representational modularity, which might be related to its robustness. Our work provides a new direction for designing resilient artificial agents following the topology of biological brains.
[AI-74] LLM s in the Real World: Evaluating “AI” in Emergency Contexts ACL
链接: https://arxiv.org/abs/2607.00019
作者: Sara Court,Lara Downing,Micha Elsner
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: Accepted to ACL Findings 2026 in San Diego
Abstract:This paper offers a call to action. We urge our colleagues in the research community to play a greater role in the articulation of our findings to the public. To illustrate the stakes we present a case study on the initial stages of an LLM-based machine translation application’s deployment in a real-world context: a text-2-911 system advertising capabilities in 55 languages for use in emergencies in which it may be difficult to call operators directly. We identify a number of common misconceptions about technologies such as these, concluding with a set of concrete recommendations and best practices for stakeholders at every stage of the development and deployment pipeline. While the advancement of scientific research often lies in solving the “hard” problems, we argue it is often the “easy” ones – problems for which the latest technology is often unnecessary – that are most overlooked.
[AI-75] Bounded Morality: Defining the Space of Moral Computation AAAI-26
链接: https://arxiv.org/abs/2607.00002
作者: Max Kanwal,Caryn Tran,Patrick Mineault
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 24 pages, 2 figures; Proceedings of the AAAI-26 Workshop on Machine Ethics
Abstract:Moral cognition has traditionally been modeled as adherence to fixed ethical theories–deontology, consequentialism, virtue ethics–implemented as static rules or value functions. We propose Bounded Morality, a formal framework for analyzing the computational demands of moral problems faced by finite agents. Extending Herbert Simon’s notion of bounded rationality, we formalize moral situations along two orthogonal dimensions: moral breadth, the scope of entities treated as morally relevant, and moral depth, the inferential integration required to evaluate their interactions. Limited resources impose an unavoidable tradeoff between these dimensions, defining a feasible space of moral computation. Within this space, ethical theories correspond to locally efficient strategies adapted to different demand regimes rather than competing accounts of moral truth. The framework yields a formal notion of moral regret and moral progress under constraint, and implies that moral alignment in artificial systems depends on the scaling and allocation of moral reasoning capacity rather than on direct imitation of human judgments.
[AI-76] Constructive Alignment: Governing Preference Dynamics in Human-AI Interaction AAAI-26
链接: https://arxiv.org/abs/2607.00001
作者: Max Kanwal,Caryn Tran
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 23 pages, 1 figure; Proceedings of the AAAI-26 Workshop on Machine Ethics
Abstract:Most approaches to AI alignment treat human preferences as fixed targets to be inferred and optimized. This assumption conflicts with extensive empirical evidence showing that preferences are layered, dynamic, and constructed through interaction–particularly with adaptive technologies. As AI systems become more persistent, personalized, and socially embedded, they increasingly participate in shaping what people attend to, value, and endorse over time. We introduce Constructive Alignment, a paradigm that reframes alignment as a control problem over evolving human preference trajectories rather than static preference satisfaction. Drawing on behavioral economics, psychology, and constructivist social theory, we model preferences as layered state variables that evolve under interaction with AI systems. We formalize this view using a control-theoretic framework in which system actions and interaction design jointly influence both world states and human evaluative states. We argue that alignment is not primarily about controlling AI behavior, but about regulating how AI systems influence the evolution of human preferences–ensuring that value trajectories remain coherent, reflectively endorsed, epistemically grounded, bounded against manipulation, and empowering under uncertainty. Alignment thus becomes a problem of governing long-term value formation rather than simply satisfying static preferences.
[AI-77] Meta-Transfer Learning for mmWave Beam Alignment
链接: https://arxiv.org/abs/2607.00860
作者: Ahmet Nuri Cevik,Sinem Coleri
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注:
Abstract:Millimeter-wave (mmWave) beam alignment plays a critical role in next-generation wireless systems, yet its efficient implementation remains challenging. Meta-learning and transfer learning have been explored to enable deep learning-based beam prediction models to rapidly adapt to unseen environments; however, existing meta-learning approaches adapt the entire network and are trained from random initialization, leading to a large number of updated parameters and a high meta-training cost, while transfer learning approaches restrict adaptation to part of the network but do not exploit episodic meta-learning, which explicitly trains the model over multiple tasks, to optimize the adaptation process itself. To overcome these limitations, we propose MTL-BA, a meta-transfer learning framework for beam alignment in millimeter-wave multiple-input single-output (MISO) systems that freezes a pre-trained convolutional backbone and meta-learns only lightweight Scale-and-Shift (SS) adapters together with a classifier head. Warm-starting from the pre-trained model and restricting adaptation to the SS adapters and classifier head reduce both the adaptation cost and the meta-training budget without sacrificing prediction performance. Simulation results on the DeepMIMO ray-tracing dataset show that MTL-BA matches the accuracy and spectral efficiency of full fine-tuning across various SNR levels despite updating approximately 17\times fewer parameters than both full fine-tuning and Model-Agnostic Meta-Learning (MAML), outperforms last-layer fine-tuning while updating a comparable number of parameters, and approaches MAML’s performance while requiring 60% fewer meta-training epochs.
[AI-78] Holographic Quantum Transformer: A Generalist Neuro-Symbolic Architecture for Solving Frustrated Systems via Generative Attention KDD’26
链接: https://arxiv.org/abs/2607.00398
作者: Xingran Guo,Tiaojie Xiao,Jie Liu,Keqin Li
类目: rongly Correlated Electrons (cond-mat.str-el); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
备注: 10 pages, accepted to KDD '26
Abstract:Simulating two-dimensional frustrated quantum matter is a grand challenge due to the sign problem and exponential Hilbert space complexity. In this work, we introduce the Holographic Quantum Transformer (HQT), a physics-inspired generative architecture that leverages global self-attention to resolve non-local entanglement patterns. We validate HQT on the square lattice J_1-J_2 Heisenberg model. On the heavily frustrated 8 \times 8 lattice at the quantum critical point ( J_2=0.5 ), HQT reaches a ground-state energy per site ( E/N ) of \mathbf-0.5001(1) , consistent with the expected finite-size scaling trend. Beyond numerical accuracy, HQT exhibits intrinsic physical awareness, autonomously recovering the underlying J_2 interaction geometry through interpretable attention maps. Our central contribution is ``Holographic Transfer", a zero-shot size-extrapolation protocol with rapid alignment: a model trained on 8 \times 8 systems is directly projected onto larger 10 \times 10 lattices via continuous positional-embedding interpolation and head re-initialization, achieving high-fidelity initialization and rapid convergence. This zero-shot protocol yields an energy of E/N = \mathbf-0.49782(3) , statistically consistent with the variational state of the art while requiring no from-scratch training on the target lattice. Our results establish generative attention as a scalable paradigm for transferable quantum simulation.
[AI-79] When AI meets quantum information: A comprehensive review
链接: https://arxiv.org/abs/2607.00365
作者: Min Chen,Yu Gan,Xin Jin,Yuqing Li,Junqi Wang,Zeguan Wu,Yunfei Wang,Bingzhi Zhang,Priyam Srivastava,Tianlong Chen,Ankit Kulshrestha,Yuan Liu,Juan José Mendoza-Arenas,Kaushik P. Seshadreesan,Sarvagya Upadhyay,Xueyue Zhang,Quntao Zhuang,Junyu Liu
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 62 pages, 4 figures
Abstract:Artificial intelligence (AI) and quantum information (QI) are rapidly co-evolving. AI is becoming a practical tool for learning, designing, controlling, and verifying quantum systems, while QI offers new computational models, representational structures, and learning-theoretic questions for AI. This survey reviews the interface from both directions. In the AI for QI direction, we organize recent progress around the central tasks of extracting information from limited measurements, training and discovering quantum algorithms, stabilizing noisy hardware, automating experimental and programming workflows, and extending learning-based methods to sensing and networking. In the QI for AI direction, we examine how quantum computation and quantum-inspired structures affect learning through algorithmic speedups, expressivity, trainability, generalization, neural-network design, and tensor-network representations. We close by identifying cross-cutting challenges in reproducibility, scalability, hardware realism, and co-design, arguing that progress will depend on tighter integration of theory, experiment, and hybrid quantum–classical systems.
[AI-80] A Category Theory Account of AI Identity
链接: https://arxiv.org/abs/2607.00220
作者: Andrea Ferrario
类目: Category Theory (math.CT); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 25 pages, 4 figures
Abstract:Artificial intelligence (AI) systems are routinely modified after deployment through retraining and changes in their environments. These transformations raise a metaphysical question: under what conditions does an AI system remain the same system over time or across deployments? Earlier work formulates synchronic and diachronic identity propositionally, by relating identity within a fixed AI system type to equality of trustworthiness levels. Such criteria specify when identity statements are true, but leave implicit the structure of the states compared, the transformations connecting them, and the temporal organization of persistence. We develop a category-theoretic formalization of AI identity. An AI system type is specified by a datum consisting of a techno-function, a trustworthiness profile, and a trustworthiness-level function. Profile-relative states are connected by admissible lifecycle paths, which are restricted to trustworthiness-level-preserving transformations and quotiented to obtain a reachability category. Temporally admissible functors represent AI system histories, while time-synchronous natural transformations compare realized histories. The formalization yields two categorical interpretations of the earlier AI identity criteria. A weak interpretation recovers identity as equality of trustworthiness level. A strong interpretation requires mutual trustworthiness-preserving reachability, expressed through state isomorphism or natural isomorphism of realized histories. Category theory therefore replaces a single AI identity relation with a structured hierarchy of diachronic and synchronic criteria. The resulting framework identifies identity-related preconditions for transferring responsible-AI claims, evidence, and governance procedures across versions or deployments, without treating categorical identity as sufficient by itself for such transfer. Comments: 25 pages, 4 figures Subjects: Category Theory (math.CT); Artificial Intelligence (cs.AI); Computers and Society (cs.CY) Cite as: arXiv:2607.00220 [math.CT] (or arXiv:2607.00220v1 [math.CT] for this version) https://doi.org/10.48550/arXiv.2607.00220 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-81] Spectral Geometry and Bosonic-Bloch Probes: Explorations in Quantum Learning
链接: https://arxiv.org/abs/2607.00063
作者: Santanu Ganguly,Xing Liang,Dimitrios Makris
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI)
备注:
Abstract:This paper studies how spectral geometry emerges in quantum learning models and how it can be diagnosed with physically grounded probes. In graph-regularized quantum networks, training reorganizes the output similarity graph, increases the effective spectral dimension Delta S = +0.23, and reshapes the Laplacian spectrum. Edge-resolved two-boson interference directly probes this restructuring: the bosonic enhancement Delta P_uv correlates with the Fiedler edge split |Delta v_2| (r = -0.50), linking learned spectral partitions to interference signatures. A phase diagram shows a nonmonotonic dependence of performance on coupling strength gamma and noise delta, with graph regularization improving fidelity only in a restricted regime; hardware experiments confirm the predicted interference behavior within shot-noise uncertainty. We also analyze a hybrid quantum autoencoder and introduce Bloch-space drift as a geometric diagnostic of its latent representation. With an unsupervised benign-data threshold, the model achieves high ranking performance (ROC-AUC about 0.99) and negligible false-negative rates. Absolute Bloch drift strongly discriminates anomalies (ROC-AUC at least about 0.9), while consecutive drift is near random (ROC-AUC about 0.5), showing that detection arises from persistent state-space displacement rather than local fluctuations. Through the geometry of reduced single-qubit states and associated quantum Fisher information, these results show that learning-induced spectral organization appears as measurable quantum-state structure, establishing a unified spectral-geometric framework for diagnosing quantum learning systems with bosonic and Bloch probes.
机器学习
[LG-0] Rex-2: Generalizing TiRex to Multivariate Data and Streaming
链接: https://arxiv.org/abs/2607.01204
作者: Patrick Podest,Marco Pichler,Elias Bürger,Levente Zólyomi,Bernhard Voggenberger,Wilhelm Berghammer,Daniel Klotz,Sebastian Böck,Günter Klambauer,Sepp Hochreiter
类目: Machine Learning (cs.LG)
*备注:
Abstract:We introduce TiRex-2, a recurrent xLSTM-based time series foundation model that generalizes the univariate TiRex to multivariate forecasting with both past and future covariates. Real-world forecasting is inherently sequential: observations arrive continuously, variables evolve jointly, and a subset of covariates is known ahead of time. Existing Transformer-based time series foundation models capture cross-variate dependencies but incur quadratic complexity in context length and require full-history recomputation as new observations arrive. TiRex-2 addresses these limitations through a memory-centric recurrent design that operates at constant per-patch cost under streaming. The model combines a bidirectional time mixer with an asymmetric grouped-attention variate mixer, enabling the integration of future-known covariates while preserving strict causality over target variables. To our knowledge, this is the first time series foundation model that achieves this combination of properties. To support scalable multivariate pretraining, we propose a synthetic coupling pipeline that composes diverse multivariate samples on the fly from large univariate corpora. Empirically, TiRex-2 achieves state-of-the-art zero-shot performance on GIFT-Eval and fev-bench, remains stable when streamed to arbitrary context lengths, and maintains constant inference cost per patch. The model uses 38.4M active parameters in univariate mode, with an additional 44.1M parameters activated for multivariate forecasting.
[LG-1] Quantum vs. Classical Machine Learning: A Unified Empirical Comparison
链接: https://arxiv.org/abs/2607.01197
作者: Chuanming Yu,Jiaming Liu,Zihao Ge,Xiongfei Wu,Lulu Zhu,Pengzhan Zhao,Jianjun Zhao
类目: Machine Learning (cs.LG)
*备注: This paper has been accepted for a poster presentation at the 5th CCF Quantum Computation Conference (CQCC 2026) on August 3, 2026
Abstract:Quantum computing has emerged as a promising computational paradigm for machine learning (ML), with the potential to offer computational advantages over classical approaches. At this stage, the evidence supporting the performance and advantages of quantum machine learning (QML) models relative to classical models is this http URL address this gap, this paper presents an empirical study on the performance of QML models and their classical counterparts. We compare seven model pairs spanning supervised learning and reinforcement learning. Our results indicate that the evaluated quantum machine learning models do not yet surpass the classical baselines in overall prediction performance, policy stability, or training time. Nevertheless, QML remains a promising approach for filtering noise and controlling false positives. Our research findings summarize the challenges facing quantum machine learning across hardware environments, training efficiency, and convergence stability, providing a foundation for research into the robustness and parameter optimization of QML. This work is publicly available at this https URL.
[LG-2] Neural Certificate Pricing for Combinatorial Optimization Problems
链接: https://arxiv.org/abs/2607.01185
作者: Jingyi Chen,Xinyuan Zhang,Xinwu Qian
类目: Machine Learning (cs.LG)
*备注:
Abstract:Combinatorial optimization (CO) problems are difficult because certifiable discrete structure induces exponential search. One needs to search over the set exponentially many candidates to certify optimality, however, the structural feasibility of a path, packing, or cover can be verified in polynomial time once supplied. In this study, we introduce Neural Certificate Pricing (NCP) that exploits this asymmetry under an unsupervised learning framework. A neural network is trained to predict certificate-level dual prices, while a structured recovery layer constructs the induced primal marginal. NCP can be viewed as amortized separation: instead of enumerating violated inequalities, it learns the residual prices through which their aggregate effect enters recovery. When the certificate-consistency condition holds, the recovered marginal is globally feasible, and a local theory shows that first-order errors in the predicted price induce only second-order loss in objective value. Across three classes of CO problems, NCP either outperforms state-of-the-art neural baselines by large margins or matches them at a fraction of the computation time, and shows stronger out-of-distribution generalization.
[LG-3] Decision-Aware Training for Sample-Based Generative Models
链接: https://arxiv.org/abs/2607.01171
作者: Kornelius Raeth,Nicole Ludwig
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Sample-based generative models are increasingly used for probabilistic forecasting in high-stakes decision settings, yet their training objectives are blind to the decision maker’s cost structure. These models are commonly trained with strictly proper scoring rules, such as the energy score, which allocate their training signal in proportion to data density, with no awareness of where forecast errors are most costly for downstream decisions. We therefore propose decision-aware training for sample-based generative models, augmenting the energy score objective with a differentiable decision loss that directly penalises the cost incurred by acting on the model’s forecast. This combined loss is theoretically grounded, as the decision loss is itself a proper scoring rule. We validate our method on one synthetic and two real-world tasks, showing targeted improvements in cost-sensitive regions while retaining full probabilistic forecasts.
[LG-4] Efficient Compression of Structured and Unstructured Volumes via Learned 3D Gaussian Representation
链接: https://arxiv.org/abs/2607.01164
作者: Landon Dyken,Sharmistha Chakrabarti,Nathan Debardeleben,Steve Petruzza,Qi Wu,Will Usher,Sidharth Kumar
类目: Machine Learning (cs.LG)
*备注:
Abstract:Recent work has shown that implicit neural representations (INRs) can be trained to effectively compress structured and unstructured volume data, allowing for direct data querying with a reduced memory footprint. However, as existing INRs for unstructured volumes do not encode geometry, they require partial mesh storage for later sampling, limiting achievable compression. At the same time, novel view synthesis methods have shown that explicit collections of 3D Gaussians can be used to accurately visualize volume data. In this work, we introduce an explicit model for volume data compression based on 3D Gaussian primitives. We reinterpret collections of 3D Gaussians as an explicit representation of a scalar field and use a sampling strategy that reconstructs scalar values at spatial locations through weighted aggregation of intersecting Gaussians. We develop optimized CUDA-accelerated pipelines for structured and unstructured model sampling, loss functions that encourage accurate domain encoding by our models, and a novel sampling-error based densification strategy. Our explicit formulation naturally encodes domain geometry, eliminating the need for mesh storage in unstructured volumes and introducing significantly higher compression opportunities. Compared to existing INRs, we demonstrate that our explicit model achieves competitive reconstruction quality with significant training speedups on structured volumes, while markedly outperforming in all metrics on unstructured volumes.
[LG-5] A Lightweight Self-Supervised Learning Framework for Multivariate Time Series using Hierarchical-JEPA on ECG Data
链接: https://arxiv.org/abs/2607.01145
作者: Siwon Kim
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: 25 pages, 7 figures. Code will be made publicly available soon
Abstract:Data analysis in the medical domain often encounters scenarios involving a limited target dataset and a large, unannotated dataset with a general distribution. Under such circumstances, self-supervised learning (SSL) methods are highly effective for utilizing large datasets, making them a popular choice for electrocardiogram (ECG) analysis. This work presents the Event Reconstruction Joint-Embedding Predictive Architecture (ER-JEPA), a lightweight SSL framework for multivariate time series, whose name and two-fold hierarchical structure are inspired by the diagnostic approach of cardiologists. At its core, ER-JEPA features: (1) a two-stage structure that constructs representations for each time interval and subsequently processes these representations as a univariate time series, (2) the hierarchical integration of two Joint-Embedding Predictive Architectures (JEPAs), and (3) a Vision Transformer (ViT) backbone. The structural concatenation of two JEPAs categorizes the model as a Hierarchical JEPA (H-JEPA), designed to encode multiple levels of abstract representations for enhanced prediction on complex tasks. This study reports a successful application of H-JEPA to 12-lead ECG data as a multivariate time series alongside an analysis of the sensitivity of hierarchical representation during the pretraining stage. Pretrained on approximately 180,000 10-second recordings, the model achieves state-of-the-art downstream performance on the ST-MEM benchmark, with rapid computation and minimal resource usage.
[LG-6] GAIA: Geometry-Adaptive Operator Learning for Forward and Inverse Problems
链接: https://arxiv.org/abs/2607.01128
作者: Meenakshi Krishnan,Pranav Pulijala,Ke Chen,Haizhao Yang,Ramani Duraiswami
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:
Abstract:Operator learning for partial differential equations (PDEs) on arbitrary geometries builds fast neural surrogates for large-scale simulation. Although recent geometry-adaptive neural operators have made substantial progress, they are mainly designed for forward problems in which inputs and outputs share the same spatial domain. This limits their applicability for boundary value problems (BVPs) and inverse problems, where inputs and outputs may live on different domains. We introduce the Geometry-Adaptive Integral Autoencoder (GAIA), an operator learning model that encodes the domain boundary and the interior field distribution into geometry tokens, and conditions integral transform layers on these tokens via cross-attention, allowing the kernel to adapt locally to geometric features. This yields a single architecture for forward (including BVPs) and inverse problems on arbitrary domains in one pass, without retraining, iterative optimization, or graph construction. We evaluate GAIA on seven 2D and 3D benchmarks, four of which are new or substantially extended benchmarks for inverse problems and BVP: electrical impedance tomography, optical tomography, 3D Darcy flow on varying geometries, and a modified setting of Poisson BVP on mechanical components benchmark (MCB). GAIA sets new state-of-the-art results on every inverse and BVP task, reducing median relative L^2 error by 64% on airfoil flow reconstruction and 27% on EIT relative to the next best amortized method, and outperforming all baselines on every shape category of MCB. On other forward problems, GAIA is competitive with specialized solvers while maintaining stable accuracy across point resolutions on which transformer-based baselines degrade.
[LG-7] ZO-Act: Efficient Zeroth-Order Fine-Tuning via One-Shot Activation-Informed Low-Rank Subspaces
链接: https://arxiv.org/abs/2607.01125
作者: Xun Dong,Yibo Xu,Naigang Wang,Xin Li,Penghang Yin,Zi Yang
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:
Abstract:Zeroth-order (ZO) optimization enables fine-tuning large language models when backpropagation is unavailable or memory-prohibitive, but existing methods often perturb full model weights or randomly constructed low-dimensional subspaces, yielding high-variance estimates and limited performance. We propose ZO-Act, an activation-informed ZO fine-tuning method that restricts perturbations to a fixed low-rank subspace derived from input activations. For each linear layer, ZO-Act computes a small activation basis once at initialization and optimizes only lightweight coefficient matrices using forward-only loss evaluations. This reduces the effective perturbation dimension, exposes explicit trainable variables compatible with momentum-based optimizers such as Adam, and naturally supports quantized LLM fine-tuning by keeping low-bit weights frozen. We analyze ZO-Act as zeroth-order optimization over a restricted coefficient space and show that perturbing the low-dimensional coefficients reduces both the variance-dependent convergence term and the finite-difference error of the ZO estimator, at the cost of a controlled subspace approximation bias that is mitigated by the low-rank structure of LLM activations and gradients. Experiments on Llama-3-8B, OPT-13B, and INT4 Llama-3-8B show consistent gains over strong ZO fine-tuning baselines across language understanding, question answering, and commonsense reasoning.
[LG-8] SynLaD: Latent Diffusion for Generating Synthesizable Molecules Conditioned on 3D Pharmacophore Profiles
链接: https://arxiv.org/abs/2607.01105
作者: Miruna Cretu,John Bradshaw,Patricia Suriana,Saeed Saremi,Omar Mahmood,Kirill Shmilovich,Kangway Chuang,Vishnu Sresht,Colin Grambow
类目: Machine Learning (cs.LG)
*备注:
Abstract:We present SynLaD, a latent diffusion framework for small-molecule generation that unifies ligand-based drug design objectives (what to make) with synthetic accessibility (how to make it). Current models typically optimize one objective at the expense of the other, creating a bottleneck for discovering high-scoring and synthesizable molecules. SynLaD combines reaction-constrained generation with pharmacophore-conditioned 3D design by learning a latent space that decodes to both 3D structures and synthesis pathways. An encoder maps molecules to a latent representation used by two decoder heads: (i) a geometric head that reconstructs atom types and coordinates and (ii) an autoregressive synthesis head that outputs synthetic routes in a serialized, reaction-based notation. A diffusion transformer generates novel latents in the learned space, conditioned on pharmacophore profiles. Across analogue generation tasks for bioactive ligands, SynLaD outperforms existing baselines in synthesizable and diverse hit generation, demonstrating that a single model can produce shape-aligned molecules with feasible synthesis plans.
[LG-9] When Context Compensates for Sparse Event History: AlphaEarth for Spatio-Temporal Point-Process Forecasting
链接: https://arxiv.org/abs/2607.01082
作者: Yahya Aalaila,Mouad Elhamdi,Gerrit Großmann,Daniel Jenson,Elizaveta Semenova,Sebastian Vollmer
类目: Machine Learning (cs.LG)
*备注:
Abstract:Spatio-temporal point-process models must often generalise across space when local event histories are sparse. We study whether exogenous spatial context can compensate in such regimes. Using a fixed log-Gaussian Cox process backbone, we compare an event-only model with the same model augmented by AlphaEarth embeddings as linear spatial context. We evaluate spatial transfer on emergency medical services (EMS) forecasting across eight held-out regions, fixed forecast anchors, and a sweep over history length w , using only AlphaEarth (AE) embeddings available strictly before each anchor. AE improves out-of-region predictive performance across all history regimes, with the largest gains under scarce histories: approximately 2 – 6\times multiplicative improvements at 1-2 weeks, tapering to roughly 10 – 20% at w=20 – 104 weeks. These results show that contextual information can substantially stabilise spatially transferred point-process forecasts when event history is limited.
[LG-10] Balancing Expressivity and Learnability in Quantum Kernel Bandit Optimization
链接: https://arxiv.org/abs/2607.01080
作者: Yuqi Huang,Vincent Y. F. Tan,Sharu Theresa Jose
类目: Machine Learning (cs.LG); Information Theory (cs.IT)
*备注:
Abstract:We investigate Gaussian process (GP) bandit optimization with quantum kernels, assuming the mean reward function lies in the reproducing kernel Hilbert space (RKHS) induced by the quantum kernel. This setting is motivated by NISQ-era tasks such as quantum control, state preparation and variational quantum algorithms. While quantum kernels can offer a `quantum advantage’ via domain-specific inductive biases, naïvely using full, high-dimensional kernels increases model complexity and information gain, leading to higher cumulative regret and poor learnability. To address this, we propose projected quantum kernels and classical kernel approximation techniques that reduce feature dimensionality while preserving key quantum properties. Using these approximate kernels, we develop misspecified GP bandit algorithms and derive regret bounds that characterize the trade-off between approximation error and information gain. The regret bounds provide principled guidance for selecting the optimal model complexity. Empirically, our methods outperform full quantum kernels in sample efficiency, while substantially reducing computational overhead, enabling scalable GP optimization for quantum-native applications.
[LG-11] GSRQ: Gain-Shape Residual Quantization for Sub-1-bit KV Cache ICML2026
链接: https://arxiv.org/abs/2607.01065
作者: Soosung Kim,Minjae Park,Eui-Young Chung,Jaeyong Chung
类目: Machine Learning (cs.LG)
*备注: ICML 2026
Abstract:The deployment of Large Language Models (LLMs) with extended context windows is increasingly constrained by the linear growth of Key-Value (KV) cache memory. Vector Quantization (VQ), particularly Residual Quantization (RQ), is a promising approach for pushing KV cache storage toward the sub-1-bit regime by progressively encoding residuals with small codebooks. However, most VQ methods still rely on standard \ell_2 K -means as the core codebook-learning primitive. We identify a subtle high-dimensional issue of this primitive: Euclidean centroid averaging can induce centroid shrinkage, which weakens the angular alignment term in the \ell_2 distortion and makes directional preservation harder. To address this issue, we propose Gain-Shape K -means (GSKM), a drop-in replacement for K -means that improves directional fidelity while matching, and in some regimes improving, \ell_2 distortion. We then build Gain-Shape Residual Quantization (GSRQ) by incorporating a weighted extension of GSKM into an RQ pipeline. On LLaMA-3-8B, GSRQ substantially improves over strong KV cache quantization baselines across bit rates. At 1-bit, it improves the average accuracy across LongBench tasks from 11.34 to 33.54, a gain of 22.20 percentage points over VQLLM.
[LG-12] he Model Organism Lottery: Model Organism Interpretability Strongly Depends on Training Methodology
链接: https://arxiv.org/abs/2607.01033
作者: Andrzej Szablewski,Gabriel Konar-Steenberg,Raffaello Fornasiere,Nikita Menon,Stefan Heimersheim
类目: Machine Learning (cs.LG)
*备注: 9 pages, 9 figures, references and appendices
Abstract:Model organisms (MOs) - language models trained to exhibit undesired or unnatural behaviours - are frequently used as testbeds for evaluating white-box interpretability techniques. Current MOs are typically constructed via post-hoc supervised fine-tuning (SFT) on behavioural transcripts or synthetic documents. Prior research has shown that interpretability methods can easily identify hidden behaviours in these MOs. However, recent work suggests that such post-hoc training methods may make interpretability unrealistically easy. We investigate this claim by constructing a suite of 54 \verb|OLMo2-1B| - and \verb|gemma-3-1b-it| -based MOs trained with seven different techniques, including standard post-hoc SFT, post-hoc DPO, and more realistic integration of MO data into the OLMo post-training DPO phase. We use these MO variants to benchmark activation oracles, activation steering, logit lens, and sparse autoencoders. Our findings show that (i) MO interpretability depends strongly on training objective, target behaviour, model architecture, and training data generation pipeline; (ii) substantial variance remains even after controlling for differences in the strength of target behaviour expression; and (iii) our more realistic \textitintegrated training often yields less interpretable MOs than standard post-hoc methods. Our results cast substantial doubt on the validity of current MOs as interpretability proxies.
[LG-13] Seahorse: A Unified Benchmarking Framework for Spatiotemporal Event Modeling
链接: https://arxiv.org/abs/2607.01022
作者: Yahya Aalaila,Gerrit Großmann,Sebastian Vollmer
类目: Machine Learning (cs.LG)
*备注: 24 pages, 9 figures. Code: this https URL
Abstract:Spatiotemporal point processes (STPPs) model event data in continuous time and space, with applications in mobility, epidemiology, and public safety. Recent neural STPPs span expressive intensity models, conditional density models, continuous-time latent dynamics, normalizing-flow spatial decoders, and score-based generative mechanisms. Yet comparison remains fragile because implementations differ in preprocessing, coordinate normalization, splits, likelihood conventions, and evaluation protocols. We present SEAHORSE, a unified framework for reproducible STPP experimentation. SEAHORSE formalizes neural STPPs through a common encode-evolve-decode interface and trains, tunes, and evaluates every model family under a single executable benchmark protocol with raw-coordinate likelihood reporting. This enables fair comparisons but, more importantly, controlled diagnostic studies. We pair SEAHORSE with HawkesNest, a synthetic stress-test suite, and show that increasing event-pattern complexity exposes each family’s inductive bias, degrading some models sharply and leaving others stable. Code: this https URL.
[LG-14] Generative Model Proposal based Particle Filtering for Data Assimilation
链接: https://arxiv.org/abs/2607.01012
作者: Chandni Nagda,Mayank Shrivastavam Gudrun Thorkelsdottir,Gan Zhang,Morteza Mardani,Arindam Banerjee
类目: Machine Learning (cs.LG)
*备注:
Abstract:Data assimilation models state dynamics conditioned on sequential observations, and has wide-ranging scientific applications. In the filtering setting, the goal is to model the posterior over the current state given all observations so far. Classical solutions typically make simplifying distributional or functional assumptions, e.g., linear-Gaussian systems, which can be inaccurate in many scenarios. In principle, particle filters (PFs) remove these assumptions, yet often collapse in high dimensions. Recent generative approaches learn conditional state transitions, but without principled Bayesian updates they do not recover the correct filtering posterior and can accumulate error over long horizons. In this work, we introduce Flow Proposal Particle Filters (FPPF), which learn a conditional generative model based proposal approximating the variance-minimizing optimal proposal for particle propagation. Conditioning on observations steers particles toward high-likelihood regions before weighting, reducing weight variance and delaying degeneracy. Since our proposal admits tractable likelihood evaluation, FPPF computes accurate importance weights and retains a Bayesian update step. We further extend FPPF to high-dimensional problems through localization strategies, adressing another standard PF failure mode. Extensive experiments on a variety of dynamical systems show that FPPF outperforms statistical baselines and other generative methods in non-linear, non-Gaussian, and high-dimensional regimes.
[LG-15] Automatic Detection of Stress from Speech in the Trier Social Stress Test INTERSPEECH2026
链接: https://arxiv.org/abs/2607.00986
作者: Hanna Drimalla,Wieland R. Cremer,Christine Kraus,Oliver T. Wolf
类目: Machine Learning (cs.LG)
*备注: Accepted to/for Interspeech 2026
Abstract:Automatically detecting stress in speech provides an unobtrusive way to gain insights relevant to behavioral research or clinical assessment. This study investigates the automatic differentiation between a stressful and non-stressful situation, and the prediction of physiological and affective stress responses. Speech data was collected from 50 participants who either completed the Trier Social Stress Test (TSST) or a non-stressful control condition. With a processing pipeline that included speaker diarization and machine learning models, we achieved stress detection performance significantly above a mean baseline. Moreover, relevant physiological and affective stress responses were partially predictable from acoustic-prosodic features. Feature-importance analyses identified the most informative predictors contributing to model performance. The findings demonstrate that speech can serve as a meaningful and unobtrusive indicator of multiple dimensions of the human stress response.
[LG-16] LeNEPA: No-Augmentation Next-Latent Prediction for Time-Series Representation Learning KDD
链接: https://arxiv.org/abs/2607.00958
作者: Alexander Chemeris,Ming Jin,Randall Balestriero
类目: Machine Learning (cs.LG)
*备注: 9 pages, 4 figures, 6 tables; accepted by the 12th Mining and Learning from Time Series (KDD MILETS 2026); source code and artifacts: this https URL
Abstract:Time series are central to modern data mining applications, from industrial telemetry and server metrics to finance and physiology, yet time-series self-supervised learning often depends on view and augmentation choices that encode domain-specific invariances. We study how an SSL recipe behaves when its method-specific configuration is reused unchanged after the pretraining signal family changes, framing this as a fixed-recipe stress test rather than a comparison against optimally tuned methods. We introduce Latent Euclidean Next-Embedding Prediction Architecture (LeNEPA), a no-augmentation next-latent-token objective with a causal backbone. LeNEPA replaces the stop-gradient/EMA stabilization used by vanilla NEPA with SIGReg-based isotropy regularization and computes the predictive loss in a lightweight projected space that is discarded for evaluation. We compare LeNEPA with an ECG-tuned JEPA recipe under a fixed-horizon frozen-probe protocol on PTB-XL and Diag, a synthetic diagnostic corpus generated with Aionoscope. Both methods are retrained independently on each dataset while keeping their method-specific recipes unchanged. In this protocol, the ECG-tuned JEPA recipe is strong in-domain on PTB-XL but weaker when reused unchanged on Diag, whereas LeNEPA preserves useful frozen-probe gains on both datasets. Learning curves suggest faster early representation acquisition: LeNEPA reaches 80% of its final AUROC/AUPRC gain after 2–5k updates, compared with 5–10k updates for the faster JEPA readout. As a separate external frozen-encoder check, a CauKer-pretrained LeNEPA variant reaches 77.65% mean UCR-128 Random-Forest accuracy in a single-seed, best-checkpoint run, within 1.16 points of Mantis and within 0.24 points of MOMENT (77.89%). Overall, the results support no-augmentation latent prediction as a useful candidate recipe for low-retuning time-series SSL.
[LG-17] Diffeomorphic Optimization
链接: https://arxiv.org/abs/2607.00947
作者: Ludwig Winkler,Andrew Leaver-Fay,Joseph Kleinhenz,Pan Kessel
类目: Machine Learning (cs.LG)
*备注:
Abstract:Generative models learn data distributions that reside on a low-dimensional manifold within a higher-dimensional ambient space. Optimizing differentiable objectives on this manifold is challenging: the ambient loss landscape is high-dimensional, rugged, and non-convex. Direct gradient descent, blind to the manifold’s geometry, quickly drifts off it. Diffeomorphic optimization starts from the observation that diffusion and flow models provide a map from the data manifold to a much simpler base space in which we perform gradient descent. Using differential geometry, we show this is equivalent to Riemannian gradient descent on the data manifold up to \mathcalO(\lambda^2) corrections, keeping trajectories on-manifold by construction and yielding a smoother optimization surface. For protein design, we extend diffeomorphic optimization to the matrix Lie groups \mathrmSO(3) and \mathrmSE(3) , deriving an autograd-compatible \mathrmSO(3) gradient and a generalized adjoint-state method for backpropagation through Lie-group ODE solvers. Diffeomorphic optimization improves over tuned guidance on secondary-structure targeting with FrameFlow ( 91.3% vs. 63.3% of residues in the Ramachandran target), outperforms OC-Flow on peptide binding affinity at 2\times the speed, and reduces Rosetta energies by thousands of units across the PDB test set for structures with hundreds of residues.
[LG-18] A Geometric Perspective on Composable Emotion Steering in Text-to-Speech Models
链接: https://arxiv.org/abs/2607.00946
作者: Siyi Wang,James Bailey,Ting Dang
类目: ound (cs.SD); Machine Learning (cs.LG)
*备注:
Abstract:While prior work has explored emotion control in hybrid text-to-speech systems, the geometric properties of these modules, and their implications for steerability, remain poorly understood. We present the first comparative study of speech language model (SLM) and conditional flow-matching (CFM) modules as activation steering sites for mixed emotion speech synthesis. We first characterize emotion representations using linear probing and local intrinsic dimensionality (LID), and then evaluate single-site and joint steering for mixed-emotion synthesis. Our results show that SLM offers a clean, low-dimensional emotion-specific subspace with strong speaker–emotion disentanglement, while CFM exhibitspoor cross-speaker generalization due to speaker–emotion entanglement. Joint steering increases emotion intensity but degrades proportional control and speech quality on in-distribution data. These findings provide practical guidance for multi-site activation steering in hybrid TTS systems and highlight the importance of representation geometry in controllable speech generation.
[LG-19] Explainable AI for Cancer Drug Response Prediction: Beyond Univariate Feature Attributions
链接: https://arxiv.org/abs/2607.00931
作者: Martino Ciaperoni,Margherita Lalli,Simone Piaggesi,Martina Varisco,Francesco Carli,Riccardo Guidotti,Dino Pedreschi,Francesco Raimondi,Fosca Giannotti
类目: Machine Learning (cs.LG)
*备注:
Abstract:Predicting cancer drug response from transcriptomic profiles is a cornerstone of precision oncology, yet the scientific value of machine learning models hinges not solely on predictive accuracy, but also on their capacity to generate reliable biological insights. Current explainability approaches in this setting are computationally costly, lack robustness, and reduce complex drug response to univariate gene importance scores, overlooking the coordinated gene activity that drives sensitivity and resistance. In this work, we present ILLUME+, a scalable post-hoc explainability framework that moves beyond single-gene assessments to capture multiple, complementary forms of explanation. Integrated into our end-to-end pipeline, ILLUME+ produces more stable gene importance scores than existing baselines, recovers established drug-gene associations and mechanisms of action, and enables AI-assisted hypothesis generation to uncover novel interaction-driven molecular signals in cancer biology.
[LG-20] Beyond Activation Alignment:The Alignment-Diversity Tradeoff in Task-Aware LLM Quantization
链接: https://arxiv.org/abs/2607.00908
作者: Fei Wang,Chao Xue,Taoran Liu,Li Shen,Ye Liu,ChangXing Ding
类目: Machine Learning (cs.LG)
*备注:
Abstract:Mixed-precision quantization (MPQ) has become a key technique for deploying large language models under stringent memory and compute constraints. We first identify a phenomenon that we term the Perplexity Illusion: layers ranked as important by perplexity-based sensitivity show little rank correlation with those that are most influential for complex reasoning performance, with Kendall \tau \approx 0 in our analysis. We further reveal an Alignment-Diversity Tradeoff: using only target-task calibration data can degrade post-quantization performance, whereas incorporating general-domain data stabilizes sensitivity estimation and improves robustness across tasks. Based on these observations, we propose TASA (Task-Aware Sensitivity Analysis), a two-level framework that jointly optimizes calibration-data composition and mixed-precision bit allocation. Specifically, TASA searches for a calibration-data mixture using a training-free gradient-trace alignment criterion, and then aggregates perplexity and reasoning-oriented sensitivity signals to guide both inter-layer and intra-layer bit allocation. Experiments on LLaMA-3-8B and Qwen2.5-7B reveal a precision inversion: appropriately allocated 3.5-bit models can match or surpass less task-aware 4-bit baselines. At an average precision of 3.5 bits, TASA matches or outperforms several competitive 4-bit uniform baselines in aggregate accuracy, and improves over the strongest W3 baseline on GSM8K by more than 20 absolute points on LLaMA-3-8B. These results show that calibration-data composition substantially affects task-sensitive quantization, a factor underexplored in prior work.
[LG-21] he Binary Tree Mechanism is Optimal for Approximate Differentially Private Continual Counting
链接: https://arxiv.org/abs/2607.00876
作者: Konstantina Bairaktari,Kasper Green Larsen
类目: Data Structures and Algorithms (cs.DS); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:
Abstract:Private continual counting is a fundamental problem in differential privacy: given a binary stream of length n , where each 1 corresponds to the contribution of one individual, the goal is to release all running counts while protecting the privacy of each individual. The standard algorithm is the binary tree mechanism, whose Gaussian-noise variant achieves expected \ell_\infty error proportional to \log^3/2 n for approximate differential privacy. Whether this dependence on the stream length is necessary has remained a central open problem. In this work, we resolve the dependence on n by proving that every differentially private mechanism for continual counting must incur expected \ell_\infty error \Omega(\log^3/2 n) . This shows that the binary tree mechanism is asymptotically optimal in the approximate-DP setting. As a consequence, we also obtain a largest-possible separation between hereditary discrepancy and private \ell_\infty error for linear queries, showing that the known general upper bound in terms of hereditary discrepancy has the optimal dependence on the number of queries. Subjects: Data Structures and Algorithms (cs.DS); Cryptography and Security (cs.CR); Machine Learning (cs.LG) Cite as: arXiv:2607.00876 [cs.DS] (or arXiv:2607.00876v1 [cs.DS] for this version) https://doi.org/10.48550/arXiv.2607.00876 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-22] Constrained Bayesian Optimisation with Multiple Information Sources
链接: https://arxiv.org/abs/2607.00865
作者: Hauke Maathuis,Roeland De Breuker,Saullo Castro,Maike Osborne
类目: Machine Learning (cs.LG)
*备注:
Abstract:Bayesian Optimisation (BO) under unknown constraints is particularly challenging when feasible regions are small. In such settings, existing methods that typically rely solely on evaluations of the true objective and constraints struggle to efficiently explore the design space. However, many real-world applications offer auxiliary data sources (e.g. surrogate models or simplified simulations) that can support early exploration. Despite this potential, their integration into constrained BO remains largely unexplored. We propose a general multi-source framework that extends constrained Max-value Entropy Search, capturing inter-source correlation while balancing evaluation cost and information gain. Experiments on both synthetic and physics-based benchmarks show that our method efficiently identifies feasible and optimal solutions, even when auxiliary data are only weakly correlated. The proposed approach consistently outperforms existing methods, particularly in early-stage exploration.
[LG-23] Spectroscopy Analysis with Machine Learning Regression for the Quantification of Carbon and Nitrogen Contents in Inceptisol and Oxisol Soil Types: Comparing Different Preprocessing and Validation methods as well as Feature Importance
链接: https://arxiv.org/abs/2607.00834
作者: Vinicius Herique Kieling,Guilherme Macedo Baggio,Felipe Augusto Bueno Rossi,Marco Antonio de Castro Barbosa,Dalcimar Casanova,Larissa Macedo dos Santos Tonial,Jefferson Tales Oliva
类目: Machine Learning (cs.LG)
*备注:
Abstract:Near-Infrared (NIR) spectroscopy has emerged as a promising alternative to traditional soil analysis methods, offering advantages such as speed, low cost, and non-destructive testing. This work proposes a machine learning (ML) approach to calibrate predictive models for carbon © and nitrogen (N) content in Oxisols and Inceptisols, utilizing NIR spectral data acquired with a portable MyNIR device. Various preprocessing methods were evaluated, with the most effective being the Savitzky-Golay (SG) filter and a robust outlier removal method based on the Nonlinear Iterative Partial Least Squares (NIPALS) algorithm coupled with a Huber loss function. Multiple validation strategies were compared, including 10-fold cross-validation, leave-one-out, and holdout via the Kennard-Stone method, followed by standardization. Stacking ensemble learning models were employed, using Partial Least Squares (PLS), Support Vector Regression (SVR), and Ridge as base models, with linear regression as the meta-model. The models were evaluated using R2, Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), and Ratio of Performance Deviation (RPD) metrics. The performance gap between soil types suggests the influence of pedological characteristics. Furthermore, the models achieved an RPD 2.0 with low overfitting, validating the potential of this approach for rapid C and N quantification. This study contributes to the optimization of sustainable agricultural practices, aligning with the demand for efficient and environmentally friendly analytical methods. The developed technique enables faster decision-making for producers and consultants based on organic matter content, fertility indicators, and nutrient availability.
[LG-24] From Pixels to Temporal Correlations: Learning Informative Representations for Reinforcement Learning Pre-training ACM-MM2025
链接: https://arxiv.org/abs/2607.00811
作者: Jinwen Wang,Youfang Lin,Xiaobo Hu,Siyu Yang,Sheng Han,Shuo Wang,Kai Lv
类目: Machine Learning (cs.LG)
*备注: 10 pages, 8 figures. Accepted by ACM MM 2025
Abstract:Unsupervised pre-training on large-scale datasets has demonstrated significant potential for improving the sample efficiency and performance of Reinforcement Learning (RL). Given the large-scale action-free internet videos, existing methods utilize single-step transition prediction and image reconstruction to learn representations. However, these methods prefer to preserve large-proportion stationary information in the pixel space, neglecting small but crucial information. To preserve enough information in the representation, it is essential to pay equal attention to each element in videos. Specifically, we propose a temporal correlation space to distinguish each element. For implementation, we introduce the Multi-scale Temporal Contrastive Learning (MTCL) method to model multi-scale temporal correlations separately. This approach can balance the attention of different elements and yield more informative representations, effectively supporting policy learning in various downstream tasks. Experimental results demonstrate that our method improves sample efficiency and asymptotic performance across various downstream tasks.
[LG-25] Local Motion Matters: A Deconstruct-Recompose Paradigm for Reinforcement Learning Pre-training from Videos
链接: https://arxiv.org/abs/2607.00808
作者: Jinwen Wang,Youfang Lin,Xiaobo Hu,Shuo Wang,Kai Lv
类目: Machine Learning (cs.LG)
*备注: 20 pages, 16 figures
Abstract:Pre-training on large-scale videos to improve reinforcement learning efficiency is promising yet remains challenging. Existing methods typically treat the agent as an indivisible entity, modeling motion patterns globally. Such global modeling is tightly coupled with the morphology, hindering transfer across domains. In contrast, despite the vast disparity in global motions, the local components exhibit similar motion patterns across different agents. Building on this insight, we propose a novel Deconstruct-Recompose Paradigm (DRP) for learning transferable local motion representations. Specifically, in the Deconstruct phase, we identify multiple local points and track their frame-wise motions, defining each as an Atomic Action. We introduce a Dual-Attention Encoder (DAE) to learn local motion representations from these Atomic Actions, capturing their spatiotemporal relationships. In the Recompose phase, we compose local motion representations with a learnable Motion Aggregation Token [MAT] via latent dynamics model learning. Additionally, an adapter bridges local motion and downstream action-specific dynamics to accelerate policy learning. Extensive experiments demonstrate that our method effectively transfers to diverse robotic control and manipulation tasks, significantly improving sample efficiency and performance.
[LG-26] ask-Relevant Representation Decoupling for Visual Reinforcement Learning Generalization
链接: https://arxiv.org/abs/2607.00796
作者: Jinwen Wang,Youfang Lin,Xiaobo Hu,Qian Xu,Shuo Wang,Zhuo Chen,Kai Lv
类目: Machine Learning (cs.LG)
*备注: 23 pages, 13 figures
Abstract:Visual Reinforcement Learning (VRL) has achieved considerable success in solving control tasks. However, generalizing learned policies to new environments remains a major challenge, as agents often overfit to task-irrelevant features in the training environment. To solve this problem, we introduce the concept of decoupling observations into task-relevant and task-irrelevant representations. Building on this idea, we propose a self-supervised Task-Relevant Representation Decoupling (T2RD) algorithm for VRL. This algorithm consists of three components: task-relevant representation consistency, cross-reconstruction, and cross-dynamic prediction. The first two components achieve the decoupling of content and style features, but the resulting content representations are not necessarily task-relevant. To further refine task-relevant features from content representations, we design the third component that introduces dynamic prediction. T2RD achieves State-Of-The-Art (SOTA) generalization performance and sample efficiency in the DeepMind Control Suite and Robotic Manipulation tasks.
[LG-27] Which Metric Reflects the Spelling Rate Accuracy in Event-Related Potential-Based Brain-Computer Interfaces?
链接: https://arxiv.org/abs/2607.00794
作者: Okba Bekhelifi,Naoual El Djouher Mebtouche
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: paper is accepted for presentation at the 2026 IEEE International Conference on Metrology for eXtended Reality, Artificial Intelligence and Neural Engineering - IEEE MetroXRAINE 2026, Chemnitz, Germany
Abstract:For predictive models, the often-reported performance metrics are the loss and accuracy. In synchronous Brain- Computer Interface (BCI) systems, these metrics are informative for most BCI paradigms; however, for Event-Related Potential (ERP) applications the spelling rate, which measures the number of characters correctly selected is more important as it influences the estimation of information transfer rate (ITR) and any related metric measuring spelling performance. Moreover, ERP-based BCIs hold imbalanced data class distributions, which require reporting metrics that can handle the imbalance, such as the area under the receiver operating characteristic curve (ROC AUC). In this work, we study the correlation of the spelling rate with 13 metrics to identify which among them best reflect user spelling performance and how they are affected by trial repetition. The Results of two datasets (a private LARESI ERP dataset and the public OpenBMI ERP dataset) favor the Brier score, Matthews Correlation Coefficient (MCC), and the metrics that account for class imbalance in binary classification: ROC AUC, area under the Precision-Recall curve (PR AUC), Average Precision (AP), and partial AUC (pAUC). These findings encourage researchers and practitioners to report those metrics in ERP-based BCI experiments.
[LG-28] Evaluating Pretrained Music Embeddings for Cross-Performance Jazz Standard Recognition ICML2026
链接: https://arxiv.org/abs/2607.00777
作者: Çağrı Eser
类目: ound (cs.SD); Machine Learning (cs.LG)
*备注: 6 pages, 2 figures, 4 tables. Accepted to the ICML 2026 Workshop on Machine Learning for Audio
Abstract:Recognizing jazz standards from audio is a challenging form of tune-level music retrieval: different performances of the same standard may vary in tempo, key, arrangement, instrumentation, improvisational content, and even whether the head melody is present. We study this problem using a curated subset of the Jazz Trio Database designed for cross-performance standard recognition. We compare a from-scratch trained Harmonic CNN baseline against frozen pretrained music representations from recent music understanding foundation models, using both supervised probing and nearest-neighbor retrieval. Our results suggest that from-scratch spectrogram models overfit strongly to training performances, while pretrained embeddings provide better top- k results but are sensitive to performer identity, which can be partially reduced with a lightweight contrastive projection. Our findings motivate jazz standard recognition as a useful stress test for music representation models and as a step toward retrieval-based standard identification. Project page: this https URL.
[LG-29] Accelerating Discrete Diffusion Models with Parallel-In-Time Sampling
链接: https://arxiv.org/abs/2607.00773
作者: Yu Yao,Huanjian Zhou,Andi Han,Wei Huang,Masashi Sugiyama
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Data Structures and Algorithms (cs.DS); Numerical Analysis (math.NA)
*备注: 33 pages, 10 figures
Abstract:Discrete diffusion models are widely used for learning and generating discrete distributions. As the generation process is inherently sequential, the acceleration of sampling is of significant importance. In this work, we parallelize the mainstream \tau -leaping algorithm for absorbing discrete diffusion in a Continuous-Time Markov Chain (CTMC) framework. By leveraging the continuous-time stochastic integral form of the \tau -leaping algorithm and the Picard iteration method, we achieve parallel-in-time sampling acceleration and provide a proof of exponential-factorial convergence for our algorithm. We improve the overall time complexity of \tau -leaping under absorbing settings from \mathcalO(d \log S) to \mathcalO(\log (d\log S)\cdot \log d) with respect to NFE. Empirically, our method shows consistent acceleration across synthetic and real-data settings. The new sampler achieves at most 7 – 9\times runtime speedup for synthetic distribution, and maintains the same quality with 50% fewer NFE and 1.45 – 1.86\times runtime speedups in image/text tasks on a single GPU. Our research expands the potential of discrete diffusion models for efficient parallel inference, with broader implications for applications such as molecular structure and language generation.
[LG-30] Forensic-Oriented Intrusion Detection Using Synthetic Network Traffic Data and Explainable Artificial Intelligence
链接: https://arxiv.org/abs/2607.00763
作者: Jose Luis Vela Alonso,Carmen Pellicer
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 23 pages, 8 figures
Abstract:Digital forensic investigations of network intrusions require analytical outputs that are traceable, reproducible, and court-defensible - requirements existing machine learning pipelines do not satisfy, since they treat original evidence as training data and produce opaque classifications without instance-level justification. This paper presents a forensic-oriented intrusion detection framework resolving both problems simultaneously, integrating synthetic data generation, binary classification, and explainability within a single pipeline governed by ISO/IEC 27037, 27041, 27042, and NIST SP 800-86. The framework operationalises the ISO/IEC 27037 requirement for strict separation between original digital evidence and derived analytical artefacts. Original datasets are treated as immutable, hash-verified artefacts; all training operates on parameterized synthetic derivatives via SDV + CTGAN. XGBoost binary classification provides high-performance detection on tabular network flow data, and SHAP TreeExplainer produces instance-level feature attributions mapping statistical predictions to observable network behaviour for forensic reporting. Train-on-Synthetic, Test-on-Real (TSTR) evaluation on CICIDS2017 achieves F1-macro = 0.96, within cross-validation variance of the real-data baseline (0.97). Kolmogorov-Smirnov testing confirms synthetic privacy preservation (mean |KS| = 0.38) alongside operational utility. Cross-dataset validation on UNSW-NB15 and Kitsune identifies feature space dimensionality as the primary determinant of synthetic training effectiveness, establishing a practical deployment boundary of approximately 30 numeric flow-level features. SHAP attributions for Brute Force, Port Scan, and DoS attacks are consistent across real and synthetic instances, confirming synthetic training preserves forensically relevant attack fingerprints required for expert witness testimony. Comments: 23 pages, 8 figures Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG) Cite as: arXiv:2607.00763 [cs.CR] (or arXiv:2607.00763v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2607.00763 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-31] MosaicKV: Serving Long-Context LLM with Dynamic Two-D KV Cache Compression
链接: https://arxiv.org/abs/2607.00760
作者: Sheng Qiang,Ruiwei Chen,Yinpeng Wu,Jinyu Gu,Zhichao Hua,Yubin Xia,Binyu Zang,Haibo Chen
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: 15 pages, 10 figures
Abstract:Long-context LLM services now sustain prompts with hundreds of thousands to millions of tokens, making the key-value (KV) cache a first-order serving cost. Because the cache grows linearly with context length, it can exhaust GPU memory, force smaller batches, and reduce serving throughput. Prior KV cache compression techniques typically target only the sequence dimension or only the channel dimension, which leaves limited headroom as context windows scale. Compressing both dimensions promises higher memory reduction, but applying the two forms of compression directly leads to significant accuracy loss. This paper introduces MosaicKV, a dynamic two-D (dimensional) KV cache compression system for extremely long-context serving. MosaicKV uses dynamic two-D compression to address the accuracy challenge, exploiting the non-uniform importance distribution of elements within the KV cache. Instead of applying one compression pattern globally, MosaicKV identifies important elements for each KV vector and selects compression strategies at the granularity of KV cache segments. To address the performance challenge, where fine-grained sparsity and compression management overhead can offset the gains from compression, MosaicKV introduces compressed KV cache management. This mechanism uses underutilized GPU and CPU resources to maintain compressed KV caches and accelerate attention computation. Evaluation on an H800 GPU with multiple LLMs shows that MosaicKV delivers up to 16x attention speedup, 4.8x lower decode latency, and 7.3x higher throughput than the uncompressed baseline. At the same time, it reduces memory usage by 3x and incurs only 1.76% average accuracy loss on LongBench and RULER. Comments: 15 pages, 10 figures Subjects: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC) Cite as: arXiv:2607.00760 [cs.LG] (or arXiv:2607.00760v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2607.00760 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-32] Generative Refinement for Low-Budget Black-Box Optimization
链接: https://arxiv.org/abs/2607.00691
作者: Edouard R. Dufour,Pascal Fua
类目: Machine Learning (cs.LG)
*备注: 20 pages, 7 figures
Abstract:Black-box optimization is a fundamental science and engineering tool that makes it possible to optimize objectives without gradient information. Unfortunately, as it often requires many function evaluations, it can be challenging when each one is costly. This is especially true when the evaluation function is noisy or failure-prone, and when high-performing solutions are confined to thin, curved, or disconnected regions of the search space. Existing methods leveraging generative models to navigate these subspaces are built to sample from reward-aligned distributions. As a result, they require a large number of evaluations to align their sampler effectively, making them impractical in low-budget settings. We propose SPARROW, an algorithm that completely decouples the generative prior from the reward signal. SPARROW can use any sampler with a known corruption process and trained on unevaluated data, as a fixed, structured proposal operator. Optimization proceeds by rank-based guidance over an archive of evaluated candidates. SPARROW can navigate complex geometries, handle unreliable reward signals, and perform effective optimization under very low evaluation budgets. We provide asymptotic convergence guarantees over the sampler support and demonstrate strong empirical performance on problems with unreliable rewards and geometrically complex landscapes.
[LG-33] AdaBoosting Text Prompts for Vision-Language Models ECCV2026
链接: https://arxiv.org/abs/2607.00684
作者: Seokhee Jin,Changhwan Sung,Sunung Mun,Hoyoung Kim,Jungseul Ok
类目: Machine Learning (cs.LG)
*备注: Accepted to ECCV 2026
Abstract:The classification accuracy of pretrained Vision-Language Models (VLMs) relies on the quality of the text prompts. Handcrafted templates and Large Language Model (LLM)-generated descriptions not only make predictions more interpretable, but also enable reuse of the same prompts across heterogeneous VLMs. Recent works construct task-adapted text prompts with a small number of labeled images. However, existing few-shot text prompting methods do not explicitly focus on misclassified examples during prompt construction, leading to only marginal improvements even as more shots become available. To fully exploit few-shot supervision, we propose Text Prompt Boosting (TPB), an AdaBoost-inspired framework that treats each text-prompt-based classifier as a weak learner and sequentially aggregates them into a strong ensemble by explicitly targeting hard, misclassified examples. Extensive experiments show that TPB preserves task-intrinsic, model-agnostic cues in text space, enabling robust cross-model transfer. Across eleven classification benchmarks, TPB improves accuracy on the source model and preserves shot-driven gains when transferred to larger, more capable VLMs, where existing methods struggle to sustain such improvements.
[LG-34] Distributed Online Bandit Submodular Maximization with Bounded Sampling Violations
链接: https://arxiv.org/abs/2607.00680
作者: Bin Du,Chang Liu,Dingqi Zhu,Lintao Ye,Dengfeng Sun
类目: Machine Learning (cs.LG)
*备注:
Abstract:We study distributed online submodular maximization under partition matroid constraints, in which multiple agents select a limited number of actions from their own subsets sequentially to maximize the cumulative value of a sequence of objective functions. We develop a unified algorithmic framework that accommodates full-information and bandit feedback models. For both feedback models, we prove that the proposed algorithms achieve sublinear (1-1/e) -regret guarantees, which are comparable to those achieved by existing centralized counterparts. Furthermore, to tackle the sampling violation issue caused by continuous relaxation and rounding, we develop a bounded stochastic pipage rounding scheme and show that the probability of sampling violation vanishes asymptotically. As a result, the cumulative sampling violation remains sublinear in T , which is further shown to be not improvable under certain conditions. Numerical results validate the theoretical findings in this paper.
[LG-35] Whats a Credit Worth? A Market Framework for Attribution-Aware Compensation in Generative Music
链接: https://arxiv.org/abs/2607.00641
作者: Luyang Zhang,Xirui Jiang,Junwei Deng,Beibei Li,Jiaqi W. Ma,Chris Donahue
类目: Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注:
Abstract:Advances in generative AI are rapidly increasing the quality and commercial value of generated music, and this progress depends on large catalogs of creators’ recordings. This raises a central question for platform design: how should creators be compensated when their work is used to train generative AI models that in turn produce commercial outputs? We develop a framework for fairly compensating creators in generative-music markets, where each creator’s payment depends on a data-attribution score estimating their contribution to model outputs. Compared to past compensation frameworks, our framework has two unique considerations: (1) attribution is traced to entire creator catalogs, not individual songs, and (2) the informativeness (signal-to-noise ratio) of the attribution score is an input to the payment mechanism. The framework yields a closed-form payment rule per creator and measures the welfare cost of inaccurate attribution for both creators and the platform. Whether the welfare-optimal contract is royalty-based or takes the form of fixed-fee licensing depends on how informative attribution is for that creator’s catalog. We show that better attribution translates directly into welfare gains for both creators and the platform, yet under multi-platform competition a platform only captures gains from attribution improvements when its signal becomes the most precise in the market. To ground our framework in empirical behavior, we train acoustic and symbolic music generation models and measure the informativeness of scalable attribution techniques against a leave-one-catalog-out ground truth. Our experiments reveal that noisy attribution signals push payment toward fixed-fee licensing and diminish welfare for both creators and the platform, providing an economic motivation for further research on improved attribution.
[LG-36] Measuring Dead Directions: Decomposing and Classifying Singular Structure off Canonical Alignment
链接: https://arxiv.org/abs/2607.00603
作者: Tejas Pradeep Shirodkar
类目: Machine Learning (cs.LG)
*备注: 45 pages, 14 figures, 19 tables. Methods and empirical companion to arXiv:2606.05957 (Dead Directions: Geometric Singular Learning)
Abstract:We give a descent-free, alignment-free measurement of singular structure on trained networks. At a single frozen checkpoint the read recovers the order k of each dead direction from the directional-Fisher rate, the master invariant from which the per-direction learning coefficient 1/(2k) follows exactly, in whatever basis the optimizer left. The same read classifies each direction, separating a genuine singularity, whose order the architecture fixes, from a flat gauge symmetry; the directional-Fisher magnitude settles the cases the order cannot. A pluggable detector supplies the directions for transformer, convolutional, and normalisation layers. The read recovers the architecture-predicted order across constructed cells and trained networks, including a fine-tuned vision transformer whose dead structure is the LayerNorm-kernel gauge and a from-scratch one whose compressed MLP forms a node-death at its activation order. Where the singular structure enumerates, the per-direction orders assemble, through the typed intersection of the loci, into the global coefficient (\lambda, m) matching the closed form. The method removes the canonical-alignment and descent preconditions of the underlying rate result, turning order-recovery into a deterministic, architecture-general reading. We then map its reach into the Watanabe triple: the order determines the universal singular fluctuation \nu(k) , though a trained network’s realized \nu falls below it as the live structure absorbs the dead direction’s data fluctuation, and the multiplicity recovers from the dominant structure under a single-locus assumption.
[LG-37] Decision-focused Sparse Tangent Portfolio Optimization ICML2026
链接: https://arxiv.org/abs/2607.00581
作者: Haeun Jeon,Seunghoon Choi,Hyunglip Bae,Yongjae Lee,Woo Chang Kim
类目: Machine Learning (cs.LG)
*备注: ICML 2026
Abstract:Sparse tangent portfolio optimization aims to learn an interpretable, low-cardinality portfolio in the tangency direction of the mean-variance frontier. However, the associated cardinality-constrained formulation is NP-hard, and standard predict-then-optimize pipelines often misalign forecasting accuracy with downstream portfolio quality. We propose an end-to-end decision-focused learning framework that reformulates Sharpe ratio maximization as a Disciplined Parametrized Programming (DPP)-compliant convex programming layer and replaces discrete selection with a smooth top- k operator enforcing an exact cardinality k . This enables gradient flow through prediction, asset selection, and re-optimization, allowing the predictive model to directly optimize portfolio performance. Across four major equity markets, our method achieves competitive and often superior out-of-sample Sharpe ratios compared with historical and prediction-focused baselines, with particularly strong gains in larger asset universes. Our \hrefthis https URLcode is publicly available.
[LG-38] From Structural Equation Modelling to Double Machine Learning: Robustness Analysis for Survey-Based Research
链接: https://arxiv.org/abs/2607.00512
作者: Ka Ching Chan,Qiana Liu,Sanjib Tiwari,Ranga Chimhundu
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 21 pages, 1 figure, 13 tables
Abstract:Structural equation modelling (SEM) is widely used in survey-based business and information systems research to assess latent constructs and theory-driven structural relationships. However, SEM path significance is obtained within a particular model specification and may not show whether findings remain stable under alternative estimation frameworks. This study develops and demonstrates a staged robustness analysis framework that connects SEM, ordinary least squares (OLS) regression, and Double Machine Learning (DML). SEM is first used to refine the measurement structure and estimate the robustness-baseline SEM model, in which the full theory-specified structural path system is retained for downstream robustness analysis before final structural path evaluation. OLS regression is then applied to SEM-derived construct scores as a transparent regression benchmark. Finally, DML-style residualisation is used to examine whether each tested focal relationship remains stable after flexible machine-learning-based adjustment for observed controls. Learner-sensitivity checks compare Random Forest, Gradient Boosting, and Support Vector Machine learners, and selected reverse-direction diagnostics are used to examine directional sensitivity. The framework is demonstrated using a FinTech Digital Customer Intimacy survey model. The findings identify which relationships are stable across SEM, OLS, and DML-style checks, and which require more cautious interpretation. A reproducible Google Colab workbook and generated result files are publicly available, providing a reusable template that researchers and students can adapt to other survey-based latent-construct studies. The paper contributes a practical robustness workflow and interpretation guide for survey-based researchers seeking to complement SEM with conventional and machine-learning-based robustness checks.
[LG-39] Prototype Language Models
链接: https://arxiv.org/abs/2607.00510
作者: Dan Ley,Giang Nguyen,Himabindu Lakkaraju,Julius Adebayo
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Knowing which training examples drive outputs is fundamental to auditing, correcting, and understanding language models, yet for modern LLMs this remains expensive, approximate, and largely post-hoc. Standard language models generate tokens through a dense network pathway, causing training data’s influence to be distributed across parameters rather than organized along explicit, traceable components. We introduce a prototype language model architecture, Prototypes for Interpretable Sequence Modeling (PRISM), that forms each prediction via a sparse, non-negative mixture of learned prototypes, trained with clustering objectives that anchor each prototype to coherent neighborhoods of training examples. Across architectures from 130M to 1.6B parameters trained on up to 50B tokens, prototype language models either surpass or remain within 2.5 percentage points on average downstream accuracy of matched dense baselines. We show that sparse prototype structure localizes curvature in the loss landscape, yielding a more tractable Hessian and enabling training data attribution that is ~500x faster than post hoc baselines when consuming equivalent memory. Calibrating linear prototype controllers can improve downstream accuracy by roughly 3 points while tracing those corrections back to training neighborhoods, and targeted prototype suppression can remove model behaviors without finetuning or measurable loss in generation quality.
[LG-40] Ghost in the Kernel: In-Context Learning with Efficient Transformers via Domain Generalization
链接: https://arxiv.org/abs/2607.00479
作者: Peilin Liu,Ding-Xuan Zhou
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Transformer-based large models have demonstrated remarkable generalization abilities across different tasks by leveraging a context-aware attention module for in-context learning. With richer context, transformers adapt more effectively to the current use case without any parameter updates. However, the quadratic computational and memory complexity with respect to context length significantly slows data processing in softmax transformers. Linear transformers were proposed to address this issue by reducing the complexity to linear dependence on context length, but the design and understanding of the feature mapping in linear attention, from a theoretical viewpoint, remain unclear. In this paper, we investigate the approximation and generalization abilities of linear transformers under a two-staged sampling process from domain generalization. We show that linear transformers perform in-context learning as learning a mapping from context distributions to response functions. A dimension-independent convergence rate is obtained for our generalization analysis, which also exhibits the tradeoff between the regularities of data distributions and latent features. Guided by our theoretical framework, we propose a new perspective on activation and loss design for linearizing pretrained softmax large language models.
[LG-41] Interpretable vs Learned Encoders for High-Cardinality Fraud Detection
链接: https://arxiv.org/abs/2607.00477
作者: Xiao Han,Jingjing Liu,Moxuan Zheng,Zhen Zhang,Chenyu Wu
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE)
*备注:
Abstract:A total of seven categorical encoding methods were tested on the IEEE-CIS fraud benchmark dataset (590,540 records, 3.5% positives, 8 high-cardinality columns). The encoders were evaluated using a stratified 5-fold cross-validation (CV) with three repetitions. Five of the encoders had identical frozen LightGBM learners in the downstream phase, allowing for controlled comparisons of their performance to each other. CatBoost and TabNet were included as comparisons across paradigms using different learners. The entity embeddings produced the highest AUC-ROC (0.9612), with a statistically significant tie with that of CatBoost (0.9602) and statistically superior to tier group encoding (0.9548), whereas target encoding was only 0.0023 worse than tier group encoding and the auditor-friendly tier boundaries were maintained. Off-the-shelf TabNet did not outperform tree-based pipelines and collapsed under data scarcity. On AUC-PR, CatBoost leads (0.822 vs. 0.793); no encoder dominated both metrics. Per-column analysis confirmed the embedding advantage arises from joint multi-column representation.
[LG-42] How Early Is Early Enough? Design-Dependent Observation-Window Sufficiency in Subscription Churn Prediction
链接: https://arxiv.org/abs/2607.00473
作者: Xiao Han,Yao Xiao,Chenyu Wu,Tongchen Zhang
类目: Machine Learning (cs.LG)
*备注:
Abstract:How many days of early behavior suffice for subscription churn prediction? In the public KKBox dataset, the early indicator of churn is typically an indicator of someone’s contract status; however, when looking in the heavily churned manual-renewal segment, having access to early behavior creates a substantial increase in prediction for that specific segment (PR +0.10 at 120 days). A nine-window sufficiency curve shows a diminishing-returns knee in a 45-90 day band. However, stress-testing over three cohort/task designs shows that this curve is singular to the design being tested; for example, in our test with a moving target, the curve inverts and can shift depending on the feature set used. Therefore, any window-sufficiency claim should state its cohort construction, target definition, and feature families. All evidence is from one music-streaming dataset; the mechanism should generalize but the magnitudes may not.
[LG-43] mesynth: A Temporal Fidelity Framework for Health Signal Digital Twins
链接: https://arxiv.org/abs/2607.00431
作者: Md Rakibul Haque,Shireen Elhabian,Warren Woodrich Pettine
类目: Machine Learning (cs.LG)
*备注: Under review at Nature Communications
Abstract:Forecasting models for health-signal digital twins must preserve the oscillatory, frequency, phase, and state-transition dynamics of physiological signals, yet the pointwise metrics used to benchmark them cannot detect when these fundamental properties are lost. We show that this blind spot misranks models: across 11 architectures, models with comparable pointwise error diverge by up to 53° in phase accuracy, equivalent to roughly 123 ms for a 1.2 Hz cardiac rhythm and invisible to standard metrics. To enable development of models that escape such failures, we introduce TimeSynth, a controlled benchmarking framework with two reusable components: a physiologically grounded generator producing signals with analytically known ground-truth dynamics from parametric models fitted to real electroencephalography, electrocardiography and photoplethysmogram signals, along with diagnostics quantifying amplitude, frequency, phase, and state-transition fidelity. Linear and full-sequence attention models systematically lose frequency and phase information despite acceptable amplitude error, whereas architectures with localized temporal structure better preserve dynamical fidelity and adapt to observable state transitions; none, however, reliably preserves stochastic switching. Because the dominant determinant of fidelity is architectural, model choice becomes a principled, use-case-driven decision rather than a search for a single winner. TimeSynth thus supplies the controlled preclinical stress test missing before models are coupled to patient data, with a reusable generator and diagnostics for fidelity-aware development.
[LG-44] SAOT: Self-Supervised Continual Graph Learning with Structure-Aware Optimal Transport ICML2026
链接: https://arxiv.org/abs/2607.00377
作者: Yuting Zhang,Yanbei Liu,Zhitao Xiao,Lei Geng,Yanwei Pang,Xiao Wang
类目: Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注: The paper has 9 pages of text and 13 pages in total (including acknowledgments, impact statement, references, and appendix), with 6 figures and 4 tables. This paper has been accepted by ICML 2026 conference and this is a final version of the manuscript submitted to the conference
Abstract:Self-supervised Continual Graph Learning (CGL) aims to successively learn from a graph sequence with different tasks without label supervision - a paradigm that has attracted widespread attention. Most existing self-supervised CGL methods rely on instance-level consistency objectives that enforce stability of individual node (or node-pair) embeddings. Due to optimizing nodes in isolation, these methods fail to maintain global relational structure, causing inter-node correspondences to progressively distort under continual learning. To this end, we propose a novel Structure-Aware Optimal Transport (SAOT) framework that explicitly captures and preserves relational structure within graph representations across sequential tasks. Specifically, SAOT leverages optimal transport theory to capture global inter-node correspondences, thereby facilitating and enhancing graph representation learning. Simultaneously, SAOT incorporates a cross-task knowledge distillation mechanism to preserve the previous structural knowledge. Extensive experiments on four CGL benchmark datasets demonstrate that SAOT outperforms existing self-supervised baselines. In particular, SAOT achieves significant performance gains, improving average accuracy by up to 5% on CoraFull-CL and over 15% on Products-CL compared with state-of-the-art methods in the Class-IL setting.
[LG-45] PRISM: Prioritized Channel Importance with Semi-supervised Domain Adaptation for Cross-Subject EEG Emotion Recognition
链接: https://arxiv.org/abs/2607.00358
作者: Xin Zhou,Xiang Zhang,Hao Deng,Lijun Yin
类目: Machine Learning (cs.LG)
*备注:
Abstract:Electroencephalogram (EEG) captures endogenous brain activity with high temporal fidelity and holds substantial promise for precise emotion decoding. However, channel redundancy and pronounced inter-subject variability remain key obstacles to scalable generalization. To address these limitations, we propose a novel framework termed PRioritized channel Importance with Semi-supervised doMain adaptation (PRISM), enabling label-efficient cross-subject emotion decoding. On the channel side, PRISM assigns differentiable, data-dependent channel weights via a lightweight expert ensemble, amplifying reliable electrodes while suppressing distractors. On the domain side, PRISM leverages unlabeled data through confidence-filtered pseudo-labels to drive consistency regularization and domain alignment, mitigating subject-specific heterogeneity. Extensive experiments show that PRISM surpasses state-of-the-art methods on DEAP, DREAMER, and SEED datasets, achieving robust cross-subject generalization given limited annotations.
[LG-46] Generative Modeling of Quantum Distribution with Functional Flow Matching
链接: https://arxiv.org/abs/2607.00301
作者: Jaehoon Hahm,Tak Hur,Joonseok Lee,Daniel K. Park
类目: Machine Learning (cs.LG); Quantum Physics (quant-ph)
*备注: Accepted as an extended abstract at the Quantum Techniques in Machine Learning (QTML) 2024
Abstract:The emergence of powerful deep generative models based on diffusion and flow matching has enabled the learning and modeling of complex distributions. Learning quantum distributions, however, remains challenging due to the inherent difficulty of accurately modeling the meaningful physical properties of quantum states. We propose Quantum Flow Matching (QFM), a novel generative model designed to learn quantum distribution by utilizing spin Wigner function and flow matching. By converting density matrix into the spin Wigner function and leveraging functional flow matching to learn distributions in function space, QFM enables accurate and effective learning of multi-qubit quantum distributions. We demonstrate the effectiveness of our method by evaluating physical quantities such as trace, purity, and entanglement entropy of the generated quantum states, accurately capturing the underlying physics of the given quantum distributions.
[LG-47] Self-Organized Learning in Oscillatory Neural Networks with Memristive Signed Couplings
链接: https://arxiv.org/abs/2607.00286
作者: Riley Acker,Aman Desai,Garrett Kenyon,Frank Barrows
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG); Adaptation and Self-Organizing Systems (nlin.AO)
*备注: 14 pages single column
Abstract:Oscillatory neural networks (ONNs) have emerged as a promising neuromorphic architecture, leveraging coupled dynamical systems to perform computation and represent information through phase relationships. Their interactions can be designed to support intrinsic energy-minimizing dynamics, enabling tasks such as associative memory and optimization, and positioning them as a candidate architecture for continuous learning and inference. We present a neuromorphic primitive implemented using memristive edges with inhibitory couplings as a potential design for autonomous learning, and provide circuit simulation validation that the system is capable of denoising noisy inputs on an auto-associative task. While numerical Hopfield/Ising models routinely assume signed weights, neuromorphic implementations of ONNs often fail to realize negative weights due to device and circuit constraints. A practically implementable route to inhibitory (negative) weights is particularly valuable: it expands the class of attractor structures accessible to oscillator networks beyond purely synchronous couplings, and supports phase-coded memories where anti-phase constraints are not merely transiently enforced during training but can persist autonomously after release. We provide circuit simulations and theoretical analyses demonstrating that signed effective weights are necessary for anti-phase attractors to persist autonomously.
[LG-48] Understanding Guest Preferences and Optimizing Two-sided Marketplaces: Airbnb as an Example KDD2024
链接: https://arxiv.org/abs/2607.00280
作者: Yufei Wu,Daniel Schmierer
类目: Machine Learning (cs.LG); Computers and Society (cs.CY); Econometrics (econ.EM); Applications (stat.AP)
*备注: 5 pages, 3 figures. Presented at the KDD 2024 Workshop on Two-Sided Marketplace Optimization, Barcelona, Spain
Abstract:Airbnb is a community based on connection and belonging – many hosts on Airbnb are everyday people who share their worlds to provide guests with the feeling of connection and being at home; Airbnb strives to connect people and places. Among our efforts to connect guests and hosts, we provide tools to enable hosts to set competitive prices, which helps improve affordability for guests while helping hosts get more bookings. We also personalize the guest experience to show them the listings that match their needs. To help inform these efforts, we combine economic modeling and causal inference techniques to understand how guests book stays based on the prices hosts set, among other factors, and how that preference varies across different guests and listings. Such understanding helps us identify opportunities for Airbnb to support the marketplace and better connect guests and hosts. For example, understanding how much guests respond to different prices helps optimize the tools that we provide to hosts, in order to enable hosts to choose and set competitive prices that further balance demand and supply. As another example, understanding heterogeneity in guest preferences helps us personalize the guest experience and better match them with the listings that meet their needs, based on how much they respond to different prices and other factors. Comments: 5 pages, 3 figures. Presented at the KDD 2024 Workshop on Two-Sided Marketplace Optimization, Barcelona, Spain Subjects: Machine Learning (cs.LG); Computers and Society (cs.CY); Econometrics (econ.EM); Applications (stat.AP) Cite as: arXiv:2607.00280 [cs.LG] (or arXiv:2607.00280v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2607.00280 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-49] Learning dynamical systems from noisy data with Weak-form Kernel Ridge Regression
链接: https://arxiv.org/abs/2607.00257
作者: Max Kreider,John Harlim,Daning Huang
类目: Machine Learning (cs.LG); Dynamical Systems (math.DS)
*备注:
Abstract:Accurate prediction of complex dynamical systems from noisy measurements remains a significant challenge in scientific computing. Kernel ridge regression learning strategies are often effective when applied to clean data, but have limited success with noisy data. Recent work has observed that a weak formulation can act to filter noisy data, and different learning strategies have achieved increased noise robustness with a weak-form framework. In this manuscript, we give an overview of the filtering mechanism behind the weak formulation and provide a bias-variance error decomposition. Using these insights, we combine a weak formulation with a kernel learning strategy to propose Weak-form Kernel Ridge Regression (WKRR) for learning dynamical systems. The proposed framework is simple to implement, effective for both clean and noisy data, and outperforms several baseline methods. We demonstrate the performance of WKRR on chaotic benchmark systems in up to 64 dimensions, as well as 15,000-dimensional real-world fluid data.
[LG-50] Distributionally Robust Linear Regression With Block Lewis Weights ICLR2026
链接: https://arxiv.org/abs/2607.00252
作者: Naren Sarayu Manoj,Kumar Kshitij Patel
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注: ICLR 2026. Comments welcome!
Abstract:We present an algorithm for the group distributionally robust (GDR) least squares problem. Given m groups, a parameter vector in \mathbbR^d , and stacked design matrices and responses \mathbfA and \mathbfb , our algorithm obtains a (1+\varepsilon) -multiplicative optimal solution using \widetildeO(\min\mathsfrank(\mathbfA),m^1/3\varepsilon^-2/3) linear-system-solves of matrices of the form \mathbfA^\top\mathbfB\mathbfA for block-diagonal \mathbfB . Our technical methods follow from a recent geometric construction, block Lewis weights, that relates the empirical GDR problem to a carefully chosen least squares problem and an application of accelerated proximal methods. Our algorithm improves over known interior point methods for moderate accuracy regimes and matches the state-of-the-art guarantees for the special case of \ell_\infty regression. We also give algorithms that smoothly interpolate between minimizing the average least squares loss and the distributionally robust loss.
[LG-51] Device Passport: Enabling Spatio-Temporal Pretrained Models to Generalize Across Input Layouts ALT ICML2026
链接: https://arxiv.org/abs/2607.00249
作者: Geeling Chau,Ran Liu,Juri Minxha,Wenhui Cui,Erdrin Azemi,Ellen L. Zippi,Behrooz Mahasseni,Christopher M. Sandino
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: Workshop on Structured Data for Health, ICML 2026
Abstract:New device layouts pose a challenging modeling problem due to the lack of large datasets for each specific layout. Biosignal foundation models offer a plausible solution if they are able to generalize to new layouts effectively. To improve cross-layout transfer, we study how different channel embedding techniques behave when pretraining layouts differ substantially from the downstream decoding layout. We propose Device Passport, a new channel embedding technique that learns experts and mixture models that take each channel’s functional activity and metadata as input. This contrasts with prior embedding methods, which typically use only functional information or only metadata to look up learned or fixed positional embeddings. Across controlled subset-transfer experiments and realistic transfer to ear-EEG, Device Passport is competitive overall and improves over the strongest learned baseline in the layout-transfer regimes that motivate this work. These results suggest that channel embedding design is a key consideration when reusing large-scale pretrained biosignal models on new devices.
[LG-52] StateFlow: Dual-State Recurrent Modeling for Long-Horizon Time Series Forecasting
链接: https://arxiv.org/abs/2607.00197
作者: Haroon Gharwi,Yue Dai,Kai Shu
类目: Machine Learning (cs.LG)
*备注:
Abstract:Long-horizon multivariate time series forecasting (LTSF) remains challenging due to non-stationarity, regime shifts, and error accumulation. The Variability-Aware Recursive Neural Network (VARNN) is designed to track such variability by maintaining a residual-memory state driven by one-step prediction errors. However, its original formulation is limited to one-step sequence regression and does not directly support multi-step forecasting. In this work, we extend VARNN to long-horizon forecasting and introduce StateFlow, a recurrent forecasting framework that uses VARNN as a dual-state recurrent backbone to capture two complementary signals from the lookback sequence: a hidden-state trajectory representing primary temporal dynamics, including trend, seasonality, level changes, and recurring patterns, and a residual-memory trajectory representing structured local prediction deviations, driven from a nonlinear recurrent transformation of errors between one-step base predictions and observed values. A chunk-based decoder separately summarizes these trajectories and maps them to the future horizon for direct multi-step forecasting. We further employ a two-stage optimization strategy that first trains the VARNN encoder through a one-step base prediction objective to optimize the internal representations over the lookback sequence, and then trains a horizon-specific decoder for direct multi-step forecasting. Experiments on standard LTSF benchmarks show that StateFlow achieves competitive performance against strong linear, recurrent, convolutional, and Transformer-based baselines while preserving linear recurrent encoding and a compact model design.
[LG-53] RIE: An Evaluation Framework for Stochastic PDE Surrogates
链接: https://arxiv.org/abs/2607.00196
作者: Bharat Srikishan,Javier E. Santos,Nikhil Muralidhar,Charles D. Young
类目: Machine Learning (cs.LG)
*备注: 17 pages, 10 figures
Abstract:Many scientific systems exhibit uncertainty from stochastic forcing, unresolved degrees of freedom, or imperfect observations, making reliable surrogate forecasting fundamentally distributional rather than pointwise. For such systems, deterministic neural surrogates fail to capture statistical measures and forecast uncertainty. We introduce TRIE, an evaluation framework for stochastic PDE surrogates that asks whether models reproduce invariant measures, provide trustworthy predictive uncertainty, and scale to efficient probabilistic generation. We demonstrate TRIE on two stationary chaotic spatially extended SPDEs, stochastic Kuramoto–Sivashinsky and stochastic Kolmogorov flow, across 11 parameter values. Our evaluation shows that standard pointwise-trained neural surrogates can produce plausible short rollouts while failing to match long-time statistical structure. Approximate uncertainty methods such as Monte Carlo dropout and heteroscedastic Gaussian likelihoods produce stochastic forecasts, but are often miscalibrated and overconfident under temporal and spatial uncertainty diagnostics. Across these criteria, generative models provide the most consistent performance, accurately capturing invariant measure statistics and achieving the lowest CRPS in all reported probabilistic settings. Finally, we show that latent generative models with automatic dimension discovery retain much of this statistical fidelity while reducing Kolmogorov inference time by roughly 12\times . We release our code and data at this https URL to support reproducible evaluation of stochastic PDE forecasting models.
[LG-54] allyTrain: Communication-Efficient Federated Distillation
链接: https://arxiv.org/abs/2607.00173
作者: Radhakrishna Achanta,Will Reed
类目: Machine Learning (cs.LG)
*备注: 27 pages, 7 figures, 12 tables
Abstract:Federated learning is bandwidth-bound on two orthogonal axes: model size, which limits how often parameter-averaging methods can afford to merge, and class count, which makes per-probe soft-label distillation prohibitive at large vocabularies. Both ceilings tighten as modern systems scale. We collapse the class-count axis to \lceil \log_2 C \rceil bits per probe by transmitting only each peer’s \arg\max class index, where C is the number of output classes. The resulting protocol, TallyTrain, is not merely compressed: under non-IID training it can be preferable to soft-label distillation, because under-trained peers are confidently wrong and majority voting filters this noise where soft-label averaging amplifies it. Across standard benchmarks, TallyTrain matches or beats soft-label distillation at up to three orders of magnitude less communication. We also relax the model-size axis: we compose the cheap hard-label consensus with sparse parameter merges to obtain a bandwidth-bridge variant, which Pareto-dominates every tested operating point of the standard FedAvg, FedProx and FedDF baselines.
[LG-55] Verifiable Rewards for Calibrated Probabilistic Forecasting
链接: https://arxiv.org/abs/2607.00164
作者: Sadanand Singh,Allam Reddy,Manan Chopra
类目: Machine Learning (cs.LG)
*备注:
Abstract:Reinforcement learning with verifiable rewards can in principle train calibrated probabilistic forecasters, since a proper scoring rule such as the Brier score is computed from outcomes alone and is minimized in expectation by the true probability. In practice it degrades calibration, and existing remedies address epistemic uncertainty, where a model’s confidence accompanies a verifiably correct or incorrect answer. We study aleatoric forecasting, where the forecast itself is the output and the label is one stochastic outcome, taking NFL in-game win probability as a testbed with the betting market as a reference. Rewarding the realized per-play outcome fails, because the single outcome is a noisy target and the policy gradient corrupts the chain of thought. We introduce a verifiable, label-free reward, a state-conditioned empirical win rate estimated from past outcomes, that removes the label noise, and we keep the gradient off the reasoning, by direct prediction or a gradient mask, so it cannot be corrupted. Trained with this reward alone, without human labels or supervised fine-tuning, a 7B model reaches the calibration of the betting market by direct prediction and is better calibrated than a zero-shot frontier model. That frontier model and a tabular estimator reach the same Brier score as this model, identifying the market’s small remaining edge as live in-game information beyond their shared inputs. Masking the gradient, rather than dropping the chain of thought, preserves reasoning from which the forecast follows, which ordinary chain-of-thought training corrupts.
[LG-56] FRAME: Learning the Adaptation Domain with a Mixture of Fractional-Fourier Experts
链接: https://arxiv.org/abs/2607.00162
作者: Tom Saliencro,Maya Lindqvist,Rohan Desai,Priya Nair,Daniel Whitmore
类目: Machine Learning (cs.LG)
*备注:
Abstract:Parameter-efficient fine-tuning (PEFT) reparameterizes weight updates in a fixed basis: low-rank adapters operate in the spatial domain, while a recent line of spectral methods operates in a fixed Fourier domain. We argue that the choice of domain is itself a design degree of freedom that should be learned, and that no single basis is optimal across tasks, layers, or tokens. We introduce Fractional-Fourier Mixture of Experts, a mixture-of-experts adapter in which every expert carries a learnable fractional-Fourier order that continuously interpolates between the spatial domain (recovering vanilla LoRA) and the Fourier domain (recovering a spectral adapter). Routing tokens through experts that occupy different points on this spatial-spectral continuum lets the model place each low-rank update in the domain where it is most compact, and – because fractional-Fourier operators of different orders are mutually incoherent – makes the experts naturally decorrelated, which reduces interference and improves multi-task composition. The order is a single scalar per expert, trained with a separate optimizer, and the transform is computed with an \mathcalO(d\log d) chirp–FFT surrogate, so Fractional-Fourier Mixture of Experts adds negligible cost over standard MoE-LoRA. Across commonsense, mathematical, code, and knowledge benchmarks on LLaMA-3.1-8B and Qwen2.5-7B, Fractional-Fourier Mixture of Experts improves over strong MoE-LoRA and spectral baselines – including FlyLoRA, FourierMoE, and HMoRA – while keeping the active-parameter budget small, and analysis shows that the learned orders specialize by task and layer in interpretable ways.
[LG-57] A Filtered Mixture-of-Generators for Fully Synthetic Survival Training
链接: https://arxiv.org/abs/2607.00127
作者: Niccolò Maria Rizzi,Eugenio Lomurno,Alberto Archetti,Matteo Matteucci
类目: Machine Learning (cs.LG)
*备注:
Abstract:Survival analysis models time-to-event data, but in clinical settings training data are costly and scarce: events accrue over years of follow-up, cohorts are small, and privacy regulations restrict sharing across institutions. Tabular generative models promise augmentation and privacy-preserving cohort sharing, yet are themselves data-hungry – on the small cohorts typical of survival analysis, a single generator rarely characterizes the population well enough for downstream models trained on its output to match real-data performance. FoGS (Filtered Mixture-of-Generators for Survival analysis) reframes synthetic-data construction as sample selection rather than generation. A candidate pool is drawn from four architecturally distinct tabular generators, and each sample is scored by an ensemble of seven survival models trained on real data, using proper scoring rules as a per-sample plausibility proxy. A two-level pipeline optimizes, in its outer loop, a selection policy – generator quotas, scorer weights, a random complement, and stratified balancing on event time and censoring – against held-out downstream performance, while an inner loop tunes the downstream model (XGBoost-Cox). On 16 public datasets under train-on-synthetic, test-on-real (C-index and IBS, 0 – 100 scale), FoGS yields mean improvements of +2.17 in C-index and +0.67 in IBS, improving both metrics on 9 of 16 datasets and at least one on 13 (one-sided Wilcoxon p=0.039 and p=0.035 ). It matches or exceeds real-data training on most cohorts, with no significant change in nearest-neighbour privacy margin relative to unfiltered sampling. Sample filtering over a heterogeneous generator pool is thus a viable substitute for real-data training in privacy-restricted clinical settings.
[LG-58] SemiScope: Disentangling Classifier Tuning and Joint Optimization in Semi-Supervised Security Classification
链接: https://arxiv.org/abs/2607.00113
作者: Rui Shu,Tianpei Xia,Jingzhu He
类目: Machine Learning (cs.LG)
*备注:
Abstract:Background. Labeled data for security classification is scarce. Semi-supervised learning (SSL) propagates labels from a small labeled pool to larger unlabeled pools. Yet security applications often use SSL as a black box: default parameters, a fixed classifier, and no handling of pseudo-label-induced class imbalance. Aims. Recent work reports sizeable gains from optimizing SSL pipelines via joint search, AutoML, or per-component tuning. These gains are hard to attribute: they may reflect useful SSL-classifier interactions, or mostly from simply tuning the downstream classifier. We disentangle these effects for binary tabular security data with classical SSL and tree-based classifiers. Method. We build SemiScope as an analysis instrument, not a deployment recommendation. It uses Bayesian Optimization to jointly tune SSL settings, confidence filtering, oversampling, and the classifier. The key control, Tuned-Clf, fixes SSL to defaults but gets the same 100-trial classifier budget and validation-set threshold tuning as SemiScope. At 10% labels, we compare them with paired TOST using a +/-1.0 g-measure smallest effect of interest. Results. SemiScope beats every default SSL baseline on all five datasets, improving over the strongest by 0.7-12.7 points. Under the equal-budget control, Tuned-Clf is statistically equivalent to the full pipeline on 4 of 5 datasets; Phishing is inconclusive. Classifier HPO alone recovers a median 86% of SemiScope’s gain over Default Self-Training (ST) + Random Forest (RF). Conclusions. The reusable contribution is the decomposition protocol. A simpler recipe suffices: use Self-Training, tune the classifier with Bayesian Optimization, and tune the decision threshold on validation data. It reaches within 1 g-measure of Supervised RF at 20-30% labels on four datasets and 40% on Drebin, at the same or lower label rate than Default ST + RF on every dataset.
[LG-59] Representation as a Bottleneck for Mechanistic Interpretability: The Manifestation Unit Protocol
链接: https://arxiv.org/abs/2607.00089
作者: Hussein Chouman,Wataru Sasaki,Tomokazu Matsui,Hirohiko Suwa,Keiichi Yasumoto
类目: Machine Learning (cs.LG)
*备注: 65 pages. Interactive demos: this https URL , this https URL
Abstract:Mechanistic interpretability has produced a rich inventory of component-level analyses that characterise what neural-network components encode and how they interact. Their outputs, however, are not easily reusable: selectivity tables, circuit diagrams, and feature lists remain locked in per-study notebooks - non-composable, not queryable in natural language, and not directly actionable for downstream audit or intervention. We study the representation layer that sits between these analyses and downstream use as a bottleneck that can be evaluated independently, and introduce Manifestation Units, a typed tuple protocol (E, S, R, D, G) extended with attention-head primitives (T) for transformer architectures, organising per-component statistics into structured fields populated automatically and queried through hybrid retrieval. Instantiated across generative vision (beta-VAE), discriminative vision (CNN), and language (GPT-2), the protocol supports two findings: typed structure substantially outperforms unstructured baselines on retrieval, and CNN filters retrieved by the schema satisfy causal sufficiency and necessity criteria under matched-budget controls. The schema absorbs attention-head primitives without modification, set-recovers known IOI circuit members under retrieval-budget-matched controls, and reveals an irreducible two-field core (S+R) with remaining fields either redundant or actively interfering. We present this as schema infrastructure for mechanistic interpretability rather than frontier-scale validation.
[LG-60] Urban Deceleration Behavior Modes Under Scene Context: An Early-Kinematic Classifier from Argoverse 2 Multi-Agent Trajectories
链接: https://arxiv.org/abs/2607.00027
作者: Eni Solomon Laughter
类目: Robotics (cs.RO); Machine Learning (cs.LG); Signal Processing (eess.SP); Systems and Control (eess.SY)
*备注:
Abstract:Urban deceleration is one of the most empirically studied yet least taxonomically organized behaviors in car-following research. Recent perception-equipped autonomous-vehicle datasets enable trajectory-anchored mode discovery. We extract 1,219 sustained deceleration events from 234 urban driving logs of the Argoverse 2 Sensor dataset, encode each event in a 19-dimensional kinematic feature vector, discover behavioral modes via K-means clustering with bootstrap stability analysis, and quantify modulation by eleven scene-context variables. A HistGradientBoosting classifier predicts mode membership from the first 1.0 s of each event. Four stable modes emerge with a bootstrap Adjusted Rand Index of 0.897 across 50 resamples: anticipatory soft (62.8%), reactive closing (30.6%), brake-like jerk (4.8%), and an outlier category (1.8%). Only pair age shows a medium effect (epsilon^2 = 0.085); scene geometry and vulnerable-road-user proximity show negligible effects. The early-event classifier achieves macro-F1 = 0.758 at 1.0 s, with scene context contributing +0.059 F1 over kinematics alone. Modes are regime-invariant in medium-speed driving (ARI = 0.817) but regime-dependent at low speed (ARI = 0.166). A small set of stable kinematic modes structures urban deceleration; early-window jerk dominates predictive signal; and pair age is the primary contextual modulator.
[LG-61] Group-invariant Coresets for Data-efficient Active Learning
链接: https://arxiv.org/abs/2607.01089
作者: L. C. Ayres,J. C. M. Bermudez,S. J. M. de Almeida,R. A. Borsoi
类目: Image and Video Processing (eess.IV); Machine Learning (cs.LG)
*备注:
Abstract:Active learning reduces labeling cost by querying the most informative unlabeled samples, but standard coreset methods ignore known data symmetries and can waste budget on transformed versions of the same instance. We propose GRINCO, a group-invariant coreset framework that performs acquisition in the quotient space induced by a transformation group, so that selection operates on orbits rather than raw samples. The method uses either canonical representatives or learned orbit-separating invariant embeddings to define practical quotient metrics, and combines quotient-space k-center selection with invariant training through an orbit-averaged loss. We further derive a generalization bound that relates excess orbit-averaged risk to quotient-space coverage, label uncertainty, and intra-orbit variability. Experiments on synthetic scale-invariant data and image benchmarks with rotation-induced redundancy show that GRINCO improves orbit coverage and achieves stronger label efficiency than conventional coreset baselines, especially when group-induced redundancy is substantial.
[LG-62] Characterizing and Identifying Separable Graphical Models
链接: https://arxiv.org/abs/2607.01057
作者: Christopher Meek,Kayvan Sadeghi
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注: 69 pages, 7 figures, complete paper currently under submission
Abstract:We study a broad class of graphical models whose independencies correspond to vertex separation in mixed graphs with directed, undirected, and bidirected edges, that are capable of encoding independence structures arising from feedback, latent and selection mechanisms. In particular, we introduce separable graphs, in which each missing edge implies the existence of a separating set for its endpoints, and essentially separable graphs, those graphs separation equivalent to a separable graph. We show that these models include many existing graph families used to define graphical models an provide several characterizations of separable graphs and essentially separable graphs. We also provide multiple characterizations of separation equivalence for separable graphs. One is a graphical characterization in terms of ordinary graph properties, extending earlier results for specific subfamilies Another is a separational characterization depending only on graph separation properties. Finally, we provide a canonical representation for the equivalence classes of essentially separable graphs and develop an algorithm that, under suitable assumptions, identifies the equivalence class of any essentially separable graph.
[LG-63] How Much Do RF Drone Benchmarks Overstate? A Controlled Study and Theory of Data Leakage in UAV Signal Identification
链接: https://arxiv.org/abs/2607.01025
作者: David Shulman
类目: Applied Physics (physics.app-ph); Machine Learning (cs.LG)
*备注:
Abstract:Radio-frequency (RF) sensing is a central modality for counter-unmanned-aerial-system (counter-UAS) defence because it exploits the control, telemetry, and video links between a drone and its operator. Reported accuracies for RF-based drone detection and identification are often very high, but many are obtained using cross-validation that splits a small number of continuous recordings into short segments. This can place near-duplicate slices of the same recording in both training and test partitions, creating data leakage. We study this leakage problem through theory and measurement. We formalise the optimism of segment-level cross-validation and show, using Cover’s function-counting theorem, that a classifier can exactly memorise the recording-to-label map when the number of independent recordings, R, is small relative to the feature dimension, d. In particular, this can occur when 2R is less than or approximately equal to d. Under these conditions, naive accuracy approaches 1, and the inflation gap approaches 1 - ACC*, where ACC* is the Bayes accuracy. The inflation eases only once R grows beyond this separability threshold. A controlled synthetic experiment with 10 seeds confirms the predicted curves: naive balanced accuracy rises from the Bayes level toward 1.0 as recording-specific nuisance variation grows, while honest recording-grouped evaluation declines to chance, with a gap reaching about 0.5. On the public DroneRF dataset, pooled leave-one-recording-out cross-validation shows drone type identification, AR versus Bebop, collapsing from a naive macro-F1 of 0.74 to 0.46, the two-class chance level. A leakage-pathway ablation attributes essentially all of the inflation to segment-level leakage. Subjects: Applied Physics (physics.app-ph); Machine Learning (cs.LG) Cite as: arXiv:2607.01025 [physics.app-ph] (or arXiv:2607.01025v1 [physics.app-ph] for this version) https://doi.org/10.48550/arXiv.2607.01025 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-64] Function-Counting Theory for Low-Dimensional Data Structures
链接: https://arxiv.org/abs/2607.01010
作者: Konstantin Häberle,Helmut Bölcskei
类目: Machine Learning (stat.ML); Information Theory (cs.IT); Machine Learning (cs.LG); Classical Analysis and ODEs (math.CA); Combinatorics (math.CO)
*备注: 49 pages, 7 figures
Abstract:The success of deep learning models in classification and regression is widely attributed to the low-dimensional structure that real-world data tend to exhibit, despite their high-dimensional representation. This work attempts to provide a mathematical framework for binary classification on low-dimensional data, building on Cover’s (1965) function-counting theory. With our framework, we aim to address the question of how the low-dimensional structure of the data affects the classification capabilities of learning models. Cover’s theory relies on a general position assumption that blinds it to the underlying data structure. We refine this assumption to account for the low-dimensionality of the data and derive dichotomy counts that reflect the data structure. We further extend Cover’s separation capacity and problem of generalization to the low-dimensional setting, enabling the impact of the underlying data structure on both to be analyzed.
[LG-65] Deep Multitask Learning for Mixed-Type Outcomes with Shared Sparsity
链接: https://arxiv.org/abs/2607.00995
作者: Huichao Li,Tong Wang,Sanguo Zhang,Shuangge Ma
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Most existing multitask learning approaches are limited by their reliance on task-specific loss functions tailored to the scale and type of each outcome. When outcomes differ across tasks, these losses are generally not directly comparable, which makes it difficult to formulate a unified objective and may limit information sharing across tasks. We propose a multitask transformation framework in which task-specific responses may differ through unknown monotone transformations. Motivated by high-dimensional biological applications in which the predictor dimension may diverge with the sample size while only a common subset of predictors is informative, we consider shared sparsity across tasks. Under this framework, we estimate the target functions and identify important predictors by optimizing a smoothed rank-based criterion with a group-Lasso penalty, implemented through a multitask deep neural network with a shared first layer. We establish the nonasymptotic excess-risk bounds, and variable-selection consistency for the proposed estimator. Simulation studies show that the proposed method achieves competitive prediction and variable-selection performance compared with competing approaches. Analyses of gene-expression studies with continuous, binary, and mixed outcomes further illustrate that the proposed method improves prediction and identifies biologically meaningful shared predictors.
[LG-66] Bridging Quantum Computing Paradigms toward Semiconductor Yield: A Controlled CV-versus-DV Comparison on Wafer-Map Defect Classification
链接: https://arxiv.org/abs/2607.00961
作者: Yeonhong Kim,Jonghyeok Im,Monu Nath Baitha,Kyoungsik Kim
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: 15 pages, 5 figures, 5 tables
Abstract:Realizing quantum neural networks (QNNs) in industry requires knowing which quantum computing paradigm suits which task. Motivated by AI accelerators and high-bandwidth memory, where die stacking makes wafer-level defect screening central to yield, we study WM-811K wafer-map defect classification (eight classes), comparing the dominant paradigms, continuous-variable (CV) and discrete-variable (DV), under controlled conditions. To isolate the quantum circuit as the sole variable, a shared convolutional backbone (~4.3M parameters) feeds interchangeable heads (classical dense, CV-QNN, or DV-QNN) as the only structural difference; each quantum head is scaled over three sizes (3, 4, 8 qumodes/qubits). The CV head consistently outperforms the DV head: at four qumodes/qubits it reaches 79.7 +/- 1.8% accuracy versus 61.6 +/- 1.4%, a non-overlapping 18-point gap. The advantage is sharpest on the spatially localized Edge-Loc class, easily confused with Scratch, which CV recovers with recall 0.66 +/- 0.06 while DV fails at every size (=0.05), showing the structured CV layer better captures fine spatial distinctions between defect types. Training curves show the DV limitation is a representational-capacity ceiling, not an optimization failure; at the Fock cutoff used here (d = 2) the CV advantage reflects two intrinsic properties, a structured, neural-network-analogue layer and continuous phase-space encoding, not Hilbert-space dimensionality. On IBM hardware, DV accuracy holds at shallow depth, degrading only at the deepest circuit. Both quantum heads remain below the classical baseline (85.0%), but the controlled setting isolates where a structured head already helps and, as noise and scale improve, which paradigm can deliver practical advantage.
[LG-67] Shapley in Context: Explaining Financial Language with Domain Expertise
链接: https://arxiv.org/abs/2607.00856
作者: Dangxing Chen,Pengzhan Guo
类目: Computational Finance (q-fin.CP); Machine Learning (cs.LG)
*备注: European Journal of Finance
Abstract:In recent years, large language models have achieved remarkable success and have seen growing adoption in financial applications. At the same time, explainability remains critical in finance, a domain characterized by high stakes and strict regulatory requirements. Although numerous methods have been proposed to explain black box machine learning models, the majority of these approaches are designed for general purpose tasks and do not incorporate domain specific knowledge. In this work, we study the explainability of financial textual data modeled by large language models through the lens of the Shapley value. Specifically, we investigate whether Shapley based attributions align with established financial domain knowledge. Through rigorous theoretical analysis and extensive empirical evaluations, we demonstrate that Shapley values can yield explanations that are consistent with financial reasoning and can offer meaningful insights into the model’s behavior in text based financial applications.
[LG-68] Optimal scaling of MCMC algorithms: exploiting the symmetry of the Metropolis-Hastings formula
链接: https://arxiv.org/abs/2607.00586
作者: P. Dobson,J.M. Sanz-Serna,K.C. Zygalakis
类目: Computation (stat.CO); Machine Learning (cs.LG); Probability (math.PR)
*备注: 23 pages, 3 figures
Abstract:We present a simple, yet general approach to study the scaling properties as the dimensionality of Metropolised MCMC sampling algorithms increases. The study relies ultimately on the symmetry of the Metropolis-Hastings formula. Our findings contain, as particular cases, many known results for the Random Walk Metropolis, MALA and other algorithms. In addition, they provide, in an easy way, new optimal scaling results for a variety of proposal mechanisms, including implicit proposals and proposals generated with the help of differential equation integrators. The analysis applies to targets that are products of a given, not necessarily univariate distribution, and also to cases where the different terms in the product are scaled differently. We show how to construct gradient-based MALA-like proposals where the variance of the proposal as the dimension d increases may be taken as O(1/d^\mu) , with \mu0 arbitrarily small, to be compared with the values \mu = 1 for Random Walk Metropolis and \mu=1/3 for MALA.
[LG-69] How Environment and Urbanization Shape Bird Diversity in Sri Lanka
链接: https://arxiv.org/abs/2607.00582
作者: Dilusha Chandrasiri,Maneesha Herath,Yasith Hewarathna,Muditha Herath,Gishan Bandara,Madara Mendis,Nathali Athukorala,Nisansa de Silva,Sandareka Wickramanayake
类目: Populations and Evolution (q-bio.PE); Machine Learning (cs.LG)
*备注: 10 pages, 5 figures. IEEE conference paper. Dept. of Computer Science and Engineering, University of Moratuwa, Sri Lanka. Dataset and code publicly available on Hugging Face and GitHub
Abstract:This study presents a comprehensive analysis of bird diversity across Sri Lanka by integrating spatial, temporal, and environmental data. Bird observation records were combined with environmental variables, including weather conditions, air pollution, the Normalized Difference Vegetation Index (NDVI), land cover, elevation, and Artificial Light At Night (ALAN), and rigorously preprocessed to ensure data quality. Spatial analyses were conducted on multiple grid scales (2 km, 5 km, 10 km) to evaluate patterns in species richness while minimizing sampling bias through spatial thinning. Temporal trends were assessed using effort-corrected metrics including rarefied richness and occupancy rates to account for variations in observation effort over time. Environmental drivers of bird diversity were examined using multivariate statistical models, including Poisson Generalized Linear Models (GLMs) and correlation analyses, to identify key associations between ecological factors and species richness. Additionally, community structure, dominance patterns, and beta diversity were analyzed to understand variations in species composition across regions and time. The study found that land-cover type is a stronger predictor of bird diversity than individual continuous variables such as NDVI or temperature alone. Urbanization, measured by ALAN, exhibits nuanced scale-dependent effects, supporting high abundances of a few generalist species while reducing overall richness. The findings provide actionable insights into the patterns and drivers of avian diversity in Sri Lanka, offering a scalable and reproducible framework for biodiversity research and conservation planning.
[LG-70] Neural Network-Based Estimation of Time-Dependent Parameters in AR§ Processes
链接: https://arxiv.org/abs/2607.00470
作者: Agnieszka Kopeć,Paweł Przybyłowicz,Martyna Wiącek
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:We investigate a forecasting framework based on a simple discrete-time dynamic model with coefficients varying in time. The parameters of the model are recovered within a deep learning framework, which makes it possible to retain a transparent parametric structure while simultaneously accounting for complex and nonstationary patterns in the observed phenomenon. Our analysis covers two specifications of the noise process. Besides the standard Gaussian setting, we also consider Laplace-distributed noise, which can offer a more adequate description in the presence of heavier tails and sharper local fluctuations. For both cases, we formulate the predictive scheme of the model and analyze the associated uncertainty quantification, including the construction of prediction intervals. The results illustrate that a relatively simple model, when combined with time-dependent parameter estimation, can serve as a mathematically tractable and practically flexible tool for forecasting complex dynamics under different noise assumptions. The general model is stated for TVAR( p ), while the prediction-interval formulas and the numerical experiments are developed for the TVAR(1) case.
[LG-71] From Spectral Methods to Sample Complexity Bounds for Fourier Neural Operators
链接: https://arxiv.org/abs/2607.00320
作者: Nisha Chandramoorthy,Daniel Sanz-Alonso,Nathan Waniorek
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注: 66 pages
Abstract:We establish approximation and learning guarantees for Fourier neural operators (FNOs) applied to time- T solution operators of dissipative evolution equations. The analysis builds on the premise that FNOs can efficiently approximate and learn solution operators whenever these operators admit stable and accurate spectral discretizations. To formalize this idea, we introduce classes of evolution operators defined through spectral methods and derive FNO approximation bounds and polynomial sample complexity guarantees for these classes. For equations with polynomial nonlinearities, the learning rates depend primarily on the smoothness of the input space and the dimension of the physical domain. Our results hold uniformly over broad families of dissipative equations, rather than for a single fixed PDE, and apply in particular to the Navier–Stokes, Allen–Cahn, and Cahn–Hilliard equations. For equations with non-polynomial smooth nonlinearities, we prove that polynomial sample complexity still holds with rates that now additionally depend on the smoothness of the nonlinear terms and the dissipation strength. Overall, we connect classical spectral approximation theory with modern operator learning and explain when FNOs can learn nonlinear evolution operators efficiently.
[LG-72] Computer vision-based neural networks for radioisotope identification in urban environments
链接: https://arxiv.org/abs/2607.00270
作者: Masen Bachleda,Peter Lalor
类目: Instrumentation and Detectors (physics.ins-det); Machine Learning (cs.LG)
*备注: 17 pages, 2 figures, 4 tables
Abstract:Algorithm development for radioisotope identification in mobile urban search scenarios face significant challenges from non-uniform backgrounds, momentary source encounters, and severe class imbalance between rare threat signatures and background measurements. We present a machine learning-based approach to this problem that converts list-mode gamma-ray data into two-dimensional waterfall spectrograms and applies computer vision architectures to the resulting images. Rather than treating waterfalls as conventional images, we employ a representation where consecutive time spectra can form input channels, similar to RGB channels in color images. This representation encodes both spectral and temporal information, enabling neural networks to more effectively learn patterns that distinguish source signatures from background fluctuations. We evaluate three architectures, a multilayer perceptron (MLP), convolutional neural network (CNN), and vision transformer (ViT), on the Radiological Anomaly Detection and Identification (RADAI) benchmark dataset. At a false positive rate of less than one false alarm per hour, our CNN outperforms the previous-best non-negative matrix factorization (NMF) method across all global metrics, achieving true detection, classification, and identification rates of 0.4334, 0.3965, and 0.2950 respectively, compared to 0.4151, 0.3611, and 0.2625 for NMF. At lower false positive rate constraints, the neural network approaches show comparable but ultimately lower performance than NMF, indicating opportunities for further research.
[LG-73] Leverag ing Multimodality for Real-Time Classification of Transients and Variables found by the Zwicky Transient Facility
链接: https://arxiv.org/abs/2607.00228
作者: Ved G. Shah,Nabeel Rehemtulla,Adam A. Miller,Sushant Sharma Chaudhary,Michael W. Coughlin,Antoine Le Calloch,Matthew J. Graham,Joahan Castaneda Jaimes,Theophile Jegou du Laz,Ashish A. Mahabal,Frank J. Masci,Josiah Purdum,Reed Riddle,Jesper Sollerman,Anastasia Wei,Mansi M. Kasliwal
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); High Energy Astrophysical Phenomena (astro-ph.HE); Machine Learning (cs.LG)
*备注: 29 Pages, 15 Figures, 8 Tables. Comments welcome
Abstract:Modern time-domain surveys such as the Zwicky Transient Facility (ZTF) generate hundreds of thousands of alerts each night, making real-time decisions for follow-up observations a central challenge in time-domain astronomy. Robust early classification is crucial for making informed decisions, but is hindered by sparse light curves and degeneracies between classes. In this work, we leverage multimodality to substantially improve real-time classification and demonstrate the practicality of our approach by deploying our model on the ZTF alert stream. Building on the Online Ranked Astrophysical CLass Estimator (ORACLE), we introduce the ORACLE-2 models, which combine light curves, metadata, and images for real-time hierarchical classification. Using both real and simulated datasets, we show that incorporating additional modalities consistently improves classification performance. On observations from ZTF’s Bright Transient Survey, our best-performing model, ORACLE-2 Omni, achieves a macro F1 score of 0.73 – an improvement of up to 11% over models using light curves and metadata alone, and up to 40% over light-curve-only models, with the strongest gains realized at early times. To demonstrate applicability to the Legacy Survey of Space and Time, which will increase alert volume by more than an order of magnitude, we train a light curve + metadata variant on the simulated ELAsTiCC dataset. This model achieves a macro F1 score of 0.88, an improvement of up to 13% over the light-curve-only variant, matching the performance of other state-of-the-art models. Finally, we quantify the trade-offs between performance and throughput, identifying regimes where multimodal approaches offer the greatest benefit. These results show that combining multiple modalities improves early-time classification, enabling more effective triage of high-volume alert streams for current and future time-domain surveys.
[LG-74] Sample Complexities of Estimating Gumbel–Max Watermark Proportions with and without Reduction to Pivotal Statistics
链接: https://arxiv.org/abs/2607.00224
作者: Shuwen Chai,Qiaosen Wang
类目: atistics Theory (math.ST); Information Theory (cs.IT); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Watermarking promises a statistical trace of large language model (LLM) use, but real documents, after editing or paraphrasing, rarely arrive as purely human-written or purely machine-generated. This motivates a quantitative question beyond detection: what proportion of a document is generated from a pre-specified watermarked LLM? We study this watermark proportion estimation problem under the Gumbel–max watermarking mechanism, treating the next-token prediction (NTP) distributions as unknown and arbitrary nuisance parameters subject to a non-degeneracy condition. We compare two observation regimes: in the full observation regime, the estimator observes the pseudorandom vector and the selected token at each position; under the more popular setting of pivotal reduction, it observes only a scalar pivot, which follows a one-dimensional Uniform–Beta mixture distribution. Under pivotal reduction, we develop a Laguerre-polynomial estimator and establish a matching information-theoretic lower bound for the sample complexity. For full observation, we introduce an event-counting estimator and show a matching lower bound, yielding a substantially smaller sample complexity. As our results imply, although reducing to pivotal statistics is an elegant and widely used procedure, it is not always sample-efficient for estimating the proportion of watermarks.
[LG-75] Homogenization of ell_2-Adversarial Training in High-Dimensions: Exact Dynamics under Stochastic Gradient Descent
链接: https://arxiv.org/abs/2607.00207
作者: Fabrizzio Sabelli
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Probability (math.PR); Machine Learning (stat.ML)
*备注:
Abstract:We develop a framework for analyzing the learning dynamics of \ell_2 -adversarial training of single-index models on Gaussian mixtures in the high-dimensional limit under streaming stochastic gradient descent (SGD). We derive deterministic equivalents for a broad class of statistics of the SGD iterates, including the adversarial risk and distance to adversarial optimality, in terms of the solution to a system of ODEs. We use them to study two idealized learning rate schedules: the Polyak stepsize and exact line search. In the case of \ell_2 -adversarial least squares with a single class, we show that, unlike noiseless standard least squares, no constant learning rate guarantees monotone descent of SGD towards a minimizer of the adversarial risk. We identify anisotropic covariance and a mismatch in ridge parameters as the main sources of suboptimality of exact line search relative to the Polyak stepsize. We also introduce a stochastic differential equation (SDE), called adversarial homogenized SGD, that captures the evolution of statistics of the iterates of SGD. For \ell_2 -adversarial least squares, using this SDE, we show the evolution of the risk is equivalent, up to dimension-free constants, to that of SGD on standard least squares with an adaptive learning rate and adaptive \ell_2 -regularization. When the dynamics converge, the limiting adversarial risk and SGD iterate are determined by a fixed-point equation, with the limiting iterate being equivalent to the solution of a ridge regression problem whose regularization parameter is the limiting effective regularization of SGD.
[LG-76] Spatio-Temporal Gaussian Process for Building Terrain-Incorporating Wind Power Curves
链接: https://arxiv.org/abs/2607.00051
作者: Ahmadreza Chokhachian,V. Roshan Joseph,Yu Ding
类目: Applications (stat.AP); Machine Learning (cs.LG)
*备注:
Abstract:Accurate modeling of wind turbine power curves is crucial for optimal wind farm operation. Nearly all existing power curve models focus on temporal variables such as wind speed and temperature while overlooking the influence of terrain covariates, which governs inflow wind conditions and thus also affects wind power production. This paper proposes a nonparametric spatio-temporal Gaussian process model that integrates temporal environmental covariates with spatial terrain features. The model falls in the category of spatial-temporal Gaussian process models with data on a grid. The challenge to be addressed is that the spatio-temporal modeling require certain temporal alignment among the data, a property that the wind farm data does not have. Our solution strategy is to construct a shared representative temporal covariate set which not only aligns the temporal inputs but also has a size an order of magnitude smaller than the original data size. With this transformation, our resulting model is able to employ a separable kernel structure that captures both spatial and temporal dependencies. Empirical analysis on a real wind farm dataset shows that our method improves predictive accuracy over existing baselines and can be used to quantify the various impact of the terrain characteristics on turbine performance.
附件下载


